### Osiel Vivar - ov35 - Machine Learning Project Phase 2 - Dataset 1

# Dataset 1 - House Prices

Dataset 1 is [1461x81]. There are 79 predictive features and the 1 target variable is price. 

In [1]:
#Import the Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import statistics

## <b>(A) - Brief Exploration of the Dataset</b>

Here I want to get a better understanding of the data. See what features are of high utility and which are of low utility. Maybe do some plotting to get a general idea of some of the predictive features.

In [2]:
#Load the Dataset
df = pd.read_csv('house.csv')
print(df.shape)

(1460, 81)


The dataset is composed of: 

36 Numerical Features such as : Lot Area, Time related variables (year built), number of rooms, condition and quality. Etc.

38 Categorical Features: Condition, building type, lot shape, Heating, Electrical, etc.

5 Conditional Features that do not apply to all entries, generally less than half: PoolQC, Fence, MiscFeature, FireplaceQu, Alley.

In [3]:
#Lets Plot the Fraction of the Conditional Features
import seaborn as sns

condtional_features = df[['PoolQC','Fence','MiscFeature','FireplaceQu', 'Alley']]
print(condtional_features.head())

print("\n\nNew Dataframe composed of conditional/fractional features, Shape is: " ,condtional_features.shape, "\n\nNumber of NaN values :")
condtional_features.isna().sum() #isna() counts NaN, sum() adds them all up.

  PoolQC Fence MiscFeature FireplaceQu Alley
0    NaN   NaN         NaN         NaN   NaN
1    NaN   NaN         NaN          TA   NaN
2    NaN   NaN         NaN          TA   NaN
3    NaN   NaN         NaN          Gd   NaN
4    NaN   NaN         NaN          TA   NaN


New Dataframe composed of conditional/fractional features, Shape is:  (1460, 5) 

Number of NaN values :


PoolQC         1453
Fence          1179
MiscFeature    1406
FireplaceQu     690
Alley          1369
dtype: int64

As you can see from the script the number of NaN values in these "condtional" features or  "fractional" features is quite high.

PoolQC      -->  1453/1460 =    0.995
Fence       -->  1179/1460 =    0.807
MiscFeature -->  1406/1460 =    0.963
FireplaceQu  -->  690/1460  =   0.473
Alley       -->  1369/1460 =    0.938

Because of this I will not be using these features, the only one that may still be incorporated is FireplaceQu, and only if the model performs poorly in predicting target.

## <b>(B) - Preprocessing of Data</b>

### <b>Feature Cleanup/Removal</b>

*MSSubClass despite being numerical is actually a categorical feature, with nominal values REQUIRES ENCODING.*

In [4]:
MSSub_Unique = pd.unique(df['MSSubClass'])
print(MSSub_Unique, "\nNumber of unique values: ", len(MSSub_Unique))


[ 60  20  70  50 190  45  90 120  30  85  80 160  75 180  40] 
Number of unique values:  15


Becuase there are 15 unique values I think it may be worthwhile appending MSSubclass to the list of variables that will be removed.

May be included later at the end of the constrcution of the model. It may not be needed. Making 15 encoded features seems like too much.


In [5]:
df_cleaned = df.drop(
    labels = ['Id','MSSubClass', 'PoolQC', 'Fence', 'MiscFeature', 'FireplaceQu', 'Alley'],
    axis = 1,
    inplace = False)
df_cleaned.shape


(1460, 74)

df_cleaned is a modified version of the original dataframe, except it has dropped 7 features for ease of use.

### <b> One Hot Encoding Categorical Features </b>

In [6]:
#One hot encode the data, categorical values have been broken down.
df_encoded = pd.get_dummies(df_cleaned)
print(df_encoded.shape)
df_encoded.head()

(1460, 270)


Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,65.0,8450,7,5,2003,2003,196.0,706,0,150,...,0,0,0,1,0,0,0,0,1,0
1,80.0,9600,6,8,1976,1976,0.0,978,0,284,...,0,0,0,1,0,0,0,0,1,0
2,68.0,11250,7,5,2001,2002,162.0,486,0,434,...,0,0,0,1,0,0,0,0,1,0
3,60.0,9550,7,5,1915,1970,0.0,216,0,540,...,0,0,0,1,1,0,0,0,0,0
4,84.0,14260,8,5,2000,2000,350.0,655,0,490,...,0,0,0,1,0,0,0,0,1,0


*The number of features has been increased to 270 total features, because of the one hot encoding.*

### <b> Outlier Removal </b>
*I am going to continue to preprocess the data by removing outliers. These will be classified as being outside of 3 Standard deviations from the mean.*

In [7]:
#print("Mean:\n", df_encoded.mean())
#print("\nStandard Deviation:\n", df_encoded.std())
df_encoded.SalePrice.describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

In [8]:
#We have dropped some features, now we want to drop data entries that exist outside 3 standard deviations
upper_limit = df_encoded['SalePrice'].mean() + 3*df_encoded['SalePrice'].std()
lower_limit = df_encoded['SalePrice'].mean() - 3*df_encoded['SalePrice'].std()

print("Upper Limit: \n",upper_limit)
print("\n\nLower Limit: \n",lower_limit)

Upper Limit: 
 419248.70453907084


Lower Limit: 
 -57406.31275824897


In [9]:
#Reducing number of entries in dataframe by removing any data entry where a feature exists outside of 3 standard deviations. Why? Because they will likely just throw off our data.
print(df_encoded.shape)
#print(df_unormalized.shape)
df_noutliers = df_encoded[df_encoded['SalePrice'] <= upper_limit]
print("Dataframe without outliers: ", df_noutliers.shape)

(1460, 270)
Dataframe without outliers:  (1438, 270)


*Out of the 1450 | 22 Entries have been Dropped*

### <b> NaN Value Removal </b>
*We need to remove NaN values from our dataframe, model will not no how to interpret it.*

In [10]:
#Pandas Method
#X_train_NaN = X_train.fillna(0)
#Numpy Method
#X_train.replace(np.nan, 0)
#--------------------------------------# Originlly I had this code in a latter section but it makes more sense to include here. We have to apply fillna(0) to entire dataset now, previous lines wont work.
df_NaN = df_noutliers.fillna(0)


WE WILL SKIP NORMALIZATION FOR A SECOND. SEE HOW OUR MODEL PERFORMS BASED ON THE CURRENT CLEANING AND NO FEATURE EXTRACTION.

### <b>Normalization</b>

*Make all values from [0,1]*

In [11]:
#Check Value Types in Dataframe
print(df_NaN.dtypes.unique())

#The numerical types are float64 and int64, the categorical values are represented as 1 or 0 , unsigned, so they are uint8
numerics = ['float64','int64']
#Now Lets isolate the Numerical Features so that we can normalize the data. 
#We DO NOT want to normalize categorical features.
df_numerics = df_NaN.select_dtypes(include=numerics)
df_numerics.head()

[dtype('float64') dtype('int64') dtype('uint8')]


Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,65.0,8450,7,5,2003,2003,196.0,706,0,150,...,0,61,0,0,0,0,0,2,2008,208500
1,80.0,9600,6,8,1976,1976,0.0,978,0,284,...,298,0,0,0,0,0,0,5,2007,181500
2,68.0,11250,7,5,2001,2002,162.0,486,0,434,...,0,42,0,0,0,0,0,9,2008,223500
3,60.0,9550,7,5,1915,1970,0.0,216,0,540,...,0,35,272,0,0,0,0,2,2006,140000
4,84.0,14260,8,5,2000,2000,350.0,655,0,490,...,192,84,0,0,0,0,0,12,2008,250000


Now that these have been identified, we want to normalize the data.

In [20]:
#38 Numeric Features to Normalize
print(df_numerics.columns)

for column in df_numerics.columns:
    print(df_numerics[column])
    df_numerics[column] = (df_numerics[column] - df_numerics[column].min())/(df_numerics[column].max() - df_numerics[column].min()) #Normalization of Numerical Values

df_numerics.head()

Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')
0       0.207668
1       0.255591
2       0.217252
3       0.191693
4       0.268371
          ...   
1455    0.198083
1456    0.271565
1457    0.210863
1458    0.217252
1459    0.239617
Name: LotFrontage, Length: 1438, dtype: float64
0       0.033420
1       0.038795
2       0.046507
3       0.038561
4       0.060576
          ...   
1455    0.030929
1456    0.055505
1457    0.036187
1458    0.039342
1459    0.040370


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_numerics[column] = (df_numerics[column] - df_numerics[column].min())/(df_numerics[column].max() - df_numerics[column].min()) #Normalization of Numerical Values


Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,0.207668,0.03342,0.666667,0.5,0.949275,0.883333,0.1225,0.125089,0.0,0.064212,...,0.0,0.111517,0.0,0.0,0.0,0.0,0.0,0.090909,0.5,0.456364
1,0.255591,0.038795,0.555556,0.875,0.753623,0.433333,0.0,0.173281,0.0,0.121575,...,0.347725,0.0,0.0,0.0,0.0,0.0,0.0,0.363636,0.25,0.385386
2,0.217252,0.046507,0.666667,0.5,0.934783,0.866667,0.10125,0.086109,0.0,0.185788,...,0.0,0.076782,0.0,0.0,0.0,0.0,0.0,0.727273,0.5,0.495797
3,0.191693,0.038561,0.666667,0.5,0.311594,0.333333,0.0,0.038271,0.0,0.231164,...,0.0,0.063985,0.492754,0.0,0.0,0.0,0.0,0.090909,0.0,0.27629
4,0.268371,0.060576,0.777778,0.5,0.927536,0.833333,0.21875,0.116052,0.0,0.20976,...,0.224037,0.153565,0.0,0.0,0.0,0.0,0.0,1.0,0.5,0.56546


### Standardization - 

<b> Make data's mean = 0, and std = 1 (unit variance) </b>

## (C) Feature Extraction

So earlier I already eliminated 6 features, that were fractional. I could apply principle component analysis to identify lets say the 50 most predictive features but I will not yet.

## (D) Processing of Each Dataset

I will basically apply a model here. The target variable is price, a continuous variable that will require some sort of regression.

I will likely try to use polynomial regression or linear regression.

In [13]:
#import sklearn
from sklearn.model_selection import train_test_split

#Seperate Features From Target
df_features = df_NaN.drop(labels = ['SalePrice'], axis = 1, inplace =False)
df_target = df_NaN['SalePrice']

print("Feature Shape: " , df_features.shape)
print("Target Shape: ", df_target.shape)

Feature Shape:  (1438, 269)
Target Shape:  (1438,)


In [14]:
#train_test_split our dataframe. Testing data set will be around 0.25 of the data.
X_train, X_test, y_train, y_test = train_test_split(df_features, df_target, test_size = 0.25, random_state = 1)

nX_train, nX_test, ny_train, ny_test = train_test_split(df_encoded, df_encoded['SalePrice'], test_size = 0.25, random_state = 1)


Training data is 75%, Testing is 25%. We can proceed to train data.

In [16]:
#Training Data
from sklearn import linear_model

#Linear Regression Model is here.
regression = linear_model.LinearRegression(fit_intercept = True, normalize = True).fit(X_train,y_train)
regression_2 = linear_model.LinearRegression(fit_intercept = True, normalize = True).fit(nX_train.fillna(0),ny_train)
#Test model here
output = regression.predict(X_test)
output2 = regression_2.predict(nX_test.fillna(0))

Okay so I should have stored inside "output" the predicted home prices.

I will inspect with the naked eye...

In [48]:
print("Testing the mean of y_test, the testing set that has removed the outliers. ",y_test.mean())
print("\nMean of the predictive model, there is a massive outlier which i will find, ", output.mean())
sorted_output = sorted(output, reverse = True)
print("\nSorted Output, descending, no outliers :\n", sorted_output)
print("\n\nThe top 3 values, total outliers: \n", sorted_output[0],"\n", sorted_output[1],"\n", sorted_output[2],"\n\n")
#We want to identify why these values are so huge by comparison.
print("Indices of 3 huge outliers of predictive model (no outliers) : \n" , np.where(output == sorted_output[0]) ,"\n",  np.where(output == sorted_output[1]), "\n",np.where(output == sorted_output[2]),"\n\n")
#print("df_noutliers , 359 :" , df_noutliers.loc[359,:] ,df_noutliers.loc[358,:],df_noutliers.loc[357,:])

#Non outlier removal
print("\nNo outliers removed, test mean ", ny_test.mean())
print("Model that did not exclude outliers, also not normalized.", output2.mean())

#NO outlier totally outperforms. Theres a big problem somewhere in the outlier data.

#print(y_test)

#There is a massive outlier in the scores, one of the houses has been predicted to be worth over a quintillian dollars.
#I want to try to get the median loss just to get a sense of how bad the score is

#CODE
#loss = abs(y_test - output)
#median_loss = statistics.median(loss)


Testing the mean of y_test, the testing set that has removed the outliers.  173497.61388888888

Mean of the predictive model, there is a massive outlier which i will find,  150888255157529.9

Sorted Output, descending, no outliers :
 [2.554182514546395e+16, 2.537818315352124e+16, 2.4546620105499344e+16, 368992.0, 368768.0, 358032.0, 349408.0, 348304.0, 344112.0, 336400.0, 334320.0, 333296.0, 329888.0, 321104.0, 321104.0, 317984.0, 317184.0, 314736.0, 309984.0, 309808.0, 305568.0, 297008.0, 292528.0, 289968.0, 289696.0, 288816.0, 287504.0, 287456.0, 286816.0, 281520.0, 281072.0, 280400.0, 280096.0, 276368.0, 276064.0, 274992.0, 270272.0, 269872.0, 267312.0, 264112.0, 262672.0, 260880.0, 258880.0, 258848.0, 258416.0, 257072.0, 255536.0, 255152.0, 253648.0, 253440.0, 252816.0, 247248.0, 246976.0, 246176.0, 244880.0, 244528.0, 244256.0, 244224.0, 240224.0, 236864.0, 236416.0, 235040.0, 232416.0, 230256.0, 229360.0, 228368.0, 226848.0, 226640.0, 226640.0, 226032.0, 225360.0, 225280.0, 22416

In [49]:
#It is worse to continue using the outlier removed model. We will continue to use the non outlier removed 

In [51]:
#Time to check the Accuracy using MSE
from sklearn.metrics import mean_squared_error
MSE = mean_squared_error(ny_test, output2)
print(MSE)

4789342.051536175
