### This kernel is my submission to the House Prices: Advanced Regression Technique tutorial dataset. It's my first time making a IPython notebook or working with Kaggle datasets so it's mostly for learning's sake and will follow the practices of others, but I'm excited to see what results it'll get.

The steps I will walk through are:
1. Data Summarization
2. Data Cleaning
3. Data Exploration
4. Feature Engineering
5. Trying different algorithms (current planned: linear & logistic regression, SVM, random forests)
6. Comparing accuracy rates

In [1]:
#package imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1. Data Summarization

In [2]:
train = pd.read_csv('~/Documents/kagglestuff/train.csv')
test = pd.read_csv('~/Documents/kagglestuff/test.csv')
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
train.info()
train[train.PoolArea!=0]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
197,198,75,RL,174.0,25419,Pave,,Reg,Lvl,AllPub,...,512,Ex,GdPrv,,0,3,2006,WD,Abnorml,235000
810,811,20,RL,78.0,10140,Pave,,Reg,Lvl,AllPub,...,648,Fa,GdPrv,,0,1,2006,WD,Normal,181000
1170,1171,80,RL,76.0,9880,Pave,,Reg,Lvl,AllPub,...,576,Gd,GdPrv,,0,7,2008,WD,Normal,171000
1182,1183,60,RL,160.0,15623,Pave,,IR1,Lvl,AllPub,...,555,Ex,MnPrv,,0,7,2007,WD,Abnorml,745000
1298,1299,60,RL,313.0,63887,Pave,,IR3,Bnk,AllPub,...,480,Gd,,,0,1,2008,New,Partial,160000
1386,1387,60,RL,80.0,16692,Pave,,IR1,Lvl,AllPub,...,519,Fa,MnPrv,TenC,2000,7,2006,WD,Normal,250000
1423,1424,80,RL,,19690,Pave,,IR1,Lvl,AllPub,...,738,Gd,GdPrv,,0,8,2006,WD,Alloca,274970


In [4]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
Id               1459 non-null int64
MSSubClass       1459 non-null int64
MSZoning         1455 non-null object
LotFrontage      1232 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            107 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1457 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1458 non-

### Observations:
- One response variable, SalePrice
- Only 7 houses have pools, with PoolArea != 0 and PoolQC not NaN
- Only ~271 houses have fences, 91 have alleys, 54 "MiscFeature", 770 fireplaces
- Lower amts of null vals in LotFrontage, MasVnrType, MasVnrArea, BsmtQual, BsmtCont, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, Electrical, FireplaceQu, garage quals (GarageCars and GarageArea are 0 when other garage quals are null)
- Test cols that are missing data: LotFrontage, Alley, Utilities, Exterior1st and 2nd, MasVnrType and Area, Bsmt qualities, Garage qualities, Functional, FireplaceQu, Pool quals, Fence, Misc quals, SaleType

# 2. Data Cleaning

Certain features being missing would be significant to a buyer's judgement. Therefore, we will fill in those in with 0s for NaN values to represent absence. Even the categorical values are filled with 0s, seeing as the other categories will be changed to ints for regression anyways.

In [5]:
train.fillna(0,inplace=True)
test.fillna(0,inplace=True)

Check to ensure no null values remain.

In [6]:
print("Train Null Values")
print(train.isnull().sum().sum())
print("Test Null Values")
print(test.isnull().sum().sum())

Train Null Values
0
Test Null Values
0


So prior to this, I tried manually analyzing and picking out features to remove. That is a pain. Let's experiment instead with four sklearn methods of feature streamlining: SelectKBest, RFE, PCA, and Extra Trees.

### Desired Number of Features

Let's take a look at the features of our data:

In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non

ID is not a feature, so removing that we have 80 features. Skimming over and looking at the number of keyword redundancies we can first estimate that we could do with just 1/4rd (20) of the present features. This figure can be adjusted later.

In [8]:
train.drop(columns='Id',inplace=True)

### First Run

We're first going to run the random forests model to see how bad overfitting is currently, and to get the importance of each feature in order to take the top 20.

In [9]:
from sklearn.model_selection import train_test_split
train = pd.get_dummies(train)
X = train.drop(columns='SalePrice')
y = train.SalePrice
X_train,X_val,y_train,y_val = train_test_split(X,y,random_state=0)

Importing packages:

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, mean_absolute_error

Setting up, fitting and measuring accuracy of model:

In [11]:
def rfwithval(xt, yt, xv, yv,maxfeat="auto",maxnd=None,maxdp=None,ntrees=10):
    forest = RandomForestClassifier(max_features=maxfeat,max_leaf_nodes=maxnd,max_depth=maxdp,n_estimators=ntrees)
    forest.fit(xt, yt)
    train_predictions = forest.predict(xt)
    val_predictions = forest.predict(xv)
    print("Training MAE: {} | Validation MAE: {}"\
          .format(mean_absolute_error(yt,train_predictions),mean_absolute_error(yv,val_predictions)))
    return forest
    
forest = rfwithval(X_train,y_train,X_val,y_val)

Training MAE: 2.922374429223744 | Validation MAE: 43149.67123287671


So there's room for improvement. Let's see if we can fix this by fiddling with maximum features sampled, maximum leaf nodes, maximum tree depth, and number of trees.

### Features Sampled

In [12]:
feats = [4,8,16,32,64]
for item in feats:
    print("{} features sampled".format(item))
    forest1 = rfwithval(X_train,y_train,X_val,y_val,maxfeat=item)

4 features sampled
Training MAE: 139.51689497716896 | Validation MAE: 48123.71232876712
8 features sampled
Training MAE: 143.38173515981734 | Validation MAE: 44760.654794520546
16 features sampled
Training MAE: 25.45936073059361 | Validation MAE: 47177.73698630137
32 features sampled
Training MAE: 63.62100456621005 | Validation MAE: 41798.51780821918
64 features sampled
Training MAE: 125.37260273972603 | Validation MAE: 43224.70958904109


Not much of a difference is made.

### Leaf Nodes

In [13]:
nds = [3,30,300,3000,30000]
for item in nds:
    print("{} nodes".format(item))
    forest1 = rfwithval(X_train,y_train,X_val,y_val,maxnd=item)

3 nodes
Training MAE: 55180.38173515982 | Validation MAE: 51442.63287671233
30 nodes
Training MAE: 29912.18904109589 | Validation MAE: 38353.687671232874
300 nodes
Training MAE: 2080.7132420091325 | Validation MAE: 34343.99178082192
3000 nodes
Training MAE: 65.9041095890411 | Validation MAE: 43190.501369863014
30000 nodes
Training MAE: 42.922374429223744 | Validation MAE: 43144.52876712329


Maximum 300 leaf nodes minimizes validation MAE substantially but also is far from the minimum training MAE.

### Depth

In [14]:
dps = [3,9,27,81,243]
for item in dps:
    print("{} depth".format(item))
    forest1 = rfwithval(X_train,y_train,X_val,y_val,maxdp=item)

3 depth
Training MAE: 53284.708675799084 | Validation MAE: 56101.86575342466
9 depth
Training MAE: 21775.815525114154 | Validation MAE: 38511.80547945206
27 depth
Training MAE: 261.55159817351597 | Validation MAE: 35015.02191780822
81 depth
Training MAE: 16.894977168949772 | Validation MAE: 45167.13424657534
243 depth
Training MAE: 49.954337899543376 | Validation MAE: 46920.778082191784


Maximum depth of 27 minimizes validation MAE substantially and does not have minimum training MAE but has not as bad of an increase.

### Number of Trees

In [15]:
nums = [5,10,20,40,80]
for item in nums:
    print("{} trees".format(item))
    forest1 = rfwithval(X_train,y_train,X_val,y_val,ntrees=item)

5 trees
Training MAE: 2793.3205479452054 | Validation MAE: 44658.29863013699
10 trees
Training MAE: 64.74885844748859 | Validation MAE: 43180.49315068493
20 trees
Training MAE: 3.1963470319634704 | Validation MAE: 33069.613698630135
40 trees
Training MAE: 0.0 | Validation MAE: 28485.186301369864
80 trees
Training MAE: 0.0 | Validation MAE: 26877.657534246577


Increasing the number of trees eliminates training MAE but doesn't do much for validation MAE, and also increases training time substantially.

## Another Idea: Using the Feature Rankings from RandomForestClassifier

Another way to reduce overfitting is by reducing the number of features input into the model. When a random forest classifier is fit, it also assigns an importance score to each feature. Let's see if we can reduce overfitting by only taking n highest importance features.

In [16]:
#Get scores of features
imps = forest.feature_importances_

In [17]:
def featurereducedforest(df, imps, cut):
    #Get cutoff for top n scores
    cutoff = min(sorted(imps)[-cut:])
    #Get indexes of top n
    inds = [i for i in range(len(imps)) if imps[i]>=cutoff]
    newdict = {i:df.iloc[:,i] for i in inds}
    train1 = pd.DataFrame(newdict, columns=inds)
    X_train,X_val,y_train,y_val = train_test_split(df,y,random_state=0)
    return rfwithval(X_train,y_train,X_val,y_val)

In [18]:
feats = [5,10,20,40,80]
for item in feats:
    print("{} features".format(item))
    forest1 = featurereducedforest(X,imps,item)

5 features
Training MAE: 168.6703196347032 | Validation MAE: 46628.76164383562
10 features
Training MAE: 121.27853881278538 | Validation MAE: 44661.10410958904
20 features
Training MAE: 68.81095890410958 | Validation MAE: 43528.04383561644
40 features
Training MAE: 194.70319634703196 | Validation MAE: 46073.18904109589
80 features
Training MAE: 76.57534246575342 | Validation MAE: 46343.890410958906


Not much of a difference.

# Final Alteration

Let's try combining the overfitting reduction measures above.

In [19]:
forest1 = rfwithval(X_train,y_train,X_val,y_val,maxnd=300,maxdp=27)

Training MAE: 2902.9159817351597 | Validation MAE: 34325.8


In [20]:
forest1 = rfwithval(X_train,y_train,X_val,y_val,maxdp=27,ntrees=80)

Training MAE: 0.0 | Validation MAE: 24707.739726027397


In [21]:
forest1 = rfwithval(X_train,y_train,X_val,y_val,maxnd=300,ntrees=80)

Training MAE: 6.8493150684931505 | Validation MAE: 26020.016438356164


None of the reduction measures work well in combination. So we'll go with capping the max depth at 27.

In [24]:
forest = RandomForestClassifier(max_depth=27)
X_test = pd.get_dummies(test.drop(columns='Id'))
X, X_test = X.align(X_test,join='outer',axis=1,fill_value=0)
forest.fit(X,y)
test_preds = forest.predict(X_test)
output = pd.DataFrame({'Id':test.Id, 'SalePrice':test_preds})
output.to_csv('submission.csv',index=False)

# Conclusion

There's still quite a lot to learn, but this was a good start at getting familiar with the available tools of pandas and sklearn, as well as the Jupyter notebook interface and Kaggle competitions. More to come.