### This kernel is my submission to the House Prices: Advanced Regression Technique tutorial dataset. It's my first time making a IPython notebook or working with Kaggle datasets so it's mostly for learning's sake and will follow the practices of others, but I'm excited to see what results it'll get.

The steps I will walk through are:
1. Data Summarization
2. Data Cleaning
3. Data Exploration
4. Feature Engineering
5. Trying different algorithms (current planned: linear & logistic regression, SVM, random forests)
6. Comparing accuracy rates

In [None]:
#package imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1. Data Summarization

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train.head()

In [None]:
train.info()
train[train.PoolArea!=0]

In [None]:
test.info()

### Observations:
- One response variable, SalePrice
- Only 7 houses have pools, with PoolArea != 0 and PoolQC not NaN
- Only ~271 houses have fences, 91 have alleys, 54 "MiscFeature", 770 fireplaces
- Lower amts of null vals in LotFrontage, MasVnrType, MasVnrArea, BsmtQual, BsmtCont, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, Electrical, FireplaceQu, garage quals (GarageCars and GarageArea are 0 when other garage quals are null)
- Test cols that are missing data: LotFrontage, Alley, Utilities, Exterior1st and 2nd, MasVnrType and Area, Bsmt qualities, Garage qualities, Functional, FireplaceQu, Pool quals, Fence, Misc quals, SaleType

# 2. Data Cleaning

Certain features being missing would be significant to a buyer's judgement. Therefore, we will fill in those in with 0s for NaN values to represent absence. Even the categorical values are filled with 0s, seeing as the other categories will be changed to ints for regression anyways.

In [None]:
train.fillna(0,inplace=True)
test.fillna(0,inplace=True)

Check to ensure no null values remain.

In [None]:
print("Train Null Values")
print(train.isnull().sum().sum())
print("Test Null Values")
print(test.isnull().sum().sum())

So prior to this, I tried manually analyzing and picking out features to remove. That is a pain. Let's experiment instead with four sklearn methods of feature streamlining: SelectKBest, RFE, PCA, and Extra Trees.

### Desired Number of Features

Let's take a look at the features of our data:

In [None]:
train.info()

ID is not a feature, so removing that we have 80 features. Skimming over and looking at the number of keyword redundancies we can first estimate that we could do with just 1/4rd (20) of the present features. This figure can be adjusted later.

In [None]:
train.drop(columns='Id',inplace=True)

### First Run

We're first going to run the random forests model to see how bad overfitting is currently, and to get the importance of each feature in order to take the top 20.

In [None]:
from sklearn.model_selection import train_test_split
train = pd.get_dummies(train)
X = train.drop(columns='SalePrice')
y = train.SalePrice
X_train,X_val,y_train,y_val = train_test_split(X,y,random_state=0)

Importing packages:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, mean_absolute_error

Setting up, fitting and measuring accuracy of model:

In [None]:
def rfwithval(xt, yt, xv, yv,maxfeat="auto",maxnd=None,maxdp=None,ntrees=10):
    forest = RandomForestClassifier(random_state=1,max_features=maxfeat,max_leaf_nodes=maxnd,max_depth=maxdp,n_estimators=ntrees)
    forest.fit(xt, yt)
    train_predictions = forest.predict(xt)
    val_predictions = forest.predict(xv)
    print("Training MAE: {} | Validation MAE: {}"\
          .format(mean_absolute_error(yt,train_predictions),mean_absolute_error(yv,val_predictions)))
    return forest
    
forest = rfwithval(X_train,y_train,X_val,y_val)

So there's room for improvement. Let's see if we can fix this by fiddling with maximum features sampled, maximum leaf nodes, maximum tree depth, and number of trees.

# 3. Tuning the Random Forest

### Features Sampled

In [None]:
feats = [4,8,16,32,64]
for item in feats:
    print("{} features sampled".format(item))
    forest1 = rfwithval(X_train,y_train,X_val,y_val,maxfeat=item)

Not much of a difference is made.

### Leaf Nodes

In [None]:
nds = [3,30,300,3000,30000]
for item in nds:
    print("{} nodes".format(item))
    forest1 = rfwithval(X_train,y_train,X_val,y_val,maxnd=item)

Maximum 300 leaf nodes minimizes validation MAE substantially but also is far from the minimum training MAE.

### Depth

In [None]:
dps = [3,9,27,81,243]
for item in dps:
    print("{} depth".format(item))
    forest1 = rfwithval(X_train,y_train,X_val,y_val,maxdp=item)

Maximum depth of 27 minimizes validation MAE substantially and does not have minimum training MAE but has not as bad of an increase.

### Number of Trees

In [None]:
nums = [5,10,20,40,80]
for item in nums:
    print("{} trees".format(item))
    forest1 = rfwithval(X_train,y_train,X_val,y_val,ntrees=item)

Increasing the number of trees eliminates training MAE but doesn't do much for validation MAE, and also increases training time substantially.

## Another Idea: Using the Feature Rankings from RandomForestClassifier

Another way to reduce overfitting is by reducing the number of features input into the model. When a random forest classifier is fit, it also assigns an importance score to each feature. Let's see if we can reduce overfitting by only taking n highest importance features.

In [None]:
#Get scores of features
imps = forest.feature_importances_

In [None]:
def featurereducedforest(df, imps, cut):
    #Get cutoff for top n scores
    cutoff = min(sorted(imps)[-cut:])
    #Get indexes of top n
    inds = [i for i in range(len(imps)) if imps[i]>=cutoff]
    newdict = {i:df.iloc[:,i] for i in inds}
    train1 = pd.DataFrame(newdict, columns=inds)
    X_train,X_val,y_train,y_val = train_test_split(df,y,random_state=0)
    return rfwithval(X_train,y_train,X_val,y_val)

In [None]:
feats = [5,10,20,40,80]
for item in feats:
    print("{} features".format(item))
    forest1 = featurereducedforest(X,imps,item)

Not much of a difference.

# 4. Running With the Best Tuning

Let's try combining the overfitting reduction measures above.

In [None]:
forest1 = rfwithval(X_train,y_train,X_val,y_val,maxnd=300,maxdp=27)

In [None]:
forest1 = rfwithval(X_train,y_train,X_val,y_val,maxdp=27,ntrees=80)

In [None]:
forest1 = rfwithval(X_train,y_train,X_val,y_val,maxnd=300,ntrees=80)

None of the reduction measures work well in combination. So we'll go with capping the max depth at 27.

In [None]:
forest = RandomForestClassifier(random_state=1,max_depth=27)
X_test = pd.get_dummies(test.drop(columns='Id'))
X, X_test = X.align(X_test,join='outer',axis=1,fill_value=0)
forest.fit(X,y)
test_preds = forest.predict(X_test)
output = pd.DataFrame({'Id':test.Id, 'SalePrice':test_preds})
output.to_csv('submission.csv',index=False)

# 5. What in the fuck

Somehow, I performed worse on this than in the tutorial model. Some more feature engineering will have to be done after all.

# 6. Feature Engineering