# Kaggle Workshop Walkthrough

Example walkthrough for the House Price competition. This notebook shows a possible simple approach for each of the workshop proposed tasks. Consider this a simple baseline, you can to better üöÄ

In [None]:
%matplotlib inline
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.api.types import is_numeric_dtype

sns.set()

# Load data

In [None]:
rawtrain = pd.read_csv('../input/train.csv')
rawtest = pd.read_csv('../input/test.csv')

In [None]:
print('Train shape:', rawtrain.shape)
print('Test shape:', rawtest.shape)

These are the types of the columns in the dataset. `np.object` are string values for the categorical features.

In [None]:
rawtrain.dtypes.value_counts()

# First model with selected features

To make it a bit easier to do the first steps with the dataset you can use the following list with the 20 most important features (this list is the result of running a gradient boosting model and selecting the most important features). 

In [None]:
selected = ['GrLivArea',
 'LotArea',
 'BsmtUnfSF',
 '1stFlrSF',
 'TotalBsmtSF',
 'GarageArea',
 'BsmtFinSF1',
 'LotFrontage',
 'YearBuilt',
 'Neighborhood',
 'GarageYrBlt',
 'OpenPorchSF',
 'YearRemodAdd',
 'WoodDeckSF',
 'MoSold',
 '2ndFlrSF',
 'OverallCond',
 'Exterior1st',
 'YrSold',
 'OverallQual']

Or you can just select everything if you prefer

In [None]:
#features = [c for c in test.columns if c not in ['Id']]

This code builds a single dataframe with both `train` and `test` datasets and a new column to separate both. This can be useful when doing transformations that would need to be applied both in `train` and `test`. If you keep this approach you can use the checking code that is provided.

In [None]:
train = rawtrain[selected].copy()
train['is_train'] = 1
train['SalePrice'] = rawtrain['SalePrice'].values
train['Id'] = rawtrain['Id'].values

test = rawtest[selected].copy()
test['is_train'] = 0
test['SalePrice'] = 1  #dummy value
test['Id'] = rawtest['Id'].values

full = pd.concat([train, test])

not_features = ['Id', 'SalePrice', 'is_train']
features = [c for c in train.columns if c not in not_features]

# Check target distribution

The competition metric is based on log transformed values. That is already an hint that log transform maybe useful to make the target distribution behave more like a normal distribution.

Now plot the distribution of `SalePrice`.

In [None]:
pd.Series(train.SalePrice).hist(bins=50);

In [None]:
pd.Series(np.log(train.SalePrice)).hist(bins=50);

And apply the log transformation to `SalePrice` in the dataset.

In [None]:
full['SalePrice'] = np.log(full['SalePrice'])

# Check missing values

Do some analysis to identify the missing values. There is a proposed summary function that can be used to check missing values for the different dtypes (`np.object`, `np.float64`, `np.int64`).

In [None]:
def summary(df, dtype):
    data = []
    for c in df.select_dtypes([dtype]).columns:
        data.append({'name': c, 'unique': df[c].nunique(), 
                     'nulls': df[c].isnull().sum(),
                     'samples': df[c].unique()[:20] })
    return pd.DataFrame(data)

In [None]:
summary(full[features], np.object)

In [None]:
summary(full[features], np.float64)

In [None]:
summary(full[features], np.int64)

Now do something to replace the missing values. The best is to analyse case by case. A quick lazy approach can be to use a new label missing categoricals and zero for missing numerical.

In [None]:
for c in full.select_dtypes([np.object]).columns:
    full[c].fillna('__NA__', inplace=True)
for c in full.select_dtypes([np.float64]).columns:
    full[c].fillna(0, inplace=True)

Code to check there are no missing values in the dataset

In [None]:
for c in full.columns:
    assert full[c].isnull().sum() == 0, f'There are still missing values in {c}'

In [None]:
nan

# Encode categorical

Before creating the model the categorical features must be encoded into numberical values. There are many ways to do it, for example building a mapping dictionary and applying it with pandas. Or using `sklearn.preprocessing.LabelEncoder`.

In [None]:
mappers = {}
for c in full.select_dtypes([np.object]).columns:
    mappers[c] = {v:i for i,v in enumerate(full[c].unique())}
    full[c] = full[c].map(mappers[c]).astype(int)

Code to check that all columns are numeric

In [None]:
for c in full.columns:
    assert is_numeric_dtype(full[c]), f'Non-numeric column {c}'

# First model

Now try to build a first predictive model. One suggestion is to use gradient boosting that typically has strong results in this kind of tabular data (in `sklearn` there is a `GradientBoostingRegress` model). If you choose to do first a Linear regression don't forget to also one-hot encode the categorical values.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics

Implementation of the competition metric (notice target is already log transformed so no need to do that in the metric).

In [None]:
def rmse(y_true, y_pred):
    return np.sqrt(metrics.mean_squared_error(y_true, y_pred))

Choose a validation strategy for your model. Simple approach is to take out a validation dataset from the train dataset (`sklearn.model_selection.train_test_split` can be used for this).

In [None]:
train = full[full.is_train==1][features].values
target = full[full.is_train==1].SalePrice.values
Xtrain, Xvalid, ytrain, yvalid = train_test_split(train, target, test_size=0.2, random_state=42)

Notice these model parameters is just a first guess. With parameter optimization the model results can be improved (for example using `sklearn.model_selection.RandomizedSearchCV`, left as a follow-up exercise). 

In [None]:
model = GradientBoostingRegressor(n_estimators=1500, learning_rate=0.02, max_depth=4, random_state=42)

In [None]:
model.fit(Xtrain, ytrain)

In [None]:
ypred = model.predict(Xvalid)
rmse(yvalid, ypred)

Now applying the model to the test dataset, to generate the predictions to be submitted as results. 

In [None]:
test = full[full.is_train==0]
ytestpred = model.predict(test[features].values)

Since target was log transformed it needs to be exponentiated now

In [None]:
ytestpred = np.exp(ytestpred)

In [None]:
subm = pd.DataFrame(ytestpred, index=test['Id'], columns=['SalePrice'])
subm.to_csv('submission.csv')

üéâ Great! Submission ready üí™ Now time to upload to Kaggle. 

# Remove Outliers

It is worth checking the data for outliers and try a model with some outliers removed. Suggested task is to plot a boxplot for each numerical variable. Then filter out some outliers in the training dataset.

In [None]:
cols = full[features].select_dtypes([np.float64, np.int64]).columns
n_rows = math.ceil(len(cols)/2)
fig, ax = plt.subplots(n_rows, 2, figsize=(14, n_rows*2))
ax = ax.flatten()
for i,c in enumerate(cols):
    sns.boxplot(x=full[c], ax=ax[i])
    ax[i].set_title(c)
    ax[i].set_xlabel("")
plt.tight_layout()

This code will remove some rows based on predefined limits. This is meant to be just example code, probably there is no reason to remove entries. This is a carefully cleaned dataset.

In [None]:
limits = [('TotalBsmtSF', 4000), ('WoodDeckSF', 1400)]

full['__include'] = 1 
for c, val in limits:
    full.loc[full[c] > val, '__include'] = 0

full = full[(full.is_train==0)|(full['__include']==1)]

full = full.drop('__include', axis=1)

# these dates in the future are likely typos
full['GarageYrBlt'] = np.where(full.GarageYrBlt > 2010, full.YearBuilt, full.GarageYrBlt)

# Feature engineering

Some ideas for new features:

- House age (considering the construction year and that this is a dataset from 2010)
- What season was the house sold (winter, summer, etc)
- Reduce the overal condition to 3 levels (good, average, bad)
- How long ago the house was sold
- Total area including first and second floor
- Total area including first floor, second floor and basement


In [None]:
full['Age'] = 2010 - full['YearBuilt']
month_season_map = {12:0, 1:0, 2:0, 3:1, 4:1, 5:1, 6:2, 7:2, 8:2, 9:3, 10:3, 11:3}
full['SeasonSold'] = full['MoSold'].map(month_season_map).astype(int)
full['SimplOverallCond'] = full['OverallCond'].replace(
        {1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2, 7 : 3, 8 : 3, 9 : 3, 10 : 3})
full['TimeSinceSold'] =  2010 - full['YrSold']
full['TotalArea1st2nd'] = full['1stFlrSF'] + full['2ndFlrSF']
full['TotalSF'] = full['TotalBsmtSF'] + full['1stFlrSF'] + full['2ndFlrSF']

# Blend 2 models

Now try to make 2 different models (for example GBM and ExtraTrees or RandomForest), combine the models (for example with a weighted average) and evaluate the performance in the validation set.

In [None]:
train = full[full.is_train==1][features].values
target = full[full.is_train==1].SalePrice.values
Xtrain, Xvalid, ytrain, yvalid = train_test_split(train, target, test_size=0.2, random_state=42)

model = GradientBoostingRegressor(n_estimators=1500, learning_rate=0.02, max_depth=4, random_state=42)
model.fit(Xtrain, ytrain)
ypred = model.predict(Xvalid)
rmse(yvalid, ypred)

In [None]:
model2 = ExtraTreesRegressor(n_estimators=1500, random_state=42)
model2.fit(Xtrain, ytrain)
ypred2 = model2.predict(Xvalid)
rmse(yvalid, ypred2)

In [None]:
blendpred = 0.7*ypred + 0.3*ypred2
rmse(yvalid, blendpred)

If the performance improved in you CV, do another submission on Kaggle to check the value on `test`.  

In [None]:
test = full[full.is_train==0]
ytestpred = model.predict(test[features].values)
ytestpred2 = model2.predict(test[features].values)
blendtestpred = 0.7*ytestpred + 0.3*ytestpred2

blendtestpred = np.exp(blendtestpred)

subm = pd.DataFrame(blendtestpred, index=test['Id'], columns=['SalePrice'])
subm.to_csv('submission_blend.csv')

Well Done! üèÜ Now just keep the momentum and go for the gold ü•áü•áü•áü•áüöÄ

# Extra: if you still have some time

Try the following:

- K-fold CV 
- Liner regression model 

In [None]:
nan