# Kaggle Learn - Machine Learning

Working through lessons from https://www.kaggle.com/learn/maching_learning

In [1]:
import pandas as pd

pd.set_option('max_rows', 10)

### Table of Contents

Level 1
1. [How Models Work](#part1)<br>
2. [Starting Your ML Project](#part2)<br>
3. [Selecting and Filtering Data](#part3)<br>
4. [Your First Scikit-Learn Model](#part4)<br>
5. [Model Validation](#part5)<br>
6. [Underfitting, Overfitting and Model Optimization](#part6)<br>
7. [Random Forests](#part7)<br>
8. [Submitting from Kernel](#part8)<br>

Level 2
1. [Handling Missing Values](#l2_part1)<br>
2. [One Hot Encoding](#l2_part2)<br>
3. [XGBoost](#l2_part3)<br>
4. [Partial Dependence Plots](#l2_part4)<br>
5. [Pipelines](#l2_part5)<br>
6. [Cross-Validation](#l2_part6)<br>
7. [Data Leakage](#l2_part7)<br>

# Level 1

## Part 1: How Models Work<a class="anchor" id="part1"></a>

Course starts basic, but will ramp up quickly.

Modeling steps
* **fit** or **train** - capture patterns from data to build model
* **predict** - apply model to new data

## Part 2: Starting your ML Project<a class="anchor" id="part2"></a>

**df.describe**
  * returns summary data for numerical columns <br>
    (or non-numerical colums if called on only non-numerical columns)
  * count - rows with non-missing data in column

In [2]:
# Iowa data for tutorial
data = pd.read_csv('data/train.csv')
data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


## Part 3: Selecting and Filtering Data<a class="anchor" id="part3"></a>

For now, selecting variables to explore by intuition. Later will introduce statistical techniques for prioritizing variables.

**df.columns** - names of columns in data frame as a pandas Index object (kind of like a Series?)<br> 
**df.columns.sort_values** - columns sorted alphabetically rather than in order in datatable<br>

Selecting columns<br>
* **dot-notation** - like python attribute - df.column<br>
* **brackets** - like python dictiorary lookup - df[['column1', 'column2']]<br>
* **.loc** and **.iloc** - see pandas tutorial

In [3]:
data.columns.sort_values()

Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'Alley', 'BedroomAbvGr',
       'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath',
       'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1', 'Condition2',
       'Electrical', 'EnclosedPorch', 'ExterCond', 'ExterQual', 'Exterior1st',
       'Exterior2nd', 'Fence', 'FireplaceQu', 'Fireplaces', 'Foundation',
       'FullBath', 'Functional', 'GarageArea', 'GarageCars', 'GarageCond',
       'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt', 'GrLivArea',
       'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle', 'Id', 'KitchenAbvGr',
       'KitchenQual', 'LandContour', 'LandSlope', 'LotArea', 'LotConfig',
       'LotFrontage', 'LotShape', 'LowQualFinSF', 'MSSubClass', 'MSZoning',
       'MasVnrArea', 'MasVnrType', 'MiscFeature', 'MiscVal', 'MoSold',
       'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual',
       'PavedDrive', 'PoolArea', 'Po

In [4]:
# Summary of prices - target for predictions
data.SalePrice.describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

In [5]:
# Summary of squart foot by floor - potential features
data[['1stFlrSF', '2ndFlrSF']].describe()

Unnamed: 0,1stFlrSF,2ndFlrSF
count,1460.0,1460.0
mean,1162.626712,346.992466
std,386.587738,436.528436
min,334.0,0.0
25%,882.0,0.0
50%,1087.0,0.0
75%,1391.25,728.0
max,4692.0,2065.0


## Part 4: Your First Scikit-Learn Model <a class="anchor" id="part4"></a>

Choose a **prediction target** aka **outcome variable** aka **dependent variable** conventionally **y**

Choose **predictors** aka **features** aka **independent variables** conventionally **X**


In [6]:
target = 'SalePrice'
y = data.loc[:, target]

features = ['LotArea',
            'YearBuilt',
            '1stFlrSF',
            '2ndFlrSF',
            'FullBath',
            'BedroomAbvGr',
            'TotRmsAbvGrd']
X = data.loc[:, features]

display(y)
display(X)

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9
...,...,...,...,...,...,...,...
1455,7917,1999,953,694,2,3,7
1456,13175,1978,2073,0,2,3,7
1457,9042,1941,1188,1152,2,4,9
1458,9717,1950,1078,0,1,2,5


#### Training the model
* Import desired model Class from scikit-learn and initiate an instance. 
* This initial model was trained on all of data to show that this is a bad
  idea (set aside some data for [validation](#part5)).
* This initial model uses a single decision tree. More sophisticated models
  (such as [random forest](#part 7) are generally preferred. 

In [7]:
from sklearn.tree import DecisionTreeRegressor

iowa_model = DecisionTreeRegressor()
iowa_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [8]:
# Predictions
predicted_prices = iowa_model.predict(X)
print(predicted_prices - y.values) # Most predictions exactly correct
print()
print(data
      .assign(PredictedPrice = predicted_prices.astype('int')) 
      .loc[:, ['SalePrice', 'PredictedPrice']]
      .loc[data.SalePrice != predicted_prices]) # only 24 of 1460 are different
print()

# 'Average' house
#    note .values.reshape only necessiary for single row of features
print(iowa_model.predict(data.loc[:, features].mean(axis='rows')
                         .values.reshape(1, -1)))
print(data.SalePrice.mean())
print()

# 'Minimal' house
print(iowa_model.predict(data.loc[:, features].min(axis='rows')
                         .values.reshape(1, -1)))
print(data.SalePrice.min())
print()

# 'Max' house
print(iowa_model.predict(data.loc[:, features].max(axis='rows')
                         .values.reshape(1, -1)))
print(data.SalePrice.max())


[0. 0. 0. ... 0. 0. 0.]

      SalePrice  PredictedPrice
102      118964          118911
126      128000          135875
145      130000          132500
193      130000          132500
232       94500          106250
...         ...             ...
1421     127500          133750
1422     136500          134000
1431     143750          135875
1441     149300          144433
1452     145000          146500

[24 rows x 2 columns]

[126000.]
180921.19589041095

[60000.]
34900

[745000.]
755000


## Part 5: Model Validation <a class="anchor" id="part5"></a>

**Mean Absolute Error** or **MAE**
* Average, absolute value of the difference between predicted and actual value
* Use **sklearn.metrics.mean_absolute_error**

**Validation data**
* Setting aside some data before training the model to use for testing the model
* Use **sklearn.model_selection.train_test_split**

In [9]:
# Using trainging data for validation
#   just to illustrate what a bad idea this is
from sklearn.metrics import mean_absolute_error

predicted_prices = iowa_model.predict(X)
mean_absolute_error(y, predicted_prices)

62.35433789954339

In [10]:
# Using separate training and validation sets
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)
iowa_model = DecisionTreeRegressor()
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)

print(mean_absolute_error(val_y, val_predictions))

33235.24657534246


## Part 6: Underfitting, Overfitting and Model Optimization<a class="anchor" id="part6"></a>

**overfitting** - model accurately predicts training data, but does not generalize well

**underfitting** - model performs poorly even on training data, technically it may generalize well but being consistently poor

Decision tree **depth** is the length of the longest path from root to leaf. A shallow tree is prone to underfitting, but a deep tree may be overfit. Calculating the MAE for models trained over a range of depths can help identify an optional trade-off depth.

In [11]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor  

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

# Compare MAE with differing values for max_leaf_nodes parameter
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes,
                                                                my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  35190
Max leaf nodes: 50  		 Mean Absolute Error:  27825
Max leaf nodes: 500  		 Mean Absolute Error:  32662
Max leaf nodes: 5000  		 Mean Absolute Error:  33382


## Part 7: Random Forests  <a class="anchor" id="part7"></a>

**Random forests** average predictions of many decision trees for better predictive accuracy. Tend to work well with default parameters, though other models with generally better performance exist.

In [12]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
iowa_predictions = forest_model.predict(val_X)
print(mean_absolute_error(val_y, iowa_predictions))

24571.0301826484


## Part 8: Submitting from a Kernel  <a class="anchor" id="part8"></a>

Information for submitting models to Kaggle competitions.
1. Download a .csv with training data<br>
2. Use this to train and validate a model<br>
3. Download a .csv of testing data (same features as training data, but<br>
   no values for the target)<br>
4. Use the model to make predictions on the testing data<br>
5. Save a two column .csv with the IDs from the test data and the predictions<br>
   (no index column)<br>
6. Submit the .csv to Kaggle<br>
   may have to do this from a Kaggle Notebook

In [13]:
#my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
# my_submission.to_csv('submission.csv', index=False)

# Level 2

## Part 1: Handling Missing Values <a class="anchor" id="l2_part1"></a>

Solutions:
1. Drop the columns with NaN
   * usually not the best solution, but for mostly NaN columns may make sense
   * may throw out useful features
   * problems if test set has NaN in other columns
   * dropping rows with NaN is even more dubious, may introduce sampling bias<br>
   <br> 
2. Imputation
   * Fill the missing value with some number
   * **sklearn.preprocessing.Imputer**
   * The default is to use the column mean. More sophisticated methods exist but
     are generally no better<br>
     <br>
3. Add a booean 'wasNan' category for each imputed column
   * May be meaningful if NaNs are systematically above or below their means

In [14]:
# Find columns with NaNs
nan_count_by_column = (data.isnull().sum().loc[data.isnull().sum() > 0]
                      .sort_values(ascending=False))

with pd.option_context('max_rows', 20):
    print(nan_count_by_column)

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageYrBlt       81
GarageType        81
GarageFinish      81
GarageQual        81
GarageCond        81
BsmtFinType2      38
BsmtExposure      38
BsmtFinType1      37
BsmtCond          37
BsmtQual          37
MasVnrArea         8
MasVnrType         8
Electrical         1
dtype: int64


In [42]:
# Using all numeric features from Iowa data
#   Ignoring catagorical data types for now for simplicity
#   will add in next section (level2 - part2)

X = (data.drop(['SalePrice'], axis='columns')
         .select_dtypes(exclude=['object']))
y = data.SalePrice

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Function for training and evaluting each method for each NaN treatment
def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    return mean_absolute_error(y_test, predictions)

In [17]:
# Solution 1 - drop columns with NaNs

# df.dropna(axis=1) drops all columns with NaNs, but doesn't store which columns
# instead save list of columns to drop
cols_with_nan = [col for col in X_train.columns if X_train[col].isnull().any()]
print(sorted(cols_with_nan))
X_train_reduced = X_train.drop(cols_with_nan, axis='columns')
X_test_reduced = X_test.drop(cols_with_nan, axis='columns')

# nan_count for training data should be as empty list now    
cols_with_nan_train = [col for col in X_train_reduced.columns if X_train_reduced[col].isnull().any()]
print(sorted(cols_with_nan_train))

# check if test data is also free of NaNs after dropping columns
cols_with_nan_test = [col for col in X_test_reduced.columns if X_test_reduced[col].isnull().any()]
print(sorted(cols_with_nan_test))

# train and test dropped NaN solution
mae = score_dataset(X_train_reduced, X_test_reduced, y_train, y_test)
print()
print('Mean Absolute Error from dropping columns with Missing Values: {0}'
     .format(mae))

['GarageYrBlt', 'LotFrontage', 'MasVnrArea']
[]
[]

Mean Absolute Error from dropping columns with Missing Values: 19130.69726027397


In [18]:
# Solution 2 - Imputing
from sklearn.preprocessing import Imputer

# perform imputations
my_imputer = Imputer()
X_train_imputed = my_imputer.fit_transform(X_train)
X_test_imputed = my_imputer.transform(X_test)
    
# train and test with imputed data
mae = score_dataset(X_train_imputed, X_test_imputed, y_train, y_test)
print('Mean Absolute Error from imputing Missing Values: {0}'.format(mae)) 

Mean Absolute Error from imputing Missing Values: 20254.68082191781


In [40]:
# Solution 3 - Imputing plus adding 'wasNaN' as a feature per imputed column

# Add columns of boolean values for 'was missing' imputed columns
X_train_plus = X_train.copy()
X_test_plus = X_test.copy()
cols_with_nan = [col for col in X_train.columns if X_train[col].isnull().any()]
for col in cols_with_nan:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_test_plus[col + '_was_missing'] = X_test_plus[col].isnull()
    
# List of added columns
print(X_train_plus.columns[X_train_plus.columns.str.contains('_was_missing')])
print()

# Perform imputations
my_imputer = Imputer()
X_train_imputed_plus = my_imputer.fit_transform(X_train_plus)
X_test_imputed_plus = my_imputer.transform(X_test_plus)
    
# train and test with imputed data
mae = score_dataset(X_train_imputed_plus, X_test_imputed_plus, y_train, y_test)
print('Mean Absolute Error from imputing Missing Values: {0}'.format(mae))  

Index(['LotFrontage_was_missing', 'MasVnrArea_was_missing',
       'GarageYrBlt_was_missing'],
      dtype='object')

Mean Absolute Error from imputing Missing Values: 18868.11890410959
