# Kaggle Learn - Machine Learning - Level 1

Working through lessons from https://www.kaggle.com/learn/maching_learning

In [1]:
import pandas as pd

pd.set_option('max_rows', 10)

### Table of Contents
1. [How Models Work](#part1)<br>
2. [Starting Your ML Project](#part2)<br>
3. [Selecting and Filtering Data](#part3)<br>
4. [Your First Scikit-Learn Model](#part4)<br>
5. [Model Validation](#part5)<br>
6. [Underfitting, Overfitting and Model Optimization](#part6)<br>
7. [Random Forests](#part7)<br>
8. [Submitting from Kernal](#part8)<br>

## Part 1: How Models Work<a class="anchor" id="part1"></a>

Course starts basic, but will ramp up quick.

Modeling steps
* **fit** or **train** - capture patterns from data to build model
* **predict** - apply model to new data

## Part 2: Starting your ML Project<a class="anchor" id="part2"></a>

**df.describe**
  * returns summary data for numerical columns <br>
    (or non-numerical colums if called on only non-numerical columns)
  * count - rows with non-missing data in column

In [2]:
# Example data - from finished Melbourne project
melbourne_data = pd.read_csv('data/melb_data.csv')
melbourne_data.describe()

Unnamed: 0.1,Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,18396.0,18396.0,18396.0,18395.0,18395.0,14927.0,14925.0,14820.0,13603.0,7762.0,8958.0,15064.0,15064.0,18395.0
mean,11826.787073,2.93504,1056697.0,10.389986,3107.140147,2.913043,1.538492,1.61552,558.116371,151.220219,1965.879996,-37.809849,144.996338,7517.975265
std,6800.710448,0.958202,641921.7,6.00905,95.000995,0.964641,0.689311,0.955916,3987.326586,519.188596,37.013261,0.081152,0.106375,4488.416599
min,1.0,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,5936.75,2.0,633000.0,6.3,3046.0,2.0,1.0,1.0,176.5,93.0,1950.0,-37.8581,144.931193,4294.0
50%,11820.5,3.0,880000.0,9.7,3085.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.803625,145.00092,6567.0
75%,17734.25,3.0,1302000.0,13.3,3149.0,3.0,2.0,2.0,651.0,174.0,2000.0,-37.75627,145.06,10331.0
max,23546.0,12.0,9000000.0,48.1,3978.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [3]:
data = pd.read_csv('data/train.csv')
data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


## Part 3: Selecting and Filtering Data<a class="anchor" id="part3"></a>

For now, selecting variables to explore by intuition. Later will introduce statistical techniques for prioritizing variables.

**df.columns** - names of columns in data frame as a pandas Index object<br> 
**df.columns.sort_values** - columns sorted alphabetically rather than in order in datatable<br>

Selecting columns<br>
* **dot-notation** - like python attribute - df.column<br>
* **brackets** - like python dictiorary lookup - df[['column1', 'column2']]<br>
* **.loc** and **.iloc** - pandas prefered - getting to it

In [4]:
data.columns.sort_values()

Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'Alley', 'BedroomAbvGr',
       'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath',
       'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1', 'Condition2',
       'Electrical', 'EnclosedPorch', 'ExterCond', 'ExterQual', 'Exterior1st',
       'Exterior2nd', 'Fence', 'FireplaceQu', 'Fireplaces', 'Foundation',
       'FullBath', 'Functional', 'GarageArea', 'GarageCars', 'GarageCond',
       'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt', 'GrLivArea',
       'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle', 'Id', 'KitchenAbvGr',
       'KitchenQual', 'LandContour', 'LandSlope', 'LotArea', 'LotConfig',
       'LotFrontage', 'LotShape', 'LowQualFinSF', 'MSSubClass', 'MSZoning',
       'MasVnrArea', 'MasVnrType', 'MiscFeature', 'MiscVal', 'MoSold',
       'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual',
       'PavedDrive', 'PoolArea', 'Po

In [5]:
prices = data.SalePrice
prices.head(10)

0    208500
1    181500
2    223500
3    140000
4    250000
5    143000
6    307000
7    200000
8    129900
9    118000
Name: SalePrice, dtype: int64

In [6]:
sf = data[['1stFlrSF', '2ndFlrSF']]
sf.describe()

Unnamed: 0,1stFlrSF,2ndFlrSF
count,1460.0,1460.0
mean,1162.626712,346.992466
std,386.587738,436.528436
min,334.0,0.0
25%,882.0,0.0
50%,1087.0,0.0
75%,1391.25,728.0
max,4692.0,2065.0


## Part 4: Your First Scikit-Learn Model <a class="anchor" id="part4"></a>

Choose a **prediction target** or **outcome variable** or **dependent variable** conventionally **y**

Choose **predictors** or **features** or **independent variables** conventionally **X**


In [7]:
target = 'SalePrice'
y = data.loc[:, target]

features = ['LotArea',
            'YearBuilt',
            '1stFlrSF',
            '2ndFlrSF',
            'FullBath',
            'BedroomAbvGr',
            'TotRmsAbvGrd']
X = data.loc[:, features]

display(y)
display(X)

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9
...,...,...,...,...,...,...,...
1455,7917,1999,953,694,2,3,7
1456,13175,1978,2073,0,2,3,7
1457,9042,1941,1188,1152,2,4,9
1458,9717,1950,1078,0,1,2,5


In [8]:
# Training the model
#   note - this version used all data (none saved for validation)
#          to illustrate that this is a bad idea. See below.
from sklearn.tree import DecisionTreeRegressor

iowa_model = DecisionTreeRegressor()
iowa_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [31]:
# Predictions
predicted_prices = iowa_model.predict(X)
print(predicted_prices - y.values) # Most predictions exactly correct
print()
print(data
      .assign(PredictedPrice = predicted_prices.astype('int')) 
      .loc[:, ['SalePrice', 'PredictedPrice']]
      .loc[data.SalePrice != predicted_prices])   # 24 of 1460 are different
print()

# 'Average' house
#    note .values.reshape only necessiary for single row of features
print(iowa_model.predict(data.loc[:, features].mean(axis='rows').values.reshape(1, -1)))
print(data.SalePrice.mean())
print()

# 'Minimal' house
print(iowa_model.predict(data.loc[:, features].min(axis='rows').values.reshape(1, -1)))
print(data.SalePrice.min())
print()

# 'Max' house
print(iowa_model.predict(data.loc[:, features].max(axis='rows').values.reshape(1, -1)))
print(data.SalePrice.max())


[0. 0. 0. ... 0. 0. 0.]

      SalePrice  PredictedPrice
102      118964          118911
126      128000          135875
145      130000          132500
193      130000          132500
232       94500          106250
...         ...             ...
1421     127500          133750
1422     136500          134000
1431     143750          135875
1441     149300          144433
1452     145000          146500

[24 rows x 2 columns]

[126000.]
180921.19589041095

[60000.]
34900

[755000.]
755000


## Part 5: Model Validation <a class="anchor" id="part5"></a>

**Mean Absolute Error** or **MAE**
* Average, absolute value of difference between prediction and actual value

**Validation data**
* Setting aside some data before training the model to use for testing the model
* Use *sklearn.train_test_split* or similar method

In [32]:
# Using trainging data for validation
#   just to illustrate what a bad idea this is
from sklearn.metrics import mean_absolute_error

predicted_prices = iowa_model.predict(X)
mean_absolute_error(y, predicted_prices)

62.35433789954339

In [33]:
# Using separate training and validation sets
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)
iowa_model = DecisionTreeRegressor()
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)

print(mean_absolute_error(val_y, val_predictions))

32760.720547945206


## Part 6: Underfitting, Overfitting and Model Optimization<a class="anchor" id="part6"></a>

**overfitting** model accurately predicts training data, but does not generalize well

**underfitting** model performs poorly even on training data, technically it generalizes well, but 'consistently poor' isn't particulary useful

Decision tree **depth** is the length of the longest path from root to leaf. A shallow tree is prone to underfitting, but a deep tree may be overfit. Calculating the MAE for models trained over a range of depths can help identify an optional trade-off depth.



In [34]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor  

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))



Max leaf nodes: 5  		 Mean Absolute Error:  35190
Max leaf nodes: 50  		 Mean Absolute Error:  27825
Max leaf nodes: 500  		 Mean Absolute Error:  32662
Max leaf nodes: 5000  		 Mean Absolute Error:  33382


## Part 7 Random Forests:  <a class="anchor" id="part7"></a>

## Part 8:  <a class="anchor" id="part8"></a>