## Machine Learning Course (part 1)

- This document summarizes machine learning introduction from [Machine learning course from Kaggle](https://www.kaggle.com/learn/machine-learning).
- You can download data files from [kaggle competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). You need to accept rules for this competition in order to download any data through [kaggle API](https://github.com/Kaggle/kaggle-api).

In [15]:
# Download dataset
!kaggle competitions download -c house-prices-advanced-regression-techniques --path ./data_files --file train.csv

Downloading train.csv to ./data_files
100%|████████████████████████████████████████| 450k/450k [00:00<00:00, 1.21MB/s]



In [18]:
# load dataset
import pandas as pd

home_data = pd.read_csv('./data_files/train.csv')
home_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


###  Step 1 : Specify prediction target

In [19]:
y = home_data.SalePrice

### Step 2 : Create X holding the predictive features

In [21]:
feature_names = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
x = home_data[feature_names]

### Step 3 : Specify and fit model

In [26]:
# install scikit-learn libarary if necessary 
!pip install sklearn

Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Collecting scikit-learn (from sklearn)
[?25l  Downloading https://files.pythonhosted.org/packages/28/1d/9fd027fde8a23fa8e3ecdc00cec891cea7bb387ac9d3f77843925f7435b7/scikit_learn-0.20.0-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (7.7MB)
[K    100% |████████████████████████████████| 7.8MB 3.4MB/s ta 0:00:011
[?25hCollecting scipy>=0.13.3 (from scikit-learn->sklearn)
[?25l  Downloading https://files.pythonhosted.org/packages/4c/4a/440cc9703938bbc86636ff6b9e17810f3d0f06e9b41891c5433dc4cd9091/scipy-1.1.0-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (16.7MB)
[K    100% |████████████████████████████████| 16.7MB 1.4MB/s ta 0:00:011
Building wheels for collected packages: sklearn
  Running setup.py bdist_w

In [29]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(random_state=1)
model.fit(x,y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

> Many machine learning models allow some randomness in model training. Specifying a number for `random_state` ensures you get the same results in each run. Model quality won't depend meaningfully on exactly what value you choose.

### Step 4 : Make prediction

In [30]:
model.predict(x)

array([208500., 181500., 223500., ..., 266500., 142125., 147500.])

In [48]:
print('First in-sample predictions : ', model.predict(x.head()))
print('Acutal target values for those homes:', y.head().tolist())

First in-sample predictions :  [208500. 181500. 223500. 140000. 250000.]
Acutal target values for those homes: [208500, 181500, 223500, 140000, 250000]


> Isn't it too good?

### Model validataion
 There are many metrics for summarizing model quality. but we'll start with one called **Mean Absolute Error** (also called MAE).
 The prediction error for each house is :
 ``` 
 error = actual - predicted
 ```
With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. In plain English, it can be said as 
> On average, our predictions are off by about X.



### Step 5. Validate your model

In [32]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = model.predict(x)
mean_absolute_error(y, predicted_home_prices)

62.35433789954339

### The problem with 'in-sample' scores
The measure we just computed can be called an 'in-sample' score. We used a single 'sample' of houses for both building the model and evaluating it. 

Imagine that in the sample of data you used to buid the model, all homes with green doors were very expensive (normally, door color is unrelated to home prices). The model's job is to find patterns that predict home prices, so it will see the pattern, and it will always predict high prices for homes with green doors.

SInce this pattern was derived from the training data, the model will appear accurate in the training data. 

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice. 

Since models' practical value come from making predictions of new data, we measure performance on data that  wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the mode's accuracy on data it hasn't seen before. This data is called **validation data**.

### Step 6. Split your data

In [36]:
from sklearn.model_selection import train_test_split

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)

### Step 7. Specify and fit the model

In [50]:
model = DecisionTreeRegressor(random_state=1)
model.fit(train_x, train_y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

### Step 8. Validate your model

In [52]:
val_predictions = model.predict(val_x)
mean_absolute_error(val_y, val_predictions)

32966.449315068494

### Step 9. Compare different tree sizes to optimize your model

In [72]:
# Create a function to help compare MAE scores from different values for max_leaf_nodes:
def get_mae(max_leaf_nodes, train_x, val_x, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state=0)
    model.fit(train_x, train_y)
    prediction_val = model.predict(val_x)
    mae = mean_absolute_error(val_y, prediction_val)
    return(mae)

In [75]:
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
scores = {leaf_size : get_mae(leaf_size, train_x, val_x, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
scores

{5: 35190.33670788684,
 25: 28501.887126575195,
 50: 27825.888386265695,
 100: 28653.10992820276,
 250: 31738.366204184345,
 500: 32662.00407479887}

In [85]:
print('value of max_leaf_nodes that gives the most accurate model on your data is : {}'.format(min(scores, key=scores.get)))

value of max_leaf_nodes that gives the most accurate model on your data is : 50


### Step 10. Random forest model

In [84]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_x, train_y)
rf_predictions = forest_model.predict(val_x)
print(mean_absolute_error(val_y, rf_predictions))

24346.620065231575


