### Intro to ML - Kaggle [Completed]
## Study Notes - Hanaan R. Shafi

In [34]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

### Basic Info

Capturing patterns from data is called fitting or training the model. The data used to the fit the model is called training data. 

In Decision trees the point at the bottom where we make the prediction is called the leaf. 

### 1. Explore your data using Pandas

In [3]:
melb_data = pd.read_csv("melb_data.csv")
melb_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


### 2. Selecting data for modelling

In [5]:
melb_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [6]:
melb_data = melb_data.dropna(axis=0)

#### Select your prediction target and features

In [9]:
y = melb_data.Price
feat = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melb_data[feat]

In [12]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [13]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


### 3. Building your model

The steps to building and using a model are:
1. Define: what type of model will it be? A decision tree, for eg. Specify some parameters of the model.
2. Fit: Capture patterns from the given data.
3. Predict
4. Evaluate: Determine how accurate the model's predictions are


Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

In [16]:
melb_model = DecisionTreeRegressor(random_state = 1) #Specify a number for random_state to ensure same results each run
melb_model.fit(X,y)

In [17]:
print("Making predictions for the foll. 5 houses")
print(X.head())
print("The predictions are:")
print(melb_model.predict(X.head()))

Making predictions for the foll. 5 houses
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are:
[1035000. 1465000. 1600000. 1876000. 1636000.]


### 4. Validating your model

In most applications, the relevant measure of model quality is predictive accuracy. Will the model's predictions be close to what actually happens? 

Let's start with mean absolute error or MAE. 

error = actual - predicted

With MAE, we take the absolute value of the error for each data point and average them all. 

In [22]:
predicted_prices = melb_model.predict(X)
mean_absolute_error(y, predicted_prices)

1115.7467183128902

### This is called an in-sample score. A single sample of house data was used to both train the data and validate it. But we should actually test the model on new data it hasn't seen before, called validation data. The way to do this is to split the original data set into two pieces, training data to fit the model and validation data to test the model. 

### 5. Train-Test Split

In [23]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0) #supplying the same random state ensures we get the same split every time

In [25]:
melb_model = DecisionTreeRegressor(random_state=0)
melb_model.fit(train_X, train_y)

In [27]:
pred_y = melb_model.predict(val_X)
mean_absolute_error(val_y, pred_y)

271598.0400258231

wow! Out of sample MAE is massive compared to in-sample MAE. 

### 6. Underfitting and Overfitting

SKLEARN DTR https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

There are many parameters. The most important options determine the tree's depth, which is how many splits it makes before arriving at a prediction. 

In overfitting, the houses are divided amongst many leaves with very few houses in each. Leaves with v few houses will make v good predicts v close to actual home value but will perform badly on new data. This is OVERFITTING, where the model matches the training data almost perfectly, but does poorly in validation and new data. 

On the other hand, if we have only a few leaves, each group still has a variety of houses. Resulting preds will be far off for most houses even in the training data. Fails to capture patterns in the data. This is UNDERFITTING. 

We want to find the sweet spot between overfitting and underfitting. So we have to control for the tree depth. Some routes of the tree may be deeper than others. 

max_leaf_nodes is a useful parameter to use. The more leaves we allow the model to make, the more we move from underfitting to overfitting. What we can do is, check out the MAE scores for different max leaf node values. Let's make a function for that. 

In [31]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)
    model.fit(train_X, train_y)
    pred_y = model.predict(val_X)
    mae = mean_absolute_error(val_y, pred_y)
    return(mae)

In [33]:
for x in [5, 50, 500, 5000]:
    mae = get_mae(x, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(x, mae))

Max leaf nodes: 5  		 Mean Absolute Error:  385696
Max leaf nodes: 50  		 Mean Absolute Error:  279794
Max leaf nodes: 500  		 Mean Absolute Error:  261718
Max leaf nodes: 5000  		 Mean Absolute Error:  271320


So the max number of nodes is 500 because it results in the least MAE. 

### 7. Random Forests! 

The random forest uses many trees, and it makes a prediction by average the predictions of each component tree. Generally has much better predictive accuracy than the individual trees and works well with default parameters. 

In [37]:
forest_model = RandomForestRegressor(random_state=0)
forest_model.fit(train_X, train_y)
pred_y = forest_model.predict(val_X)
mae = mean_absolute_error(val_y, pred_y)
mae

206868.39967967046