## Kaggle Intro To ML

We'll start with a model called the Decision Tree. There are fancier models that give more accurate predictions, but decision trees are easy to understand, and they are the basic building block for some of the best models in data science.

A decision tree consists of 3 types of nodes:

1. Decision Nodes: represented by squares
2. Chance Nodes: typically represented by circles
3. End Nodes: typically represented by triangles

The predicted value from the model is at the bottom of the tree. 

<img src="img/decisionTree.png">

___

### Using Pandas to get familiar with your data

The first step in any machine learning project is to familiarize yourself with the data. We use the pandas library as a tool for exploring and manipulating data.

In [2]:
import pandas as pd

The most important part of the pandas library is the DataFrame, which can think of as a table in SQL or a sheet in Excel.

As an example, we'll look at home price data in Melbourne Australia. We can load and explore the data with the following commands:

In [21]:
# save filepath to variable for easier access
melbourne_file_path = 'data/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


Count: shows how many rows have non-missing values.
Mean: The average 

To interpret the min, 25%, 50%, 75% max values imagine a column is sorted in ascending order.

Min Value: the smallest value
25% (25th percentile): 25% of the values are smaller than this value
50% (50th percentile): 50% of the values are smaller than this value
75% (75th percentile): 75% of the values are smaller than this value
Max Value: the largest value

### Selecting Data for Modeling

Often, large datasets have too many variables/features/columns to wrap your head around. We need a means of selecting features to be used in our model. We'll start by picking a few variables using our intuition. Later we'll use statistical techniques to automatically prioritize variables.


In [5]:
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

### Selecting the Prediction Target

You can pull out a variable with dot-notation. This single column is stored in a Series, which is like a DataFrame with a single column.

We'll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y.

In [8]:
y = melbourne_data.Price

### Choosing "Features"

The features are the other columns besides the prediction target. We use features in our model to predict our prediction target. So the idea here, is to select the columns that will best predict the price. 

Here, we'll build a model with only a few features. Later we'll see how to iterate and compare models built with different features.

By convention, this data is called x

In [29]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,13580.0,13580.0,13580.0,13580.0,13580.0
mean,2.937997,1.534242,558.416127,-37.809203,144.995216
std,0.955748,0.691712,3990.669241,0.07926,0.103916
min,1.0,0.0,0.0,-38.18255,144.43181
25%,2.0,1.0,177.0,-37.856822,144.9296
50%,3.0,1.0,440.0,-37.802355,145.0001
75%,3.0,2.0,651.0,-37.7564,145.058305
max,10.0,8.0,433014.0,-37.40853,145.52635


In [16]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
0,2,1.0,202.0,-37.7996,144.9984
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
3,3,2.0,94.0,-37.7969,144.9969
4,4,1.0,120.0,-37.8072,144.9941


## Building Your Model

We'll use the scikit-learn library to create your models.

The Steps to building and using a model are:

* ***Define*** what type of model will it be
* ***Fit*** Capture patterns from provided data. The heart of modeling
* ***Predict*** the target
* ***Evaluate*** to determine how accurate the model's predictions are

Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.

In [19]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=1, splitter='best')

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered good practice.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works

In [24]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2       1.0     202.0   -37.7996    144.9984
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
3      3       2.0      94.0   -37.7969    144.9969
4      4       1.0     120.0   -37.8072    144.9941
The predictions are
[1480000. 1035000. 1465000.  850000. 1600000.]


We've built a model... but it sucks! Now we will learn to use model validation to measure the quality of your model. Measuring model quality is the key to iteratively improving your models.

## What is Model Validation

You will want to evaluate every model you build. In most applications, the relevant measure of model quality is its predictive accuracy. Will the model's predictions be close to what actually happens.

Many people make a mistake when measuring predictive accuracy. They make predictions with their training data and compare those to the target values in the training data.

### The problem with "In-Sample" Scores

If we were to use the training data for both building the model and evaluating it, the patterns found in the training data might not hold when seeing new unseen data.

For example, say door color is unrelated to housing price and houses with green doors were very expensive in the training data. Since this pattern was derived from just the training data, the model would appear accurate when predicting values for the training data but very unaccurate when using new data.

### Mean Absolute Error (MAE)

There are many metrics for summarizing model quality. We'll start with MAE

The prediction error for each house is:

```
error = actualPrice - predictedPrice
```
So if a house costs 125,000 and the model predicted 100,000. Our error would be 25,000.

With MAE, we take the absolute value of each error and then find the average of these. With MAE, we can say "On average, the models predictions are off by about X".


In [23]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1125.1804614629357

## Coding it

The scikit-learn library has a function train_test_split to break up the data into 2 pieces. We'll use one section as training data to fit the model, and we'll use the other section as testing or validation data to calculate the mean_absolute_error.

In [26]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

247949.8380952381


So here we can see that our MAE for our in-sample data is 1125 and our MAE for our out-of-sample is 247,949. So really bad. There are many ways to improve this model, such as experimenting to find better features or different model types

In [28]:
print(train_y.mean())

1075481.2686303388


### Experimenting with Different Models

Now we have a reliable way to measure model accuracy, we can experiment with alternative models and see which one gives the best predictions.


## Overfitting and Underfitting

Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions

Underfitting: failing to capture the relevant patterns, again leading to less accurate predictions

Again we use ***validation/testing data*** to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have  2^10  groups of houses by the time we get to the 10th level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in

<img src="img/MAE.png">