# <center> Kaggle's Intro to Machine Learning <center>

## 1. Basic Data Exploration

The first step in any Machine Learning project is to get familiarized with the data. In this course we will be using `Pandas` library for this. Pandas is the primary tool data scientists use for exploring and manipulating data.

The most important part of the Pandas library is the `DataFrame`. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database. Pandas has powerful methods for most things you'll want to do with this type of data

First, we will have to get the data from Kaggle, in this case the dataset that we will be using is *Melbourne housing snapshot*. To achieve this we can use Kaggle's API as the code that follows:

In [3]:
import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi

In [4]:
api = KaggleApi()
api.authenticate()

In [5]:
api.dataset_download_file('dansbecker/melbourne-housing-snapshot',
                             file_name = 'melb_data.csv')

False

In this case, the dataset comes in a .zip file, so we will have tu unzip it.

In [6]:
import pandas as pd

housing_df = pd.read_csv('melb_data.csv')
housing_df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


We can see a summary of the dataset with the `describe` method:

In [7]:
housing_df.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


### 1.1 Interpreting Data Description

The results show 8 fields for each column in your original dataset. The first field `count` shows how many rows have non-missing values.

Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.

The second value is the `mean`, which is the average. Under that, `std` is the standard deviation, which measures how numerically spread out the values are.

To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. The first (smallest) value is the `min`. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced `"25th percentile"`). The 50th (`the median`) and `75th percentile` are defined analogously, and the `max` is the largest number.

### 1.2 Selecting Data for Modeling

Generally, datasets have too many variables to wrap your head around, or even to print out nicely. How can you pare down this overwhelming amount of data to something you can understand?

We'll start by picking a few variables using our intuition. There are also more advanced statistical techniques to automatically prioritize variables. On later Kaggle courses you can learn about them. 

To choose variables/columns, we'll need to see a list of all columns in the dataset. That is done with the `columns property` of the DataFrame:

In [8]:
housing_df.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

As we saw with the `describe` method with our DataFrame, there are some missing values on our data. We have different ways to handle this missing values, but for now, we will take the simplest option that is to drop houses with missing values from our DataFrame.

In [9]:
housing_df = housing_df.dropna(axis = 0)
housing_df.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1068828.0,9.751097,3101.947708,2.902034,1.57634,1.573596,471.00694,141.568645,1964.081988,-37.807904,144.990201,7435.489509
std,0.971079,675156.4,5.612065,86.421604,0.970055,0.711362,0.929947,897.449881,90.834824,38.105673,0.07585,0.099165,4337.698917
min,1.0,131000.0,0.0,3000.0,0.0,1.0,0.0,0.0,0.0,1196.0,-38.16492,144.54237,389.0
25%,2.0,620000.0,5.9,3044.0,2.0,1.0,1.0,152.0,91.0,1940.0,-37.855438,144.926198,4383.75
50%,3.0,880000.0,9.0,3081.0,3.0,1.0,1.0,373.0,124.0,1970.0,-37.80225,144.9958,6567.0
75%,4.0,1325000.0,12.4,3147.0,3.0,2.0,2.0,628.0,170.0,2000.0,-37.7582,145.0527,10175.0
max,8.0,9000000.0,47.4,3977.0,9.0,8.0,10.0,37000.0,3112.0,2018.0,-37.45709,145.52635,21650.0


### 1.3 Selecting the Prediction Target

We can pull out the variable with `dot-notation`. This single column is stored in a Pandas' `Series`, which is broadly a DataFrame with only a single column of data.

We'll use the dot notation to select the column we want to predict, which is called the `prediction target`. By convention, the prediction target is called `y`.

In [10]:
y_housing = housing_df.Price
y_housing

1        1035000.0
2        1465000.0
4        1600000.0
6        1876000.0
7        1636000.0
           ...    
12205     601000.0
12206    1050000.0
12207     385000.0
12209     560000.0
12212    2450000.0
Name: Price, Length: 6196, dtype: float64

### 1.4 Choosing the "Features"

The columns that are inputted into our model (and later used to make predictions) are called `features`. In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features.

For now, we'll build a model with only a few features. Later on you'll see how to iterate and compare models built with different features.

We select multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).

Here is an example:

In [11]:
housing_features = ['Rooms',
                    'Bathroom',
                    'Landsize',
                    'Lattitude',
                    'Longtitude']

By convention, this data ins called `X`.

In [12]:
X_housing =  housing_df[housing_features]
X_housing.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


In [13]:
X_housing.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


Visually checking your data with these commands is an important part of a data scientist's job. You'll frequently find surprises in the dataset that deserve further inspection.

## 2. Building your model

You will use the **scikit-learn** library to create your models. When coding, this library is written as `sklearn`, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

- `Define`: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
- `Fit`: Capture patterns from provided data. This is the heart of modeling.
- `Predict`: Just what it sounds like
- `Evaluate`: Determine how accurate the model's predictions are.

Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable. We will need to specify a number for `random_state` to ensure same results on each run.

In [14]:
from sklearn.tree import DecisionTreeRegressor

model_housing = DecisionTreeRegressor(random_state = 1)

model_housing.fit(X_housing, y_housing)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=1, splitter='best')

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [15]:
print('\033[1m', "Making predictions for the following 5 houses:", '\033[0m')
print(X_housing.head(), end = '\n\n\n')
print('\033[1m', "The predictions are", '\033[0m')
print(model_housing.predict(X_housing.head()))

[1m Making predictions for the following 5 houses: [0m
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954


[1m The predictions are [0m
[1035000. 1465000. 1600000. 1876000. 1636000.]


## 3. Model validation

You've built a model. But how good is it? You'll want to evaluate almost every model you ever build. In most (though not all) applications, the relevant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens.

### 3.1 First steps into Model validation

Many people make a huge mistake when measuring predictive accuracy with their training data and compare those predictions to the target values in the training data. You'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first.

There are many metrics for summarizing model quality, but we'll start with one called `Mean Absolute Error` (also called MAE). With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. 

In the previous case, the MAE for the prediction was:

In [16]:
from sklearn.metrics import mean_absolute_error

predictions_housing = model_housing.predict(X_housing)
mean_absolute_error(y_housing, predictions_housing)

1115.7467183128902

### 3.2 The problem with 'In-Sample' scores

The measure we just computed can be called an `in-sample` score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price. However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data. But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called `validation data`.

The scikit-learn library has a function `train_test_split` to break up the data into two pieces. We'll use some of that data as `training data` to fit the model, and we'll use the other data as `validation data` to calculate the MAE. This sklearn splitting affects both training and validation sets.

This can be achieved by doing:

In [17]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X_housing, y_housing, random_state = 0)

model_houses = DecisionTreeRegressor()
model_houses.fit(train_X, train_y)

val_predictions = model_houses.predict(val_X)

print(mean_absolute_error(val_y, val_predictions))

272231.0748870239


We can now see that the mean absolute error for the in-sample data was about 500 dollars, but in the out-of-sample it's more than 250,000 dollars!

This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes. As a point of reference, the average home value in the validation data is 1.1 million dollars. So the error in new data is about a quarter of the average home value.

There are many ways to improve this model, such as experimenting to find better features or different model types.

## 4. Underfitting and Overfitting

### 4.1 Experimenting with different models

Now that we have a reliable way to measure model accuracy, we can experiment with alternative models and see which one gives us the best predictions. You can see in scikit-learn's documentation that the decision tree model has many options.

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If we keep doubling the number of groups by adding more splits at each level, we'll have  2<sup>10</sup>  groups of houses by the time we get to the 10th level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called `overfitting`, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called `underfitting`.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in

![Mean Average Error](https://i.imgur.com/2q85n9s.png)

But in summary, models and suffer from either>

- `Overfitting`: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
- `Underfitting`: failing to capture relevant patterns, again leading to less accurate predictions.

We use `validation data`, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.

### 4.2 Hands on experimenting with Decision Trees

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the `max_leaf_nodes` argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

In [18]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, 
                                  random_state = 0)
    
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    
    return(mae)

In [19]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  385696
Max leaf nodes: 50  		 Mean Absolute Error:  279794
Max leaf nodes: 500  		 Mean Absolute Error:  261718
Max leaf nodes: 5000  		 Mean Absolute Error:  271996


For getting the best number of leaf nodes we can use a loop. Here I'll present two ways, the more 'explicit' way and simpler for Python begginers, and another one with dict comprenhension:

In [21]:
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]

# Write loop to find the ideal tree size from candidate_max_leaf_nodes
mae =  get_mae(candidate_max_leaf_nodes[0],
               train_X,
               val_X,
               train_y,
               val_y
        )

leaf_nodes = candidate_max_leaf_nodes[0]
new_mae = 0

for qty in candidate_max_leaf_nodes[1:]:
    new_mae = get_mae(qty, train_X, val_X, train_y, val_y)
    if new_mae < mae:
        leaf_nodes = qty
        mae = new_mae
    

# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = leaf_nodes

print('The best tree size is with ', best_tree_size, ' nodes.')
print('The MAE for this tree size is: ', mae)

The best tree size is with  500  nodes.
The MAE for this tree size is:  261718.1134423186


In [22]:
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y)
          for leaf_size in candidate_max_leaf_nodes}

best_tree_size = min(scores, key = scores.get)

print('The best tree size is with ', best_tree_size, ' nodes.')
print('The MAE for this tree size is: ', scores[best_tree_size])

The best tree size is with  500  nodes.
The MAE for this tree size is:  261718.1134423186


Now, we would have to re-train the model so it has the best value for the MAE with the new best_tree_size.

In [23]:
# Fill in argument to make optimal size and uncomment
final_model = DecisionTreeRegressor(max_leaf_nodes = best_tree_size,
                                    random_state = 1)

# fit the final model and uncomment the next two lines
final_model.fit(train_X, train_y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=500,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=1, splitter='best')