# General ML Crash Course
Notebook written by Ze Ming :l and Joel

By the end of this notebook, you will be able to:
1. Participate and make submissions in Kaggle competitions
2. Explain the difference between supervised and unsupervised problems
3. Create a machine learning model by fitting data into a model provided by scikit.

In this notebook, we will be using the Boston housing problem on [Kaggle](https://www.kaggle.com/c/boston-housing/overview) as an example.

## Imports

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR


from sklearn.metrics import mean_squared_error, mean_absolute_error
from math import sqrt


## Defining the problem
**Sign up and download the data here: https://www.kaggle.com/c/boston-housing/data**

The zip archive contains three files:

- train.csv
- test.csv
- submission_example.csv

`train.csv` is the dataset which you use to teach your machine learning model.

`test.csv` contains tests which Kaggle uses to assess your model.

In [2]:
# Load the training data and test data into pandas dataframes
train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./test.csv')

#### Distinguishing supervised and unsupervised problems

A *supervised problem* is the problem of finding a relationship between two data sets where one data set is the input, and the other is the expected output.

In this case, we are finding the relationship between:
- (Input) some information about the town in Boston
- (Expected Output) price of a house in that town


An *unsupervised problem* is the problem of finding a certain pattern in a data set. The key difference is that there is no expected output.

In [3]:
train_df.head()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9


In this problem, the task is to predict `medv` giventhe rest of the columns `crim`, `zn`, ... etc.
Since there is an expected value that we are trying to predict (`medv`)

As such test data is missing `medv` because that what we have to predict

In [4]:
test_df.head()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat
0,3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03
1,6,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21
2,8,0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15
3,9,0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93
4,10,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1


## Preparing the data
First we split the training data into inputs and expected outputs `medv`

In [5]:
# Extract everything else for use by the model to try to predict medv
X = train_df.iloc[:, :-1]

# Extract the prediction target medv
Y = train_df['medv']

#### Splitting the data for later validation

It would be good to be able to tell the accuracy of our model after training it. So, we split the given dataset (`train.csv`) into two: one for training and the other for validation.

We dont use the training data to test the model because we have already given the training data answers to the model when training the model. The only reliable test of whether actually learnt from the data is to test the model on data it has never seen before 

> Remember to shuffle when doing the spltting or bad things may happen...


In [6]:
# Split the data, with 30% of the data as test data
X_train, X_valid, Y_train, Y_valid  = train_test_split(X, Y, test_size=0.3, shuffle=True)

## Training the Model
Now we proceed with training a model the training on the data.

#### Scikit-learn machine learning library
The scikit-learn machine library contains many machine learning algorithms, and many utilities to assist with machine learning

Scikit-learn is also very useful because most models are used in the same way: 
- `.fit()` - train the model on the given data
- `.predict()`- predict values using the model

> So any time you find yourself wanting to try another type of algorithm for your dataset, just swap it out and it should work fine.

For example, you can swap model `RandomForestRegressor` wit `SVR`, both should still work. 

In [7]:
# create a new model
model = RandomForestRegressor()
# model = SVR(C=1e+2, gamma=3e-3)

# Train the model
model.fit(X_train, Y_train)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

## Evaluating the model
We now see how well the model learnt from the data by evaluating its predictions on a _metric_:

We will be using **RMSE** and **R2** evaluation metrics

In [9]:
Y_predicted = model.predict(X_valid)

print("RMSE:", sqrt(mean_squared_error(Y_valid, Y_predicted)))

# Evaluate model based on MAE
print("R2", model.score(X_valid, Y_valid))


RMSE: 3.390472828382496
R2 0.8563761261570519


The **R2** evaluation metric gives us an idea of how well the model is doing.

It typical ranges from -1 to 1 and the closer to 1 you get the better your is.

## Hyperparameter tuning
Now the hard part, tunning the hyperparameters:
1. select a value for a hyperparameter
2. train a model with that hyperparameter value
3. evaluate the model to see if the change improved

In [11]:
# Use the parameters in the model
model = RandomForestRegressor(n_estimators=279,
                              max_depth=682,
                              max_features=2)

# Train the model
model.fit(X_train, Y_train)

# evalute the model based on RMSE
Y_predicted = model.predict(X_valid)
print("RMSE:", sqrt(mean_squared_error(Y_valid, Y_predicted)))

# Evaluate model based on MAE
print("R2", model.score(X_valid, Y_valid))

RMSE: 3.995264543992084
R2 0.8005669265440317


## Create the submission CSV file
Kaggle provides a `test.csv` file which contains data on towns in Boston but does not contain the price of housing in that town. This is the set of "questions" which Kaggle uses to assess your model.

Let's do the same process to predict the price of housing in the `test.csv` towns.

In [12]:
X_test = test_df.values

In [13]:
# Create the dataframe to be used for submission
predicted = model.predict(X_test)
submission_df = pd.DataFrame({'ID' : test_df['ID'], "medv" : predicted})

Pandas makes making CSVs very easy :)

In [14]:
# Write the dataframe into a CSV file
submission_df.to_csv('submission.csv', index=False)

Now you submit `submission.csv` to the competition and view your score. The lower your score the better since they are using a different method of scoring (RMSE).