# General ML Crash Course
Notebook written by Ze Ming :l and Joel

By the end of this notebook, you will be able to:
1. Participate and make submissions in Kaggle competitions
2. Explain the difference between supervised and unsupervised problems
3. Create a machine learning model by fitting data into a model provided by scikit.

In this notebook, we will be using the Boston housing problem on [Kaggle](https://www.kaggle.com/c/boston-housing/overview) as an example.

## Predicting Boston Housing Prices
**Sign up and download the data here: https://www.kaggle.com/c/boston-housing/data**

The zip archive contains three files:

- train.csv
- test.csv
- submission_example.csv

`train.csv` is the dataset which you use to teach your machine learning model.

`test.csv` contains tests which Kaggle uses to assess your model.

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Load the training data and test data into pandas dataframes
train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./test.csv')

In [None]:
# Explore the train data with .head()
train_df.head()

In [None]:
# Explore the train data with .head()
test_df.head()

## Distinguishing supervised and unsupervised problems

A *supervised problem* is the problem of finding a relationship between two data sets where one data set is the input, and the other is the expected output.

In this case, we are finding the relationship between:
- (Input) some information about the town in Boston
- (Expected Output) price of a house in that town


An *unsupervised problem* is the problem of finding a certain pattern in a data set. The key difference is that there is no expected output.

Many competitions will provide you with the dataset (such as now) and, just by looking through the dataset, you will be able to tell whether it is a supervised or unsupervised problem.

In this competition, we need to submit a csv file with our answers, given the test rows in `test.csv`.

Since the `train.csv` dataset contains input and expected output data, this is a supervised problem.

### The library that we will be using for machine learning this time will be scikit-learn. 

Scikit-learn is also very useful because most models are used in the same way: `.fit()`/`.train()` and some other methods apply to most models.

So any time you find yourself wanting to try another type of algorithm for your dataset, just swap it out and it should work fine.

For example, you can comment `RandomForestRegressor` and uncomment `SVR`, both should still work. Some models like `SVR` and `RandomForestRegressor` also come with hyper-parameters for you to tune the algorithm to possibly get better performance

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR

In [None]:
model = RandomForestRegressor()
# model = SVR(C=1e+2, gamma=3e-3)

After doing some exploration, you will observe that several factors seem to affect the median price:

* rm
* lstat
* crim
* rad
* indus

So, we are only going to use a few of these columns for training the model.

Feel free to adjust what columns you use to train the model. `pandas` makes it convenient by allowing you to plug and play the column names.

In this case, I will use `rm`, `lstat` and `crim`

In [None]:
# Extract the factors which seem to affect the median price
X = train_df[['rm', 'lstat', 'crim']]
# Extract the median price
Y = train_df['medv']

It would be good to be able to tell the accuracy of our model after training it. So, we split the given dataset (`train.csv`) into two: one for training and the other for testing the accuracy.

In [None]:
# Split the data, with 30% of the data as test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, shuffle=True)

In [None]:
print(type(X_train))
print(X_train.shape)

In [None]:
# Train the model
model.fit(X_train, Y_train)

In [None]:
# Score the model based on default scorer (R2)
print(model.score(X_test, Y_test))

# Score the model based on RMSE
from sklearn.metrics import mean_squared_error
from math import sqrt

Y_predicted = model.predict(X_test)
print(sqrt(mean_squared_error(Y_test, Y_predicted)))

The score here will always vary, as the dataset is quite small (400~).

## Scoring
Scikit-Learn calculates the score by testing the model with `X_test` "questions". Then it compares the model's "answers" with the `Y_test` (`Y_test` is the correct answer). Since the dataset is small, the test size is also small. This means that the result can change all the time. This issue does not happen with large datasets.

However, you should be able to achieve at least >0.5 score for R2. (predictions are more than 50% correct when tested)

There are two scoring methods used here. The first is R2 and the second one is RMSE. R2 can give you a rough idea of "accuracy". RMSE is used here because that is what Kaggle is judging you by. 

## Create the submission CSV file
Kaggle provides a `test.csv` file which contains data on towns in Boston but does not contain the price of housing in that town. This is the set of "questions" which Kaggle uses to assess your model.

Let's do the same process to predict the price of housing in the `test.csv` towns.

In [None]:
# Take the same features that you used previously to train your model
predict_from = test_df[['rm', 'lstat', 'crim']]
predict_from.head()

In [None]:
# Create the dataframe to be used for submission
predicted = model.predict(predict_from)
submission_df = pd.DataFrame({'ID' : test_df['ID'], "medv" : predicted})

Pandas makes making CSVs very easy :)

In [None]:
# Write the dataframe into a CSV file
submission_df.to_csv('submission.csv', index=False)

Now you submit `submission.csv` to the competition and view your score. The lower your score the better since they are using a different method of scoring (RMSE).

## Hyperparameter tuning
As mentioned before, the machine learning model may accept additional parameters to tune the underlying alogrithm. These are called *hyperparameters*. We won't go through how to find the best values, but, just to show the effects, here are a few good values to try out:

```
n_estimators=279, max_depth=682, max_features=2
n_estimators=45, max_depth=703, max_features=2
n_estimators=61, max_depth=166, max_features=3
```

In [None]:
# Use the parameters in the model
model = RandomForestRegressor(n_estimators=279, max_depth=682, max_features=2)

# Train the model
model.fit(X_train, Y_train)

# Score the model based on default scorer (R2)
print(model.score(X_test, Y_test))

# Score the model based on RMSE
Y_predicted = model.predict(X_test)
print(sqrt(mean_squared_error(Y_test, Y_predicted)))