# Week 1
## Last Time
Last week we looked at a simple binary classification problem. We followed a process and trained a linear classifier on a simple dataset. Whilst this worked well for our toy data, when we applied it to another dataset, we found that it didn't translate too well. The reason for this is that we were trying to model a non-linear function using (essentially) the linear algeabra function that you'll have come across in high school $y=mx+c$. If you go back and try a more complicated classification method, such as [SVM](http://scikit-learn.org/stable/modules/svm.html) or [MLP](http://scikit-learn.org/stable/modules/neural_networks_supervised.html) then you'll find more success, although you will come across the bias variance tradeoff. We'll talk about this again in future weeks, but for now, let's focus on what we're doing this week.

As an aside, for those of you who were wondering, the additional dataset was [the monk's problem](https://www.openml.org/d/333).

For fun, we ran the data through sci-kit learn's standard implementation of a couple classifiers. We did this 5 times and the mean results are below.

| Classifier | Accuracy Run 1 | Accuracy Run 2 | Accuracy Run 3 | Accuracy Run 4 | Accuracy Run 5 | Mean Accuracy |
|:----------:|:--------------:|:--------------:|:--------------:|:--------------:|:--------------:|:-------------:|
|     LR     |      0.643     |      0.598     |      0.643     |      0.670     |      0.643     |      0.64     |
|     SVM$^1$    |      0.857     |      0.795     |      0.857     |      0.911     |      0.884     |     0.861     |
|     MLP$^2$    |      0.438     |      0.455     |      0.402     |      0.554     |      0.616     |     0.493     |

$^1$ By tweaking the SVM slightly, we can get it to perform with an accuracy of 0.95. 

$^2$ You're probably thinking this isn't too great. Neural Networks are highly parameterized and this is just an out of the box model. We could improve this greatly by changing the parameters.

## This Week
Last week we learnt a function $f(x)=y$ that took our input features and output a discrete value $\{0,1\}$. Whilst this is OK, it clearly has its limitations, so we're going to expand on this. This week we're going to follow a similar approach and learn a function $f(x)=y$ that this time ouputs a real number rather than a discrete value. To demonstrate this we're going to try and predict house prices. We'll follow the same process as last week, but this time the dataset is more complex. Because of this there are an increased number of decisions that have to be made. We will make arbitrary decisions and highlight these. We then encourage you to take the same data and try to beat our score. The place to start will be looking at either changing the model, or going back to the arbitrary decisions that we've made and experiment with different decisions. Again, please share your results in Slack, we love to hear how you're getting on.

Before we get onto this week though, let's outline a sketch of where we're going over the next couple weeks.

## Future
The goal of these notebooks is to hopefully get you excited about ML and more importantly, to break the barrier to entry. We're not claiming that by the end of this you'll be an expert. Far from it. But we hope that you will feel confident reading other sources, building something using the tools we've introduced and are intrigued enough to find out more. With that in mind, here's where we're going.

0. Binary Classification
1. Regression
2. Multi-class Classification
3. Deep Learning
4. Clustering
5. Genetic Algorithms
6. Reinforcement Learning

These are all subject to change etc, but this should give a *broad* overview of the world of machine learning. If you've got any feedback, or want to cover something different, please let us know.

# Predicting House Prices

## Gather the data

The data is (surprisingly) in the file, 'houseprices.csv', so let's bring it in using Pandas.

In [1]:
import pandas as pd
train = pd.read_csv("houseprices.csv")
print(train.shape)

(1460, 81)


## Understand the data
The best way to get a feel for the data is to get our hands dirty. Let's take a look!

We know that there are 81 rows, so printing out the rows of the dataframe will hide some of the data. Instead, let's look at one row first.

In [2]:
[(col, data) for col, data in zip(train.columns, train.iloc[1])]

[('Id', 2),
 ('MSSubClass', 20),
 ('MSZoning', 'RL'),
 ('LotFrontage', 80.0),
 ('LotArea', 9600),
 ('Street', 'Pave'),
 ('Alley', nan),
 ('LotShape', 'Reg'),
 ('LandContour', 'Lvl'),
 ('Utilities', 'AllPub'),
 ('LotConfig', 'FR2'),
 ('LandSlope', 'Gtl'),
 ('Neighborhood', 'Veenker'),
 ('Condition1', 'Feedr'),
 ('Condition2', 'Norm'),
 ('BldgType', '1Fam'),
 ('HouseStyle', '1Story'),
 ('OverallQual', 6),
 ('OverallCond', 8),
 ('YearBuilt', 1976),
 ('YearRemodAdd', 1976),
 ('RoofStyle', 'Gable'),
 ('RoofMatl', 'CompShg'),
 ('Exterior1st', 'MetalSd'),
 ('Exterior2nd', 'MetalSd'),
 ('MasVnrType', 'None'),
 ('MasVnrArea', 0.0),
 ('ExterQual', 'TA'),
 ('ExterCond', 'TA'),
 ('Foundation', 'CBlock'),
 ('BsmtQual', 'Gd'),
 ('BsmtCond', 'TA'),
 ('BsmtExposure', 'Gd'),
 ('BsmtFinType1', 'ALQ'),
 ('BsmtFinSF1', 978),
 ('BsmtFinType2', 'Unf'),
 ('BsmtFinSF2', 0),
 ('BsmtUnfSF', 284),
 ('TotalBsmtSF', 1262),
 ('Heating', 'GasA'),
 ('HeatingQC', 'Ex'),
 ('CentralAir', 'Y'),
 ('Electrical', 'SBrkr'),


First thing to notice is that there's a mix of all kinds of data, and we don't want this tutorial to take days to read through, so let's make some arbitrary decisions. These initial ones are as follows.

1. The 'Id' column can be dropped.
2. If a numerical column contains a NaN, the NaN is replaced with the mean of the column.
3. If a non-numerical column contains a NaN, the NaN is to be replaced by the string 'None'.
4. All text columns are to be One Hot Encoded, as we're assuming that they contain categorical data.
5. The column 'MSSubClass' is to be treated as containing categorical data.

(Challenging some of these decisions might be a good place to look at improving the model)

In [3]:
import numpy as np

y = train['SalePrice'].astype(np.float64)
train = train.drop('SalePrice', axis=1)

# Rule 1
train = train.drop('Id', axis=1)

In [4]:
# Rules 2 & 3
for col in train.columns:
    if not train[col].dtype == object:
        mean = train[col].mean()
        train[col] = train[col].fillna(mean)
    else:
        train[col] = train[col].fillna(value="None")

In [5]:
# Rules 4 & 5
for col in train.columns:
    if train[col].dtype == object or col == 'MSSubClass':
        train = pd.concat([train, pd.get_dummies(train[col], prefix=col)], axis=1)
        train = train.drop(col, axis=1)

## Pre-processing
Now we've enacted those decisions, let's consider pre-processing the data. Let's arbitrarily decide to standardize all the data, including the target. Whoah whoah whoah, we're standardizing the target?! You're not crazy to think that, if you did. But ultimately, it's just a number, and as long as we're being consistent, this is a legitimate choice.

In [6]:
# We'll use sci-kit learn's standardizer
from sklearn.preprocessing import StandardScaler
scaler_train = StandardScaler()
scaler_test = StandardScaler()

scaler_train = scaler_train.fit(train)
scaler_test = scaler_test.fit(y.values.reshape(-1,1))

train = pd.DataFrame(scaler_train.transform(train))
y = pd.DataFrame(scaler_test.transform(y.values.reshape(-1,1)))

In [7]:
# Then let's split the data into training and test data
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(train, y, test_size=0.2)

## Training a Model
We're going to use an SVM as a regression model. This is again, a completely arbitrary choice. Well, kind of arbitrary, SVMs are kind of beautiful. [This](https://static1.squarespace.com/static/58851af9ebbd1a30e98fb283/t/58902fbae4fcb5398aeb7505/1485844411772/SVM+Explained.pdf) is a really nice explanation.

In [8]:
from sklearn import svm
model = svm.SVR(C=1e5, epsilon=1e-10, tol=1e-5)
model.fit(train_x, train_y.values.ravel())

SVR(C=100000.0, cache_size=200, coef0=0.0, degree=3, epsilon=1e-10,
  gamma='auto', kernel='rbf', max_iter=-1, shrinking=True, tol=1e-05,
  verbose=False)

## Post-processing
For post-processing, we un-standardize the output.

In [9]:
res = model.predict(test_x)

original_y = scaler_test.inverse_transform(test_y.values)
res_pp = scaler_test.inverse_transform(res)

## Model Evaluation

So how well did we do? That's a good question. There are a few different ways of measuring success when it comes to regression. Clearly if in house prices if we're a couple thousand off, that's fine. If we were trying to predict people's heights, being a couple thousand centimeteres off would be a nightmare.

We're going to use [r2 score](http://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination) to measure success. Essentially, a score of 1.0 is the best you can get, whereas a model that makes a constant prediction whilst disrgarding the input would get an r2 score of 0.

Remember, r2 score is only one method of evaluating our model. There are others and a combination of these might give a better representation of how your model is really doing. But, for now we're just going to stick with r2 score.

In [10]:
from sklearn.metrics import r2_score
print("r2:", r2_score(test_y.values, res))

r2: 0.781302236075


0.78 isn't too bad. Let's look at the results and see how we did

In [11]:
[("target:", target, "prediction:", int(pred)) for target, pred in zip(original_y, res_pp)]

[('target:', array([ 119000.]), 'prediction:', 126657),
 ('target:', array([ 125500.]), 'prediction:', 128939),
 ('target:', array([ 122500.]), 'prediction:', 110330),
 ('target:', array([ 146800.]), 'prediction:', 163371),
 ('target:', array([ 144500.]), 'prediction:', 167481),
 ('target:', array([ 155000.]), 'prediction:', 174500),
 ('target:', array([ 164700.]), 'prediction:', 159277),
 ('target:', array([ 150500.]), 'prediction:', 147520),
 ('target:', array([ 205000.]), 'prediction:', 219763),
 ('target:', array([ 105000.]), 'prediction:', 114932),
 ('target:', array([ 214000.]), 'prediction:', 237142),
 ('target:', array([ 124000.]), 'prediction:', 127136),
 ('target:', array([ 148000.]), 'prediction:', 165923),
 ('target:', array([ 130000.]), 'prediction:', 134696),
 ('target:', array([ 118000.]), 'prediction:', 118107),
 ('target:', array([ 194000.]), 'prediction:', 185851),
 ('target:', array([ 107000.]), 'prediction:', 142417),
 ('target:', array([ 79900.]), 'prediction:', 10

You can see, some of these have turned out well...... others, not so much. But an r2 score of 0.78 feels like a good challenge to try and beat. Particularly given the number of arbitrary decisions in pre-processing. Let us know how you get on :)