In [30]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## The problem of overfitting

![](images/overfitting.png)

## Populations

![](images/pop1.png)

![](images/pop2.png)

![](images/pop3.png)

## Your model versus the population

A sample is a **subset** of a population.

You will likely **never** have data that covers the entire population.

That means that you will likely **never** be able to represent the entire population!

Your model will lie!

## Carowners and voters

In 1963 *millions* of mock ballots was mailed to carowners across the USA, to learn who would win the presidential election.

The Republicans was a *clear* winner in the mock ballots, but the Democrats won the election.

What was wrong?

## The problem of generalisation

If X% of sample has Y it does **not** mean that X% of population has Y!

**Always** ask yourself: is your data representative?

## Back to overfitting 

In ML you learn from data that is **not** the entire population.

![](images/overfitting.png)

So, how can we get our model to work on the entire population?

Answer: By hiding some data from our model, and saving it for future tests.

## Training and testing data

We now have a split between 
* **Training data**: the data that the model sees
* **Testing data**: the data that the model is tested against

Note: the model should **never** train on the testing data

## Sklearn `train_test_split`

Splitting the data into testing and training makes it more likely that your model generalises.

But it **does not guarantee it**!

In [None]:
from sklearn.datasets import load_iris

In [None]:
X = load_iris().data
X

In [None]:
y = load_iris().target
y

In [None]:
from sklearn.model_selection import train_test_split
train_test_split(X, y)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y)

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)

In [None]:
model.predict(x_train)

In [None]:
model.score(x_train, y_train)

In [None]:
model.score(x_test, y_test)

## Evaluating a model

* Models are supposed to be as accurate as possible
  * `model.score`
  * Read the documentation

* But not *too* accurate
  * Overfitting

## The overfitting curve

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Overfitting_svg.svg/1280px-Overfitting_svg.svg.png" style="width:40%"/>

## Exercise

* Import `science.csv` to a pandas DataFrame
* Split the input (X) and target (y) using `train_test_split`
* Train the model on the training data
* Score the model based on the testing data

In [26]:
from sklearn.linear_model import LinearRegression
import pandas as pd

df = pd.read_csv('data/science.csv')

x = df['US science spending'].values.reshape(-1, 1)
y = df['Suicides']
x_train, x_test, y_train, y_test = train_test_split(x, y)

In [None]:
model = LinearRegression()
model.fit(x_train, y_train)
model.score(x_test, y_test)

## Other sklearn metrics

The model uses *default* metrics. But there are numerous others.

https://sklearn.org/modules/classes.html#module-sklearn.metrics

Metrics usually depends on the type of your model (classification, regression, etc.)

* **READ THE DOCUMENTATION**

In [18]:
import sklearn
sklearn.metrics.SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'accuracy', 'roc_auc', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'brier_score_loss', 'adjusted_rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted'])