In [None]:
%pylab inline

## Machine learning basics

* Modelling
* Model correctness
* Overfitting/underfitting
* Training/testing data
* Sklearn basics

## What is a model?

* A **simplified** representation

## Model types

* Regression
  * What is the value of 'Y' at 'X?'
* Classification
  * Is it a 'X'?
* Clustering
  * Is this closest to 'X' or 'Y'?

## Regression

* Using a model to predict numerical data
  * Salaries, statistics, age, sizes, etc.

## Example: Linear regression

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Linear_Function_Graph.svg/300px-Linear_Function_Graph.svg.png)

## Example: Plotting number of murders and US science spending

* Pandas: A library for reading data files
  * We'll cover this much more in depth later
  
```python
import pandas
```

In [None]:
import pandas as pd
%matplotlib inline
data = pd.read_csv("data/science.csv")
data

In [None]:
data.plot()

In [None]:
# Errrm, ups, what I really wanted was the science spending on the x axis and the suicides on the Y
data.plot(x = 1, y = 2)

In [None]:
# Ok, but why the line? Let's do a scatter plot
data.plot.scatter(x = 1, y = 2)

## Uh, looks close to a line, right?! Let's try to draw a straight line between the points


In [None]:
import matplotlib.pyplot as plt
data.plot.scatter(x = 1, y = 2)
plt.plot([18000, 30000], [5500, 9300])

## We can now do predictions!

We can simply look at the graph to find out how many murders we will have if we change the US science spending.

* Unfortunately it's pretty hard to read it out graphically, so let's get the formula

$y = \alpha x + \beta$

1. We can find the slope of the triangle...

2. And then we can use the slope to extrapolate the point $(18000, 5500)$

## Introducing sklearn

* **Scikit-learn** is a pretty cool machine learning framework with a lot of tools
  * https://scikit-learn.org/

In [None]:
import sklearn

## Improving our model

* Before, I was just taking a random guess on what was a good model, luckily `sklearn` is much better at guessing than I am.
  * We can use `sklearn` to construct a `LinearRegression` model
  
* **Regression** means that we *regress* towards a better model
  * So we are actively trying to find the perfect linear model ($\alpha x + \beta$) that fits our data
  
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression

In [None]:
import sklearn.linear_model
sklearn.linear_model.LinearRegression?

## Fitting a model

* Now that we know what model to use, we have to **train** it or **fit** it to our data

In [None]:
xs_reshape = np.array(xs).reshape(11, 1)

In [None]:
model = sklearn.linear_model.LinearRegression()
model.fit(xs_reshape, actual)

**Note:** sklearn expects the `xs` as an array of at least one element (for reasons we will se later)

In [None]:
model.coef_

In [None]:
model.intercept_

## You can now use the model to predict

In [None]:
predicted = model.predict(xs_reshape)

## Classification

* What if we don't want numbers but classes?
  * Cars, weekdays, emotions, etc.

## Example: decision tree classifier

![](images/decision-tree.png)

## Example: predicting flower classes

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data
y = iris.target

In [None]:
iris.target_names

In [None]:
model = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
model.fit(X, y)

In [None]:
model.predict([[6.1, 2.8, 4. , 1.3]])

In [None]:
from graphviz import Source
from sklearn.tree import export_graphviz
from IPython.display import SVG
graph = Source( export_graphviz(model, out_file=None, feature_names=iris.feature_names))
SVG(graph.pipe(format='svg'))

## Sklearn

https://scikit-learn.org/stable/index.html

## Exercise

* Import data using `sklearn.datasets.load_diabetes`:
```python
from sklearn.datasets import load_diabetes
X = load_diabetes().data
Y = load_diabetes().data
```
* Construct a `sklearn.linear_model.LinearRegression` model
* Fit it with the data
* What is the predicted disease progression given this input?
```python
[ 0.01628068, -0.04464164,  0.01750591, -0.02288496,  0.06034892,
  0.0444058 ,  0.03023191, -0.00259226,  0.03723201, -0.0010777 ]
```

## Recap

* Model types
    * Regression
    * Classification
    * Clustering
* Models are
    1. Constructed
    2. Trained
    3. Tested

## How do we know that the models are good?