# Scikit-learn 1

Introduction to working with `scikit-learn`, _the_ machine learning library for Python.

## 0 Setting up your system

Here is the recommended way to set up your system:

- install `miniconda`, get it from: https://conda.io/miniconda.html
- use it to install at least: `numpy`, `scikit-learn`, `seaborn`, `matplotlib`, `pandas`:

```
conda install numpy, scikit-learn, seaborn, matplotlib, pandas
```

- install `jupyter`, if you do not have it yet - also with `conda`:

```
conda install jupyter
```

- start up jupyter with:
```  
jupyter notebook
```

- this should open your browser automatically, if not open a browser window and navigate to localhost with the specified port

## 1 Loading Datasets

This sections explains how you can load datasets into `scikit-learn`.

### Loading a toy data set

`scikit-learn` ships with a number of toy datasets that you can simply import. Here is an example, the IRIS dataset.

In [None]:
# import data loading function (without actually loading the data yet)
from sklearn.datasets import load_iris
load_iris

In [None]:
# load data and assign it to a variable
iris = load_iris() # convention: variable name describes data set, instead of generic
type(iris)

All toy datasets come in the same format and have the same attributes (well, at least `data` and `target`):

In [None]:
# what attributes does the `data` variable have?
dir(iris)

In [None]:
# familiarize yourself with the data set before you do anything else
# note: see how statistics is relevant here?
print iris.DESCR

The `data` attribute is a list of lists, where the outer list is a list of observations ("samples"). Each observation is itself a list - a list of features. Or, more precisely, an observation is a list of feature **values**.

In [None]:
# print the feature values of the first observation
iris.data[0]

What do those feature values correspond to? The `feature_names` attribute explains this:

In [None]:
iris.feature_names

Important convention: list of observations is assigned to the variable `X`, the list of targets (or labels, or responses, or classes, even though that last one is sloppy) is assigned to `y`:

In [None]:
X = iris.data
y = iris.target
# first 10 observations together with their class:
print "{}\t\t{}".format("Observation", "Response")
for observation, response in zip(X[:10], y[:10]):
    print "{}\t{}".format(observation, response)

What does a response of 0 mean?

In [None]:
print set(y)
iris.target_names

** Question for you: why is `X` uppercase, when `y` is not?**

Both the observations and responses are of type `numpy.ndarray`, a flexible container for scalars, vectors and matrices:

In [None]:
type(X), type(y)

`ndarray`s have a convenient attribute `shape` that gives you the dimensions of the object in question:

In [None]:
X.shape, y.shape

**Question for you: what do those numbers mean? Which of them mean columns, which mean rows?**

### Loading an external dataset in a standard format like CSV

Loading an existing dataset for use with `scikit-learn` is also easy. We use a handy library called `pandas` that lets you manipulate data in an R-like fashion:

In [None]:
# always import as `pd`, this is a convention
import pandas as pd

Import an example dataset from ICS UCI, find more information here: https://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.names:

In [None]:
# read CSV directly from URL
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data")

Variables created in this way are `pandas` `DataFrame` objects that have a variety of convenient attributes, for instance the `head` or `tail` or `describe` methods:

In [None]:
print type(data)
# to display all other attributes of the data frame: dir(data)
data.head()

As you can see, something is not quite right. The first observation is mistaken as the row of column headers. Fixing this:

In [None]:
# define column names
col_names = ['patient_age', 'operation_year', 'auxiliary_node_count', 'survival_status']

# read CSV directly from URL, indicate that there are no headers
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data", header=None, names=col_names)
data.head()

Data frames also have a shape attribute and named columns can be accessed by their name ("bracket notation"):

In [None]:
print data.shape
# access last column directly
print data.survival_status.head()

**Question for you: Again, what does the output of `shape` mean, in your opinion?**

As with a toy dataset, the convention is to assign the data to `X` and `y`:

In [None]:
# define columns that are features
feature_cols = ['patient_age', 'operation_year', 'auxiliary_node_count']
X = data[feature_cols]
y = data.survival_status # alternatively, data['survival_status']

In [None]:
print X.head()
print 
print y.head()

### Loading data that is neither toy nor in a standardized format

If your data is none of the above, it is your job to process it until you can assign it to `X` and `y`, both of which must be of type `numpy.ndarray`.

## 2 Visualizing datasets

Once data is loaded into memory, it can be useful to visualize it, before you train any machine learning system.

In [None]:
import seaborn as sns # convention, always import as `sns`
import matplotlib.pyplot as plt # same here

# matplotlib "magic" command
%matplotlib inline

### Visualizing regression data

In [None]:
# load a new dataset, this time a REGRESSION problem:
from sklearn.datasets import load_boston
boston = load_boston()

# convert to data frame, check output
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['price'] = boston.target
df.head()

One way to get insight from regression data is a so-called "pair plot" that emphasizes the _pairwise_ relationships in your dataset. Seaborn can generate such plots:

In [None]:
# regression plot for only the first 3 features, plotted against the target variable
sns.pairplot(df, x_vars=boston.feature_names[:3], y_vars='price', size=7, aspect=0.7, kind='reg')

Check out http://www.neural.cz/dataset-exploration-boston-house-pricing.html for more cool examples specifically with the Boston House Pricing data set and the libraries we have used.

Another example, taken from the seaborn pairplot documentation (http://seaborn.pydata.org/generated/seaborn.pairplot.html):

### Visualizing classification data

If your data does not describe a regression problem (i.e. if the response variable is not continuous, but categorical), then regression plots do not make much sense.


In [None]:
iris = load_iris() # function from scikit-learn
sns.set(style="ticks", color_codes=True)

# `df` is the conventional name for a dataframe variable
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target
df.head()

In [None]:
# target variable is only used for coloring, both x axis and y axis are feature names
sns.pairplot(df, hue="species", size=3, x_vars=iris.feature_names, y_vars=iris.feature_names)

A lot to discover in those plots, for instance:
- how well individual features can tell apart observations of different classes
- in the diagonal: how feature values are distributed overall, and in each class

If you're interested, find out more: http://seaborn.pydata.org/tutorial.html, http://pandas.pydata.org/ and http://scikit-learn.org/stable/datasets/index.html.

## 3 Training and evaluating a classifier

This sections explains how simple training and evaluation work in `scikit-learn`.

### Your first classifier

Once you have `X` and `y` and have some intuitions about your data, train your first classifier.

In [None]:
# review, make sure correct dataset is loaded
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

X.shape, y.shape

`scikit-learn` offers a range of different classifiers, they are organized into different modules. A classifier we already know is KNN ("k nearest neighbor"). 

Classifier classes are imported from modules, then instantiated. Simply printing the object gives you all parameters of the classifiers, including implicit defaults.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5) # convention: call classifier instance `clf`, or describe estimator type
clf

Then, call the `fit` method to actually learn the relationship between `X` and `y`:

In [None]:
clf.fit(X, y)

**Question for you: In the case of KNN, what does "learning" mean?**

After a model is "fitted", you can predict the reponse for new observations. The input must be a list of lists, exactly the same as `X`:

In [None]:
clf.predict([[0.2, 0.4, 0.5, 0.1]])

**Question for you: In the case of KNN, what does "predicting" mean? And: What does this output mean?**

### Simple training and testing split

Since the ultimate goal is to _generalize_ well (this is **extremely** important), there must be a way to evaluate if our model performs well on unseen data. The first method we look at is to simply hold out part of the data (meaning: not show it to the classifier during training), and then evaluate the model on the held-out data.

Split up your observations and responses into a training and testing part each:

In [None]:
from sklearn.model_selection import train_test_split
# variable names are conventional, please also adopt them
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 20 percent go into the test set

In [None]:
for v in (X_train, X_test, y_train, y_test):
    print v.shape

Now, in order to evaluate properly, we need to train a classifier that has only seen the training part of the data, and is oblivious to the correct test set answers.

Then, predict the correct answers for the test observations:

In [None]:
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test) # always use `y_pred` as the variable name

**Question for you: What is the difference between `y_test` and `y_pred`?**

In [None]:
# compare `y_test` and `y_pred`:
print "Actual\tPredicted"
for t, p in zip(y_test[:10], y_pred[:10]):
    print "{}\t{}".format(t, p)

### Evaluation by calculating accuracy

Now that we have the predictions for the test set examples, and the true answers, we can evaluate automatically the performance of the classifier. There are several different reasonable metrics, accuracy is by far the simplest:

In [None]:
from sklearn import metrics # module dedicated to measuring performance
metrics.accuracy_score(y_test, y_pred)

**Question for you: How is accuracy computed?**

If you cannot answer this question, look it up. Then try to implement an accuracy function yourself: one that takes `y_pred` and `y_test` as inputs.

# 4 Outlook

This method of model fitting and evaluation has several problems, which we will discuss in later classes. To give you some food for thought:
- We have split the data randomly into training and testing examples. Therefore, it is possible that the test only contains "easy" examples or only hard ones. Isn't this a bit unfair?
- Each classifier has hyperparameters that need to be set by the user. For instance, `n_neighbors` in our case. We have set it to `5`, an arbitrary decision. Can we do better than that?
- Does accuracy work for a regression problem? Why not?