# Preface
If you are new to Colab, please familiarize yourself with it by starting with the [introduction](https://colab.research.google.com/notebooks/intro.ipynb) and then working yourself through a [small tutorial](https://colab.research.google.com/drive/1umhPVtUWH8yHD2l9A_G4fdttmgmSgC0Q).

Please, always save a copy of the notebook on your google drive before you start working and only edit that one. In addition, always switch the runtime to `python 3` and for most exercises it is recommended to switch also to a GPU runtime.



```
# This is formatted as code
```

# Exercise 1 - Machine Learning Basics

In the first part of this exercise we will will apply linear regression to a dataset of brain properties. In the second part we will apply logistic regresseion to classify different types of iris flowers.

This exercise is based on ["Learning scikit-learn -- An Introduction to Machine Learning in Python @ PyData Chicago 2016"](https://github.com/rasbt/pydata-chicago2016-ml-tutorial).

Before we start we need to download the two datasets named "dataset_brain.txt" and "dataset_iris.txt" from a shared google drive to the virtual machine of colab or our local machine so we will have it available:

In [None]:
import gdown

url = 'https://drive.google.com/uc?id=1W7s11mAK3PByOJIxPRpr1cIGhxsriI4c'
output = 'dataset_brain.txt'
gdown.download(url, output, quiet=False)

url = 'https://drive.google.com/uc?id=1lBQ55AHVbX29bEMNfLOunOE5PwYAKDpg'
output = 'dataset_iris.txt'
gdown.download(url, output, quiet=False)


In [None]:
!ls

# Table of Contents

* [1 Linear Regression](#2-Linear-Regression)
    * [Loading the dataset](#Loading-the-dataset)
    * [Preparing the dataset](#Preparing-the-dataset)
    * [Fitting the model](#Fitting-the-model)
    * [Evaluating the model](#Evaluating-the-model)
* [2 Classification](#3-Introduction-to-Classification)
    * [The Iris dataset](#The-Iris-dataset)
    * [Class label encoding](#Class-label-encoding)
    * [Scikit-learn's in-build datasets](#Scikit-learn's-in-build-datasets)
    * [Test/train splits](#Test/train-splits)
    * [Logistic Regression](#Logistic-Regression)
    * [K-Nearest Neighbors](#K-Nearest-Neighbors)

# 1  Linear Regression

## Loading the dataset

We will use a dataset of an old publication which studied the relation of the brain weight to the head size for different gender and age ranges.

Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to
to the Size of the Head", Biometrika, Vol. 4, pp105-123

The dataset is stored in a file called
**`dataset_brain.txt`**

Description: Brain weight (grams) and head size (cubic cm) for 237 adults classified by gender and age group.

Variables/Columns
- Gender (`1`=Male, `2`=Female)
- Age Range (`1`=20-46, `2`=46+)
- Head size (cm$^3$)
- Brain weight (grams)


### Task 1: Print the first 30 lines of the dataset and inspect it
*hints*
- use `open("path/to/file")`
- `readlines` is a useful method

We will use [**`pandas`**](https://pandas.pydata.org/pandas-docs/stable/) to read in the dataset.


> pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal. (quoted from web page)




In [None]:
import pandas as pd

The file contains 'comma separated values' (CSV) and we will use pandas [**`DataFrame`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame) to handle the data.

In [None]:
df = pd.read_csv('dataset_brain.txt',
                 encoding='utf-8',
                 comment='#',
                 sep='\s+')
df.head(10)

*additional comments:*

The cell above reads a text file with csv ending from the disk and converts it to a data frame. The parameter `comment` specifies which lines in the file will not be converted to data entries, `sep` specifies how data entries are separated. `\s+` is a regular expression that matches one or more blanks or tabs between data entries. *sep* needs to be chosen according to your data format and could be other regular expressions or separating characters  like `;,\t` (tab only), ....

Let's look at the relation of the brain weight to the head size by plotting them in a 2D scatter plot. We will use [**`matplotlib`**](https://matplotlib.org/) for that.



In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

We can call the columns of the pandas DataFrame simply by using the keys.

In [None]:
plt.scatter(df['head-size'], df['brain-weight'])
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');

## Preparing the dataset

In order to use the dataset, we need to retrieve a [**`numpy`**](http://www.numpy.org/) array containing only the values.

In [None]:
import numpy as np

In [None]:
y = df['brain-weight'].values
print(y)

How many data points do we have?

In [None]:
y.shape

The same with the head size:

In [None]:
X = df['head-size'].values
print(X)
X.shape

In all machine learning frameworks like *scikit-learn*, *tensorflow*, *keras*, ..., it is a convention that the first data dimension depicts the number of samples, the second one the number of features. Our array has currently only one dimension. We have 237 samples, each containing only one feature value. To comply with the convention, we would like to have n arrays containing one value:

In [None]:
X = X[:, None]
print(X)

Alternatively you can use `X = X.reshape(len(X), 1)`
or `X = X.reshape(-1, 1)` if you know that you have only one feature, but you are not sure how many values you have. Each *reshape* call can have up to one *-1* in it. This axis will then be determined by the other entries.


We will use the machine learning tool and library [**`scikit-learn`**](http://scikit-learn.org/stable/) in the following.


A very useful functionality of scikit learn is to easily split the dataset into training and testing dataset. The dataset is split randomly with seed 123 and the test size is 30%, train size 70%:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=123)

*additional comments:*

 using a seed for randomization results in getting the same random numbers for each call. In this example, we would always get the same test-train-split. This may seem like a mistake at first but is surprisingly useful for **testing your code** as you know that changes in the result do not come from a different randomization*.

### Task 2: Plot the training and testing dataset separately again in a 2D scatter plot including axis label. Use different colors (option [`c`(olor)`='blue'`](https://matplotlib.org/api/colors_api.html)) and different marker (option [`marker='o'`](https://matplotlib.org/api/markers_api.html))

## Fitting the model

We would like to fit the training data now using the [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model of scikit-learn:

This uses a linear function and the ordinary least squares method.

*comment: yes, this is pretty much the same as using `curve_fit` from `scipy` with a linear fit function like you did in the good old lab excercise days*

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

OK, what is the result of the fit?

In [None]:
# The coefficients
print('Coefficients: \n', lr.coef_)
# The intercept
print('Intercept: \n', lr.intercept_)

OK, let's plot this linear function.

In [None]:
plt.scatter(X_test, y_test,  color='blue')
plt.plot(X_test, y_pred, color='red', linewidth=3)
plt.xlabel('Head size (cm$^3$)')
plt.ylabel('Brain weight (grams)');

## Evaluating the model

How do we know if the fit was good? We need to define a performance measure. One way is to calculate the **Coefficient of determination**, denoted R^2. It is the proportion of the variance in the dependent variable that is predictable from the independent variables. It is calculated the following way:

 <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/eef0fc7006ba5f7df32eceeba7f1c5271e0100af">

In [None]:
sum_of_squares = ((y_test - y_pred) ** 2).sum()
res_sum_of_squares = ((y_test - y_test.mean()) ** 2).sum()
r2_score = 1 - (sum_of_squares / res_sum_of_squares)
print('R2 score: %.2f' % r2_score)

It ranges from 0 to 1 and values close to 1 means a good agreement. Luckily, scikit-learn has several performance measures for [regression (metrics)](http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics) already included.

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Explained variance score: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, y_pred))
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# The mean squared error
print("Mean absolute error: %.2f" % mean_absolute_error(y_test, y_pred))


# 2 Classification

## The Iris dataset

### Task 3: The Iris flower dataset is stored in file **`dataset_iris.txt`**. Read in the dataset using a **`pandas`** `DataFrame` and have a look at the first entries.
*hints*:
- look what you did for the first data inspection in Task 1
- what is the separator in the iris dataset?

We now need to create a 150x4 design matrix containing only our feature values. In order to do that, we need to strip the class column from the dataset. We use the [**`iloc`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html) function for that:

> `DataFrame.iloc`
>
> Purely integer-location based indexing for selection by position.



In [None]:
X = df.iloc[:, :4]
X

And now we get 150$\times$4 numpy array (design matrix) by using the values function:

In [None]:
X = X.values
X

However, we also need a numpy array containing the class labels in order to classify. Let's get the class column and create a numpy array out of it:

In [None]:
y = df['class'].values
y

We could also just inspect the targets by only looking at unique values:

In [None]:
np.unique(y)

## Class label encoding

We will now use the **`LabelEncoder`** class to convert the class labels into numerical labels:

In [None]:
from sklearn.preprocessing import LabelEncoder

l_encoder = LabelEncoder()
l_encoder.fit(y)
l_encoder.classes_

Simply, by using **`transform`**, we can convert it into numerical targets

In [None]:
y_enc = l_encoder.transform(y)
y_enc

Or just the unique values:

In [None]:
np.unique(y_enc)

We can also convert it back by using **`inverse_transform`**:

In [None]:
np.unique(l_encoder.inverse_transform(y_enc))

## Scikit-learn's in-build datasets

Scikit-learn has also a couple of [in-build datasets](http://scikit-learn.org/stable/datasets/index.html). The iris dataset is part of it, which you can simply load:

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
print(iris['DESCR'])

We get the feature design matrix by calling data:

In [None]:
 iris.data

And the target array:

In [None]:
iris.target

## Test/train splits

OK, now we need to split the dataset again in training and testing. Let's first assign the design matrix to X and the target to y:

In [None]:
X, y = iris.data[:, :2], iris.target
# ! We only use 2 features for visual purposes


How many example do we have of each class?

In [None]:
print('Class labels:', np.unique(y))
print('Class proportions:', np.bincount(y))

### Task 4: Split the dataset in 40% testing and 60% training sets.
- How many examples of each class do you expect in the training set?
- How many are there? What happened?
- What happens if you don't shuffle?
- Can you create datasets in which each class is equally distributed?

### Task 5: Plot the sepal length vs the sepal width of the training set for the different classes in a scatter plot. You can set different colors for the classes with `c=y_train`

## Logistic Regression

Let's perform a classification using logistic regression:

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='newton-cg',
                        multi_class='multinomial',
                        random_state=42)

lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

OK, how do we evaluate the classification? We can chose one of the [classification performance measures](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics).

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))
print("Precision: %.2f" % precision_score(y_test, y_pred, average='weighted'))
print("Recall: %.2f" % recall_score(y_test, y_pred, average='weighted'))


Or we use the classification report function:

In [None]:
print('Classification Report:\n', classification_report(y_test, y_pred))

Finally, we would like to plot the decision regions and our data in order to see how the classifier categorized the events. We have highlighted the test data.

(Technicality) When running on Google Colab, we first need to update the *mlxtend* package, as Colab's default version of the packages is outdated:

In [None]:
%pip install mlxtend --upgrade  #this needs to be run only once and will install the most recent version of the mlxtend package

In [None]:
from mlxtend.plotting import plot_decision_regions

plot_decision_regions(X=X, y=y, clf=lr, X_highlight=X_test, legend=2)
plt.xlabel('sepal length [cm]')
plt.xlabel('sepal width [cm]');

## K-Nearest Neighbors

### Task 6 (Bonus): Perform a classification using [K-nearest neighbors classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), evaluate the performance and show the decision regions.