Python for Beginners Exercise 6: Machine Learning 1 - Sklearn

Made by: Julian Liber

Date Created: 03/17/2020

## Hello Everyone!


#### This activity should teach you:
- What machine learning is
- What supervised machine learning is and requires
- What is the difference between classification and regression.
- How to train a basic ML model
- How to interpret ML results

Machine learning (ML) is a often-discussed topic in computer science. While it is simply another statistical method for understanding your data, ML can be useful for tasks such as classification and regression.

ML can be either supervised or unsupervised. Supervised methods use variables (called _features_) to predict one or more dependent variables, based on known observations of independent and dependent variables. Unsupervised methods find relationships between observations based only features.

Here are some of each type

#### Supervised methods:
- Classification
- Regression

#### Unsupervised methods:
- Clustering
- Ordination

We want to be careful to not do this however:

<img src="https://imgs.xkcd.com/comics/machine_learning.png" width=50% alt="Hulahoop"><p style="text-align: right;">From: https://imgs.xkcd.com/comics/machine_learning.png</p>

We want ML to make **useful** predictions, so therefore we try to avoid problems of any statistical model, specifically overfitting and underfitting.

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_underfitting_overfitting_001.png" width=100% alt="Hulahoop"><p style="text-align: right;">From: https://scikit-learn.org/stable/_images/sphx_glr_plot_underfitting_overfitting_001.png</p>


To help do this, we will often split the data into _training_ and _testing_ data, to ensure that we improve the ability of the model to predict in the real world.

#### First ML exercise

Our first exercise will be a classification problem. We are using the [seeds](https://archive.ics.uci.edu/ml/datasets/seeds) dataset, wich compares seed traits of 3 different wheat varieties.

A classification problem attempts to predict _labels_ given _features_. In this case the wheat variety (Kama, Rosa, or Canadian) are the labels, which are predicted using these seed traits (from the UCI website):

1. area A,
2. perimeter P,
3. compactness C = 4*pi*A/P^2,
4. length of kernel,
5. width of kernel,
6. asymmetry coefficient
7. length of kernel groove.

For our first steps, we need to do these steps:
1. Read in the data
2. Split in features and labels
3. Split into training and testing data
4. Train the model
5. Test the model

We are going to do this first using a Support Vector Machine (SVM), which attempts to linearly separate groups. More can be learned [here](https://scikit-learn.org/stable/modules/svm.html#classification).

In [None]:
# Load needed libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics

%matplotlib inline

In [None]:
# 1. Read in the data
full_data = np.empty((8,))
with open("seeds_dataset.txt", "r") as datafile:
    line = datafile.readline()
    while line != "":
        line_dat = np.array(line.split("\t"))
        line_dat = line_dat[line_dat != ""].astype("float")
        full_data = np.vstack((full_data, line_dat))
        line = datafile.readline()
# 2. Split into features and labels
x = full_data[:,0:7]
y = full_data[:,7]

print(x.shape, y.shape)

In [None]:
# 3. Split into training and testing data 
train_feat, test_feat, train_lab, test_lab = train_test_split(x, y, test_size=0.25)

In [None]:
print(train_feat.shape, test_feat.shape)
print(train_lab.shape, test_lab.shape)

In [None]:
plt.scatter(x=train_feat[:,0], y=train_feat[:,5], c=train_lab)
plt.xlabel("Seed Area")
plt.ylabel("Asymmetry Coefficient")

The data here show rather good differentiation between the varieties, so it should be possible to classify them with a machine learning algorithm.

In [None]:
# 4. Train the model
clf = svm.SVC(decision_function_shape='ovo')
clf.fit(train_feat, train_lab)

In [None]:
# 5. Test the model
pred_lab = clf.predict(test_feat)

In [None]:
# Analyse the results
fig = plt.figure(figsize=(10,5))
plt.subplot(121)
plt.scatter(x = test_feat[:,0], y = test_feat[:,5], c = test_lab)
plt.xlabel("Seed Area")
plt.ylabel("Asymmetry Coefficient")
plt.title("True Labels")

plt.subplot(122)
plt.scatter(x = test_feat[:,0], y = test_feat[:,5], c = pred_lab)
plt.xlabel("Seed Area")
plt.ylabel("Asymmetry Coefficient")
plt.title("Predicted Labels")

What can we conclude from this visual analysis?
- Looks good for the points well within the group
- Performs poorly for points outside the groups
- Entirely linearly split

What if we want a more quantitative analysis?
- Classification metrics

In [None]:
print(metrics.classification_report(test_lab, pred_lab))
print(metrics.confusion_matrix(test_lab, pred_lab))

Pretty good, right?

It all depends on your goal here, but precision and recall are pretty balanced.

Additionally, the F1 score is the harmonic mean of precision and recall, so can be a good way to show the balance between both. This [wiki page](https://en.wikipedia.org/wiki/Precision_and_recall) has a good explanation of what this means.

${\displaystyle \mathrm {F} _{1}=2\cdot {\frac {\mathrm {PPV} \cdot \mathrm {TPR} }{\mathrm {PPV} +\mathrm {TPR} }}={\frac {2\mathrm {TP} }{2\mathrm {TP} +\mathrm {FP} +\mathrm {FN} }}}$

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/525px-Precisionrecall.svg.png" width=50% alt="Hulahoop"><p style="text-align: right;">From: https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/525px-Precisionrecall.svg.png</p>

Let's try another model: the RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf2 = RandomForestClassifier()
clf2.fit(train_feat, train_lab)

In [None]:
pred_lab2 = clf2.predict(test_feat)
print(metrics.classification_report(test_lab, pred_lab2))
print(metrics.confusion_matrix(test_lab, pred_lab2))

While the performance of the classification is similar, the RandomForest algorithm allows for an additional analysis step: _feature importances_.

Feature importances tell how influencial each feature is in the classification decision, which can be interesting if you want to know what matters to the model's decision.

In [None]:
plt.barh(y = np.arange(len(clf2.feature_importances_)),
        width=clf2.feature_importances_,
        tick_label = ["area", "perimeter","compactness" ,"kernel length", "kernel width", "asymmetry coef", "groove length"],
       )

#### Do This:
Find another machine learning algorithm to try. [This page](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) has many to attempt to implement.

Fit the model with the training data, and use the testing data to assess its success.

### Thanks for doing Exercise 6!

#### More will follow soon!