# Session 6 - Benchmark classification on ```cifar-10```

This notebook builds on what we were doing last week with the handwritten digits from the MNIST dataset.

This week, we're working with another famous dataset in computer vision and image processing research - [cifar10](https://www.cs.toronto.edu/~kriz/cifar.html).

In [9]:
# path tools
import os

# data loader
from tensorflow.keras.datasets import cifar10

# machine learning tools
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# classification models
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

# other modules
import numpy as np
import cv2

We're going to load the data using a function from the library ```TensorFlow```, which we'll be looking at in more detail next week. 

For now, we're just using it to fetch the data!

In [2]:
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


**Question:** What is the shape of the data?

In [3]:
X_train.shape

(50000, 32, 32, 3)

Unfortunately, this version of the data set doesn't have explict labels, so we need to create our own.

In [5]:
labels = [
    "airplane", 
    "automobile", 
    "bird", 
    "cat", 
    "deer", 
    "dog", 
    "frog", 
    "horse", 
    "ship", 
    "truck"] # in alphabetical order

### Convert all the data to greyscale

In the following cell, I'm converting all of my images to greyscale and then making a ```numpy``` array at the end.

Notice that I'm using something funky here called *[list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)*.

In [10]:
X_train_grey = np.array([
    cv2.cvtColor(
        image, 
        cv2.COLOR_BGR2GRAY) for image in X_train])
X_test_grey = np.array([
    cv2.cvtColor(
        image, 
        cv2.COLOR_BGR2GRAY) for image in X_test])

# HOW LIST COMPREHENSION WORKS
# the fussy loop way:
# for x in y:
#     do_this(x)
#
# the funky list comprehension way:
# [function(x) for x in y]
# OR
# [x.function() for x in y]
# e.g., if you have the list the list colours = ["red", "green", "blue"] 
# and you want the list to be all uppercase letters
# upper = [colour.upper() for colour in colours] 

In [11]:
X_train_grey.shape # now it's 3 dimensional

(50000, 32, 32)

Then, we're going to do some simple scaling by dividing by 255.

In [12]:
X_train_scaled = (X_train_grey)/255.0
X_test_scaled = (X_test_grey)/255.0

### Reshaping the data

Next, we're going to reshape this data. 

In [13]:
nsamples, nx, ny = X_train_scaled.shape # extracting each dimension
X_train_dataset = X_train_scaled.reshape((
    nsamples,
    nx * ny)) # flattening from 3 dimensions to 2 dimensions

In [14]:
nsamples, nx, ny = X_test_scaled.shape
X_test_dataset = X_test_scaled.reshape((
    nsamples,
    nx * ny))

## Simple logistic regression classifier

We define our Logistic Regression classifier as we have done previously. You'll notice that I've set a lot of different parameters here - you can learn more in the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

PENALTY:

- L1 = setting very small (close to 0) weights to 0, and only keeping meaningful weights
- none = no penalty

TOL:

- tolerance for stopping criteria. When the model stops learning (improving the weight values) and stops training.
- Default is usually pretty small, so 0.1 is a little high actually

VERBOSE:

- Boolean value.
- Default is FALSE: meaing no output is printed
- TRUE: output is printed as it runs

SOLVER:

- algorithm for optimization problem.
- See documentation when choosing (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- Depends on dataset size and multiclass problems.

MULTI_CLASS:

- either multinomial or binary.
- In this case we have multinomial data, and not (FAKE vs REAL data articles)

In [15]:
clf = LogisticRegression(
    penalty = "none",
    tol = 0.1,
    verbose = True,
    solver = "saga",
    multi_class = "multinomial").fit(
        X_train_dataset, 
        y_train)

  y = column_or_1d(y, warn=True)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Epoch 1, change: 1.00000000
Epoch 2, change: 0.26975541
Epoch 3, change: 0.12787122
Epoch 4, change: 0.10726254
convergence after 5 epochs took 13 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   12.5s finished


In [16]:
y_pred = clf.predict(
    X_test_dataset)

We can then print our classification report, using the label names that we defined earlier.

In [17]:
report = classification_report(
    y_test, 
    y_pred, 
    target_names = labels)
print(report)

              precision    recall  f1-score   support

    airplane       0.33      0.41      0.37      1000
  automobile       0.37      0.37      0.37      1000
        bird       0.27      0.20      0.23      1000
         cat       0.22      0.17      0.19      1000
        deer       0.26      0.20      0.23      1000
         dog       0.30      0.31      0.30      1000
        frog       0.28      0.33      0.30      1000
       horse       0.34      0.30      0.32      1000
        ship       0.34      0.40      0.37      1000
       truck       0.39      0.45      0.42      1000

    accuracy                           0.31     10000
   macro avg       0.31      0.31      0.31     10000
weighted avg       0.31      0.31      0.31     10000



## Neural network classifier

I've set a couple of different parameters here - you can see more in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

**NB!** This will take a long time to run! On the 32 CPU machine on UCloud, this takes around 30 seconds per iteration.

In [18]:
clf = MLPClassifier(
    random_state = 42,
    hidden_layer_sizes = (64, 10),
    learning_rate = "adaptive",
    early_stopping = True,
    verbose = True,
    max_iter = 20).fit(
        X_train_dataset, 
        y_train)

  y = column_or_1d(y, warn=True)


Iteration 1, loss = 2.30872956
Validation score: 0.133000
Iteration 2, loss = 2.15971661
Validation score: 0.239200
Iteration 3, loss = 2.02581278
Validation score: 0.265200
Iteration 4, loss = 1.97076182
Validation score: 0.281800
Iteration 5, loss = 1.93555578
Validation score: 0.302600
Iteration 6, loss = 1.90926190
Validation score: 0.315600
Iteration 7, loss = 1.89160286
Validation score: 0.318800
Iteration 8, loss = 1.87500641
Validation score: 0.322200
Iteration 9, loss = 1.86730610
Validation score: 0.316800
Iteration 10, loss = 1.85845283
Validation score: 0.321200
Iteration 11, loss = 1.84549829
Validation score: 0.331400
Iteration 12, loss = 1.83590762
Validation score: 0.328600
Iteration 13, loss = 1.82908945
Validation score: 0.331400
Iteration 14, loss = 1.82320985
Validation score: 0.330600
Iteration 15, loss = 1.81056794
Validation score: 0.343400
Iteration 16, loss = 1.80707784
Validation score: 0.338400
Iteration 17, loss = 1.79877427
Validation score: 0.339800
Iterat



In [19]:
y_pred = clf.predict(
    X_test_dataset)

Lastly, we can get our classification report as usual.

In [20]:
report = classification_report(
    y_test,
    y_pred,
    target_names=labels)
print(report)

              precision    recall  f1-score   support

    airplane       0.38      0.41      0.40      1000
  automobile       0.40      0.49      0.44      1000
        bird       0.26      0.34      0.30      1000
         cat       0.28      0.11      0.16      1000
        deer       0.27      0.26      0.27      1000
         dog       0.33      0.34      0.34      1000
        frog       0.28      0.29      0.28      1000
       horse       0.45      0.39      0.42      1000
        ship       0.44      0.44      0.44      1000
       truck       0.42      0.47      0.44      1000

    accuracy                           0.35     10000
   macro avg       0.35      0.35      0.35     10000
weighted avg       0.35      0.35      0.35     10000



## Tasks

Take the code outlined in this notebook and turn it into two separate Python scripts, one which performs Logistic Regression classification and one which uses the MLPClassifier on the ```Cifar10``` dataset.

Try to use the things we've spoken about in clas
- Requirements.txt
- Virtual environment
- Setup scripts
- Argparse

This task is [Assignment 2 for Visual Analytics](https://classroom.github.com/a/KLVvny7d).