# Session 6 - Benchmark classification on ```cifar-10```

This notebook builds on what we were doing last week with the handwritten digits from the MNIST dataset.

This week, we're working with another famous dataset in computer vision and image processing research - [cifar10](https://www.cs.toronto.edu/~kriz/cifar.html).

In [2]:
# path tools
import os

# data loader
from tensorflow.keras.datasets import cifar10

# machine learning tools
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# classificatio models
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

2023-03-10 11:20:10.683302: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


We're going to load the data using a function from the library ```TensorFlow```, which we'll be looking at in more detail next week. 

For now, we're just using it to fetch the data!

In [3]:
(X_train, y_train), (X_test, y_test) = cifar10.load_data() # cifar10 has function to load data.
# here cifar are returning both training and testing data, grouped by X and y.

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


**Question:** What is the shape of the data?

In [14]:
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(50000, 32, 32, 3) (50000, 1) (10000, 32, 32, 3) (10000, 1)


50000 refers to number of images, 32 and 32 is the dimensions of each image in pixels and 3 is color channels
For the y_train the numbers refers to the labels attached to each image. 50000 tags in all.

Unfortunately, this version of the data set doesn't have explict labels (there are just number 0-8), so we need to create our own.

In [15]:
labels = ["airplane", 
          "automobile", 
          "bird", 
          "cat", 
          "deer", 
          "dog", 
          "frog", 
          "horse", 
          "ship", 
          "truck"]

In [24]:
# List comprehension to convert labels to uppercase
#uppers = [labels.upper() for i in labels]

### Convert all the data to greyscale

In the following cell, I'm converting all of my images to greyscale and then making a ```numpy``` array at the end.

Notice that I'm using something funky here called *[list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)*. the downside of list comprehensions are useful but does decrease readability of the code.

In [7]:
import numpy as np
import cv2

X_train_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_train]) # convert each image in X_train to grey scale using list comprehension
X_test_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_test])

In [27]:
X_train_grey.shape

(50000, 32, 32)

Now that they are grey, we lose the color channels, and thus one dimension

Then, we're going to do some simple scaling by dividing by 255.



In [8]:
X_train_scaled = (X_train_grey)/255.0
X_test_scaled = (X_test_grey)/255.0

### Reshaping the data

Next, we're going to reshape this data. 

We do this to make the data compatiable with the neural network. We flatten the data, to flatten the pixels into one dimension. 

However, reshape function is a bit different (see below comments). This approach is similar to flattening, just smarter.

In [9]:
nsamples, nx, ny = X_train_scaled.shape # returns (50000, 32, 32)
X_train_dataset = X_train_scaled.reshape((nsamples,nx*ny)) # we want the new shape to be the number of values on the x axis, and the number of values on the y axis.
# so we are reshaping the data to be 50000 rows, and 1024 columns.

In [29]:
nsamples, nx, ny = X_test_scaled.shape
X_test_dataset = X_test_scaled.reshape((nsamples,nx*ny))

In [30]:
X_train_dataset.shape

(50000, 1024)

## Simple logistic regression classifier

We define our Logistic Regression classifier as we have done previously. You'll notice that I've set a lot of different parameters here - you can learn more in the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

* tol: tolerance for stopping criteria: stop when the loss function is less than 0.1. Meaning that if the weights of the model is not improving by more than 0.1 according to the stop loss function, stop training. The function does not say after how many iterations the model is not improving

* verbose: For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.

* solver: Algorithm to use in the optimization problem. It optimizes the loss function basically. 

* multi_class: Whether the data is multinomial or not.

In [11]:
clf = LogisticRegression(penalty="none", # no regularization of the weights
                        tol=0.1, # 
                        verbose=True, # 
                        solver="saga", # stochastic average gradient descent
                        multi_class="multinomial").fit(X_train_dataset, y_train) # 

  y = column_or_1d(y, warn=True)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Epoch 1, change: 1.00000000
Epoch 2, change: 0.23390008
Epoch 3, change: 0.12836014
Epoch 4, change: 0.11127933
convergence after 5 epochs took 17 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   16.3s finished


In [16]:
y_pred = clf.predict(X_test_dataset)

We can then print our classification report, using the label names that we defined earlier.

In [17]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels)
print(report)

              precision    recall  f1-score   support

    airplane       0.34      0.38      0.36      1000
  automobile       0.36      0.38      0.37      1000
        bird       0.27      0.21      0.24      1000
         cat       0.21      0.18      0.19      1000
        deer       0.25      0.20      0.22      1000
         dog       0.31      0.29      0.30      1000
        frog       0.28      0.32      0.30      1000
       horse       0.31      0.33      0.32      1000
        ship       0.33      0.42      0.37      1000
       truck       0.41      0.43      0.42      1000

    accuracy                           0.31     10000
   macro avg       0.31      0.31      0.31     10000
weighted avg       0.31      0.31      0.31     10000



## Neural network classifier

I've set a couple of different parameters here - you can see more in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

**NB!** This will take a long time to run! On the 32 CPU machine on UCloud, this takes around 30 seconds per iteration.

* adaptive learning rate: The learning adjusts its value according to how close it gets to global minimum. The lower the loss function, the lower the learning rate to get as close to the global minimum as possible.

In [18]:
clf = MLPClassifier(random_state=42,
                    hidden_layer_sizes=(64, 10), # 64 neurons in the first layer, 10 neurons in the second layer
                    learning_rate="adaptive", # adaptive learning rate:
                    early_stopping=True, # set to true to stop training when the validation score stops improving for x consecutive epochs
                    verbose=True,
                    max_iter=20).fit(X_train_dataset, y_train)

  y = column_or_1d(y, warn=True)


Iteration 1, loss = 2.30872956
Validation score: 0.133000
Iteration 2, loss = 2.15971661
Validation score: 0.239200
Iteration 3, loss = 2.02581278
Validation score: 0.265200
Iteration 4, loss = 1.97076182
Validation score: 0.281800
Iteration 5, loss = 1.93555578
Validation score: 0.302600
Iteration 6, loss = 1.90926190
Validation score: 0.315600
Iteration 7, loss = 1.89160286
Validation score: 0.318800
Iteration 8, loss = 1.87500641
Validation score: 0.322200
Iteration 9, loss = 1.86730610
Validation score: 0.316800
Iteration 10, loss = 1.85845283
Validation score: 0.321200
Iteration 11, loss = 1.84549829
Validation score: 0.331400
Iteration 12, loss = 1.83590762
Validation score: 0.328600
Iteration 13, loss = 1.82908945
Validation score: 0.331400
Iteration 14, loss = 1.82320985
Validation score: 0.330600
Iteration 15, loss = 1.81056794
Validation score: 0.343400
Iteration 16, loss = 1.80707784
Validation score: 0.338400
Iteration 17, loss = 1.79877427
Validation score: 0.339800
Iterat



**Validation score:** When training a model, we learn loss values based on the validation data which is a small portion of the training data or a manually defined portion of the all data. The validation score is how well is performs on this validation set.

In [25]:
y_pred = clf.predict(X_test_dataset)

Lastly, we can get our classification report as usual.

In [26]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels)
print(report)

              precision    recall  f1-score   support

    airplane       0.38      0.41      0.40      1000
  automobile       0.40      0.49      0.44      1000
        bird       0.26      0.34      0.30      1000
         cat       0.28      0.11      0.16      1000
        deer       0.27      0.26      0.27      1000
         dog       0.33      0.34      0.34      1000
        frog       0.28      0.29      0.28      1000
       horse       0.45      0.39      0.42      1000
        ship       0.44      0.44      0.44      1000
       truck       0.42      0.47      0.44      1000

    accuracy                           0.35     10000
   macro avg       0.35      0.35      0.35     10000
weighted avg       0.35      0.35      0.35     10000



## Tasks

Take the code outlined in this notebook and turn it into two separate Python scripts, one which performs Logistic Regression classification and one which uses the MLPClassifier on the ```Cifar10``` dataset.

Try to use the things we've spoken about in clas
- Requirements.txt
- Virtual environment
- Setup scripts
- Argparse

This task is [Assignment 2 for Visual Analytics](https://classroom.github.com/a/KLVvny7d).