# Multiclass Classification

While binary classifiers are used to distinguish between two classes (e.g. detect if a transaction is a fraudulent one, classify an email into either spam or non-spam and etc.), multiclass classifiers distinguish between more than two classes. 

There are various ways that we can use to perform multiclass classification by leveraging any binary classifiers. In this exercise, you will implement two such strategies for multiclass classification: _One-versus-all_ strategy and _One-versus-one_ strategy.

- **One-versus-all (OvA)**: In this strategy, we train a single binary classifier per class, with the samples of that class as positive samples and all other samples as negatives. During inference, we get the prediction from each classifier and select the class with the highest score. This strategy is also called the one-versus-the-rest strtegey. 

- **One-versus-one (OvO)**: In this strategy, we train a binary classifier for every pair of classes. If there are N classes in the problem, you need to train N * (N-1) / 2 classifiers. During inference, we have to run through all N * (N-1) / 2 classifiers and ses which class wins the most votes. The main advantage of OvO strategy is that each binary classifier only needs to be train on the part of the training dataset for the two classes that it needs to separate. 

In [161]:
# import packages
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

# make this notebook's output stable across runs
np.random.seed(0)

## Avila Dataset

In this lab assignment, we use the [Avila](https://archive.ics.uci.edu/ml/datasets/Avila) data set has been extracted from 800 images of the the "Avila Bible", a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain.  
The palaeographic analysis of the  manuscript has  individuated the presence of 12 copyists. The pages written by each copyist are not equally numerous. 
Each pattern contains 10 features and corresponds to a group of 4 consecutive rows.

The prediction task consists in associating each pattern to one of the 12 copyists (labeled as: A, B, C, D, E, F, G, H, I, W, X, Y).
The data have has been normalized, by using the Z-normalization method, and divided in two data sets: a training set containing 10430 samples, and a test set  containing the 10437 samples.


In [162]:
# Load train and test data from CSV files.
train = pd.read_csv("avila-tr.txt", header=None)
test = pd.read_csv("avila-ts.txt", header=None)

x_train = train.iloc[:,:-1]
y_train = train.iloc[:,-1]

x_test = test.iloc[:,:-1]
y_test = test.iloc[:,-1]

In [163]:
# Output the number of images in each class in the train and test datasets.

# Taking column 10 (the class) and counting the number of each class in the train and test datasets. Also sorting 
# the output by the class number.
print("Train")
display(y_train.value_counts().sort_index())

print("Test")
display(y_test.value_counts().sort_index())


Train


10
A    4286
B       5
C     103
D     352
E    1095
F    1961
G     446
H     519
I     831
W      44
X     522
Y     266
Name: count, dtype: int64

Test


10
A    4286
B       5
C     103
D     353
E    1095
F    1962
G     447
H     520
I     832
W      45
X     522
Y     267
Name: count, dtype: int64

Question 1.1: Check for missing Data

In [164]:
# Missing data in the train set.
print(f"Does train has any missing data? {train.isna().sum().sum() > 0}")
print(f"Does train has any missing data? {test.isna().sum().sum() > 0}")


Does train has any missing data? False
Does train has any missing data? False


Question 1.2: Apply Z-normalization to data

In [165]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

display(x_train.head())

scaler.fit(X=x_train, y=y_train)
x_train_scaled = scaler.transform(x_train)
x_train = pd.DataFrame(x_train_scaled, columns=x_train.columns)

display(x_train.head())



Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.266074,-0.16562,0.32098,0.483299,0.17234,0.273364,0.371178,0.929823,0.251173,0.159345
1,0.130292,0.870736,-3.210528,0.062493,0.261718,1.43606,1.46594,0.636203,0.282354,0.515587
2,-0.116585,0.069915,0.068476,-0.783147,0.261718,0.439463,-0.081827,-0.888236,-0.123005,0.582939
3,0.031541,0.2976,-3.210528,-0.58359,-0.721442,-0.307984,0.710932,1.051693,0.594169,-0.533994
4,0.229043,0.807926,-0.052442,0.082634,0.261718,0.14879,0.635431,0.051062,0.032902,-0.086652


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.267527,-0.050815,0.28702,0.481603,0.167308,0.230326,0.278279,0.916331,0.219066,0.157418
1,0.130565,0.213515,-2.865694,0.064334,0.257406,1.262741,1.111627,0.623723,0.247735,0.511168
2,-0.118458,0.00926,0.0616,-0.774197,0.257406,0.377813,-0.066555,-0.895461,-0.12497,0.578049
3,0.030955,0.067333,-2.865694,-0.576317,-0.733677,-0.285881,0.536904,1.037781,0.534432,-0.53107
4,0.230174,0.197495,-0.046348,0.084306,0.257406,0.119711,0.479432,0.040599,0.018378,-0.086858


Question 2.1: Write a method to train multiple logistic regression models performing One vs All (OvA) classification. The method allows you to pass in training features, and target. The method returns a list of models and their associated labels. 
Within the method:
- Determine the list of classes
- Create a place to store all the models
- For each class, train a model with the target variable set to 1 and 0 for all other classes
- Return the list of models trained and associated labels.

In [166]:
from typing import Any, List


def trainOvA(x, y):
    """
    TODO: Train the multiclass classifier using OvA strategy. 
    """
    labels = sorted(y.unique())
    n_labels = len(labels)
    print("number of classes is {}".format(n_labels))
    
    models: List[Any] = [None] * n_labels
    model_labels = [None] * n_labels

    for i, label in enumerate(labels):
        # Create model
        print("Train Logistic Regression model for class {}".format(label))
        model = LogisticRegression()

        y_binary = (y == label).astype(int)
        model.fit(x, y_binary)

        models[i] = model
        model_labels[i] = label

    return models, model_labels

Question 2.2: Write a method that leverage the multiple models train for OvA, and outputs the majority class.

In [167]:
def predictOvA(models, labels, x):
    """
    TODO: Make predictions on multiclass problems using the OvA strategy. 
    """
    if models == None:
        sys.exit("The model has not been trained yet. Please call train() first. Exiting...")

    #Create prediction
    predictions = pd.DataFrame(columns=labels)
    for label, model in zip(labels, models):
        # We need to extract [:, 1] to get the probability of the positive class.
        predictions[label] = model.predict_proba(x)[:, 1]

    display(predictions.head())
    return predictions.idxmax(axis=1).values

Question 2.3: Train OvA model on the Avila dataset

In [168]:
models, model_labels = trainOvA(x_train, y_train)

number of classes is 12
Train Logistic Regression model for class A
Train Logistic Regression model for class B
Train Logistic Regression model for class C
Train Logistic Regression model for class D
Train Logistic Regression model for class E
Train Logistic Regression model for class F
Train Logistic Regression model for class G
Train Logistic Regression model for class H
Train Logistic Regression model for class I
Train Logistic Regression model for class W
Train Logistic Regression model for class X
Train Logistic Regression model for class Y


Question 2.4: Predict and evalutate the results of your model

In [169]:
predicted_classes = predictOvA(models, model_labels, x_test)

Unnamed: 0,A,B,C,D,E,F,G,H,I,W,X,Y
0,0.589714,0.004225,0.014231,0.030204,0.227871,0.024916,0.008499,0.000271,9.294807e-08,0.220406,0.278997,0.047493
1,0.042894,9.7e-05,0.025909,0.012274,0.344777,0.064239,0.119464,0.33726,0.2617123,0.000162,0.48061,0.001315
2,0.21534,1e-06,0.011637,0.074561,0.119586,0.142983,0.071152,0.132069,0.7900001,0.00011,0.011605,0.00067
3,0.136859,5.3e-05,0.014832,0.011912,0.062806,0.104072,0.031091,0.025584,0.00492109,0.000127,0.00109,0.002176
4,0.658432,3e-06,0.005328,0.031513,0.096298,0.257332,0.045448,0.034323,0.0005213751,0.001761,0.001948,0.001207


In [170]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

ova_accuracy = accuracy_score(y_test, predicted_classes)
ova_confusion_matrix = confusion_matrix(y_test, predicted_classes)

print("Accuracy of OvA classifier is {}.".format(ova_accuracy))
print("Confusion matrix of OvA classifier: \n {}".format(ova_confusion_matrix))

Accuracy of OvA classifier is 0.5290792373287343.
Confusion matrix of OvA classifier: 
 [[4065   10    2    0   30  101    0   11   50    0   11    6]
 [   0    5    0    0    0    0    0    0    0    0    0    0]
 [  69    4    0    0   10   12    0    0    8    0    0    0]
 [ 302    0    0    0   13   13    0   12   11    0    2    0]
 [ 835    2    0    0  114   58    0   17   21    0   47    1]
 [1813   10    2    0    3   92    0    3   33    0    2    4]
 [ 353    0    0    0    4   81    0    5    1    0    3    0]
 [ 320    0    0    0   30  106    0   43   19    0    1    1]
 [  52    0    0    1    1    6    0    4  742    0   15   11]
 [  35    0    0    0    9    0    0    0    0    0    1    0]
 [  72    0    0    0    9    1    0    6   52    0  366   16]
 [  43    0    0    0    1    0    0    0  100    0   28   95]]


---

Question 3.1: Develop a method that trains a list of models based on the OvO stragety for multiclass classification using logistic regression. 

In [171]:
def trainOvO(x, y):
    """
    TODO: Train the multiclass classifier using OvO strategy. 
    """
    labels = sorted(y.unique())
    n_labels = len(labels)
    n_models = int(len(labels) * (len(labels) - 1) / 2)
    print("number of classes is {}".format(n_labels))

    models: List[Any] = [None] * n_models
    model_labels: List[Any] = [None] * n_models 
    model_idx = 0
    for i in range(n_labels):
        for j in range(i+1, n_labels):
            label_i = labels[i]
            label_j = labels[j]
            print("Train Logistic Regression model to distinguish {} and {}".format(label_i, label_j))

            # update the label according to OvA strategy
            selected_rows = (y == label_i) | (y == label_j)
            train_y = y[selected_rows].apply(lambda x: 1 if x == label_i else 0)
            train_x = x[selected_rows]

            # construct the logistic regression instance
            lr = LogisticRegression(solver = 'liblinear')
            lr.fit(train_x, train_y)
            models[model_idx] = lr
            model_labels[model_idx] = (label_i, label_j)
            model_idx += 1
        
    return models, model_labels

Question 3.2: Write a method that leverage the multiple models train for OvO, and outputs the majority class.

In [172]:
def predictOvO(models, labels, x):
    """
    TODO: Make predictions on multiclass problems using the OvO strategy. 
    """
    if models == None:
        sys.exit("The model has not been trained yet. Please call train() first. Exiting...")

    n_models = len(models)
    predictions = pd.DataFrame(columns=labels)
    for i in range(n_models):
        label_i, label_j = labels[i]
        model = models[i]
        pred = model.predict(x)
        pred = pd.Series(pred).apply(lambda x: label_i if x == 1 else label_j)
        predictions[i] = pred

    return predictions.mode(axis=1).iloc[:, 0].values

Question 3.3: Train OvO model on the Avila dataset

In [173]:
models, labels = trainOvO(x_train, y_train)

number of classes is 12
Train Logistic Regression model to distinguish A and B
Train Logistic Regression model to distinguish A and C
Train Logistic Regression model to distinguish A and D
Train Logistic Regression model to distinguish A and E
Train Logistic Regression model to distinguish A and F
Train Logistic Regression model to distinguish A and G
Train Logistic Regression model to distinguish A and H
Train Logistic Regression model to distinguish A and I
Train Logistic Regression model to distinguish A and W
Train Logistic Regression model to distinguish A and X
Train Logistic Regression model to distinguish A and Y
Train Logistic Regression model to distinguish B and C
Train Logistic Regression model to distinguish B and D
Train Logistic Regression model to distinguish B and E
Train Logistic Regression model to distinguish B and F
Train Logistic Regression model to distinguish B and G
Train Logistic Regression model to distinguish B and H
Train Logistic Regression model to distin

Question 3.4: Predict and evalutate the results of your model

In [174]:
te_z_ovo = predictOvO(models, labels, x_test)

display(te_z_ovo)

array(['W', 'I', 'I', ..., 'A', 'F', 'F'], dtype=object)

In [175]:

ovo_accuracy = accuracy_score(y_test, te_z_ovo)
ovo_confusion_matrix = confusion_matrix(y_test, te_z_ovo)


print("Accuracy of OvO classifier is {}.".format(ovo_accuracy))
print("Confusion matrix of OvO classifier: \n {}".format(ovo_confusion_matrix))

Accuracy of OvO classifier is 0.5489125227555811.
Confusion matrix of OvO classifier: 
 [[3903   10   27    2   54  122   14   13   67    0   72    2]
 [   0    5    0    0    0    0    0    0    0    0    0    0]
 [  55    4    0    0   20   10    0    4   10    0    0    0]
 [ 264    0    5    0   28    9    9    0    8    0   30    0]
 [ 603    2    0    0  225   49   45   33   24    0  109    5]
 [1633   10    2    6   12  235    2    9   34    0   15    4]
 [ 382    0    0    0    0   30   34    0    1    0    0    0]
 [ 259    0    0    0   36   53   63   93   13    0    2    1]
 [  44    0    1    0    1    9   14    3  722    1   19   18]
 [  32    0    0    0   11    0    0    0    0    2    0    0]
 [  49    0    0    0   34    2    2    1   48    3  367   16]
 [  24    0    0    0    5    1    1    0   38    0   55  143]]


Question 4.1: [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) within sklearn supports two approaches for solving multi-class problems: 'ovr', 'multinomial'. Try out both approaches, and evaluate compare the performance agains what you developed in questions 2 and 3.

In [176]:
clf = LogisticRegression(solver='liblinear', multi_class='ovr').fit(x_train, y_train)
y_ovr = clf.predict(x_test)

ovr_accuracy = accuracy_score(y_test, y_ovr)
ovr_confuction_matrix =  confusion_matrix(y_test, y_ovr)

print("Accuracy of OvO classifier is {}.".format(ovr_accuracy))
print("Confusion matrix of OvO classifier: \n {}".format(ovr_confuction_matrix))

Accuracy of OvO classifier is 0.5287917984095046.
Confusion matrix of OvO classifier: 
 [[4066   11    2    0   30   99    0   12   50    0   10    6]
 [   0    5    0    0    0    0    0    0    0    0    0    0]
 [  69    4    0    0   10   12    0    0    8    0    0    0]
 [ 302    0    0    0   13   13    0   12   11    0    2    0]
 [ 835    2    0    0  113   58    0   17   21    0   48    1]
 [1815   10    2    0    3   91    0    3   33    0    1    4]
 [ 353    0    0    0    4   81    0    5    1    0    3    0]
 [ 319    0    0    0   31  106    0   43   20    0    1    0]
 [  52    0    0    1    1    6    0    4  742    0   15   11]
 [  35    0    0    0    9    0    0    0    0    0    1    0]
 [  72    0    0    0    9    1    0    6   52    0  366   16]
 [  43    0    0    0    1    0    0    0  102    0   28   93]]


In [177]:
#class = multinomial
clf = LogisticRegression(solver='lbfgs', multi_class='multinomial').fit(x_train, y_train)
y_multinomial = clf.predict(x_test)

multinomial_accuracy = accuracy_score(y_test, y_multinomial)
multinomial_confuction_matrix = confusion_matrix(y_test, y_multinomial)


print("Accuracy of OvO classifier is {}.".format(multinomial_accuracy))
print("Confusion matrix of OvO classifier: \n {}".format(multinomial_confuction_matrix))

Accuracy of OvO classifier is 0.5421098016671457.
Confusion matrix of OvO classifier: 
 [[3797   10   26   13   57  153    0   21   87    0  115    7]
 [   0    5    0    0    0    0    0    0    0    0    0    0]
 [  55    4    1    0   16   12    0    6    9    0    0    0]
 [ 251    0    0    0   20   14    0   12   18    0   38    0]
 [ 577    2    0    0  275   62    0   30   30    0  112    7]
 [1623   10    4    1   17  219    0   20   48    0   15    5]
 [ 363    0    0    0   11   54    0   18    1    0    0    0]
 [ 271    0    0    0   51   62    0  114   19    0    2    1]
 [  27    0    0    1    2   14    0    7  740    0   25   16]
 [  31    0    0    0   12    0    0    0    0    2    0    0]
 [  36    0    0    0   24    4    0    2   40    9  384   23]
 [  24    0    0    0    5    1    0    0   69    0   47  121]]


# Interpretation of the results

The results of both manually implemented methods and the ones provided by sklearn are very similar. Especially the accuracy score, which for both of the the methods is within 10^-3. The manual inspection of the confusion matrix also shows pretty significant similarity between the two methods.



Question 4: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

I have a pretty strong experience with Python, so implementation part was not very difficult for me. The hardest part was definitely understanding what is the expected output for each of the functions. I'm still no expert in Pandas and sklearn, so figuring out what is the expected output was a bit tricky at times.

I generally don't have a lot of experience with multiclass classification in real life, other than the exercises in the first part of the course. It is interesting to see that the manual implementation of the method is very similar to the one provided by sklearn.

I'm still confused about the data standarization. I performed the standarization, although I'm not sure what would be the benefit in this case.