# Multiclass Classification

While binary classifiers are used to distinguish between two classes (e.g. detect if a transaction is a fraudulent one, classify an email into either spam or non-spam and etc.), multiclass classifiers distinguish between more than two classes. 

There are various ways that we can use to perform multiclass classification by leveraging any binary classifiers. In this exercise, you will implement two such strategies for multiclass classification: _One-versus-all_ strategy and _One-versus-one_ strategy.

- **One-versus-all (OvA)**: In this strategy, we train a single binary classifier per class, with the samples of that class as positive samples and all other samples as negatives. During inference, we get the prediction from each classifier and select the class with the highest score. This strategy is also called the one-versus-the-rest strtegey. 

- **One-versus-one (OvO)**: In this strategy, we train a binary classifier for every pair of classes. If there are N classes in the problem, you need to train N * (N-1) / 2 classifiers. During inference, we have to run through all N * (N-1) / 2 classifiers and ses which class wins the most votes. The main advantage of OvO strategy is that each binary classifier only needs to be train on the part of the training dataset for the two classes that it needs to separate. 

In [1]:
# import packages
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

# make this notebook's output stable across runs
np.random.seed(0)

## Avila Dataset

In this lab assignment, we use the [Avila](https://archive.ics.uci.edu/ml/datasets/Avila) data set has been extracted from 800 images of the the "Avila Bible", a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain.  
The palaeographic analysis of the  manuscript has  individuated the presence of 12 copyists. The pages written by each copyist are not equally numerous. 
Each pattern contains 10 features and corresponds to a group of 4 consecutive rows.

The prediction task consists in associating each pattern to one of the 12 copyists (labeled as: A, B, C, D, E, F, G, H, I, W, X, Y).
The data have has been normalized, by using the Z-normalization method, and divided in two data sets: a training set containing 10430 samples, and a test set  containing the 10437 samples.


In [17]:
# Load train and test data from CSV files.
train = pd.read_csv("avila-tr.txt", header=None)
test = pd.read_csv("avila-ts.txt", header=None)

x_train = train.iloc[:,:-1]
y_train = train.iloc[:,-1]

x_test = test.iloc[:,:-1]
y_test = test.iloc[:,-1]

In [18]:
# Output the number of images in each class in the train and test datasets.
print(len(train))
print(len(test))

10430
10437


Question 1.1: Check for missing Data

In [19]:
x_train[x_train.isna().any(axis=1)]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9


In [20]:
y_train[y_train.isna()]

Series([], Name: 10, dtype: object)

In [21]:
x_test[x_test.isna().any(axis=1)]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9


In [22]:
y_test[y_test.isna()]

Series([], Name: 10, dtype: object)

Question 1.2: Apply Z-normalization to data

In [23]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

x_train = pd.DataFrame(scaler.fit_transform(x_train), columns=x_train.columns)
x_test = pd.DataFrame(scaler.transform(x_test), columns=x_train.columns)

Question 2.1: Write a method to train multiple logistic regression models performing One vs All (OvA) classification. The method allows you to pass in training features, and target. The method returns a list of models and their associated labels. 
Within the method:
- Determine the list of classes
- Create a place to store all the models
- For each class, train a model with the target variable set to 1 and 0 for all other classes
- Return the list of models trained and associated labels.

In [24]:
def trainOvA(x, y):
    """
    Train the multiclass classifier using OvA strategy. 
    """
    labels = sorted(y.unique())
    n_labels = len(labels)
    models = [None for _ in range(n_labels)]
    model_labels = [None for _ in range(n_labels)]
    print("number of classes is {}".format(n_labels))

    #Create model
    for i, label in enumerate(labels): # more pythonic this way
        print("Train Logistic Regression model for class {}".format(label))

        # update the label according to OvA strategy
        ova_train = np.where(y == label, 1, 0)
        
        # Train model
        models[i] = LogisticRegression(solver='lbfgs').fit(x, ova_train)
        model_labels[i] = label
    return models, model_labels

Question 2.2: Write a method that leverage the multiple models train for OvA, and outputs the majority class.

In [25]:
def predictOvA(models, labels, x):
    """
    TODO: Make predictions on multiclass problems using the OvA strategy. 
    """
    if models == None:
        sys.exit("The model has not been trained yet. Please call train() first. Exiting...")
    
    predictions = pd.DataFrame()
    
    #Create prediction
    for label, model in zip(labels, models):
        predictions[label] = model.predict_proba(x_test)[:, 1]

    # best method I found for binarizing the output
    return predictions.eq(predictions.where(predictions != 0).max(1), axis=0).astype(int)

Question 2.3: Train OvA model on the Avila dataset

In [26]:
from sklearn.linear_model import LogisticRegression

models, model_labels = trainOvA(x_train, y_train)

number of classes is 12
Train Logistic Regression model for class A
Train Logistic Regression model for class B
Train Logistic Regression model for class C
Train Logistic Regression model for class D
Train Logistic Regression model for class E
Train Logistic Regression model for class F
Train Logistic Regression model for class G
Train Logistic Regression model for class H
Train Logistic Regression model for class I
Train Logistic Regression model for class W
Train Logistic Regression model for class X
Train Logistic Regression model for class Y


Question 2.4: Predict and evalutate the results of your model

In [27]:
predictions = predictOvA(models, model_labels, x_test)
predictions.shape
#te_z_ova = predictOvA(#to do)

(10437, 12)

In [28]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import multilabel_confusion_matrix

bin_y_test = pd.get_dummies(y_test)
ova_accuracy = accuracy_score(bin_y_test, predictions)
ova_confuction_matrix = multilabel_confusion_matrix(bin_y_test, predictions)

print("Accuracy of OvA classifier is {}.".format(ova_accuracy))
print("Confusion matrix of OvA classifier: \n {}".format(ova_confuction_matrix))

Accuracy of OvA classifier is 0.5308996838171889.
Confusion matrix of OvA classifier: 
 [[[ 1917  4234]
  [  127  4159]]

 [[10432     0]
  [    0     5]]

 [[10334     0]
  [  103     0]]

 [[10084     0]
  [  353     0]]

 [[ 9231   111]
  [ 1019    76]]

 [[ 8244   231]
  [ 1907    55]]

 [[ 9990     0]
  [  447     0]]

 [[ 9872    45]
  [  480    40]]

 [[ 9450   155]
  [   99   733]]

 [[10392     0]
  [   45     0]]

 [[ 9839    76]
  [  175   347]]

 [[10126    44]
  [  141   126]]]


Question 3.1: Develop a method that trains a list of models based on the OvO stragety for multiclass classification using logistic regression. 

In [29]:
def trainOvO(x, y):
    """
    TODO: Train the multiclass classifier using OvO strategy. 
    """
    labels = sorted(y.unique())
    n_labels = len(labels)
    n_models = int(len(labels) * (len(labels) - 1) / 2)
    print("number of classes is {}".format(n_labels))

    models = []
    model_labels = []
    model_idx = 0
    for i in range(n_labels):
        for j in range(i+1, n_labels):
            label_i = labels[i]
            label_j = labels[j]
            print("Train Logistic Regression model to distinguish {} and {}".format(label_i, label_j))

            # update the label according to OvA strategy
            train_y = y[(y == label_i) | (y == label_j)]
            train_x = x.iloc[train_y.index]
            
            # binarize target
            train_y = np.where(train_y == label_i, 0, 1)

            # construct the logistic regression instance
            model = LogisticRegression(solver = 'liblinear').fit(train_x, train_y)
            models.append(model)
            model_labels.append((label_i, label_j))
        
    return models, model_labels

Question 3.2: Write a method that leverage the multiple models train for OvO, and outputs the majority class.

In [30]:
def predictOvO(models, labels, x):
    """
    TODO: Make predictions on multiclass problems using the OvO strategy. 
    """
    if models == None:
        sys.exit("The model has not been trained yet. Please call train() first. Exiting...")

    n_models = len(models)
    predictions = pd.DataFrame()
    for label_tup, model in zip(labels, models):
        label_i = label_tup[0]
        label_j = label_tup[1]
        col_name = f'{label_i}_vs_{label_j}'
        predict_bin = model.predict(x_test) # 0 or 1
        predictions[col_name] = np.where(predict_bin == 0, label_i, label_j)
        
    return predictions.mode(axis=1)[0]

Question 3.3: Train OvO model on the Avila dataset

In [31]:
models, labels = trainOvO(x_train, y_train)

number of classes is 12
Train Logistic Regression model to distinguish A and B
Train Logistic Regression model to distinguish A and C
Train Logistic Regression model to distinguish A and D
Train Logistic Regression model to distinguish A and E
Train Logistic Regression model to distinguish A and F
Train Logistic Regression model to distinguish A and G
Train Logistic Regression model to distinguish A and H
Train Logistic Regression model to distinguish A and I
Train Logistic Regression model to distinguish A and W
Train Logistic Regression model to distinguish A and X
Train Logistic Regression model to distinguish A and Y
Train Logistic Regression model to distinguish B and C
Train Logistic Regression model to distinguish B and D
Train Logistic Regression model to distinguish B and E
Train Logistic Regression model to distinguish B and F
Train Logistic Regression model to distinguish B and G
Train Logistic Regression model to distinguish B and H
Train Logistic Regression model to distin

Question 3.4: Predict and evalutate the results of your model

In [32]:
ovo_pred = predictOvO(models, labels, x_test)
ovo_pred
#te_z_ovo = predictOvO(#to do)

0        A
1        I
2        I
3        A
4        A
        ..
10432    X
10433    A
10434    A
10435    F
10436    F
Name: 0, Length: 10437, dtype: object

In [33]:
ovo_pred_bin = pd.get_dummies(ovo_pred)
ovo_pred_bin

Unnamed: 0,A,B,C,D,E,F,G,H,I,W,X,Y
0,1,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
10432,0,0,0,0,0,0,0,0,0,0,1,0
10433,1,0,0,0,0,0,0,0,0,0,0,0
10434,1,0,0,0,0,0,0,0,0,0,0,0
10435,0,0,0,0,0,1,0,0,0,0,0,0


In [34]:
ovo_accuracy = accuracy_score(bin_y_test, ovo_pred_bin)
ovo_confusion_matrix = multilabel_confusion_matrix(bin_y_test, ovo_pred_bin)


print("Accuracy of OvO classifier is {}.".format(ovo_accuracy))
print("Confusion matrix of OvO classifier: \n {}".format(ovo_confusion_matrix))

Accuracy of OvO classifier is 0.5720992622401073.
Confusion matrix of OvO classifier: 
 [[[ 2670  3481]
  [  243  4043]]

 [[10432     0]
  [    0     5]]

 [[10307    27]
  [  103     0]]

 [[10078     6]
  [  353     0]]

 [[ 9101   241]
  [  768   327]]

 [[ 8163   312]
  [ 1755   207]]

 [[ 9987     3]
  [  440     7]]

 [[ 9810   107]
  [  375   145]]

 [[ 9473   132]
  [  109   723]]

 [[10381    11]
  [   45     0]]

 [[ 9829    86]
  [  179   343]]

 [[10110    60]
  [   96   171]]]


Question 4.1: [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) within sklearn supports two approaches for solving multi-class problems: 'ovr', 'multinomial'. Try out both approaches, and evaluate compare the performance agains what you developed in questions 2 and 3.

In [35]:
clf = LogisticRegression(multi_class='ovr').fit(x_train, y_train)
y_ovr = pd.DataFrame(clf.predict_proba(x_test))

# binarize output
y_ovr = y_ovr.eq(y_ovr.where(y_ovr != 0).max(1), axis=0).astype(int)

ovr_accuracy = accuracy_score(bin_y_test, y_ovr)
ovr_confusion_matrix = multilabel_confusion_matrix(bin_y_test, y_ovr)


print("Accuracy of OvO classifier is {}.".format(ovr_accuracy))
print("Confusion matrix of OvO classifier: \n {}".format(ovr_confusion_matrix))

Accuracy of OvO classifier is 0.5308996838171889.
Confusion matrix of OvO classifier: 
 [[[ 1917  4234]
  [  127  4159]]

 [[10432     0]
  [    0     5]]

 [[10334     0]
  [  103     0]]

 [[10084     0]
  [  353     0]]

 [[ 9231   111]
  [ 1019    76]]

 [[ 8244   231]
  [ 1907    55]]

 [[ 9990     0]
  [  447     0]]

 [[ 9872    45]
  [  480    40]]

 [[ 9450   155]
  [   99   733]]

 [[10392     0]
  [   45     0]]

 [[ 9839    76]
  [  175   347]]

 [[10126    44]
  [  141   126]]]


In [36]:
#class = multinomial
clf = LogisticRegression(multi_class='multinomial', max_iter=500).fit(x_train, y_train)
y_multinomial = pd.DataFrame(clf.predict_proba(x_test))

# binarize output
y_multinomial = y_multinomial.eq(y_multinomial.where(y_multinomial != 0).max(1), axis=0).astype(int)

multinomial_accuracy = accuracy_score(bin_y_test, y_multinomial)
multinomial_confusion_matrix = multilabel_confusion_matrix(bin_y_test, y_multinomial)


print("Accuracy of OvO classifier is {}.".format(multinomial_accuracy))
print("Confusion matrix of OvO classifier: \n {}".format(multinomial_confusion_matrix))

Accuracy of OvO classifier is 0.5615598352016863.
Confusion matrix of OvO classifier: 
 [[[ 2569  3582]
  [  245  4041]]

 [[10432     0]
  [    0     5]]

 [[10331     3]
  [  103     0]]

 [[10079     5]
  [  353     0]]

 [[ 9103   239]
  [  825   270]]

 [[ 8163   312]
  [ 1784   178]]

 [[ 9990     0]
  [  447     0]]

 [[ 9807   110]
  [  424    96]]

 [[ 9485   120]
  [  106   726]]

 [[10386     6]
  [   43     2]]

 [[ 9789   126]
  [  150   372]]

 [[10097    73]
  [   96   171]]]


Question 4: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

I had experience training multilabel classification models before, but it was useful to have to implement the One-vs-All and One-vs-One strategies myself. Most of the difficulty came from data prep. I was confused as to why we were asked to z-normalize data that had already been z-normalized. Also, I had to come up with a hackish way to binarize my predictions in order to use sklearn's accuracy score. Normally, something simple like pd.get_dummies() would work just fine; the problem is that, for some of the models, there would be a few output classes that the model never once predicted. So, those columns would be missing from the output matrix, which would give it the incorrect shape. 