## Multi Class Classification

Binary classifiers -- Distinguish between two classes

Multiclass classifiers / multinomial classifiers -- distinguish between more than two classes.



#### How algorithm trains Multiclass Classifier

Random Forest or Naive Bayes are capable of handline multiple class directly.

SVM or Linear classifiers are strictly binary classifiers, however there are various strategies that you can use  to perform multiclass classification using binary classifiers.

The two Strategies are One-versus-All / One-versus-Rest and One-versus-one.

**One-versus-All**: For example, one way to create a system that can classify the digit images into 10 classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (0-detector, 1-detector, 2-detector and so on).

Then when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs highest score.

One-vs-all classification is a method which involves training N distinct binary classifiers, each designed for recognizing a particular class. Then those N classifiers are collectively used for multi-class classification.

**One-versus-one**: considers each binary pair of classes and trains classifier on subset of data containing those classes. So it trains total n*(n-1)/2 classes. During the classification phases each classifier predicts one class.

For example in Mnist data it trains classifiers to distinguish 0s and 1s, another for 0s and 2s, another for 0s and 3s, another for 1s and 2s etc., for mnist problem this means 45 classifiers.

The main advantage of OvO is that each classifier only needs to be trained on the part of the training set for the two classes that it must distinguish.

some algorithms like SVM  scale poorly with the size of the training set, so for these algortithms OvO is preferred since it is faster to train many classifiers on small training sets rather than few classifiers on large training sets. For most binary classification algorithms, OvA is preffered.

Scikit-Learn detects when you try to use a binary classification algorithm for multiclass classification task, and it automatically runs OvA (except for SVM classifiers it uses OvO).


#### Multiclass classification for MNIST data

#### using One-versus-All

In [2]:
import numpy as np
import pandas as pd

train = pd.read_csv('mtrain.csv')
train = np.array(train.values)

X = train[:, 1:]
y = train[:,:1]

X_train, X_test, y_train, y_test = X[:35000], X[35000:], y[:35000], y[35000:]

shuffle_index = np.random.permutation(35000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

X_train[:5]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [3]:
from sklearn.linear_model import SGDClassifier

some_digit = X[11000] # index 11000, label is 2

sgd_clf = SGDClassifier()
sgd_clf.fit(X_train, y_train) 
sgd_clf.predict([some_digit])

  y = column_or_1d(y, warn=True)


array([2], dtype=int64)

The code trains the SGDclassifier on the training set using the original target classes from 0 to 9, instead of 2-versus-all target classes (y_train_2). Then is makes a prediction..

Underhood Scikit-Learn actually trained 10 binary classifiers, got their decisions scores for the image, and selected the classes with highest score.

To see that this is indeed the case, you can call the decision_function() method. Instead of returning just one score per instance, it now returns 10 scores, one per class:

In [4]:
some_digit_scores = sgd_clf.decision_function([some_digit])
some_digit_scores

array([[-1739047.29681772,  -412763.97665615,   422638.35572991,
         -374490.40558151,  -669033.22982208,  -863015.9540521 ,
         -481311.03910297, -1126483.94173421,  -359409.30211204,
         -581594.84802513]])

In [5]:
# highest score is score related to class 2

np.argmax(some_digit_scores)

2

In [6]:
sgd_clf.classes_

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

When a classifier is trained, it stores the list of target classes in its classes_ attribute, ordered by value. In this case, the index of
each class in the classes_ array conveniently matches the class itself (e.g., the class at index 5 happens to be class 5), but in
general you won’t be so lucky.

In [7]:
# decision scores for each class. in first case it is 2.
# in second case highest value is 3 so the target variable for 3 is digit 3

pd.DataFrame(sgd_clf.decision_function([X[11000], X[1233]]), columns=sgd_clf.classes_)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-1739047.0,-412763.976656,422638.35573,-374490.405582,-669033.2,-863015.954052,-481311.0,-1126484.0,-359409.302112,-581594.8
1,-879454.3,-510969.023276,-448713.754225,313750.379877,-1227551.0,-178742.066132,-1032710.0,-1045567.0,-91315.392488,-1187300.0


#### using One-versus-One

If you want to force ScikitLearn to use one-versus-one or one-versus-all, you can use the *OneVsOneClassifier or OneVsRestClassifier* classes. 

Simply create an instance and pass a binary classifier to its constructor. 

In [8]:
from sklearn.multiclass import OneVsOneClassifier

ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])







array([2], dtype=int64)

In [9]:
# 

len(ovo_clf.estimators_)

45

In [10]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier()
forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])

  from numpy.core.umath_tests import inner1d
  after removing the cwd from sys.path.


array([2], dtype=int64)

This time Scikit-Learn do not have to run OvA or OvO because Random Forest classifiers can directly classify instances into multiple classes.

**predict_proba()** to get the list of probabilities that classifier assigned to each instance for each class. 

In [11]:
# classifier is fairly confident about its prediction. 1 at 3rd index. 

forest_clf.predict_proba([some_digit])

array([[0.1, 0. , 0.8, 0. , 0.1, 0. , 0. , 0. , 0. , 0. ]])

In [12]:
# 0.8 at 7th index means, model estimates 80% probability that the image represents 6 

print(forest_clf.predict_proba([X[1200],X[11250], X[100]]))
print(forest_clf.predict([X[1200],X[11250], X[100]]))

[[0.1 0.  0.  0.  0.  0.1 0.8 0.  0.  0. ]
 [1.  0.  0.  0.  0.  0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0.  0.  0.  0.  0.  1. ]]
[6 0 9]


#### Evaluating the model with cross validation

In [13]:
from sklearn.model_selection import cross_val_score

# accuracy scores in all three folds
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring='accuracy')

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


array([0.85998286, 0.88059318, 0.85948217])

#### StandardScaling to improve accuracy

In [14]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))

cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring='accuracy')

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


array([0.90822622, 0.90733756, 0.91186557])

AS you can see scaling improved accuracy

### Calculating Metrics for Multiclass problems

When you have multiclass problem, the average paramter in f1_score, precision, recall needs to be one of the below:

 1.weighted
 
 2.micro
 
 3.macro
 
'weighted' calculates de F1 score for each class independently but when it adds them together uses a weight that depends on the number of true labels of each class:

   F1class1 ∗ W1 + F1class2 ∗ W2 +⋅⋅⋅+ F1classN∗WN

therefore favouring the majority class.


'micro' uses the global number of TP, FN, FP and calculates the F1 directly:


F1 (class1 + class2 + class3)

no favouring any class in particular.

Finally, 'macro' calculates the F1 separated by class but not using weights for the aggregation:

  F1class1 + F1class2 +⋅⋅⋅+ F1classN

which resuls in a bigger penalisation when your model does not perform well with the minority classes.

The one to use depends on what you want to achieve. If you are worried with class imbalance I would suggest using 'macro'. However, it might be also worthwile implementing some of the techniques available to taclke imbalance problems such as downsampling the majority class, upsampling the minority, SMOTE, etc.

#### metrics for unscaled data with SGDClassifier

In [15]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(sgd_clf, X_train, y_train, cv=3)

def classification_metrics(y_train, y_pred):
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import precision_score, recall_score
    from sklearn.metrics import f1_score
    
    print(confusion_matrix(y_train, y_pred))
    print('f1_score ',f1_score(y_train, y_pred, pos_label='positve', average='weighted'))
    print('precision ',precision_score(y_train, y_pred, pos_label='positve', average='weighted'))
    print('Recall ',recall_score(y_train, y_pred , pos_label='positve', average='weighted'))

classification_metrics(y_train, y_pred)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[[3301    1    2   12    2   22   38    1   32    5]
 [   1 3629   43   38    5   15   15    6  150    1]
 [  35   29 2885  147   39   34  116   58  153   10]
 [  11   14   86 3191    2  132   26   28  108   24]
 [  28   14   17   26 2920   14   54   14  127  153]
 [  57   11   18  223   37 2440   94   14  270   31]
 [  35    3   14    2   11   53 3265    4   37    2]
 [  21   13   64   60   24   12   10 3190   48  240]
 [  33   53   36  166   13  264   35   21 2718   37]
 [  37   14   11   84  135   55    2  137  239 2793]]
f1_score  0.8671806256092702
precision  0.8700410053423511
Recall  0.8666285714285714




#### Metrics for scaled data with SGDClassfier

You can see how metrics imporved with the scaled data

In [16]:
y_pred_scaled = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)

classification_metrics(y_train, y_pred_scaled)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[[3314    1    9    6    7   21   38    3   14    3]
 [   0 3766   24   20    3   18    6    7   56    3]
 [  24   19 3124   61   53   12   63   47   92   11]
 [  21   19   80 3136    1  146   27   33   97   62]
 [  10   12   23    7 3099    3   33   19   38  123]
 [  40   25   15  116   38 2712   71   24   95   59]
 [  20   13   19    0   21   52 3275    4   20    2]
 [  19   13   48   18   32    6    5 3383    6  152]
 [  29   73   39  116    7   96   28   16 2881   91]
 [  29   15   10   46  105   20    0  132   55 3095]]
f1_score  0.9079088066660824
precision  0.9079343107194907
Recall  0.9081428571428571




#### Metrics with Random Forest classifier and scaled data

Improved accuracy

In [17]:
y_pred_RF = cross_val_predict(forest_clf, X_train_scaled, y_train, cv=3)

classification_metrics(y_train, y_pred_RF)

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)


[[3334    1    8    3    3   11   25    4   18    9]
 [   1 3820   25   17    5    8    6    7    7    7]
 [  23   15 3286   32   22   10   25   46   35   12]
 [  22   17   92 3298    3   69    8   34   51   28]
 [  11   11   20    6 3175    5   20   12   14   93]
 [  42   13   10  144   23 2861   47    4   30   21]
 [  33    6   27    4   22   42 3277    1   14    0]
 [  10   20   66   12   37    3    0 3462    8   64]
 [  20   37   64  106   37   77   27    9 2954   45]
 [  21   18   18   46  135   33    9   72   42 3113]]
f1_score  0.9306547134382956
precision  0.9308353912553229
Recall  0.9308571428571428




### Hyperparameter Tuning


In [18]:
from pprint import pprint

# look at the parameters used by our current forest

print('Parameters currently in use:\n')
pprint(forest_clf.get_params())

Parameters currently in use:

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


In [19]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random Forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10 )]

# number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimim number of samples required to split a node
min_samples_split = [2, 5, 10]

# minimum number of samples required at each leaf node
min_samples_leaf = [1,2,4]

# method of selecting samples for training each tree
bootstrap = [True, False]

In [20]:
# create the random grid

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [21]:
pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


On each iteration, the algorithm will chose a difference combination of the features. Altogether, there are 2 * 12 * 2 * 3 * 3 * 10 = 4320 settings.

However, the benefit of a random search is that we are not trying every combination, but selecting at random to sample a wide range of values.

#### Random Search Training

In [None]:
# use the random grid to search for best hyperparameters
# first create the base model to tune
# Random search for parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available scores

forest_clf_random = RandomizedSearchCV(estimator=forest_clf, param_distributions=random_grid,
                                     n_iter= 2, cv = 3, verbose=2, random_state=42, n_jobs=-1)

y_train.shape

In [None]:
# fit the random search model
forest_clf_random.fit(X_train, y_train)

Fitting 3 folds for each of 2 candidates, totalling 6 fits


The most important arguments in the RandomizedSearchCV are n_iter, which controls the number of different combinations to try, and cv is which is the number of folds to use for cross validation.

In [24]:
forest_clf_random.best_params_