## Multi Class Classification

Binary classifiers -- Distinguish between two classes

Multiclass classifiers / multinomial classifiers -- distinguish between more than two classes.



#### How algorithm trains Multiclass Classifier

Random Forest or Naive Bayes are capable of handline multiple class directly.

SVM or Linear classifiers are strictly binary classifiers, however there are various strategies that you can use  to perform multiclass classification using binary classifiers.

The two Strategies are One-versus-All / One-versus-Rest and One-versus-one.

**One-versus-All**: For example, one way to create a system that can classify the digit images into 10 classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (0-detector, 1-detector, 2-detector and so on).

Then when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs highest score.

One-vs-all classification is a method which involves training N distinct binary classifiers, each designed for recognizing a particular class. Then those N classifiers are collectively used for multi-class classification.

**One-versus-one**: considers each binary pair of classes and trains classifier on subset of data containing those classes. So it trains total n*(n-1)/2 classes. During the classification phases each classifier predicts one class.

For example in Mnist data it trains classifiers to distinguish 0s and 1s, another for 0s and 2s, another for 0s and 3s, another for 1s and 2s etc., for mnist problem this means 45 classifiers.

The main advantage of OvO is that each classifier only needs to be trained on the part of the training set for the two classes that it must distinguish.

some algorithms like SVM  scale poorly with the size of the training set, so for these algortithms OvO is preferred since it is faster to train many classifiers on small training sets rather than few classifiers on large training sets. For most binary classification algorithms, OvA is preffered.

Scikit-Learn detects when you try to use a binary classification algorithm for multiclass classification task, and it automatically runs OvA (except for SVM classifiers it uses OvO).


#### Multiclass classification for MNIST data

#### using One-versus-All

In [None]:
import numpy as np
import pandas as pd

train = pd.read_csv('mtrain.csv')
train = np.array(train.values)

X = train[:, 1:]
y = train[:,:1]

X_train, X_test, y_train, y_test = X[:35000], X[35000:], y[:35000], y[35000:]

shuffle_index = np.random.permutation(35000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

X_train[:5]

In [None]:
y_train_ravel = y_train.ravel()
y_train_ravel.shape

In [None]:
from sklearn.linear_model import SGDClassifier

some_digit = X[11000] # index 11000, label is 2

sgd_clf = SGDClassifier()
sgd_clf.fit(X_train, y_train) 
sgd_clf.predict([some_digit])

The code trains the SGDclassifier on the training set using the original target classes from 0 to 9, instead of 2-versus-all target classes (y_train_2). Then is makes a prediction..

Underhood Scikit-Learn actually trained 10 binary classifiers, got their decisions scores for the image, and selected the classes with highest score.

To see that this is indeed the case, you can call the decision_function() method. Instead of returning just one score per instance, it now returns 10 scores, one per class:

In [None]:
some_digit_scores = sgd_clf.decision_function([some_digit])
some_digit_scores

In [None]:
# highest score is score related to class 2

np.argmax(some_digit_scores)

In [None]:
sgd_clf.classes_

When a classifier is trained, it stores the list of target classes in its classes_ attribute, ordered by value. In this case, the index of
each class in the classes_ array conveniently matches the class itself (e.g., the class at index 5 happens to be class 5), but in
general you won’t be so lucky.

In [None]:
# decision scores for each class. in first case it is 2.
# in second case highest value is 3 so the target variable for 3 is digit 3

pd.DataFrame(sgd_clf.decision_function([X[11000], X[1233]]), columns=sgd_clf.classes_)

#### using One-versus-One

If you want to force ScikitLearn to use one-versus-one or one-versus-all, you can use the *OneVsOneClassifier or OneVsRestClassifier* classes. 

Simply create an instance and pass a binary classifier to its constructor. 

In [None]:
from sklearn.multiclass import OneVsOneClassifier

ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])

In [None]:
# 

len(ovo_clf.estimators_)

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier()
forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])

This time Scikit-Learn do not have to run OvA or OvO because Random Forest classifiers can directly classify instances into multiple classes.

**predict_proba()** to get the list of probabilities that classifier assigned to each instance for each class. 

In [None]:
# classifier is fairly confident about its prediction. 1 at 3rd index. 

forest_clf.predict_proba([some_digit])

In [None]:
# 0.8 at 7th index means, model estimates 80% probability that the image represents 6 

print(forest_clf.predict_proba([X[1200],X[11250], X[100]]))
print(forest_clf.predict([X[1200],X[11250], X[100]]))

#### Evaluating the model with cross validation

In [None]:
from sklearn.model_selection import cross_val_score

# accuracy scores in all three folds
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring='accuracy')

#### StandardScaling to improve accuracy

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))

cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring='accuracy')

AS you can see scaling improved accuracy

### Calculating Metrics for Multiclass problems

When you have multiclass problem, the average paramter in f1_score, precision, recall needs to be one of the below:

 1.weighted
 
 2.micro
 
 3.macro
 
'weighted' calculates de F1 score for each class independently but when it adds them together uses a weight that depends on the number of true labels of each class:

   F1class1 ∗ W1 + F1class2 ∗ W2 +⋅⋅⋅+ F1classN∗WN

therefore favouring the majority class.


'micro' uses the global number of TP, FN, FP and calculates the F1 directly:


F1 (class1 + class2 + class3)

no favouring any class in particular.

Finally, 'macro' calculates the F1 separated by class but not using weights for the aggregation:

  F1class1 + F1class2 +⋅⋅⋅+ F1classN

which resuls in a bigger penalisation when your model does not perform well with the minority classes.

The one to use depends on what you want to achieve. If you are worried with class imbalance I would suggest using 'macro'. However, it might be also worthwile implementing some of the techniques available to taclke imbalance problems such as downsampling the majority class, upsampling the minority, SMOTE, etc.

#### metrics for unscaled data with SGDClassifier

In [None]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(sgd_clf, X_train, y_train, cv=3)

def classification_metrics(y_train, y_pred):
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import precision_score, recall_score
    from sklearn.metrics import f1_score
    
    print(confusion_matrix(y_train, y_pred))
    print('f1_score ',f1_score(y_train, y_pred, pos_label='positve', average='weighted'))
    print('precision ',precision_score(y_train, y_pred, pos_label='positve', average='weighted'))
    print('Recall ',recall_score(y_train, y_pred , pos_label='positve', average='weighted'))

classification_metrics(y_train, y_pred)


#### Metrics for scaled data with SGDClassfier

You can see how metrics imporved with the scaled data

In [None]:
y_pred_scaled = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)

classification_metrics(y_train, y_pred_scaled)

#### Metrics with Random Forest classifier and scaled data

Improved accuracy

In [None]:
y_pred_RF = cross_val_predict(forest_clf, X_train_scaled, y_train, cv=3)

classification_metrics(y_train, y_pred_RF)

### Hyperparameter Tuning


On each iteration, the algorithm will chose a difference combination of the features. Altogether, there are 2 * 12 * 2 * 3 * 3 * 10 = 4320 settings.

However, the benefit of a random search is that we are not trying every combination, but selecting at random to sample a wide range of values.

#### Random Search Training

In [None]:
#forest_clf_random.best_params_

#### Evaluate performance of base model with random search model

In [None]:
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape
    print('Model Performance')
    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy

base_accuracy = evaluate(forest_clf, X_train_scaled, y_train)