## One Versus All and One Versus One 
A lot of this notebook was based on [Machine Learning Mastery: OvR, OvO](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/). This was used for the Q/A session to answer a students question for Data Science Go Virtual October 25, 2020. 

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import seaborn as sns
import math
from IPython.display import Video

# Model imports
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler


from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multiclass import OneVsOneClassifier

# Metrics imports
from sklearn import metrics
from sklearn.model_selection import train_test_split

Many linear classification models (like logistic regression, support vector machines) are for binary classification only, and don't extend naturally to the multiclass case (with the exception of logistic regression) where multiclass means more than two classes. 

One approach for using binary classification algorithms for multi-classification problems is to split the multi-class classification dataset into multiple binary classification datasets and fit a binary classification model on each. Two different examples of this approach are the One-vs-Rest (OvR), also known as One-vs-All (OvA) and One-vs-One (OvO) strategies.

What is covered in this notebook is the following: 

* The One-vs-Rest (One-vs-All) strategy splits a multi-class classification into one binary classification problem per class.
* The One-vs-One strategy splits a multi-class classification into one binary classification problem per each pair of classes.

### One Versus All (OvA)
One Versus All, which is sometimes also called one versus rest (OvR) is a technique that allows us to extend any binary classifier to multi-class problems. We can train one classifier per class, where the particular class is treated as the positive class and the examples from all other classes are considered negative classes. If we were to classify a new, unlabeled data instance, we would use our n classifiers, where n is the number of class labels, and assign the class label the highest confidence to the particular instance we want to classify. 

### One Versus All Theoretical Example

For example, given a multi-class classification problem with examples for each class ‘setosa,’ ‘versicolor,’ and ‘virginica‘. This could be divided into three binary classification datasets as follows:

Binary Classification Problem 1: setosa vs [versicolor, virginica]

Binary Classification Problem 2: versicolor vs [setosa, virginica]

Binary Classification Problem 3: virginica vs [setosa, versicolor]

This approach requires that each model predicts a class membership probability or a probability-like score. The argmax of these scores (class index with the largest score) is then used to predict a class.

This approach is commonly used for algorithms that naturally predict numerical class membership probability or score, such as logistic regression.

As such, the implementation of these algorithms in the scikit-learn library implements the OvR strategy by default when using these algorithms for multi-class classification.

We can demonstrate this with an example on a 3-class classification problem using the LogisticRegression algorithm. The strategy for handling multi-class classification can be set via the “multi_class” argument and can be set to “ovr” for the one-vs-rest strategy.

The complete example of fitting a logistic regression model for multi-class classification using the built-in one-vs-rest strategy is listed below.

In [2]:
# logistic regression for multi-class classification using built-in one-vs-rest
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# define model
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)

The scikit-learn library also provides a separate [OneVsRestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) class that allows the one-vs-rest strategy to be used with any classifier.

This class can be used to use a binary classifier like Logistic Regression or another algorithm for multi-class classification, or even other classifiers that natively support multi-class classification.

It is very easy to use and requires that a classifier that is to be used for binary classification be provided to the OneVsRestClassifier as an argument.

It is very easy to use and requires that a classifier that is to be used for binary classification be provided to the OneVsRestClassifier as an argument.

In [3]:
# logistic regression for multi-class classification using a one-vs-rest
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# define model
model = LogisticRegression()
# define the ovr strategy
ovr = OneVsRestClassifier(model)
# fit model
ovr.fit(X, y)
# make predictions
yhat = ovr.predict(X)

### One Versus All Example

In [4]:
# Load Data
col_names = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

df = pd.read_csv(r'data/wine.data',
                     header = None,
                names = col_names)


In [5]:
df.head()

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [6]:
df.shape

(178, 14)

In [7]:
# Print out how many classes
print('Class labels', np.unique(df['Class label']))

Class labels [1 2 3]


In [8]:
# Classes aren't balanced.
df['Class label'].value_counts(dropna = False)

2    71
1    59
3    48
Name: Class label, dtype: int64

In [9]:
# Arrange data into features matrix and target vector
X = df.loc[:, df.columns[(df.columns != 'Class label')]]
y = df.loc[:, 'Class label'].values

In [10]:
# In statistical surveys, 
# when subpopulations within an overall population vary,
# it could be advantageous to sample each subpopulation (stratum) independently. Stratification is the process of dividing members of the population into homogeneous subgroups before sampling.
#help(train_test_split)

In [11]:
# Split into training and test sets 
# Providing the class label array y as an argument to stratify ensures both
# the training set and test datasets have the same class proportions as the
# original dataset
X_train, X_test, y_train, y_test =train_test_split(X,
                                                   y,
                                                   test_size=0.3, 
                                                   random_state=0, 
                                                   stratify=y)

In [12]:
unique, counts = np.unique(y_train, return_counts=True)
dict(zip(unique, counts))

{1: 41, 2: 50, 3: 33}

In [13]:
# Standardize Data
scaler = StandardScaler()

# Fit on training set only.
scaler.fit(X_train)

# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [14]:
log_reg = LogisticRegression(penalty='l1',
                        C=1.0,
                        solver='liblinear',
                        multi_class='ovr')

log_reg.fit(X_train, y_train)
print('Training accuracy:', log_reg.score(X_train, y_train))
print('Test accuracy:', log_reg.score(X_test, y_test))

Training accuracy: 1.0
Test accuracy: 1.0


Both the training and test accuracies (both 100 percent) indicate that our model does a perfect job on both datasets. When you access the intercept terms via the `log_reg.intercept_` attribute, we see that the array returns three values. 

In [15]:
log_reg.intercept_

array([-1.26389183, -1.2159549 , -2.36959333])

Since we fit the Logistic Regression object on a multiclass dataset via the OvR approach, the first intercept belongs to the model that fits class 1 versus classes 2 and 3, the second value is the intercept of the model that fits class 2 versus classes 1 and 3, and the third value is the intercept of the model that fits class 3 versus  1 and 2. 

In [16]:
log_reg.coef_

array([[ 1.24566275,  0.18064686,  0.74592289, -1.16393436,  0.        ,
         0.        ,  1.16077801,  0.        ,  0.        ,  0.        ,
         0.        ,  0.55718013,  2.50873345],
       [-1.53705281, -0.38729034, -0.99526575,  0.36483503, -0.05932216,
         0.        ,  0.66810689,  0.        ,  0.        , -1.93415219,
         1.23404742,  0.        , -2.23179996],
       [ 0.13525564,  0.16995133,  0.35782592,  0.        ,  0.        ,
         0.        , -2.43087034,  0.        ,  0.        ,  1.56186001,
        -0.8189804 , -0.49717816,  0.        ]])

The weight array that is accessed by `log_reg.coef_` attribute that contains three rows of weight coefficients, one weight vector for each class. Each row consists of 13 weights, where each weight is multiplied by the respective feature in the 13-dimensional wine dataset to calculated the net input. 

#### Predictions
You choose the highest probability of all the class probabilties

In [17]:
X_test.shape

(54, 13)

In [18]:
# get the first row of features
X_test[0:1].shape

(1, 13)

In [19]:
# get the first label in the test set
y_test[0]

1

In [20]:
X_test[0]

array([ 0.89443737, -0.38811788,  1.10073064, -0.81201711,  1.13201117,
        1.09807851,  0.71204102,  0.18101342,  0.06628046,  0.51285923,
        0.79629785,  0.44829502,  1.90593792])

In [21]:
X_test[0:1].shape

(1, 13)

In [22]:
X_test[0].shape

(13,)

In [23]:
# The first class is the highest score so it will be the predict for this data
log_reg.predict_proba(X_test[0:1])

array([[9.76557984e-01, 4.48247143e-04, 2.29937690e-02]])

In [24]:
help(log_reg.predict_proba)

Help on method predict_proba in module sklearn.linear_model._logistic:

predict_proba(X) method of sklearn.linear_model._logistic.LogisticRegression instance
    Probability estimates.
    
    The returned estimates for all classes are ordered by the
    label of classes.
    
    For a multi_class problem, if multi_class is set to be "multinomial"
    the softmax function is used to find the predicted probability of
    each class.
    Else use a one-vs-rest approach, i.e calculate the probability
    of each class assuming it to be positive using the logistic function.
    and normalize these values across all the classes.
    
    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        Vector to be scored, where `n_samples` is the number of samples and
        `n_features` is the number of features.
    
    Returns
    -------
    T : array-like of shape (n_samples, n_classes)
        Returns the probability of the sample for each class in the model,


### One Versus One Theoretical Example (Not my Favorite)

This is not my favorite method to teach for this particular course since we don't cover support vector machines which this is commonly used for, but I am including it for your information.

Classically, this approach is suggested for support vector machines (SVM) and related kernel-based algorithms. This is believed because the performance of kernel methods does not scale in proportion to the size of the training dataset and using subsets of the training data may counter this effect.

One-vs-One (OvO for short) is another heuristic method for using binary classification algorithms for multi-class classification.

Like one-vs-rest, one-vs-one splits a multi-class classification dataset into binary classification problems. Unlike one-vs-rest that splits it into one binary dataset for each class, the one-vs-one approach splits the dataset into one dataset for each class versus every other class.

For example, consider a multi-class classification problem with four classes: ‘red,’ ‘blue,’ and ‘green,’ ‘yellow.’ This could be divided into six binary classification datasets as follows:

* Binary Classification Problem 1: red vs. blue

* Binary Classification Problem 2: red vs. green

* Binary Classification Problem 3: red vs. yellow

* Binary Classification Problem 4: blue vs. green

* Binary Classification Problem 5: blue vs. yellow

* Binary Classification Problem 6: green vs. yellow

This is significantly more datasets, and in turn, models than the one-vs-rest strategy described in the previous section.

The formula for calculating the number of binary datasets, and in turn, models, is as follows:

* (NumClasses * (NumClasses – 1)) / 2

We can see that for four classes, this gives us the expected value of six binary classification problems:

(* NumClasses * (NumClasses – 1)) / 2
* (4 * (4 – 1)) / 2
* (4 * 3) / 2
* 12 / 2
* 6

Each binary classification model may predict one class label and the model with the most predictions or votes is predicted by the one-vs-one strategy.

"An alternative is to introduce K(K − 1)/2 binary discriminant functions, one for every possible pair of classes. This is known as a one-versus-one classifier. Each point is then classified according to a majority vote amongst the discriminant functions." — Page 183, [Pattern Recognition and Machine Learning](https://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738/ref=as_li_ss_tl?keywords=Pattern+Recognition+and+Machine+Learning&qid=1579729404&sr=8-1&linkCode=sl1&tag=inspiredalgor-20&linkId=35644770788d3f7d6402ac21dac68952&language=en_US), 2006.

Similarly, if the binary classification models predict a numerical class membership, such as a probability, then the argmax of the sum of the scores (class with the largest sum score) is predicted as the class label.

The support vector machine implementation in the scikit-learn is provided by the SVC class and supports the one-vs-one method for multi-class classification problems. This can be achieved by setting the “decision_function_shape” argument to ‘ovo‘.

The example below demonstrates SVM for multi-class classification using the one-vs-one method.

In [26]:
# SVM for multi-class classification using built-in one-vs-one
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# define model
model = SVC(decision_function_shape='ovo')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)

The scikit-learn library also provides a separate [OneVsOneClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html) class that allows the one-vs-one strategy to be used with any classifier.

This class can be used with a binary classifier like SVM, Logistic Regression or Perceptron for multi-class classification, or even other classifiers that natively support multi-class classification.

It is very easy to use and requires that a classifier that is to be used for binary classification be provided to the OneVsOneClassifier as an argument.

The example below demonstrates how to use the OneVsOneClassifier class with an SVC class used as the binary classification model.

In [27]:
# SVM for multi-class classification using one-vs-one
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# define model
model = SVC()
# define ovo strategy
ovo = OneVsOneClassifier(model)
# fit model
ovo.fit(X, y)
# make predictions
yhat = ovo.predict(X)

### Resources

[Machine Learning Mastery: OvR, OvO](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/)

[Scikit-Learn: One vs One](https://scikit-learn.org/stable/modules/multiclass.html#ovo-classification)

[Coursera: One vs All (Andrew Ng)](https://www.coursera.org/lecture/machine-learning/multiclass-classification-one-vs-all-68Pol)

If you can't view the coursera one, you can open it in incognito.