## **Overview of Multiclass Classification: one-vs-all and all-pairs**

Give an overview of the algorithm and describe its advantages and disadvantages.

#### <ins>One-vs-All

This algorithm creates a classifier for each class (3 classifiers if there exists 3 classes). For each classifier, it is responsible for predicting whether an input belongs to its corresponding class or not.

Each classifier is trained on the entire dataset with modifications corresponding to each classifier.\
The modification changes the dataset such that when you're training a classifier for class $3$, labels for all the other classes are modified to $-1$ and labels for the classifier's class are modified to $1$. (More details will be provided in the Representation section)

#### <ins>All-Pairs

This algorithm creates a classifier for each pair of classes. For each classifier, it is responsible for predicting whether a given input belongs to one class or the other.

Each classifier is trained a portion of the dataset with modifications corresponding to each classifier.\
First, each classifier is assigned portion of the dataset that contains the classes the classifier is predicting for. Then, the assigned data's classes are changed so that one class is assigned the label of $1$ and the other is assigned the label of $-1$.

#### <ins>Advantages and Disadvantages of Multiclass Classification

Multiclass classification algorithm is an algorithm that classifies an input that can belong to one of the multiple classes (more than two classes).\
For this project, we will be implementing One-vs-All and All-Pairs algorithms for the multiclass classification of the UCI Iris dataset.

Compared to a multiclass classification algorithm that inherently encompasses multiclass classification (output of model predicts multiclass),
the main advantages of One-vs-All and All-Pairs stems from the use of binary classifiers.\
Because of the binary classifiers to represent multiclass classification, these two algorithms have implementation simplicity and easy interpretability of the predictions.

Unfortunately, the disadvantages also stem from the use of binary classifiers.
- The binary classifiers do not have any knowledge that it is used for multiclass classification and therefore, does not have inherent understanding of the multiclass classification problem.
- Due to training classifier for each class, each classifier is trained on a class imbalanced dataset and may result in overfitting.
- Training multiple classifiers can be computationally expensive.

#### <ins>Misc.

In this final project, we will be using the UCI Iris dataset we encountered in our previous homework:\
[`https://archive.ics.uci.edu/dataset/53/iris`](https://archive.ics.uci.edu/dataset/53/iris)

What we will be comparing to:
[scikit-learn multiclass classification](https://scikit-learn.org/1.5/modules/multiclass.html#multilabel-classification)


### Representation: Logistic Regression

#### Binary Logistic Regression
Given sample's feature values $x \in \mathbb{R}^{d}$ and a label $y  \in \{0, 1\}$, binary classification of input $x$ is predicted through combination of affine function and sigmoid function.
$$ y = \langle w, x\rangle $$ 
$$\sigma (y) = \frac{1}{1 + e^{-y}}$$
Therefore, our hypothesis function defined on weights $w$ is
$$h_{w}(x) = \frac{1}{1 + e^{-\langle w, x\rangle}}$$

#### Multiclass Logistic Regression
Now, using the binary logistic regression defined above, we will define one-vs-all and all-pairs multiclass logistic regression algorithms.

Pseudocode for one-vs-all (from textbook):

Given inputs:\
training set $S = (x_1, y_1), ..., (x_m, y_m)$\
binary classifier - logistic regression $L$

$\text{foreach } i \in Y:$\
$\text{ let } S_i = (x_1, (-1)^{\mathbb{1}_{[y_1 \neq i]}}), ..., (x_m, (-1)^{\mathbb{1}_{[y_m \neq i]}})$\
$\text{ let } h_i = L(S_i)$

Predicts:\
$ h(x) \in argmax_{i \in Y }\text{ }h_i(x)$


Pseudocode for all-pairs (from textbook):

Given inputs:\
training set $S = (x_1, y_1), ..., (x_m, y_m)$\
binary classifier - logistic regression $L$

$\text{foreach } i,j \in Y \text{ such that } i < j$\
$\text{ initialize empty } S_{i,j}$\
$\text{ for } t = 1, ..., m$\
$\text{ }\text{ If } y_t = i \text{, then add } (x_t, 1) \text{ to } S_{i,j}$
$\text{ }\text{ If } y_t = j \text{, then add } (x_t, -1) \text{ to } S_{i,j}$
$\text{ let } h_{i,j} = L(S_{i,j}$

Predicts:\
$ h(x) \in argmax_{i \in Y }\text{ } (\Sigma_{j \in Y} \text{ sign}(j-i) h_{i,j}(x))$



### Loss: Logistic Loss + Regularization

The loss function of a Logistic Regression classifier over $k$ classes is the **log-loss**, also called **cross-entropy loss**. Since we will only use binary classifier, e.g. Binary Logistic Regression, in this project, only **Binary Log Loss** will be introduced in this section.

The Binary Log Loss on a sample of m data points, also called the Binary Cross Entropy Loss, is:
$$L_S(h) = -\frac{1}{m} \sum_{i=1}^m (y_i \log h(x_i) + (1 - y_i)\log (1 - h(x_i)))$$

The corresponding gradient of the Binary Log loss with respect to the model's wights is:
$$\frac{\partial L_S(h)}{\partial w_j} = \frac{1}{m} \sum_{i=1}^m (h(x_i) - y_i)x_{ij}$$

We also implement the L2 norm of wights to adpot Tikhonov regularization into our loss function. The L2 norm of wights is:
$$\lambda||w||_2^2 = \lambda\sum_{i=1}^{d}w_i^2$$ 
And the gradient of the L2 term with respect to the model's weights is:
$$\frac{\partial \lambda\sum_{i=1}^{d}w_i^2}{\partial w_j} = 2\lambda w_j$$

In conclusion, the total loss function would be:
$$L_S(h) = -\frac{1}{m} \sum_{i=1}^m (y_i \log h(x_i) + (1 - y_i)\log (1 - h(x_i)))+ \lambda\sum_{i=1}^{d}w_i^2$$

### Optimizer

**One-vs-All** and **All-Pairs** are both strategies used to solve muticlass classification problems by utilizing binary classifiers. In this case, Stochastic Gradient Descent (Mini Batch) is a suitable choice.   
In gradient descent, the general formula for weight update is:
$$w_j = w_j - \alpha \cdot \frac{\partial L}{\partial w_j}$$  

For each batch of size $m$, the gradient of the binary log loss with respect to the weight is:
$$\frac{\partial L}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} (h(x_i) - y_i) \cdot x_{ij}$$  

If incorporate regularization (mentioned in the previous section), the total gradient becomes:
$$\frac{\partial L}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} (h(x_i) - y_i) \cdot x_{ij} + 2 \lambda w_j$$
Thus, the final weight update equation is: 
$$w_j = w_j - \alpha \cdot \left( \frac{\partial L}{\partial w_j} + 2 \lambda w_j \right)$$  

Due to the nature of One-vs-All and All-pairs strategies, we apply this optimizer differently compared to direct multiclass classification techniques, such as multiclass logistic regression.  
**One-vs-All**: for each class $j$, you train a seperate binary classifier that distinguishes class $j$ from all other classes.  
**All-pairs**: for each unique class pair $(i, j)$, you train a binary classifier that differentiates between those two classes.  

#### Pseudocode: Stochastic Gradient Descent for Logistic Regression (Lecture 6 Slide 21)  
**Inputs**: Traning examples $S$, step size $\alpha$, batch size $b < |S|$  
Set converged false  
**while** not converged:  
&nbsp;&nbsp;&nbsp;&nbsp;Randomly shuffle $S$  
&nbsp;&nbsp;&nbsp;&nbsp;**for** $i = 0$ to $|S|/b - 1$:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$S'$ = Extracted current batch using $i$  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\mathbf{w} = \mathbf{w} - \alpha \cdot \nabla L_{S'}(h_w)$ + regularization  
&nbsp;&nbsp;&nbsp;&nbsp;converged = check_convergence$(S,w)$  
return

## **Model**

In [50]:
import numpy as np
from sklearn.linear_model import SGDClassifier

def sigmoid(x):
    '''
        Sigmoid function f(x) =  1/(1 + exp(-x))
        :param x: A scalar or Numpy array
        :return: Sigmoid function evaluated at x (applied element-wise if it is an array)
    '''
    return np.where(x > 0, 1 / (1 + np.exp(-x)), np.exp(x) / (np.exp(x) + np.exp(0)))

def get_estimator(train_epochs, lr):
    estimator = SGDClassifier(
        loss='log_loss',
        tol=None,
        max_iter=train_epochs,
        shuffle=True,
        random_state=0,
        learning_rate='constant',
        eta0=lr,
        alpha=0)
    return estimator

## Model: Standard Scaler

In [2]:
import numpy as np

class StandardScaler:
    def _init_(self, X):
        self.num_samples = X.shape[0]
        self.n_features = X.shape[1] # can include bias or not
        self.mean = None
        self.std = None
    
    def fit(self, X):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
    
    def center(self, X):
        return X - self.mean
    
    def scale(self, X):
        X_centered = X - self.mean
        X_scaled = X_centered/self.std
        return X_scaled

## **Model: Representation - Logistic Regression**

In [44]:
import numpy as np

class MyLogisticRegression:
    '''
    Binary Logistic Regression that learns weights using 
    stochastic gradient descent.
    '''
    def __init__(self, batch_size=1, num_epochs=1, lr=0.0001, tol=1e-4):
        '''
        Initializes a LogisticRegression classifer.
        @attrs:
            n_features: the number of features in the classification problem
            n_classes: the number of classes in the classification problem
            weights: The weights of the Logistic Regression model
            alpha: The learning rate used in stochastic gradient descent
        '''
        self.learning_rate = lr
        self.num_epochs = num_epochs
        self.batch_size = batch_size
        self.weights = None
        self.tol = tol

    def train(self, X, Y):
        '''
        Train the model, using batch stochastic gradient descent
        @params:
            X: a 2D Numpy array where each row contains an example, padded by 1 column for the bias
            Y: a 1D Numpy array containing the corresponding labels for each example
        @return:
            None
        '''
        num_samples, num_features = X.shape
        self.weights = np.zeros((1, num_features))
        previous_loss = float('inf')

        for epoch in range(self.num_epochs):
            shuffled_inds = np.random.permutation(num_samples)
            shuffled_X = X[shuffled_inds]
            shuffled_Y = Y[shuffled_inds]

            for start in range(0, num_samples, self.batch_size):
                end = start + self.batch_size
                X_batch = shuffled_X [start: min(end, num_samples)] 
                Y_batch = shuffled_Y [start: min(end, num_samples)] 

                predictions = sigmoid(np.dot(X_batch, self.weights.T)) # num_samples * 1 (num_classes)
                Y_batch = np.reshape(Y_batch,(len(Y_batch),1)) # num_samples * 1, reshape Y to same dimensions of sigmoid
                error = predictions - Y_batch
                loss_grad = np.dot(error.T, X_batch)/len(X_batch)
    
                self.weights -= self.learning_rate * loss_grad
            
            current_loss = self.loss(X, Y)

            # if abs(previous_loss - current_loss) < self.tol:
            #     # print(f'Convergence reached at epoch {epoch + 1}')
            #     break
            

            #print(f"Epoch {epoch + 1}, Weights: {self.weights}")
                            

    def loss(self, X, Y):
        '''
        Computes the logistic loss (binary cross-entropy loss) for binary classification
        @params:
            X: 2D Numpy array where each row contains an example, padded by 1 column for the bias
            Y: 1D Numpy array containing the corresponding labels for each example
        @return:
            A float number which is the average loss of the model on the dataset
        '''
        # Clip predictions to prevent log(0)
        y_pred = self.predict(X)
        y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
        
        left_half = Y.T @ np.log(y_pred)
        right_half = (1-Y).T @ np.log(1-y_pred)
        
        # Calculate the logistic loss
        loss = -np.mean(left_half + right_half)
        return loss


    def predict(self, X):
        '''
        Compute predictions based on the learned parameters and examples X
        @params:
            X: a 2D Numpy array where each row contains an example, padded by 1 column for the bias
        @return:
            A 1D Numpy array with one element for each row in X containing the predicted class.
        '''
        # X.shape: (batch size, num features)
        # self.weights.shape: (1, num features)
        dot_product = np.dot(self.weights, X.T) # n_classes * n_samples
        probs = sigmoid(dot_product)
        probsall = np.vstack((1-probs, probs)) # probs are for class 2
        y_predict = np.argmax(probsall, axis=0) #finding the index of the max value in a column
        return y_predict


    def accuracy(self, X, Y):
        '''
        Output the accuracy of the trained model on a given testing dataset X and labels Y.
        @params:
            X: a 2D Numpy array where each row contains an example, padded by 1 column for the bias
            Y: a 1D Numpy array containing the corresponding labels for each example
        @return:
            a float number indicating accuracy (between 0 and 1)
        '''
        predicted_classes = self.predict(X)
        return np.mean(predicted_classes == Y)
    
    def predict_proba(self, X):
        '''
        Compute probabilities for the input data X.
        @params:
        X: A 2D Numpy array where each row contains an example
        @return:
        Probabilities of each example being in class 1
        '''
        dot_product = np.dot(self.weights, X.T) # n_classes * n_samples
        probs = sigmoid(dot_product)
        
        return probs

## **Check Logistic Regression**

In [None]:
import pytest
import numpy as np
from sklearn.linear_model import SGDClassifier

# Set random seed for testing purposes
np.random.seed(0)

# Create Test Models
# SGDClassifier is always batch_size == 1
train_epochs = 1
lr = 0.01
my_model = MyLogisticRegression(lr=lr, batch_size=1, num_epochs=train_epochs)
sklearn_model = SGDClassifier(
    loss='log_loss', 
    max_iter=train_epochs,
    shuffle=True,
    random_state=0,
    learning_rate='constant',
    eta0=lr,
    alpha=0)

# Creates Test Data
x = np.array([[0,4], [0,3], [5,0], [4,1], [0,5]])
x_bias = np.array([[0,4,1], [0,3,1], [5,0,1], [4,1,1], [0,5,1]])
y = np.array([0,0,1,1,0])

my_model.train(x_bias, y)
sklearn_model.fit(x, y) # Gives same result as partial_fit

weights = my_model.weights
assert isinstance(weights, np.ndarray)
assert weights.ndim==2 and weights.shape == (1,3)
# FIXME: relative tolerance might be not strict enough
assert weights[0][:-1] == pytest.approx(sklearn_model.coef_[0], 0.01)
assert weights[0][-1] == pytest.approx(sklearn_model.intercept_[0], 0.01)

# print('My Model Weights',my_model.weights)
# print('sklearn weights',sklearn_model.coef_[0])
# print('sklearn bias', sklearn_model.intercept_)

# ===================================================================

train_epochs = 10
lr = 0.01
my_model = MyLogisticRegression(lr=lr, batch_size=1, num_epochs=train_epochs)
sklearn_model = SGDClassifier(
    loss='log_loss',
    tol=None,
    max_iter=train_epochs,
    shuffle=True,
    random_state=0,
    learning_rate='constant',
    eta0=lr,
    alpha=0)

my_model.train(x_bias, y)
sklearn_model.fit(x, y, coef_init=[[0,0]], intercept_init=[0])

weights = my_model.weights
# print('My Model Weights',my_model.weights)
# print('sklearn weights',sklearn_model.coef_[0])
# print('sklearn bias', sklearn_model.intercept_)
assert weights[0][:-1] == pytest.approx(sklearn_model.coef_[0], 0.01)
assert weights[0][-1] == pytest.approx(sklearn_model.intercept_[0], 0.1)

# Test model predictions
x_test = np.array([[0,0], [-5,3], [9,0], [1,0], [6,-7]])
x_test_bias = np.array([[0,0,1], [-5,3,1], [9,0,1], [1,0,1], [6,-7,1]])
y_test = np.array([0,0,1,0,1])

my_preds = my_model.predict(x_test_bias)
sklearn_preds = sklearn_model.predict(x_test)
assert (my_preds == sklearn_preds).all()

print("Check Model Finished")

## **Model: one-vs-all**

In [None]:
import numpy as np

class OnevsAll:
    def __init__(self, n_classes, batch_size=1, epochs=1, lr=0.01):
        self.n_classes = n_classes
        self.lr = lr
        self.batch_size = batch_size
        self.epochs = epochs
        # self.conv_threshold = conv_threshold

    def train(self, X, Y):
        num_inputs = X.shape[0]
        num_features = X.shape[1] - 1 # there's bias
        
        # Split data and train each representation
        self.S_Y = np.array([np.array(Y) for _ in range(self.n_classes)])
        self.h = np.array([
            MyLogisticRegression(
                batch_size=self.batch_size, 
                num_epochs=self.epochs, 
                lr=self.lr, 
                tol=0) for _ in range(self.n_classes)])
        self.conv_epochs = [0] * self.n_classes
        for cls in range(self.n_classes):
            # Create S_i for each class i
            S_Y_i = self.S_Y[cls]
            cls_idx = S_Y_i == cls
            S_Y_i[cls_idx] = 1
            non_cls_idx = np.logical_not(cls_idx)
            S_Y_i[non_cls_idx] = 0
            
            # Train h_i for each class i on S_i
            h_i = self.h[cls]
            conv_epoch = h_i.train(X, S_Y_i)
            self.conv_epochs.append(conv_epoch)
            
    # def loss(self, X, Y):
    #     preds = self.predict(X)
    #     # L1-loss
    #     losses = np.abs(Y-preds)
    #     losses = np.sum(losses)
    #     return losses

    def predict(self, X):
        # h_i in argmax h_i(x)
        predictions = [0] * X.shape[0]
        for i, x in enumerate(X):
            preds = [0] * self.n_classes
            # Get predictions from all hypotheses
            for c in range(self.n_classes):
                preds[c] = self.h[c].predict_proba(x)
            # Select max prediction
            predictions[i] = np.argmax(preds)
            # print(preds, predictions[i])
            
        return predictions

    def accuracy(self, preds, Y):
        correct_preds = Y == preds
        return np.sum(correct_preds) / Y.shape[0]

## **Check Model: one-vs-all**

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
np.random.seed(0)

# Get data for testing
data = load_iris()

X = data.data
bias_col = np.ones((X.shape[0], 1))
X_bias = np.hstack((X, bias_col))

Y = data.target
n_classes = len(np.unique(Y))

# Initialize models:
train_epochs = 5
lr = 0.03
my_model = OnevsAll(n_classes)

estimator = get_estimator(train_epochs, lr)
sklearn_model = OneVsRestClassifier(estimator)

my_model.train(X_bias, Y)
sklearn_model.fit(X, Y)

# Check that S is populated correctly
assert my_model.S_Y.shape[0] == n_classes
assert my_model.S_Y.shape[1] == Y.shape[0]

# Check h
assert len(my_model.h) == n_classes
assert len(sklearn_model.estimators_) == n_classes
for h in my_model.h:
    assert isinstance(h, MyLogisticRegression)
for i in range(n_classes):
    my_weights = my_model.h[i].weights[0][:-1]
    my_bias = my_model.h[i].weights[0][-1]
    sklearn_weights = sklearn_model.estimators_[i].coef_[0]
    sklearn_bias = sklearn_model.estimators_[i].intercept_[0]
    print(" === ", i)
    print(my_weights)
    print(sklearn_weights)
    print(my_bias)
    print(sklearn_bias)

    # assert my_weights == pytest.approx(sklearn_weights, 0.01)
    # assert my_bias == pytest.approx(sklearn_bias, 0.01)

# Check predictions
predictions = my_model.predict(X_bias)
print("My Predictions:", np.array(predictions))

sklearn_predictions = sklearn_model.predict(X)
print("sklearn Predictions:", sklearn_predictions)

print('num_samples', X.shape[0])
print('Differences', np.sum(np.abs(np.array(predictions)-sklearn_predictions)))

## **Model: all-pairs**

In [None]:
# You can only use python and numpy in this section.
import numpy as np

class AllPairs:
    def __init__(self, n_classes, conv_threshold, batch_size, epochs, lr):
        self.n_classes = n_classes
        self.batch_size = batch_size
        self.epochs = epochs
        self.conv_threshold = conv_threshold
        self.models = {}
        self.lr = lr

    def train(self, X, Y):
        
        for i in range(self.n_classes):
            for j in range(i + 1, self.n_classes):
                selected_indices = []
                for index, label in enumerate(Y):
                    if label == i or label == j:
                        selected_indices.append(index)

                X_selected = X[selected_indices]
                Y_selected = Y[selected_indices]
                
                for idx in range(len(Y_selected)):
                    if Y_selected[idx] == i:
                        Y_selected[idx] = 0
                    else:
                        Y_selected[idx] = 1

                model = MyLogisticRegression(batch_size=self.batch_size, num_epochs=self.epochs, lr=self.lr, tol=self.conv_threshold)
                model.train(X_selected, Y_selected)
                key = i, j
                self.models[key] = model

    def loss(self, X, Y):
        '''
        Average Logistic loss?
        total_loss = 0
        for (i, j), model in self.models.items():
            selected_indices = []
            for index, label in enumerate(Y):
                    if label == i or label == j:
                        selected_indices.append(index)

            X_selected = X[selected_indices]
            Y_selected = Y[selected_indices]
            
            Y_selected = np.where(Y_selected == i, 0, 1)
            total_loss += model.loss(X_selected, Y_selected)

        return total_loss / len(self.models)
        
        '''
        prediction = self.predict(X)
        losses = np.abs(Y - prediction)
        return np.sum(losses)

    def predict(self, X):
        votes = np.zeros((X.shape[0], self.n_classes))
        for key in self.models:
            i, j = key
            model = self.models[key]
            prediction = model.predict(X)
    
            for idx in range(len(prediction)):
                if prediction[idx] == 0:
                    votes[idx, i] = votes[idx, i] + 1 
                elif prediction[idx] == 1:
                    votes[idx, j] = votes[idx, j] + 1
        return np.argmax(votes, axis=1)
        

    def accuracy(self, X, Y):
        predictions = self.predict(X)
        return np.mean(predictions == Y)

## **Check Model: all-pairs**

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import LinearSVC

np.random.seed(0)

data = load_iris()

X = data.data
bias_col = np.ones((X.shape[0], 1))
X_biased = np.hstack((X, bias_col))

Y = data.target
n_classes = len(np.unique(Y))

model = AllPairs(n_classes=n_classes, conv_threshold=0.001, batch_size=10, epochs=5000, lr=0.01)
model.train(X_biased, Y)



predictions = model.predict(X_biased)


print("PredictionsA:", np.array(predictions))

sklearn_predictions = OneVsOneClassifier(LinearSVC(random_state=0)).fit(X, Y).predict(X)
print("PredictionsB:", sklearn_predictions)

print('num_samples', X.shape[0])
print('Differences', np.sum(np.abs(np.array(predictions)-sklearn_predictions)))

## **Main**

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html

from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
# from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier # Use this instead of LogisticRegression
from sklearn.model_selection import train_test_split

import numpy as np


iris = load_iris()
print(iris.data.shape)  # (150, 4)
print(iris.feature_names)  # ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(iris.target_names)  # ['setosa', 'versicolor', 'virginica']

num_classes = len(iris.target_names)

# Compare performance with sklearn on iris dataset
random_seed = 0
train_epochs = 1
one_estimators = [SGDClassifier(
    loss='log_loss', 
    max_iter=train_epochs, 
    shuffle=True,
    random_state=random_seed,
    learning_rate='constant') for _ in range(num_classes)]
one_model = OneVsOneClassifier(one_estimators)

rest_estimators = [SGDClassifier(
    loss='log_loss', 
    max_iter=train_epochs, 
    shuffle=True,
    random_state=random_seed,
    learning_rate='constant') for _ in range(num_classes)]
rest_model = OneVsRestClassifier(rest_estimators)

# Note: SGDClassifier.partial_fit does not allow specific batch size.
# SGDClassifier trains on batch size == 1 in all cases.

In [None]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt

# Load Iris dataset
data = load_iris()
X = data.data
Y = data.target

# Apply PCA to reduce the dimensions to 2 for visualization
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

In [None]:
# Train your One-vs-All model
'''one_vs_all_model = OnevsAll(n_classes=len(np.unique(Y)))
one_vs_all_model.train(X_reduced, Y)'''

# Train your All-Pairs model
n_classes=len(np.unique(Y))
all_pairs_model = AllPairs(n_classes=n_classes, batch_size=10, conv_threshold=0.001, epochs=5000, lr=0.01)
all_pairs_model.train(X_reduced, Y)

In [None]:
def plot_decision_boundaries(X, Y, model, title, subplot_position):
    # Create a meshgrid of points
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300),
                         np.linspace(y_min, y_max, 300))

    # Predict for each point in the mesh
    mesh_points = np.c_[xx.ravel(), yy.ravel()]
    Z = model.predict(mesh_points)
    Z = Z.reshape(xx.shape)

    # Plotting decision boundaries
    plt.subplot(subplot_position)
    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolor='k', s=30, cmap=plt.cm.Set1)
    plt.title(title)
    plt.xlabel('First principal component')
    plt.ylabel('Second principal component')

# Prepare figure
plt.figure(figsize=(12, 10))

# Plot decision boundary for your One-vs-All model
'''plot_decision_boundaries(X_reduced, Y, one_vs_all_model, "One-vs-All Model", 222)'''

# Plot decision boundary for your All-Pairs model
plot_decision_boundaries(X_reduced, Y, all_pairs_model, "All-Pairs Model", 223)

plt.tight_layout()
plt.show()
