## SVM implementations

**Support vector methods** explicitly estimate decision boundaries in the input space to separate (perfectly or not) classes, or additionally, to perform regression. The first of such methods, *maximal margin classifier*, tries to generate a hyperplane that perfectly separates two classes. *Support vector classifier*, in its turn, allows overlap of classes, while still fitting a linear decision boundary. *Support vector machines* consist of a generalization
which not only allows misclassification of training data points, but also produce non-linear decision boundaries in the original input space.

The explicit estimation of decision boundaries by SVMs depends on the definition of a hyperplane, which is given by the set of points $\{x \in \mathbb{R}^p | f(x) = x^T\beta + \beta_0 = 0\}$. For a binary classification task, where $Y \in \{-1, 1\}$ is the target variable, if a given instance $x$ falls above the hyperplane, i.e., $f(x) > 0$, then $x$ is assigned to class $Y = 1$, and vice-versa. Consequently, the classification rule induced by $f(x)$ is $G(x) = sign(x^T\beta + \beta_0)$.

**Linearly separable case:** the perfect separation can be summarized by the following property of the *margin* for all data points available: $y_if(x_i) > 0$ $\forall$ $i \in \{1, 2, ..., N\}$. Indeed, maximizing the margin $M \in \mathbb{I}_+$ that separates the set of points for which $Y = 1$ from those that $Y = -1$ is the way how to define the perfect separating hyperplane $f(x)$:
\begin{equation}
    \displaystyle \max_{\beta, \beta_0} M
\end{equation}
\begin{equation}
  \displaystyle \mbox{subject to } ||\beta|| = 1
\end{equation}

\begin{equation}
    \displaystyle y_i(x_i^T\beta + \beta_0) \geq M \forall i \in \{1, 2, ..., N\}
\end{equation}

Where $y_i(x_i^T\beta + \beta_0)$ is the *sign distance* between training instance $x_i$ and the hyperplane. More precisely, $2M$ is the margin between the two classes, since $M$ is actually the distance between the closest data points from $Y = 1$ and $Y = -1$ to the hyperplane. This optimization problem can be redefined as:
\begin{equation}
    \displaystyle \min_{\beta, \beta_0} ||\beta||
\end{equation}
\begin{equation}
    \displaystyle \mbox{subject to } y_i(x_i^T\beta + \beta_0) \geq 1 \forall i \in \{1, 2, ..., N\}
\end{equation}

Where $M$ is taken to be equal to $1/||\beta||$. The estimated function $\hat{f}(x) = x^T\hat{\beta} + \hat{\beta}_0$ that follows from this problem is named the **maximal margin classifier**.

The **linearly non-separable case** demands the change of the main constraint of the first optimization problem to $y_i(x_i^T\beta + \beta_0) \geq M(1 - \xi_i)$, where $\{\xi_i\}_{i=1}^N$ are the *slack variables* that allow misclassification of some data points, in the sense that some observations can fall into the wrong side of the hyperplane when $\xi_i > 1$. Additionally, for observations with $0 < \xi_i < 1$, they will belong to the right side of the hyperplane, but inside the margin. A new constraint under which $\sum_i \xi_i \leq K$ limits the number of training data points that are allowed to be misclassified. The parameters $\hat{\beta}$ that satisfy the modified version of the optimization problem above configure the **support vector classifier**. This method does not only accomplish the task of classification in the non-separable case, but also it is more robust to overfitting than maximal margin classifier, even when perfect separation would be possible.

Support vector methods are highly intuitive, but its mathematical formulation are somewhat complex. Given that optimization problems are their ground basis, and since there are a lot of ways to express a same optimization problem, it can be cumbersome to express all formulations. [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf) present a complete derivation of SVMs, including one that justifies the term *support vectors*: only observations in the margin or in the incorrect side of it effectively contribute to the construction of the hyperplane.

**Support vector machines** have the same structure of support vector classifiers. The main difference concerns the use of basis functions to enlarge the features space in order to produce linear decision boundaries in that enlarged space, but non-linear boundaries in the original space, which may help increasing predictive performance. The hyperplane equation is now given by $f(x) = h(x)^T\beta + \beta_0$, where $h(x)$ is a vector of basis functions constructed upon the original inputs. The main formulation that leads to the support vector classifier implies the following:
\begin{equation}
  \displaystyle \hat{f}(x) = x^T\hat{\beta} + \hat{\beta}_0 = \sum_{i=1}^N \hat{\alpha}_iy_ix^Tx_i + \hat{\beta}_0
\end{equation}

Consequently:
\begin{equation}
  \displaystyle \hat{f}(x) = \sum_{i=1}^N \hat{\alpha}_iy_i<x, x_i> + \hat{\beta}_0
\end{equation}

Where $<x, x_i>$ stands for the inner product between vectors $x$ and $x_i$, while $\{\hat{\alpha}_i\}$ are Lagrange multipliers of the optimization problem of support vector classifiers. What SVMs do is to replace $x$ and $x_i$ by basis functions $h(x)$ and $h(x_i)$. In practice, SVMs use a more general *kernel function* $K(x, x_i)$ instead of $<x, x_i>$, in such a way that $K(x, x_i) = <h(x), h(x_i)>$, and so $x$ is implicitly replaced by $h(x)$. Some of the main alternatives of kernel functions are:
\begin{equation}
    \displaystyle K(x, x') = (1 + <x, x'>)^d
\end{equation}
\begin{equation}
    \displaystyle K(x, x') = \exp(-\gamma||x - x'||^2)
\end{equation}
\begin{equation}
    \displaystyle K(x, x') = \tanh(\kappa_1<x, x'> + \kappa_2)
\end{equation}

The $d$-th degree polynomial, the radial basis, and the neural network kernels, respectively. So, for SVMs the separating hyperplane is given by:

\begin{equation}
  \displaystyle \hat{f}(x) = \sum_{i=1}^N \hat{\alpha}_iy_iK(x, x_i) + \hat{\beta}_0
\end{equation}

Note that support vector classifiers are a particular case of SVMs when the kernel is linear, i.e., $K(x, x_i) = <x, x_i>$.

The ultimate version of the optimization problem solved for fitting an SVM model, that is assumed by the [article](https://towardsdatascience.com/svm-implementation-from-scratch-python-2db2fc52e5c2) that brings the codes presented and discussed next, is such that the intrinsic regularization of SVMs is showed up:
\begin{equation}
  \displaystyle \min_{\beta, \beta_0} \sum_{i=1}^N \max[0, 1 - y_if(x_i)] + \frac{\lambda}{2}||\beta||^2
\end{equation}

Where $L(y, f) = \max[0, 1 - yf(x)]$ is the *hinge loss function*, while $||\beta||$ consists of the penalty term. The larger $\lambda$, the regularization parameter, the more coefficients of the hyperplane will be shrunken towards zero; the higher the regularization, the more training data points can be misclassified, i.e., the larger the margin.

Finally, still regarding the theoretical ground of support vector methods, they do not explicitly produce estimates for class probabilities. Some additional algorithms may perform this estimation by regressing true labels against predicted labels from a SVM model, which are used as inputs to an auxiliary estimation method, a logistic regression (*Platt scaling*) or a piecewise regression (*isotonic regression*), for instance.

The main [article](https://towardsdatascience.com/svm-implementation-from-scratch-python-2db2fc52e5c2) whose codes are reproduced here implements SVMs by solving the last minimization problem presented above through **stochastic gradient descent (SGD)**, a widely used optimization algorithm that is also applied for fitting neural networks, for instance. The pipeline of the machine learning application developed by the author starts by removing both highly correlated features and those whose relationship with the target variable is not statistically significant. Then, all input variables are normalized using min-max scaling, which is crucial since weights of variables are calculated, similarly to logistic regression, neural networks and KNNs, and very different scales could mislead the calculation of the relevance of each input in the hyperplane construction.

After introducing train-test split, the author trains the SVM model by calculating the coefficients of the separating hyperplane by using the SGD algorithm. The *sgd* function is constructed upon the function *compute_cost*, which, using weights, inputs and output variables, returns the main cost function defined above for a given regularization parameter (*actually, instead of $\lambda$, the regularization parameter $C = 1/\lambda$ is considered in the code, where this parameter multiplies the Hinge loss instead of the norm of $\beta$*). Most importantly, *sgd* function uses *calculate_cost_gradient* function, which calculates the gradient of the cost function for updating weights using the *gradient descent update rule*. So, *sgd* function iterates over a given number of epochs and, inside of each epoch and after shuffling the data, it iterates over each training instance, i.e., the mini-batch gradient descent with $S = 1$ is implemented. At each iteration, weights are updated using the gradient of cost function evaluated at the mini-batch. After some number of epochs, a termination criterium is evaluated by checking if the current iteration reduces the previous cost function by some predefined percentage of reference.

**References**
<br>
[SVM From Scratch](https://towardsdatascience.com/svm-implementation-from-scratch-python-2db2fc52e5c2).
<br>
[The Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf).
<br>
[Support Vector Machine Algorithm for Data Scientists](https://www.analyticsvidhya.com/blog/2021/07/svm-support-vector-machine-algorithm/).

----------------

This notebook first imports all relevant libraries, and then presents implementations of linear regression estimation with OLS and gradient descent, besides of polynomial regression, ridge and lasso regularized models. All implementations follow from this [article](https://towardsdatascience.com/ml-from-scratch-linear-polynomial-and-regularized-regression-models-725672336076) ([Github](https://github.com/lukenew2/mlscratch) page of reference).

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Preparing the data](#data_prep)<a href='#data_prep'></a>.
3. [Inner functioning of SVM](#inner_functioning)<a href='#inner_functioning'></a>.
4. [Training the model](#model_training)<a href='#model_training'></a>.

<a id='libraries'></a>

## Libraries

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
cd "/content/gdrive/MyDrive/Studies/svm/Codes"

/content/gdrive/MyDrive/Studies/svm/Codes


In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.utils import shuffle

  import pandas.util.testing as tm


<a id='data_prep'></a>

## Preparing the data

<a id='data_transf'></a>

### Reading and transforming the data

In [None]:
# Importing the data:
data = pd.read_csv('../Datasets/data.csv')
data.drop(data.columns[[-1, 0]], axis=1, inplace=True)

# Treating the target variable:
diag_map = {'M': 1.0, 'B': -1.0}
data['diagnosis'] = data['diagnosis'].map(diag_map)

# Input and output variables:
Y = data.loc[:, 'diagnosis']
X = data.iloc[:, 1:]

# Scaling the data:
X_normalized = MinMaxScaler().fit_transform(X.values)
X = pd.DataFrame(X_normalized)
X.insert(loc=len(X.columns), column='intercept', value=1)

# Train-test split:
X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2, random_state=42)

<a id='feat_selection'></a>

### Features selection

In [None]:
# Function that removes highly correlated input variables:
def remove_correlated_features(X):
    corr_threshold = 0.9
    corr = X.corr()
    drop_columns = np.full(corr.shape[0], False, dtype=bool)
    for i in range(corr.shape[0]):
        for j in range(i + 1, corr.shape[0]):
            if corr.iloc[i, j] >= corr_threshold:
                drop_columns[j] = True
    columns_dropped = X.columns[drop_columns]
    X.drop(columns_dropped, axis=1, inplace=True)
    return columns_dropped

# Function that removes input variables with just a few correlation with the output variable:
def remove_less_significant_features(X, Y):
    sl = 0.05
    regression_ols = None
    columns_dropped = np.array([])
    for itr in range(0, len(X.columns)):
        regression_ols = sm.OLS(Y, X).fit()
        max_col = regression_ols.pvalues.idxmax()
        max_val = regression_ols.pvalues.max()
        if max_val > sl:
            X.drop(max_col, axis='columns', inplace=True)
            columns_dropped = np.append(columns_dropped, [max_col])
        else:
            break
    regression_ols.summary()
    return columns_dropped

<a id='inner_functioning'></a>

## Inner functioning of SVM

<a id='cost_function'></a>

### Cost function

In [None]:
# Function that calculates the cost function of SVMs:
def compute_cost(W, X, Y):
    # calculate hinge loss
    N = X.shape[0]
    distances = 1 - Y * (np.dot(X, W))
    distances[distances < 0] = 0  # equivalent to max(0, distance)
    hinge_loss = regularization_strength * (np.sum(distances) / N)

    # calculate cost
    cost = 1 / 2 * np.dot(W, W) + hinge_loss
    return cost

# Function that calculates the gradient vector of cost function with respect to coefficients:
def calculate_cost_gradient(W, X_batch, Y_batch):
    # if only one example is passed (eg. in case of SGD)
    if type(Y_batch) == np.float64:
        Y_batch = np.array([Y_batch])
        X_batch = np.array([X_batch])  # gives multidimensional array

    distance = 1 - (Y_batch * np.dot(X_batch, W))
    dw = np.zeros(len(W))

    for ind, d in enumerate(distance):
        if max(0, d) == 0:
            di = W
        else:
            di = W - (regularization_strength * Y_batch[ind] * X_batch[ind])
        dw += di

    dw = dw/len(Y_batch)  # average
    return dw

<a id='sgd'></a>

### Stochastic gradient descent

In [None]:
# Function that implements SGD for minimizing the cost function of SVMs:
def sgd(features, outputs):
    max_epochs = 5000
    weights = np.zeros(features.shape[1])
    nth = 0
    prev_cost = float("inf")
    cost_threshold = 0.01  # in percent
    # stochastic gradient descent
    for epoch in range(1, max_epochs):
        # shuffle to prevent repeating update cycles
        X, Y = shuffle(features, outputs)
        for ind, x in enumerate(X):
            ascent = calculate_cost_gradient(weights, x, Y[ind])
            weights = weights - (learning_rate * ascent)

        # convergence check on 2^nth epoch
        if epoch == 2 ** nth or epoch == max_epochs - 1:
            cost = compute_cost(weights, features, outputs)
            print("Epoch is: {} and Cost is: {}".format(epoch, cost))
            # stoppage criterion
            if abs(prev_cost - cost) < cost_threshold * prev_cost:
                return weights
            prev_cost = cost
            nth += 1
    return weights

<a id='model_training'></a>

## Model training

<a id='svm_algorithm'></a>

### SVM algorithm

In [None]:
# Algorithm for fitting an SVM model:
def init():
    print("reading dataset...")
    # read data in pandas (pd) data frame
    data = pd.read_csv('../Datasets/data.csv')

    # drop last column (extra column added by pd)
    # and unnecessary first column (id)
    data.drop(data.columns[[-1, 0]], axis=1, inplace=True)

    print("applying feature engineering...")
    # convert categorical labels to numbers
    diag_map = {'M': 1.0, 'B': -1.0}
    data['diagnosis'] = data['diagnosis'].map(diag_map)

    # put features & outputs in different data frames
    Y = data.loc[:, 'diagnosis']
    X = data.iloc[:, 1:]

    # filter features
    remove_correlated_features(X)
    remove_less_significant_features(X, Y)

    # normalize data for better convergence and to prevent overflow
    X_normalized = MinMaxScaler().fit_transform(X.values)
    X = pd.DataFrame(X_normalized)

    # insert 1 in every row for intercept b
    X.insert(loc=len(X.columns), column='intercept', value=1)

    # split data into train and test set
    print("splitting dataset into train and test sets...")
    X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2, random_state=42)

    # train the model
    print("training started...")
    W = sgd(X_train.to_numpy(), y_train.to_numpy())
    print("training finished.")
    print("weights are: {}".format(W))

    # testing the model
    print("testing the model...")
    y_train_predicted = np.array([])
    for i in range(X_train.shape[0]):
        yp = np.sign(np.dot(X_train.to_numpy()[i], W))
        y_train_predicted = np.append(y_train_predicted, yp)

    y_test_predicted = np.array([])
    for i in range(X_test.shape[0]):
        yp = np.sign(np.dot(X_test.to_numpy()[i], W))
        y_test_predicted = np.append(y_test_predicted, yp)

    print("accuracy on test dataset: {}".format(accuracy_score(y_test, y_test_predicted)))
    print("recall on test dataset: {}".format(recall_score(y_test, y_test_predicted)))
    print("precision on test dataset: {}".format(recall_score(y_test, y_test_predicted)))

<a id='fitting_model'></a>

### Fitting the model

In [None]:
# Hyper-parameters:
regularization_strength = 10000
learning_rate = 0.000001

# Fitting the model:
init()

reading dataset...
applying feature engineering...
splitting dataset into train and test sets...
training started...
Epoch is: 1 and Cost is: 7265.461796282826
Epoch is: 2 and Cost is: 6548.324667994216
Epoch is: 4 and Cost is: 5454.715412780056
Epoch is: 8 and Cost is: 3963.303287448077
Epoch is: 16 and Cost is: 2640.61157366003
Epoch is: 32 and Cost is: 2044.4096653757088
Epoch is: 64 and Cost is: 1590.1977865307224
Epoch is: 128 and Cost is: 1325.3659890065799
Epoch is: 256 and Cost is: 1161.805355679864
Epoch is: 512 and Cost is: 1074.162147916526
Epoch is: 1024 and Cost is: 1048.5991553237839
Epoch is: 2048 and Cost is: 1044.5797119244573
training finished.
weights are: [ 3.5559056  11.04953495 -2.28574834 -7.89109122 10.1562053  -1.27033218
 -6.44637726  2.25265913 -3.88768916  3.26020627  4.9645179   4.8243594
 -4.7149502 ]
testing the model...
accuracy on test dataset: 0.9912280701754386
recall on test dataset: 0.9767441860465116
precision on test dataset: 0.9767441860465116
