### Support Vector Machines

You have well-known classification model i.e. logistic regression. Though this model is widely used in industries such as banking, e-commerce etc. it finds limited use in more complex classification problems, such as image classification.

SVMs, as you will see, are capable of dealing with quite complex problems, where models such as logistic regression typically fail. SVMs have been extensively used for solving complex classification problems such as image recognition, voice detection etc.

SVMs are mostly used for classification tasks, but they can also be used for regression. The topics covered here are limited to classification tasks. 

Support Vector Machines, or SVMs, are a class of extremely popular classification models. Besides their ability to solve complex machine learning problems, they have numerous other advantages over other classification problems, such as the ability to deal with computationally heavy data sets, classifying nonlinearly separable data, etc.

It is important to remember that SVMs belong to the class of linear machine learning models (logistic regression is also a linear model).

A linear model uses a linear function (i.e. of the form y = ax +b) to model the relationship between the input x and output y. For example, in logistic regression, the log(odds) of an outcome (say, defaulting on a credit card) is linearly related to the attributes x1, x2, etc. Similarly, SVMs are also linear models. You will learn to use SVMs for classification problems.

![image.png](attachment:image.png)

SVMs are linear models that require numeric attributes. In case the attributes are non-numeric, you need to convert them to a numeric form in the data preparation stage.

If you have a data as shown in the figure below, SVMs can handle it easily and that's how SVM
distinguishes from logistic regression.

![image.png](attachment:image.png)

### Concept of Hyperplane in 2D

Hyperplane: it is a boundary which classifies the data set (classifies Spam email from the ham ones). It could be lines, 2D planes, or even n-dimensional planes that are beyond our imagination.

A line that is used to classify one class from another is called a hyperplane. In fact, it is the model you're trying to build as shown in the figure below:

![image.png](attachment:image.png)

The standard equation of a line is given by ax + by + c = 0. You could generalise it as W0 + W1x1 + W2x2=0, where x1 and x2 are the features — such as 'word_freq_technology' and 'word_freq_money' — and W1 and W2 are the coefficients.

A positive value (blue points in the plot above) would mean that the set of values of the features is in one class; however, a negative value (red points in the plot above) would imply it belongs to the other class. A value of zero would imply that the point lies on the line (hyperplane) because any point on the line will satisfy the equation: W0 + W1x1 + W2x2=0.

### Concept of Hyperplane in 3D

In 3 dimensions (refer to figure below), the hyperplane (light orange) will be a plane with an expression of ax+by+cz+d = 0. The plane divides the data set into two halves. Data points above the plane represent one class (red), while data points below the plane represent the other class (blue).

![image.png](attachment:image.png)

In general, if the hyperplane from d attributes in d-dimensional, the expression can be written as follows:

![image-2.png](attachment:image-2.png)

The model denoted by the expression given above is called a linear discriminator. Similar to the 2D and 3D expressions, an n-dimensional hyperplane also follows the general rule: all points above the plane will yield a value greater than 0, and those below it will yield lesser than 0 when plugged into the expression of the hyperplane.

![image-3.png](attachment:image-3.png)

### Maximal Margin Classifier

There could be multiple lines(Hyperplanes) possible which perfectly separate the two classes as shown in the figure below. But the best line, is the one which maintains the largest possible equal distance from the nearest points of both the classes so for the separator to be optimal, the margin or the distance of the nearest point to the separator should be maximum. This is called Maximal Margin classifier.

![image.png](attachment:image.png)

3rd line(Hyperplane) should be considered as a maximal margin classifier in the above figure.

The maximal margin line (hyperplane), although it separates the two classes perfectly, is very sensitive to the training data. This means that the Maximal Margin Classifier will perform perfectly on the training data set. But on the unseen data, it may perform poorly. Also, there are cases where the classes cannot be perfectly separated. Thus, the soft margin classifier helps in solving this problem.

### Soft Margin Classifier

The Support Vector Classifier essentially allows certain points to be deliberately misclassified. By doing this, it is able to classify most of the points correctly in the unseen data and is also more robust. 

The Support Vector Classifier is also called the Soft Margin Classifier because instead of searching for the margin that exactly classifies each and every data point to the correct class, the Soft Margin Classifier allows some observations to fall on the wrong side. The points which are close to the hyperplane are only considered for constructing the hyperplane and those points are called support vectors. 

Support vector classifier works well when the data is partially intermingled (i.e. the data can be classified by minimal misclassifications). But what if the distribution looks completely intermingled and follows some pattern, something like the circular distribution of labels (+ and -), as shown in figure below.

![image.png](attachment:image.png)

Obviously, the Support Vector Classifier can't classify the data above correctly, because it divides the data set into two halves, which misclassifies a lot of data points. But it doesn't mean that this problem cannot be solved. There is a way to solve such problems, which we will learn later.

Like the Maximal Margin Classifier, the Support Vector Classifier also maximises the margin; but the margin, here, will allow some points to be misclassified, as shown in figure below.

![image-2.png](attachment:image-2.png)

So to select the best-fit Support Vector Classifier, the notion of slack variables (epsilons(ε)) can help in comparing the classifiers.

There is also a concept of the slack variable(ϵ). A slack variable is used to control misclassifications. It tells you where an observation is located relative to the margin and hyperplane.

There are three different conditions applied if any new data point comes into play. Suppose you draw a Support Vector Classifier in such a way that it doesn't allow any misclassification, i.e. Epsilon(ϵ) = 0, then each observation is on the correct side of the margin as shown in figure below.

![image-3.png](attachment:image-3.png)

But if you draw a Support Vector Classifier in such a way that it only violates the margin, i.e. 0< Epsilon( ϵ) < 1, the observations classify correctly as shown in figure below.

![image-4.png](attachment:image-4.png)

But if the data points violate the hyperplane, i.e. Epsilon(ϵ) > 1, then the observation is on the wrong side of the hyperplane, as shown in figure below.

![image-5.png](attachment:image-5.png)

So you can see that:

* Each data point has a slack value associated to it, according to where the point is located.
* The value of slack lies between 0 and +infinity.

Lower values of slack are better than higher values (slack = 0 implies a correct classification, but slack > 1 implies an incorrect classification, whereas slack within 0 and 1 classifies correctly but violates the margin)


### Cost of Misclassification

Cost of misclassification is greater than or equal to the summation of all the epsilons of each data point, and is denoted by cost or 'C'.

![image.png](attachment:image.png)

Once you understand the notion of the slack variable, you can easily compare the two Support Vector Classifiers. You can measure the summation of all the epsilons(ϵ) of both the hyperplanes and choose the best one that gives you the least sum of epsilons(ϵ). The summation of all the epsilons of each data point is denoted by cost or 'C'.

When C is large, the slack variables can be large, i.e. you allow a larger number of data points to be misclassified or to violate the margin. So you get a hyperplane where the margin is wide and misclassifications are allowed. In this case, the model is flexible, more generalisable, and less likely to overfit. In other words, it has a high bias.

On the other hand, when C is small, you force the individual slack variables to be small, i.e. you do not allow many data points to fall on the wrong side of the margin or the hyperplane. So, the margin is narrow and there are few misclassifications. In this case, the model isless flexible, less generalisable, and more likely to overfit. In other words, it has a high variance.