## Support Vector Machine(SVM)

* Support vector machine is highly preferred by many as it produces significant accuracy with less computation power. 
* Support Vector Machine, abbreviated as SVM can be used for both regression and classification tasks. But, it is widely used in classification objectives.
* The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

* When talking in terms of linearly separable points,

![Hyperplane1](images/hyperplanes.png)

![Hyperplane2](images/svm_hyperplane.png)


* There can be multiple planes, **but SVM selects best one among all which is able to classify the points as widely as possible i.e. the gap between nearest points and plan is maximum**.
* When we create this dotted hyperplane, which is parallel to the main hyperplane, we make sure both the dotted planes passed to at least nearest one point on both sides.
* Distance between two dotted lines is called as marginal distance.
* If in test data some point comes between main hyperplane and dotted hyperplane, then it will be classified to class wrt dotted plane(i.e. which dotted plane, if nearer to blue points one then it will be classified as blue point else red point)

* So we have understood that, to separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence. Higher the marginal distance, more genralized our model is(in case of linearly separable dataset).


* Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a 2D plane. It becomes difficult to imagine when the number of features exceeds 3.

![Hyperplane 2D-3D](images/hyperplane2d3d.png)

* **Support vectors** are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. So the *points which are on dotted line are support vectors*. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM.
![Support vectors](images/sv.png)

* SVMs are great when dataset is small.

## Classify non linearly separable data points:
* For this we use SVM Kernels, where we convert low dimenstion to high dimension(like 2D graph to 3D graph)


## Advantages of SVM:
* **SVM are robust to outliers**.
* It is more effective in high dimensional spaces.
* It is relatively memory efficient.
* **SVM’s are very good when we have no idea on the data**.
* It works well with even unstructured and semi structured data like text, Images and trees. So it **works well with non linear data**.
* The kernel trick is real strength of SVM. With an appropriate kernel function, we can solve any complex problem.
* SVM models have generalization in practice, the risk of over-fitting is less in SVM.


## Disadvantages of SVM:
* More training time is required for larger dataset
* It is difficult to choose a good kernel function.
* The SVM hyper parameters are Cost -C and gamma. It is not that easy to fine-tune these hyper-parameters. It is hard to visualize their impact.


## Applications:
* We can use SVM with every ANN usecases.
* Intrusion Detection
* Handwriting Recognition

## Practical Implmentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
* https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html


### Differentiate between Support Vector Machine and Logistic Regression

* In logistic regression also we draw a straight line just like we are doing in SVM, but in SVM marginal distance is getting added.


### Mathematical Intuition behind SVM:
* Let's consider two points (-5, 0) and (5,5) and we draw a straight line to divide these points(We have assumed slope as -1 and y-intercept as 0) and equal of straight line(hyperplane) is y = wTx + c;
    
    our w is slope(we are doing transpose as we have to do matrix multplication) and we calculate for point (-5,0)
![svm_maths.jpg](attachment:svm_maths.jpg)


    For any point below (blue)line, if we compute y value, it will always bepositive

* Let's say if we calculate for (5, 5), our slope will be same but point will be [5 5], if we multply we get -5. So this side of line we will always get negative values.
* Now we can consider these two as different classes, one giving positive can be considered as +1 and one giving negative value belongs to -1 class.

* Let's say in below graph our point below the line are +ve and above the line are -ve. We will find the nearest point from hyperplane and draw a marginal plane on both sides(dotted line below). Plane above hyperplane is negative plane and it's equation is wTx + c = -1 and plane below hyperplane is positive plane and it's equation is wTx + c = 1
![svm_graph.png](attachment:svm_graph.png)

* Now in SVM wherever marginal distance is high, that will be considered as hyperplane. So we have to compute distance between them, let's say selected points are x1 and x2 lying on negative and positive plane respectively. We have to computer x2-x1
    
        Equation for x1: wTx1+c = -1  ------- (i) // T is just tranpose
        Equation for x2: wTx2+c = 1  ------- (ii)
        
        If we subtract (i) - (ii) we get
        => wT(x2-x1) = 2 
        
        And wT is just transpose of slope, if we want to remove it we can remove it using norm of W i.e. ||w||, we can't directly remove wT as there is some direction involved, therefore:
        
        => wT/||w|| = (x2-x1) = 2/||w||
        
        2/||w|| is our optimization function, we have to maximize it considering conditions:
            y = {+1   wTx1+c >=1
                 -1   wTx1+c<=-1}
               
    
        
        
        max 2/||w|| can be written as min ||w||/2, why we are converting to min: as whenever we creating straight line we want to minimize the distance, just like linear regression we use gradient descent where also we want to minimize the distance through derivatives
        
* Now to optimize model better, introducing two terminolgies:
    min ||w||/2 + c + Σ(i=1 to n)ζi
    
    * c is how many errors? If our model has some x errors like 5, we won't change our line(how many are allowed)
    
    * ζ(Zeta) value of the error: whatever distance we are gettig wrt error we will just do the summation of all errors.
    
    * c is also called regularization, we get this value using hyperparameter tuning.


* In real world, data is not linearly mostly, there is a lot of overalapping; for that we will use SVM Kernel trick.

* For reference: 
    * https://www.youtube.com/watch?v=Js3GLb1xPhc&list=PLZoTAELRMXVPBTrWtJkn3wWQxZkmTXGwe&index=84
    * https://www.baeldung.com/cs/svm-hard-margin-vs-soft-margin