# PHYS555 - Machine Learning Project - Q1
Karlee Zammit - V00823093

## Describe how SVM algorithms can be used for classification and regression problems (describe the algorithms). Which parameters are the most important ones in the models for classification and regression (e.g., for fitting and controlling overfitting...)? What is the difference between classification and regression algorithms in SVM?

# Support Vector Machine 

A support vector machine (SVM) is a supervised learning model that analyzes data for classification and regression tasks. At a high level, the SVM algorithm maximizes a particular mathetical function with respect to a given collection of data. As discussed in Noble, W. (2006), there are four concepts essential in the SVM algorithm: 

1. The separating hyperplane

2. The maximum-margin hyperplane

3. The soft margin 

4. The kernel function

I will discuss these below for a binary classification example.

### The Separating Hyperplane


For an imaginary dataset, that looks like the left panel of the Figure below (titled "2D Hyperplane"), a separating line can be drawn through the data. Then for a future prediction, depending on where the prediction falls on the graph, a classification can be made if it will belong to the purple or orange class. This separating line is called the separating hyperplane. This idea can be extended to higher dimensions, with a 3-dimensional example provided in the right panel of the Figure below (titled "3D Hyperplane).


<div>
<img src="SVM-hyperplanes.png" width="500"/>
</div>

### The Maximum-Margin Hyperplane

In a 1D example as shown in the Figure below, the "maximum-margin" hyperplane is located at the position in space that maximizes it's distance from each of the two classes. If you were to move this margin closer to one class, it would no longer be the maximum distance away and therefore would have a higher chance of inaccurately predicting a future observation of each class. For perfect data like in the Figure below, the maximum-margin hyperplane can be used to determine the optimized hyperplane location.

<div>
<img src="SVM-margin-max.png" width="400"/>
</div>

### The Soft Margin Hyperplane

But what if the data  was not perfect, as shown in the Figure below? It would then be ideal to allow for misclassifications, so that future observations can be more accurately predicted (ie. avoid overfitting to the data). This is an example of the tradeoff between bias and variance, which is a common theme in machine learning algorithms. The location of this soft margin is determined by trial and error using cross validation. 

<div>
<img src="SVM-margin-soft.png" width="350"/>
</div>

### The Kernel Function

Sometimes data is too complex to be overcome by the introduction of a soft margin alone. For example, in the top panel of the Figure below, there exists no linear line that could separate the two classes from one another. The kernel function provides a solution to the problem, adding an additional dimension to the data. In this example, by squaring the original values, a new dimension is introduced and a linear line can then be used to separate the classes from one another. It can be proven that for any given labelled data set, there exists a kernel function that allows the data to be linearly separated. One needs to consider the curse of dimensionality here, as complex data can be projected into higher and higher dimensions, but the number of possible solutions increases exponentially. 

<div>
<img src="SVM-kernel.png" width="400"/>
</div>

# Avoiding Overfitting

The two goals of SVM are: 
- Increase the distance of decision boundary to classes (or support vectors)
- Maximize the number of points that are correctly classified in the training set

For SVM, the most important parameters for avoiding overfitting are "C" and "Gamma". 

## C 

C adds a penalty for each misclassified data point, meaning it tells the SVM optimization how much you want to avoid misclassifying each training example. If C is small, there is a small penalty for misclassified points, and so the decision boundary with a large margin is chosen at the expense of many misclassified points. If C is large, SVM tries to minimize the number of misclassified examples, which results in a decision boundary with a smaller margin. 

## Gamma

Gamma controls the distance of the influence of a single training point. Low values of gamma result in a large similarity radius, and so more points are grouped together. High values of gamma mean that more points need to be grouped together in order to be considered in the same group or class. Large gamma values tend to lead to overfitting. 

# Difference between Classification and Regression Algorithms



# References 

Bishop, C. M. (2006). Pattern recognition and machine learning. In Pattern recognition and machine learning. Springer.

Noble, W. (2006). What is a support vector machine?. Nat Biotechnol 24, 1565–1567 (2006). https://doi.org/10.1038/nbt1206-1565

https://en.wikipedia.org/wiki/Support_vector_machine

Scikit-Learn Documentation: 

https://scikit-learn.org/stable/modules/svm.html#svm-regression

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html



# Appendix

## Sklearn Function Documentation Explanations

### Classification

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

class sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)[source]

1. C: regularization parameter
2. kernel: kernel type to be used in the algorithm, rbf is default
3. degree: only used in poly kernel, degree of polynomial
4. gamma: scale or auto, coefficient for rbf, poly, or sigmoid 
5. coef0: only used in poly or sigmoid, independent term in kernel function
6. shrinking: whether to use shrinking heuristic. default true
7. probability: enable probability estimates, much slower
8. tol: tolerance for stopping criterion
9. cache_size: kernel cache size
10. class_weight: can use this to balance unbalanced data
11. verbose: enable verbose output
12. max_iter: -1 for no limit
13. decision_function_shape: ovo or ovr, ovr is constructed from ovo output
14. break_ties: uses lots of computational resources 
15. random_state: if probability is true, controls random number generation


### Regression

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html

class sklearn.svm.SVR(*, kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1)[source]

Not listing each description again, as same as above. Note differences, where these parameters are specific to classification. 

1. No probability 
2. No class weight 
3. No decision function shape
4. No break ties 
