# Intuition

The idea of SVR is the creation of a hyperplane which approximately captues as many instances as posssible while limiting the deviation for other instances of our data.
Imagine we are trying to predict housing prices based off of different parameters, SVR will try t find at most ε away from the actual price for as many points as possible. This can be visualized as a tube fitting around the readius ε to our data. The instances that are within our ε tube are not errors, while ε outside of our tube are.  
SVR wants to minimize the volume of this tube which is inherently minimizing error. The support vectors are the points that are on the boundary or edge of our tube. These are the values our model deemed to be most "sensitive" to as they define the position and width of the tube. SVR takes on kernels that map the relationship between features and target variables. 
The intuition is to find a function that fits as many instances as possible within a tolerance (ε) while minimizing the deviations or distance for the remaining data points. 
**SVR Advantages**
- Effective in high dimensional spaces
- Memory Efficient: Uses subset of datapoints(support vectors)
- Versatile: can model linear and non-linear relationship depending on the kernel we use

# Mathematics

As stated previously, we are looking for a function f(x) that will limit our deviation and maximize the number of instances within our tube while trying to remain as flat as possible. We can think of this as trying to find a line or hyperplane which is at most an ε distance from the actual target value.  

Essentially, we can write this as:  
|y - f(x)| ≤ ε  

This is emphasizing the point that our model is insenstive to the ε tube. This is a desirable property in terms of our margin(distance between closest data points and decision boundary). As stated the ε is the maximum deviation from the target variable, this means that we are accounting for error. With that being said, our model becomes more resistant to noise and outliers. It is important to note that ε does not change the orientation of our decision boundary only the width of the margin.

While introducing error, it is important for errors larger than our margin we introduce slack variables. The slack variables measure the deviation outside of the tube.  

## Linear Algebra and Analytical Geometry

In reference to linear SVR, our hyperplane can be defined as `f(x) = w^T x + b` where `w` is the weight vector, `x` is the input vector, and `b` is the bias term. The objective function is : Minimize (1/2) * ||w||^2 + C * Σ(ξ + ξ*) as stated previously. 

The term ||w||^2 is the squared Euclidian norm (L2 Norm) of our weight vector which represents the flatness(angle) of our function. Since it is a euclidean norm of a vector V, it is defined as as `sqrt(v1^2 + v2^2 + ... + vn^2)`, so `||w||^2` is simply `w1^2 + w2^2 + ... + wn^2`.
The terms `ξ` and `ξ*` are slack variables that measure the deviation outside the ε tube. The sum `Σ(ξ + ξ*)` is simply the sum of these slack variables over all training instances.

The constraints of the problem can be written as:


y - w^T x - b ≤ ε + ξ  
w^T x + b - y ≤ ε + ξ*  
ξ, ξ* ≥ 0  


These are linear inequalities, and they ensure that the residuals (the differences between the predictions and the actual values) are at most ε, except for the slack variables `ξ` and `ξ*`.  

The dual problem of SVR involves the Lagrange multipliers, which are used to incorporate the constraints into the objective function. The solution to the dual problem gives the optimal weight vector `w` and bias term `b`, as well as the support vectors. Will dive into duals in the Optimization section

## Continuous Optimization and Vector Calculus

Going back to the objective function and restraints of SVR from earlier. IT is a convex optimizatio problem as we are trying to find the unique global minimum which minimizes error. The solution to this problem gives the weight fector w and bias term b which help us define our hyper plane. To find these terms, we can use the method of Lagrange multipliers which incorparates the constraints into the objective function.  

### The Dual Problem
However, when dealing with Lagrange multipliers we run into the dual problem since we are minimizing the Lagrangian with respect to our parameters of the objective function while maximizing our Lagrangian multipliers like the inequality constraints and slack variables.

(Looked this up but haven't researched it myself yet, this can be solved using techniques like Sequential Minimal Optimization or SMO)

Why is that a problem? We are trying to minimize our original equation by maximizing the Lagrangian multipliers which are based off of the original function.
However, the solution to this problem returns the optimal values for our weight and bias.

### Advantages of dual problem
- it depends on the dot product of the input vectors which allows us to use the kernel trick to map our inputs to a higher dimensional space and solve non-linear problems

# Support Vector Regression, scratch

[Great Resource](https://www.mathworks.com/help/stats/understanding-support-vector-machine-regression.html)

In [None]:
class SVR:
    def __init__(self, kernel, C, epsilon):
        self.kernel = kernel
        self.C = C
        self.epsilon = epsilon

    def fit(self, X, y):
        pass

    def predict(self, X):
        pass