# Support Vector Machine (SVM)

* For classification → **Support Vector Classifier (SVC)**
* For regression → **Support Vector Regression (SVR)**

---

## 1. Pre-requisite

Before diving into SVM, it is **very important to understand Logistic Regression** (mathematical intuition and implementation).
If you skipped logistic regression, please go through it first.

---

## 2. Logistic Regression Recap

* In **binary classification** (2 categories of points), logistic regression creates a **decision boundary** (line in 2D, plane in 3D, hyperplane in nD).
* Example:

  * In 2D → best fit **line** separates the two classes.
  * In 3D → best fit **plane** separates the classes.
  * In nD → best fit **hyperplane** is created.

### Example:

* Features: $$ (x_1, x_2) $$ → line
* Features: $$ (x_1, x_2, x_3) $$ → plane
* Features: $$ (x_1, x_2, \dots, x_n) $$ → hyperplane

---

## 3. Support Vector Machine (SVM)

SVM extends the concept of decision boundaries by not only finding a **separating hyperplane**, but also maximizing the **margin**.

### Key Idea:

* Create a **best fit line (or plane/hyperplane)**.
* Along with it, create **two marginal planes** that are equidistant from the separating hyperplane.
* The goal is to maximize the distance between these marginal planes.

---

## 4. Margins in SVM

* Suppose the distance between marginal planes is $$ D $$.
* We want to **maximize $$ D $$**.
* The marginal planes should touch the **nearest data points** from each class.

These nearest points are called **Support Vectors**.

---

## 5. Geometric Intuition

* **2D Case:**

  * Best fit line + 2 parallel marginal lines.
  * Margins pass through the closest data points.

* **3D Case:**

  * Decision boundary becomes a **plane**.
  * Two parallel planes act as **marginal planes**.

* **nD Case:**

  * Decision boundary is a **hyperplane**.
  * Marginal hyperplanes are placed on either side.

---

## 6. Classification with SVM

* If a **new test point** falls:

  * Above the hyperplane → assign to one class.
  * Below the hyperplane → assign to another class.

SVM ensures **clear classification** by maximizing margin.

---

## 7. Support Vectors

* The **data points lying on the marginal planes** are called **Support Vectors**.
* These points are crucial since they define the margin.
* Removing non-support vectors does not change the decision boundary.

---

## 8. Problem Statement (Simplified)

* Logistic Regression → finds a separating hyperplane.
* Support Vector Machine → finds a separating hyperplane **with maximum margin**.

Thus, SVM improves classification robustness.

---


# Support Vector Machine (SVM) – Hard Margin vs Soft Margin

In the previous discussion, we understood the **main idea of SVM** for classification:

* Find a **best fit hyperplane**.
* Create **marginal planes**.
* Ensure the **margin is maximized**.

Now, let us extend this understanding to **real-world scenarios** using **Hard Margin** and **Soft Margin**.

---

## 1. Hard Margin

* Hard Margin assumes that **all data points are perfectly separable** by a linear boundary.
* The separating hyperplane is chosen such that:

  * **No misclassification occurs.**
  * **All points lie outside or on the margin.**

### Key Characteristics:

* Works only when data is **linearly separable**.
* Margin must perfectly divide the classes without errors.
* Sensitive to outliers (a single outlier can break separability).

---

## 2. Soft Margin

* In real-world datasets, classes are **rarely linearly separable**.
* Some points **overlap** across classes.
* To handle this, SVM introduces **Soft Margin**.

### Idea:

* Allow **some errors (misclassifications)**.
* Introduce **slack variables** $$ \xi_i \geq 0 $$ to measure the degree of misclassification for each data point.

### Optimization Goal:

* Maximize the margin
* While minimizing the total misclassification error

Mathematically, the optimization balances between:
$$
\text{Minimize: } \frac{1}{2} | w |^2 + C \sum_{i=1}^n \xi_i
$$

Subject to:
$$
y_i (w \cdot x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0
$$

Where:

*  w  → weight vector (orientation of hyperplane)
*  b  → bias (shift of hyperplane)
*  C  → regularization (trade-off: margin vs error)
*  xi_i  → slack variable (misclassification measure)



## 3. Hard Margin vs Soft Margin – Comparison

| Aspect            | Hard Margin SVM                  | Soft Margin SVM                          |
| ----------------- | -------------------------------- | ---------------------------------------- |
| Data separability | Assumes perfectly separable data | Handles overlapping/non-separable data   |
| Errors            | No misclassification allowed     | Allows some misclassifications           |
| Outliers          | Very sensitive                   | More robust                              |
| Use case          | Ideal for clean, simple datasets | Practical for noisy, real-world datasets |

---

## 4. Key Insight

* **Hard Margin** → Only works in theory when datasets are perfectly separable.
* **Soft Margin** → Practical version of SVM, used in real-world problems with noise and overlap.

---

## 5. Next Step

In the next section, we will cover the **mathematical intuition** of how to derive the best fit line and margins:

* Equation of a straight line (in 2D).
* Equation of a plane (in 3D).
* Distance from a point to a hyperplane.
* Optimization using Lagrange multipliers.

---


# Support Vector Machine (SVM) – Mathematical Intuition

In the last discussion, we saw the geometric idea of SVM. Now, let’s dive into the **mathematics**.

---

## 1. Equation of the Hyperplane

In 2D, the best fit line is given by:

$$
w^T x + b = 0
$$

Where:

* $$ w $$ → weight vector (perpendicular to the hyperplane)
* $$ b $$ → bias term

If the line passes through the origin, $$ b = 0 $$. Otherwise, $$ b \neq 0 $$.

---

## 2. Distance of a Point from Hyperplane

The signed distance of a point $$ x $$ from the hyperplane is:

$$
d = \frac{w^T x + b}{| w |}
$$

* $$ d > 0 $$ → point lies **above** the hyperplane
* $$ d < 0 $$ → point lies **below** the hyperplane

This ensures positive distance for one class and negative distance for the other.

---

## 3. Marginal Hyperplanes

Along with the main hyperplane, we define two marginal planes:

* Upper margin:
  $$
  w^T x + b = +1
  $$

* Lower margin:
  $$
  w^T x + b = -1
  $$

The **margin width** is the distance between them:

$$
\text{Margin} = \frac{2}{| w |}
$$

---

## 4. Optimization Objective

We want to **maximize the margin** → equivalently **minimize**:

$$
\frac{1}{2} | w |^2
$$

---

## 5. Classification Constraint

For correct classification:

* If $$ y_i = +1 $$, then $$ w^T x_i + b \geq +1 $$
* If $$ y_i = -1 $$, then $$ w^T x_i + b \leq -1 $$

This can be compactly written as:

$$
y_i , (w^T x_i + b) \geq 1, \quad \forall i
$$

---

## 6. Final Optimization Problem (Hard Margin SVM)

$$
\begin{aligned}
\text{Minimize: } & \frac{1}{2} | w |^2 \
\text{Subject to: } & y_i (w^T x_i + b) \geq 1, \quad \forall i
\end{aligned}
$$

---

## 7. Key Idea

* **Margin width** = $$ \frac{2}{| w |} $$
* **Objective** = maximize margin (or minimize $$ | w |^2 $$)
* **Constraints** = all points classified correctly

This is the **core formulation of SVM** for linearly separable data (Hard Margin).

---


# Support Vector Machine (SVM) – Soft Margin & Hinge Loss

In real-world data, perfect linear separation is rare. To handle **overlapping points and misclassifications**, we introduce slack variables and hinge loss.

---

## 1. Hard Margin Cost Function

For linearly separable data:

$$
\min_{w, b} \ \frac{1}{2} | w |^2
\quad \text{subject to: } y_i (w^T x_i + b) \geq 1, \ \forall i
$$

---

## 2. Introducing Slack Variables (Soft Margin)

To allow some misclassification, we introduce slack variables $$ \xi_i \geq 0 $$:

$$
y_i (w^T x_i + b) \geq 1 - \xi_i, \quad \forall i
$$

*  \xi_i = 0  → correctly classified point
*  \xi_i > 0  → misclassified or inside margin

---

## 3. Regularization Parameter $$ C $$

We modify the cost function to include a penalty for slack:

$$
\min_{w, b, \xi} \ \frac{1}{2} | w |^2 + C \sum_{i=1}^{n} \xi_i
$$

Where:

*  C  → regularization parameter (controls trade-off between margin width and error)
* Large  C  → fewer misclassifications, narrower margin
* Small  C  → wider margin, more tolerance to errors

---

## 4. Hinge Loss

The error term for each point is defined using **hinge loss**:

$$
L_i = \max \big(0, \ 1 - y_i (w^T x_i + b) \big)
$$

So the full objective can also be written as:

$$
\min_{w, b} \ \frac{1}{2} | w |^2 + C \sum_{i=1}^{n} L_i
$$

---

## 5. Key Idea

* **Hard Margin SVM** → assumes perfect separation
* **Soft Margin SVM** → allows some errors using slack variables
* **Hinge Loss** → penalizes misclassified or margin-violating points
* **Goal** → balance **margin maximization** and **error minimization**

---