# Explanation
*The purpose of this file is to represent the research I've learnt on the inner workings of support vector machines. This will include what it is, how it works, how to optimize it, different functions, etc. The aim of this to gain a deep confident grasp of SVM's. This research will be related/connect to this titanic model I'm making.*

# Table of Contents:
- [Classification VS Regression]("ClassVSReg")
- [What is a Support Vector Machine?]("svm-intro")
- [SVM Basic Code Explanation]("svm-intro-code")

<a name="ClassVSReg"></a>
# Classification VS Regression

*Classification and regression are both types of supervised learning used in machine learning, where the goal is to train a model on labeled data to make predictions. The key difference lies in the type of output they produce: classification predicts discrete categories or class labels (such as 'spam' or 'not spam'), while regression predicts continuous numerical values (such as house prices or temperatures). Despite this difference, both approaches follow the same underlying process - learning a mapping from input features to an output - and use similar algorithms that are adapted to suit either categorical or continuous prediction tasks.*

<p align="center">
  <img src="classification.png" alt="Classification" width="300"/>
  <img src="regression.png" alt="Regression" width="300"/>
</p>

### Classification:
Example:
- A list of students in a class can be categorised by gender.
- A dataset of images of hand-drawn numbers 0-9 can be classified into type integers.

Common Usage:
- Medical diagnostics
- Identifying spam vs non-spam
- Identifying whether a file is malicous
- Image recognistion

### Regression:
Example:
- Predicting the price of a house based on its features (size, location, number of rooms).
- Estimating a person's weight based on height and age.

Common Usage:
- Forecasting stock prices or sales revenue
- Predicting temperature or rainfall levels
- Estimating delivery time or traffic flow
- Modeling population growth over time


<a name="svm-intro"></a>
# What is a Support Vector Machine?

*A Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and regression tasks. In this project, the focus will be on **classification**, not regression.*

### How It Works:

SVM works by plotting data points in a multi-dimensional space, where each feature represents one axis (or dimension).  
- For example: a dataset with two features like **height** and **weight** can be visualized in a 2D space.  
- A dataset with three features would be 3D, and so on.

The SVM algorithm then attempts to find the **best hyperplane** that separates the different classes of data. A hyperplane is a decision boundary:
- In **2D**, it’s a line (1D hyperplane).
- In **3D**, it’s a plane (2D hyperplane).
- In general, the hyperplane is always **one dimension less** than the feature space.

The key question is: **where should the hyperplane be placed?**

SVM answers this by choosing the hyperplane that **maximizes the margin** — the distance between the hyperplane and the **nearest data points** from each class (called **support vectors**). This helps improve generalization to new data and makes SVM robust. The margin would be the distance between the hyperplane and the dotted line below.
<p align="center">
  <img src="margin.png" alt="margin" width="300"/>
</p>

### Types of Margins
*Hard Margin: No misclassifications allowed; assumes data is perfectly separable*

*Soft Margin: Allows some misclassification; better for real-world noisy data*


<a name="svm-intro-code"></a>
# SVM Basic Code Explanation:

The first bit of every AI model consists of data loading and data exploration. This can easily be done with the use of python modules such as- Pandas, matplotlib, numpy and seaborn.

Pandas is a fasts and power tool to help analyses data and is my prefered way to load data from a csv file. Pandas can be easily installed from the console through the Python Package Installer (pip):

pip install pandas

To import the pandas library and load the data we can use these lines of code:

In [None]:
import pandas
df = pandas.read_csv('titanic/train.csv') #Note that is pandas funciton to load csv files.

We can analyse this dataframe (df) with pandas functions such as head(), describe() info().

**head()** shows the first 5 (defualt) rows of the dataframe, head(x) shows the first x rows of the dataframe. 

**describe()** shows each column's count(cnt), mean, standard deviation (std), minimum (min), 25th percentile (25%). 50th percentile (50%), 75th percentile (75%), and maximum (max).

**info()** shows each column's index (#), Column label, number of non-null values, and data type (Dtype). It also states the number of rows and columns.

In [None]:
df.head()
df.describe()
df.info()

The core idea behind AI models is prediction. We're training models on past (labeled) data with the goal of accurately predicting future or unseen data. This is essentially about learning which features (inputs) are most correlated with the target (output).

At their heart, all machine learning models are just function estimators. In regression, we estimate a function that maps inputs to continuous outputs. In classification, we estimate a function that separates categories of data.

That’s it - it’s all just functions.

And that’s part of what makes AI so powerful: the world itself is governed by patterns and functions, and machines can often model these high-dimensional relationships far better than humans.

The point of this (mini) rant is to highlight the importance of understanding which features are most correlated with the outcome - this is the foundation of good predictions and good models. The better we understand the data, the smarter our focus and efforts will be.

Pandas has a cool function that maps correlation between all combinations of columns: ***corr()***.
This function calculates the pairwise correlation between all numerical columns in the DataFrame, returning a matrix where each value ranges from -1 to 1:

A value close to 1 means a strong positive correlation (as one feature increases, so does the other).

A value close to -1 means a strong negative correlation (as one feature increases, the other decreases).

A value close to 0 means little to no linear correlation.

This is super useful when deciding which features are most influential or redundant, especially when trying to understand what drives predictions in your model.

In [None]:
df.corr()

However we need to make sure that all given column values are numeric, this can be done through a technique called encoding. One simple way to encode categorical values is with the map() function.
It replaces each category with a number by applying a dictionary mapping to the column.
For example:

In [None]:
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
# This replaces 'male' with 0 and 'female' with 1.

corr_matrix = df[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare','Embarked_encoded', 'Sex_encoded']].corr()
# Note that I encoded both 'embarked' and 'sex'. This returns:

<p align="center">
  <img src="corr.png" alt="margin" width="450"/>
</p>

This is good, we can now see the correlation between different columns, however we want to focus of features that correlate to surviving, so we need to filter out the rest. This can be done by **loc()**.

loc() selects the row in the correlation matrix corresponding to 'Survived'.

We can also filter out the correlation between Survived with itself, since its always 1. Using **drop()**

drop() removes the correlation of Survived with itself (1.0), because it’s not useful for analysis.

In [None]:
survived_corr = corr_matrix.loc['Survived'].drop('Survived')

<p align="center">
  <img src="corr_filtered.png" alt="margin" width="200"/>
</p>

We can see now that the sex has the highest correlation to surviving, the second goes the Pclass. We want to look at the absolute values of the correlation matrix because it is important to consider the highest correltion whether that be positive or negative, as explained before, this indicates a stronger pattern.

Now we can start programming the SVM