# Spam detection with SVMs

SVMs are a type of **supervised** algorithms wich has the task to find the best hyperplane that best separet classes of data that can be represente in a  **multidimesional space**. In some cases we can have data that needs different hyperplanes to correctly separate the data from each other, in our case we have to choose an hyperplane that **optimizes the prefixed margin**, that rapresent the distance between the hyperplane and the data.

The SVM can be considered as a better implementation of Perceptron. The key difference is that on the Perceptron we have tried to **minimize** classifications errors, in the SVM instead is to maximiza the margin, wich is the distance between the hyperplane and the *closest* training data to the hyperplane(know as **support vector**).

## The optimization strategy of SVMs
By choosing the **wider margins** correspond to fewer classifications errors, while with **narrower margins** we risk incurring the phenomenon known as **overfitting**(we will discuss it later)

So now let's translate the SVM into mathematical terms, similar to what we done in the case of Perceptron, we must define conditions that must be met to assure that SVM identify correctly the best hyperplane that separate the classes of data.

$$ y = \sum w_{i}x_{i} + \beta \geq \mu $$

The $\beta$ costant rapresent the *bias* and the \mu rapresent the *margin*
In practice we add to $\sum w_{i}x_{i}$ the $\beta$ bias, this allow us to obtain a value greater than or equal to zero, in the presence of values that fall in the same **class label**(we rember that $y$ can assume value between -1 to +1, to distinguish between the corresponding classes).

After that the value calculated is compared with $\mu$ margin to ensure that the distance between each sample and the separating hyperplane we identified  is greater or equal to our margin.  

## SVM example
As for the Perceptron we will also choose a linear classifier for the SVM, to compare it with our previouse one.
Now we are tankin basically the same dataset but this time we have stored for each message the harmeless word and the sospicious keywords and after that we compare those numbers.


In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv("./resources/smsSpamSvm.csv")
y = df.iloc[:, 0].values
y = np.where(y == 'spam', -1, 1)
X = df.iloc[:, [1, 2]].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 0)


Now let's initializa our SVM(**support vector classifier**), choosing the linear classifier (*kernel = 'linear'*), then we can procede to the model training (*fit()*) and finaly we can estimate the test_data by invoking (*predict()*):

In [15]:
from sklearn.svm import SVC

svm = SVC(kernel = 'linear', C = 1.0, random_state = 0)
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)


Now let's evaluate the accurancy of our predictions **sklearn.metrics**

In [16]:
from sklearn.metrics import accuracy_score

print('Misclassified sample: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

Misclassified sample: 7
Accuracy: 0.84
