# SVM Example

Using an SMS Spam data set (slightly modified) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data set is a collection of 5574 SMS messages that have been labeled as ham or spam. The file is a tab-delimited file with the first column the label and the second the message content. I edited the data set to remove some unwanted columns and add headings. 

The SVM algorithm finds a separating hyperplane to divide classes. The goal is to find margins that separate classes well. Observations on the margins are called *support vectors*. Once the support vectors are found, they are used to classify on which side of the separating hyperplane new observations fall.

![image](svm.png)

SVM classifiers can be linear, polynomial, or radial. SVM classifiers have a hyperparameter C which specifies how many observations are allowed to fall on the wrong side of the line during training. These observations are called slack variables. Smaller values of C will have higher variance. 

Polynomial and radial kernel SVMs have an additional hyperparameter, gamma, which controls the bias-variance tradeoff. Higher gamma values result in higher variance, and might overfit the training data. 

In [4]:
import pandas as pd
df = pd.read_csv('data/sms-spam.csv', header=0, usecols=[1,2], encoding='latin-1')
print('rows and columns:', df.shape)
print(df.head())

rows and columns: (4837, 2)
   spam                                               text
0     0  Go until jurong point, crazy.. Available only ...
1     0                      Ok lar... Joking wif u oni...
2     1  Free entry in 2 a wkly comp to win FA Cup fina...
3     0  U dun say so early hor... U c already then say...
4     0  Nah I don't think he goes to usf, he lives aro...


In [5]:
# text preprocessing
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

stopwords = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stopwords, binary=True)

In [6]:
# set up X and y
X = vectorizer.fit_transform(df.text)
y = df.spam

In [7]:
# divide into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

### Train and test

Train on the train data and then evaluate on the test data.

In [8]:
from sklearn import svm
classifier = svm.SVC(kernel='linear')
classifier.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [9]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score
pred = classifier.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))

accuracy score:  0.981404958677686
precision score:  0.9903846153846154
recall score:  0.8583333333333333
f1 score:  0.9196428571428571


This is a higher accuracy and precision than Naive Bayes, and we just used default settings. Perhaps even higher accuracy could be achieved if we tuned the C parameter. 

The best values for hyperparameters are often found with cross validation. With cross validation, the available training data is divided into a set number of portions, usually 5 or 10. With each iteration, one portion is used as a test set with the rest of the data as train. Then results are averaged.

![image](cv.png)

Cross validation is a useful technique for train-test when the size of the data is very small. In this case below, it is used on a 'validation' set to identify the best hyperparameters.

In [13]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
import warnings

# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        print("# Tuning hyper-parameters for %s" % score)
        print()

        clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
                       scoring='%s_macro' % score)
        clf.fit(X_train, y_train)

        print("Best parameters set found on development set:")
        print()
        print(clf.best_params_)
        print()
        print("Grid scores on development set:")
        print()
        means = clf.cv_results_['mean_test_score']
        stds = clf.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, clf.cv_results_['params']):
            print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
        print()

        print("Detailed classification report:")
        print()
        print("The model is trained on the full development set.")
        print("The scores are computed on the full evaluation set.")
        print()
        y_true, y_pred = y_test, clf.predict(X_test)
        print(classification_report(y_true, y_pred))
        print()

# Tuning hyper-parameters for precision

Best parameters set found on development set:

{'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}

Grid scores on development set:

0.433 (+/-0.001) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.433 (+/-0.001) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.433 (+/-0.001) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.433 (+/-0.001) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.962 (+/-0.013) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.433 (+/-0.001) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
0.983 (+/-0.006) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
0.962 (+/-0.013) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
0.979 (+/-0.011) for {'C': 1, 'kernel': 'linear'}
0.977 (+/-0.005) for {'C': 10, 'kernel': 'linear'}
0.977 (+/-0.005) for {'C': 100, 'kernel': 'linear'}
0.977 (+/-0.005) for {'C': 1000, 'kernel': 'linear'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed

Run svm again with the suggested parameters. There wasn't a lot of room to improve, but there was a very slight improvement in both accuracy and precision. 

In [14]:
classifier = svm.SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, kernel='rbf', gamma=0.001,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

classifier.fit(X_train, y_train)

pred = classifier.predict(X_test)

print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))


accuracy score:  0.984504132231405
precision score:  0.9906542056074766
recall score:  0.8833333333333333
f1 score:  0.933920704845815


In [15]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, pred))

[[847   1]
 [ 14 106]]
