# SVM with scikit-learn

The goal of this notebook is to build SVM model using scikit-learn library to predicting sentiment from product reviews. You will do the following:

 * Load product reviews and safe loans datasets like previous module.
 * Implement SVM model using scikit-learn.
 * Tuning some parameters.

In [4]:
# Import some libs

import pandas
import numpy as np
from utils import get_product_reviews_data

## Load product reviews dataset
Like previous module, we load, preprocess data, convert and split them into train and test datasets. We dont't focus on that in this notebook, so you can just run the following cells. You can check out the load data code inside the folder **utils**.

In [5]:
train_set, val_set = get_product_reviews_data()

sentiment_X_train, sentiment_y_train = train_set
sentiment_X_valid, sentiment_y_valid = val_set

*****Sentiment data shape*****
sentiment_X_train.shape:  (42458, 194)
sentiment_y_train.shape:  (42458,)
sentiment_X_valid.shape:  (10614, 194)
sentiment_y_valid.shape:  (10614,)


# Build classifier using scikit learn
Now, let's use the built-in SVM learner [sklearn.svm.LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). For more details about the LinearSVC class you can check the sklearn doc. LinearSVC classifer is equivalent to SVM with linear kernel so we will use it because our datasets are pretty large.

In [6]:
from sklearn.svm import LinearSVC
sentiment_clf = LinearSVC(C=1, random_state=0, max_iter=2000)
sentiment_clf.fit(sentiment_X_train, sentiment_y_train)

print ("***Sentiment result***")
print("Train accuracy: {}".format(sentiment_clf.score(sentiment_X_train, sentiment_y_train)))
print("Validation accuracy: {}".format(sentiment_clf.score(sentiment_X_valid, sentiment_y_valid)))



***Sentiment result***
Train accuracy: 0.7777332893683169
Validation accuracy: 0.7710570944036179


# Hyper parameters tuning
Let's try to tune some hyper parameters to see if we can get any better results. We will use [sklearn.model_selection.GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to tune the C hyper parameters in SVM classifer.

In [7]:
from sklearn.model_selection import GridSearchCV

tuned_parameters = {
    "C": [1, 2, 5, 10, 20, 100]
}

tuned_sentiment_cls = GridSearchCV(
                        LinearSVC(C=1),
                        param_grid=tuned_parameters,
                        n_jobs=2,
                        verbose=1,
)

tuned_sentiment_cls.fit(sentiment_X_train, sentiment_y_train)

print ("***Sentiment result***")
print("Train accuracy: {}".format(tuned_sentiment_cls.score(sentiment_X_train, sentiment_y_train)))
print("Validation accuracy: {}".format(tuned_sentiment_cls.score(sentiment_X_valid, sentiment_y_valid)))

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  30 out of  30 | elapsed:  2.5min finished


***Sentiment result***
Train accuracy: 0.7778039474304018
Validation accuracy: 0.771528170341059


As you can see we get a little bit better results.
<br>
**Quiz**: What is the validation accuracy?
<br>
**Your answer**: