# Explore the Settings of SVM

*Yuhui Hong <yuhhong@iu.edu>*

Here, the 'Hillary Clinton' is used as an example aiming to explore the affect of each parameter of SVM. Then we optimized them for different targets one by one, the final results are in `a.py`. 

## 0. Load data

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
from sklearn.metrics import accuracy_score

from a_util import TweetsData
from a import per_SVM

# change paths if necessary
TRAIN_SET_PATH = './StanceDataset/train.csv'
TEST_SET_PATH = './StanceDataset/test.csv'
TARGET_LIST = ['Hillary Clinton', 'Climate Change is a Real Concern', 'Legalization of Abortion', 'Atheism', 'Feminist Movement']
STANCE_DICT = {'AGAINST': 0, 'NONE': 1, 'FAVOR': 2}

In [2]:
### 1: Read in train.csv and test.csv. 
# 'latin1' resolves UniCode decode error
df_train = pd.read_csv(TRAIN_SET_PATH, engine='python', dtype='str', encoding ='latin1') 
df_test = pd.read_csv(TEST_SET_PATH, engine='python', dtype='str', encoding ='latin1')

### 2: Preprocess on data (details in `a_1_2_util.py`).
### 3: Extract a bag-of-words list of nouns, adj, and verbs from original Tweets.
data_train = TweetsData(df_train) # init a TweetsData
print("Load {} training data from {}".format(len(data_train), TRAIN_SET_PATH))
data_test = TweetsData(df_test) # init a TweetsData
print("Load {} test data from {}\n".format(len(data_test), TEST_SET_PATH))
# print("Targets in train: {}".format(data_train.get_targets())) 
# print("Targets in test: {}".format(data_test.get_targets()))  

Load 2914 training data from ./StanceDataset/train.csv
Load 1956 test data from ./StanceDataset/test.csv



## 1. Default Settings

In [3]:
clf = svm.SVC(C=1.0, kernel='rbf', decision_function_shape='ovr', class_weight=None)
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.6305084745762712



## 2. Kernel type

**kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}, default=’rbf’**

Related parameters: 

- **degree: int, default=3**
- **gamma: {‘scale’, ‘auto’} or float, default=’scale’**
- **coef0: float, default=0.0**

As experience, the linear and rbf kernels are most commonly used kernels. If the dimension of feature is large enough, the data can be linearly seperatable in high dimensionality, the linear kernel will performance great and fast. If the dimension of feature is not large enough, rbf kernel could be a good choice. Then we will run them one by one and sdjust the related parameters. 

### 2.1 kernel = 'linear'

$$K(x_i,x_j)=x_i^Tx_j$$

As the above equation shows, the linear kernel does not need other related parameters. 

In [4]:
clf = svm.SVC(kernel='linear')
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.6101694915254238



### 2.2 kernel = 'poly'

$$K(x_i,x_j)=(\gamma x_i^Tx_j + r)^d, d>1$$

As the above equation shows, the polynomial kernel need 3 parameters, $d$, $\gamma$, $r$. We need to ajust all the related parameters one by one. 

Here is the default polynomial kernel settings: 

In [14]:
clf = svm.SVC(kernel='poly', degree=3, gamma='scale', coef0=0)
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.5830508474576271



Let's adjust $d$! The best degree is $1$. 

In [11]:
clf = svm.SVC(kernel='poly', degree=1)
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.6305084745762712



Let's adjust $\gamma$! There are three choices, 'auto', 'scale' (default) and other float number. 

- If `gamma='scale'` (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma.

- If `gamma='auto'`, uses 1 / n_features.

The larger the gamma, the fewer support vectors, and the smaller the gamma value, the more support vectors. More support vectors could fit the model better for the training data, however, may lead to a bad performance on test data. Here, we only try 'scale' and 'auto' of $\gamma$ rather than other float numbers, and 'scale' performance better. 

In [13]:
clf = svm.SVC(kernel='poly', degree=1, gamma='scale')
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.6305084745762712



Let's adjust $r$! When degree equals to $1$, $r$ does not affect to the results. 

In [17]:
clf = svm.SVC(kernel='poly', degree=1, gamma='scale', coef0=100)
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.6305084745762712



### 2.3 kernel = 'rbf'

$$K(x_i,x_j)=exp(-\gamma ||x_i-x_j||^2),\gamma>0$$

As the above equation shows, the radial basis function (RBF) kernel need 1 parameters, $\gamma$. 

Similar to the experiment above, we adjusted $\gamma$. The results show that 'scale' performance better. 

In [19]:
clf = svm.SVC(kernel='rbf', gamma='scale')
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.6305084745762712



### 2.4 kernel = 'sigmoid'

$$K(x_i,x_j)=tanh(\gamma x_i^Tx_j + r ), \gamma>0, r<0$$

As the above equation shows, the sigmoid kernel need 2 parameters, $\gamma$, $r$. 

Similar to the experiment above, we adjusted $\gamma$ and $r$. The results show that `gamma='scale'` and `coef0=0` performances better. 

In [20]:
clf = svm.SVC(kernel='sigmoid', gamma='scale', coef0=0)
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.6237288135593221



## 3. Regularization parameter

**C: float, default=1.0**

Refer: [Intuition for the regularization parameter in SVM](https://datascience.stackexchange.com/questions/4943/intuition-for-the-regularization-parameter-in-svm)

The regularization parameter (lambda) serves as a degree of importance that is given to misclassifications. SVM poses a quadratic optimization problem that looks for maximizing the margin between both classes and minimizing the number of misclassifications. However, for non-separable problems, in order to find a solution, the misclassification constraint must be relaxed, and this is done by setting the mentioned "regularization". 

If the regularization parameter is too small, the model will be underfitted. If the regularization parameter is too large, the model will be overfitted. 

In this example, the best `C=10`. 

### 3.1 C = 0.1

In [26]:
clf = svm.SVC(C=0.1, kernel='rbf', gamma='scale')
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.5830508474576271



### 3.2 C=10

In [27]:
clf = svm.SVC(C=10, kernel='rbf', gamma='scale')
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.6338983050847458



### 3.3 C=100

In [28]:
clf = svm.SVC(C=100, kernel='rbf', gamma='scale')
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.6338983050847458



## 4. Class weight of regularization parameter

**class_weight: dict or ‘balanced’, default=None**

This parameter could set the regularization parameter `C` of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The ‘balanced’ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as `n_samples / (n_classes * np.bincount(y))`. Let's try ‘balanced’ mode. 

In this example, the weight of classes does not effect the results. 

In [29]:
clf = svm.SVC(C=10, kernel='rbf', gamma='scale', class_weight='balanced')
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.6338983050847458



## 5. Best settings

From all the experiments above, we could get a general optimization steps of SVM. 

- Check that which kernel performance best with their default related parameters. 
    In most of the situations, it is linear kernel or RBF kernel. If the dimension of feature is large enough compare to the number of samples, the data can be linearly seperatable in high dimensionality, the linear kernel will performance great and fast. If the dimension of feature is not large enough, rbf kernel could be a good choice. 
- Adjust the regularization parameter.
- Check that whether a balanced class weight need to be used.

Then we know the best settings of this example are: 

In [30]:
clf = svm.SVC(C=10, kernel='rbf', gamma='scale', class_weight=None)
per_SVM(data_train, data_test, clf, target='Hillary Clinton')

>>> Hillary Clinton
X_train: (689, 3127), Y_train: (689,)
X_test: (295, 3127), Y_test: (295,)
Training the SVM...
Done!
Accuracy score: 0.6338983050847458

