## **Support Vector Machines**

#### **Decision Boundaries**
* Classification function is represented by decision surfaces
* Defines the region a data point must be in to be classified as a given class
* **Data overfitting**: Decision boundary learned over training data doesn't generalize to test data
* **Linear Boundaries**
    * Line that separates classes
    * Easy to find
    * Easy to evaluate
    * More generalizable

SVMs are **linear classifiers** that find a hyperplane to separate **two classes** of data: positive and negative

Given training data $(x_1, y_1), (x_2, y_2), ...$ where $x_i = (x_1, x_2, ..., x_n)$ is the instance vector and $y_i$ is one of $\{-1, +1\}$. SVM finds a linear function $w$ (weight vector):

$f(x_i)\ =\ <w\ .\ x_i>\ +\ b$

if $f(x_i) > 0$ then $y_i = +1$ else $y_i = -1$

#### **Multi-class classification**
* **One vs Rest** - Learn to differentiate one class from the others
* **One vs One** - Learn to differentiate only between two classes
    * n-class SVM has $C(n, 2)$ classifiers

#### **SVM Parameters**
* **Regularization parameter $(c)$**: How much importance should you give individual data points as compared to better generalized model
    * Larger values of $c$ = less regularization - Fit training data as well as possible, every data point is important.
    * Smaller values of $c$ = more regularization - More tolerant to errors on individual data points
* **Kernels**
    * Linear kernels usually works better for text data
    * Other kernels are `rbf`, `polynomial`, etc

#### **Important Concepts**
* SVMs tend to be the most accurate classifiers, especially in high-dimensional data
* Strong theoretical foundations
* Handles only numeric features
    * Converts categorical features to numeric features
    * Normalization
* Hyperplane hard to interpret

In [1]:
import pandas as pd
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.metrics import f1_score

train_df = pd.read_csv('../datasets/News_Classification/train.csv')
test_df = pd.read_csv('../datasets/News_Classification/test.csv')

# keeps only the Business(3) and Sci/Tech(4) news
train_df = train_df[train_df['Class Index'].isin([3, 4])]
test_df = test_df[test_df['Class Index'].isin([3, 4])]

# transforms labels 3 -> 0 and 4 -> 1
replace_labels = {3: -1, 4: 1}
train_df['Class Index'].replace(replace_labels, inplace=True)
test_df['Class Index'].replace(replace_labels, inplace=True)

# concatenate title and description in only one text
train_df['Description'] = train_df.apply(lambda x: ' '.join([str(x['Title']), str(x['Description'])]), axis=1)
test_df['Description'] = test_df.apply(lambda x: ' '.join([str(x['Title']), str(x['Description'])]), axis=1)

# gets X and y data
X_train, y_train = train_df['Description'].to_list(), train_df['Class Index'].to_list()
X_test, y_test = test_df['Description'].to_list(), test_df['Class Index'].to_list()

print('TRAIN DATA:')
display(train_df.info())

print('\nTEST DATA:')
display(test_df.info())

TRAIN DATA:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 60000 entries, 0 to 119981
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Class Index  60000 non-null  int64 
 1   Title        60000 non-null  object
 2   Description  60000 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.8+ MB


None


TEST DATA:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3800 entries, 0 to 7599
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Class Index  3800 non-null   int64 
 1   Title        3800 non-null   object
 2   Description  3800 non-null   object
dtypes: int64(1), object(2)
memory usage: 118.8+ KB


None

In [2]:
count_vectorizer = CountVectorizer(stop_words='english', max_features=500)
X_train = count_vectorizer.fit_transform(X_train)
X_test = count_vectorizer.transform(X_test)

modelSVM = SVC(kernel='linear', C=0.1)
modelSVM.fit(X_train, y_train)

y_pred = modelSVM.predict(X_test)
print(f1_score(y_test, y_pred, average='micro'))

0.8781578947368421
