## SVM for Class Imbalance with Scikit-Learn

Kaggle CreditCard Fraud Detection Data can be downloaded here:
https://github.com/nsethi31/Kaggle-Data-Credit-Card-Fraud-Detection/blob/master/creditcard.csv?raw=true

### How SVM Works:
Part 1: https://www.youtube.com/watch?v=-3URiBiFgIg
        
Part 2: https://www.youtube.com/watch?v=BomqrcbJS4g 

In [None]:
%%time
import pandas as pd
data = pd.read_csv('creditcard.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data['Class'].value_counts()

In [None]:
%%time
# Split data into train and test splits

from sklearn.model_selection import train_test_split

# retrieve numpy array
data = data.values
# split into input and output elements
X, y = data[:, 1:-1], data[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [None]:
# Count how many unique values of each class

import numpy as np
unique, counts = np.unique(y, return_counts=True)
print (np.asarray((unique, counts)).T)

unique, counts = np.unique(y_test, return_counts=True)
print (np.asarray((unique, counts)).T)

### The class weighing can be defined multiple ways; for example:

* Domain expertise, determined by talking to subject matter experts
* Tuning, determined by a hyperparameter search such as a grid search
* Heuristic, specified using a general best practice
* A best practice for using the class weighting is to use the inverse of the class distribution present in the training dataset

In [None]:
# calculate heuristic class weighting
from sklearn.utils.class_weight import compute_class_weight

# calculate class weighting according to training data
weighting = compute_class_weight('balanced', [0,1], y_train)
print(weighting)

### Weighting C for SVM:
The weight is defined proportional to the class distribution:

$C_i = weight_i * C$

A larger weighting can be used for the minority class, allowing the margin to be softer, whereas a smaller weighting can be used for the majority class, forcing the margin to be harder and preventing misclassified examples.

* *Small Weight:* Smaller C value, larger penalty for misclassified examples.
* *Larger Weight:* Larger C value, smaller penalty for misclassified examples.

This has the effect of encouraging the margin to contain the majority class with less flexibility, but allow the minority class to be flexible with misclassification of majority class examples onto the minority class side if needed.

### This cell can take long to run

In [None]:
%%time
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score

weights = {0:weighting[0], 1:weighting[1]} 
print(weights)
# define model
# try with class_weight=weights and class_weight=None
model = SVC(probability=True)
#model = SVC(class_weight=weights, probability=True)
#model = SVC(gamma='scale', class_weight='balanced', probability=True)

# fit a model
model.fit(X_train, y_train)

# predict probabilities
probs = model.predict_proba(X_test)
# keep probabilities for the positive outcome only
pos_probs = probs[:, 1]

auc = roc_auc_score(y_test, pos_probs)

# summarize performance
print(' ROC AUC = %.3f' % auc)

### Use Synthetic Data

In [None]:
%%time
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC


# prepare train and test dataset
def prepare_data(n_samples=1000):
    # generate 2d classification dataset
    X, y = make_classification(n_samples=n_samples, n_features=3, n_redundant=0,
    n_clusters_per_class=2, weights=[0.99], flip_y=0, random_state=4)
    # split into train and test
    n_train = n_samples//2
    trainX, testX = X[:n_train, :], X[n_train:, :]
    trainy, testy = y[:n_train], y[n_train:]
    return trainX, trainy, testX, testy

# prepare dataset
X_train, y_train, X_test, y_test = prepare_data()

# calculate class weighting according to training data
weighting = compute_class_weight('balanced', [0,1], y_train)
weights = {0:weighting[0], 1:weighting[1]} 
print(weights)

# define model
# try with class_weight=weights and class_weight=None
#model = SVC(probability=True)
#model = SVC(class_weight=weights, probability=True)
model = SVC(gamma='scale', class_weight='balanced', probability=True)

# fit a model
model.fit(X_train, y_train)

# predict probabilities
probs = model.predict_proba(X_test)
# keep probabilities for the positive outcome only
pos_probs = probs[:, 1]

auc = roc_auc_score(y_test, pos_probs)

# summarize performance
print(' ROC AUC = %.3f' % auc)