<a></a>
<div style="border-radius: 10px; border: 1px solid #0F9CF5; background-color: #232323; white-space: nowrap;">
    <p style="margin-top: -10px; margin-bottom: 0px; margin-left: 10px; font-size: 1.15em; padding: 10px; overflow: hidden;">
        <span style="color: orange; font-size: 2em;">&#9432;  </span>
        Click the <span style="color: orange;">Run All</span> <img style="max-height: 1.5em; border: 1px solid orange;" src="../img/RunAll.png" /> button in the toolbar above to run the code in this notebook 
    </p>
</div>

<a id="document-top"></a>
# BQuant Machine Learning Series Part 5 - SVM



<a href='https://bloombergslides.com/view/new/mail?iID=4kX986qfQH4VFvvbWG77'>Video: Episode 5 - ML Series Video - Support Vector Machines</a>

In [None]:
import bql
bq = bql.Service()

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# cache bql request on disk
import src.cache as cachereq
from src.shared import * ## Shared library for retrieving data via BQL for Machine Learning Series

%load_ext autoreload
%autoreload 2

### Initial set up - PLEASE READ
<font color='magenta'>The data is pre-cached on disk and will automatically be called when running get_earnings_factors() function. The query sources significant amout of data from BQL so to avoid running into data limit issues, we strongly recommend you do not modify below code. You can examine BQL code in folder src -> shared.py
</font>

In [None]:
# Read cached data from data_svm folder
cache = cachereq.CacheRequest(bq, {'cache_folder': 'data_svm', 'cache_data_on_disk': True})

# src -> shared.py -> get_earnings_factors()
data = get_earnings_factors(cache=cache)
print(data.shape)
print("Quarterly data from 2018-12-31 : 2020-12-31 for SP500")

<h3>Earnings movement prediction</h3>

<h4>Forecast direction of next quarter earnings based on accounting information of the current quarter </h4>

#### Steps:
- Enhance data with additional information
- Preprocess the data
- Apply Support Vector Machines on our dataset
- Try to improve our results through PCA



In [None]:
data.head(3)

#### Enhance data:
- change in Earnings per share : (Current Period EPS - Prior Period EPS) 
- Assign 1 to positive change in EPS and 0 to negative change
- Shift data index by -1: we will be using current financial data to predict future change in earnings


In [None]:
# Create binary column of positive and negative earnings changes
data['binary_change'] = [1 if row['change_in_EPS'] > 0 else 0 for _,row in data.iterrows()]

# Shift date index by -1 so we are predicting future changes: 1 or 0
data['Future_change'] = data['binary_change'].shift(-1)

In [None]:
# Goal is to anticipate the sign of futute earnings change from the financial data of the current quarter.
# If the future earnigs changes is + , we assign 1, otherwise 0,  to Future_change value of the current quarter
data[['EPS','change_in_EPS','Future_change']].head(6)

In [None]:
# Examine data 
data.describe()

In [None]:
# Replace infinity with nan
data = data.replace([np.inf, -np.inf], np.nan)

In [None]:
#Drop rows where change_in_EPS is nan: they are no use to us 
data = data.dropna(subset = ['change_in_EPS', 'Future_change'])

In [None]:
# We no longer need these columns
data = data.drop(columns = ['EPS','change_in_EPS','binary_change'])

In [None]:
# Examine missing data
missing_column_data = 100*(data.isnull().sum() / data.shape[0]).round(3)
print('Percent of missing values per column:\n', missing_column_data)

In [None]:
# Drop 10 columns that have more than 30% of data missing
columns_to_drop = missing_column_data[missing_column_data > 30]
columns_to_drop

In [None]:
# Number of columns dropped, 10 
data = data.drop(columns = list(columns_to_drop.index))
print( f'New Dataframe shape : {data.shape}' )

#### Preprocess data:
- Handle remaining missing values
- Minimize influence of outliers by performing Winsorization
- Standardize data 


Handle remaining missing data by replacing NaN by mean of the column

In [None]:
# Keep in mind that this is a naive way to handle missing values. 
# This method can cause data leakage and does not factor the covariance between features.
# For more robust methods,take a look at MICE and KNN

for col in data.columns:
    data[col].fillna(data[col].mean(), inplace=True)

In [None]:
# Check for missing values
missing_column_data = 100*(data.isnull().sum()/ data.shape[0]).round(3)
print('Percent of missing values per column:\n',missing_column_data)

In [None]:
# First we need to split our data into train and test. 
from sklearn.model_selection import train_test_split

# Independent values/features
X = data.iloc[:,:-1].values
# Dependent values
y = data.iloc[:,-1].values

# Create test and train data sets, split data randomly into 20% test and 80% train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Winsorization transforms data by limiting extreme values, typically by setting all outliers to a specified percentile of data

In [None]:
from scipy.stats import mstats
# Winsorize top 1% and bottom 1% of points. 

# Apply on X_train and X_test separately
X_train = mstats.winsorize(X_train, limits = [0.01, 0.01])
X_test = mstats.winsorize(X_test, limits = [0.01, 0.01])

Standardize the data

$$z=(x-mean) /  Standard Deviation$$

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines (SVM) or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

In [None]:
# Standardize features by removing the mean and scaling to unit variance.

# IMPORTANT: During testing, it is important to construct the test feature vectors using the means and standard deviations saved from
# the training data, rather than computing it from the test data. You must scale your test inputs using the saved means
# and standard deviations, prior to sending them to your SVM library for classification.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

# Fit to training data and then transform it
X_train = sc.fit_transform(X_train)
# Perform standardization on testing data using mu and sigma from training data
X_test = sc.transform(X_test)

[Source: scikit-learn](https://scikit-learn.org/stable/modules/svm.html) <br>

### SVM

**Advantages:**
* Effective in high dimensional spaces.
* Still effective in cases where number of dimensions is greater than the number of samples.
* Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
* Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

**Disadvantages:**
* If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
* It also doesn’t perform very well when the data set has more noise i.e. when target classes are overlapping.
* SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.


<img src='img/svm.jpg'>

In [None]:
# Support Vector Classification(C)
from sklearn.svm import SVC

# Initialize svm, rbf is a default kernel
classifier_rbf = SVC(C = 1, kernel = 'rbf', gamma = 'auto', random_state = 0)

# Fit the model on training data
classifier_rbf.fit(X_train, y_train)

# Make a prediction on testing data
y_pred_rbf = classifier_rbf.predict(X_test)

In [None]:
# Import accuracy score
from sklearn.metrics import accuracy_score
ac_rbf = accuracy_score(y_test, y_pred_rbf)
print('Accuracy with RBF: {:.2f}'.format(ac_rbf))

In [None]:
# Precision and recall
from sklearn.metrics import classification_report
result = classification_report(y_test, y_pred_rbf)
print(result)

#### Hyperparameters:
- Kernel - transforms the data into a required form(dimension) so the data can be separated. RBF is useful for non-linear hyperplane in higher dimensions
  and computes the separation line in the higher dimension. In some of the applications, it is suggested to use a more complex kernel to separate the classes that are curved or nonlinear.
- Regularization, C - penalty parameter, which represents misclassification or error. It tells the SVM optimization how much error is bearable. Small C results in a small-margin hyperplane while large C in large margin hyperplane.
- Gamma - defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors. Higher values of gamma will exactly fit the training dataset, which can causes over-fitting.

In [None]:
# Default C = 1, let's change kernel to linear
classifier_lin = SVC(C = 1, kernel = 'linear',gamma = 'auto',random_state=0)

# Fit the model on training data
classifier_lin.fit(X_train, y_train)

# Make a prediction on testing data
y_pred_lin = classifier_lin.predict(X_test)

from sklearn.metrics import accuracy_score
ac_lin = accuracy_score(y_test, y_pred_lin)
print('Accuracy with Linear: {:.2f}'.format(ac_lin))

Can we speed up our SVM algorithm ?

#### Principal Component Analysis (PCA)
- Common way to speed up machine learning algorithms
- Large number of features in the dataset can affect both the training times and accuracy of the model
- PCA is a statistical technique that reduces number of features to those that capture maximum information about the dataset
- Features are selected on the basis of their variance - higher the variance, more information that component conveys

In [None]:
from sklearn.decomposition import PCA

# keep 95% of variance
pca = PCA(0.95)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

In [None]:
# Components that explain 95% of variance in our dataset
explained_variance = pca.explained_variance_ratio_
# 27 features explain 95% of variance, down from original 40
len(explained_variance)

Able to achieve similar accuracy but with only 27 features

In [None]:
classifier = SVC(C = 1, kernel='rbf',gamma = 'auto',random_state=0)

classifier.fit(X_train_pca, y_train)
y_pred = classifier.predict(X_test_pca)
ac = accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}'.format(ac))


<h3>Additional Resources</h3>

<h4>Python Libraries</h4>

Scikit train_test_split:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Scikit SVM:
https://scikit-learn.org/stable/modules/svm.html
    
PCA:
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Missing values imputation
https://scikit-learn.org/stable/modules/impute.html

    