## Model with Incremental Learning

From the previous notebook, we have already found that the appropriate model and its respective hyperparameters. Now, I would apply the same model and parameters on this model, but instead with incremental learning. Incremental learning techniques are often used where data is processed in parts(subsets of the data are considered at any given point in time) and the result is then combined to save memory.

### Incremental Learning with SVM Explaination

Given l data points {(x1,y1),(x2,y2),……..,(xl,yl)}, the decision function of SVMs is as follows:
<img src="image/equation.png" width="300"/>
Often, only a small fraction of alpha(i) coefficients are non-zero. Due to this, the corresponding xi entries and yi output labels fully define the decision function. These three are then preserved for use in the classification process. All the remaining training does not contribute in any way and is regarded as redundant. xi entries are the support vectors here.
Since only a small number of data points end up as support vectors, the support vector algorithm is able to summarize the data space in a very concise manner. 

This is how incremental training works: 
1. Feed a subset of the data into the model.
2. Preserve only the support vectors.
3. Add them to the next subset.

Reference: https://medium.com/computers-papers-and-everything/incremental-learning-with-support-vector-machines-e838cd2d7691

In [None]:
# Import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import gzip

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Classifiers used
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler

### Incremental Learning with SVM Application

In [None]:
# Import the test sets
X_test = pd.read_csv(f'data/incremental_model/X_test.csv',index_col=0)
y_test = pd.read_csv(f'data/incremental_model/y_test.csv',index_col=0)

In [None]:
# Scale the train and test sets
scaler = StandardScaler()

# Instantiate and fit to the train set
SGDClass = SGDClassifier(loss='hinge',class_weight={-1: 1, 1: 5},random_state=7)

# Run a loop to feed sub datasets into the model
for i in range(0,13):
    X = pd.read_csv(f'data/incremental_model/X_train/X_{i}.csv',index_col=0)
    y = pd.read_csv(f'data/incremental_model/y_train/y_{i}.csv',index_col=0)
        
    # Partial fit the data to scaler
    scaler.partial_fit(X)
    scaler.transform(X)
    
    # Partial fit the data
    SGDClass.partial_fit(X,y,classes=np.unique(y))
    
    # Progress
    print(f"Fitted sub-dataset {i}")

In [None]:
print(SGDClass.score(X_test,y_test))

After partially fitting the entire dataset, we receive a test accuracy of only 50.09%, which is significantly lower than that of the model using the reduced dataset. Hence for now, it appears that the best option would be for us to go with the logistic model fitted with the reduced dataset.

### Saving the Model with Incremental Learning

In [None]:
import pickle
filename = 'pickle/sgd_incremental_model.sav'
# save the model to disk
pickle.dump(SGDClass, open(filename, 'wb'))