https://coderzcolumn.com/tutorials/machine-learning/scikit-learn-incremental-learning-for-large-datasets

In [None]:
1. List of Estimators with "partial_fit()" Method 
Below we have listed estimators which have partial_fit() method available with them.

Regression
sklearn.linear_model.SGDRegressor
sklearn.linear_model.PassiveAggressiveRegressor
sklearn.neural_network.MLPRegressor
Classification
sklearn.naive_bayes.MultinomialNB
sklearn.naive_bayes.BernoulliNB
sklearn.linear_model.Perceptron
sklearn.linear_model.SGDClassifier
sklearn.linear_model.PassiveAggressiveClassifier
sklearn.neural_network.MLPClassifier
Clustering
sklearn.cluster.MiniBatchKMeans
sklearn.cluster.Birch
Preprocessing
sklearn.preprocessing.StandardScaler
sklearn.preprocessing.MinMaxScaler
sklearn.preprocessing.MaxAbsScaler
Decomposition / Dimensionality Reduction
sklearn.decomposition.MiniBatchDictionaryLearning
sklearn.decomposition.IncrementalPCA
sklearn.decomposition.LatentDirichletAllocation

Below is a list of available regression estimators from scikit-learn which supports partial fit on a batch of data for datasets that do not fit into the main memory of the computer.

sklearn.linear_model.SGDRegressor
sklearn.linear_model.PassiveAggressiveRegressor
sklearn.neural_network.MLPRegressor

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

X, Y = datasets.make_regression(n_samples=240000, random_state=123)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.9, random_state=123)

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

In [None]:
X_train, X_test = X_train.reshape(-1,24,100), X_test.reshape(-1,24,100)
Y_train, Y_test = Y_train.reshape(-1,24), Y_test.reshape(-1,24)

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

In [None]:
X_train[0].shape, Y_train[0].shape

2.2 Create and Train Model
In this section, we have created an ML model using SGDRegressor class of scikit-learn. We have then looped through data in batches and trained this estimator by calling partial_fit() method on it for each batch of data. We have also looped through total data 10 times where each time training will be performed in batches.

Below we have included a definition of SGDRegressor estimator for explanation purposes.

SGDRegressor(loss='squared_error',penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=0.1, random_state=None, learning_rate='invscaling', eta0=0.01, power_t=0.25, early_stopping=False, validation_fraction=0.1,warm_start=False) - This class creates linear model for regression task.
The loss parameter accepts one of the below strings specifying loss.
'squared_error'
'huber'
'epsilon_insensitive'
'squared_epsilon_insensitive'
The penalty parameter accepts string specifying penalty. The possible values of the parameters are 'l2', 'l1' and 'elasticnet'. The default is 'l2'.
The l1_ratio parameter accepts float value in the range [0,1] specifying the amount of l1 penalty to use for elasticnet penalty which is a mix of l1 and l2. If a float value of 0 is specified then only l2 penalty is used and a value of 1.0 specifies only the l1 penalty. The value between 0 and 1 specifies the combination of l1 and l2.
The fit_intercept parameter accepts boolean values specifying whether to include an intercept in the model or not.
The learning_rate parameter accepts one of the below-mentioned strings specifying the learning rate.
'constant'
'optimal'
'invscaling'
'adaptive'
The validation_fraction parameter accepts float in the range 0-1 specifying how much of training sample should be used for validation. The default is 0.1 which means that 10% of training samples will be used for validation purposes.
Below we have created an instance of SGDRegressor with the default parameter. We have then looped through data in batches and called partial_fit() on regressor instance with each batch. We have performed this process for 10 epochs which means we have looped through total training data 10 times in batches.

In [None]:
from sklearn.linear_model import SGDRegressor

regressor = SGDRegressor()

epochs = 10

for k in range(epochs): ## Number of loops through data
    for i in range(X_train.shape[0]): ## Looping through batches
        X_batch, Y_batch = X_train[i], Y_train[i]
        regressor.partial_fit(X_batch, Y_batch) ## Partially fitting data in batches

2.3 Evaluate Model Performance on Test Data
In this section, we have evaluated the performance of our trained model on test data. We have looped through test data in batches and made predictions on them. We have then combined the prediction of each batch.

At last, we have calculated MSE and R^2 scores on the test dataset to check the performance of the model.

If you are interested in learning about model evaluation metrics using scikit-learn then please feel free to check our tutorial on the same which explains the topic with simple and easy-to-understand examples.

Scikit-Learn - Model Evaluation & Scoring Metrics


In [None]:
from sklearn.metrics import mean_squared_error, r2_score


Y_test_preds = []
for j in range(X_test.shape[0]): ## Looping through test batches for making predictions
    Y_preds = regressor.predict(X_test[j])
    Y_test_preds.extend(Y_preds.tolist())

print("Test MSE      : {}".format(mean_squared_error(Y_test.reshape(-1), Y_test_preds)))
print("Test R2 Score : {}".format(r2_score(Y_test.reshape(-1), Y_test_preds)))

In [None]:
2.4 Evaluate Model Performance on Train Data
In this section, we have evaluated the performance of our trained model on train data. We have looped through train data in batches and made predictions. We have combined predictions of each batch.

At last, we have calculated MSE and R^2 scores on the training dataset to check the performance of the model on train data.



In [None]:

from sklearn.metrics import mean_squared_error, r2_score

Y_train_preds = []
for j in range(X_train.shape[0]): ## Looping through train batches for making predictions
    Y_preds = regressor.predict(X_train[j])
    Y_train_preds.extend(Y_preds.tolist())

print("Train MSE      : {}".format(mean_squared_error(Y_train.reshape(-1), Y_train_preds)))
print("Train R2 Score : {}".format(r2_score(Y_train.reshape(-1), Y_train_preds)))

In [None]:
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler

### Scaling Data

scaler = StandardScaler()

for i in range(X_train.shape[0]):
    X_batch, Y_batch = X_train[i], Y_train[i]
    scaler.partial_fit(X_batch, Y_batch) ## Partially fitting data in batches


### Fitting Data in batches
regressor = SGDRegressor()

epochs = 10

for k in range(epochs):
    for i in range(X_train.shape[0]):
        X_batch, Y_batch = X_train[i], Y_train[i]
        X_batch = scaler.transform(X_batch) ## Preprocessing Single batch of data
        regressor.partial_fit(X_batch, Y_batch) ## Partially fitting data in batches

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import IncrementalPCA

### Scaling Data

pca = IncrementalPCA(n_components=20)

for i in range(X_train.shape[0]):
    X_batch, Y_batch = X_train[i], Y_train[i]
    pca.partial_fit(X_batch, Y_batch) ## Partially fitting data in batches


### Fitting Data in batches
classifier = SGDClassifier()

epochs = 20

for k in range(epochs):
    for i in range(X_train.shape[0]):
        X_batch, Y_batch = X_train[i], Y_train[i]
        X_batch = pca.transform(X_batch) ## Preprocessing Single batch of data
        classifier.partial_fit(X_batch, Y_batch, classes=list(range(2))) ## Partially fitting data in batches

In [None]:
from sklearn.metrics import accuracy_score


Y_test_preds = []
for j in range(X_test.shape[0]): ## Looping through test batches for making predictions
    X_batch = pca.transform(X_test[j]) ## Preprocessing Single batch of data
    Y_preds = classifier.predict(X_batch)
    Y_test_preds.extend(Y_preds.tolist())

print("Test Accuracy      : {}".format(accuracy_score(Y_test.reshape(-1), Y_test_preds)))

Creme, which:

Implements a number of popular algorithms for classification, regression, feature selection, and feature preprocessing.
Has an API similar to scikit-learn.
And makes it super easy to perform online/incremental learning.

In [None]:
from creme.linear_model import LogisticRegression
from creme.multiclass import OneVsRestClassifier
from creme.preprocessing import StandardScaler
from creme.compose import Pipeline
from creme.metrics import Accuracy
from creme import stream
import argparse