# Using mini-batches

In its purest form, online machine learning encompasses models which learn with one sample at a time. This is the design which is used in `creme`.

The main downside of single-instance processing is that it doesn't scale to big data. Indeed, processing one sample at a time means that we are able to use [vectorisation](https://www.wikiwand.com/en/Vectorization) and other computational tools that are taken for granted in batch learning. On top of this, processing a large dataset in `creme` essentially involves a Python `for` loop, which might be too slow for some usecases. However, this doesn't mean that `creme` is slow. In fact, for processing a single instance, `creme` is actually a couple of orders of magnitude faster than libraries such as scikit-learn, PyTorch, and Tensorflow. The reason why is because `creme` is designed from the ground up to process a single instance, whereas the majority of other libraries choose to care about batches of data. Both approaches offer different compromises, and the best choice depends on your usecase.

In order to propose the best of both worlds, `creme` offers some limited support for mini-batch learning. Some of `creme`'s estimators implement `*_many` methods on top of their `*_one` counterparts. For instance, `preprocessing.StandardScaler` has a `fit_many` method as well as a `transform_many` method, in addition to `fit_one` and `transform_one`. Each mini-batch method takes as input a `pandas.DataFrame`. Supervised estimators also take as input a `pandas.Series` of target values. We choose to use `pandas.DataFrames` over `numpy.ndarrays` because of the simple fact that the former allows us to name each feature. This in turn allows us to offer a uniform interface for both single instance and mini-batch learning.

As an example, we will build a simple pipeline that scales the data and trains a logistic regression. Indeed, the `compose.Pipeline` class can be applied to mini-batches, as long as each step is able to do so.

In [1]:
from creme import compose
from creme import linear_model
from creme import preprocessing

model = compose.Pipeline(
    preprocessing.StandardScaler(),
    linear_model.LogisticRegression()
)

For this example, we will use `datasets.Higgs`.

In [2]:
from creme import datasets

dataset = datasets.Higgs()
dataset

Higgs dataset

              Task  Binary classification                                                       
 Number of samples  11,000,000                                                                  
Number of features  28                                                                          
            Sparse  False                                                                       
              Path  /Users/mhalford/creme_data/Higgs/HIGGS.csv.gz                               
               URL  https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
              Size  2.55 GB                                                                     
        Downloaded  True                                                                        

The easiest way to read the data in a mini-batch fashion is to use the `read_csv` from `pandas`.

In [12]:
import pandas as pd

names = [
    'target', 'lepton pT', 'lepton eta', 'lepton phi',
    'missing energy magnitude', 'missing energy phi',
    'jet 1 pt', 'jet 1 eta', 'jet 1 phi', 'jet 1 b-tag',
    'jet 2 pt', 'jet 2 eta', 'jet 2 phi', 'jet 2 b-tag',
    'jet 3 pt', 'jet 3 eta', 'jet 3 phi', 'jet 3 b-tag',
    'jet 4 pt', 'jet 4 eta', 'jet 4 phi', 'jet 4 b-tag',
    'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb'
]

for x in pd.read_csv(dataset.path, names=columns, chunksize=8096, nrows=3e5):
    y = x.pop('target')
    
    model.fit
    
batch.head()

Unnamed: 0,lepton pT,lepton eta,lepton phi,missing energy magnitude,missing energy phi,jet 1 pt,jet 1 eta,jet 1 phi,jet 1 b-tag,jet 2 pt,...,jet 4 eta,jet 4 phi,jet 4 b-tag,m_jj,m_jjj,m_lv,m_jlv,m_bb,m_wbb,m_wwbb
0,0.869293,-0.635082,0.22569,0.32747,-0.689993,0.754202,-0.248573,-1.092064,0.0,1.374992,...,-0.010455,-0.045767,3.101961,1.35376,0.979563,0.978076,0.920005,0.721657,0.988751,0.876678
1,0.907542,0.329147,0.359412,1.49797,-0.31301,1.095531,-0.557525,-1.58823,2.173076,0.812581,...,-1.13893,-0.000819,0.0,0.30222,0.833048,0.9857,0.978098,0.779732,0.992356,0.798343
2,0.798835,1.470639,-1.635975,0.453773,0.425629,1.104875,1.282322,1.381664,0.0,0.851737,...,1.128848,0.900461,0.0,0.909753,1.10833,0.985692,0.951331,0.803252,0.865924,0.780118
3,1.344385,-0.876626,0.935913,1.99205,0.882454,1.786066,-1.646778,-0.942383,0.0,2.423265,...,-0.678379,-1.360356,0.0,0.946652,1.028704,0.998656,0.728281,0.8692,1.026736,0.957904
4,1.105009,0.321356,1.522401,0.882808,-1.205349,0.681466,-1.070464,-0.921871,0.0,0.800872,...,-0.373566,0.113041,0.0,0.755856,1.361057,0.98661,0.838085,1.133295,0.872245,0.808487


Note that you can check which estimators can process mini-batches programmatically:

In [None]:
import importlib
import inspect

def can_mini_batch(obj):
    return hasattr(obj, 'fit_many')

for module in importlib.import_module('creme').__all__:
    for obj in inspect.getmembers(importlib.import_module(f'creme.{module}'), can_mini_batch):
        print(obj)

We plan to promote more models to the mini-batch regime. However, we will only be doing so for the methods that benefit the most from it, as well as those that are most popular. Indeed, `creme`'s core philosophy will remain to cater to single instance learning.