#### After some research, I found that the term used in sklearn for batch processing large dataset are known as partial_fit features, which was explained as 'out-of-core' or external-memory training. Whichever method that support this features will allow the model to be trained batch by batch. 
#### reference (https://scikit-learn.org/0.15/modules/scaling_strategies.html)

#### As stated in the link, these ML methods were designed to support partial_fit
#### Classification
#### 1. sklearn.naive_bayes.MultinomialNB
#### 2. sklearn.naive_bayes.BernoulliNB
#### 3. sklearn.linear_model.Perceptron
#### 4. sklearn.linear_model.SGDClassifier
#### 5. sklearn.linear_model.PassiveAggressiveClassifier

#### Regression
#### 1. sklearn.linear_model.SGDRegressor
#### 2. sklearn.linear_model.PassiveAggressiveRegressor

#### Clustering
#### 1. sklearn.cluster.MiniBatchKMeans

#### Decomposition / feature Extraction
#### 1. sklearn.decomposition.MiniBatchDictionaryLearning
#### 2. sklearn.cluster.MiniBatchKMeans

#### The following coding will experiment the MultinomialNB using Iris dataset, performance was evaluated through the comparison of default Multinomial with fit and Multinomial model train through partial_fit. 

In [85]:
import pandas as pd
import requests
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC 
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import SGDClassifier
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.metrics import plot_confusion_matrix
from sklearn.preprocessing import StandardScaler
from random import choice, randint
from sklearn.naive_bayes import MultinomialNB

In [2]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
r = requests.get(url, allow_redirects=True)
open('iris.txt','wb').write(r.content)

4551

In [3]:
header = ['sepal_length','sepal_width','petal_length','petal_width','names']
df = pd.read_csv('iris.txt',names = header,index_col =False)
df.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,names
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [4]:
def name_to_numeric(x):
    if x=='Iris-setosa':return 0
    if x =='Iris-versicolor':return 1
    if x =='Iris-virginica':return 2

df['names'] = df['names'].apply(name_to_numeric)
df.to_csv('new iris.csv', header=False,index=False)
df.shape

(150, 5)

In [5]:
#Randomly selecting small amount of dataset from new iris.csv
selected_list = []
choice([i for i in range(0,151) if i not in []])

94

In [6]:
def random_data_selector():
    
    exclude = []
    final_lst = []
    for X in range(6):
        selection =[]
        while len(selection)< 20:
            randInt = randint(0,149)
            if randInt in exclude:
                continue
            else:
                exclude.append(randInt)
                selection.append(randInt)
        final_lst.append(selection)
    return final_lst,exclude

In [7]:
rand,exclude = random_data_selector()

In [8]:
np.array(rand).shape

(6, 20)

In [9]:
#Fail-safe checking, min must be =0 max must be 149
for list in rand:
    print('min for this list is ', min(list))
    print('max for this list is ', max(list), '\n')

min for this list is  4
max for this list is  131 

min for this list is  2
max for this list is  149 

min for this list is  0
max for this list is  145 

min for this list is  7
max for this list is  143 

min for this list is  1
max for this list is  142 

min for this list is  13
max for this list is  148 



In [10]:
#check if number duplicated, set function only return unique value in a list)
flat_list = [item for sublist in rand for item in sublist]
print(len(flat_list),len(set(flat_list)))

120 120


In [53]:
#Extracting the remaining 30 entries of data for testing set
test_indexes = [i for i in range(0,150) if i not in exclude]
df_test = pd.DataFrame(columns = ['sepal_length','sepal_width','petal_length','petal_width','names'])
for index in test_indexes:
    line = pd.read_csv('new iris.csv',skiprows = index-1,nrows=1,header=None,names = ['sepal_length','sepal_width','petal_length','petal_width','names'])
    df_test = pd.concat([df_test,line])


In [139]:
##Constructing ML model
nb = MultinomialNB()

In [180]:
%%time
for x in range(6):
    indexes = rand[x]
    minidf = pd.DataFrame(columns = ['sepal_length','sepal_width','petal_length','petal_width','names'])
    for index in indexes:
        line = pd.read_csv('new iris.csv',skiprows= index-1,nrows=1,header=None,names = ['sepal_length','sepal_width','petal_length','petal_width','names'])
        minidf = pd.concat([minidf,line])
    X = minidf.iloc[:,:4]
    y = minidf.iloc[:,-1]
    y = y.astype('int')

    if x == 0: nb.fit(X,y)
    if x == 1:nb.partial_fit(X,y,classes=np.unique(y))
    else: nb.partial_fit(X,y)
    

    print('minibatch {}  done'.format(x))

minibatch 0  done
minibatch 1  done
minibatch 2  done
minibatch 3  done
minibatch 4  done
minibatch 5  done
Wall time: 314 ms


In [181]:
%%time
X_test = df_test.iloc[:,:4]
y_test = df_test.iloc[:,-1]
y_pred = nb.predict(X_test)
y_test = np.array(y_test.values).astype('int')
accuracy_score(y_test,y_pred)

Wall time: 2 ms


0.9666666666666667

##### The result can be further improve by disabling the fit_prior feature. The improvement is due to each minibatch could be providing different number of class ratio (setosa, verginica, and versicolor), and assuming the probability based on the earlier batches is misleading.


In [142]:
nb = MultinomialNB(fit_prior = False)
for x in range(6):
    indexes = rand[x]
    minidf = pd.DataFrame(columns = ['sepal_length','sepal_width','petal_length','petal_width','names'])
    for index in indexes:
        line = pd.read_csv('new iris.csv',skiprows= index-1,nrows=1,header=None,names = ['sepal_length','sepal_width','petal_length','petal_width','names'])
        minidf = pd.concat([minidf,line])
    X = minidf.iloc[:,:4]
    y = minidf.iloc[:,-1]
    y = y.astype('int')

    if x == 0: nb.fit(X,y)
    if x == 1:nb.partial_fit(X,y,classes=np.unique(y))
    else: nb.partial_fit(X,y)

    print('minibatch {}  done'.format(x))

minibatch 0  done
minibatch 1  done
minibatch 2  done
minibatch 3  done
minibatch 4  done
minibatch 5  done


In [163]:
X_test = df_test.iloc[:,:4]
y_test = df_test.iloc[:,-1]
y_pred = nb.predict(X_test)
y_test = np.array(y_test.values).astype('int')
accuracy_score(y_test,y_pred)

0.9666666666666667

In [171]:
def name_to_numeric(x):
    if x=='Iris-setosa':return 0
    if x =='Iris-versicolor':return 1
    if x =='Iris-virginica':return 2

In [179]:
%%time
####Standard fit all data method
header = ['sepal_length','sepal_width','petal_length','petal_width','names']
df = pd.read_csv('iris.txt',names = header,index_col =False)
df.head(10)
df['names'] = df['names'].apply(name_to_numeric)
X = df.iloc[:,:-1]
y = (df.iloc[:,-1:])
X_train, X_test, y_train, y_test = train_test_split( X,y, test_size = 0.2, random_state = 500)
dtc = MultinomialNB(fit_prior=False)
dtc.fit(X_train, np.ravel(y_train))
y_pred = dtc.predict(X_test)
accuracy_score(y_test,y_pred)

Wall time: 10 ms


0.9333333333333333

##### Conclusively, the standard fit all method performs well with slightly low accuracy compare to the partial fit version which is assumed can be improved by hyperparameter tuning, However, runtime-wise, partial_fit induces significant more runtime. Nevertheless, runtime and accuracy shouldn't be a concern when it comes to choosing between fit or partial fit. Rather, partial fit should only be used when there is hardware limitations.

#### In this project, the partial fit feature was experimented and could be recycle use on other support