# Outline
- The large scle data sets may not fit in memory and we need to devise strategies to ahndle it in the context of training and prediction use cases.
-  discussion on following topics:

>- Overview of handling large scale data
>- Incremental preprocessing and learning.
>  - `fit` vs `partial_fit()`: `partial_fit()` is our friend in this case
>- Combining preprocessing and incremental learning


# Large Scale Machine Learning

> The idea is to process data in **batches** and <b>update</b> the model parameters for each batch. This way of learning is referred as **Incremental (or out-of-core) learning**

## Incremental Learning
> Instead of re-training the model with the entire set of data, one can employ an incremental learning approach, where the model parameters are updated with new batch of data

### Incremental learning in `sklearn`

- Use of `partial_fit`
- this method is called several times consecutively on different chunks of dataset
- some performance overhead

### `partial_fit() attributes`

`partial_fit(X,y,[classes],[sample_weight])`

The following estimators implement `partial_fit` method:
- Classification
  - MultinomialNB
  -BernoulliNB
  - SGDClassifier
  - Perceptron
- Regression
  - SGDRegressor
- Clustering
  -MiniBatchKMeans\\

`SGDRegressor` and `SGDClassifier` are commonly used for handling large data



# **fit() vs partial_fit()**

In [55]:
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

### 1. Traditional Approach(using `fit()`)

In [56]:
x,y = make_classification(n_samples=50000, n_features=10,
                          n_classes=3,
                          n_clusters_per_class=1)
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.15)

In [57]:
clf1 = SGDClassifier(max_iter=1000,tol=0.01)

In [58]:
clf1.fit(xtrain,ytrain)

SGDClassifier(tol=0.01)

In [59]:
train_score = clf1.score(xtrain,ytrain)

In [60]:
test_score = clf1.score(xtest,ytest)

In [61]:
print('train score',train_score)
print('test score:',test_score)

train score 0.9061411764705882
test score: 0.9032


In [62]:
ypred = clf1.predict(xtest)
cm =confusion_matrix(ytest,ypred)
print(cm)

[[2390   29   94]
 [ 419 2079    9]
 [   6  169 2305]]


In [63]:
print(classification_report(ytest,ypred))

              precision    recall  f1-score   support

           0       0.85      0.95      0.90      2513
           1       0.91      0.83      0.87      2507
           2       0.96      0.93      0.94      2480

    accuracy                           0.90      7500
   macro avg       0.91      0.90      0.90      7500
weighted avg       0.91      0.90      0.90      7500



### 2. Incremental Approach(using `partial_fit()`)

In [64]:
xtrain[:5]

array([[-1.93657713, -0.63177565, -0.27479035, -0.63227078,  0.54735588,
         0.35760078,  1.61041344, -1.85146695,  0.69436861, -0.10551839],
       [-0.33375655,  0.19102651, -0.10879669,  1.45279473,  0.75270828,
        -0.4380048 , -1.13234159, -0.9763225 , -1.02907512, -0.95388221],
       [-1.42335852, -0.27899837,  0.72207817,  0.52116284,  0.38146247,
        -0.04595028,  0.3123013 , -0.05017705,  0.21495823, -1.66276572],
       [-1.67314073,  0.12414249,  0.25412015, -0.82380348,  0.02704602,
        -0.8071953 , -1.75824464, -0.35446733,  1.6040287 ,  0.10471175],
       [ 2.15578194,  0.47927764, -0.87019071, -1.62635913, -0.45844213,
        -0.02488779, -0.7396183 ,  2.65570143, -0.44297942,  1.01167011]])

In [65]:
ytrain[:5]

array([0, 2, 0, 1, 2])

In [66]:
ytrain[:,np.newaxis]

array([[0],
       [2],
       [0],
       ...,
       [1],
       [2],
       [2]])

In [67]:
train_data=np.concatenate((xtrain, ytrain[:,np.newaxis]),axis=1)

In [68]:
train_data[:5]

array([[-1.93657713, -0.63177565, -0.27479035, -0.63227078,  0.54735588,
         0.35760078,  1.61041344, -1.85146695,  0.69436861, -0.10551839,
         0.        ],
       [-0.33375655,  0.19102651, -0.10879669,  1.45279473,  0.75270828,
        -0.4380048 , -1.13234159, -0.9763225 , -1.02907512, -0.95388221,
         2.        ],
       [-1.42335852, -0.27899837,  0.72207817,  0.52116284,  0.38146247,
        -0.04595028,  0.3123013 , -0.05017705,  0.21495823, -1.66276572,
         0.        ],
       [-1.67314073,  0.12414249,  0.25412015, -0.82380348,  0.02704602,
        -0.8071953 , -1.75824464, -0.35446733,  1.6040287 ,  0.10471175,
         1.        ],
       [ 2.15578194,  0.47927764, -0.87019071, -1.62635913, -0.45844213,
        -0.02488779, -0.7396183 ,  2.65570143, -0.44297942,  1.01167011,
         2.        ]])

In [69]:
a =np.asarray(train_data)
np.savetxt("train_data.csv",a,delimiter=",")

In [70]:
clf2 = SGDClassifier(max_iter=1000,tol=0.01)

#### Processing data chunk by chunk

Pandas `read_csv()` function has an attribute `chunksize` that can be used to read data chunk by chunk. The `chunksize` parameter specifies the number of rows per chunk.\\
We can the use this data for `partial_fit`. We can then repeat this two steps multiple times. That way entire data may not be required to kept in memory


In [71]:
import pandas as pd
chunksize=1000
iter=1
for train_df in pd.read_csv("train_data.csv",chunksize=chunksize,
                            iterator=True):
  if iter==1:
    #In the first iteration, we are specifying all possible class labels
    xtrain_partial = train_df.iloc[:,0:10]
    ytrain_partial = train_df.iloc[:,10]
    clf2.partial_fit(xtrain_partial, ytrain_partial,
                     classes=np.array([0,1,2]))
  else:
    xtrain_partial = train_df.iloc[:,0:10]
    ytrain_partial = train_df.iloc[:,10]
    clf2.partial_fit(xtrain_partial, ytrain_partial)
  print("After iter #",iter)
  print(clf2.coef_)
  print(clf2.intercept_)
  iter=iter+1


After iter # 1
[[-19.97925233 -10.26620682 -20.78429324  13.88516655  -5.05936298
    9.93381241  34.23528168 -11.04576437  -6.93570413  -1.93477971]
 [-16.84656949   2.43409102 -11.26973178   2.6443482    7.87741752
  -10.10020578 -23.27007588   2.31261837 -10.19835205   3.91637462]
 [ 57.15618518  19.76298764  -4.32204653   0.66227691  -4.83858455
  -12.41468632 -52.77965388  -0.12661262 -17.49429465  -6.84503729]]
[ -7.6158045  -32.35436741 -40.73281325]
After iter # 2
[[ -7.54134443  -7.63586285   2.7921551   -0.41687174   0.76920182
   10.01491822  30.60206942  10.03945167  -3.94230983   1.9606068 ]
 [ -2.63701483   4.26730814   1.08454143  -7.27093835  -2.26092356
   -8.05539538 -21.9121557    7.9727461   -5.61262599  -8.02210359]
 [ 35.76945614  14.67260934  -0.6893345    4.03584666  -7.08343082
  -11.60862396 -43.86431487  -4.87679715  -4.90418904  -0.89331304]]
[ -1.67984048 -20.30813191 -37.44430267]
After iter # 3
[[ -6.53956928  -6.94465313  -1.31566633   6.02186889  -4.436

In [72]:
test_score= clf2.score(xtest,ytest)
print("Test Score:", test_score)

Test Score: 0.9010666666666667


  "X does not have valid feature names, but"


In [73]:
ypred = clf2.predict(xtest)
cm=confusion_matrix(ytest,ypred)
print(cm)

[[2327  101   85]
 [ 288 2194   25]
 [   5  238 2237]]


  "X does not have valid feature names, but"


In [74]:
print(classification_report(ytest,ypred))

              precision    recall  f1-score   support

           0       0.89      0.93      0.91      2513
           1       0.87      0.88      0.87      2507
           2       0.95      0.90      0.93      2480

    accuracy                           0.90      7500
   macro avg       0.90      0.90      0.90      7500
weighted avg       0.90      0.90      0.90      7500



# **Incremental Preprocessing Example**

## `CountVectorizer` vs `HashingVectorizer`

In [75]:
text_documents = ['You must have heard “Apple a day keeps the doctor away” but what does that mean? Well, it has a very straightforward and precise meaning that eating apples maintains good health and acts as a bodyguard to save your body from diseases.',
                  'The proverb was first published in print in 1866 and over 150 years later, a medical journal has used the excuse of April Fool’s Day to publish a study that questions – seriously – If this wisdom really does keep the doctor away. ',
                  'British apples are one of the nations best loved fruit and according to British Apples, we consume around 122,000 tonned of them each year.',
                  'But what are the health benefits, and do they really keep the doctor away?']

### 1. CountVectorizer

In [76]:
from sklearn.feature_extraction.text import CountVectorizer
c_vectorizer = CountVectorizer()

In [77]:
X_c =c_vectorizer.fit_transform(text_documents)

In [78]:
X_c.shape

(4, 84)

In [79]:
c_vectorizer.vocabulary_

{'000': 0,
 '122': 1,
 '150': 2,
 '1866': 3,
 'according': 4,
 'acts': 5,
 'and': 6,
 'apple': 7,
 'apples': 8,
 'april': 9,
 'are': 10,
 'around': 11,
 'as': 12,
 'away': 13,
 'benefits': 14,
 'best': 15,
 'body': 16,
 'bodyguard': 17,
 'british': 18,
 'but': 19,
 'consume': 20,
 'day': 21,
 'diseases': 22,
 'do': 23,
 'doctor': 24,
 'does': 25,
 'each': 26,
 'eating': 27,
 'excuse': 28,
 'first': 29,
 'fool': 30,
 'from': 31,
 'fruit': 32,
 'good': 33,
 'has': 34,
 'have': 35,
 'health': 36,
 'heard': 37,
 'if': 38,
 'in': 39,
 'it': 40,
 'journal': 41,
 'keep': 42,
 'keeps': 43,
 'later': 44,
 'loved': 45,
 'maintains': 46,
 'mean': 47,
 'meaning': 48,
 'medical': 49,
 'must': 50,
 'nations': 51,
 'of': 52,
 'one': 53,
 'over': 54,
 'precise': 55,
 'print': 56,
 'proverb': 57,
 'publish': 58,
 'published': 59,
 'questions': 60,
 'really': 61,
 'save': 62,
 'seriously': 63,
 'straightforward': 64,
 'study': 65,
 'that': 66,
 'the': 67,
 'them': 68,
 'they': 69,
 'this': 70,
 'to': 71

In [80]:
print(X_c)

  (0, 82)	1
  (0, 50)	1
  (0, 35)	1
  (0, 37)	1
  (0, 7)	1
  (0, 21)	1
  (0, 43)	1
  (0, 67)	1
  (0, 24)	1
  (0, 13)	1
  (0, 19)	1
  (0, 78)	1
  (0, 25)	1
  (0, 66)	2
  (0, 47)	1
  (0, 77)	1
  (0, 40)	1
  (0, 34)	1
  (0, 74)	1
  (0, 64)	1
  (0, 6)	2
  (0, 55)	1
  (0, 48)	1
  (0, 27)	1
  (0, 8)	1
  :	:
  (2, 45)	1
  (2, 32)	1
  (2, 4)	1
  (2, 76)	1
  (2, 20)	1
  (2, 11)	1
  (2, 1)	1
  (2, 0)	1
  (2, 72)	1
  (2, 68)	1
  (2, 26)	1
  (2, 80)	1
  (3, 67)	2
  (3, 24)	1
  (3, 13)	1
  (3, 19)	1
  (3, 78)	1
  (3, 6)	1
  (3, 36)	1
  (3, 61)	1
  (3, 42)	1
  (3, 10)	1
  (3, 14)	1
  (3, 23)	1
  (3, 69)	1


### 2. `HashingVectorizer`

In [81]:
from sklearn.feature_extraction.text import HashingVectorizer

Note: Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners

In [82]:
h_vectorizer=HashingVectorizer(n_features=50)

In [83]:
X_h =h_vectorizer.fit_transform(text_documents)

In [84]:
X_h.shape

(4, 50)

In [85]:
print(X_h[0])

  (0, 2)	-0.13736056394868904
  (0, 4)	-0.13736056394868904
  (0, 5)	0.27472112789737807
  (0, 8)	-0.13736056394868904
  (0, 9)	0.13736056394868904
  (0, 10)	-0.13736056394868904
  (0, 11)	0.13736056394868904
  (0, 12)	0.0
  (0, 13)	0.0
  (0, 15)	-0.13736056394868904
  (0, 18)	0.13736056394868904
  (0, 20)	0.13736056394868904
  (0, 26)	0.13736056394868904
  (0, 29)	0.13736056394868904
  (0, 31)	-0.13736056394868904
  (0, 34)	-0.13736056394868904
  (0, 36)	-0.13736056394868904
  (0, 38)	0.5494422557947561
  (0, 39)	-0.13736056394868904
  (0, 41)	0.0
  (0, 42)	-0.13736056394868904
  (0, 45)	-0.5494422557947561
  (0, 47)	0.13736056394868904


#Combining preprocessing and fitting in Incremental Learning
(`HashingVectorizer` along with `SGDClassifier`)

## Downloading the dataset

In [86]:
import pandas as pd
from io import StringIO, BytesIO, TextIOWrapper
from zipfile import ZipFile
import urllib.request

resp = urllib.request.urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip')
zipfile = ZipFile(BytesIO(resp.read()))
data = TextIOWrapper(zipfile.open('sentiment labelled sentences/amazon_cells_labelled.txt'),encoding='utf-8')

df=pd.read_csv(data,sep='\t')
df.columns=['review','sentiment']

## 2. Exploring the dataset

In [87]:
df.head()

Unnamed: 0,review,sentiment
0,"Good case, Excellent value.",1
1,Great for the jawbone.,1
2,Tied to charger for conversations lasting more...,0
3,The mic is great.,1
4,I have to jiggle the plug to get it to line up...,0


In [88]:
df.tail()

Unnamed: 0,review,sentiment
994,The screen does get smudged easily because it ...,0
995,What a piece of junk.. I lose more calls on th...,0
996,Item Does Not Match Picture.,0
997,The only thing that disappoint me is the infra...,0
998,"You can not answer calls with the unit, never ...",0


In [89]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     999 non-null    object
 1   sentiment  999 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


In [90]:
df.describe()

Unnamed: 0,sentiment
count,999.0
mean,0.500501
std,0.50025
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


## Splitting data into train and test

In [91]:
from sklearn.model_selection import train_test_split


In [92]:
X=df.loc[:,'review']
y=df.loc[:,'sentiment']

In [93]:
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.2)

In [94]:
X_train.shape

(799,)

In [95]:
y_train.shape

(799,)

## Preprocessing

In [96]:
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer =HashingVectorizer()

## Creating instance of SGDClassifier

In [97]:
from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier(penalty='l2',loss='hinge')

# Iteration 1 of `partial_fit()`

In [98]:
X_train_part1_hashed = vectorizer.fit_transform(X_train[0:400])
y_train_part1 = y_train[0:400]

In [99]:
all_classes = np.unique(df.loc[:,'sentiment']) #we need to mention all classes in the first iteration of partial_fit()

In [100]:
classifier.partial_fit(X_train_part1_hashed,y_train_part1, classes=all_classes)

SGDClassifier()

In [101]:
X_test_hashed = vectorizer.transform(X_test) #first we will have to preprocess the X_test with the same vectorizer that was fit on

In [102]:
test_score= classifier.score(X_test_hashed,y_test)
print("Test score:", test_score)

Test score: 0.69


## Iteration 2 of partial_fit()

In [103]:
X_train_part2_hashed = vectorizer.fit_transform(X_train[400:])
y_train_part2 = y_train[400:]

In [104]:
classifier.partial_fit(X_train_part2_hashed,y_train_part2)

SGDClassifier()

In [105]:
test_score= classifier.score(X_test_hashed,y_test)
print("Test score:", test_score)

Test score: 0.755


#Assignment

## Practice Assignment

In [None]:
text_data=['A metaverse is a network of 3D virtual worlds focused on social connection.',
           'In futurism and science fiction, the term is often described as a hypothetical iteration of the Internet as a single', 
           'universal virtual world that is facilitated by the use of virtual and augmented reality headsets.',
           'The term "metaverse" has its origins  the 1992 science fiction novel Snow Crash as a portmanteau of "meta" and "universe."',
           'Various metaverses have been developed for popular use such as virtual world platforms like Second Life.',
           'Some metaverse iterations involve integration between virtual and physical spaces and virtual economies',
           'often including a significant interest in advancing virtual reality technology.', 
           'The term has seen considerable use as a buzzword for public relations purposes to exaggerate development progress for various related technologies and projects.[10] Information privacy and user addiction are concerns within metaverses',
           'stemming from challenges facing the social media and video game industries as a whole.']

In [108]:
c_vectorizer=CountVectorizer()
X_c=c_vectorizer.fit_transform(text_data)

In [109]:
X_c.shape

(9, 99)

In [110]:
c_vectorizer.vocabulary_

{'10': 0,
 '1992': 1,
 '3d': 2,
 'addiction': 3,
 'advancing': 4,
 'and': 5,
 'are': 6,
 'as': 7,
 'augmented': 8,
 'been': 9,
 'between': 10,
 'buzzword': 11,
 'by': 12,
 'challenges': 13,
 'concerns': 14,
 'connection': 15,
 'considerable': 16,
 'crash': 17,
 'described': 18,
 'developed': 19,
 'development': 20,
 'economies': 21,
 'exaggerate': 22,
 'facilitated': 23,
 'facing': 24,
 'fiction': 25,
 'focused': 26,
 'for': 27,
 'from': 28,
 'futurism': 29,
 'game': 30,
 'has': 31,
 'have': 32,
 'headsets': 33,
 'hypothetical': 34,
 'in': 35,
 'including': 36,
 'industries': 37,
 'information': 38,
 'integration': 39,
 'interest': 40,
 'internet': 41,
 'involve': 42,
 'is': 43,
 'iteration': 44,
 'iterations': 45,
 'its': 46,
 'life': 47,
 'like': 48,
 'media': 49,
 'meta': 50,
 'metaverse': 51,
 'metaverses': 52,
 'network': 53,
 'novel': 54,
 'of': 55,
 'often': 56,
 'on': 57,
 'origins': 58,
 'physical': 59,
 'platforms': 60,
 'popular': 61,
 'portmanteau': 62,
 'privacy': 63,
 'pr

In [116]:
cv_new=CountVectorizer(min_df=2)
X_c_new = cv_new.fit_transform(text_data)
X_c_new.shape

(9, 20)

In [117]:
Docs = ['This is the first question.', 'This document is the second document.', 'And this is the third one' ]


In [118]:
cv2 = CountVectorizer(max_features=10)
X_c2= cv2.fit_transform(Docs)

In [121]:
print(X_c2.toarray())

[[0 0 1 1 0 1 0 1 0 1]
 [0 2 0 1 0 0 1 1 0 1]
 [1 0 0 1 1 0 0 1 1 1]]


In [122]:
import numpy as np

In [129]:
X = np.array([[72, 69 ,82], [ 9 ,79, 99], [20 ,47, 88], [80 ,64, 49]])
p=np.array([[0,0,0]])

In [124]:
def euclid(a,b):
  return np.sum((a-b)**2,axis=1)

In [127]:
euclid(X,p)

array([16669, 16123, 10353, 12897])

In [130]:
from sklearn.neighbors import NearestNeighbors
np.random.seed(0)
Xclosest=NearestNeighbors().fit(X)
idx= Xclosest.kneighbors(p,1,return_distance=False)
X[idx[0]]

array([[20, 47, 88]])

In [131]:
def minkowsky(a,b,p):
  d=np.sum((a-b)**p,axis=1)
  return np.power(d,1/p)

In [132]:
X0 = np.array([1,0,0,0])
X= np.asarray([[1, 0, 0,0], [0, 1, 1,1],[1,2,0,0]])
p= 2

In [133]:
minkowsky(X0,X,p)

array([0., 2., 2.])

In [135]:
from scipy.spatial import minkowski_distance

dist = minkowski_distance(X0, X,2)
print(dist)

[0. 2. 2.]


In [136]:
from sklearn.datasets import load_digits
df= load_digits()

In [137]:
X=df.data
y=df.target

In [138]:
X.shape

(1797, 64)

In [139]:
unique,counts= np.unique(y,return_counts=True)

In [140]:
unique

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [142]:
np.column_stack((unique,counts))

array([[  0, 178],
       [  1, 182],
       [  2, 177],
       [  3, 183],
       [  4, 181],
       [  5, 182],
       [  6, 181],
       [  7, 179],
       [  8, 174],
       [  9, 180]])

In [143]:
X_train, X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=10)

In [146]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe_log = make_pipeline(StandardScaler(),LogisticRegression(multi_class='multinomial',
                                                             solver='sag'))
pipe_log.fit(X_train,y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression',
                 LogisticRegression(multi_class='multinomial', solver='sag'))])

In [149]:
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score
y_hat=pipe_log.predict(X_test)
accuracy_score(y_test,y_hat)

0.9694444444444444

In [150]:
from sklearn.metrics import f1_score
f1_score(y_test,y_hat,average='weighted')

0.9697233182569327

## Graded Assignment

In [151]:
import pandas as pd
df = pd.read_csv('/content/data_for_large_scale.csv')

In [152]:
df.head()

Unnamed: 0,Feature-1,Feature-2,Feature-3,Feature-4,Feature-5,Feature-6,Feature-7,Feature-8,Feature-9,Feature-10,Target
0,-1.58,1.05,1.06,-0.44,0.451,-0.0348,0.643,0.265,0.268,-0.851,84.7
1,-0.832,-0.866,-1.34,0.138,1.18,0.733,-1.41,0.135,-0.088,-1.55,-211.0
2,-0.237,2.09,-3.93,0.296,0.352,-0.501,0.961,-0.0287,1.82,0.938,-96.9
3,-1.17,-1.13,-1.09,1.12,0.312,0.183,0.448,-0.819,-1.01,-1.08,-152.0
4,0.26,-0.0273,0.925,-1.15,-1.39,0.0251,0.627,0.095,-0.28,-0.848,-57.7


In [154]:
X=df.iloc[:,:10]
y=df.iloc[:,10]

In [156]:
X_array = X.to_numpy()
y_array=y.to_numpy()

In [157]:
X_array[:5]

array([[-1.58  ,  1.05  ,  1.06  , -0.44  ,  0.451 , -0.0348,  0.643 ,
         0.265 ,  0.268 , -0.851 ],
       [-0.832 , -0.866 , -1.34  ,  0.138 ,  1.18  ,  0.733 , -1.41  ,
         0.135 , -0.088 , -1.55  ],
       [-0.237 ,  2.09  , -3.93  ,  0.296 ,  0.352 , -0.501 ,  0.961 ,
        -0.0287,  1.82  ,  0.938 ],
       [-1.17  , -1.13  , -1.09  ,  1.12  ,  0.312 ,  0.183 ,  0.448 ,
        -0.819 , -1.01  , -1.08  ],
       [ 0.26  , -0.0273,  0.925 , -1.15  , -1.39  ,  0.0251,  0.627 ,
         0.095 , -0.28  , -0.848 ]])

In [165]:
X_train, X_test, y_train, y_test= train_test_split(X_array,y_array,train_size=0.9, random_state=10)

In [173]:
X_train, X_test = X_train.reshape(-1,90,10), X_test.reshape(-1,90,10)
y_train, y_test = y_train.reshape(-1,90), y_test.reshape(-1,90)

In [167]:
from sklearn.linear_model import SGDRegressor
regressor = SGDRegressor(random_state=10)
for i in range(X_train.shape[0]):
  X_batch, y_batch = X_train[i],y_train[i]
  regressor.partial_fit(X_batch, y_batch)

In [168]:
regressor.intercept_

array([-0.0051044])

In [169]:
regressor.coef_

array([51.32117195, 22.26580455, 81.23785419, 53.19578104, 76.46592847,
       71.47197842, 93.45078666, 51.92184073, 30.03844346, 40.95696488])

In [174]:
from sklearn.metrics import r2_score
regressor = SGDRegressor(random_state=10)
for i in range(X_train.shape[0]):
  X_batch, y_batch = X_train[i],y_train[i]
  regressor.partial_fit(X_batch, y_batch)
y_preds_li=[]
for j in range(X_test.shape[0]):
  y_pred = regressor.predict(X_test[j])
  y_preds_li.extend(y_pred.tolist())

In [176]:
r2_score(y_test.reshape(-1),y_preds_li)

0.9999919827709302

In [177]:
network_data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00246/3D_spatial_network.txt" 
df2=pd.read_csv(network_data_url, header=None)

In [178]:
df2.columns

Int64Index([0, 1, 2, 3], dtype='int64')

In [180]:
df2.head()

Unnamed: 0,0,1,2,3
0,144552912,9.349849,56.740876,17.052772
1,144552912,9.350188,56.740679,17.61484
2,144552912,9.350549,56.740544,18.083536
3,144552912,9.350806,56.740485,18.279465
4,144552912,9.351053,56.740486,18.422974


In [181]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 434874 entries, 0 to 434873
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   0       434874 non-null  int64  
 1   1       434874 non-null  float64
 2   2       434874 non-null  float64
 3   3       434874 non-null  float64
dtypes: float64(3), int64(1)
memory usage: 13.3 MB


In [183]:
scaler = StandardScaler()
chunksize = 20000
for train_df in pd.read_csv(network_data_url,
                            chunksize=chunksize,
                            iterator=True):
    xtrain_partial = train_df.iloc[:, 1:3]
    ytrain_partial = train_df.iloc[:, 3]
    scaler.partial_fit(xtrain_partial, ytrain_partial) 

In [184]:
chunksize = 20000
regressor = SGDRegressor(random_state=10)

iter = 1
for train_df in pd.read_csv(network_data_url,
                            chunksize=chunksize,
                            iterator=True):
  
    xtrain_partial = train_df.iloc[:, 1:3]
    ytrain_partial = train_df.iloc[:, 3]
    scaler.transform(xtrain_partial)
    regressor.partial_fit(xtrain_partial, ytrain_partial)
    if iter==7:
      print("After iter #", iter)
      print(regressor.intercept_) 
      print(regressor.coef_)
      break  
    iter = iter + 1

After iter # 7
[-3.3580229e+08]
[-1732780.23537162  6145752.04828986]
