## First Online Learning in Python

In this tutorial, we will do the first online learning in Python. We will use demo data that we split manually into the batches and we "assume" they are coming in a real time.

You can find the data for this tutorial [**here**](https://drive.google.com/file/d/1WeylXMG4JZ_wxyqjz6MgjwpMI4ZVkdFz/view?usp=sharing). It's a NBA dataset we have used in exercises for probability and statistics. It consists of statistics and result of each NBA game in 3 regular seasons 2013-2015.

We will use statistics from regular season games to predict if teams won or lost in playoffs. You can find a playoff dataset [**here**](https://drive.google.com/file/d/15cx7LsopbCZ9WQ5CbGZHK_Dp-IPDrRqF/view?usp=sharing).

In [1]:
# import required packages
import pandas as pd
import numpy as np

In [4]:
# Load the data and keep only the columns we will need in this tutorial
data_path = "/Users/jurajkapasny/Data/NBA/"
df = pd.read_csv("nba_playoff_games_2016.csv",sep=";")
# we want to keep only these statistics
cols_to_keep = ['GAME_DATE','WL', 'PTS', 'FGM', 'FGA', 'FG_PCT',
       'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB',
       'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF']
df = df[cols_to_keep]
# convert to datetime
df['GAME_DATE'] = pd.to_datetime(df['GAME_DATE'])
df = df.sort_values('GAME_DATE')

In [5]:
df.head()

Unnamed: 0,GAME_DATE,WL,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF
171,2016-04-16,L,101,37,102,0.363,11,35,0.314,16,19,0.842,15,35,50,27,6,6,10,32
164,2016-04-16,W,100,34,79,0.43,11,21,0.524,21,29,0.724,9,29,38,19,11,8,13,25
165,2016-04-16,W,104,39,91,0.429,10,25,0.4,16,24,0.667,15,38,53,26,10,7,15,22
166,2016-04-16,L,90,30,79,0.38,4,19,0.211,26,38,0.684,20,32,52,17,6,5,19,27
170,2016-04-16,L,70,25,84,0.298,4,18,0.222,16,26,0.615,9,24,33,16,5,0,12,25


For initial training , we will use only games from 2013.

In [6]:
first_train = df[df.GAME_DATE.dt.year == 2013]
first_train = first_train.drop("GAME_DATE", axis = 1)

Now , we need to prepare our target variable, `WL`. We will convert it into 0 and 1 using LabelEncoder()

In [7]:
from sklearn.preprocessing import LabelEncoder

In [8]:
le = LabelEncoder()
first_train.WL = le.fit_transform(first_train.WL.values)

In [9]:
# Let's see how our dataset looks like now:
first_train.head()

Unnamed: 0,WL,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF


In [10]:
# Extract y and X from the dataset
y_first = first_train.WL
X_first = first_train.drop("WL", axis = 1)

### Prepare Test Set

In [11]:
# Load the data and keep only the columns we will need in this tutorial
play_offs = pd.read_csv("nba_playoff_games_2016.csv",sep=";")
# we want to keep only these statistics
cols_to_keep = ['WL', 'PTS', 'FGM', 'FGA', 'FG_PCT',
       'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB',
       'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF']
play_offs = play_offs[cols_to_keep]

In [12]:
play_offs.head()

Unnamed: 0,WL,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF
0,W,93,33,82,0.402,6,25,0.24,21,25,0.84,9,39,48,17,7,6,11,15
1,L,89,32,83,0.386,15,41,0.366,10,13,0.769,7,32,39,22,7,5,10,23
2,L,101,33,82,0.402,15,39,0.385,20,29,0.69,9,26,35,19,5,3,14,25
3,W,115,40,77,0.519,10,27,0.37,25,32,0.781,8,37,45,24,12,7,10,25
4,W,112,44,83,0.53,10,24,0.417,14,23,0.609,8,33,41,15,11,9,16,22


> #### Warning
> It's important to use ONLY .transform() for LabelEncoder here. We don't want to accidentaly end up with different numbers for W and L

In [13]:
play_offs.WL = le.transform(play_offs.WL)

ValueError: y contains previously unseen labels: 'W'

In [None]:
# Extract y and X from the dataset
y_test = play_offs.WL
X_test = play_offs.drop("WL", axis = 1)

### Modeling

We will use the **Stochastic Gradient Descent Classifier (SGDClassifier)**. The only difference with most other methods is that they actually optimize their coefficients using only one observation at a time (Using Stochastic Gradient Descent). It therefore takes more iterations before it reaches comparable results to a classic ridge or lasso regression, but it requires much less memory.

> #### Note
> SGD is sensitive to the scale of variables, and that’s not just because of regularization, it’s because of the way it works internally. Consequently, we should always standardize our features (for instance, by using StandardScaler) or you force them in the range [0,+1] or [-1,+1]. We will have poorer results if we don't do this. 

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
# metrics for evaluation
from sklearn.metrics import accuracy_score, recall_score, precision_score

#### Scaling

In [None]:
scaling = StandardScaler()
scaling.fit(X_first)
X_first = scaling.transform(X_first)

In [None]:
type(y_first)

#### First Model

In [None]:
SGD = SGDClassifier(loss='log')
# we will use .partial_fit() method. This will allow us to train on new data incrementaly. 
# When using online learning, we need to specify the final list of classes. 
# It might happen that we miss some classes in the first batch of data.
SGD.partial_fit(X_first, y_first, classes=np.unique(y_first))

In [None]:
print("Accuracy")
print(accuracy_score(y_test, SGD.predict(X_test)))
print("")
print("Precision")
print(precision_score(y_test, SGD.predict(X_test)))
print("")
print("Recall")
print(recall_score(y_test, SGD.predict(X_test)))

We can see our model is not good. The good precision is caused by very small number of cases where we actually predicted `WIN`. We can also have different results because if depends on the initial weights in Stochastic Gradient Descent.

Now, let's see if we can improve the model when we have new data.

### Online Learning
We will be adding a new data day by day and look for some improvements in our model.

In [None]:
# Let's extract the rest of the data
other_data = df[df.GAME_DATE.dt.year != 2013]

In [None]:
# We will put all unique dates into the list
all_dates = list(other_data.GAME_DATE.unique())

In [None]:
# Let's test if this works
df[df.GAME_DATE == all_dates[0]]

### Simulation

In [None]:
acc = list()
precision = list()
recall = list()
for day in all_dates:
    train = df[df.GAME_DATE == day]
    train = train.drop("GAME_DATE", axis = 1)
    # Extract y and X from the dataset
    train.WL = le.transform(train.WL)
    y_train = train.WL
    X_train = train.drop("WL", axis = 1)
    

    X_train = scaling.transform(X_train)
    
    # partial fit on new data
    SGD.partial_fit(X_train, y_train)
    # storing improvements (if any :))
    acc.append(accuracy_score(y_test, SGD.predict(X_test)))
    precision.append(precision_score(y_test, SGD.predict(X_test), zero_division=False))
    recall.append(recall_score(y_test, SGD.predict(X_test)))

#### Visualization of performance over the iterations

In [None]:
# Accuracy
import matplotlib.pyplot as plt
plt.subplot(1,2,1)
plt.plot(range(1,21),np.abs(acc[:20]),'o--')
plt.xlabel('Partial fit initial iterations')
plt.ylabel('Test set mean squared error')
plt.title("Accuracy First 20 Iterations")
plt.subplot(1,2,2)
plt.plot(range(0,len(acc),50),np.abs(acc[0:len(acc):50]),'o--')
plt.xlabel('Partial fit ending iterations')
plt.title("Accuracy Overall")
plt.show()

In [None]:
# Precision
import matplotlib.pyplot as plt
plt.subplot(1,2,1)
plt.plot(range(1,21),np.abs(precision[:20]),'o--')
plt.xlabel('Partial fit initial iterations')
plt.ylabel('Test set mean squared error')
plt.title("Precission First 20 Iterations")
plt.subplot(1,2,2)
plt.plot(range(0,len(precision),50),np.abs(precision[0:len(precision):50]),'o--')
plt.xlabel('Partial fit ending iterations')
plt.title("Precission Overall")
plt.show()

In [None]:
# Recall
import matplotlib.pyplot as plt
plt.subplot(1,2,1)
plt.plot(range(1,21),np.abs(recall[:20]),'o--')
plt.xlabel('Partial fit initial iterations')
plt.ylabel('Test set mean squared error')
plt.title("Recall First 20 Iterations")
plt.subplot(1,2,2)
plt.plot(range(0,len(recall),50),np.abs(recall[0:len(recall):50]),'o--')
plt.xlabel('Partial fit ending iterations')
plt.title("Recall Overall")
plt.show()

## Conclusion

We see that that we were able to improve our original performance. But the best model is somewhere in the middle of all iterations.

> #### Warning
> We need to be careful because new data doesn't always mean better model. We should always test a new version and replace the old one only if there is an improvement.