<a href="https://colab.research.google.com/github/noahgift/Python-MLOps-Cookbook/blob/main/Baseball_Predictions_Export_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML Regression

This notebook is featured in [Practical MLOps book by O'Reilly](https://learning.oreilly.com/library/view/practical-mlops/9781098103002/) as well a Coursera + Duke Course.

## Ingest

Source:  http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights

In [34]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("https://raw.githubusercontent.com/noahgift/functional_intro_to_python/master/data/mlb_weight_ht.csv")
df.head()

Unnamed: 0,Name,Team,Position,Height(inches),Weight(pounds),Age
0,Adam_Donachie,BAL,Catcher,74,180.0,22.99
1,Paul_Bako,BAL,Catcher,74,215.0,34.69
2,Ramon_Hernandez,BAL,Catcher,72,210.0,30.78
3,Kevin_Millar,BAL,First_Baseman,72,210.0,35.43
4,Chris_Gomez,BAL,First_Baseman,73,188.0,35.71


Find N/A

In [35]:
df.shape

(1034, 6)

In [36]:
df.isnull().values.any()

True

In [37]:
df = df.dropna()
df.isnull().values.any()

False

In [38]:
df.shape

(1033, 6)

### Clean

In [39]:
df.rename(index=str, 
             columns={"Height(inches)": "Height", "Weight(pounds)": "Weight"},
             inplace=True)
df.head()


Unnamed: 0,Name,Team,Position,Height,Weight,Age
0,Adam_Donachie,BAL,Catcher,74,180.0,22.99
1,Paul_Bako,BAL,Catcher,74,215.0,34.69
2,Ramon_Hernandez,BAL,Catcher,72,210.0,30.78
3,Kevin_Millar,BAL,First_Baseman,72,210.0,35.43
4,Chris_Gomez,BAL,First_Baseman,73,188.0,35.71


## EDA

In [40]:
df.describe()

Unnamed: 0,Height,Weight,Age
count,1033.0,1033.0,1033.0
mean,73.698935,201.689255,28.737648
std,2.30633,20.991491,4.322298
min,67.0,150.0,20.9
25%,72.0,187.0,25.44
50%,74.0,200.0,27.93
75%,75.0,215.0,31.24
max,83.0,290.0,48.52


## Modeling

In [41]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

### Select Feature

Using Weight to Predict Height, so just one feature


In [42]:
var = df['Height'].values
var.shape

(1033,)

In [43]:
y = df['Height'].values #Target
y = y.reshape(-1, 1)
X = df['Weight'].values #Feature(s)
X = X.reshape(-1,1)

In [44]:
X.shape

(1033, 1)

In [45]:
y.shape

(1033, 1)

### Split Data and Scale Data

Scaling workflow step by step to make it easier to understand

In [46]:
scaler = StandardScaler()

In [47]:
X_scaler = scaler.fit(X)
X


array([[180.],
       [215.],
       [210.],
       ...,
       [205.],
       [190.],
       [195.]])

In [48]:
X = X_scaler.transform(X)
X

array([[-1.0337408 ],
       [ 0.6344091 ],
       [ 0.39610197],
       ...,
       [ 0.15779485],
       [-0.55712654],
       [-0.31881941]])

In [49]:
y_scaler = scaler.fit(y)
y

array([[74],
       [74],
       [72],
       ...,
       [75],
       [75],
       [73]])

In [50]:
y = y_scaler.transform(y)
y

array([[ 0.13060176],
       [ 0.13060176],
       [-0.73699706],
       ...,
       [ 0.56440117],
       [ 0.56440117],
       [-0.30319765]])

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(929, 1) (929, 1)
(104, 1) (104, 1)


### Fit the model

In [52]:
from sklearn.linear_model import Ridge
clf = Ridge()
model = clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

In [53]:
y_test.shape

(104, 1)

In [54]:
predictions.shape

(104, 1)

unscaled predictions, converted to DataFrame and described

In [55]:
df_predictions = pd.DataFrame(predictions)
df.describe()

Unnamed: 0,Height,Weight,Age
count,1033.0,1033.0,1033.0
mean,73.698935,201.689255,28.737648
std,2.30633,20.991491,4.322298
min,67.0,150.0,20.9
25%,72.0,187.0,25.44
50%,74.0,200.0,27.93
75%,75.0,215.0,31.24
max,83.0,290.0,48.52


### Plot Predictions

Let's inverse scale back to view predictions in a form we can understand, then plot

In [56]:
df_inverse_scaled_prediction = pd.DataFrame(y_scaler.inverse_transform(y))
df_inverse_scaled_prediction.describe()

Unnamed: 0,0
count,1033.0
mean,73.698935
std,2.30633
min,67.0
25%,72.0
50%,74.0
75%,75.0
max,83.0


### Print Accuracy of Linear Regression Model

In [58]:
model.score(X_test, y_test)

0.26073908792710265

### Export Model

In [59]:
import joblib

In [60]:
joblib.dump(model, '../models/model.joblib')

['../models/model.joblib']

### Verify Model Import Feedbackloop

In [62]:
clf_disk = joblib.load("../models/model.joblib")

### Test Predict From Model Loaded From Disk

View data

In [63]:
df.tail()

Unnamed: 0,Name,Team,Position,Height,Weight,Age
1029,Brad_Thompson,STL,Relief_Pitcher,73,190.0,25.08
1030,Tyler_Johnson,STL,Relief_Pitcher,74,180.0,25.73
1031,Chris_Narveson,STL,Relief_Pitcher,75,205.0,25.19
1032,Randy_Keisler,STL,Relief_Pitcher,75,190.0,31.01
1033,Josh_Kinney,STL,Relief_Pitcher,73,195.0,27.92


Get one observation and only get the Weight by grabbing Chris_Narveson

In [64]:
pX = df.iloc[[1030]][["Weight"]].values #Feature(s)
pX = pX.reshape(-1,1)
pX

array([[205.]])

Scale Input

In [65]:
target = df["Weight"].values
target = target.reshape(-1, 1)
target

array([[180.],
       [215.],
       [210.],
       ...,
       [205.],
       [190.],
       [195.]])

In [66]:
import numpy as np
input_scaler = StandardScaler().fit(target) #scale relative to the values in the df
scaled_input = input_scaler.transform(pX)
np.array2string(scaled_input, formatter={'float_kind':'{0:.3f}'.format})

'[[0.158]]'

Inverse Transform Predicted Height

In [67]:
result = clf.predict(scaled_input)
print(f"Unscaled prediction {result.tolist()[0]}")
transformed_prediction = y_scaler.inverse_transform(result) #Note the y_scaler is the target scaler
print(f"transformed_prediction {transformed_prediction.tolist()[0]}")

Unscaled prediction [0.08282752662941836]
transformed_prediction [73.88987022495702]
