# Pipelines and Model Persistence - W07D1
### Instructor: Eric Elmoznino

## Overview - Persistence
- Motivation
- Pickle
- Joblib
- Saving parameters only

---
## Motivation
Serialization is the process of converting a program entity into a stream of bytes that can be saved as a file.
There are two primary reasons why we might want to save (and later load) a trained model:
- Avoid redundant training. Models can take a long time to train and data can take a long time to load/process
- Deployment into an application (keep model training/deployment code separate)

---
## Pickle
- Pickling is the process where a Python object is converted into a byte stream (usually not human readable).
- Unpickling is the reverse operation, where a byte stream is converted back into a working Python object.
- Pickling is the simplest way to store the object from a coding perspective.
- The Python Pickle module is an object-oriented way of storing objects.
    - It can store *any* Python object, not just Sklearn models.
    
#### Features
- Store/load dictionaries and lists.
- Store/load the attributes of arbitrary data types (i.e. classes)
- Do this recursively, so that if your object has attributes that are
classes themselves, it can be saved just as easily

#### Limitations
- Does not save the *code* of an object — only its attribute values.
- Cannot save file handles or connection sockets.
- Pickle is **version-dependent**. For example, if you saved a model with a certain version
of Sklearn then try to load it with a different one (e.g. you updated), there may be issues.
    - Another motivation for using virtual environments, which can be containerized.

#### Saving procedure
```python
import pickle        # Built-in python module

# Create some object and manipulate it in some way (e.g. train the model)
myobj = SomeClass(...)
myobj = myobj.some_method(...)

# Save to a file using Pickle
with open('myfile.pickle', 'wb') as file_handle:
    pickle.dump(myobj, file_handle)
```

#### Loading procedure
```python
import pickle        # Built-in python module

# Load from a file using Pickle
with open('myfile.pickle', 'rb') as file_handle:
    myobj = pickle.load(file_handle)    # myobj will be an instance of SomeClass
```

#### Methods
The pickle module provides four different methods:
- dump() − The dump() method serializes to an open file (file-like object).
- dumps() − Serializes to a string.
- load() − Deserializes from an open-like object.
- loads() − Deserializes from a string.

### Example

In [1]:
import pickle
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url, names=names)
X = df.drop(columns='class')
y = df['class']

pipeline = Pipeline(steps=[('scaling', StandardScaler()),
                           ('pca', PCA(n_components=3)),
                           ('classifier', LogisticRegression())])
pipeline.fit(X, y)

Pipeline(steps=[('scaling', StandardScaler()), ('pca', PCA(n_components=3)),
                ('classifier', LogisticRegression())])

In [4]:
# Save the model
with open('saved_models/pipeline.pickle', 'wb') as f:
    pickle.dump(pipeline, f)

# Load the model
with open('saved_models/pipeline.pickle', 'rb') as f:
    pipeline_loaded = pickle.load(f)

assert (pipeline.steps[2][1].coef_ == pipeline_loaded.steps[2][1].coef_).all()

---
## Joblib
Joblib is an alternative serialization module to Pickle. It's main advantage over Pickle
is that it is faster and more efficient at saving large `numpy` arrays.

<sub>*Note: Starting with Python 3.8, Pickle is actually better than Joblib for saving `numpy` arrays.
    If you have Python >=3.8, just use Pickle. [Source](https://stackoverflow.com/a/12617603).*</sub>
    
#### Saving procedure
```python
import joblib

# Create some object and manipulate it in some way (e.g. train the model)
myobj = SomeClass(...)
myobj = myobj.some_method(...)

# Save to a file using Joblib
joblib.dump(myobj, file_path)
```

#### Loading procedure
```python
import joblib

# Load from a file using Joblib
myobj = joblib.load(file_path)    # myobj will be an instance of SomeClass
```

### Example

In [13]:
import joblib

joblib.dump(pipeline, 'saved_models/pipeline.can')

# Load the model
pipeline_loaded = joblib.load('saved_models/pipeline.can')
    
assert (pipeline.steps[2][1].coef_ == pipeline_loaded.steps[2][1].coef_).all()

---
## Saving parameters only
Oftentimes, we don't want to save the entire Python class. What we really might want to save
could just be the model parameters. This can have some advantages:
- Faster to save, smaller data size, and simpler storage format, since parameters are usually just numpy arrays.
- No version issues, since you are just saving numbers. If the Sklearn developers change something
about the model class, you will still be able to safely load the parameters you saved.
- Framework-independent (i.e. don't have to load back in Sklearn, or even Python).

Of course, there are also some disadvantages:
- More convoluted process programmatically
- Models might have many different parameter attributes
- Models might have complex architectures that link their parameters
- You cannot save an entire workflow (e.g. Sklearn pipeline)
- When loading, you have to recreate the model and define it the same way as when you trained it,
before restoring the parameters

#### Saving procedure
```python
import numpy as np

# Create some object and manipulate it in some way (e.g. train the model)
myobj = SomeClass(param1=a, param2=b, ...)
myobj = myobj.some_method(...)

# Save parameters to a file using numpy
np.save(file_path, myobj.some_parameters)
np.save(...)    # Save all the different parameters (e.g. weights and intercept) separately
```

#### Loading procedure
```python
import numpy as np

# Recreate the model exactly as you did when training it
myobj = SomeClass(param1=a, param2=b, ...)

# Load parameters from a file using numpy
params = numpy.load(file_path)

# Replace the model's parameters
myobj.some_parameters = params
```

### Example

In [6]:
from sklearn.linear_model import RidgeClassifier

model = RidgeClassifier(alpha=0.1)
model.fit(X, y)

RidgeClassifier(alpha=0.1)

In [7]:
import numpy as np

# Save the model's parameters
np.save('saved_models/logistic_coef.npy', model.coef_)
np.save('saved_models/logistic_intercept.npy', model.intercept_)

# Recreate the model with the same arguments as when you trained it
loaded_model = RidgeClassifier(alpha=0.1)

# Load the model's parameters
loaded_model.coef_ = np.load('saved_models/logistic_coef.npy')
loaded_model.intercept_ = np.load('saved_models/logistic_intercept.npy')
    
assert (model.coef_ == loaded_model.coef_).all()
assert (model.intercept_ == loaded_model.intercept_).all()