# Persistencia de modelos

Hello, welcome back to your machine learning book with scikit-learn.

You have everything ready, you're happy with your model and all the vectorizers and transformers you trained to make it work. But what's next? It's time to save it to disk for distribution and deployment.

To continue in this chapter, I'm going to load, create, and train a scikit-learn pipeline which, for practical purposes, is a scikit-learn model. If you want to know more details, check out the chapter on pipelines.

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def load_trained_model():
# Generate synthetic data
	np.random.seed(0)
	X = np.random.rand(100, 1)
	y = 2 * X + 1 + 0.1 * np.random.randn(100, 1)
	
	# Split the data into training and testing sets
	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
	
	# Create a linear regression model
	model = LinearRegression()
	
	# Train the model using the training data
	model.fit(X_train, y_train)

	return model

model = load_trained_model()


Remember that this is already a trained model.

## Pickle

Traditionally, models were serialized to disk using `pickle`, the default library for serializing objects in Python.

Saving a model is simple with pickle, what you need to do is:

In [None]:
import pickle

with open("model.pickle", "wb") as wb:
	pickle.dump(model, wb)


This will serialize the model into the "model.pickle" file, which is already a file that we can share with someone else or put into production ourselves. To read it from disk, it is necessary to do the following:

In [None]:
with open("model.pickle", "rb") as rb:
	unpickled_model = pickle.load(rb)


We can verify that it is the type we are expecting:

In [None]:
type(unpickled_model)


And make predictions on new data:

### Disadvantages

However, pickle has serious security flaws and is not the recommended way to persist your models as it allows arbitrary code execution during deserialization. I recommend using pickle if, and only if, you have no other alternative.

## Joblib

There is another library called `joblib` that is optimized for serializing large NumPy arrays. But internally, it ends up using pickle for many of its tasks, so they share the same problems.

I will show you how to use it for completeness in this book:

In [None]:
from joblib import dump

dump(model, "model.joblib")


And to load it, it's done like this:

In [None]:
from joblib import load

model_joblibbed = load("model.joblib")


Review the type and make some predictions so you can see that it does work:

In [None]:
type(model_joblibbed)


Although Joblib is an improvement over Pickle, this improvement is in aspects of speed and size of the stored model, but nothing changes in terms of security. That's why I recommend that you don't use Joblib unless you have no other alternative.

## Skops

Recently, a new library called skops emerged, whose objective is to help save machine learning models and put them into production.

This library is not part of scikit-learn, but is

It is simple to use:

In [None]:
import skops.io as sio

sio.dump(model, "model.sio")


And to load it:

In [None]:
skopted_model = sio.load("model.sio")


Check the type and make some predictions so you can see that it does work:

In [None]:
type(skopted_model)


## About extensions

In reality, the extensions you assign to your model don't matter at all, since in the end, to read or write them to disk, you'll need to specify the complete path to the file.

## Other recommendations

Saving models is just one part of putting our models into production. Other things you should do are:

1. **Maintain consistent library versions**: Make sure to use the same versions, or compatible ones, of scikit-learn and its dependencies when saving and loading models. This ensures compatibility and consistent model behavior.
2. **Include preprocessing steps**: Save trained preprocessing transformers along with the model. You can achieve this by creating a pipeline that includes all steps and saving the entire pipeline.
3. **Document your model**: Consider documenting the model's purpose, performance metrics, dataset used for training, feature engineering, and any other relevant information. This documentation will help you and others understand the context, limitations, and use cases of the model.
4. **Use a version control system**: Store your saved models and associated files (e.g., data preprocessing scripts, configuration files, and documentation) in a version control system like Git.
5. **Back up your models**: Ensure that your saved models are backed up in a secure and reliable storage system. This could involve saving models in cloud storage or using a dedicated backup solution.
6. **Ensure the model works after saving and opening**: After saving a model, test loading it and making predictions to ensure that the serialization and deserialization process works as expected.

And that's all, I hope these tips help you to successfully put your model into production.