-
Notifications
You must be signed in to change notification settings - Fork 232
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
adds first version of deployment guide
- Loading branch information
1 parent
ad64f20
commit d427d9d
Showing
3 changed files
with
322 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,320 @@ | ||
Deployment | ||
---------- | ||
|
||
The two most common ways to deploy data pipelines are batch and online. | ||
Ploomber supports both deployment options. | ||
|
||
Batch implies running your pipeline (usually on a schedule), generate results | ||
and make them available for future consumption. For example, you may develop a | ||
Machine Learning pipeline that runs every morning, predicts the probability | ||
of user churn and stores the probabilities in a database table. | ||
probabilities can then later be used to guide decision-making. | ||
|
||
Alternatively, you may deploy a pipeline as an online service. This time, | ||
instead of storing predictions for later consumption, you expose your model | ||
as a REST API and users can make requests and get predictions on demand. | ||
|
||
Pipeline composition | ||
==================== | ||
|
||
Before diving into the deployment details, let's introduce the concept of | ||
pipeline composition. | ||
|
||
The only difference between a Machine Learning training pipeline and its serving | ||
counterpart is what happens at the begininng and at the end. | ||
|
||
|
||
At **training** time, we obtain historical data, generate features and train a | ||
model: | ||
|
||
|
||
.. raw:: html | ||
|
||
<div class="mermaid"> | ||
graph LR | ||
A[Get historical data] --> B1[A feature] --> C[Train model] | ||
A --> B2[Another feature] --> C | ||
subgraph Feature engineering | ||
B1 | ||
B2 | ||
end | ||
|
||
</div> | ||
|
||
At **serving** time, we obtain new data, generate features and make | ||
predictions using a previously trained model: | ||
|
||
|
||
.. raw:: html | ||
|
||
<div class="mermaid"> | ||
graph LR | ||
A[Get new data] --> B1[A feature] --> C[Predict] | ||
A --> B2[Another feature] --> C | ||
subgraph Feature engineering | ||
B1 | ||
B2 | ||
end | ||
|
||
</div> | ||
|
||
When the feature engineering process does not match, | ||
`training-serving skew <https://ploomber.io/posts/train-serve-skew/>`_ arises. | ||
This is one of the most common problems when deploying ML models. To fix it, | ||
Ploomber allows you to compose pipelines: **write your | ||
feature generation once and re-use it to compose your training and serving | ||
pipelines**; ensuring that the feature engineering code matches exactly. | ||
|
||
|
||
Batch processing | ||
================ | ||
|
||
Ploomber pipelines can be exported to production-grade schedulers for batch | ||
processing. Check out our package | ||
`Soopervisor <https://soopervisor.readthedocs.io/en/stable/index.html>`_, which | ||
allows you to export to | ||
`Kubernetes <https://soopervisor.readthedocs.io/en/stable/kubernetes.html>`_ | ||
(via Argo workflows) and | ||
`Airflow <https://soopervisor.readthedocs.io/en/stable/airflow.html>`_. It's | ||
also possible to run Ploomber projects using `cron | ||
<https://soopervisor.readthedocs.io/en/stable/scheduling.html#cron>`_ or | ||
`Github Actions <https://soopervisor.readthedocs.io/en/stable/scheduling.html#github-actions>`_. | ||
|
||
Composing batch pipelines | ||
************************* | ||
|
||
To compose a batch pipeline, use the ``import_tasks_from`` directive in | ||
your ``pipeline.yaml`` file. | ||
|
||
For example, define all your feature generation tasks in a ``features.yaml`` file: | ||
|
||
|
||
.. code-block:: yaml | ||
:class: text-editor | ||
:name: features-yaml | ||
# generate one feature... | ||
- source: features.a_feature | ||
product: features/a_feature.csv | ||
# another feature... | ||
- source: features.anoter_feature | ||
product: features/another_feature.csv | ||
# join the two previous features... | ||
- source: features.join | ||
product: features/all.csv | ||
Then import those tasks in your training pipeline, ``pipeline.yaml``: | ||
|
||
|
||
.. code-block:: yaml | ||
:class: text-editor | ||
:name: pipeline-yaml | ||
meta: | ||
# import feature generation tasks | ||
import_tasks_from: features.yaml | ||
tasks: | ||
# Get raw data for training | ||
- source: train.get_historical_data | ||
product: raw/get.csv | ||
# The import_tasks_from injects your features generation tasks here | ||
# Train a model | ||
- source: train.train_model | ||
product: model/model.pickle | ||
Your serving pipeline ``pipepline-serve.yaml`` would look like this: | ||
|
||
.. code-block:: yaml | ||
:class: text-editor | ||
:name: pipeline-serve-yaml | ||
meta: | ||
# import feature generation tasks | ||
import_tasks_from: features.yaml | ||
tasks: | ||
# Get new data for predictions | ||
- source: serve.get_new_data | ||
product: serve/get.parquet | ||
# The import_tasks_from injects your features generation tasks here | ||
# Make predictions using a trained model | ||
- source: serve.predict | ||
product: serve/predictions.csv | ||
params: | ||
path_to_model: model.pickle | ||
Example | ||
******* | ||
|
||
`Here's an example | ||
<https://github.com/ploomber/projects/tree/master/ml-intermediate>`_ project | ||
showing how to use ``import_tasks_from`` to create a training | ||
(``pipeline.yaml``) and serving (``pipeline-serve.yaml``) pipeline. | ||
|
||
|
||
Online service (API) | ||
==================== | ||
|
||
To encapsulate all your pipeline's logic to generate online predictions, use | ||
:py:mod:`ploomber.OnlineDAG`. Once implemented, you can generate predictons | ||
like this: | ||
|
||
.. code-block:: python | ||
:class: text-editor | ||
:name: online-py | ||
from my_project import MyOnlineDAG | ||
# MyOnlineDAG is a subclass of OnlineDAG | ||
dag = MyOnlineDAG() | ||
dag.predict(input_data=input_data) | ||
You can easily integrate an online DAG with any library such as Flask or gRPC. | ||
|
||
The only requisite is that your feature generation code should be entirely | ||
made of Python functions (i.e., :py:mod:`ploomber.tasks.PythonCallable`) tasks | ||
with configured :ref:`serializer-and-unserializer`. | ||
|
||
The next section explains the implementation details. | ||
|
||
|
||
Composing online pipelines | ||
************************** | ||
|
||
To create an online DAG, list your feature tasks in a ``features.yaml`` and | ||
use ``import_tasks_from`` in your training pipeline ``pipeline.yaml``. To | ||
create the serving pipeline, you have to create a subclass of | ||
:py:mod:`ploomber.OnlineDAG`. | ||
|
||
``OnlineDAG`` will take your ``features.yaml`` and create "input tasks" based | ||
on ``upstream`` references in yout feature tasks. For example, if your pipeline | ||
has features ``a_feature`` and ``another_feature`` (just like the pipeline | ||
described in the first section), and both obtain their inputs from a task | ||
named ``get``, the code will look like this: | ||
|
||
.. code-block:: py | ||
:class: text-editor | ||
:name: features-py | ||
def a_feature(upstream): | ||
raw_data = upstream['get'] | ||
# process raw_data to generate features... | ||
# return a_feature | ||
return df_a_feature | ||
def another_feature(upstream): | ||
raw_data = upstream['get'] | ||
# process raw_data to generate features... | ||
# return another_feature | ||
return df_another_feature | ||
Since ``features.yaml`` does not contain a task named ``get``, ``OnlineDAG`` | ||
automatically identifies is as an input. Finally, you must provide a | ||
"terminal task", which will be the last task in your online pipeline: | ||
|
||
.. raw:: html | ||
|
||
<div class="mermaid"> | ||
graph LR | ||
A[Input] --> B1[A feature] --> C[Terminal task] | ||
A --> B2[Another feature] --> C | ||
subgraph Feature engineering | ||
B1 | ||
B2 | ||
end | ||
|
||
</div> | ||
|
||
To implement this, create a subclass of ``OnlineDAG`` and provide the path | ||
to your ``features.yaml``, parameters for your terminal task and the terminal | ||
task: | ||
|
||
.. code-block:: py | ||
:class: text-editor | ||
:name: online-dag-py | ||
from ploomber import OnlineDAG | ||
# subclass OnlineDAG... | ||
class MyOnlineDAG(OnlineDAG): | ||
# and provide these three methods... | ||
# get_partial: returns a path to your feature tasks | ||
@staticmethod | ||
def get_partial(): | ||
return 'tasks-features.yaml' | ||
# terminal_params: returns a dictionary with parameters for the terminal task | ||
@staticmethod | ||
def terminal_params(): | ||
model = pickle.loads(resources.read_binary(ml_online, 'model.pickle')) | ||
return dict(model=model) | ||
# terminal_task: implementation of your terminal task | ||
@staticmethod | ||
def terminal_task(upstream, model): | ||
# receives all tasks with no downtream dependencies in | ||
# tasks-features.yaml | ||
a_feature = upstream['a_feature'] | ||
another_feature = upstream['another_feature'] | ||
X = pd.DataFrame({'a_feature': a_feature, | ||
'anoter_feature': anoter_feature}) | ||
return model.predict(X) | ||
To call ``MyOnlineDAG``: | ||
|
||
.. code-block:: python | ||
:class: text-editor | ||
:name: online-py | ||
from my_project import MyOnlineDAG | ||
dag = MyOnlineDAG() | ||
# pass parameters (one per input) | ||
prediction = dag.predict(get=input_data) | ||
You can import and call ``MyOnlineDAG`` in any framework (e.g., flask) to | ||
expose your pipeline as an online service. | ||
|
||
|
||
.. code-block:: python | ||
:class: text-editor | ||
:name: micro-service-py | ||
from flask import Flask, request, jsonify | ||
import pandas as pd | ||
from my_project import OnlineDAG | ||
# instantiate online dag | ||
dag = OnlineDAG() | ||
app = Flask(__name__) | ||
@app.route('/', methods=['POST']) | ||
def predict(): | ||
request_data = request.get_json() | ||
# get JSON data and create a data frame with a single row | ||
input_data = pd.DataFrame(request_data, index=[0]) | ||
# pass input data, argument per root node | ||
out = pipeline.predict(get=input_data) | ||
# return output from the terminal task | ||
return jsonify({'prediction': int(out['terminal'])}) | ||
Example | ||
******* | ||
|
||
`Click here <https://github.com/ploomber/projects/tree/master/ml-online>`_ to | ||
see a full sample project that trains a model and exposes an API via flask. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,7 @@ User Guide | |
sql-templating | ||
testing | ||
debugging | ||
deployment | ||
r-support | ||
faq_index | ||
spec-vs-python |