Skip to content

Commit

Permalink
adds first version of deployment guide
Browse files Browse the repository at this point in the history
  • Loading branch information
edublancas committed Mar 12, 2021
1 parent ad64f20 commit d427d9d
Show file tree
Hide file tree
Showing 3 changed files with 322 additions and 0 deletions.
1 change: 1 addition & 0 deletions doc/api/spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@ to backup pipeline results (say, for example, you run a job that trains
several models and want to save output results. You can use
:py:mod:`ploomber.clients.GCloudStorageClient` for that.

.. _serializer-and-unserializer:

``serializer`` and ``unserializer``
***********************************
Expand Down
320 changes: 320 additions & 0 deletions doc/user-guide/deployment.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,320 @@
Deployment
----------

The two most common ways to deploy data pipelines are batch and online.
Ploomber supports both deployment options.

Batch implies running your pipeline (usually on a schedule), generate results
and make them available for future consumption. For example, you may develop a
Machine Learning pipeline that runs every morning, predicts the probability
of user churn and stores the probabilities in a database table.
probabilities can then later be used to guide decision-making.

Alternatively, you may deploy a pipeline as an online service. This time,
instead of storing predictions for later consumption, you expose your model
as a REST API and users can make requests and get predictions on demand.

Pipeline composition
====================

Before diving into the deployment details, let's introduce the concept of
pipeline composition.

The only difference between a Machine Learning training pipeline and its serving
counterpart is what happens at the begininng and at the end.


At **training** time, we obtain historical data, generate features and train a
model:


.. raw:: html

<div class="mermaid">
graph LR
A[Get historical data] --> B1[A feature] --> C[Train model]
A --> B2[Another feature] --> C
subgraph Feature engineering
B1
B2
end

</div>

At **serving** time, we obtain new data, generate features and make
predictions using a previously trained model:


.. raw:: html

<div class="mermaid">
graph LR
A[Get new data] --> B1[A feature] --> C[Predict]
A --> B2[Another feature] --> C
subgraph Feature engineering
B1
B2
end

</div>

When the feature engineering process does not match,
`training-serving skew <https://ploomber.io/posts/train-serve-skew/>`_ arises.
This is one of the most common problems when deploying ML models. To fix it,
Ploomber allows you to compose pipelines: **write your
feature generation once and re-use it to compose your training and serving
pipelines**; ensuring that the feature engineering code matches exactly.


Batch processing
================

Ploomber pipelines can be exported to production-grade schedulers for batch
processing. Check out our package
`Soopervisor <https://soopervisor.readthedocs.io/en/stable/index.html>`_, which
allows you to export to
`Kubernetes <https://soopervisor.readthedocs.io/en/stable/kubernetes.html>`_
(via Argo workflows) and
`Airflow <https://soopervisor.readthedocs.io/en/stable/airflow.html>`_. It's
also possible to run Ploomber projects using `cron
<https://soopervisor.readthedocs.io/en/stable/scheduling.html#cron>`_ or
`Github Actions <https://soopervisor.readthedocs.io/en/stable/scheduling.html#github-actions>`_.

Composing batch pipelines
*************************

To compose a batch pipeline, use the ``import_tasks_from`` directive in
your ``pipeline.yaml`` file.

For example, define all your feature generation tasks in a ``features.yaml`` file:


.. code-block:: yaml
:class: text-editor
:name: features-yaml
# generate one feature...
- source: features.a_feature
product: features/a_feature.csv
# another feature...
- source: features.anoter_feature
product: features/another_feature.csv
# join the two previous features...
- source: features.join
product: features/all.csv
Then import those tasks in your training pipeline, ``pipeline.yaml``:


.. code-block:: yaml
:class: text-editor
:name: pipeline-yaml
meta:
# import feature generation tasks
import_tasks_from: features.yaml
tasks:
# Get raw data for training
- source: train.get_historical_data
product: raw/get.csv
# The import_tasks_from injects your features generation tasks here
# Train a model
- source: train.train_model
product: model/model.pickle
Your serving pipeline ``pipepline-serve.yaml`` would look like this:

.. code-block:: yaml
:class: text-editor
:name: pipeline-serve-yaml
meta:
# import feature generation tasks
import_tasks_from: features.yaml
tasks:
# Get new data for predictions
- source: serve.get_new_data
product: serve/get.parquet
# The import_tasks_from injects your features generation tasks here
# Make predictions using a trained model
- source: serve.predict
product: serve/predictions.csv
params:
path_to_model: model.pickle
Example
*******

`Here's an example
<https://github.com/ploomber/projects/tree/master/ml-intermediate>`_ project
showing how to use ``import_tasks_from`` to create a training
(``pipeline.yaml``) and serving (``pipeline-serve.yaml``) pipeline.


Online service (API)
====================

To encapsulate all your pipeline's logic to generate online predictions, use
:py:mod:`ploomber.OnlineDAG`. Once implemented, you can generate predictons
like this:

.. code-block:: python
:class: text-editor
:name: online-py
from my_project import MyOnlineDAG
# MyOnlineDAG is a subclass of OnlineDAG
dag = MyOnlineDAG()
dag.predict(input_data=input_data)
You can easily integrate an online DAG with any library such as Flask or gRPC.

The only requisite is that your feature generation code should be entirely
made of Python functions (i.e., :py:mod:`ploomber.tasks.PythonCallable`) tasks
with configured :ref:`serializer-and-unserializer`.

The next section explains the implementation details.


Composing online pipelines
**************************

To create an online DAG, list your feature tasks in a ``features.yaml`` and
use ``import_tasks_from`` in your training pipeline ``pipeline.yaml``. To
create the serving pipeline, you have to create a subclass of
:py:mod:`ploomber.OnlineDAG`.

``OnlineDAG`` will take your ``features.yaml`` and create "input tasks" based
on ``upstream`` references in yout feature tasks. For example, if your pipeline
has features ``a_feature`` and ``another_feature`` (just like the pipeline
described in the first section), and both obtain their inputs from a task
named ``get``, the code will look like this:

.. code-block:: py
:class: text-editor
:name: features-py
def a_feature(upstream):
raw_data = upstream['get']
# process raw_data to generate features...
# return a_feature
return df_a_feature
def another_feature(upstream):
raw_data = upstream['get']
# process raw_data to generate features...
# return another_feature
return df_another_feature
Since ``features.yaml`` does not contain a task named ``get``, ``OnlineDAG``
automatically identifies is as an input. Finally, you must provide a
"terminal task", which will be the last task in your online pipeline:

.. raw:: html

<div class="mermaid">
graph LR
A[Input] --> B1[A feature] --> C[Terminal task]
A --> B2[Another feature] --> C
subgraph Feature engineering
B1
B2
end

</div>

To implement this, create a subclass of ``OnlineDAG`` and provide the path
to your ``features.yaml``, parameters for your terminal task and the terminal
task:

.. code-block:: py
:class: text-editor
:name: online-dag-py
from ploomber import OnlineDAG
# subclass OnlineDAG...
class MyOnlineDAG(OnlineDAG):
# and provide these three methods...
# get_partial: returns a path to your feature tasks
@staticmethod
def get_partial():
return 'tasks-features.yaml'
# terminal_params: returns a dictionary with parameters for the terminal task
@staticmethod
def terminal_params():
model = pickle.loads(resources.read_binary(ml_online, 'model.pickle'))
return dict(model=model)
# terminal_task: implementation of your terminal task
@staticmethod
def terminal_task(upstream, model):
# receives all tasks with no downtream dependencies in
# tasks-features.yaml
a_feature = upstream['a_feature']
another_feature = upstream['another_feature']
X = pd.DataFrame({'a_feature': a_feature,
'anoter_feature': anoter_feature})
return model.predict(X)
To call ``MyOnlineDAG``:

.. code-block:: python
:class: text-editor
:name: online-py
from my_project import MyOnlineDAG
dag = MyOnlineDAG()
# pass parameters (one per input)
prediction = dag.predict(get=input_data)
You can import and call ``MyOnlineDAG`` in any framework (e.g., flask) to
expose your pipeline as an online service.


.. code-block:: python
:class: text-editor
:name: micro-service-py
from flask import Flask, request, jsonify
import pandas as pd
from my_project import OnlineDAG
# instantiate online dag
dag = OnlineDAG()
app = Flask(__name__)
@app.route('/', methods=['POST'])
def predict():
request_data = request.get_json()
# get JSON data and create a data frame with a single row
input_data = pd.DataFrame(request_data, index=[0])
# pass input data, argument per root node
out = pipeline.predict(get=input_data)
# return output from the terminal task
return jsonify({'prediction': int(out['terminal'])})
Example
*******

`Click here <https://github.com/ploomber/projects/tree/master/ml-online>`_ to
see a full sample project that trains a model and exposes an API via flask.
1 change: 1 addition & 0 deletions doc/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ User Guide
sql-templating
testing
debugging
deployment
r-support
faq_index
spec-vs-python

0 comments on commit d427d9d

Please sign in to comment.