adds first version of deployment guide

ploomber · Mar 12, 2021 · d427d9d · d427d9d
1 parent ad64f20
commit d427d9d
Show file tree

Hide file tree

Showing 3 changed files with 322 additions and 0 deletions.
diff --git a/doc/api/spec.rst b/doc/api/spec.rst
@@ -148,6 +148,7 @@ to backup pipeline results (say, for example, you run a job that trains
 several models and want to save output results. You can use
 :py:mod:`ploomber.clients.GCloudStorageClient` for that.
 
+.. _serializer-and-unserializer:
 
 ``serializer`` and ``unserializer``
 ***********************************

diff --git a/doc/user-guide/deployment.rst b/doc/user-guide/deployment.rst
@@ -0,0 +1,320 @@
+Deployment
+----------
+
+The two most common ways to deploy data pipelines are batch and online.
+Ploomber supports both deployment options.
+
+Batch implies running your pipeline (usually on a schedule), generate results
+and make them available for future consumption. For example, you may develop a
+Machine Learning pipeline that runs every morning, predicts the probability
+of user churn and stores the probabilities in a database table. 
+probabilities can then later be used to guide decision-making.
+
+Alternatively, you may deploy a pipeline as an online service. This time,
+instead of storing predictions for later consumption, you expose your model
+as a REST API and users can make requests and get predictions on demand.
+
+Pipeline composition
+====================
+
+Before diving into the deployment details, let's introduce the concept of
+pipeline composition.
+
+The only difference between a Machine Learning training pipeline and its serving
+counterpart is what happens at the begininng and at the end.
+
+
+At **training** time, we obtain historical data, generate features and train a
+model:
+
+
+.. raw:: html
+
+    <div class="mermaid">
+    graph LR
+        A[Get historical data] --> B1[A feature] --> C[Train model]
+        A --> B2[Another feature] --> C
+        subgraph Feature engineering
+        B1
+        B2
+        end
+
+    </div>
+
+At **serving** time, we obtain new data, generate features and make
+predictions using a previously trained model:
+
+
+.. raw:: html
+
+    <div class="mermaid">
+    graph LR
+        A[Get new data] --> B1[A feature] --> C[Predict]
+        A --> B2[Another feature] --> C
+        subgraph Feature engineering
+        B1
+        B2
+        end
+
+    </div>
+
+When the feature engineering process does not match,
+`training-serving skew <https://ploomber.io/posts/train-serve-skew/>`_ arises.
+This is one of the most common problems when deploying ML models. To fix it,
+Ploomber allows you to compose pipelines: **write your
+feature generation once and re-use it to compose your training and serving
+pipelines**; ensuring that the feature engineering code matches exactly.
+
+
+Batch processing
+================
+
+Ploomber pipelines can be exported to production-grade schedulers for batch
+processing. Check out our package
+`Soopervisor <https://soopervisor.readthedocs.io/en/stable/index.html>`_, which
+allows you to export to
+`Kubernetes <https://soopervisor.readthedocs.io/en/stable/kubernetes.html>`_
+(via Argo workflows) and
+`Airflow <https://soopervisor.readthedocs.io/en/stable/airflow.html>`_. It's
+also possible to run Ploomber projects using `cron
+<https://soopervisor.readthedocs.io/en/stable/scheduling.html#cron>`_ or
+`Github Actions <https://soopervisor.readthedocs.io/en/stable/scheduling.html#github-actions>`_.
+
+Composing batch pipelines
+*************************
+
+To compose a batch pipeline, use the ``import_tasks_from`` directive in
+your ``pipeline.yaml`` file.
+
+For example, define all your feature generation tasks in a ``features.yaml`` file:
+
+
+.. code-block:: yaml
+    :class: text-editor
+    :name: features-yaml
+
+    # generate one feature...
+    - source: features.a_feature
+      product: features/a_feature.csv
+
+    # another feature...
+    - source: features.anoter_feature
+      product: features/another_feature.csv
+
+    # join the two previous features...
+    - source: features.join
+      product: features/all.csv
+        
+
+Then import those tasks in your training pipeline, ``pipeline.yaml``:
+
+
+.. code-block:: yaml
+    :class: text-editor
+    :name: pipeline-yaml
+
+    meta:
+        # import feature generation tasks
+        import_tasks_from: features.yaml
+
+    tasks:
+        # Get raw data for training
+        - source: train.get_historical_data
+          product: raw/get.csv
+        
+        # The import_tasks_from injects your features generation tasks here
+
+        # Train a model
+        - source: train.train_model
+          product: model/model.pickle
+
+Your serving pipeline ``pipepline-serve.yaml`` would look like this:
+
+.. code-block:: yaml
+    :class: text-editor
+    :name: pipeline-serve-yaml
+
+    meta:
+        # import feature generation tasks
+        import_tasks_from: features.yaml
+
+    tasks:
+        # Get new data for predictions
+        - source: serve.get_new_data
+          product: serve/get.parquet
+        
+        # The import_tasks_from injects your features generation tasks here
+
+        # Make predictions using a trained model
+        - source: serve.predict
+          product: serve/predictions.csv
+          params:
+            path_to_model: model.pickle
+
+
+Example
+*******
+
+`Here's an example
+<https://github.com/ploomber/projects/tree/master/ml-intermediate>`_ project
+showing how to use ``import_tasks_from`` to create a training
+(``pipeline.yaml``) and serving (``pipeline-serve.yaml``) pipeline.
+
+
+Online service (API)
+====================
+
+To encapsulate all your pipeline's logic to generate online predictions, use
+:py:mod:`ploomber.OnlineDAG`. Once implemented, you can generate predictons
+like this:
+
+.. code-block:: python
+    :class: text-editor
+    :name: online-py
+
+    from my_project import MyOnlineDAG
+
+    # MyOnlineDAG is a subclass of OnlineDAG
+    dag = MyOnlineDAG()
+    dag.predict(input_data=input_data)
+
+You can easily integrate an online DAG with any library such as Flask or gRPC.
+
+The only requisite is that your feature generation code should be entirely
+made of Python functions (i.e., :py:mod:`ploomber.tasks.PythonCallable`) tasks
+with configured :ref:`serializer-and-unserializer`.
+
+The next section explains the implementation details.
+
+
+Composing online pipelines
+**************************
+
+To create an online DAG, list your feature tasks in a ``features.yaml`` and
+use ``import_tasks_from`` in your training pipeline ``pipeline.yaml``. To
+create the serving pipeline, you have to create a subclass of
+:py:mod:`ploomber.OnlineDAG`.
+
+``OnlineDAG`` will take your ``features.yaml`` and create "input tasks" based
+on ``upstream`` references in yout feature tasks. For example, if your pipeline
+has features ``a_feature`` and ``another_feature`` (just like the pipeline
+described in the first section), and both obtain their inputs from a task
+named ``get``, the code will look like this:
+
+.. code-block:: py
+    :class: text-editor
+    :name: features-py
+
+    def a_feature(upstream):
+        raw_data = upstream['get']
+        # process raw_data to generate features...
+        # return a_feature
+        return df_a_feature
+    
+    def another_feature(upstream):
+        raw_data = upstream['get']
+        # process raw_data to generate features...
+        # return another_feature
+        return df_another_feature
+
+Since ``features.yaml`` does not contain a task named ``get``, ``OnlineDAG``
+automatically identifies is as an input. Finally, you must provide a
+"terminal task", which will be the last task in your online pipeline:
+
+.. raw:: html
+
+    <div class="mermaid">
+    graph LR
+        A[Input] --> B1[A feature] --> C[Terminal task]
+        A --> B2[Another feature] --> C
+        subgraph Feature engineering
+        B1
+        B2
+        end
+
+    </div>
+
+To implement this, create a subclass of ``OnlineDAG`` and provide the path
+to your ``features.yaml``, parameters for your terminal task and the terminal
+task:
+
+.. code-block:: py
+    :class: text-editor
+    :name: online-dag-py
+
+    from ploomber import OnlineDAG
+
+    # subclass OnlineDAG...
+    class MyOnlineDAG(OnlineDAG):
+        # and provide these three methods...
+
+        # get_partial: returns a path to your feature tasks
+        @staticmethod
+        def get_partial():
+            return 'tasks-features.yaml'
+
+        # terminal_params: returns a dictionary with parameters for the terminal task
+        @staticmethod
+        def terminal_params():
+            model = pickle.loads(resources.read_binary(ml_online, 'model.pickle'))
+            return dict(model=model)
+
+        # terminal_task: implementation of your terminal task
+        @staticmethod
+        def terminal_task(upstream, model):
+            # receives all tasks with no downtream dependencies in
+            # tasks-features.yaml
+            a_feature = upstream['a_feature']
+            another_feature = upstream['another_feature']
+            X = pd.DataFrame({'a_feature': a_feature,
+                              'anoter_feature': anoter_feature})
+            return model.predict(X)
+
+
+To call ``MyOnlineDAG``:
+
+.. code-block:: python
+    :class: text-editor
+    :name: online-py
+
+    from my_project import MyOnlineDAG
+
+    dag = MyOnlineDAG()
+
+    # pass parameters (one per input)
+    prediction = dag.predict(get=input_data)
+
+
+You can import and call ``MyOnlineDAG`` in any framework (e.g., flask) to
+expose your pipeline as an online service.
+
+
+.. code-block:: python
+    :class: text-editor
+    :name: micro-service-py
+
+    from flask import Flask, request, jsonify
+    import pandas as pd
+
+    from my_project import OnlineDAG
+
+    # instantiate online dag
+    dag = OnlineDAG()
+    app = Flask(__name__)
+
+    @app.route('/', methods=['POST'])
+    def predict():
+        request_data = request.get_json()
+        # get JSON data and create a data frame with a single row
+        input_data = pd.DataFrame(request_data, index=[0])
+        # pass input data, argument per root node
+        out = pipeline.predict(get=input_data)
+        # return output from the terminal task
+        return jsonify({'prediction': int(out['terminal'])})
+
+
+Example
+*******
+
+`Click here <https://github.com/ploomber/projects/tree/master/ml-online>`_ to
+see a full sample project that trains a model and exposes an API via flask.
diff --git a/doc/user-guide/index.rst b/doc/user-guide/index.rst
@@ -11,6 +11,7 @@ User Guide
     sql-templating
     testing
     debugging
+    deployment
     r-support
     faq_index
     spec-vs-python