Skip to content
This repository has been archived by the owner on Jun 23, 2018. It is now read-only.

Commit

Permalink
More complex quick start examples to introduce other concepts (#17)
Browse files Browse the repository at this point in the history
* More complex quick start
* Merge new changes
* Split up more complex examples, integrate nick's comments on #17, add more content for partial workflows, add new syntax
* Methods -> Functions, and reorganisation
* remove functions from fully composed workflows
* updated with new module versioning syntax, fixed a few typos, restructured
  • Loading branch information
lanthias authored and mands committed Mar 31, 2017
1 parent 1ea1911 commit 83f9a18
Show file tree
Hide file tree
Showing 8 changed files with 307 additions and 8 deletions.
15 changes: 15 additions & 0 deletions advanced_start/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
.. _advanced_start_index:

******************
Advanced Tutorial
******************

In this section, we're going to productionise a Random Forest classifier written with `sklearn <http://scikit-learn.org/>`_, deploy it to the cloud, and use it in a more sophisticated workflow.

By the end of the tutorial, you will learn how to build modules with dependencies, write more sophisticated workflows, and build abstractions over data-sources. Enjoy!


.. toctree::

more
workflow_power
171 changes: 171 additions & 0 deletions advanced_start/more.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
.. _more:

Productionising a Classifier as an NStack Module
================================================

So far, we have built and published a Python module with a single function on it, ``numChars``, and built a workflow which connects our function to an HTTP endpoint. This in itself isn't particularly useful, so, now that you've got the gist of how NStack works, it's time to build something more realistic!

In this tutorial, we're going to create and productionise a simple classifier which uses the famous `iris dataset <https://en.wikipedia.org/wiki/Iris_flower_data_set>`_.
We're going to train our classifier to classify which species an iris is, given measurements of its sepals and petals. You can find the dataset we're using to train our model `here <https://raw.githubusercontent.com/nstackcom/nstack-examples/master/iris/Iris.Classify/train.csv>`_.

First, let's look at the the format of our data to see how we should approach the problem. We see that we have five fields:

================ ======================= ===========
Field Name Description Type
================ ======================= ===========
``species`` The species of iris Text

``sepal_width`` The width of the sepal Double

``sepal_length`` The length of the sepal Double

``petal_width`` The width of the petal Double

``petal_length`` The length of the petal Double
================ ======================= ===========

If we are trying to find the species based on the sepal and petal measurements, this means these measurements are going to be the input to our classifier module, with text being the output. This means we need to write a function in Python which takes four ``Double``\s and returns ``Text``.

Creating your classifier module
------------------------------

To begin, let's make a new directory called ``Iris.Classify``, ``cd`` into it, and initialise a new Python module:

.. code:: bash

~/ $ mkdir Iris.Classify; cd Iris.Classify
~/Iris.Classify/ $ nstack init python
python module 'Iris.Classify' successfully initialised at ~/Iris.Classify

Next, let's download our training data into this directory so we can use it in our module. We have hosted it for you as a CSV on GitHub.

.. code:: bash

~/Iris.Classify/ $ curl -O https://raw.githubusercontent.com/nstackcom/nstack-examples/master/iris/Iris.Classify/train.csv

Defining our API
----------------

As we know what the input and output of our classifier is going to look like, let's edit the ``api`` section of ``nstack.yaml`` to define our API (i.e. the entry-point into our module). By default, a new module contains a sample function ``numChars``, which we replace with our definition. We're going to call the function we write in Python ``predict``, which means we fill in the ``api`` section of ``nstack.yaml`` as follows:

.. code :: java

api : |
predict : (Double, Double, Double, Double) -> Text


This means we want to productionise a single function, ``predict``, which takes four ``Double``\s (the measurements) and returns ``Text`` (the iris species).


Writing our classifier
----------------------

Now that we've defined our API, let's jump into our Python module, which lives in ``service.py``.
We see that NStack has created a class ``Service``. This is where we add the functions for our module. Right now it also has a sample function in it, ``numChars``, which we can remove.

Let's import the libaries we're using.

.. code :: python

import nstack
import pandas as pd

from sklearn.ensemble import RandomForestClassifier

.. note :: Python modules must also import ``nstack``

Before we add our ``predict`` function, we're going to add ``__init__``, the Python constructor function which runs upon the creation of our module. It's going to load our data from ``train.csv``, and use it to train our Random Forest classifier:

.. code :: python

def __init__(self):
train = pd.read_csv("train.csv")

self.cols = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
colsRes = ['class']
trainArr = train.as_matrix(self.cols)
trainRes = train.as_matrix(colsRes)

rf = RandomForestClassifier(n_estimators=100)
rf.fit(trainArr, trainRes)
self.rf = rf

Now we can write our ``predict`` function. The second argument, ``inputArr``, is the input -- in this case, our four ``Double``\s. To return text, we simply return from the function in Python.

.. code :: python

def predict(self, inputArr):
points = [inputArr]
df = pd.DataFrame(points, columns=self.cols)

results = self.rf.predict(df)
return results.item()

Configuring your module
-----------------------

When your module is started, it is run in a Linux container on the NStack server. Because our module uses libraries like ``pandas`` and ``sklearn``, we have to tell NStack to install some extra operating system libraries inside your module's container. NStack lets us specify these in our ``nstack.yaml`` configuration file in the ``packages`` section. Let's add the following packages:

.. code :: yaml

packages: ['numpy', 'python3-scikit-learn', 'scipy', 'python3-scikit-image', 'python3-pandas']

Additionally, we want to tell NStack to copy our ``train.csv`` file into our module, so we can use it in ``__init__``. ``nstack.yaml`` also has a section for specifying files you'd like to include:

.. code :: yaml

files: ['train.csv']

Publishing and starting
-----------------------

Now we're ready to build and publish our classifier. Remember, even though we run this command locally, our module gets built and published on your NStack server in the cloud.

.. code :: bash

~/Iris.Classify/ $ nstack build
Building NStack Container module Iris.Classify. Please wait. This may take some time.
Module Iris.Classify built successfully. Use `nstack list functions` to see all available functions.

We can now see ``Iris.Classify.predict`` in the list of existing functions (along with previously built functions like ``demo.numChars``),

.. code :: bash

~/Iris.Classify/ $ nstack list functions
Iris.Classify:0.0.1-SNAPSHOT
predict : (Double, Double, Double, Double) -> Text
Demo:0.0.1-SNAPSHOT
numChars : Text -> Integer

Our classifier is now published, but to use it we need to connect it to an event source and sink. In the previous tutorial, we used HTTP as a source, and the NStack log as a sink. We can do the same here. This time, instead of creating a workflow module right away, we can use nstack's ``notebook`` command to test our workflow first. ``notebook`` opens an interactive shell where we can write our workflow. When we are finished, we can ``Ctrl-D``.

.. code :: bash

~/Iris.Classify/ $ nstack notebook
import Iris.Classify:0.0.1-SNAPSHOT as Classifier
Sources.http<(Double, Double, Double, Double) | Classifier.predict | Sinks.log<Text>
[Ctrl-D]

This creates an HTTP endpoint on ``http://localhost:8080/irisendpoint`` which can receive four ``Double``\s, and writes the results to the log as ``Text``. Let's check it is running as a process:

.. code :: bash

~/Iris.Classify/ $ nstack ps
1
2

In this instance, it is running as process ``2``. We can test our classifier by sending it some of the sample data from ``train.csv``.

.. code :: bash

~/Iris.Classify/ $ curl -X PUT -d '{ "params" : [4.7, 1.4, 6.1, 2.9] }' localhost:8080/irisendpoint
Msg Accepted
~/Iris.Classify/ $ nstack log 2
Feb 17 10:32:30 nostromo nstack-server[8925]: OUTPUT: "Iris-versicolor"

Our classifier is now productionised. Next, we're going explore some of the more sophisticated workflows you can build using NStack.





116 changes: 116 additions & 0 deletions advanced_start/workflow_power.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
.. _workflow_power:

More Powerful Workflows
=======================

Now that we have published our classifier to NStack as a module, we can use it to demonstrate some of the more powerful features of the workflow engine.

Multiple Steps
---------------

Workflows can contain as many steps as you like, as long as the output type of one matches the input type of the other. For instance, let's say we wanted to create the following workflow:

- Expose an HTTP endpoint which takes four ``Double``\s
- Send these ``Double``\s to our classifier, ``Iris.Classify``, which will tell us the species of the iris
- Count the number of characters in the species of the iris using our ``Demo.numChars`` function
- Write the result to the log

We could write the following workflow:

.. code :: bash
module Iris.Workflow:0.0.1-SNAPSHOT {
import Iris.Classify:0.0.1-SNAPSHOT as Classifier;
import Demo:0.0.1-SNAPSHOT as Demo;
def multipleSteps = Sources.http<(Double, Double, Double, Double> { http_path = "/irisendpoint" } | Classifier.predict | Demo.numChars | sinks.log<Integer>;
}
.. note :: ``numChars`` and ``predict`` can be `composed` together because their types -- or schemas -- match. If ``predict`` wasn't configured to output ``Text``, or ``numChars`` wasn't configured to take ``Text`` as input, NStack would not let you build the following workflow.
Partial workflows
-----------------

All of the workflows that we have written so far have been `fully composed`, which means that they contain a source and a sink. Many times, you want to split up sources, sinks, and functions into separate pieces you can share and reuse. In this case, we say that a workflow is `partially composed`, which just means it does not contain a source and a sink. These workflows cannot be ``start``\ed by themselves, but can be shared and attached to other sources and/or sinks to become `fully composed`.

For instance, we could combine ``Iris.Classify.predict`` and ``demo.numChars`` from the previous example to form a new workflow ``speciesLength`` like so:

.. code :: java
module Iris.Workflow:0.0.1-SNAPSHOT {
import Iris.Classify:0.0.1-SNAPSHOT as Classifier;
import Demo:0.0.1-SNAPSHOT as Demo;
def speciesLength = Classifier.predict | Demo.numChars
}
Because our workflow ``Iris.Workflow.speciesLength`` has not been connected to a source or a sink, it in itself is still a function. If we build this workflow, we can see ``speciesLength`` alongside our other functions by using the ``list`` command:

.. code :: bash
~/Iris.Workflow/ $ nstack list functions
Iris.Classify:0.0.1-SNAPSHOT
predict : (Double, Double, Double, Double) -> Text
Demo:0.0.1
numChars : Text -> Integer
Iris.Workflow:0.0.1-SNAPSHOT
speciesLength : (Double, Double, Double, Double) -> Integer
As we would expect, the input type of the workflow is the input type of ``Iris.Classify.predict``, and the output type is the output type of ``demo.numChars``. Like other functions, this must be connected to a source and a sink to make it `fully composed`, which means we could use this workflow it in *another* workflow.

.. code :: bash
module Iris.Endpoint:0.0.1-SNAPSHOT {
import Iris.Workflow:0.0.1-SNAPSHOT as IrisWF;
def http = Sources.http<(Double, Double, Double, Double) | IrisWF.speciesLength | Sinks.log<Integer>;
}
Often times you want to re-use a source or a sink without reconfiguring them. To do this, we can similarly separate the sources and sinks into separate workflows, like so:

.. code :: java
module Iris.Workflow:0.0.1-SNAPSHOT {
import Iris.Classify:0.0.1-SNAPSHOT as Classifier
def httpEndpoint = sources.http<(Double, Double, Double, Double)> { http_path = "speciesLength" };
def logSink = sinks.log<Text>
def speciesWf = httpEndpoint | Classifier.predict | logSink;
}
Separating sources and sinks becomes useful when you're connecting to more complex integrations which you don't want to configure each time you use it -- many times you want to reuse a source or sink in multiple workflows. In the following example, we are defining a module which provides a source and a sink which both sit ontop of Postgres.

.. code :: java
module Iris.DB:0.0.1-SNAPSHOT {
def petalsAndSepals = Sources.postgres<(Double, Double, Double, Double)> {
pg_database = "flowers",
pg_query = "SELECT * FROM iris"
};
def irisSpecies = Sinks.postgres<Text> {
pg_database = "flowers",
pg_table = "iris"
};
}
If we built this module, ``petalsAndSepals`` and ``irisSpecies`` could be used in other modules as sources and sinks, themselves.

We may also want to add a functions to do some pre- or post- processing to a source or sink. For instance:

.. code :: java
module IrisCleanDbs:0.0.1-SNAPSHOT {
import PetalTools:1.0.0 as PetalTools;
import TextTools:1.1.2 as TextTools;
import Iris.DB:0.0.1-SNAPSHOT as DB;
def roundedPetalsSource = DB.petalsAndSepals | PetalsTools.roundPetalLengths;
def irisSpeciesUppercase = TextTools.toUppercase | DB.irisSpecies;
}
Because ``roundedPetalsSource`` is a combination of a source and a function, it is still a valid source. Similarly, ``irisSpeciesUppercase`` is a combination of a function and a sink, so it is still a valid sink.

Because NStack functions, source, and sinks can be composed and reused, this lets you build powerful abstractions over infrastructure.

1 change: 1 addition & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Welcome to the NStack Documentation!
concepts
installation
quick_start/index
advanced_start/index
architecture
reference/index

Expand Down
4 changes: 2 additions & 2 deletions quick_start/index.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _quick_start_index:

******************
Quick Start
Quick Tutorial
******************

In this section, we're going to see how to build up a simple NStack module, deploy it to the cloud, and use it in a workflow by connecting it to a `source` and a `sink`.
Expand All @@ -10,7 +10,7 @@ By the end of the tutorial, you will learn how to publish your code to NStack an

.. note:: To learn more about modules, sources, and sinks, read :ref:`Concepts<concepts>`

Make sure you have :doc:`installed NStack </installation>` and let's get going.
Make sure you have :doc:`installed NStack </installation>`.

.. toctree::

Expand Down
2 changes: 1 addition & 1 deletion quick_start/module.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _module:

Writing your Module
Building a Module
=========================

NStack Modules contain the functions that can be used on the NStack platform. They are the building blocks which can be used to build workflows and applications.
Expand Down
4 changes: 0 additions & 4 deletions quick_start/starting.rst

This file was deleted.

2 changes: 1 addition & 1 deletion quick_start/workflow.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _workflow:

Building your Workflow
Building a Workflow
=========================

In the previous tutorial, we built and published a Python module `Demo` using NStack.
Expand Down

0 comments on commit 83f9a18

Please sign in to comment.