This repository has been archived by the owner on Jun 23, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
More complex quick start examples to introduce other concepts (#17)
* More complex quick start * Merge new changes * Split up more complex examples, integrate nick's comments on #17, add more content for partial workflows, add new syntax * Methods -> Functions, and reorganisation * remove functions from fully composed workflows * updated with new module versioning syntax, fixed a few typos, restructured
- Loading branch information
Showing
8 changed files
with
307 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
.. _advanced_start_index: | ||
|
||
****************** | ||
Advanced Tutorial | ||
****************** | ||
|
||
In this section, we're going to productionise a Random Forest classifier written with `sklearn <http://scikit-learn.org/>`_, deploy it to the cloud, and use it in a more sophisticated workflow. | ||
|
||
By the end of the tutorial, you will learn how to build modules with dependencies, write more sophisticated workflows, and build abstractions over data-sources. Enjoy! | ||
|
||
|
||
.. toctree:: | ||
|
||
more | ||
workflow_power |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
.. _more: | ||
|
||
Productionising a Classifier as an NStack Module | ||
================================================ | ||
|
||
So far, we have built and published a Python module with a single function on it, ``numChars``, and built a workflow which connects our function to an HTTP endpoint. This in itself isn't particularly useful, so, now that you've got the gist of how NStack works, it's time to build something more realistic! | ||
|
||
In this tutorial, we're going to create and productionise a simple classifier which uses the famous `iris dataset <https://en.wikipedia.org/wiki/Iris_flower_data_set>`_. | ||
We're going to train our classifier to classify which species an iris is, given measurements of its sepals and petals. You can find the dataset we're using to train our model `here <https://raw.githubusercontent.com/nstackcom/nstack-examples/master/iris/Iris.Classify/train.csv>`_. | ||
|
||
First, let's look at the the format of our data to see how we should approach the problem. We see that we have five fields: | ||
|
||
================ ======================= =========== | ||
Field Name Description Type | ||
================ ======================= =========== | ||
``species`` The species of iris Text | ||
|
||
``sepal_width`` The width of the sepal Double | ||
|
||
``sepal_length`` The length of the sepal Double | ||
|
||
``petal_width`` The width of the petal Double | ||
|
||
``petal_length`` The length of the petal Double | ||
================ ======================= =========== | ||
|
||
If we are trying to find the species based on the sepal and petal measurements, this means these measurements are going to be the input to our classifier module, with text being the output. This means we need to write a function in Python which takes four ``Double``\s and returns ``Text``. | ||
|
||
Creating your classifier module | ||
------------------------------ | ||
|
||
To begin, let's make a new directory called ``Iris.Classify``, ``cd`` into it, and initialise a new Python module: | ||
|
||
.. code:: bash | ||
|
||
~/ $ mkdir Iris.Classify; cd Iris.Classify | ||
~/Iris.Classify/ $ nstack init python | ||
python module 'Iris.Classify' successfully initialised at ~/Iris.Classify | ||
|
||
Next, let's download our training data into this directory so we can use it in our module. We have hosted it for you as a CSV on GitHub. | ||
|
||
.. code:: bash | ||
|
||
~/Iris.Classify/ $ curl -O https://raw.githubusercontent.com/nstackcom/nstack-examples/master/iris/Iris.Classify/train.csv | ||
|
||
Defining our API | ||
---------------- | ||
|
||
As we know what the input and output of our classifier is going to look like, let's edit the ``api`` section of ``nstack.yaml`` to define our API (i.e. the entry-point into our module). By default, a new module contains a sample function ``numChars``, which we replace with our definition. We're going to call the function we write in Python ``predict``, which means we fill in the ``api`` section of ``nstack.yaml`` as follows: | ||
|
||
.. code :: java | ||
|
||
api : | | ||
predict : (Double, Double, Double, Double) -> Text | ||
|
||
|
||
This means we want to productionise a single function, ``predict``, which takes four ``Double``\s (the measurements) and returns ``Text`` (the iris species). | ||
|
||
|
||
Writing our classifier | ||
---------------------- | ||
|
||
Now that we've defined our API, let's jump into our Python module, which lives in ``service.py``. | ||
We see that NStack has created a class ``Service``. This is where we add the functions for our module. Right now it also has a sample function in it, ``numChars``, which we can remove. | ||
|
||
Let's import the libaries we're using. | ||
|
||
.. code :: python | ||
|
||
import nstack | ||
import pandas as pd | ||
|
||
from sklearn.ensemble import RandomForestClassifier | ||
|
||
.. note :: Python modules must also import ``nstack`` | ||
|
||
Before we add our ``predict`` function, we're going to add ``__init__``, the Python constructor function which runs upon the creation of our module. It's going to load our data from ``train.csv``, and use it to train our Random Forest classifier: | ||
|
||
.. code :: python | ||
|
||
def __init__(self): | ||
train = pd.read_csv("train.csv") | ||
|
||
self.cols = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width'] | ||
colsRes = ['class'] | ||
trainArr = train.as_matrix(self.cols) | ||
trainRes = train.as_matrix(colsRes) | ||
|
||
rf = RandomForestClassifier(n_estimators=100) | ||
rf.fit(trainArr, trainRes) | ||
self.rf = rf | ||
|
||
Now we can write our ``predict`` function. The second argument, ``inputArr``, is the input -- in this case, our four ``Double``\s. To return text, we simply return from the function in Python. | ||
|
||
.. code :: python | ||
|
||
def predict(self, inputArr): | ||
points = [inputArr] | ||
df = pd.DataFrame(points, columns=self.cols) | ||
|
||
results = self.rf.predict(df) | ||
return results.item() | ||
|
||
Configuring your module | ||
----------------------- | ||
|
||
When your module is started, it is run in a Linux container on the NStack server. Because our module uses libraries like ``pandas`` and ``sklearn``, we have to tell NStack to install some extra operating system libraries inside your module's container. NStack lets us specify these in our ``nstack.yaml`` configuration file in the ``packages`` section. Let's add the following packages: | ||
|
||
.. code :: yaml | ||
|
||
packages: ['numpy', 'python3-scikit-learn', 'scipy', 'python3-scikit-image', 'python3-pandas'] | ||
|
||
Additionally, we want to tell NStack to copy our ``train.csv`` file into our module, so we can use it in ``__init__``. ``nstack.yaml`` also has a section for specifying files you'd like to include: | ||
|
||
.. code :: yaml | ||
|
||
files: ['train.csv'] | ||
|
||
Publishing and starting | ||
----------------------- | ||
|
||
Now we're ready to build and publish our classifier. Remember, even though we run this command locally, our module gets built and published on your NStack server in the cloud. | ||
|
||
.. code :: bash | ||
|
||
~/Iris.Classify/ $ nstack build | ||
Building NStack Container module Iris.Classify. Please wait. This may take some time. | ||
Module Iris.Classify built successfully. Use `nstack list functions` to see all available functions. | ||
|
||
We can now see ``Iris.Classify.predict`` in the list of existing functions (along with previously built functions like ``demo.numChars``), | ||
|
||
.. code :: bash | ||
|
||
~/Iris.Classify/ $ nstack list functions | ||
Iris.Classify:0.0.1-SNAPSHOT | ||
predict : (Double, Double, Double, Double) -> Text | ||
Demo:0.0.1-SNAPSHOT | ||
numChars : Text -> Integer | ||
|
||
Our classifier is now published, but to use it we need to connect it to an event source and sink. In the previous tutorial, we used HTTP as a source, and the NStack log as a sink. We can do the same here. This time, instead of creating a workflow module right away, we can use nstack's ``notebook`` command to test our workflow first. ``notebook`` opens an interactive shell where we can write our workflow. When we are finished, we can ``Ctrl-D``. | ||
|
||
.. code :: bash | ||
|
||
~/Iris.Classify/ $ nstack notebook | ||
import Iris.Classify:0.0.1-SNAPSHOT as Classifier | ||
Sources.http<(Double, Double, Double, Double) | Classifier.predict | Sinks.log<Text> | ||
[Ctrl-D] | ||
|
||
This creates an HTTP endpoint on ``http://localhost:8080/irisendpoint`` which can receive four ``Double``\s, and writes the results to the log as ``Text``. Let's check it is running as a process: | ||
|
||
.. code :: bash | ||
|
||
~/Iris.Classify/ $ nstack ps | ||
1 | ||
2 | ||
|
||
In this instance, it is running as process ``2``. We can test our classifier by sending it some of the sample data from ``train.csv``. | ||
|
||
.. code :: bash | ||
|
||
~/Iris.Classify/ $ curl -X PUT -d '{ "params" : [4.7, 1.4, 6.1, 2.9] }' localhost:8080/irisendpoint | ||
Msg Accepted | ||
~/Iris.Classify/ $ nstack log 2 | ||
Feb 17 10:32:30 nostromo nstack-server[8925]: OUTPUT: "Iris-versicolor" | ||
|
||
Our classifier is now productionised. Next, we're going explore some of the more sophisticated workflows you can build using NStack. | ||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
.. _workflow_power: | ||
|
||
More Powerful Workflows | ||
======================= | ||
|
||
Now that we have published our classifier to NStack as a module, we can use it to demonstrate some of the more powerful features of the workflow engine. | ||
|
||
Multiple Steps | ||
--------------- | ||
|
||
Workflows can contain as many steps as you like, as long as the output type of one matches the input type of the other. For instance, let's say we wanted to create the following workflow: | ||
|
||
- Expose an HTTP endpoint which takes four ``Double``\s | ||
- Send these ``Double``\s to our classifier, ``Iris.Classify``, which will tell us the species of the iris | ||
- Count the number of characters in the species of the iris using our ``Demo.numChars`` function | ||
- Write the result to the log | ||
|
||
We could write the following workflow: | ||
|
||
.. code :: bash | ||
module Iris.Workflow:0.0.1-SNAPSHOT { | ||
import Iris.Classify:0.0.1-SNAPSHOT as Classifier; | ||
import Demo:0.0.1-SNAPSHOT as Demo; | ||
def multipleSteps = Sources.http<(Double, Double, Double, Double> { http_path = "/irisendpoint" } | Classifier.predict | Demo.numChars | sinks.log<Integer>; | ||
} | ||
.. note :: ``numChars`` and ``predict`` can be `composed` together because their types -- or schemas -- match. If ``predict`` wasn't configured to output ``Text``, or ``numChars`` wasn't configured to take ``Text`` as input, NStack would not let you build the following workflow. | ||
Partial workflows | ||
----------------- | ||
|
||
All of the workflows that we have written so far have been `fully composed`, which means that they contain a source and a sink. Many times, you want to split up sources, sinks, and functions into separate pieces you can share and reuse. In this case, we say that a workflow is `partially composed`, which just means it does not contain a source and a sink. These workflows cannot be ``start``\ed by themselves, but can be shared and attached to other sources and/or sinks to become `fully composed`. | ||
|
||
For instance, we could combine ``Iris.Classify.predict`` and ``demo.numChars`` from the previous example to form a new workflow ``speciesLength`` like so: | ||
|
||
.. code :: java | ||
module Iris.Workflow:0.0.1-SNAPSHOT { | ||
import Iris.Classify:0.0.1-SNAPSHOT as Classifier; | ||
import Demo:0.0.1-SNAPSHOT as Demo; | ||
def speciesLength = Classifier.predict | Demo.numChars | ||
} | ||
Because our workflow ``Iris.Workflow.speciesLength`` has not been connected to a source or a sink, it in itself is still a function. If we build this workflow, we can see ``speciesLength`` alongside our other functions by using the ``list`` command: | ||
|
||
.. code :: bash | ||
~/Iris.Workflow/ $ nstack list functions | ||
Iris.Classify:0.0.1-SNAPSHOT | ||
predict : (Double, Double, Double, Double) -> Text | ||
Demo:0.0.1 | ||
numChars : Text -> Integer | ||
Iris.Workflow:0.0.1-SNAPSHOT | ||
speciesLength : (Double, Double, Double, Double) -> Integer | ||
As we would expect, the input type of the workflow is the input type of ``Iris.Classify.predict``, and the output type is the output type of ``demo.numChars``. Like other functions, this must be connected to a source and a sink to make it `fully composed`, which means we could use this workflow it in *another* workflow. | ||
|
||
.. code :: bash | ||
module Iris.Endpoint:0.0.1-SNAPSHOT { | ||
import Iris.Workflow:0.0.1-SNAPSHOT as IrisWF; | ||
def http = Sources.http<(Double, Double, Double, Double) | IrisWF.speciesLength | Sinks.log<Integer>; | ||
} | ||
Often times you want to re-use a source or a sink without reconfiguring them. To do this, we can similarly separate the sources and sinks into separate workflows, like so: | ||
|
||
.. code :: java | ||
module Iris.Workflow:0.0.1-SNAPSHOT { | ||
import Iris.Classify:0.0.1-SNAPSHOT as Classifier | ||
def httpEndpoint = sources.http<(Double, Double, Double, Double)> { http_path = "speciesLength" }; | ||
def logSink = sinks.log<Text> | ||
def speciesWf = httpEndpoint | Classifier.predict | logSink; | ||
} | ||
Separating sources and sinks becomes useful when you're connecting to more complex integrations which you don't want to configure each time you use it -- many times you want to reuse a source or sink in multiple workflows. In the following example, we are defining a module which provides a source and a sink which both sit ontop of Postgres. | ||
|
||
.. code :: java | ||
module Iris.DB:0.0.1-SNAPSHOT { | ||
def petalsAndSepals = Sources.postgres<(Double, Double, Double, Double)> { | ||
pg_database = "flowers", | ||
pg_query = "SELECT * FROM iris" | ||
}; | ||
def irisSpecies = Sinks.postgres<Text> { | ||
pg_database = "flowers", | ||
pg_table = "iris" | ||
}; | ||
} | ||
If we built this module, ``petalsAndSepals`` and ``irisSpecies`` could be used in other modules as sources and sinks, themselves. | ||
|
||
We may also want to add a functions to do some pre- or post- processing to a source or sink. For instance: | ||
|
||
.. code :: java | ||
module IrisCleanDbs:0.0.1-SNAPSHOT { | ||
import PetalTools:1.0.0 as PetalTools; | ||
import TextTools:1.1.2 as TextTools; | ||
import Iris.DB:0.0.1-SNAPSHOT as DB; | ||
def roundedPetalsSource = DB.petalsAndSepals | PetalsTools.roundPetalLengths; | ||
def irisSpeciesUppercase = TextTools.toUppercase | DB.irisSpecies; | ||
} | ||
Because ``roundedPetalsSource`` is a combination of a source and a function, it is still a valid source. Similarly, ``irisSpeciesUppercase`` is a combination of a function and a sink, so it is still a valid sink. | ||
|
||
Because NStack functions, source, and sinks can be composed and reused, this lets you build powerful abstractions over infrastructure. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters