# Hume

__Hume__ is a tool for building, deploying and interacting with machine learning models. Hume tries to solve the problem of creating machine learning artifacts that are completely portable wihout having to peg requirement versions. Hume also provides a framework for machine learning IO which we believe is more composable.

In this notebook we'll walk through building your first __hume__ artefact using an example machine learning problem.

In [None]:
%%bash
PIP_EXTRA_INDEX_URL=http://54.83.192.85:8080
pip install hume



### Requirements

 Before starting, you'll need to have Docker __1.10.0__ installed or greater. You can check with

In [1353]:
! docker version

Client:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 22:37:33 2016
 OS/Arch:      darwin/amd64

Server:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 22:37:33 2016
 OS/Arch:      linux/amd64


In addition, make sure you have a docker-machine instance running if you're not running on Linux.

We'll be using `pandashells` for doing some quick command line transformations, so make sure you have it installed. You can do so by running the following 

In [1692]:
! pip install pandashells



In [1235]:
!mkdir /tmp/humedemo
%cd /tmp/humedemo

mkdir: /tmp/humedemo: File exists
/private/tmp/humedemo


### Data
First let's download the data we'll be using. For this example we'll be using the __iris__ dataset from the Rdatasets set.

In [1252]:
!wget https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv

--2016-03-16 16:09:30--  https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv
Resolving vincentarelbundock.github.io... 23.235.44.133
Connecting to vincentarelbundock.github.io|23.235.44.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4821 (4.7K) [text/csv]
Saving to: 'iris.csv.2'


2016-03-16 16:09:33 (270 MB/s) - 'iris.csv.2' saved [4821/4821]



In [1257]:
# Make sure you have it and do some cleaning
!sed -i .bak '1,2s/\"\"/"id"/' iris.csv && rm iris.csv.bak && head iris.csv

"id","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
"1",5.1,3.5,1.4,0.2,"setosa"
"2",4.9,3,1.4,0.2,"setosa"
"3",4.7,3.2,1.3,0.2,"setosa"
"4",4.6,3.1,1.5,0.2,"setosa"
"5",5,3.6,1.4,0.2,"setosa"
"6",5.4,3.9,1.7,0.4,"setosa"
"7",4.6,3.4,1.4,0.3,"setosa"
"8",5,3.4,1.5,0.2,"setosa"
"9",4.4,2.9,1.4,0.2,"setosa"


### Example 1: K-Means
For this example we'll be training a k-means model on the iris dataset.

#### Model
First let's write out our estimator.

In [1507]:
%%writefile demo.py
import os
import pickle
import sys
import json
import argparse
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np


def fit(X, params):
    "Fit a model using data with the given parameters"
    kmeans = KMeans(**params)
    kmeans.fit(X)
    # As a side effect, we'll save the model so that we can predict
    # with it later.
    with open("./fitted.pkl", "w") as f:
        pickle.dump(kmeans, f)
    return kmeans

def predict(X):
    "Predict cluster membership"
    try:
        with open("./fitted.pkl", "rb") as f:
            kmeans = pickle.load(f)
    except IOError as e:
        raise IOError("No fitted model found at {}. Train your model first before calling predict.".format(e.filename))
    else:
        return kmeans.predict(X)

def split_X_y(df, target):
    "Prep the data for modeling"
    X, y = df.drop(target, axis=1), df[target]
    return X, y

def fit_main(args):

    params_file  = os.getenv("HUME_PARAMS_FILE")
    data_file    = os.getenv("HUME_TRAIN_DATA")
    target_label = os.getenv("HUME_TARGET_LABEL")
    
    with open(params_file, "rb") as f:
        params = json.load(f)
    df = pd.read_csv(data_file, header=None)
    X, y = split_X_y(df.pivot(0,1,2), target_label)
    print("params: "+str(params))
    print("params type: "+str(type(params)))
    print(fit(X, params))
    
def predict_main(args):
    # print("args.sample = {}".format(args.sample))
    X = pd.read_csv(args.sample, header=None)
    out = predict(X)
    np.savetxt(sys.stdout, out, delimiter=",")

def make_parser():
    # create the top-level parser
    parser = argparse.ArgumentParser(prog='Hume inter')
    subparsers = parser.add_subparsers()

    # create the parser for the "fit" command
    parser_fit = subparsers.add_parser('fit')
    parser_fit.set_defaults(func=fit_main)

    # create the parser for the "predict" command
    parser_predict = subparsers.add_parser('predict')
    parser_predict.add_argument('sample', nargs="?", type=argparse.FileType('r'), default=sys.stdin)
    parser_predict.set_defaults(func=predict_main)
    return parser

# A trick to make sure we're not being called from an ipython notebook
if __name__ == '__main__' and "get_ipython" not in dir():
    parser = make_parser()
    args = parser.parse_args()
    args.func(args)

Overwriting demo.py


#### Environment
Now we need a Docker as the runtime environment for our model. For simplicity, we'll just inherit from a pre-built scipy image since it has all the dependencies we'll need to run our model. 

In [1508]:
%%writefile Dockerfile
FROM registry.spartzinc.com:5000/dose_scipy27:latest
MAINTAINER Troy de Freitas troy@dose.com

Overwriting Dockerfile


However, the image does not come with `sklearn`, so we need to add a directive to install it.

In [1509]:
%%writefile -a Dockerfile

RUN pip install sklearn

Appending to Dockerfile


Finally, we need to tell Docker where to put our model and how to run it. The `ENTRYPOINT` is significant to Hume. Hume parses the Dockerfile looking for the `ENTRYPOINT` as the gateway for running commands.

In [1510]:
%%writefile -a Dockerfile

COPY . /opt/model

ENTRYPOINT ["python", "/opt/model/demo.py"]

Appending to Dockerfile


Note: In future versions you'll be able to indicate the comamnds for different operations using environment variables, like for instance
```docker
ENV HUME_CMD_FIT "python /opt/model/demo.py --fit"
ENV HUME_CMD_PREDICT "python /opt/model/demo.py --predict"
ENV HUME_CMD_PARAMS "cat /opt/hume/params"
```
but we'll just stick with the entrypoint for now.

Putting it all together, our Dockerfile looks like the following:

In [1511]:
!cat Dockerfile

FROM registry.spartzinc.com:5000/dose_scipy27:latest
MAINTAINER Troy de Freitas troy@dose.com
RUN pip install sklearn
COPY . /opt/model

ENTRYPOINT ["python", "/opt/model/demo.py"]

#### Build Step

Now that we've written a Dockerfile specifying our estimator, we're ready to build our Hume container. To build your container, simply run

In [1512]:
! hume build demo . &1>/dev/null

Hume just built a Docker image tagged as `demo` with the Dockefile in our current directory. Once we've built our estimator, we now have a reusable artefact that we can retrain multiple times with different data and different parameters while reusing only one base image.

#### Parameters

Often times you want to train a model with the same estimator but with different parameters. Hume supports passing parameters to estimators in a couple of ways: via a `params.yml` file in the same directory as your Dockerfile or via command line arguments. We'll use the first method for this example.

In [1513]:
%%writefile params.yml
n_clusters: 4
init: "k-means++"

Overwriting params.yml


#### Training Data

Machine learning data customarily comes in a matrix or wide format. However Hume takes data in a long format which is better for encoding sparse data and for streaming. We'll do a quick transformation using Pandas to munge our iris dataset into a long format which Hume accepts.

In [1514]:
%%bash
cat iris.csv | p.df "pd.melt(df, id_vars='id')" > iris_long.csv
head -n 5 iris_long.csv

"id","variable","value"
1,"Sepal.Length",5.1
2,"Sepal.Length",4.9
3,"Sepal.Length",4.7
4,"Sepal.Length",4.6


#### Model Fitting

Now that we have our parameters and data in the right format, we're ready to fit our model. We'll pass the long-style csv we created to Hume via stdin.

In [1515]:
# Note that the `-` tells Hume to accept data via stdin
!tail -n+2 iris_long.csv | hume fit --target Species demo -

Creating build context for Docker image at /var/folders/6q/_wph_y897bgf0q5xss3jltpw0000gn/T/tmpDzZ6qz
Sending build context to Docker daemon 19.46 kB
Step 1 : FROM demo
[91m# Executing 13 build triggers...
[0mStep 1 : RUN mkdir /opt/hume
 ---> Running in 6e9987bd4101
Step 1 : ARG TRAIN_WITH
 ---> Running in e52c183fea4b
Step 1 : ENV TRAIN_WITH ${TRAIN_WITH}
 ---> Running in 15f9632f5197
Step 1 : ENV HUME_TRAIN_DATA /opt/hume/data
 ---> Running in 09eca22fecac
Step 1 : COPY ${TRAIN_WITH} $HUME_TRAIN_DATA
Step 1 : ARG TRAIN_PARAMS=
 ---> Running in 81932298222f
Step 1 : ENV TRAIN_PARAMS "${TRAIN_PARAMS}"
 ---> Running in 78d17834bfca
Step 1 : ENV HUME_PARAMS_FILE /opt/hume/params
 ---> Running in bbe81ea5f763
Step 1 : RUN echo $TRAIN_PARAMS
 ---> Running in 1b172664fe74
{"init": "k-means++", "n_clusters": 4}
Step 1 : RUN echo $TRAIN_PARAMS > $HUME_PARAMS_FILE
 ---> Running in 9c18b87fcf63
Step 1 : ARG TARGET_LABEL=target
 ---> Running in 16dbc6a24e38
Step 1 : ENV HUME_TARGET_LABEL ${TA

Running `hume fit [..]` caused hume to take our original `demo` image along with the parameters specified in `params.yml` and generated a new image tagged as `demo:fitted` which has the same environment as the original, but now with a fitted model. Once we have a `:fitted` version of an image we can use it to generate predictions from samples.

We can score single observations:

In [1516]:
! echo "0, 1, 2, 3" | hume predict demo:fitted -

1.000000000000000000e+00



Or score in batches:

In [1517]:
%%bash 
hume predict demo:fitted - <<EOF 
0, 1, 2, 3
4, 5, 8, 9
EOF

1.000000000000000000e+00
3.000000000000000000e+00



### Inspecting a Model
Hume allows you to inspect models you've created ex-post-facto with a few convenience commands.

#### Getting and Setting Parameters

Suppose we trained a model but we don't remember what parameters were used to fit it. To recall the fitting parameters, we can just call 

In [1518]:
! hume params demo:fitted 

{
    "init": "k-means++", 
    "n_clusters": 4
}


__Note__: In future versions hume will allow you to get and set parameters inplace via the commandline.

### Creating Your Own Hume Containers 

In our first example, we kind of glossed over the details of how to create a model that Hume can use. Now we'll walk through the steps needed to write a model that targets Hume.

Continuing with the iris dataset, let's now write an SVM model that Hume can build.

Hume exports a number of environment variables that we can access while within our model:
- `HUME_TRAIN_DATA`: the path to the training data.
- `HUME_PARAMS_FILE`: the path to the parameters file (stored as json).
- `HUME_TARGET_LABEL`: the name of the target variable.


In [1683]:
%%writefile ex1/demo.py
import sys, os, json, pickle
import pandas as pd
from sklearn.svm import SVC

TRAINING_DATA = os.getenv("HUME_TRAIN_DATA")
TARGET_LABEL = os.getenv("HUME_TARGET_LABEL")
with open(os.getenv("HUME_PARAMS_FILE")) as f:
    SVM_PARAMS = json.load(f)

Overwriting ex1/demo.py


We'll use these to source the data and separate the features matrix from the target vector.

In [1684]:
def split_X_y(df):
    X, y = df.drop(TARGET_LABEL, axis=1), df[TARGET_LABEL]
    return X, y

The fit method is significant because its the only time we get to save data to persist in the container before a fitted version is built. In general, you should only save data that you need for scoring, like a serialized fitted model object.

In [1685]:
def fit():
    svm = SVC(**SVM_PARAMS)
    df = pd.read_csv(TRAINING_DATA)
    # Since we're piping observations in long format, we'll need to pivot the data
    # befor splitting. This however is not a strict Hume requirement -- you can do
    # whatever suits your usecase so long as it reads a csv from stdin and writes
    # a csv to stdout.
    X, y = split_X_y(df.pivot("id","variable","value"))
    svm.fit(X,y)
    with open("/opt/model/fitted.pkl", "wb") as f:
        pickle.dump(svm, f)

We can source the data we've saved in our predict method.
Note that predict receives its data via stdin and reports its scores via stdout.

In [1686]:
def predict():
    with open("/opt/model/fitted.pkl", "rb") as f:
        svm = pickle.load(f)
    data = pd.read_csv(sys.stdin)
    preds = svm.predict(data.pivot("id","variable","value"))
    pd.Series(preds).to_csv(sys.stdout)

Finally, we'll write a main method that dispatches calls to their respective methods.

In [1687]:
def main():
    args = sys.argv[1:]
    method = args[0]
    if method == "fit":
        fit()
    elif method == "predict":
        predict()
    else:
        raise NotImplementedError("Command {} is not implemented".format(method))
    

if __name__ == '__main__' and "get_ipython" not in dir():
    main()

Putting it all together, we have

In [1690]:
%%writefile ex1/demo.py
import sys, os, json, pickle
import pandas as pd
from sklearn.svm import SVC

TRAINING_DATA = os.getenv("HUME_TRAIN_DATA")
TARGET_LABEL = os.getenv("HUME_TARGET_LABEL")
with open(os.getenv("HUME_PARAMS_FILE")) as f:
    SVM_PARAMS = json.load(f)
    
def split_X_y(df):
    X, y = df.drop(TARGET_LABEL, axis=1), df[TARGET_LABEL]
    return X, y

def fit():
    svm = SVC(**SVM_PARAMS)
    df = pd.read_csv(TRAINING_DATA)
    X, y = split_X_y(df.pivot("id","variable","value"))
    svm.fit(X,y)
    with open("/opt/model/fitted.pkl", "wb") as f:
        pickle.dump(svm, f)

def predict():
    with open("/opt/model/fitted.pkl", "rb") as f:
        svm = pickle.load(f)
    data = pd.read_csv(sys.stdin)
    preds = svm.predict(data.pivot("id","variable","value"))
    pd.Series(preds).to_csv(sys.stdout)

def main():
    args = sys.argv[1:]
    method = args[0]
    if method == "fit":
        fit()
    elif method == "predict":
        predict()
    else:
        raise NotImplementedError("Command {} is not implemented".format(method))
    

if __name__ == '__main__' and "get_ipython" not in dir():
    main()

Overwriting ex1/demo.py


We can also write out our parameters:

In [1691]:
%%writefile ex1/params.yml
probability: True
tol: 0.001

Overwriting ex1/params.yml


In [1633]:
!hume build svm ./ex1 &1>/dev/null

In [1586]:
!cd ex1 && hume fit --target Species svm - < ../iris_long.csv

Creating build context for Docker image at /var/folders/6q/_wph_y897bgf0q5xss3jltpw0000gn/T/tmpXK37Xz
Sending build context to Docker daemon 19.46 kB
Step 1 : FROM svm
[91m# Executing 13 build triggers...
[0mStep 1 : RUN mkdir /opt/hume
 ---> Running in 70cd9b190fde
Step 1 : ARG TRAIN_WITH
 ---> Running in d2073598c88b
Step 1 : ENV TRAIN_WITH ${TRAIN_WITH}
 ---> Running in 60bc1aa3ce0c
Step 1 : ENV HUME_TRAIN_DATA /opt/hume/data
 ---> Running in 0f592e1da607
Step 1 : COPY ${TRAIN_WITH} $HUME_TRAIN_DATA
Step 1 : ARG TRAIN_PARAMS=
 ---> Running in 5e5108deadac
Step 1 : ENV TRAIN_PARAMS "${TRAIN_PARAMS}"
 ---> Running in 4d1822221cb7
Step 1 : ENV HUME_PARAMS_FILE /opt/hume/params
 ---> Running in 31929f2cc22c
Step 1 : RUN echo $TRAIN_PARAMS
 ---> Running in 7e4d70bc0efa
{"probability": true, "tol": 0.001}
Step 1 : RUN echo $TRAIN_PARAMS > $HUME_PARAMS_FILE
 ---> Running in b2ac94122608
Step 1 : ARG TARGET_LABEL=target
 ---> Running in ed7b6408f0d7
Step 1 : ENV HUME_TARGET_LABEL ${TARGET

In [1632]:
%%bash 
grep -v "Species" iris_long.csv | # Remove the target variable
awk -F ',' '($1 <= 3){print}'   | # Use only a subset of lines (where id<=3)
hume predict svm:fitted -

0,setosa
1,setosa
2,setosa



### Using Hume as a SciKit Learn Estimator

Using the sklearn adapter we can use Hume as if it were any Sklearn estimator. Since it obeys the same interface we can use Hume containers for grid search and cross-validation.

In [1335]:
from hume.ext.sklearn import SciKitHume

In [1347]:
skhume = SciKitHume("demo", params={"init": "k-means++", "n_clusters": 4})

In [1348]:
iris_data = pd.read_csv("./iris.csv", index_col=0)

In [1349]:
y = iris_data['Species']
X = iris_data.drop('Species', axis=1)

In [1350]:
skhume.fit(X, y)

In [1351]:
skhume.predict(X.drop("target", axis=1))

array([ 3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,
        3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,
        3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,
        3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,  2.,  2.,
        2.,  3.,  2.,  0.,  2.,  3.,  2.,  3.,  3.,  0.,  3.,  0.,  3.,
        2.,  0.,  0.,  0.,  3.,  0.,  0.,  0.,  0.,  0.,  2.,  2.,  2.,
        0.,  3.,  3.,  3.,  3.,  0.,  0.,  0.,  2.,  0.,  0.,  3.,  0.,
        0.,  3.,  3.,  0.,  0.,  0.,  0.,  3.,  0.,  2.,  0.,  2.,  2.,
        2.,  2.,  3.,  2.,  2.,  2.,  2.,  2.,  2.,  0.,  0.,  2.,  2.,
        2.,  2.,  0.,  2.,  0.,  2.,  2.,  2.,  2.,  0.,  0.,  2.,  2.,
        2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  0.,  2.,  2.,  2.,  0.,
        2.,  2.,  2.,  2.,  2.,  2.,  0.])