# nbjob - Using Jupyter notebooks for machine learning research
*Implements experiment management and lightweight version control.*

`nbjob` is a Python library that enhances Jupyter notebooks.

Features:
* Dispatch jobs from Jupyter Notebooks to an ipyparallel cluster
* Designed for long-running jobs, such as training a neural network for greater than a day
* Store intermediate and final job results
* Keep reproducible source code backups for all jobs
* Run analyses on job results
* Easily run a series of jobs related jobs with different hyperparameters

## Background

### Problem
The Jupyter notebook is a powerful tool for interacting with data. It lets you visualize what is happening while developing algorithms that process the data.

At the same time, the Jupyter notebook is a way of sharing a computational narrative: that is, slowly building up pieces of an algorithm and explaining how they work on data.

However, this view skips the entire process in between. Here Jupyter is limited in two key ways: it doesn't provide a good architecture for farming out parallel jobs when pickle fails, and it doesn't have version control that's well-adapted to the domain of algorithms research.

#### Job architecture

One key feature of the notebook is that it encourages keeping a lot of information stored as datastructures in memory, as opposed to stored as files on disk. So you end up having data structures that aren't serialized anywhere, and that maybe can't be regenerated because the code cells leading to them have been deleted. You also have functions that exist only in the memory of the notebook, and maybe as source code somewhere inside a cell.

This makes it hard to transmit these datastructures to other machines as jobs. Objects in memory which are executable (like functions) or rely on interpreter state (like generators) are not guaranteed to be portable, and in many cases can't be sent across machines. Details of the Python pickle implementation impose further limitations.

#### Version control

A notebook in the process of being written may be filled with one-time code to visualize a particular piece of data, that will never be needed again. It may also be highly nonlinear, i.e. just executing the notebook from the top down starting with a clean slate may not even run. Sometimes code may be changed in place, obscuring old versions.

Jupyter provides checkpointing as a backup against crashes, but this only covers one use of version control. If a piece of code is deleted, and a researcher wants to remember what it did months down the line, this is not well-supported by a simple versioning setup.

Similarly, research often involves considering many different code variations in parallel. Whereas for well-understood domains software engineers can have a linear revision history that slowly adds functionality, it is not unnatural research to have many branching paths. Most of these will be swiftly abandoned, but sometimes need to be dug up at a much later date.

### Solution

These are symptoms of the same problem: storing code and data in memory instead of on-disk. Even if the full history leading to the current state is theoretically available (e.g. in IPython history logs), it is not easy to use.

A common solution to this problem is well known: make sure that jobs are submitted in reproducible source-code form, and store the code of any job ever submitted. This solves portability across machines, and versioning any code that may be of interest. Frameworks that do this typically rely on a combination of existing version control systems, shell scripts, and archive files.

What makes it hard to adapt the notebook to this kind of workflow out-of-the-box is that a notebook might contain a mix of algorithmic and exploratory code. The algorithmic code is the key to running reproducible jobs, but the exploratory code is there to help the author make sense of the results and find bugs. It is usually this exploratory code that is unreproducible.

The key observation is: the author knows which code is exploratory, and which code isn't. Also, the algorithmic code is typically fairly concise, and also more reproducible than the exploratory sections.

So this is the general flavor of the proposal: the author annotates algorithmic code, runs the code in jobs (possibly on remote clusters), and this framework keep tracks of versioning the result.

### Other things along the way

Other features we pick up along the way:
* Having jobs run report intermediate and final results, and storing them
* Running analyses on the results
* Integration with ipycluster

## About this framework

This framework, called `nbjob`, is still heavily a work in progress. It is in flux as I adapt it to my personal research needs.

Comments welcome.

## Mini-Tutorial

### Introducing Snippets

We have Snippet Collector objects, which you can use to mark cells as non-experimental

In [3]:
from nbjob import SnippetCollector
sc = SnippetCollector()

importing IPython notebook from nbjob.ipynb
importing IPython notebook from Structs.ipynb
importing IPython notebook from Parallel Theano.ipynb


You can collect functions and classes by attaching decorators to them

In [4]:
@sc
def norm(val):
    import math
    return math.sqrt(val.x**2 + val.y**2)

@sc
class Point(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y

You can also collect pieces of code (which run locally to verify the code is correct)

In [5]:
%%snip sc
origin = Point(0., 0.)

Finally, you can spin up jobs relating to these snippets

This first requires connecting to both the database (which stores job results), and an IPython cluster

In [20]:
from nbjob import DBWrapper
db_wrapper = DBWrapper()

import ipyparallel as ipp
rc = ipp.Client()

worker_view = rc[0] # Pointer to worker 0 in the cluster

Then we can proceed to running jobs on a particular worker in the cluster

In [None]:
job = db_wrapper.create_job(
    worker_view, ['norm(origin)'],
    sc
    )
job.result

### Running jobs from multiple snippets

We support running multiple snippets in series

In [8]:
sc2 = SnippetCollector()

In [9]:
@sc2
def distance(p1, p2):
    diff = Point(p1.x - p2.x, p1.y - p2.y)
    return norm(diff)

In [10]:
%%snip sc2
a = Point(1., 2.)
b = Point(3., 0.)

Now we can start a job to run the code. This is designed to be interactive, so you will get a widget to review the job before it is submitted

In [None]:
job = db_wrapper.create_job(
    worker_view, ['distance(a, b)'],
    sc, sc2 # Note that we use both here!
    )
job.result

### Specifying parameters

We also have support for parameters

In [12]:
from nbjob import ParamLogger
params = ParamLogger()
params.x = 5
params.y = 5

In [13]:
%%snip sc2
c = Point(params.x, params.y)

When you run the code below, you will be prompted to adjust the parameters for that particular run, if desired.

In [None]:
job = db_wrapper.create_job(
    worker_view, ['distance(a, b)'],
    sc, sc2,
    params=params # Here we introduce the parameters
    )
job.result

### Viewing the jobs dashboard

In [15]:
# You need to download and run the notebook to see the dashboard.
import nbjob
nbjob.make_default_dashboard()

### Checkpointing and running analyses

This is particular to machine learning, where we have intermediate values of model parameters and we may want to evaluate something over time

Let's imagine the following dummy setup, where we're training a single parameter value to approximate the number `5.0`. However, as a parallel to machine learning, we don't do it in one step. Instead, we have a procedure where the number iteratively approaches its final value.

In [16]:
sc3 = SnippetCollector()

In [18]:
%%snip sc3
iteration = 0
param_values = [0]
jobtracker.register_checkpointer('iteration') # Save this for intermediate results
jobtracker.register_checkpointer('param_values') # Save this for intermediate results

def train():
    global iteration, params
    while abs(param_values[0] - 5.0) > 1e-5:
        iteration += 1
        param_values[0] += 0.5 * (5.0 - param_values[0]) # Iteratively approach 5
        jobtracker.checkpoint() # Save the value of iteration and params
        
# The jobtracker variable is a global variable available in all workers, that provides an interface
# for communicating back with the database

In [None]:
job = db_wrapper.create_job(
    worker_view, ['train()'],
    sc3,
    )
job.result

Here we have been checkpointing at every training iteration, and we have made it so that the iteration and parameter values are logged at each checkpoint. We can then analyze the incremental results from running our job.

In [None]:
def test_performance():
    global iteration, param_values
    return {'iteration':iteration,
           'error': abs(param_values[0] - 5.0)}

analyzer = db_wrapper.create_analyzer(
    worker_view,
    job,
    [test_performance])
analyzer.result

The analyzer gives us a list of dictionaries, each of which has `iteration` and `error` as keys. We can use them to make a plot of error rate over time.

The analysis gets a snapshot of the system with the checkpointed variables correctly restored. This means that if the parameters of a neural network are checkpointed, the analysis can run the network on validation or testing data.

## Misc [not really relevant]

In [2]:
# Cross-notebook include shim
with open("/home/nikita/notebooks/nbinclude.ipynb") as nbinclude_f: # don't rename nbinclude_f
    import nbformat
    get_ipython().run_cell(nbformat.read(nbinclude_f, nbformat.NO_CONVERT).cells[0].source)