<h1> Datadriver for DataScientists </h1>

_Execute the following cell in order to make the table of contents appear_

In [None]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_µnotebook_toc.js')

In this notebook, we will go over the main concepts of the datadriver API, which will enable you to push your exploration code to production faster than ever.

<h2 id="tocheading">Table of Contents</h2>
<div id="toc"></div>

# Context

First you need to create a context. The context is an object which will allow you to communicate with your environment during your exploration. As such, it needs to be able to communicate with your database. This is done by creating a DB object and passing it to the context constructor. Here is how it's done :

In [None]:
from dd import DB
from dd.api.contexts import LocalContext

db = DB(dbtype='sqlite', filename=':memory:')
context = LocalContext(db)

We will add some more options in order to make sure this tutorial executes properly. You don't need to understand this line right now, as we will cover it in a later tutorial.

In [None]:
context.set_default_write_options(if_exists="replace", index=False)

There you go. Now your context is set up. Time to load some data and start playing !

# Import data

The dd library comes with some data package in it. We can access the files thanks to the pkg_resources from the standard library :

In [None]:
titanic_datapath = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'

Now, let's use the context to load the file.

In [None]:
train = context.load_file(titanic_datapath,
                          table_name="titanic.train")

The returned object is what is called a dataset. It is the main abstraction you will use to transform and save you data.

In [None]:
type(train)

# Datasets

You may consider datasets as wrappers around Pandas DataFrames. It gives you access to some methods you may recognise if you are familiar with this awesome library.

In [None]:
train.shape

In [None]:
train.columns

In [None]:
train.head()

However, datasets are NOT dataframes:

In [None]:
#train.loc[:10, u'survived']

But you can access the underlying dataframe by calling _collect()_ on the dataset :

In [None]:
dataframe = train.collect()
dataframe.loc[:10, u'survived']

# Transformations

Transformations are the easiest way to transform your data. And the easiest way to create a transformation is by using the transform method.

## Transform()

By calling _transform()_ on a dataset, you apply a function to it and are returned a new object that wraps the new data.

For example, let's say you need to remove all missing value from the previous dataset. You may define the following function :

In [None]:
def fillna_with_zeros(dataframe):
    """
    Returns a copy of the dataframe with null values replaced by zeros.
    """
    return dataframe.fillna(0)

As you may have noticed, your function takes a dataframe as input, and returns a new dataframe. This is very important, because it gives you access to the full Pandas DataFrame power. It also forces you to keep your data in DataFrames.

In order to apply this function to your dataset, you would then write :

In [None]:
filled_with_zeros = train.transform(fillna_with_zeros)

Easy Peasy ! Let's look at the new data :

In [None]:
filled_with_zeros.head()

Great ! The cabin column (and all the others) are now non-null.

In case you want your transformation function to be a bit more generic, with more parameters, you may proceed like this :

In [None]:
def fillna(dataframe, value):
    """
    Returns a copy of the dataframe with null values replaced by {value}.
    """
    return dataframe.fillna(value=value)

In [None]:
filled_with_ones = train.transform(fillna, value=1)

In [None]:
filled_with_ones.head()

## MutiTransform()

With multitransform, a python function can output multiple datasets.
In this case, output_tables must be specified as a string list.

In [None]:
def split_survived(dataframe):
    return dataframe[dataframe.survived==0], dataframe[dataframe.survived==1]

In [None]:
surv0, surv1 = filled_with_ones.multitransform(split_survived, output_tables=["surv0", "surv1"])

In [None]:
surv0.head()

In [None]:
surv1.head()

## More transformations

### select_columns
allows you to restrict a dataframe to a subset of it's columns :

In [None]:
some_columns = filled_with_zeros.select_columns(["passengerid", "survived", "pclass", "age", "sibsp", "parch", "fare"],
                                                write_options=dict(if_exists="replace", index=False))

In [None]:
some_columns.head()

You may do the same thing, but with less control over the options given to the method, with the [bracket notation]

In [None]:
some_other_columns = filled_with_zeros[["passengerid", "name", "sex", "ticket"]]
some_other_columns.head()

### join
joins two datasets :

In [None]:
some_columns.join(some_other_columns).head()

### split\_train\_test
splits a dataset in TWO new disjoint datasets.

In [None]:
Xtrain, Xtest = some_columns.split_train_test(train_size=0.75)

In [None]:
Xtrain.head()

In [None]:
Xtest.head()

## Lazy operations

Note that all the operations described above are __lazy__, which means they are not executed until explicitely required to do so. The concept of _actions_ and _transformations_ are thus similar to spark. _Transformation_ are lazy, while _actions_ require the execution to be launched.

In [None]:
def dummify(dataframe):
    """
    Returns a one-hot encoded version of a dataframe
    """
    import pandas as pd
    return pd.get_dummies(dataframe)

In [None]:
dummified = filled_with_zeros[["sex"]].transform(dummify)

In [None]:
dummified.memory_usage

In [None]:
dummified.head()

In [None]:
dummified.memory_usage

As you can see, the memory taken by the dataset before the execution (launched by the action _head()_) is much smaller than the memory taken by the dataset after the execution. This is because nothing is computed before the execution, AND because the result is cached in the object after the data has been computed.

# Models

Models are objects that need to be trained before they can be applied to a new set of data. You may create a Model through the context from any object that implements a fit and a predict or a transform method (as all scikit-learn models do). Let's look at how you must proceed :

In [None]:
# Importing scikit-learn model class
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Instantiating Scikit model
scikit_model = RandomForestClassifier(max_depth=4, n_jobs=-1) 

# Creating Datadriver Model
model = context.model(scikit_model, model_address="model@foo.bar")

The model_address keyword is used to store the model in database to be retrieve later. The correct address format is {identifier}@{schema_name}.{table_name}

Now that we have a model, we will want to train it on our carefully crafted training dataset :

In [None]:
fitted_model = model.fit(Xtrain, target="survived")

With this fitted model (or soon to be fitted, remember, all of this is lazy !), we are able to make predictions on our test dataset :

In [None]:
predictions = fitted_model.predict(Xtest, target="survived")

In both cases, notice that we included the target in the input dataset. It is used in the fit method to train the model, and the column is dropped in the predict method.

In [None]:
predictions.head()