Scalable Machine Learning in Python 
===================
with Scikit-Learn and Dask 
===============
**May 2017**

Ian Stokes-Rees [@ijstokes](http://twitter.com/ijstokes) 
[http://bit.ly/scaleml-dask-wkshp](http://bit.ly/scaleml-dask-wkshp)


<a href=http://dask.pydata.org ><img src=https://www.continuum.io/sites/default/files/dask_stacked.png
 width=200 />
</a>

## Description

This hands-on 3 hour workshop will give participants an opportunity to explore [Dask](http://dask.pydata.org), a parallel computing framework for Python.  We will start with an overview of Dask and the problem it was designed to address, and then look at three exercises that demonstrate the Dask parallel wrappers for [Pandas](http://pandas.pydata.org), [NumPy](http://www.numpy.org), and [Scikit-Learn](http://www.scikit-learn.org).

<table>
<tr><td>
<a href=http://dask.pydata.org ><img src=https://www.continuum.io/sites/default/files/dask_stacked.png
 width=200 />
</a>

</td>
<td>
<a href=http://scikit-learn.org/ ><img src=http://scikit-learn.org/stable/_images/scikit-learn-logo-notext.png
 width=200 />
</a>
</td>
<td>
<a href=http://pandas.pydata.org ><img src=http://people.math.sc.edu/etpalmer/Images/pandas_logo.png
 width=200 />
</a>
<br/>
<a href=http://www.numpy.org ><img src=https://valohai.com/static/img/support-logos/numpy-simple.svg
 width=200 />
</a>
</td></tr>
</table>

Presenter
--------

<table>

<tr><td>
<font size=+2><b>Ian Stokes-Rees</b> [@ijstokes](http://twitter.com/ijstokes)
<br/>[ijstokes@continuum.io](mailto:ijstokes@continuum.io)
<br/>
[http://about.me/ijstokes](http://about.me/ijstokes)
<br/>
[http://linkedin.com/in/ijstokes](http://linkedin.com/in/ijstokes)
<br/></font>
</td>
<td>
<a href=https://continuum.io ><img src=http://ijstokes-public.s3.amazonaws.com/dspyr/img/AnacondaCIO_Logo width=400 />
</a>
</td></tr>
</table>

Acknowledgements
---------------
Adapted from material created by:
* [Matthew Rocklin](https://github.com/mrocklin)
* [Ben Zaitlen](https://github.com/quasiben)
* [Min Ragan-Kelley](https://github.com/minrk)
* [Olivier Grisel](https://github.com/ogrisel)

In particular:
* [PyCon 2017 Parallel Data Analysis Tutorial](https://us.pycon.org/2017/schedule/presentation/189/)
* [Dask Tutorial](https://github.com/dask/dask-tutorial)
* [Dask Talk](http://matthewrocklin.com/slides/dask-short#)

Assets and Reference
-------------------
This presentation:
* Anaconda Cloud: https://anaconda.org/ijstokes/scaleml-dask-wkshp
* GitHub: https://github.com/ijstokes/scaleml-dask-wkshp

The material is based on the BSD-3 open source Dask project, which is included in the Anaconda Distribution:
* Docs: http://dask.pydata.org/
* GitHub: https://github.com/dask/dask
* Support: http://dask.pydata.org/en/latest/support.html

Setup
-----
* [Download Anaconda 4.3 for Python 3.6](http://continuum.io/downloads)
* Clone or download the GitHub repo for the workshop:
```bash
git clone git@github.com:ijstokes/scaleml-dask-wkshp.git
```

* Create a conda environment for the workshop:
```bash
conda env create ijstokes/daskwkshp
source activate daskwkshp # macOS and *nix
activate daskwkshp # Windows
```

* If that doesn't work, this should do the trick:
```bash
conda create -n daskwkshp dask scikit-learn \
    jupyter notebook=5 python-graphviz pandas \
    python=3.6
```

## Before we start

We need to get some data to work with.
We are going to generate some [fake stock data](https://github.com/mrocklin/fakestockdata) by adding a bunch of points between real stock data points. This will take a few minutes the first time we run it.

In [None]:
# or do this from the command line
%run ./prep.py

## Introductions
<table>

<tr><td>
At Continuum we say 
<br/>
<font size=+2><b>*"Programming Python with Anaconda
<br/>is more fun with a friend"*</b></font>
</td>
<td>
<a href=https://continuum.io ><img src=http://ijstokes-public.s3.amazonaws.com/dspyr/img/AnacondaCIO_Logo width=400 />
</a>
</td></tr>
</table>

### Introduce yourself to the people on either side of you

There is only one of me, so you're going to need to rely on each other for help during exercises!

# Exercise 1.1: Setup and Basic Dask Operations
Take 20 minutes to get setup and then run through these basic Dask operations to see how it provides data structures similar to a `numpy.array` or `pandas.dataframe`

In [None]:
import numpy as np

In [None]:
a = np.random.randint(size=(10,10), low=1, high=10)

In [None]:
a

In [None]:
type(a)

In [None]:
import dask.array as da
a = da.random.randint(size=(60,60), low=1, high=10, chunks=(20,20))

In [None]:
a

In [None]:
type(a)

Dask does ***lazy evaluation*** so it is returning a reference to a delayed operation, not yet invoked

In [None]:
a[3,10]

`.compute()` is required to actually get back the values

In [None]:
a[3,10].compute()

## Same story for vectors

In [None]:
a[3,15:25]

In [None]:
a[3,15:25].compute()

In [None]:
# or regions/matrix
a[3:5, 15:25]

In [None]:
a[3:5, 15:25].compute()

In [None]:
# Notice what type this gives you, once it is fully reified
b = a[3:5, 10:20].compute()

In [None]:
type(b)

In [None]:
type(a[3:5, 15:25].compute())

## ... and methods

In [None]:
a.mean()

In [None]:
a.mean().compute()

## Ex 1.2 Try some computations on `dask.array` objects

In [None]:
a

In [None]:
b = a.T * a + 100

In [None]:
b

In [None]:
type(b)

In [None]:
b[3:5,10:20].compute()

In [None]:
b.max()

In [None]:
b.max().compute()

## Ex 1.3 Visualize Dask Task Graphs
### These may not work for you
It depends on whether or not graphviz and python-graphviz have installed properly.

If not you'll still be able to do all the exercises, you just won't be able to see the task graphs that Dask is creating.

In [None]:
a.visualize() # a = randint(size=(60,60), chunks=(20,20))

In [None]:
b.visualize() # b = a.T * a + 100

## Ex 1.4 Dask DataFrame
If you're familiar with the `pandas.dataframe` then the `dask.dataframe` is going to be easy to use

In [None]:
import pandas as pd
pd.options.display.max_rows = 10

In [None]:
pdf = pd.read_csv("./data/minute/aig/2010-01-04.csv", 
                 parse_dates=['timestamp']).set_index('timestamp')

In [None]:
pdf

In [None]:
# may need to fix slashes in file path if you're on Windows
import dask.dataframe as dd
df = dd.read_csv("./data/minute/aig/*.csv", 
                 parse_dates=['timestamp']).set_index('timestamp')

In [None]:
# may need to fix slashes in file path if you're on Windows
import dask.dataframe as dd
df = dd.read_csv("./data/minute/aig/2010-01-*.csv").set_index('timestamp')

In [None]:
!ls ./data/minute/aig/2010-01-*.csv | wc -l

In [None]:
df

In [None]:
df.visualize()

In [None]:
len(df)

## Ex 1.5 DataFrame columns

In [None]:
df.columns

In [None]:
df.high

In [None]:
df.close.compute()

## Ex 1.6 Dataframe methods

In [None]:
df.close.mean()

In [None]:
df.close.mean().compute()

In [None]:
%matplotlib inline

In [None]:
# Pay attention to where .plot() comes in this expression
df[['close']].compute().plot(title='AIG', figsize=(10,4))

## Ex 1.7 Visualize Dataframe Method Task Graphs

Think about what this task graph is telling you about distributed data and distributed data structures

In [None]:
df.close.mean().visualize()

## Dask data loading
Why does the graph and its connections look like this?