Scalable Machine Learning in Python 
===================
with Scikit-Learn and Dask 
===============
## 1 - Dask Task Graphs
**May 2017**

<a href=https://dask.pydata.org ><img src=https://www.continuum.io/sites/default/files/dask_stacked.png
 width=200 />
</a>

[http://bit.ly/scaleml-dask-wkshp](http://bit.ly/scaleml-dask-wkshp)


### We have a strong analytics ecosystem (NumPy, Pandas)

### that is mostly restricted to a single core and RAM

How do we parallelize an ecosystem?

of thousands of packages

each with custom algorithms

### Sckit-Image: general image analysis

    skimage.feature.canny(im, sigma=3)

<img src="http://scikit-image.org/docs/dev/_images/sphx_glr_plot_canny_001.png"
     alt="Canny edge detection from skimage"
     width="50%">


### Scikit-Allel: Specialized genomics

<img src="http://alimanfoo.github.io/assets/2016-06-10-scikit-allel-tour_files/2016-06-10-scikit-allel-tour_50_0.png" alt="scikit-allel example" width="50%" align="center">

### Need a parallel computing library

... that is flexible enough

... and familiar enough

... to parallelize a disparate ecosystem

Outline
-------

-  Parallel NumPy and Pandas
-  Parallel code generally
-  Task Graphs and Task Scheduling
    -   Compare with other systems (Spark, Airflow)
    -   Dask's task schedulers
-  Python APIs and Protocols
-  Python Ecosystem and strengths for parallel computing

# Distributed Numpy  `dask.array`

<img src="images/dask-array-black-text.svg" width="60%">

In [None]:
# NumPy code
import numpy as np
x = np.random.random((1000, 1000))
u, s, v = np.linalg.svd(x.dot(x.T))

In [None]:
# Dask.array code
import dask.array as da
x = da.random.random((100000, 100000), chunks=(1000, 1000))
u, s, v = da.linalg.svd(x.dot(x.T))

## `dask.dataframe`

<img src="images/dask-dataframe.svg" width="30%">

In [None]:
import pandas as pd
df = pd.read_csv('myfile.csv', parse_dates=['timestamp'])
df.groupby(df.timestamp.dt.hour).value.mean()

In [None]:
import dask.dataframe as dd
df = dd.read_csv('hdfs://myfiles.*.csv', parse_dates=['timestamp'])
df.groupby(df.timestamp.dt.hour).value.mean().compute()

# But many problems aren't just big arrays and dataframes

The Python community writes clever algorithms

Fine Grained Python Code:

In [None]:
results = {}

for a in A:
    for b in B:
        if a < b:
            results[a, b] = f(a, b)
        else:
            results[a, b] = g(a, b)

## Parallelizable, but not a list, dataframe, or array

In [None]:
from dask import delayed, compute

results = {}

for a in A:
    for b in B:
        if a < b:
            results[a, b] = delayed(f)(a, b)  # lazily construct graph
        else:
            results[a, b] = delayed(g)(a, b)  # without structure

results = compute(results)  # trigger all computation

## `concurrent.futures.ThreadPoolExecutor`

In [None]:
from concurrent.futures import ThreadPoolExecutor 

e = ThreadPoolExecutor()

results = {}

for a in A:
    for b in B:
        if a < b:
            results[a, b] = e.submit(f, a, b)  # submit work asynchronously
        else:
            results[a, b] = e.submit(g, a, b)  # submit work asynchronously

results = {k: v.result() for k, v in results.items()} # block until finished

# Dask APIs Produce Task Graphs

---
# Dask Schedulers Execute Task Graphs

In [None]:
import numpy      as np
import dask.array as da

## 1D-Array

<img src="images/array-1d.svg">

    >>> np.ones((15,))
    array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

    >>> x = da.ones((15,), chunks=(5,))

### 1D-Array

<img src="images/array-1d-sum.svg" width="30%">

In [None]:
x = da.ones((15,), chunks=(5,))
x.sum()

## ND-Array - Sum

<img src="images/array-sum.svg">

In [None]:
x = da.ones((15, 15), chunks=(5, 5))
x.sum(axis=0)

### ND-Array - Transpose

<img src="images/array-xxT.svg">

In [None]:
x = da.ones((15, 15), chunks=(5, 5))
x + x.T

### ND-Array - Matrix Multiply

<img src="images/array-xdotxT.svg">

In [None]:
x = da.ones((15, 15), chunks=(5, 5))
x.dot(x.T + 1)

## ND-Array - Compound Operations

<img src="images/array-xdotxT-mean.svg">

In [None]:
x = da.ones((15, 15), chunks=(5, 5))
x.dot(x.T + 1) - x.mean()

## ND-Array - Compound Operations

<img src="images/array-xdotxT-mean-std.svg">

In [None]:
x = da.ones((15, 15), chunks=(5, 5))
y = (x.dot(x.T + 1) - x.mean()).std()

# Dask APIs Produce Task Graphs

<hr>

# Dask Schedulers Execute Task Graphs

# Exercise 1: Dask Arrays and Task Graphs

In [None]:
import pandas as pd
ge = pd.read_csv('../data/minute/ge/2012-05-01.csv', parse_dates=['timestamp'])
hp = pd.read_csv('../data/minute/hp/2012-05-01.csv', parse_dates=['timestamp'])

In [None]:
%matplotlib inline

In [None]:
ge.plot(x='timestamp')

In [None]:
hp.plot(x='timestamp')

In [None]:
hp.close.max()

In [None]:
hp.close.mean()

In [None]:
from glob import glob
hp_filenames = glob('../data/minute/hp/*.csv')
len(hp_filenames)

In [None]:
%%time
hp_pd = pd.concat(map(pd.read_csv, hp_filenames))

In [None]:
hp_pd.plot(x='timestamp', title='HP', figsize=(10,6))

Now use `das.dataframe.read_csv()` to perform the same operations

In [None]:
%%time
import dask.dataframe as dd
hp_dd = dd.read_csv('../data/minute/hp/*.csv', parse_dates=['timestamp'])

**NOTE:** You will have to think about when to call the `.compute()` method.

-  How many rows are in the `hp_dd` dataset?
-  Get the *max* and *min* for the `high` column over the entire data set
-  Get the *mean* for the `close` column over the entire data set
-  Read the first few rows of the timestamp column
-  Use the `.dt.round` method to round the timestamp column to days
-  Get the high value for each day by grouping by the result from above and computing the maximum of the high column per group
-  Compute the daily high-low spread.
-  Plot the resulting Pandas DataFrame

In [None]:
df = hp_dd

In [None]:
df.groupby(hp_dd.timestamp.dt.round('1d')).high.max().compute()

In [None]:
%%time
hp_dd['high'].mean().compute()