# Dask

`Dask` is a parallel processing API that has analogous data structures to `numpy` arrays, `pandas` dataframes and `Spark` RDDs. In the cases of arrays and dataframes, unlike their original data structures in numpy and pandas, the `Dask` versions are distributed and operations are parallelized. The `bag` data structure of dask is the analog to a Spark RDD, and both are distributed data structures. These distributed data structures are called `high-level collections` in Dask.

| Dask Analog | Existing Data Structures |
| --- | ----------- |
| Array | Numpy Array |
| DataFrame | Pandas DataFrame |
| Bag | Spark RDD |

Dask seems to take the best of what's fundamental and available to the Python data science toolset and unifies them into a single framework. If you have ever worked with [Apache Spark](https://spark.apache.org/) and seen its web interface, Dask has a similar dashboard (runs on port `8787` by default). Like Spark, Dask is also built on the concept of a `driver` submitting jobs (centered on a distributed data structure) to a `cluster` of workers. Here's a more in depth [comparison and contrast](https://docs.dask.org/en/latest/spark.html) of Dask and Spark. Let's take glance at dask and see how it works.

## Cluster and client

In Dask, the `Client` represents the `driver` and the `LocalCluster` represents the `cluster`. The driver controls the manipulation of distributed data through submission of jobs to the cluster. The `LocalCluster` is not a real cluster of separate, physical worker nodes, but it mimics one and is useful for local development.

In [1]:
from dask.distributed import Client, LocalCluster

params = {
    'n_workers': 4,
    'threads_per_worker': 2,
    'dashboard_address': '8787'
}
cluster = LocalCluster(**params)
client = Client(cluster)

In [2]:
client

0,1
Client  Scheduler: tcp://127.0.0.1:59227  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 25.76 GB


## DataFrame

A Dask [DataFrame](https://docs.dask.org/en/latest/dataframe-api.html) behaves nearly identical to a pandas one. There's a couple of ways to create a Dask DataFrame (e.g. reading from files), but, here, we create a Pandas one and use the `from_pandas()` function to convert the Pandas DataFrame to a Dask one.

In [60]:
import numpy as np
import random
import pandas as pd
from sklearn.datasets import make_regression, make_classification

np.random.seed(37)
random.seed(37)

def to_pdf(X, y):
    A = pd.DataFrame(X, columns=[f'x{i}' for i in range(X.shape[1])])
    b = pd.DataFrame(pd.Series(y, name='y'))
    return A, b

def to_seq(X, y):
    A = [{c: r[c] for c in X.columns} for _, r in X.iterrows()]
    b = [{c: r[c] for c in y.columns} for _, r in y.iterrows()]
    return A, b

def get_regression(n_samples=2000):
    X, y = make_regression(**{
        'n_samples': n_samples,
        'n_features': 10,
        'n_informative': 5,
        'n_targets': 1,
        'bias': 5.3,
        'random_state': 37
    })

    return X, y

def get_classification(n_samples=2000):
    X, y = make_classification(**{
        'n_samples': n_samples,
        'n_features': 10,
        'n_informative': 5,
        'n_redundant': 2,
        'n_repeated': 0,
        'n_classes': 2,
        'n_clusters_per_class': 2,
        'random_state': 37
    })
    
    return X, y

X, y = get_regression()
X, y = to_pdf(X, y)

As with Spark, we have to worry about the number of partitions in Dask as well. For now, we will specify 100 arbitrarily.

In [23]:
import dask.dataframe as dd

X = dd.from_pandas(X, npartitions=100)

When attempting to get the string representation of a Dask DataFrame, notice how the actual data is not displayed, but, rather, just some metadata.

In [12]:
X

Unnamed: 0_level_0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9
npartitions=100,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
20,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...
1980,...,...,...,...,...,...,...,...,...,...
1999,...,...,...,...,...,...,...,...,...,...


We can check the data types of each column.

In [13]:
X.dtypes

x0    float64
x1    float64
x2    float64
x3    float64
x4    float64
x5    float64
x6    float64
x7    float64
x8    float64
x9    float64
dtype: object

We can invoke statistical functions like `sum()`, `mean()` and `std()` on the DataFrame. However, if we want the results to come back to the client, we need to issue `compute()`. This pattern of functions mimics Spark's `transformation` vs `action` functions. 

In [14]:
X.sum().compute()

x0     8.309788
x1    -7.660057
x2    -1.745221
x3    -8.071736
x4   -25.665312
x5    70.528265
x6   -18.859073
x7   -15.195804
x8   -46.170896
x9   -29.945949
dtype: float64

In [15]:
X.mean().compute()

x0    0.004155
x1   -0.003830
x2   -0.000873
x3   -0.004036
x4   -0.012833
x5    0.035264
x6   -0.009430
x7   -0.007598
x8   -0.023085
x9   -0.014973
dtype: float64

In [16]:
X.std().compute()

x0    0.984805
x1    1.004140
x2    1.014151
x3    1.016681
x4    0.982926
x5    0.981421
x6    0.986267
x7    1.025351
x8    1.019498
x9    0.986211
dtype: float64

## Bag

Dask [Bags](https://docs.dask.org/en/latest/bag-api.html) are like Spark RDDs. You get similar functions to RDDs with Bags like `map()`, `filter()` and `reduce()`. The `reduceByKey()` in Spark RDDs is `foldby()` in Dask Bags. The `reduce()` in Spark RDDs is `fold()` in Dask Bags. 

Below, to create a Dask Bag, we use `from_sequence()` where our collection has dictionary elements. 

In [25]:
import dask.bag as db

X, y = get_regression()
X, y = to_pdf(X, y)
X, y = to_seq(X, y)
X = db.from_sequence(X)

Here's an example of map and reduce by key. In the map operation, we map a dictionary element to a tuple, where the first element is a boolean indicating if the integer representation of the `x0` field is even, and the second element is the value of `x0`.

In [26]:
(X.map(lambda r: (int(r['x0']) % 2 == 0, r['x0']))
 .foldby(lambda tup: tup[0], lambda a, b: (a[0], a[1] + b[1]))
 .compute())

[(True, (True, -7.959470449062381)), (False, (False, 16.269258267363348))]

In this example, we filter for even numbers.

In [27]:
(X.map(lambda r: (int(r['x0']) % 2 == 0, r['x0']))
 .filter(lambda tup: tup[0])
 .foldby(lambda tup: tup[0], lambda a, b: (a[0], a[1] + b[1]))
 .compute())

[(True, (True, -7.959470449062381))]

In this exampke, we filter for odd numbers.

In [28]:
(X.map(lambda r: (int(r['x0']) % 2 == 0, r['x0']))
 .filter(lambda tup: not tup[0])
 .foldby(lambda tup: tup[0], lambda a, b: (a[0], a[1] + b[1]))
 .compute())

[(False, (False, 16.269258267363348))]

In [29]:
(X.map(lambda r: r['x0'])
 .filter(lambda x0: int(x0) % 2 == 0)
 .fold(lambda a, b: a + b)
 .compute())

-7.959470449062381

## Machine Learning

A Dask DataFrame can be used as a drop-in replacement for a pandas DataFrame. Here, we use the Dask DataFrames as input.

In [61]:
from sklearn.linear_model import LogisticRegression

X, y = get_classification()
X, y = to_pdf(X, y)

model = LogisticRegression(random_state=37, penalty='none', verbose=10)
model.fit(X, y)
[('intercept', model.intercept_[0])] + [(f, c) for f, c in zip(X.columns, model.coef_[0])]

  return f(**kwargs)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished


[('intercept', -0.4096500455402346),
 ('x0', 0.05472569830192885),
 ('x1', -0.3576951243307095),
 ('x2', -0.016887091414339032),
 ('x3', 0.9748572795615676),
 ('x4', 0.3314410250166681),
 ('x5', 0.7442808664015453),
 ('x6', 0.0035693089244456795),
 ('x7', 0.09115568775512932),
 ('x8', 0.08237183854292032),
 ('x9', -0.1556104235780201)]

In [62]:
from dask_ml.linear_model import LogisticRegression

ModuleNotFoundError: No module named 'dask_ml'