# Getting Started with Dask

**Dask** is a library for parallel computing in Python. It works by "chunking" large datasets into smaller datasets, and performing computations on the smaller datasets in parallel. For instance, it can break up a large Pandas DataFrame of 100k rows into 10 Pandas DataFrames of 10k rows. Then, when we want to perform some operation on the large 100k-row DataFrame, the Dask scheduler will schedule 10 tasks to be performed in parallel (in fact, it will determine and only schedule the tasks that are absolutely necessary for the final result).

This notebook will get us accustomed to the way Dask works cooperatively with Pandas. It assumes familiarity with Pandas DataFrames and basic operations like indexing and aggregations.

In [1]:
import pandas as pd
import numpy as np

import dask.dataframe as dd

As seen in the above import, Dask has its own DataFrame, called Dask DataFrame. A Dask DataFrame is composed of a number of Pandas DataFrames, each representing a "chunk". We can specify the number of chunks when we initialize a Dask DataFrame.

One way to initialize a Dask DataFrame is to take an existing Pandas DataFrame, and create a Dask DataFrame from that.

So, let's first start by creating a dummy Pandas DataFrame (using a Pandas DatetimeIndex):

In [2]:
index = pd.date_range("2021-09-01", periods=2400, freq="1H")
df = pd.DataFrame({"a": np.arange(2400), "b": list("abcaddbe" * 300)}, index=index)
df

Unnamed: 0,a,b
2021-09-01 00:00:00,0,a
2021-09-01 01:00:00,1,b
2021-09-01 02:00:00,2,c
2021-09-01 03:00:00,3,a
2021-09-01 04:00:00,4,d
...,...,...
2021-12-09 19:00:00,2395,a
2021-12-09 20:00:00,2396,d
2021-12-09 21:00:00,2397,d
2021-12-09 22:00:00,2398,b


and then we can convert it to a Dask DataFrame. Our original Pandas DataFrame has 2400 rows. Let's set up the Dask DataFrame such that it has 10 partitions, each of 240 rows:

In [3]:
ddf = dd.from_pandas(df, npartitions=10)
ddf

Unnamed: 0_level_0,a,b
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-09-01 00:00:00,int64,object
2021-09-11 00:00:00,...,...
...,...,...
2021-11-30 00:00:00,...,...
2021-12-09 23:00:00,...,...


as you can see, a Dask DataFrame is really just a collection of Pandas DataFrames, each representing a chunk of the entire dataset. We can see the edges of the Pandas DataFrames (i.e. the index range of each partition) as such:

In [5]:
ddf.divisions

(Timestamp('2021-09-01 00:00:00', freq='H'),
 Timestamp('2021-09-11 00:00:00', freq='H'),
 Timestamp('2021-09-21 00:00:00', freq='H'),
 Timestamp('2021-10-01 00:00:00', freq='H'),
 Timestamp('2021-10-11 00:00:00', freq='H'),
 Timestamp('2021-10-21 00:00:00', freq='H'),
 Timestamp('2021-10-31 00:00:00', freq='H'),
 Timestamp('2021-11-10 00:00:00', freq='H'),
 Timestamp('2021-11-20 00:00:00', freq='H'),
 Timestamp('2021-11-30 00:00:00', freq='H'),
 Timestamp('2021-12-09 23:00:00', freq='H'))

To access a specific partition, you can do it as such:

In [6]:
ddf.partitions[4]

Unnamed: 0_level_0,a,b
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-10-11,int64,object
2021-10-21,...,...


One cool thing about Dask is that when we set up an operation for it (like indexing or averaging), it won't immediately compute the result. It will setup the operation so that it can be computed when we need the result. If we want the result now, we have to explicitly specify that. This way, Dask can determine exactly which tasks it needs to do to get the final result, instead of greedily doing everything.

For instance, let's say we want to define a 2D matrix, fill it with random values, and compute the average value across the first row. In regular Python or NumPy, we would have to generate the entire matrix, generate all the random values, and then compute the average value across the first row. We wasted time doing unnecessary work (i.e. generating the rest of the matrix). Dask, however, does not immediately generate the matrix. It waits until we need the result, and only does the work that's absolutely necessary (i.e. only fills the first row). It does this by building up a task graph, and dynamically ordering it based on the dependencies.

That explains why above, we don't see the values of the Pandas DataFrame at partition 4. To actually see the results, we need to tell Dask explicitly to perform the computation, as such:

In [8]:
ddf.partitions[4].compute()

Unnamed: 0,a,b
2021-10-11 00:00:00,960,a
2021-10-11 01:00:00,961,b
2021-10-11 02:00:00,962,c
2021-10-11 03:00:00,963,a
2021-10-11 04:00:00,964,d
...,...,...
2021-10-20 19:00:00,1195,a
2021-10-20 20:00:00,1196,d
2021-10-20 21:00:00,1197,d
2021-10-20 22:00:00,1198,b


Also, earlier, I did oversimplify a little bit by saying that a Dask DataFrame is *just* a collection of Pandas DataFrames. While that's true at its core, these Pandas DataFrames can easily work together if need be. For instance, we can index the Dask DataFrame without thinking of partitions, and Dask will schedule the tasks for each of the affected partitions to to their own indexing. Dask will then collect the result of these different tasks nicely:

In [9]:
ddf["2021-10-01": "2021-10-09 5:00"]

Unnamed: 0_level_0,a,b
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-10-01 00:00:00.000000000,int64,object
2021-10-09 05:00:59.999999999,...,...


Therefore, in most cases, we don't actually have to deal with the individual partitions. We can just perform operations on the big Dask DataFrame, and it will take care of scheduling the proper tasks for the proper partitions, as well as collecting the result to return to us in an expected format. In other words, **in many cases, we can just treat the Dask DataFrame like a Pandas DataFrame and expect it to work well** (the only difference being that we have to tell Dask to explicitly compute the result, which it otherwise doesn't do):

In [11]:
ddf["2021-10-01": "2021-10-09 5:00"].compute()

Unnamed: 0,a,b
2021-10-01 00:00:00,720,a
2021-10-01 01:00:00,721,b
2021-10-01 02:00:00,722,c
2021-10-01 03:00:00,723,a
2021-10-01 04:00:00,724,d
...,...,...
2021-10-09 01:00:00,913,b
2021-10-09 02:00:00,914,c
2021-10-09 03:00:00,915,a
2021-10-09 04:00:00,916,d


Let's say we want to calculate the average of the `a` column in the entire Dask DataFrame. Normally, in Pandas, we would do `df["a"].mean()`. In Dask, we can do the same thing:

In [12]:
ddf["a"].mean()

dd.Scalar<series-..., dtype=float64>

Well, what the heck is a `dd.Scalar`? Didn't you say that Dask will take care of dividing the work and collecting the result? 

That's correct, Dask will do that, but remember: we have to ask Dask to compute the result. A `dd.Scalar` is a lazy data structure -- it knows what work it has to do, but hasn't performed the work yet.

In [13]:
ddf["a"].mean().compute()

1199.5

Similarly, let's do another operation -- this time, a cumulative sum and a subtraction.

In [15]:
result = ddf["2021-10-01": "2021-10-09 5:00"].a.cumsum() - 100
result

Dask Series Structure:
npartitions=1
2021-10-01 00:00:00.000000000    int64
2021-10-09 05:00:59.999999999      ...
Name: a, dtype: int64
Dask Name: sub, 7 graph layers

Again, the Dask Series is a lazy equivalent of the Pandas Series. If we need the results immediately, we can specify it as such:

In [16]:
result.compute()

2021-10-01 00:00:00       620
2021-10-01 01:00:00      1341
2021-10-01 02:00:00      2063
2021-10-01 03:00:00      2786
2021-10-01 04:00:00      3510
                        ...  
2021-10-09 01:00:00    158301
2021-10-09 02:00:00    159215
2021-10-09 03:00:00    160130
2021-10-09 04:00:00    161046
2021-10-09 05:00:00    161963
Freq: H, Name: a, Length: 198, dtype: int64

**To summarize:** Dask lets us perform parallel computing on different data structures, including Pandas DataFrames, by chunking the data structure into smaller subproblems, and solving the appropriate subproblems in parallel. Dask uses *lazy evaluation*, which means it doesn't execute tasks immediately when task are defined. Instead, it builds up a task graph (DAG), and only computes the result when explicitly specified. This way, it optimizes the execution plan, and only performs executions that are necessary for the final result.

Dask also has its own scheduler, which is responsible for assigning tasks to available worker threads/processes, and ensuring that the dependencies in the task graph are satisfied.

For more information about Dask DataFrames, I highly recommend you check out this doc page: https://docs.dask.org/en/stable/dataframe.html