## Setup

In [None]:
from dask.distributed import Client

client = Client(n_workers=4)
client

<img src="img/dask-overview.svg" width=75%>

# Bag: Parallel Lists for semi-structured data

Dask-bag excels in processing data that can be represented as a sequence of arbitrary inputs. We'll refer to this as "messy" data, because it can contain complex nested structures, missing fields, mixtures of data types, etc. The *functional* programming style fits very nicely with standard Python iteration, such as can be found in the `itertools` module.

Dask.bag is a high level Dask collection to automate common workloads of this form.  In a nutshell

    dask.bag = map, filter, toolz + parallel execution
    
Bags provide very general computation (any Python function.)  This generality
comes at cost: bag operations tend to be slower than array/dataframe computations in the same way that Python tends to be slower than NumPy/Pandas
    
**Related Documentation**

* [Bag documentation](https://docs.dask.org/en/latest/bag.html)
* [Bag screencast](https://youtu.be/-qIiJ1XtSv0)
* [Bag API](https://docs.dask.org/en/latest/bag-api.html)
* [Bag examples](https://examples.dask.org/bag.html)

We'll make a random set of record data and store it to disk as many JSON files, to showcase using bag

In [None]:
import dask
import json
import os
import getpass

username = getpass.getuser()
data_dir = f"/tmp/{username}/data"
os.makedirs(data_dir, exist_ok=True)  # Create data/ directory

b = dask.datasets.make_people()  # Make records of people
b.map(json.dumps).to_textfiles(data_dir+"/*.json")  # Encode as JSON, write to disk

# Bag creation

We'll make a dask bag from the data files we wrote above:

In [None]:
import dask.bag as db
import json

b = db.read_text(data_dir+"/*.json").map(json.loads)
b

In [None]:
b.take(1)

# Bag transformation
Showcasing a few map, filter, aggregate operations on our bag

In [None]:
b.filter(lambda record: record["age"] > 30).take(2)  # Select only people over 30

In [None]:
b.map(lambda record: record["occupation"]).take(2)  # Select the occupation field

In [None]:
b.count().compute()  # Count total number of records

All of these operations can be taken lazily, and chained together:

In [None]:
result = (
    b.filter(lambda record: record["age"] > 30)
    .map(lambda record: record["occupation"])
    .frequencies(sort=True)
    .topk(10, key=1)
)
result

In [None]:
result.visualize()

In [None]:
result.compute()

# Arrays

<img src="https://docs.dask.org/en/stable/_images/dask-array.svg" width="25%" align="right">
Dask array provides a parallel, larger-than-memory, n-dimensional array using blocked algorithms. Simply put: distributed Numpy.

*  **Parallel**: Uses all of the cores on your computer
*  **Larger-than-memory**:  Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.
*  **Blocked Algorithms**:  Perform large computations by performing many smaller computations


**Related Documentation**

* [Array documentation](https://docs.dask.org/en/latest/array.html)
* [Array screencast](https://youtu.be/9h_61hXCDuI)
* [Array API](https://docs.dask.org/en/latest/array-api.html)
* [Array examples](https://examples.dask.org/array.html)

### Example

1.  Construct a 20000x20000 array of normally distributed random values broken up into 1000x1000 sized chunks
2.  Take the mean along one axis
3.  Take every 100th element

In [None]:
import numpy as np
import dask.array as da

x = da.random.normal(
    10,
    0.1,
    size=(20000, 20000),  # 400 million element array
    chunks=(1000, 1000),
)  # Cut into 1000x1000 sized chunks
x

In [None]:
y = x.mean(axis=0)[::100]  # Perform NumPy-style operations
y

In [None]:
%%time
y.compute()  # Time to compute the result

Performance comparison
---------------------------

The following experiment was performed on a heavy personal laptop.  Your performance may vary.  If you attempt the NumPy version then please ensure that you have more than 4GB of main memory.

**NumPy: 19s, Needs gigabytes of memory**

```python
import numpy as np

%%time 
x = np.random.normal(10, 0.1, size=(20000, 20000)) 
y = x.mean(axis=0)[::100] 
y

CPU times: user 19.6 s, sys: 160 ms, total: 19.8 s
Wall time: 19.7 s
```

**Dask Array: 4s, Needs megabytes of memory**

```python
import dask.array as da

%%time
x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))
y = x.mean(axis=0)[::100] 
y.compute() 

CPU times: user 29.4 s, sys: 1.07 s, total: 30.5 s
Wall time: 4.01 s
```

**Discussion**

Notice that the Dask array computation ran in 4 seconds, but used 29.4 seconds of user CPU time. The numpy computation ran in 19.7 seconds and used 19.6 seconds of user CPU time.

Dask finished faster, but used more total CPU time because Dask was able to transparently parallelize the computation because of the chunk size.

*Questions*

*  What happens if the dask chunks=(20000,20000)?
    * Will the computation run in 4 seconds?
    * How much memory will be used?
* What happens if the dask chunks=(25,25)?
    * What happens to CPU and memory?

Limitations
-----------

Dask Array does not implement the entire numpy interface.  Users expecting this
will be disappointed.  Notably Dask Array has the following failings:

1.  Dask does not implement all of ``np.linalg``.  This has been done by a
    number of excellent BLAS/LAPACK implementations and is the focus of
    numerous ongoing academic research projects.
2.  Dask Array does not support some operations where the resulting shape
    depends on the values of the array. For those that it does support
    (for example, masking one Dask Array with another boolean mask),
    the chunk sizes will be unknown, which may cause issues with other
    operations that need to know the chunk sizes.
3.  Dask Array does not attempt operations like ``sort`` which are notoriously
    difficult to do in parallel and are of somewhat diminished value on very
    large data (you rarely actually need a full sort).
    Often we include parallel-friendly alternatives like ``topk``.
4.  Dask development is driven by immediate need, and so many lesser used
    functions, like ``np.sometrue`` have not been implemented purely out of
    laziness.  These would make excellent community contributions.
    
* [Array documentation](https://docs.dask.org/en/latest/array.html)
* [Array screencast](https://youtu.be/9h_61hXCDuI)
* [Array API](https://docs.dask.org/en/latest/array-api.html)
* [Array examples](https://examples.dask.org/array.html)

# Dask DataFrames

Pandas is great for tabular datasets that fit in memory. Dask becomes useful when the dataset you want to analyze is larger than your machine's RAM. The demo dataset we're working with is only about 3MB, but `dask.dataframe` will scale to  datasets much larger than memory.

<img src="https://examples.dask.org/_images/dask-dataframe.svg" align="right" width="28%">

The `dask.dataframe` module implements a blocked parallel `DataFrame` object that mimics a large subset of the Pandas `DataFrame` API. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrames` separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.

**Related Documentation**

* [DataFrame documentation](https://docs.dask.org/en/latest/dataframe.html)
* [DataFrame screencast](https://youtu.be/AT2XtFehFSQ)
* [DataFrame API](https://docs.dask.org/en/latest/dataframe-api.html)
* [DataFrame examples](https://examples.dask.org/dataframe.html)
* [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)

Let's convert our dask bag into a dataframe for more efficient processing. We'll first have to normalize some of the nested structure:

In [None]:
import dask.dataframe as dd

df = dd.read_json(
    data_dir+"/*.json",
)
df

Now we can operate on the dataframe as if it is a pandas object. Note that some of our items are `object` type because the records were not normalized. We can normalize them using `df.apply`

In [None]:
import ast

df["city"] = df.address.apply(
    lambda rec: ast.literal_eval(rec)["city"], meta=("city", str)
)
df["fullname"] = df.name.apply(lambda l: " ".join(l), meta=("fullname", str))

df

In [None]:
df.nlargest(columns="age").compute()

In [None]:
df.groupby("occupation").aggregate({"age": "mean", "city": "first"}).compute()

In [None]:
client.shutdown()