DataFrames at Scale
===================

*Playing with memory and performance*

**Technical Goal:** visualize large volumes of NYC Taxi data quickly.

**Pedagogical Goal:** Learn about managing memory use, how to interpret the dashboard, and how to improve performance on dataframe workloads

Visualize NYC Taxi Data
-----------------------

We can visualize the NYC Taxi data using Pandas and Datashader, a plotting library designed for large datasets.

*Datashader Motivation: if you do a scatter-plot of a billion points things break.  It's slow and it also looks like just a solid blob of points.  Datashader does more intelligent rendering.*

In [None]:
# TODO: 

import pandas as pd

df = pd.read_parquet(one-file-of-nyc-taxi)

In [None]:
import datashader

datashader.render_nice_image(df, ...)

## Exercise: Explore using Datashader pan/zoom

Datashader allows you to pan and zoom and explore the data.  It re-renders whenever you do so.  Explore the dataset now for a minute.

## Create Dask Cluster in the cloud

For this notebook we're going to use a Dask cluster running near the data on the cloud.  It should take a couple of minutes for us to get these machines with all of the right software installed.

In [None]:
import coiled

cluster = coiled.Cluster(
    package_sync=True,
    backend_options={"region": "us-east-1"},
) 

In [None]:
from dask.distributed import Client
client = Client(cluster)  # Point Dask to use this cluster for all operations

## Exercise: Load more data

Datashader knows about Dask and will execute in parallel on very large datasets if given a Dask dataframe. 

Rewrite the pandas code above to use Dask dataframe and then use Datashader to render the full dataset

In [None]:
import dask.dataframe as dd

df = dd.read_parquet("s3://nyc-tlc"

In [None]:
# TODO: datashader code

Hopefully this looks impressive.

## Exercise: View the Dashboard

How do we make this faster?  The secret to performance is measurement.  Fortunately Dask is making all sorts of measurements.  These are available to us through the Dask Dashboard.

Run this again, but this time observe the Dask dashboard.  What do you observe?

If you want to look through additional dashboard plots consider trying:

-  Task Stream
-  Progress
-  Workers Memory
-  Profile
-  Workers Bandwidth

We'll have a conversation about this.  Write down any observations that you'd like to share, especially about what might be slowing us down.



How fast is our current response time whenever we pan/zoom?

## Persist data in memory to improve performance

Mostly we're slowed down by reading Parquet Data from S3.  Each time we pan/zoom Dask has to read the entire dataset from S3 again.  This is slow.

We can avoid this if we `persist` the data in memory.  We do this below.  

Watch the dashboard as we run this command.  What happens?

In [None]:
df = df.persist()

## Too much Data

-  How much memory does our cluster have?  You can find this out in a few ways:
    -   The cluster memory dashboard plot
    -   The client `repr`
    -   client.scheduler_info()

-  How much data does our dataset take in memory?  You can find this out in a few ways:
    -   The cluster memory dashboard plot
    -   `df.memory_usage(deep=True).compute()`

## Exercise: reduce memory use by sampling

Find a method in the [dask.dataframe API](https://docs.dask.org/en/stable/dataframe-api.html) to sample down the dataset an appropriate amount. 

Persist that dataset and overwrite the previous one.  Is the data small enough to pan and zoom smoothly?

## Exercise: reduce memory use by column and dtype refinement

We're probably storing lots of data in memory that we don't need.  Try removing columns and casting dtypes to smaller forms to reduce memory. 

How slim can you make the dataset while still getting all of the rows?

## Exercise: scale

We're probably good at this point, but another way to handle larger datasets is to scale your cluster.  

Inspect the `cluster.scale` method.  Then scale up your cluster to twice as many machines.  How long does this take?

*Tip: use the `Cluster Map` or `Workers` dashboard plot to view the number of workers live*

In [None]:
cluster.scale?

Run your computation again.  Does this make things go any faster?  

Why or why not do you think?

## Shut down your cluster

This is polite to do.  It'll shut off in 20 minutes regardless though.

In [None]:
cluster.close()