<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">


Dask DataFrames
===============

Dask dataframes are like pandas dataframes, just bigger.

<img src="https://docs.dask.org/en/stable/_images/dask-dataframe.svg"
     align="right"
     width="30%"
     alt="Dask DataFrame is composed of pandas DataFrames"/>



API-wise they're mostly the same, except that when you want an answer, add `.compute()` to the end.

```python
# Pandas
df.groupby("name").value.mean()

# Dask DataFrame
df.groupby("name").value.mean().compute()
```

This brings the result back to your local machine, so it had better be small!

```python
df.compute()  # this would be unwise
```


## Ask for machines

In [None]:
import coiled

cluster = coiled.Cluster(
    n_workers=20,
    region="us-east-2",  # start workers close to data to minimize costs
    account="dask-tutorials",
)

client = cluster.get_client()

## Ingest Uber/Lyft Data


The NYC Taxi dataset is a timeless classic.  

Interestingly there is a new variant.  The NYC Taxi and Livery Commission requires data from all ride-share services in the city of New York.  This includes private limosine services, van services, and a new category "High Volume For Hire Vehicle" services, those that dispatch 10,000 rides per day or more.  This is a special category defined for Uber and Lyft.  

In [None]:
import dask
import pandas
import dask.dataframe as dd
import pandas as pd

dask.config.set({"dataframe.convert-string": True})  # use PyArrow strings by default

# df = pd.read_parquet(  # this would work if we had enough memory
df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    storage_options={"anon": True},
)
df.head()

In [None]:
df = df.persist()

Play time
---------

We start by playing around.  We assume that you understand Pandas syntax.  Please use it to compute the following quantities:

In [None]:
df.columns

How much did New Yorkers pay Uber/Lyft?  Sum the `base_passenger_fare` column.

How much did Uber/Lyft pay drivers?

Were there ever cases when Uber/Lyft paid drivers more than they made?  How often did this occur?

What fraction of rides had a non-zero tip?

## Broken down by carrier

If we look at the frequencies of values in the `hvfhs_licence_num` column we can identify rides as Uber/Lyft or other less common carriers.

In [None]:
df.hvfhs_license_num.value_counts().compute()

Probably HV0003 is Uber, and HV0005 is Lyft.

How do the questions above break down by carrier?