# Getting up to speed with Dask

## Part 0: Getting data

We are using the [NYC Taxi data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), which contains several publicly-available datasets about taxi and ride-share rides taken in New York City.

Data is available from 2009 to 2019, but for this exercise we will use 2019 data only. Take care when using other data, as the schemas in the CSV files changed over the years. Most notably, in mid-2016 latitude and longitude fields were replaced with more generic taxi zones for privacy reasons.

Files are hosted in this S3 location: `s3://nyc-tlc/trip data`. Both Dask and Pandas support reading directly from S3 with some slight nuance, but for simplicity we will download the data for the laptop examples (Parts 1 & 2). This requires about 8GB of disk space.

In [1]:
import s3fs
import numpy as np
from pathlib import Path

ModuleNotFoundError: No module named 's3fs'

In [2]:
# change this path if you don't want it here
data_path = Path('data')
data_path.mkdir(exist_ok=True)

In [3]:
fs = s3fs.S3FileSystem(anon=True)

files = fs.glob('s3://nyc-tlc/trip data/yellow_tripdata_*')
len(files), files[:5], files[-5:]

(138,
 ['nyc-tlc/trip data/yellow_tripdata_2009-01.csv',
  'nyc-tlc/trip data/yellow_tripdata_2009-02.csv',
  'nyc-tlc/trip data/yellow_tripdata_2009-03.csv',
  'nyc-tlc/trip data/yellow_tripdata_2009-04.csv',
  'nyc-tlc/trip data/yellow_tripdata_2009-05.csv'],
 ['nyc-tlc/trip data/yellow_tripdata_2020-02.csv',
  'nyc-tlc/trip data/yellow_tripdata_2020-03.csv',
  'nyc-tlc/trip data/yellow_tripdata_2020-04.csv',
  'nyc-tlc/trip data/yellow_tripdata_2020-05.csv',
  'nyc-tlc/trip data/yellow_tripdata_2020-06.csv'])

<br>
One file per month, approximately 8GB disk size

In [4]:
files_2019 = fs.glob('s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv')
len(files_2019), np.sum([fs.du(f) for f in files_2019]) / 1e9

(12, 7.799242459)

In [5]:
%%time

def download_file(f):
    fs.get(f, data_path/Path(f).name)

for f in files_2019:
    download_file(f)

CPU times: user 29 s, sys: 24.5 s, total: 53.5 s
Wall time: 10min 53s


# Dask sneak peak!

You can parallelize this file copy using [dask.bag](https://docs.dask.org/en/latest/bag.html)

In [6]:
import dask.bag as db

In [7]:
%%time
_ = db.from_sequence(files_2019).map(download_file).compute()

CPU times: user 153 ms, sys: 110 ms, total: 263 ms
Wall time: 5min 11s
