# Introduction to scalable computing with Dask

## What is Dask?

* Parallel and distributed computing (diagramatic rep of both)
* Various API, distributed scheduler
* We'll look at Dask DataFrame

## Dask DataFrame API

* Parallel pandas
* Discuss parititions
* Discuss paradigm shift with mean and median example

## Dask cluster

* Scheduler, workers, dashboard

## Parallelize previous pandas workflows

- LocalCluster?
- Parallel read (CSV vs Parquet)
- Previous exercises

In [1]:
import json
import gcsfs

In [2]:
token = json.load(open("prep/credentials.json"))
fs = gcsfs.GCSFileSystem(token=token)
storage_options={"token": token}

In [3]:
from dask.distributed import LocalCluster, Client

In [4]:
cluster = LocalCluster(n_workers=2)

In [5]:
client = Client(cluster)

In [6]:
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 2
Total threads: 4,Total memory: 14.65 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:46167,Workers: 2
Dashboard: http://127.0.0.1:8787/status,Total threads: 4
Started: Just now,Total memory: 14.65 GiB

0,1
Comm: tcp://127.0.0.1:37227,Total threads: 2
Dashboard: http://127.0.0.1:34711/status,Memory: 7.32 GiB
Nanny: tcp://127.0.0.1:35793,
Local directory: /tmp/dask-worker-space/worker-qmf96x9w,Local directory: /tmp/dask-worker-space/worker-qmf96x9w

0,1
Comm: tcp://127.0.0.1:33997,Total threads: 2
Dashboard: http://127.0.0.1:45995/status,Memory: 7.32 GiB
Nanny: tcp://127.0.0.1:33373,
Local directory: /tmp/dask-worker-space/worker-ivvq2pue,Local directory: /tmp/dask-worker-space/worker-ivvq2pue


https://nebari.quansight.dev/user/login_email/proxy/8787/

### Read the entire dataset

In [7]:
import dask.dataframe as dd

In [9]:
%% time

ddf = dd.read_csv("gcs://quansight-datasets/airline-ontime-performance/csv/*",
                 storage_options = storage_options)

In [None]:
ddf["flight]

In [10]:
ddf

Unnamed: 0_level_0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,FIRST_DEP_TIME,TOTAL_ADD_GTIME,LONGEST_ADD_GTIME,DIV_AIRPORT_LANDINGS,DIV_REACHED_DEST,DIV_ACTUAL_ELAPSED_TIME,DIV_ARR_DELAY,DIV_DISTANCE,DIV1_AIRPORT,DIV1_AIRPORT_ID,DIV1_AIRPORT_SEQ_ID,DIV1_WHEELS_ON,DIV1_TOTAL_GTIME,DIV1_LONGEST_GTIME,DIV1_WHEELS_OFF,DIV1_TAIL_NUM,DIV2_AIRPORT,DIV2_AIRPORT_ID,DIV2_AIRPORT_SEQ_ID,DIV2_WHEELS_ON,DIV2_TOTAL_GTIME,DIV2_LONGEST_GTIME,DIV2_WHEELS_OFF,DIV2_TAIL_NUM,DIV3_AIRPORT,DIV3_AIRPORT_ID,DIV3_AIRPORT_SEQ_ID,DIV3_WHEELS_ON,DIV3_TOTAL_GTIME,DIV3_LONGEST_GTIME,DIV3_WHEELS_OFF,DIV3_TAIL_NUM,DIV4_AIRPORT,DIV4_AIRPORT_ID,DIV4_AIRPORT_SEQ_ID,DIV4_WHEELS_ON,DIV4_TOTAL_GTIME,DIV4_LONGEST_GTIME,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
npartitions=624,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1
,int64,int64,int64,int64,int64,object,object,int64,object,float64,int64,int64,int64,int64,object,object,object,int64,object,int64,int64,int64,int64,object,object,object,int64,object,int64,int64,float64,float64,float64,float64,float64,object,float64,float64,float64,float64,int64,float64,float64,float64,float64,float64,object,float64,float64,float64,float64,float64,float64,float64,float64,int64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


Lazy evaluation

In [11]:
ddf.head() # Will raise an error

Key:       ('read-csv-ec8707a5581123dabf6af00e0c05775e', 0)
Function:  execute_task
args:      ((subgraph_callable-5e719f03-45b1-4027-9609-3104273a655c, [(<function read_block_from_file at 0x7f39bb144dc0>, <OpenFile 'quansight-datasets/airline-ontime-performance/csv/bts_airline_ontime_performance_april_2003.csv'>, 0, 65257922, b'\n'), None, True, False]))
kwargs:    {}
Exception: 'ValueError(\'Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.\\n\\n+----------+--------+----------+\\n| Column   | Found  | Expected |\\n+----------+--------+----------+\\n| TAIL_NUM | object | float64  |\\n+----------+--------+----------+\\n\\nThe following columns also raised exceptions on conversion:\\n\\n- TAIL_NUM\\n  ValueError("could not convert string to float: \\\'N050AA\\\'")\\n\\nUsually this is due to dask\\\'s dtype inference failing, and\\n*may* be fixed by specifying dtypes manually by adding:\\n\\ndtype={\\\'TAIL_NUM\\\': \\\'object\\\'}\\n\\nto the call to `read_csv`/`read_table`.\')

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+----------+--------+----------+
| Column   | Found  | Expected |
+----------+--------+----------+
| TAIL_NUM | object | float64  |
+----------+--------+----------+

The following columns also raised exceptions on conversion:

- TAIL_NUM
  ValueError("could not convert string to float: 'N050AA'")

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'TAIL_NUM': 'object'}

to the call to `read_csv`/`read_table`.

### Longest flight (distance) across the dataset

In [14]:
ddf["DISTANCE"].max()

dd.Scalar<series-..., dtype=float64>

In [None]:
# takes ~8 mins on a Medium instance

ddf["DISTANCE"].max().compute()

### Specify `dtypes`

Dask infers the dtypes using the first row, because it's "lazy" and does not read the entire dataset.

Best practice: explicit dtypes

In [12]:
with open('prep/dtypes.json', 'r') as f:
    dtypes = json.load(f)

In [13]:
ddf = dd.read_csv("gcs://quansight-datasets/airline-ontime-performance/csv/*",
                 storage_options = storage_options,
                 dtypes=dtypes)

TypeError: read_csv() got an unexpected keyword argument 'dtypes'

In [None]:
ddf.head()

In [16]:
client.close()
cluster.close()

## Dask Gateway

* What it is?
* How it works in Nebari (diagram?)

- Intro to dashboard
- First time sign in

In [12]:
import dask_gateway

gateway = dask_gateway.Gateway()

In [13]:
options = gateway.cluster_options(use_local_defaults=False)
options

VBox(children=(HTML(value='<h2>Cluster Options</h2>'), GridBox(children=(HTML(value="<p style='font-weight: bo…

In [14]:
cluster = gateway.new_cluster(options)
cluster

VBox(children=(HTML(value='<h2>GatewayCluster</h2>'), HBox(children=(HTML(value='\n<div>\n<style scoped>\n    …

In [15]:
client = cluster.get_client()
client

0,1
Connection method: Cluster object,Cluster type: dask_gateway.GatewayCluster
Dashboard: https://nebari.quansight.dev/gateway/clusters/dev.813339dafcb04264b19f5dcd23540787/status,


2023-04-08 08:36:35,618 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client


First time sign-in

## Adaptive scaling

### Always make sure to shutdown your cluster

In [None]:
client.close(shutdown=True)

## Other Dask APIs

* Dask Array
* Dask Bag
* Dask Delayed and Futures

Check out the official Dask tutorial to learn more!