# Scaling Data Science with Dask

## pandas

In [1]:
import pandas as pd

**Read 900MB dataset:**

In [2]:
df = pd.read_parquet("data-900mb/")

In [3]:
df.head()

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,989,Bob,-0.298115,-0.570104
2000-01-01 00:00:30,947,Patricia,-0.998487,-0.52243
2000-01-01 00:01:00,1013,Victor,0.317766,-0.018864
2000-01-01 00:01:30,984,Zelda,-0.606786,0.076914
2000-01-01 00:02:00,1017,Zelda,-0.67582,-0.152123


In [4]:
df.x.sum()

-805.5547844604573

**Read 52GB dataset:**

In [None]:
# df = pd.read_parquet("data-52gb/") # kernel restarts

## Dask

In [1]:
import dask.dataframe as dd

from dask.distributed import Client

In [2]:
client = Client()

In [3]:
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 8,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:59483,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 8
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:59500,Total threads: 2
Dashboard: http://127.0.0.1:59503/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:59486,
Local directory: /Users/pavithra/Developer/Confs/scalable-data-science-with-dask/pyladies-dublin/dask-worker-space/worker-s1ai_w_5,Local directory: /Users/pavithra/Developer/Confs/scalable-data-science-with-dask/pyladies-dublin/dask-worker-space/worker-s1ai_w_5

0,1
Comm: tcp://127.0.0.1:59501,Total threads: 2
Dashboard: http://127.0.0.1:59502/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:59489,
Local directory: /Users/pavithra/Developer/Confs/scalable-data-science-with-dask/pyladies-dublin/dask-worker-space/worker-85rh9htu,Local directory: /Users/pavithra/Developer/Confs/scalable-data-science-with-dask/pyladies-dublin/dask-worker-space/worker-85rh9htu

0,1
Comm: tcp://127.0.0.1:59494,Total threads: 2
Dashboard: http://127.0.0.1:59495/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:59487,
Local directory: /Users/pavithra/Developer/Confs/scalable-data-science-with-dask/pyladies-dublin/dask-worker-space/worker-f_acv7tv,Local directory: /Users/pavithra/Developer/Confs/scalable-data-science-with-dask/pyladies-dublin/dask-worker-space/worker-f_acv7tv

0,1
Comm: tcp://127.0.0.1:59497,Total threads: 2
Dashboard: http://127.0.0.1:59498/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:59488,
Local directory: /Users/pavithra/Developer/Confs/scalable-data-science-with-dask/pyladies-dublin/dask-worker-space/worker-a7ah3dfk,Local directory: /Users/pavithra/Developer/Confs/scalable-data-science-with-dask/pyladies-dublin/dask-worker-space/worker-a7ah3dfk


Open the dashboard!

In [4]:
ddf = dd.read_parquet("data-52gb/")

In [5]:
ddf.head()

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980-01-01 00:00:00,1001,Ingrid,0.036391,0.700839
1980-01-01 00:00:01,1009,Alice,-0.155686,0.154424
1980-01-01 00:00:02,1075,Wendy,0.813202,-0.366577
1980-01-01 00:00:03,946,Xavier,-0.156739,-0.798744
1980-01-01 00:00:04,999,Ingrid,-0.749491,-0.095637


In [6]:
ddf.x.sum().compute()

21408.524283867402

In [7]:
client.close()

## Coiled

To follow along, you need to sign up to Coiled: https://cloud.coiled.io

In [9]:
import coiled

In [10]:
cluster = coiled.Cluster(
    name="pavithraes/scaling-with-dask",
    n_workers=20,
)

Token:

  ········································


Save credentials for next time? [Y/n]:  n




Found software environment build
Created FW rules: coiled-dask-pavithr13-68258-firewall
Created scheduler VM: coiled-dask-pavithr13-68258-scheduler (type: t3a.medium, ip: ['3.215.174.196'])


In [11]:
client = Client(cluster)
client


+-------------+-----------+-----------+-----------+
| Package     | client    | scheduler | workers   |
+-------------+-----------+-----------+-----------+
| blosc       | None      | 1.10.2    | 1.10.2    |
| dask        | 2021.11.1 | 2021.10.0 | 2021.10.0 |
| distributed | 2021.11.1 | 2021.10.0 | 2021.10.0 |
| lz4         | None      | 3.1.3     | 3.1.3     |
| numpy       | 1.21.4    | 1.21.3    | 1.21.3    |
| toolz       | 0.11.2    | 0.11.1    | 0.11.1    |
+-------------+-----------+-----------+-----------+


0,1
Connection method: Cluster object,Cluster type: coiled.Cluster
Dashboard: http://3.215.174.196:8787,

0,1
Dashboard: http://3.215.174.196:8787,Workers: 20
Total threads: 40,Total memory: 149.55 GiB

0,1
Comm: tls://10.4.11.86:8786,Workers: 20
Dashboard: http://10.4.11.86:8787/status,Total threads: 40
Started: Just now,Total memory: 149.55 GiB

0,1
Comm: tls://10.4.26.52:39773,Total threads: 2
Dashboard: http://10.4.26.52:41229/status,Memory: 7.48 GiB
Nanny: tls://10.4.26.52:36823,
Local directory: /dask-worker-space/worker-k16uwsa1,Local directory: /dask-worker-space/worker-k16uwsa1

0,1
Comm: tls://10.4.25.42:33757,Total threads: 2
Dashboard: http://10.4.25.42:33815/status,Memory: 7.48 GiB
Nanny: tls://10.4.25.42:46099,
Local directory: /dask-worker-space/worker-q28739o_,Local directory: /dask-worker-space/worker-q28739o_

0,1
Comm: tls://10.4.21.120:33019,Total threads: 2
Dashboard: http://10.4.21.120:33033/status,Memory: 7.48 GiB
Nanny: tls://10.4.21.120:33659,
Local directory: /dask-worker-space/worker-1vs3linm,Local directory: /dask-worker-space/worker-1vs3linm

0,1
Comm: tls://10.4.26.188:43709,Total threads: 2
Dashboard: http://10.4.26.188:45677/status,Memory: 7.48 GiB
Nanny: tls://10.4.26.188:36043,
Local directory: /dask-worker-space/worker-3havrdyv,Local directory: /dask-worker-space/worker-3havrdyv

0,1
Comm: tls://10.4.29.92:38173,Total threads: 2
Dashboard: http://10.4.29.92:38733/status,Memory: 7.48 GiB
Nanny: tls://10.4.29.92:35201,
Local directory: /dask-worker-space/worker-g85lzyqw,Local directory: /dask-worker-space/worker-g85lzyqw

0,1
Comm: tls://10.4.21.3:32811,Total threads: 2
Dashboard: http://10.4.21.3:46489/status,Memory: 7.48 GiB
Nanny: tls://10.4.21.3:39753,
Local directory: /dask-worker-space/worker-4yizuuok,Local directory: /dask-worker-space/worker-4yizuuok

0,1
Comm: tls://10.4.27.236:37457,Total threads: 2
Dashboard: http://10.4.27.236:34323/status,Memory: 7.48 GiB
Nanny: tls://10.4.27.236:46237,
Local directory: /dask-worker-space/worker-pdwxtayc,Local directory: /dask-worker-space/worker-pdwxtayc

0,1
Comm: tls://10.4.20.207:37427,Total threads: 2
Dashboard: http://10.4.20.207:36545/status,Memory: 7.48 GiB
Nanny: tls://10.4.20.207:43765,
Local directory: /dask-worker-space/worker-7l8o5068,Local directory: /dask-worker-space/worker-7l8o5068

0,1
Comm: tls://10.4.30.38:42315,Total threads: 2
Dashboard: http://10.4.30.38:46579/status,Memory: 7.48 GiB
Nanny: tls://10.4.30.38:43045,
Local directory: /dask-worker-space/worker-t6sd33hc,Local directory: /dask-worker-space/worker-t6sd33hc

0,1
Comm: tls://10.4.22.37:46095,Total threads: 2
Dashboard: http://10.4.22.37:40541/status,Memory: 7.48 GiB
Nanny: tls://10.4.22.37:42643,
Local directory: /dask-worker-space/worker-vqb9mf6h,Local directory: /dask-worker-space/worker-vqb9mf6h

0,1
Comm: tls://10.4.30.246:35143,Total threads: 2
Dashboard: http://10.4.30.246:43207/status,Memory: 7.48 GiB
Nanny: tls://10.4.30.246:38597,
Local directory: /dask-worker-space/worker-ryvjvycv,Local directory: /dask-worker-space/worker-ryvjvycv

0,1
Comm: tls://10.4.27.165:46383,Total threads: 2
Dashboard: http://10.4.27.165:43831/status,Memory: 7.48 GiB
Nanny: tls://10.4.27.165:39685,
Local directory: /dask-worker-space/worker-1tp21tal,Local directory: /dask-worker-space/worker-1tp21tal

0,1
Comm: tls://10.4.29.138:37207,Total threads: 2
Dashboard: http://10.4.29.138:33081/status,Memory: 7.48 GiB
Nanny: tls://10.4.29.138:40277,
Local directory: /dask-worker-space/worker-a6_fx0im,Local directory: /dask-worker-space/worker-a6_fx0im

0,1
Comm: tls://10.4.20.151:44737,Total threads: 2
Dashboard: http://10.4.20.151:45407/status,Memory: 7.48 GiB
Nanny: tls://10.4.20.151:44627,
Local directory: /dask-worker-space/worker-7tvl8vxg,Local directory: /dask-worker-space/worker-7tvl8vxg

0,1
Comm: tls://10.4.21.7:35477,Total threads: 2
Dashboard: http://10.4.21.7:39885/status,Memory: 7.48 GiB
Nanny: tls://10.4.21.7:35669,
Local directory: /dask-worker-space/worker-y8lquchx,Local directory: /dask-worker-space/worker-y8lquchx

0,1
Comm: tls://10.4.24.105:40219,Total threads: 2
Dashboard: http://10.4.24.105:37565/status,Memory: 7.48 GiB
Nanny: tls://10.4.24.105:36267,
Local directory: /dask-worker-space/worker-9nm3umr_,Local directory: /dask-worker-space/worker-9nm3umr_

0,1
Comm: tls://10.4.23.38:36079,Total threads: 2
Dashboard: http://10.4.23.38:33689/status,Memory: 7.48 GiB
Nanny: tls://10.4.23.38:41981,
Local directory: /dask-worker-space/worker-qxlcch5y,Local directory: /dask-worker-space/worker-qxlcch5y

0,1
Comm: tls://10.4.31.8:45119,Total threads: 2
Dashboard: http://10.4.31.8:34103/status,Memory: 7.48 GiB
Nanny: tls://10.4.31.8:46285,
Local directory: /dask-worker-space/worker-6jmgx82v,Local directory: /dask-worker-space/worker-6jmgx82v

0,1
Comm: tls://10.4.27.122:37583,Total threads: 2
Dashboard: http://10.4.27.122:35045/status,Memory: 7.48 GiB
Nanny: tls://10.4.27.122:35399,
Local directory: /dask-worker-space/worker-nk8rusfz,Local directory: /dask-worker-space/worker-nk8rusfz

0,1
Comm: tls://10.4.18.17:35045,Total threads: 2
Dashboard: http://10.4.18.17:39051/status,Memory: 7.48 GiB
Nanny: tls://10.4.18.17:37821,
Local directory: /dask-worker-space/worker-sqmg2q3_,Local directory: /dask-worker-space/worker-sqmg2q3_


In [12]:
ddf = dd.read_parquet("s3://coiled-datasets/synthetic-time-series-data/1TB/*")

In [13]:
ddf.head()

Unnamed: 0_level_0,col1,col10,col100,col101,col102,col103,col104,col105,col106,col107,...,col90,col91,col92,col93,col94,col95,col96,col97,col98,col99
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1990-01-01 00:00:00,1014,1000,1003,1006,1078,1025,1032,979,950,966,...,1045,988,1006,1002,961,978,1011,1001,1047,1046
1990-01-01 00:00:01,962,984,1015,988,1016,977,1000,1009,1015,985,...,981,1002,1043,1022,1028,1034,990,1009,942,997
1990-01-01 00:00:02,952,998,954,1009,1033,973,1019,1008,967,1007,...,1003,997,1021,968,1020,1026,992,955,1026,987
1990-01-01 00:00:03,1000,1019,940,963,958,1056,977,958,1011,1012,...,977,1005,1026,1016,965,921,959,1008,1004,999
1990-01-01 00:00:04,990,986,985,1017,1041,973,1005,1004,994,972,...,948,955,987,943,1047,965,1018,1002,985,976


In [14]:
ddf.col1.sum().compute()

946597895493

In [15]:
client.close()