# Deploy a dask cluster with AWS cloudformation

## Objectives

Create a deploy a dask cluster using AWS cloud formation (success see file blabla). The current template creates:
- 1 dask-scheduler and 3 dask-workers. 
- A security group that allows a local machine to connect to the cluster (see the other notebook)
- EC2 role and connects this all instances to allow S3 access.
- S3 bucket in which we can store csv files

Things you have to do manually:
- upload the csv files to the s3 bucket
- create a user with programmatic access and configure the aws cli to work with these keys. This user must have at least read rights on S3.

### TODO
- scheduler can be t2.micro but I want workers to be larger.
- use parameters to select the instance type for the scheduler and worker.
- To have 3 workers template of one is copied 3 times. It is better to maybe use auto scaling groups and define the number workers.
- Can we set the workers to not have a public IP but only internal IP's?

### Blogs to do
- `https://docs.coiled.io/blog/tpch.html`
- `https://sujitpal.blogspot.com/2020/06/dask-mappartitions-and-almost.html`
- `https://medium.com/@shubham27/introduction-to-dask-insights-on-nyc-parking-dataset-using-dask-b34019aa44b`
- https://registry.opendata.aws/nyc-tlc-trip-records-pds/ - there is a link to doing things with dask and fargate this might be interesting

### CloudFormation Template

execute the cell below if you want to see the entire cloud formation template in the notebook

In [None]:
%load cf-dask-cluster.yaml

You can use this template to create a CloudFormation stack on AWS. 

Once the stack is completed and all EC2 instances are running you can take the public IP of the scheduler and copy this in the code snippet below.

Everything should work smoothly (for me it did)

In [3]:
import dask
from dask.distributed import Client

# copy the public ip of scheduler from the output section of cloudformation
ip = "54.210.51.198"
address = f"tcp://{ip}:8786"
dashboard = f"http://{ip}:8787/status"

print(f"Use the link below to connect to the cluster dashboard:\n{dashboard}")

print(address)
client = Client(address=address)

client

Use the link below to connect to the cluster dashboard:
http://54.210.51.198:8787/status
tcp://54.210.51.198:8786


0,1
Connection method: Direct,
Dashboard: http://54.210.51.198:8787/status,

0,1
Comm: tcp://172.31.43.62:8786,Workers: 3
Dashboard: http://172.31.43.62:8787/status,Total threads: 6
Started: 1 minute ago,Total memory: 11.46 GiB

0,1
Comm: tcp://172.31.88.170:46141,Total threads: 2
Dashboard: http://172.31.88.170:41953/status,Memory: 3.82 GiB
Nanny: tcp://172.31.88.170:44471,
Local directory: /tmp/dask-scratch-space/worker-yvl4umg8,Local directory: /tmp/dask-scratch-space/worker-yvl4umg8
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 141.98 MiB,Spilled bytes: 0 B
Read bytes: 258.28393307733927 B,Write bytes: 1.45 kiB

0,1
Comm: tcp://172.31.93.103:45733,Total threads: 2
Dashboard: http://172.31.93.103:38193/status,Memory: 3.82 GiB
Nanny: tcp://172.31.93.103:32973,
Local directory: /tmp/dask-scratch-space/worker-f4a_m3_m,Local directory: /tmp/dask-scratch-space/worker-f4a_m3_m
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 4.0%,Last seen: Just now
Memory usage: 142.03 MiB,Spilled bytes: 0 B
Read bytes: 466.21698857570317 B,Write bytes: 1.58 kiB

0,1
Comm: tcp://172.31.94.183:45021,Total threads: 2
Dashboard: http://172.31.94.183:44643/status,Memory: 3.82 GiB
Nanny: tcp://172.31.94.183:33973,
Local directory: /tmp/dask-scratch-space/worker-0u90qdve,Local directory: /tmp/dask-scratch-space/worker-0u90qdve
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 140.62 MiB,Spilled bytes: 0 B
Read bytes: 257.33865458338255 B,Write bytes: 1.44 kiB


In [1]:
import dask.array as da

a_da = da.ones(10, chunks=5)
a_da

Unnamed: 0,Array,Chunk
Bytes,80 B,40 B
Shape,"(10,)","(5,)"
Dask graph,2 chunks in 1 graph layer,2 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 80 B 40 B Shape (10,) (5,) Dask graph 2 chunks in 1 graph layer Data type float64 numpy.ndarray",10  1,

Unnamed: 0,Array,Chunk
Bytes,80 B,40 B
Shape,"(10,)","(5,)"
Dask graph,2 chunks in 1 graph layer,2 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [2]:
a_da_sum = a_da.sum()
a_da_sum

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
Array Chunk Bytes 8 B 8 B Shape () () Dask graph 1 chunks in 3 graph layers Data type float64 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [3]:
a_da_sum.compute()

np.float64(10.0)

In [4]:
xd = da.random.normal(10, 0.1, size=(30_000, 30_000), chunks=(3000, 3000))
xd

Unnamed: 0,Array,Chunk
Bytes,6.71 GiB,68.66 MiB
Shape,"(30000, 30000)","(3000, 3000)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 6.71 GiB 68.66 MiB Shape (30000, 30000) (3000, 3000) Dask graph 100 chunks in 1 graph layer Data type float64 numpy.ndarray",30000  30000,

Unnamed: 0,Array,Chunk
Bytes,6.71 GiB,68.66 MiB
Shape,"(30000, 30000)","(3000, 3000)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [5]:
%%time
xd = da.random.normal(10, 0.1, size=(30_000, 30_000), chunks=(3000, 3000))
yd = xd.mean(axis=0)
yd.compute()

CPU times: total: 31.3 s
Wall time: 3.12 s


array([10.00034189,  9.99979383,  9.99925054, ..., 10.0004483 ,
        9.99915362, 10.00051356])

## Dask dataframes

In this part we point to data in S3.

Source for examples:

`https://tutorial.dask.org/01_dataframe.html`

**remark**:
- make sure you connect to the bucket in which you have data

In [6]:
!aws s3 ls

2024-10-17 13:06:26 dask-input-data


In [7]:
import dask.dataframe as dd

# Read all CSV files from the root of the bucket
ddf = dd.read_csv("s3://dask-input-data-svw/*.csv", 
                  dtype={"TailNum": str, "CRSElapsedTime": float, "Cancelled": bool},
                  blocksize="25MB" )


ddf

OSError: An error occurred while calling the read_csv method registered to the pandas backend.
Original Message: An error occurred while calling the read_csv method registered to the pandas backend.
Original Message: s3://dask-input-data-svw/*.csv resolved to no files

In [12]:
%%time
len(ddf)

CPU times: total: 0 ns
Wall time: 4.78 s


2611892

In [13]:
%%time
ddf.head(2)

CPU times: total: 78.1 ms
Wall time: 6.22 s


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
0,1990,1,1,1,1621.0,1540,1747.0,1701,US,33,...,,46.0,41.0,EWR,PIT,319.0,,,False,0
1,1990,1,2,2,1547.0,1540,1700.0,1701,US,33,...,,-1.0,7.0,EWR,PIT,319.0,,,False,0


In [14]:
%%time
result = ddf.DepDelay.max()
result.compute()

CPU times: total: 15.6 ms
Wall time: 3.34 s


np.float64(1435.0)

In [15]:
%%time
len(ddf[~ddf.Cancelled])

CPU times: total: 31.2 ms
Wall time: 4.3 s


2540961

In [16]:
%%time
ddf[~ddf.Cancelled].groupby("Origin")["Origin"].count().compute()

CPU times: total: 31.2 ms
Wall time: 3.15 s


Origin
EWR    1139451
JFK     427243
LGA     974267
Name: Origin, dtype: int64

In [17]:
%%time
ddf.groupby("Origin").DepDelay.mean().compute()

CPU times: total: 46.9 ms
Wall time: 2.1 s


Origin
EWR    10.295469
JFK    10.351299
LGA     7.431142
Name: DepDelay, dtype: float64

In [18]:
%%time
ddf.groupby("DayOfWeek").DepDelay.mean().idxmax().compute()

CPU times: total: 31.2 ms
Wall time: 1.97 s


np.int64(5)

## Sharing Intermediate Results

#### Example 1

In [19]:
non_canceled = ddf[~ddf.Cancelled]
mean_delay = non_canceled.DepDelay.mean()
std_delay = non_canceled.DepDelay.std()

If you compute them with two calls to compute, there is no sharing of intermediate computations.

In [20]:
%%time

mean_delay_res = mean_delay.compute()
std_delay_res = std_delay.compute()

CPU times: total: 78.1 ms
Wall time: 4.72 s


But let’s try by passing both to a single compute call

In [21]:
%%time

mean_delay_res, std_delay_res = dask.compute(mean_delay, std_delay)

CPU times: total: 0 ns
Wall time: 3.04 s


#### Example 2

In [22]:
non_cancelled = ddf[~ddf.Cancelled]
ddf_jfk = non_cancelled[non_cancelled.Origin == "JFK"]

In [23]:
%%time
ddf_jfk.DepDelay.mean().compute()
ddf_jfk.DepDelay.sum().compute()

CPU times: total: 46.9 ms
Wall time: 9.53 s


np.float64(4422520.0)

In [24]:
ddf_jfk = ddf_jfk.persist() 

In [25]:
%%time
ddf_jfk.DepDelay.mean().compute()
ddf_jfk.DepDelay.std().compute()

CPU times: total: 15.6 ms
Wall time: 559 ms


np.float64(31.242509798271147)

## Custom code with Dask DataFrame

In [26]:
import pandas as pd
import dask.dataframe as dd

In [None]:
df1 = pd.DataFrame({'x': [1, 2, 3, 4, 5],
                   'y': [1., 2., 3., 4., 5.]})

ddf = dd.from_pandas(df, npartitions=2)
ddf

One can use map_partitions to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.

In [None]:
def myadd(df1, a, b=1):
    return df.x + df.y + a + b

#using pandas
display(df1.apply(myadd, args=(1,2), axis=1))

res= ddf.map_partitions(myadd, 1, b=2)
print(res.dtype)
res.compute()

In [None]:
res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))
res.compute()

In [None]:
res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
res.dtypes

In [None]:
ddf = dd.read_csv("s3://dask-input-data/*.csv", 
                  dtype={"TailNum": str, "CRSElapsedTime": float, "Cancelled": bool},
                  blocksize="25MB" )


In [28]:
%%time
ddfD = ddf[~ddf.Distance.isna()]
dask.compute(len(ddf),len(ddfD))

CPU times: total: 62.5 ms
Wall time: 8.05 s


(2611892, 2610397)

In [29]:
ddfD['Distance']= ddfD.Distance.astype('float64')

In [30]:
def my_custom_converter(df, multiplier=1):
    return df * multiplier

meta = pd.Series(name="Distance", dtype="float64")

distance_km = ddfD.Distance.map_partitions(
    my_custom_converter, multiplier=0.6, meta=meta
)

distance_km.compute()

0         191.4
1         191.4
2         191.4
3         191.4
4         191.4
          ...  
269176    971.4
269177    971.4
269178    971.4
269179    971.4
269180    971.4
Name: Distance, Length: 2610397, dtype: float64

In [31]:
client.close()