## Start and prepare EC2 instances

In this notebook instruction are provided to start three EC2 instanes. On all three the same packages have to be installed. For this we use a the `userdata` section.

Go to the EC2 in the management console. 

Go to instances and click `Launch instances`

![image.png](attachment:dbe77d7c-2749-4c4e-8557-508d532e7186.png)

go down and make sure you select or create a `Key pair`

go down and under `Advanced details` go to the user data (at the bottom)

```
#!/bin/bash
sudo apt update
sudo apt upgrade -y
sudo apt install python3-pip -y
sudo apt install python3-distributed -y
sudo  apt install python3.12-venv -y

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install "dask[complete]"
python3 -m pip install bokeh!=3.0.*,>=2.4.2
```

#### improved script

I used ChatGPT to create a slightly improved script that installs specific versions.

When we log into the instance we are logged in as `ubuntu` user. you can cd .. twice to get to the root where the `.venv` folder is.

To start the enviroment type `source .venv/bin/activate`. Do this for the scheduler and the worker.

```
#!/bin/bash
# Update and upgrade system packages
sudo apt update
sudo apt upgrade -y

# Add PPA for Python 3.10 to install it from a trusted source
sudo add-apt-repository ppa:deadsnakes/ppa -y
sudo apt update

# Install Python 3.10 and related packages like venv, dev headers, and pip
sudo apt install python3.10 python3.10-venv python3.10-dev python3-pip -y

# Create and activate Python 3.10 virtual environment
python3.10 -m venv .venv
source .venv/bin/activate

# Upgrade pip, setuptools, and wheel to the latest versions for compatibility
python3 -m pip install --upgrade pip setuptools wheel

# Install specific version of Dask (2024.2.1) and ensure a compatible version of Bokeh
python3 -m pip install "dask[complete]==2024.2.1"
python3 -m pip install "bokeh!=3.0.*,>=2.4.2"
```

### TODO 

Include the `pip install s3fs`

ChatGPT return the following userdata 

```
#!/bin/bash
# Update and upgrade system packages
sudo apt update
sudo apt upgrade -y

# Add PPA for Python 3.10 to install it from a trusted source
sudo add-apt-repository ppa:deadsnakes/ppa -y
sudo apt update

# Install Python 3.10 and related packages like venv, dev headers, and pip
sudo apt install python3.10 python3.10-venv python3.10-dev python3-pip -y

# Install unzip to handle AWS CLI zip download
sudo apt install unzip -y

# Install AWS CLI (version 2)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Verify AWS CLI installation
aws --version

# Create and activate Python 3.10 virtual environment
python3.10 -m venv .venv
source .venv/bin/activate

# Upgrade pip, setuptools, and wheel to the latest versions for compatibility
python3 -m pip install --upgrade pip setuptools wheel

# Install specific version of Dask (2024.2.1) and ensure a compatible version of Bokeh
python3 -m pip install "dask[complete]==2024.2.1"
python3 -m pip install "bokeh!=3.0.*,>=2.4.2"
```

### TO TEST

with the user data below I was able to read the data from s3 in our dataframe

this userdata below updates dask to version 2024.9.1 dus to dependencies. We need to adjust our environment.yaml.
- set dask version to 2024.9.1
- set dask-expr version as well (Maybe we also need to dask-expr installed with a specific version)


**remark** - note that the versions on the cluster need to match closely versions of the client. For that the `environment.yaml` was added. This only uses the `pip` install part to exactly install the versions on the cluster. If the versions are to different dask might not be able to perform the computations. 

name: dask-cluster
channels:
  - defaults
  - conda-forge
  - pytorch
  - nvidia
dependencies:
  - python=3.10
  - pip
  - pip:
    - pandas==2.2.3
    - numpy==2.1.2
    - dask[complete]==2024.2.1
    - jupyter

create the enviroment `conda env create --file environment.yml`. **adjust the versions in the yaml file if necessary**

Launche the instances

Give the first instance the name `dask-scheduler` and the other two instances the name `dask-worker`

the user data instruction have been tested on an EC2 instance and they all worked. 

Due to the userdata starting the instances might take a bit longer.

## Opening port in the security group

Since the instances are all started at the same time they are all related to the same security group.

The security group can be found by selecting one of the instances and then go the the security tab.

![image.png](attachment:f4304247-b2c7-4cb9-9f33-9c111a42e6a0.png)

open the security group and go to `Edit Inbound Rules` under `Inbound Rules` then `Add Rule`. Add some rules and make sure it looks like in the image below

![image.png](attachment:46cd76e2-7948-4ffc-90be-4590d0f673c5.png)

The unreadable part states port `30000-65535` (originally this was `49152-65535`). With the other range I was not able to get the cluster running. With the adjustment I was able to get get the cluster running.

click save rules

log into the dask scheduler instance and try to ping the workers usine

`ping <private-ip-adress`

On my machine both pinged correctly.

## Setting up cluster with the CLI

log into each instance using for example Putty

in the terminal of the scheduler instance type:

`dask scheduler` 

this should among other show the something like below

```
2024-06-25 14:53:43,235 - distributed.scheduler - INFO -   Scheduler at:  tcp://172.31.21.203:8786
2024-06-25 14:53:43,235 - distributed.scheduler - INFO -   dashboard at:  http://172.31.21.203:8787/status
```

on each worker now execute

`dask worker tcp://172.31.21.203:8786`

this should return among others output that contains

```
2024-06-25 14:59:22,385 - distributed.nanny - INFO -         Start Nanny at: 'tcp://172.31.24.71:33373'
2024-06-25 14:59:23,015 - distributed.worker - INFO -       Start worker at:   tcp://172.31.24.71:46673
2024-06-25 14:59:23,016 - distributed.worker - INFO -          Listening to:   tcp://172.31.24.71:46673
2024-06-25 14:59:23,016 - distributed.worker - INFO -          dashboard at:         172.31.24.71:35887
2024-06-25 14:59:23,016 - distributed.worker - INFO - Waiting to connect to:   tcp://172.31.21.203:8786
2024-06-25 14:59:23,016 - distributed.worker - INFO - -------------------------------------------------
2024-06-25 14:59:23,017 - distributed.worker - INFO -               Threads:                          1
2024-06-25 14:59:23,017 - distributed.worker - INFO -                Memory:                   0.94 GiB
2024-06-25 14:59:23,017 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-9tb94gip
2024-06-25 14:59:23,017 - distributed.worker - INFO - -------------------------------------------------
2024-06-25 14:59:23,277 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-06-25 14:59:23,278 - distributed.worker - INFO -         Registered to:   tcp://172.31.21.203:8786
2024-06-25 14:59:23,278 - distributed.worker - INFO - -------------------------------------------------
2024-06-25 14:59:23,279 - distributed.core - INFO - Starting established connection to tcp://172.31.21.203:8786
```



now the cluster is ready with one scheduler and 2 workers

Now we can see the daskboard using the **public ip** (not the private) 

`http://34.229.148.157:8787/status`

on this tab `workers`

![image.png](attachment:433c805c-9275-4c15-a249-776f4816ad47.png)

### Remark

The user data is really installed in the root while we using Putty we log in as `ubuntu` user in the directory `/home/ubuntu`. To get to the root directory we need to do cd .. twice. In this directory the `.venv` directory can be found.
To start the environment type 

`source .venv/bin/activate`

Then use the url above with the **public ip** of the scheduler node you created. 

## Install the AWS CLI

check the link below to install the aws cli:

https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

you still need to assign roles to the EC2 instance if you want to access data in s3. After that you can test if it works by typing

`aws s3 ls` 

This will show any buckets in your account

# Connect locally to the scheduler

Now we try to connect from our local computer to the dask cluster. Please make sure that you use the **public IP** of the scheduler and not the private IP.

## Checks

### package version check

In [13]:
# version checking
import pandas as pd
import numpy as np
import dask 
import dask_expr
import s3fs
import distributed

print(pd.__version__)
print(np.__version__)
print(dask.__version__)
print(s3fs.__version__)
print(distributed.__version__)
print(dask_expr.__version__)


2.2.3
2.1.2
2024.9.1
2024.9.0
2024.9.1
1.1.15


### Remote storage access check

Let's first see read data using pandas and dask without a cluster. This way we can check that our permissions for S3 work from the local machine.

In [16]:
%%time
import pandas as pd

# Example: Reading a CSV file from S3
df = pd.read_csv("s3://dask-input-data/1991.csv")
display(df.head(2))

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
0,1991,1,8,2,1215.0,1215,1340.0,1336,US,121,...,,4.0,0.0,EWR,PIT,319.0,,,0,0
1,1991,1,9,3,1215.0,1215,1353.0,1336,US,121,...,,17.0,0.0,EWR,PIT,319.0,,,0,0


CPU times: total: 469 ms
Wall time: 9.31 s


In [15]:
%%time
import dask.dataframe as dd
from dask.distributed import Client

# create a local cluster
client = Client()

# Example: Reading a CSV file from S3
ddf = dd.read_csv('s3://dask-input-data/1990.csv', blocksize="10MB" )
print(ddf)

client.close()

### Check connection to cluster

In [21]:
from dask.distributed import Client

# provide the public ip adress of the scheduler
ip = "44.202.6.3"
address = f"tcp://{ip}:8786"
dashboard = f"http://{ip}:8787/status"

print(f"Use the link below to connect to the cluster dashboard:\n{dashboard}")

print(address)
client = Client(address=address)

client

Use the link below to connect to the cluster dashboard:
http://44.202.6.3:8787/status
tcp://44.202.6.3:8786


0,1
Connection method: Direct,
Dashboard: http://44.202.6.3:8787/status,

0,1
Comm: tcp://172.31.87.82:8786,Workers: 2
Dashboard: http://172.31.87.82:8787/status,Total threads: 2
Started: 1 minute ago,Total memory: 3.84 GiB

0,1
Comm: tcp://172.31.82.136:35811,Total threads: 1
Dashboard: http://172.31.82.136:35929/status,Memory: 1.92 GiB
Nanny: tcp://172.31.82.136:45847,
Local directory: /tmp/dask-scratch-space/worker-e1u6kzu0,Local directory: /tmp/dask-scratch-space/worker-e1u6kzu0
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 140.44 MiB,Spilled bytes: 0 B
Read bytes: 258.73723137791296 B,Write bytes: 1.45 kiB

0,1
Comm: tcp://172.31.87.80:42053,Total threads: 1
Dashboard: http://172.31.87.80:43449/status,Memory: 1.92 GiB
Nanny: tcp://172.31.87.80:45849,
Local directory: /tmp/dask-scratch-space/worker-rsgb_cdt,Local directory: /tmp/dask-scratch-space/worker-rsgb_cdt
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 141.43 MiB,Spilled bytes: 0 B
Read bytes: 258.18059629448356 B,Write bytes: 1.45 kiB


#### Simple example

In [22]:
import dask.array as da

a_da = da.ones(10, chunks=5)
a_da

Unnamed: 0,Array,Chunk
Bytes,80 B,40 B
Shape,"(10,)","(5,)"
Dask graph,2 chunks in 1 graph layer,2 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 80 B 40 B Shape (10,) (5,) Dask graph 2 chunks in 1 graph layer Data type float64 numpy.ndarray",10  1,

Unnamed: 0,Array,Chunk
Bytes,80 B,40 B
Shape,"(10,)","(5,)"
Dask graph,2 chunks in 1 graph layer,2 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [23]:
a_da_sum = a_da.sum()
a_da_sum

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
Array Chunk Bytes 8 B 8 B Shape () () Dask graph 1 chunks in 3 graph layers Data type float64 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [24]:
a_da_sum.compute()

np.float64(10.0)

#### Bigger example

In [25]:
xd = da.random.normal(10, 0.1, size=(20_000, 20_000), chunks=(3000, 3000))
xd

Unnamed: 0,Array,Chunk
Bytes,2.98 GiB,68.66 MiB
Shape,"(20000, 20000)","(3000, 3000)"
Dask graph,49 chunks in 1 graph layer,49 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.98 GiB 68.66 MiB Shape (20000, 20000) (3000, 3000) Dask graph 49 chunks in 1 graph layer Data type float64 numpy.ndarray",20000  20000,

Unnamed: 0,Array,Chunk
Bytes,2.98 GiB,68.66 MiB
Shape,"(20000, 20000)","(3000, 3000)"
Dask graph,49 chunks in 1 graph layer,49 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [26]:
%%time
xd = da.random.normal(10, 0.1, size=(20_000, 20_000), chunks=(3000, 3000))
yd = xd.mean(axis=0)
yd.compute()

CPU times: total: 0 ns
Wall time: 6.07 s


array([ 9.99910342, 10.00072483,  9.99921359, ...,  9.99842834,
        9.99976833,  9.99945769])

![image.png](attachment:f227d2b3-7bff-478d-aaf9-8f6b0aa5d45d.png)

If it all works so far grab a coffee to celerate this **BIG SUCCES!**

## Cluster to use remote data on S3

The code below works for objects with public access. As an examle I made to csv files public and connect to them with dask. 

In [27]:
import dask.dataframe as dd
import os
import s3fs

you can point dask at a list of remote files

In [31]:
import dask.dataframe as dd

filenames = ["s3://dask-input-data/1990.csv", "s3://dask-input-data/1991.csv"]

# Example: Reading a CSV file from S3
ddf = dd.read_csv(filenames, 
                  parse_dates={"Date": [0, 1, 2]},
                  dtype={"TailNum": str, "CRSElapsedTime": float, "Cancelled": bool},
                  blocksize="10MB" )

ddf

Unnamed: 0_level_0,Date,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
,datetime64[ns],int64,float64,int64,float64,int64,string,int64,string,float64,float64,float64,float64,float64,string,string,float64,float64,float64,bool,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


you can also point dask to all csv files in an s3 folder

In [40]:
import dask.dataframe as dd


# Read all CSV files from the root of the bucket
# ddf = dd.read_csv("s3://dask-input-data/*.csv", 
#                   parse_dates={"Date": [0, 1, 2]},
#                   dtype={"TailNum": str, "CRSElapsedTime": float, "Cancelled": bool},
#                   blocksize="10MB" )

# Read all CSV files from the root of the bucket
ddf = dd.read_csv("s3://dask-input-data/*.csv", 
                  dtype={"TailNum": str, "CRSElapsedTime": float, "Cancelled": bool},
                  blocksize="25MB" )


ddf

Unnamed: 0_level_0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
,int64,int64,int64,int64,float64,int64,float64,int64,string,int64,string,float64,float64,float64,float64,float64,string,string,float64,float64,float64,bool,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [41]:
%%time
len(ddf)

CPU times: total: 15.6 ms
Wall time: 6.13 s


2611892

In [42]:
ddf.head(2)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
0,1990,1,1,1,1621.0,1540,1747.0,1701,US,33,...,,46.0,41.0,EWR,PIT,319.0,,,False,0
1,1990,1,2,2,1547.0,1540,1700.0,1701,US,33,...,,-1.0,7.0,EWR,PIT,319.0,,,False,0


### Example calculations

In [45]:
%%time
result = ddf.DepDelay.max()
result.compute()

CPU times: total: 46.9 ms
Wall time: 11.9 s


np.float64(1435.0)

#### In total, how many non-canceled flights were taken?

In [49]:
len(ddf[~ddf.Cancelled])

2540961

#### In total, how many non-canceled flights were taken from each airport?

In [54]:
ddf[~ddf.Cancelled].groupby("Origin")["Origin"].count().compute()

Origin
EWR    1139451
JFK     427243
LGA     974267
Name: Origin, dtype: int64

#### What was the average departure delay from each airport?

In [57]:
ddf.groupby("Origin").DepDelay.mean().compute()

Origin
EWR    10.295469
JFK    10.351299
LGA     7.431142
Name: DepDelay, dtype: float64

#### What day of the week has the worst average departure delay?

In [50]:
ddf.columns

Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime',
       'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum',
       'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay',
       'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut',
       'Cancelled', 'Diverted'],
      dtype='object')

In [61]:
ddf.groupby("DayOfWeek").DepDelay.mean().idxmax().compute()

np.int64(5)

In [None]:
client.close()

This takes abot 8 seconds with 2 workers. During this period you can go the dask dashboard to see and follow the execution of the task graph

# Run jupyter 

Now we want to run a jupyter notebook on the dask scheduler and connect to the cluster using a jupyter notebook on the scheduler (ideally you might want to setup the cluster and connect to it from your local machine - to be done). To create on an EC2 instance that runs jupyter and to which you can connect from your local machine you have to
- install jupyter on the server - use PuTTy to log into our dask-scheduler
- open up port 8080 (or another)
- execute the following command on the server - `jupyter notebook --no-browser --port=8080 --ip=0.0.0.0 --allow-root`

what is key is `--ip=0.0.0.0` without it will not work (several tries with different options)

The output will something like

` http://127.0.0.1:8080/tree?token=ea112db195c9c58c4a2043aac3b52b3897fab80dbf546e12`

replace the `127.0.0.1` with the external IP adress of the server and execute the whole expression in your local browser. Now you should be connected to the server notebook.

In [None]:
from dask.distributed import Client

In [None]:
client = Client()

In [None]:
client

In [None]:
client.close()