# Create Dask Cluster using dask_cloudprovider

In this notebook a dask cluster will be created using the `EC2Cluster` that is available in `dask_cloudprovider`.

## Create necessary permissions

### 1. IAM user
In order to create this cluster you need permission to EC2 resources. In the exercises the cluster uses data in S3. Therefore as user was created with programmatic access that has `AmazonS3FullAccesss` and `AmazonEC2FullAccess` policies attached. If you intend to provide an IAM role you also need `IAMFullAccess`. `AmazonEC2ContainerRegistryFullAccess`

### 2. EC2 role

Create a role `dask-cluster-ec2-role` with `AmazonS3FullAccesss` and  `AmazonEC2ContainerRegistryFullAccess`

### 2. key pair

Creat a key pair `dask-keys`.



### Rest

In [None]:
from dask.distributed import Client
from dask_cloudprovider.aws import EC2Cluster
from typing import List, Optional

import dask.array as da
import dask.dataframe as dd

check if we have sufficient permissions

check s3 permissions

In [None]:
# check s3 permissions
# !aws s3 ls

check ec2 permissions

In [None]:
# Check ec2 permissions
#!aws ec2 describe-subnets

check iam permissions

In [None]:
# !aws iam list-users

#### Create the cluster on AWS

In order to use the EC2Cluster class we need to set `security=False` otherwise an error message was returned.

**remark**:
- When using ACG sandboxes the region **must** be `us-east-1`

## TODO

- include outbound traffic on 80 and 443

**remark**:
- from dask discord we received a hint to use `env_vars` to install extra packages. However this does not work. 

In [None]:
cluster = EC2Cluster(
                     region="us-east-1",
                     availability_zone= None,
                     bootstrap= True,  # to install docker on the image
                     auto_shutdown= None,
                     ami= None,
                     instance_type= None,
                     scheduler_instance_type='t2.micro',
                     worker_instance_type= 't2.medium', 
                     vpc= None,
                     subnet_id = None,
                     security_groups = None,
                     filesystem_size= None,
                     key_name= "dask-keys",
                     # the Name is the IAM role name
                     iam_instance_profile= {'Name': 'dask-cluster-ec2-role'},
                     n_workers= 2,
                     docker_image = "daskdev/dask:latest", 
                     docker_args = None,
                     debug = False,
                     security = False,
                     #env_vars={'EXTRA_CONDA_PACKAGES': 's3fs'}
)

Below works!!!!

In [None]:
cluster.get_logs()

you can always scale a cluster using `.scale(n)` where `n` is the number of worker instances

In [None]:
client = Client(cluster)
client

### Manual installation of s3fs

In [None]:
# Check if 's3fs' is installed on the workers
def check_s3fs():
    try:
        import s3fs
        return True
    except ImportError:
        return False

# Run the check on all workers
print(client.run(check_s3fs))

install on workers

In [None]:
client.run(lambda: __import__('subprocess').check_call(['pip3', 'install', 's3fs']))

# Run the check on all workers
print(client.run(check_s3fs))

install on scheduler

In [None]:
# Install s3fs on the scheduler
client.run_on_scheduler(lambda: __import__('subprocess').check_call(['pip3', 'install', 's3fs']))

# Run the check on all workers
print(client.run_on_scheduler(check_s3fs))

In [None]:
# client.get_worker_logs()

In [None]:
#print(cluster.get_logs())

### Cluster in action

In [None]:
import dask.array as da

a_da = da.ones(10, chunks=5)
a_da

In [None]:
a_da_sum = a_da.sum()
a_da_sum

In [None]:
a_da_sum.compute()

In [None]:
xd = da.random.normal(10, 0.1, size=(40_000, 40_000), chunks=(3000, 3000))
xd

In [None]:
%%time
xd = da.random.normal(10, 0.1, size=(30_000, 30_000), chunks=(3000, 3000))
yd = xd.mean(axis=0)
yd.compute()

Let's try to point the cluster to data on S3

In [None]:
import dask.dataframe as dd

# Read all CSV files from the root of the bucket
ddf = dd.read_csv("s3://dask-input-bucket/*.csv", 
                  dtype={"TailNum": str, "CRSElapsedTime": float, "Cancelled": bool},
                  blocksize="25MB" )

ddf

In [None]:
ddf.head()

In [None]:
%%time
result = ddf.DepDelay.max()
result.compute()

In [None]:
client.close()

In [None]:
cluster.close()

## Create cluster using containers locally

First we try to create a docker images that contains all dependencies and that works locally - **in progress**

create a requirements file that matches the dependencies on local machine

In [None]:
%%writefile ./docker/requirements.txt
pandas==2.2.3
numpy==2.1.0
dask[complete]==2024.9.1
dask-cloudprovider==2024.9.0
s3fs==2024.9.0
dask-expr==1.1.15
awscli
jupyter

Create a simple Dockerfile

In [None]:
%%writefile ./docker/Dockerfile
FROM python:3.10.12-slim

WORKDIR app
COPY ./docker/requirements.txt .

RUN pip3 install --upgrade pip

RUN pip3 install -r requirements.txt

EXPOSE 8786 8787 30000-65535

CMD ["dask", "scheduler", "--host", "0.0.0.0"]
#CMD ["dask", "scheduler"]

build the image and give it the tag `test-dask-image`

**remark** - building this image takes more than 15 minutes

In [None]:
!docker build . -t dask-cluster-image -f ./docker/Dockerfile

when running the container we want to test expose several ports and use the host network

#### Create a network

We will have create 3 containers that need to communicate these need a network to share communcation

In [None]:
!docker network create dask-net

In [None]:
!docker network ls

#### Case 1 - expose ports

Below we create the scheduler container with pors 8786-8787 open en connect to the dask-net network.

In [None]:
!docker run -d --rm --name dask-scheduler --network dask-net -p 8786:8786 -p 8787:8787 dask-image

In [None]:
!docker ps

check the logs that the service has started

In [None]:
!docker logs dask-scheduler

now create dask-worker-1

In [None]:
!docker run -d --rm --name dask-worker-1 --network dask-net dask-image dask-worker tcp://dask-scheduler:8786

In [None]:
!docker ps

In [None]:
!docker logs dask-worker-1

create a second worker

In [None]:
!docker run -d --rm --name dask-worker-2 --network dask-net dask-image dask-worker tcp://dask-scheduler:8786

In [None]:
!docker ps

In [None]:
!docker logs dask-worker-2

In [None]:
from dask.distributed import Client

client = Client("tcp://localhost:8786")

client

In [None]:
%%time
import dask.array as da

# Create a Dask array and perform some computations
array = da.random.random((30000, 30000), chunks=(1000, 1000))
result = array.mean().compute()

print(result)

#### clean up!

In [None]:
client.close()

# docker clean up
!docker stop dask-scheduler
!docker stop dask-worker-1
!docker stop dask-worker-2
!docker network rm dask-net

This all works !!!!

## Create a dask cluster on EC2 instances

### Priliminary work

- Create IAM role (see at the top)
- Creat USER (see at the top)
- Create a key pair (see at the top)
- Create an Amazon ECR Repository
    - Navigate to Amazon ECR in the AWS Management Console.
    - Click Create Repository.
    - Name your repository, e.g., `dask-cluster-image`.
    - Choose other settings as needed, then click Create Repository.

The below commands are provided when you select the radio button of the repository and then click `view push commands`

In [None]:
!aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 654654225119.dkr.ecr.us-east-1.amazonaws.com

In [None]:
!docker tag dask-cluster-image:latest 654654225119.dkr.ecr.us-east-1.amazonaws.com/dask-cluster-image:latest

In [None]:
!docker push 654654225119.dkr.ecr.us-east-1.amazonaws.com/dask-cluster-image:latest

start 2 EC2 instances and use the user script below

**remark** - change the last command to match the sandbox account

make sure you link your `dask-keys` and the `dask-ec2-role` that you created earlier. 

while the instances are created open ports on the security group

![image.png](attachment:f8ae9183-61ee-41f0-ab9e-9528b94d2890.png)

On all instances pull down the image

on scheduler

on worker - please note we use `--network host` such that all ports are open

you need to adjust the `<accountnr>.dkr` part and the internal address in tcp

under construction below does not work 

- /var/log/cloud-init.log
- /var/log/cloud-init-output.log

In [1]:
from dask.distributed import Client

client = Client("tcp://44.206.236.130:8786")

client

0,1
Connection method: Direct,
Dashboard: http://44.206.236.130:8787/status,

0,1
Comm: tcp://172.17.0.2:8786,Workers: 1
Dashboard: http://172.17.0.2:8787/status,Total threads: 2
Started: 17 minutes ago,Total memory: 3.82 GiB

0,1
Comm: tcp://172.17.0.2:30000,Total threads: 2
Dashboard: http://172.17.0.2:30002/status,Memory: 3.82 GiB
Nanny: tcp://172.17.0.2:30001,
Local directory: /tmp/dask-scratch-space/worker-2a3et4sw,Local directory: /tmp/dask-scratch-space/worker-2a3et4sw
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 146.66 MiB,Spilled bytes: 0 B
Read bytes: 286.30768857385084 B,Write bytes: 1.44 kiB


2024-10-21 21:48:47,263 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client


In [2]:
%%time 
import dask.array as da

xd = da.random.normal(10, 0.1, size=(10_000, 10_000), chunks=(3000, 3000))
yd = xd.mean(axis=0)
yd.compute()



KeyboardInterrupt: 

In [None]:
import dask.dataframe as dd


# Read all CSV files from the root of the bucket
# ddf = dd.read_csv("s3://dask-input-data/*.csv", 
#                   parse_dates={"Date": [0, 1, 2]},
#                   dtype={"TailNum": str, "CRSElapsedTime": float, "Cancelled": bool},
#                   blocksize="10MB" )

# Read all CSV files from the root of the bucket
ddf = dd.read_csv("s3://dask-input-data/*.csv", 
                  dtype={"TailNum": str, "CRSElapsedTime": float, "Cancelled": bool},
                  blocksize="25MB" )


ddf

In [None]:
%%time
len(ddf)

In [None]:
%%time 
ddf[~ddf.Cancelled].groupby("Origin")["Origin"].count().compute()

In [None]:
client.close()