# K8s Dask notebook

## 1️⃣ Import and access the cluster

In [2]:
from dask_kubernetes import HelmCluster
cluster = HelmCluster(release_name="dask-release")
cluster

You can scale the cluster from inside dask!

In [44]:
# cluster.scale(4)

## 2️⃣ Plug the cluster to the client

Run the code below 👇 to add the **cluster** to the client. You will see some warnings about mismatched versions ⚠️. This is okay provided that it is not a library we are trying to use when we distribute our code!

In [42]:
from distributed import Client

client = Client(cluster)

### 🧪 Testing our cluster

The block below runs an arbitary piece of code on an a dask array start by scaling the cluster down to **1** then up to our maximum of **6**. (make sure to wait for the **pods** to be ready to check the actual speed difference!) Don't forget to also checkout the dashboard we opened earlier to see how the tasks are distributed!

In [43]:
%%time
import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z.compute()

## 3️⃣ Utilising our power 💪

🎯 Our goal is to get all of the summary statistics from pandas `describe` from the yellow taxi data `2020-2021`.

### 3.1 One file

In [36]:
import pandas as pd

In [37]:
%%time
df = pd.read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet")

In [38]:
%%time
description = df.describe()

In [40]:
round(description, 2)

Now you can see how long it takes us to access one of the **twenty four** parquet files we need!

### 3.2 Pure pandas

Here is the code we used for **naive** pandas version checkout the run time **no need to run it yourself!**

In [16]:
import pandas as pd

In [41]:
monthly_data = []

In [42]:
%%time 
for i in range(1,13):
    if i < 10:
        i = f"0{i}"
    df = pd.read_parquet(f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-{i}.parquet")
    monthly_data.append(df.describe())
    df = pd.read_parquet(f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-{i}.parquet")
    monthly_data.append(df.describe())

In [34]:
round(sum(monthly_data)/24)

### 3.3 DASK 💪

First import **dask**

In [17]:
import dask

Then we decorate the functions we want to distrbute with `@dask.delayed`, here we have created two seperate tasks to help you distinguish them on the dashboard!

In [25]:
monthly_data = []

@dask.delayed
def twenty_twenty_monthly_describe(i: int):
    if i < 10:
        i = f"0{i}"
    df = pd.read_parquet(f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-{i}.parquet")
    return df.describe()

@dask.delayed
def twenty_twenty_one_monthly_describe(i: int):
    if i < 10:
        i = f"0{i}"
    df = pd.read_parquet(f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-{i}.parquet")
    return df.describe()

We then create a similar list, but it is strangely fast even for **k8s** checkout the list!

In [26]:
for i in range(1,13):
    monthly_data.append(twenty_twenty_monthly_describe(i))
    monthly_data.append(twenty_twenty_one_monthly_describe(i))


In [27]:
monthly_data

They are all these delayed objects, **dask** is lazy until we call compute and then works out the result of these objects! Don't forget to watch the dashboard and see dask tear through this task!

In [28]:
%%time
monthly_data = dask.compute(monthly_data)

In [33]:
round(sum(monthly_data[0])/24,2)

# Headback to the readme if you want to explore some more **dask** or delete your **cluster** now!