<div align="center">
    <img src = "../assets/dask-logo.svg" alt="Dask logo" width="20%">
</div>

---

# Example: Using Dask, Dask Gateway, and adaptive scaling

In this notebook, we will explore some [Indian stock market data](https://www.kaggle.com/datasets/debashis74017/stock-market-data-nifty-50-stocks-1-min-data?select=ASIANPAINT_minute_data_with_indicators.csv).

In [None]:
# Location of Indian stock market data on Google Storage
data_uri = "gs://nebari-public/nifty_stock_market_data/stock-market-data-india"

import warnings
warnings.filterwarnings("ignore")

## Explore a dataset with Dask DataFrame

But first let's see how large this dataset is.

In [None]:
# Determine the size of the dataset using gscfs
from gcsfs import GCSFileSystem  

gs = GCSFileSystem()
print(f"{gs.du(data_uri) / 1e9} GB")

In the following notebook cells you will load the stocks dataset into a Dask DataFrame and view the first few elements.

In [None]:
# Import Dask's Dask DataFrame API
import dask.dataframe as dd

In [None]:
# Read CSV files using a glob-pattern into a Dask DataFrame
ddf = dd.read_csv(data_uri + "/*.csv")

In [None]:
# View the lazy Dask DataFrame
ddf

In [None]:
# Inspect the first few values of the DataFrame (`head` calls `compute` internally)
ddf.head()

**Convert your Dask DataFrame to a pandas DataFrame to load the entire dataset into local memory.**

> ‚ö†Ô∏è Warning! This will crash your kernel because of insufficient local memory! You'll need to restart the kernel and read the dataset in again.

In [None]:
# Convert a Dask DataFrame to pandas DataFrame
# Uncomment the next line to run. This will crash your kernel!

# df = ddf.high.max().compute()

As we mentioned earlier, Dask computations look very similar to pandas with an extra `compute()` at the end.

---

## Scale to large dataset with Dask Gateway

You can now scale your computation to all the ~100 files in the dataset using a Dask cluster with Dask Gateway.

### Create a Dask Gateway instance

As the first step, import and instantiate Dask Gateway.

In [None]:
from dask_gateway import Gateway

gateway = Gateway()
gateway

Open the `Cluster Options` widget where you can view and update cluster configurations like the conda environment, instance type, and any environment variables.

In [None]:
options = gateway.cluster_options()
options

This is a visual example, but all of this can of course be done programatically:

```python
options.conda_environment = conda_env
options.profile = worker_type
options.environment_vars = {"MYENV": "aNeNvVaR"}
```

> ‚ö†Ô∏è Warning: It's important that the environment used for your notebook (that is, the IPython kernel) must match the Dask worker environment (that is, `options.conda_environment`).

### Create a new Dask cluster and connect to a Client

In [None]:
# Create a new cluster with the above options
cluster = gateway.new_cluster(options)

In [None]:
# View the cluster widget
cluster

> Once the cluster is initialized, you'll need to log in to the dashboard via Keycloak before connecting to it using the JupyterLab extension. Click on the dashboard link above and log in now.


The cluster starts with zero workers, so you need to set number of workers manually or setup **adaptive scaling**. With adaptive, your cluster can automatically resize itself within the minimum and maximum bounds based on the workload. Learn more in Dask's [adaptive deployments documentation](https://docs.dask.org/en/stable/how-to/adaptive.html).

**In the above UI, set up adaptive with 1 minimum node and 10 maximum nodes.**

Image source: [Dask documentation](https://docs.dask.org/en/stable/how-to/adaptive.html)

<img src="../assets/dask-adaptive.svg" alt="Dask adaptive scaling" width="30%">



In [None]:
# Enable adaptive scaling
cluster.adapt(minimum=1, maximum=10)

To use adaptive scaling programmatically:
 
```python
cluster.adapt(minimum=1, maximum=10)
```

In [None]:
# Connect a new client to the Gateway cluster
client = cluster.get_client()

In [None]:
# View the client widget
client

The `Dask Client` interface gives us a brief summary of everything we've set up so far. 

### Dask's diagnostic dashboard

Open the Dask dashboard by clicking on the link in the Client UI.

Or (recommended), using the JupyterLab extension in the left sidebar, open:

* Cluster map
* Task stream
* Progress bar
* Worker memory plots (Optional)
* Task groups plot (Optional)

## Computation on the large dataset

### Stock data compute

With the Dask cluster running, we have the resources to do some computation!

Let's compute the highest `high` and lowest `low`. Make sure to look at the dashboard plots!

In [None]:
# Compute highest-high
# Uncomment the next line to run. This will NOT crash your kernel, 
# but it might take a little while as workers get spun up.

# ddf.high.max().compute()

### Standalone example with Dask Array

The previous example reads data from cloud storage, which can take time. Here is an example with Dask Array that you can execute immediately!

In [None]:
import dask.array as da

In [None]:
x = da.random.random((100000, 100000), chunks=(1000, 1000))
x

In [None]:
y = x * x
z = y.mean(axis=1)

In [None]:
z.compute()

## Shutdown the cluster

**ALWAYS** remember to shutdown your cluster with the following commands.

> ‚ö†Ô∏è Warning: As with JupyterLab servers, Dask workers run on cloud compute instances and cost actual money.

In [None]:
cluster.close()
client.close()

---
## üëè Next:
* [04_visualizations_and_dashboards](../04_visualizations_and_dashboards.ipynb)
---