# Dask for large scale computations

## What is [Dask](https://www.dask.org/)?

How does Dask describe Dask:

> Dask is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.
>
> Dask can scale up to your full laptop capacity and out to a cloud cluster.

Source: [Dask Tutorial](https://tutorial.dask.org/00_overview.html)

## Core advantages of Dask

As a brief overview, Dask offers Dask Collections, Dask Cluster and several additional packages such as Dask-ML, Dask-sql.

Dask Collections represent the part of the library dedicated to multi-core, distributed execution on larger-than-memory datasets. This is ultimately the key reason to use Dask, your dataset is larger than the memory on your single machine but you'd still like to work with the data as you would if you were using NumPy or Pandas. 

The high-level API includes [Dask Array](https://docs.dask.org/en/stable/array.html), a subset of NumPy's `ndarray` interface, [Dask Bag](https://docs.dask.org/en/stable/bag.html), an implementation of generic Python operations like `map`, `filter`, `groupby`, and [Dask DataFrame](https://docs.dask.org/en/stable/dataframe.html), a subset of Panda's `DataFrame` and `Series` interface. 

> A low-level API, Dask Delayed and Dask Futures, is also available but beyond the scope of this tutorial.

The advantage of the approach taken by Dask is that if you are familiar with NumPy's `ndarray` or Panda's `DataFrame`, you can get started using Dask Array or Dask DataFrame rather quickly.

## What is [Dask Gateway](https://gateway.dask.org/)?

**Dask-Gateway comes standard with your Nebari deployment.**

Although Dask can be used on a single machine, Dask Cluster represent the part of the library dedicated to actually administering your workload to a distrubuted cluster of machines. Dask can be setup to run on a variety of backend clusters, including Kubernetes, Docker, HPC, YARN/Hadoop, Dask-Gateway and more.

For the purposes of this Nebari tutorial, when we refer to Dask running on a distributed cluster, we mean connecting to Dask-Gateway.

This means that users with access to Dask-Gateway (more on user permissions in a later notebook) simply need to connect to the gateway to submit their workloads to the Dask cluster. See either of the links at the bottom of this page for a concrete example of how this done.

## When is Dask used?

Dask Collections, such as Dask DataFrame, and Dask-Gateway are used together when you have a dataset that is larger than the memory of your local machine (or the Nebari JupyterLab server you are running on) but you still need to load it and manipulate it some way.

## 👀 Watch this...

Here we run through a basic example of how to use Dask-Gateway in Nebari: [finance_examples/02_dask_gateway_adaptive_scaling.ipynb](./finance_examples/02_dask_gateway_adaptive_scaling.ipynb)

---
## 👏 Next:
* [03_managing_environments](./03_managing_environments.ipynb)
---