# Dask — Complete Notes & Roadmap (Beginner → Advanced)

This notebook is a structured, practical guide to Dask used by data engineers and ML teams. Follow the sections step-by-step. Each section contains short notes, examples, and exercises.

## Setup & Installation

Install Dask and distributed (recommended):

```bash
pip install "dask[complete]" distributed --upgrade
# Optional: for dask-ml, xgboost, and cloud storage
pip install dask-ml xgboost s3fs gcsfs adlfs
```

Start a `Client()` in examples below to enable the dashboard.

## Beginner — Fundamentals

### What is Dask?
- Dask is a flexible parallel computing library for analytics.
- It scales Python workloads from a laptop to a cluster by parallelizing existing Python libraries (NumPy, Pandas, Scikit-learn).

### When to use Dask vs Pandas / Spark
- Use **Pandas**: small data that fits in memory and single-machine workflows.
- Use **Dask**: when you want to scale existing pandas/NumPy code to out-of-core or multiple cores/machines with minimal changes.
- Use **Spark**: if you need JVM ecosystem integrations, very large clusters, or mature production orchestration in some orgs.

### Dask ecosystem
- **dask.dataframe (dd)** — pandas-like API for tabular data.
- **dask.array (da)** — NumPy-like arrays.
- **dask.bag** — for unstructured or semi-structured data.
- **dask.delayed** — turn arbitrary Python functions into lazy tasks.
- **dask.distributed** — scheduler & cluster management (Client, workers, dashboard).
- **dask-ml** — scalable ML utilities.

In [None]:
# Basic imports and starting a local Client (distributed scheduler)
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import dask.array as da

# Start a local client. In many environments, simply calling Client() will suffice.
client = Client()  # opens a local cluster and exposes a dashboard link in many environments
print(client)

### Dask DataFrame — Basics
- Dask DataFrame mirrors the pandas API but operations are **lazy**.
- Data is partitioned into many pandas DataFrames, each processed in parallel.
- Use `.compute()` to get results, `.persist()` to cache intermediate results in memory.

**Key differences**: no `.iloc` for mixed-position indexing across partitions, some pandas operations may not be supported or may be expensive (shuffles).

In [None]:
# Read multiple CSV files (the sample CSVs created alongside this notebook)
import dask.dataframe as dd
sample_path = "Dask_Tutorial/dask_tutorial_with_data/dask_sample_data/sample_part_*.csv"
ddf = dd.read_csv(sample_path, parse_dates=['timestamp'])
ddf.head()  # triggers a small computation to fetch first partitions

In [None]:
# Basic operations: selection, filter, groupby, aggregations
# Note: these return lazy Dask objects until compute() is called.
df = ddf
print('npartitions:', df.npartitions)

# Select columns
sel = df[['id', 'timestamp', 'value']]

# Filter rows
filtered = df[df['flag'] == 1]

# Groupby and aggregation (lazy)
agg = df.groupby('category').value.mean()

# Compute results
print('Aggregations result:')
print(agg.compute())

In [None]:
# Visualize the task graph for an operation (requires graphviz installed to render locally)
agg.visualize(filename='dask_agg_graph')  # saves a .png/svg in working directory if graphviz available
print('Saved a visualization file if graphviz is present.')

### Handling Missing Values
- Use `.isna()`, `.dropna()`, `.fillna()` — same API as pandas.
- Repartitioning before heavy operations can help (but avoid unnecessary shuffles).

**Tip**: For large datasets, prefer operations that act partition-wise to avoid costly shuffles.

In [None]:
# Example: fillna and dropna (partition-wise)
# Create a column with some missing values for demo
df2 = ddf.assign(v2=ddf['value'])
df2['v2'] = df2['v2'].mask(df2['v2'] < df2['v2'].quantile(0.05), None)
res = df2['v2'].fillna(df2['v2'].mean())
print('Computed mean-filled sample:')
print(res.head().compute())

### Saving Output
- Write to Parquet is recommended (columnar, faster reads, metadata preserved).

```python
ddf.to_parquet('out/parquet_dataset')
# or write partitioned by a column
ddf.to_parquet('out/parquet_dataset', partition_on=['category'])
```

Parquet is typically faster and recommended for downstream processing or cloud storage.

## Intermediate — Performance & Scaling

### Partitions
- Check partitions: `ddf.npartitions`.
- Repartition: `ddf.repartition(npartitions=...)` or `ddf.repartition(divisions=...)`.
- Best partition size: ~100–250 MB (varies by workload and memory).

### Task Graphs
- Dask builds a DAG of tasks; operations are lazy until `.compute()`.
- `.visualize()` lets you inspect the DAG; large graphs can be simplified via `optimize_graph=True` in some functions.

In [None]:
# Example: repartition and checking sizes
print('Original npartitions:', ddf.npartitions)
small = ddf.repartition(npartitions=3)
print('New npartitions:', small.npartitions)
small.head()

### Dask Delayed
- Use `dask.delayed` to parallelize arbitrary Python functions.
- Build custom DAGs and compute or persist only when needed.

### Dask Futures
- With `client.submit` and `client.map`, you get futures for real-time parallelism and control.
- Useful for asynchronous workflows and non-data-parallel tasks.

In [None]:
# Example: dask.delayed
from dask import delayed
import time

@delayed
def slow_double(x):
    time.sleep(0.2)
    return x * 2

tasks = [slow_double(i) for i in range(10)]
res = delayed(sum)(tasks)
print(res.compute())  # triggers parallel execution using the client

In [None]:
# Example: futures with client.submit
def add(x, y):
    return x + y

futures = client.map(add, range(5), range(5,10))
results = client.gather(futures)
print('Futures results:', results)

### Optimization Techniques
- `persist()` caches intermediate results on workers — good for re-use.
- Use `compute()` on final results; avoid calling `.compute()` inside loops or prematurely.
- Minimize shuffles (joins, groupbys across partitions).
- Use categorical columns where possible to reduce memory.
- Monitor the dashboard for memory and task-time hotspots.

**Example:** `df = df.categorize(columns=['category'])`

In [None]:
# Persist example
cached = ddf.persist()
print('Persisted dataframe with npartitions:', cached.npartitions)
# small computation
print(cached.head().compute())

## Advanced — Production

### Distributed scheduler internals
- Work stealing, task fusion, and resource restrictions help with efficient scheduling.
- You can set worker resources and constraints (e.g., `resources={'GPU':1}`).

### Dask on Kubernetes / Cloud
- Use DaskKubernetes or helm charts to deploy clusters and autoscaling.
- Integrate with S3/GCS using `s3fs` and `gcsfs`.

### Dask Array
- Similar API to NumPy. Use chunking and rechunking for efficient computations.

### Dask Bag
- For unstructured/line-based data (logs, JSON), with `bag.map`, `bag.filter`.

### Time series with Dask
- Partition by time and use `set_index('timestamp')` carefully (may require shuffle).
- Rolling and resampling operate per-partition; ensure contiguous time ranges when needed.

### Dask for ML
- dask-ml provides parallelized hyperparameter search and scalable estimators.
- Many scikit-learn estimators can be parallelized using joblib or dask-ml wrappers.

### XGBoost + Dask
- XGBoost supports dask interface for distributed training across workers (CPU/GPU).

### Deploying pipelines
- Orchestrate with Airflow/Prefect, monitor with Prometheus/Grafana, and handle retries for idempotency.

### Performance debugging
- Use `client.profile()`, worker logs, and `client.run_on_scheduler` for introspection.
- Avoid large Python objects in the graph; prefer DataFrames/arrays.

## Expert (Optional)
- Custom schedulers, CUDA workers with RAPIDS, writing custom blockwise ops, and low-level graph optimization.

---

## Exercises & Practice
1. Read the sample CSV files using `dd.read_csv` and compute the mean `value` per `category`.
2. Convert `category` to categorical and measure memory improvement.
3. Create an expensive Python function and parallelize it using `dask.delayed` and `client.map`. Compare runtimes.
4. Write out a filtered subset to Parquet partitioned by `category`.

Try to run these exercises and inspect the dashboard while they run.