# Dask: Beginner → Advanced Tutorial

A hands-on Jupyter notebook covering core Dask topics from basics to advanced. Includes examples, notes, and practice exercises.

## Prerequisites

- Python 3.8+
- Basic knowledge of pandas and NumPy

**Optional** installs (run if you need them):

```bash
pip install dask[complete] distributed --upgrade
pip install matplotlib pandas nbformat
```

*(The install cell below is commented — run it in your local environment if needed.)

In [None]:
# !pip install dask[complete] distributed matplotlib pandas
# Uncomment and run the line above in your environment if Dask isn't installed

## Quick imports & version check
*Check installed versions of dask and distributed.*

In [None]:
import dask, distributed
import pandas as pd
import numpy as np

print('dask version:', dask.__version__)
print('distributed version:', distributed.__version__)
print('pandas version:', pd.__version__)

## Create sample CSV files
We'll create several small CSV files to demonstrate `dd.read_csv` with glob patterns and partitioning.

In [None]:
import pandas as pd, os, numpy as np
os.makedirs('sample_csvs', exist_ok=True)
n_files = 4
rows_per = 250
for i in range(n_files):
    df = pd.DataFrame({
        'id': range(i*rows_per, (i+1)*rows_per),
        'group': np.random.choice(['A','B','C'], size=rows_per),
        'value': np.random.randn(rows_per) * 100 + 500,
        'date': pd.date_range('2020-01-01', periods=rows_per).astype(str)
    })
    df.to_csv(f'sample_csvs/data_part_{i}.csv', index=False)
print('Created', n_files, 'CSV files in ./sample_csvs')

## Read multiple CSVs as a Dask DataFrame
Use `dd.read_csv` with a glob pattern. Dask will create partitions automatically.

In [None]:
import dask.dataframe as dd
ddf = dd.read_csv('sample_csvs/data_part_*.csv', parse_dates=['date'])
ddf

## Basic operations & lazy evaluation
Dask operations are lazy — they build a task graph. Use `.head()`, `.compute()`, `.persist()` appropriately.

In [None]:
# Inspect dataframe (lazy)
print('npartitions =', ddf.npartitions)
print('columns =', ddf.columns.tolist())

# Compute a small sample
print('\n.head() sample:')
display(ddf.head())

# Perform a groupby aggregation (lazy until compute)
agg = ddf.groupby('group').value.mean()
print('\nAggregation object (lazy):', type(agg))

# Compute result
print('\nComputed result:')
display(agg.compute())

## Count NULL/NA values per column
Example to count nulls (and empty strings) in each column.

In [None]:
# Count nulls per column (Dask DataFrame approach)
null_counts = ddf.isnull().sum().compute()
print('Null counts per column:') 
print(null_counts)

# Count null+empty strings for object columns (compute needed)
def count_null_or_empty(col):
    return ((ddf[col].isnull()) | (ddf[col] == '')).sum()

# Example for 'group' column
print('\nNull or empty in "group":', count_null_or_empty('group').compute())

## Fill NULL values with mean (numeric columns)
Compute mean and use `fillna` with the dictionary of means. Note: computing means requires communication and `.compute()`.

In [None]:
# Build a list of numeric columns
numeric_cols = [c for c, t in ddf.dtypes.items() if str(t).startswith(('int','float'))]

# Compute means for numeric columns
means = ddf[numeric_cols].mean().compute().to_dict()
print('Means:', means)

# Fill nulls using the means dict (returns a new ddf)
ddf_filled = ddf.fillna(means)
print(ddf_filled.head())

## Dask Delayed - convert Python functions into lazy tasks
`delayed` is useful for wrapping arbitrary Python functions and building custom DAGs.

In [None]:
from dask import delayed
import time

@delayed
def slow_square(x):
    time.sleep(0.1)
    return x*x

tasks = [slow_square(i) for i in range(10)]
sum_task = delayed(sum)(tasks)
res = sum_task.compute()  # runs tasks in parallel with the local scheduler
print('sum of squares:', res)

## Dask Futures (distributed) example
Use `distributed.Client` for a scheduler and submit tasks as futures. Run this cell only if you can start a local client.

In [None]:
from distributed import Client
# client = Client(n_workers=2, threads_per_worker=2)  # uncomment to start a scheduler
# print(client)
# future = client.submit(lambda x: x + 1, 10)
# print('future result:', future.result())
print('This cell is a template: start a Client locally or on a cluster to use Futures.')

## Visualize Task Graphs
Use `.visualize()` on Dask collections (requires graphviz if rendering inline). Example below shows intent — rendering depends on your environment.

In [None]:
# Visualize the aggregation graph (saves to file if graphviz available)
try:
    agg = ddf.groupby('group').value.mean()
    agg.visualize(filename='agg_graph.png')  # requires graphviz in the environment
    print('Saved agg_graph.png (if graphviz installed).')
except Exception as e:
    print('Graphviz not available or visualization failed:', e)

## Repartitioning & performance tips

- Aim for partition sizes ~100–250 MB.
- Use `.repartition(npartitions=...)` or `.repartition(partition_size='200MB')`.
- Use `.persist()` when reusing intermediate results.
- Avoid wide shuffles unless necessary (joins, groupbys with high-cardinality keys).
- Prefer Parquet for storage.

Example:

In [None]:
# Example repartition
print('Before repartition:', ddf.npartitions)
ddf2 = ddf.repartition(npartitions=2)
print('After repartition:', ddf2.npartitions)

## Dask for ML & Integration (notes)

- Use `dask-ml` for scaling scikit-learn workflows.
- Use Dask with XGBoost for distributed training.
- For hyperparameter tuning, use `dask-ml`'s `IncrementalSearchCV` or tools such as Optuna with Dask.


## Persisting results & writing to Parquet
Write results using `.to_parquet()` for efficient storage and later reads.

In [None]:
# Write to parquet (local)
out_dir = 'dask_out_parquet'
ddf.to_parquet(out_dir, engine='pyarrow', overwrite=True)
print('Wrote parquet to', out_dir)

## Exercises & Practice Tasks

1. Read a large dataset using `dd.read_csv` and compute mean by a category.
2. Use `delayed` to parallelize a pure-Python data ingestion function.
3. Repartition data by a datetime column and run a rolling-window aggregation.
4. Measure performance before/after adding `.persist()`.

Try implementing these and use the Dask dashboard to inspect workers.

## Troubleshooting & Resources

- Official docs: https://docs.dask.org/
- Dask tutorial: https://tutorial.dask.org/
- Dask on distributed: https://distributed.dask.org/

If you hit memory errors: reduce partitions, increase workers, or use spill-to-disk options.