# Dask Multi-file Analysis Demo

This notebook demonstrates how **Dask** reads and processes many CSV files at once and performs analytics over them. **Note:** Dask must be installed in your environment to run the code cells (`pip install 'dask[complete]'`).

Files used by this notebook are in the folder `dask_csv_demo_files/` (already generated for you).

## 1. Setup
Install Dask and start a local client (if needed):
```bash
pip install 'dask[complete]'
```

In [16]:
import pandas as pd
import dask.dataframe as dd
from dask.distributed import Client
client = Client()  # opens a local scheduler and dashboard
client


Perhaps you already have a cluster running?
Hosting the HTTP server on port 64942 instead


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:64942/status,

0,1
Dashboard: http://127.0.0.1:64942/status,Workers: 4
Total threads: 8,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:64943,Workers: 4
Dashboard: http://127.0.0.1:64942/status,Total threads: 8
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:64961,Total threads: 2
Dashboard: http://127.0.0.1:64962/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:64947,
Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-l7k3moqd,Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-l7k3moqd

0,1
Comm: tcp://127.0.0.1:64959,Total threads: 2
Dashboard: http://127.0.0.1:64963/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:64946,
Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-aj3sx2ro,Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-aj3sx2ro

0,1
Comm: tcp://127.0.0.1:64960,Total threads: 2
Dashboard: http://127.0.0.1:64965/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:64949,
Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-jbcs8x7d,Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-jbcs8x7d

0,1
Comm: tcp://127.0.0.1:64958,Total threads: 2
Dashboard: http://127.0.0.1:64964/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:64948,
Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-9s6v2937,Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-9s6v2937


## 2. Read multiple CSV files using a wildcard
Dask can read many files using a single `read_csv` call â€” it will create one partition per file (by default) and build a task graph for parallel processing.

In [17]:
# adjust the path if you moved the CSVs
sales_ddf = dd.read_csv('dask_csv_demo_files/sales_part_*.csv', parse_dates=['date'])
print('partitions:', sales_ddf.npartitions)
sales_ddf.head()
sales_ddf.shape[0].compute()


partitions: 6


12000

## 3. Quick global stats (one compute call)
Use a single `.compute()` for combined operations to reduce overhead.

In [18]:
total_rows = sales_ddf.shape[0].compute()
total_revenue = sales_ddf['total_price'].sum().compute()
print(f"Total rows: {total_rows}")
print(f"Total revenue: ${total_revenue:,.2f}")


Total rows: 12000
Total revenue: $6,551,027.24


## 4. Aggregations across all files
Examples: total sales per product, monthly sales (time-based resampling).

In [19]:
# total sales per product (aggregates across every file)
sales_per_product = sales_ddf.groupby('product_id')['total_price'].sum().compute().reset_index().sort_values('total_price', ascending=False)
sales_per_product.head(10)


Unnamed: 0,product_id,total_price
15,16,363578.59
14,15,346328.42
5,6,343864.81
16,17,342111.53
17,18,341777.79
12,13,339480.82
9,10,339323.11
3,4,338051.03
8,9,332364.56
18,19,331764.42


In [20]:
# monthly sales: set date index and resample
sales_time = sales_ddf.set_index('date')
monthly_sales = sales_time['total_price'].resample('M').sum().compute().reset_index()
monthly_sales


Unnamed: 0,date,total_price
0,2023-01-31,570446.7
1,2023-02-28,489567.84
2,2023-03-31,555725.8
3,2023-04-30,534250.43
4,2023-05-31,532793.83
5,2023-06-30,546890.97
6,2023-07-31,508245.67
7,2023-08-31,565660.5
8,2023-09-30,538814.24
9,2023-10-31,584012.62


## 5. Joins with small lookup tables (products/customers)
Best practice: load small lookup tables as pandas and merge into the Dask dataframe (map-join pattern).

In [21]:
# load small files as pandas, then merge
products = pd.read_csv('dask_csv_demo_files/products.csv')
customers = pd.read_csv('dask_csv_demo_files/customers.csv')

# merge (Dask will handle partitioned compute)
sales_enriched = sales_ddf.merge(products, on='product_id', how='left').merge(customers, on='customer_id', how='left')
sales_enriched.head()


Unnamed: 0,order_id,customer_id,product_id,date,quantity,unit_price,total_price,product_name,category,customer_name,region
0,1,1027,4,2023-04-13,1,186.56,186.56,Product_4,A,Customer_1027,South
1,2,1078,14,2023-12-15,1,408.44,408.44,Product_14,A,Customer_1078,South
2,3,1005,10,2023-09-28,1,309.67,309.67,Product_10,C,Customer_1005,North
3,4,1060,3,2023-04-17,1,391.76,391.76,Product_3,A,Customer_1060,North
4,5,1019,3,2023-03-13,3,362.33,1086.99,Product_3,A,Customer_1019,West


## 6. Analytics examples
Revenue by region, correlation, quantiles, rolling averages.

In [22]:
# revenue by region
revenue_region = sales_enriched.groupby('region')['total_price'].sum().compute().reset_index().sort_values('total_price', ascending=False)
revenue_region

# correlation between quantity and total_price
corr = sales_ddf[['quantity','unit_price','total_price']].corr().compute()
corr


Unnamed: 0,quantity,unit_price,total_price
quantity,1.0,-0.006762,0.674962
unit_price,-0.006762,1.0,0.631087
total_price,0.674962,0.631087,1.0


## 7. Performance tips and notes
- Avoid calling `.compute()` repeatedly; chain operations then compute once.
- Use `.persist()` for reused intermediate results.
- Prefer Parquet for large workloads.

## 8. Save results
```python
sales_per_product.to_csv('agg_total_sales_by_product.csv', index=False)
monthly_sales.to_csv('agg_monthly_sales.csv', index=False)
```
