# Dask Complete Cheat Sheet

Dask enables **parallel and distributed computing** with familiar Pandas-like syntax. Ideal for handling **large datasets** that don't fit into memory.


## 1. Installation

In [4]:

# Install Dask (uncomment below if running locally)
# !pip install dask




## 2. Import and Setup

In [5]:

import dask.dataframe as dd
import dask.array as da
from dask.distributed import Client

# Start a local Dask client for parallel execution visualization
client = Client()
client


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 8,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:60987,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 8
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:61004,Total threads: 2
Dashboard: http://127.0.0.1:61008/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:60990,
Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-8yh5dilw,Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-8yh5dilw

0,1
Comm: tcp://127.0.0.1:61003,Total threads: 2
Dashboard: http://127.0.0.1:61009/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:60992,
Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-c21o1pfe,Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-c21o1pfe

0,1
Comm: tcp://127.0.0.1:61005,Total threads: 2
Dashboard: http://127.0.0.1:61007/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:60991,
Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-_qi06u37,Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-_qi06u37

0,1
Comm: tcp://127.0.0.1:61002,Total threads: 2
Dashboard: http://127.0.0.1:61006/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:60993,
Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-mdz5pdzs,Local directory: /var/folders/gl/dsvyrs6s2hz_q5dj1qdz19680000gn/T/dask-worker-space/worker-mdz5pdzs


## 3. Creating Dask DataFrames

In [6]:

import pandas as pd
import numpy as np

# Create a sample Pandas DataFrame
pdf = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 40, 45],
    'Salary': [50000, 60000, 70000, 80000, 90000],
    'Department': ['HR', 'IT', 'IT', 'Finance', 'HR']
})

# Convert Pandas to Dask DataFrame (with 2 partitions)
df = dd.from_pandas(pdf, npartitions=2)
df


Unnamed: 0_level_0,Name,Age,Salary,Department
npartitions=2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,object,int64,int64,object
3,...,...,...,...
4,...,...,...,...


## 4. Reading and Writing Data

In [None]:

# CSV
# df = dd.read_csv('data/*.csv')
# df.to_csv('output/*.csv', index=False)

# Parquet
# df.to_parquet('output_parquet/')
# dd.read_parquet('output_parquet/')


## 5. Inspecting Data and Computing

In [None]:

df.head()
df.info()

# Trigger actual computation
df.compute()

# Number of partitions
df.npartitions


## 6. Selecting Columns and Filtering Rows

In [None]:

df['Name'].head()
df[['Name', 'Salary']].compute()

# Conditional filtering
df[df['Age'] > 30].compute()


## 7. Basic Operations

In [None]:

df['Salary'] + 1000
df['Age'] * 2

# Apply a lambda function
df['Bonus'] = df['Salary'].apply(lambda x: x * 0.1, meta=('Salary', 'f8'))
df.compute()


## 8. Aggregations and GroupBy

In [None]:

df['Salary'].mean().compute()
df.groupby('Department')['Salary'].mean().compute()
df.groupby('Department').agg({'Salary': ['mean', 'max'], 'Age': 'min'}).compute()


## 9. Merge, Join, and Concat

In [None]:

df1 = dd.from_pandas(pd.DataFrame({'ID':[1,2,3], 'Name':['A','B','C']}), npartitions=1)
df2 = dd.from_pandas(pd.DataFrame({'ID':[1,2,4], 'Score':[90,80,70]}), npartitions=1)

dd.merge(df1, df2, on='ID', how='inner').compute()
dd.concat([df1, df2]).compute()


## 10. Handling Missing Values

In [None]:

df_nan = df.copy()
df_nan = df_nan.map_partitions(lambda x: x.assign(Salary=x['Salary'].mask(x['Salary'] > 85000, np.nan)))

df_nan.isna().sum().compute()
df_nan.fillna(0).compute()
df_nan.dropna().compute()


## 11. String and DateTime Operations

In [None]:

df['Name'].str.upper().compute()

dates = dd.from_pandas(pd.date_range('2023-01-01', periods=5, freq='D').to_frame(name='Date'), npartitions=2)
dates['Year'] = dates['Date'].dt.year
dates['Month'] = dates['Date'].dt.month
dates.compute()


## 12. Analytics and Statistical Functions

In [None]:

df[['Age', 'Salary']].corr().compute()
df['Salary'].quantile([0.25, 0.5, 0.75]).compute()

df['Salary'].max().compute()
df['Salary'].std().compute()
df['Salary'].var().compute()

# Rolling computations
df['Rolling_Avg'] = df['Salary'].rolling(2).mean()
df.compute()


## 13. Dask Arrays

In [None]:

arr = da.random.random((10000, 10000), chunks=(1000, 1000))
arr.mean().compute()
arr.std().compute()
arr[:5, :5].compute()


## 14. Persisting and Caching

In [None]:

# Persist keeps data in memory for faster future use
df_persisted = df.persist()
df_persisted.head()


## 15. Visualization

In [None]:

import matplotlib.pyplot as plt

df['Salary'].compute().plot(kind='hist', title='Salary Distribution')
plt.show()

df.compute().plot(x='Age', y='Salary', kind='scatter')
plt.show()


## 16. Performance Optimization Tips


✅ **Best Practices for Speed and Memory Efficiency:**  
- Use **Parquet** files instead of CSV (columnar format, faster I/O).  
- Use **`persist()`** when reusing computed datasets.  
- Use **appropriate chunk sizes** for Dask Arrays/DataFrames.  
- **Avoid calling `.compute()` too often** — combine operations first.  
- For distributed clusters, use `Client()` with scheduler addresses.  
- Visualize task graphs using `.visualize()` to debug performance bottlenecks.


## 17. Task Graph and Diagnostics

In [None]:

# Visualize computation graph
df['Salary'].mean().visualize(filename='task_graph.png')

# Dashboard is available when using Dask client (link shown above)


## 18. Summary


✅ Dask provides scalable, parallel computation with familiar Pandas syntax.  
Ideal for **Big Data**, **machine learning preprocessing**, and **ETL pipelines**.

**Key modules:**
- `dask.dataframe` → Pandas-like big data processing  
- `dask.array` → Numpy-like arrays for large data  
- `dask.delayed` → Convert custom Python code to parallel tasks  
- `dask.distributed` → Multi-core and multi-node cluster management


In [7]:
# Since this environment cannot install Dask, we'll:
# 1) create multiple synthetic CSV files (these you can download and use)
# 2) create a ready-to-run Jupyter notebook (ipynb) that contains Dask code to analyze those CSVs.
#    You can download the notebook and run it locally (where Dask is installed) or in Colab (install dask there).
#
# This cell only uses pandas/numpy and nbformat so it will run here.

import os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from nbformat import v4 as nbf
import json

output_dir = "dask_csv_demo_files"
os.makedirs(output_dir, exist_ok=True)

# 1) Create synthetic CSV files (sales data split across multiple files)
np.random.seed(42)
n_files = 6
rows_per_file = 2000

start_date = datetime(2023,1,1)
product_ids = np.arange(1, 21)   # 20 products
customer_ids = np.arange(1000, 1100)  # 100 customers

for i in range(n_files):
    rows = rows_per_file
    dates = [start_date + timedelta(days=int(x)) for x in np.random.randint(0, 365, size=rows)]
    df = pd.DataFrame({
        "order_id": np.arange(i*rows+1, i*rows+rows+1),
        "customer_id": np.random.choice(customer_ids, size=rows),
        "product_id": np.random.choice(product_ids, size=rows),
        "date": dates,
        "quantity": np.random.poisson(lam=2.0, size=rows).clip(min=1),
        "unit_price": np.round(np.random.uniform(10, 500, size=rows), 2)
    })
    df["total_price"] = (df["quantity"] * df["unit_price"]).round(2)
    file_path = os.path.join(output_dir, f"sales_part_{i+1}.csv")
    df.to_csv(file_path, index=False)

# 2) Create customers and products CSVs
customers = pd.DataFrame({
    "customer_id": customer_ids,
    "customer_name": [f"Customer_{i}" for i in customer_ids],
    "region": np.random.choice(["North", "South", "East", "West"], size=len(customer_ids))
})
customers.to_csv(os.path.join(output_dir, "customers.csv"), index=False)

products = pd.DataFrame({
    "product_id": product_ids,
    "product_name": [f"Product_{i}" for i in product_ids],
    "category": np.random.choice(["A", "B", "C"], size=len(product_ids))
})
products.to_csv(os.path.join(output_dir, "products.csv"), index=False)

created_files = sorted(os.listdir(output_dir))

# 3) Create a Jupyter notebook with Dask analysis code (the user can run locally where Dask is installed)
nb = nbf.new_notebook()
cells = []

cells.append(nbf.new_markdown_cell("# Dask Multi-file Analysis Demo\n\nThis notebook demonstrates how **Dask** reads and processes many CSV files at once and performs analytics over them. **Note:** Dask must be installed in your environment to run the code cells (`pip install 'dask[complete]'`).\n\nFiles used by this notebook are in the folder `dask_csv_demo_files/` (already generated for you)."))
cells.append(nbf.new_markdown_cell("## 1. Setup\nInstall Dask and start a local client (if needed):\n```bash\npip install 'dask[complete]'\n```"))

cells.append(nbf.new_code_cell("""import dask.dataframe as dd\nfrom dask.distributed import Client\nclient = Client()  # opens a local scheduler and dashboard\nclient\n"""))

cells.append(nbf.new_markdown_cell("## 2. Read multiple CSV files using a wildcard\nDask can read many files using a single `read_csv` call — it will create one partition per file (by default) and build a task graph for parallel processing."))
cells.append(nbf.new_code_cell("""# adjust the path if you moved the CSVs\nsales_ddf = dd.read_csv('dask_csv_demo_files/sales_part_*.csv', parse_dates=['date'])\nprint('partitions:', sales_ddf.npartitions)\nsales_ddf.head()\n"""))

cells.append(nbf.new_markdown_cell("## 3. Quick global stats (one compute call)\nUse a single `.compute()` for combined operations to reduce overhead."))
cells.append(nbf.new_code_cell("""total_rows = sales_ddf.shape[0].compute()\ntotal_revenue = sales_ddf['total_price'].sum().compute()\nprint(f\"Total rows: {total_rows}\")\nprint(f\"Total revenue: ${total_revenue:,.2f}\")\n"""))

cells.append(nbf.new_markdown_cell("## 4. Aggregations across all files\nExamples: total sales per product, monthly sales (time-based resampling)."))
cells.append(nbf.new_code_cell("""# total sales per product (aggregates across every file)\nsales_per_product = sales_ddf.groupby('product_id')['total_price'].sum().compute().reset_index().sort_values('total_price', ascending=False)\nsales_per_product.head(10)\n"""))

cells.append(nbf.new_code_cell("""# monthly sales: set date index and resample\nsales_time = sales_ddf.set_index('date')\nmonthly_sales = sales_time['total_price'].resample('M').sum().compute().reset_index()\nmonthly_sales\n"""))

cells.append(nbf.new_markdown_cell("## 5. Joins with small lookup tables (products/customers)\nBest practice: load small lookup tables as pandas and merge into the Dask dataframe (map-join pattern)."))
cells.append(nbf.new_code_cell("""# load small files as pandas, then merge\nproducts = pd.read_csv('dask_csv_demo_files/products.csv')\ncustomers = pd.read_csv('dask_csv_demo_files/customers.csv')\n\n# merge (Dask will handle partitioned compute)\nsales_enriched = sales_ddf.merge(products, on='product_id', how='left').merge(customers, on='customer_id', how='left')\nsales_enriched.head()\n"""))

cells.append(nbf.new_markdown_cell("## 6. Analytics examples\nRevenue by region, correlation, quantiles, rolling averages."))
cells.append(nbf.new_code_cell("""# revenue by region\nrevenue_region = sales_enriched.groupby('region')['total_price'].sum().compute().reset_index().sort_values('total_price', ascending=False)\nrevenue_region\n\n# correlation between quantity and total_price\ncorr = sales_ddf[['quantity','unit_price','total_price']].corr().compute()\ncorr\n"""))

cells.append(nbf.new_markdown_cell("## 7. Performance tips and notes\n- Avoid calling `.compute()` repeatedly; chain operations then compute once.\n- Use `.persist()` for reused intermediate results.\n- Prefer Parquet for large workloads.\n\n## 8. Save results\n```python\nsales_per_product.to_csv('agg_total_sales_by_product.csv', index=False)\nmonthly_sales.to_csv('agg_monthly_sales.csv', index=False)\n```\n"))

cells.append(nbf.new_markdown_cell("----\n\n### How to run\n1. Download the CSV files and this notebook into the same folder (or update the paths).\n2. Create a virtualenv and `pip install 'dask[complete]'`.\n3. Open the notebook and run cells. The Dask dashboard will be available at the URL shown by `Client()`.\n\n"))

nb['cells'] = cells

notebook_path = "Dask_MultiFile_Analysis.ipynb"
with open(notebook_path, "w") as f:
    json.dump(nb, f)

# Print outputs and provide file list for user to download
print("Created CSV files (downloadable):")
for f in created_files:
    print("-", f)

print("\nNotebook created:", os.path.basename(notebook_path))
print("\nYou can download the CSVs and the notebook from the following paths:")
print(" - CSV folder:", output_dir)
print(" - Notebook:", notebook_path)

{"csv_folder": output_dir, "notebook": notebook_path, "files": created_files}



Created CSV files (downloadable):
- customers.csv
- products.csv
- sales_part_1.csv
- sales_part_2.csv
- sales_part_3.csv
- sales_part_4.csv
- sales_part_5.csv
- sales_part_6.csv

Notebook created: Dask_MultiFile_Analysis.ipynb

You can download the CSVs and the notebook from the following paths:
 - CSV folder: dask_csv_demo_files
 - Notebook: Dask_MultiFile_Analysis.ipynb


{'csv_folder': 'dask_csv_demo_files',
 'notebook': 'Dask_MultiFile_Analysis.ipynb',
 'files': ['customers.csv',
  'products.csv',
  'sales_part_1.csv',
  'sales_part_2.csv',
  'sales_part_3.csv',
  'sales_part_4.csv',
  'sales_part_5.csv',
  'sales_part_6.csv']}