# Green data and sustainable ML
Explore how to reduce environmental impact of data and ML pipelines.

In [1]:
# Imports
import time
from functools import lru_cache
from pprint import pprint
from typing import TypedDict

## Learning goals
- Map the data lifecycle to energy drivers (storage, transfer, compute).
- Estimate carbon intensity of workloads and spot hotspots.
- Apply efficiency patterns: pruning data, batching, caching, lower precision, and scheduling.
- Compare cloud regions by carbon intensity when choosing deployment.

### Why this matters
Machine learning models and data processing pipelines consume significant amounts of electricity. This energy use translates into carbon emissions depending on the **carbon intensity** of the electricity grid (which varies by region and time). By optimizing code and infrastructure choices, you can often reduce environmental footprint without sacrificing performance. This notebook focuses on practical, code-level changes you can make today.

## Quick reference & vocabulary
- **PUE (Power Usage Effectiveness):** total facility energy / IT equipment energy. Ideal is 1.0.
- **CI (Carbon Intensity):** gCO2e per kWh of electricity. Varies by region and time.
- **SCI (Software Carbon Intensity):** rate of carbon emissions per functional unit $R$. $SCI = ((E \cdot I) + M) / R$.
- **Embodied carbon:** CO2e emitted during hardware manufacturing and disposal.
- **Operational carbon:** CO2e emitted from running the software (electricity).
- **Demand shifting:** moving workloads to times (temporal) or regions (spatial) with lower CI.

## Main Green AI concepts (with sources)
- **Green Software Foundation Principles:** Energy efficiency, Hardware efficiency, Carbon awareness. [GSF](https://greensoftware.foundation/)
- **Energy Considerations (Strubell et al.):** Training a large model can emit as much carbon as 5 cars in their lifetimes. [Paper](https://arxiv.org/abs/1906.02243)
- **Sustainable AI:** Trade-offs between accuracy and efficiency. "Red AI" (buying accuracy with massive compute) vs "Green AI".

## Green patterns in code
- **Carbon awareness:** run jobs when the grid is greener (or choose a lower-CI region when allowed).
- **Energy efficiency:** optimize code to run faster or on less hardware.
- **Hardware efficiency:** use higher utilization or specialized hardware (GPU/TPU/NPU) when appropriate.
- **Data efficiency:** train on less data (curation, deduplication).
The code snippets below show concrete examples of applying these patterns.

In [2]:
# Simple carbon estimate for a data job
def estimate_job_impact(job_name: str, cpu_hours: float, watts: float = 65, carbon_intensity: float = 300):
    kwh_used = cpu_hours * watts / 1000
    co2e_grams = kwh_used * carbon_intensity
    
    print(f"--- Impact Report: {job_name} ---")
    print(f"Duration: {cpu_hours} hours")
    print(f"Power: {watts} Watts")
    print(f"Grid Intensity: {carbon_intensity} gCO2e/kWh")
    print(f"Total Energy: {kwh_used:.2f} kWh")
    print(f"Total Emissions: {co2e_grams:.2f} gCO2e")
    print("--------------------------------")
    return co2e_grams

# Scenario: ETL job running for 12 hours on a standard server (80W) in a coal-heavy region (450g)
job_co2e = estimate_job_impact("Daily ETL (Coal Region)", cpu_hours=12, watts=80, carbon_intensity=450)

--- Impact Report: Daily ETL (Coal Region) ---
Duration: 12 hours
Power: 80 Watts
Grid Intensity: 450 gCO2e/kWh
Total Energy: 0.96 kWh
Total Emissions: 432.00 gCO2e
--------------------------------


In [3]:
# Carbon Awareness: Spatial shifting to greener regions
# Different regions have different energy mixes (Coal vs. Hydro/Nuclear).
# We also consider PUE (Power Usage Effectiveness), which measures data center efficiency.
# A PUE of 1.2 means for every 1.0 kWh used by servers, 0.2 kWh is used for cooling/lighting.
class RegionInfo(TypedDict):
    ci: float
    name: str
    pue: float
regions: dict[str, RegionInfo] = {
    'us-east-1': {'ci': 480.0, 'name': 'N. Virginia (Coal/Gas)', 'pue': 1.2},
    'eu-north-1': {'ci': 15.0, 'name': 'Stockholm (Hydro/Wind)', 'pue': 1.1},
    'eu-west-3': {'ci': 55.0, 'name': 'Paris (Nuclear)', 'pue': 1.15},
    'ap-south-1': {'ci': 720.0, 'name': 'Mumbai (Coal)', 'pue': 1.25},
}
def pick_greenest_region(options: list[str]):
    print(f"Evaluating regions: {', '.join(options)}")
    # Sort by Carbon Intensity (CI)
    best = min(options, key=lambda r: regions[r]['ci'])
    worst = max(options, key=lambda r: regions[r]['ci'])
    
    savings = regions[worst]['ci'] - regions[best]['ci']
    pct_savings = (savings / regions[worst]['ci']) * 100
    
    print(f"\n[BEST CHOICE] {regions[best]['name']}")
    print(f"   Intensity: {regions[best]['ci']} gCO2e/kWh")
    print(f"   PUE: {regions[best]['pue']}")
    print(f"\n[AVOID] {regions[worst]['name']} ({regions[worst]['ci']} gCO2e/kWh)")
    print(f"[POTENTIAL SAVINGS] {pct_savings:.1f}% reduction in operational carbon.")

pick_greenest_region(['us-east-1', 'eu-west-3', 'eu-north-1', 'ap-south-1'])

Evaluating regions: us-east-1, eu-west-3, eu-north-1, ap-south-1

[BEST CHOICE] Stockholm (Hydro/Wind)
   Intensity: 15.0 gCO2e/kWh
   PUE: 1.1

[AVOID] Mumbai (Coal) (720.0 gCO2e/kWh)
[POTENTIAL SAVINGS] 97.9% reduction in operational carbon.


**What this code does:** Selects the cloud region with the lowest current carbon intensity for a job.

### Challenge: Lower the estimate
Try to lower the estimate: reduce CPU hours (optimize queries), lower watts (right-size instances), or pick a region with lower carbon intensity.

In [4]:
# Deduplicate and prune columns to reduce storage and transfer
# Expanded dataset simulating raw logs
records = [
    {'id': 1, 'email': 'ana@example.com', 'country': 'FR', 'timestamp': '2023-10-01T10:00:00', 'browser': 'Chrome', 'os': 'Mac'},
    {'id': 2, 'email': 'ana@example.com', 'country': 'FR', 'timestamp': '2023-10-01T10:05:00', 'browser': 'Chrome', 'os': 'Mac'}, # Duplicate user
    {'id': 3, 'email': 'lee@example.com', 'country': 'DE', 'timestamp': '2023-10-01T10:10:00', 'browser': 'Firefox', 'os': 'Linux'},
    {'id': 4, 'email': 'joe@example.com', 'country': 'US', 'timestamp': '2023-10-01T10:15:00', 'browser': 'Safari', 'os': 'iOS'},
    {'id': 5, 'email': 'lee@example.com', 'country': 'DE', 'timestamp': '2023-10-01T10:20:00', 'browser': 'Firefox', 'os': 'Linux'}, # Duplicate user
]

def optimize_dataset(rows, key, keep_columns):
    initial_count = len(rows)
    seen = set()
    result = []
    
    print(f"Processing {initial_count} records...")
    
    for row in rows:
        k = row[key]
        if k in seen:
            continue
        seen.add(k)
        # Pruning: create a new dict with only needed columns
        # This reduces memory usage and network transfer size
        pruned_row = {col: row[col] for col in keep_columns if col in row}
        result.append(pruned_row)
        
    final_count = len(result)
    reduction = 100 * (1 - final_count/initial_count)
    
    print(f"Deduplicated count: {final_count}")
    print(f"Kept columns: {keep_columns}")
    print(f"[ROW REDUCTION] {reduction:.1f}%")
    return result

optimized = optimize_dataset(records, key='email', keep_columns=['id', 'email', 'country'])
pprint(optimized)

Processing 5 records...
Deduplicated count: 3
Kept columns: ['id', 'email', 'country']
[ROW REDUCTION] 40.0%
[{'country': 'FR', 'email': 'ana@example.com', 'id': 1},
 {'country': 'DE', 'email': 'lee@example.com', 'id': 3},
 {'country': 'US', 'email': 'joe@example.com', 'id': 4}]


In [5]:
# Write your notes here
green_analysis = {
    'current_impact': 'High (Coal region, full precision, hot storage)',
    'optimization_1': 'Region shift to...',
    'optimization_2': 'Data pruning...',
    'optimization_3': 'Lifecycle policy...',
    'estimated_savings': '?? %'
}
green_analysis

{'current_impact': 'High (Coal region, full precision, hot storage)',
 'optimization_1': 'Region shift to...',
 'optimization_2': 'Data pruning...',
 'optimization_3': 'Lifecycle policy...',
 'estimated_savings': '?? %'}

**What this code does:** Template to capture energy drivers and proposed optimizations.

## Mini-case: Audit this pipeline
- **Scenario:** A daily ETL job runs in `us-east-1` (coal-heavy) at 10 AM (peak load). It processes 1TB of raw logs, keeping all columns, and stores the result in hot storage forever.
- **Issues:** High CI region, peak time (no temporal shifting), data bloat (no minimization), storage waste (no lifecycle).
- **Fixes:** Move to `eu-north-1` or run at night? Prune columns? Archive old logs?
Document your reasoning in the cell below.

In [6]:
# Hardware Efficiency: Quantization / Lower Precision
# Simulating model weights size reduction for a large language model (e.g., 7B parameters)
model_params = 7_000_000_000

def estimate_model_footprint(params: int):
    precisions = {
        'float32 (Full)': 4,
        'float16 (Half)': 2,
        'bfloat16 (Brain float)': 2,
        'int8 (Quantized)': 1,
        'int4 (Aggressive)': 0.5
    }
    
    print(f"--- Memory Footprint for {params/1e9:.1f}B Parameter Model ---")
    base_size = params * precisions['float32 (Full)'] / (1024**3) # GB
    
    for name, bytes_per_param in precisions.items():
        size_gb = (params * bytes_per_param) / (1024**3)
        savings = 100 * (1 - size_gb/base_size)
        print(f"{name:<20} | {size_gb:>6.2f} GB | Savings: {savings:>5.1f}%")

estimate_model_footprint(model_params)

--- Memory Footprint for 7.0B Parameter Model ---
float32 (Full)       |  26.08 GB | Savings:   0.0%
float16 (Half)       |  13.04 GB | Savings:  50.0%
bfloat16 (Brain float) |  13.04 GB | Savings:  50.0%
int8 (Quantized)     |   6.52 GB | Savings:  75.0%
int4 (Aggressive)    |   3.26 GB | Savings:  87.5%


**What this code does:** Simulates the memory savings of using lower precision data types (quantization).

## Exercises
1) Extend the carbon estimate to include PUE (multiply kWh by PUE).
2) Add caching to a data transform you commonly run and measure runtime savings.
3) Draft a lifecycle policy for cold/archival storage and estimate yearly kWh saved.
4) Identify constraints (legal or business) that may prevent region-hopping for greener grids.

## Solutions

The following cells contain example solutions for the exercises above.

### Solution 1: Carbon estimate with PUE

PUE (Power Usage Effectiveness) accounts for the overhead of the data center (cooling, lighting).

```Total Energy = IT Energy * PUE```

In [7]:
def estimate_job_impact_with_pue(cpu_hours: float, watts: float, carbon_intensity: float, pue: float):
    kwh_it = cpu_hours * watts / 1000
    kwh_total = kwh_it * pue  # Apply PUE multiplier
    co2e_grams = kwh_total * carbon_intensity
    
    print(f"--- Impact Report (with PUE {pue}) ---")
    print(f"IT Energy: {kwh_it:.2f} kWh")
    print(f"Total Facility Energy: {kwh_total:.2f} kWh")
    print(f"Total Emissions: {co2e_grams:.2f} gCO2e")
    return co2e_grams

estimate_job_impact_with_pue(cpu_hours=12, watts=80, carbon_intensity=450, pue=1.2)

--- Impact Report (with PUE 1.2) ---
IT Energy: 0.96 kWh
Total Facility Energy: 1.15 kWh
Total Emissions: 518.40 gCO2e


518.4

### Solution 2: Caching to reduce redundant compute

We use Python's built-in lru_cache to avoid re-calculating expensive transforms.

In [8]:
# Simulate an expensive data transformation
@lru_cache(maxsize=128)
def expensive_transform(data_id):
    time.sleep(0.5)  # Simulate 500ms of heavy compute
    return f"Transformed_{data_id}"

def run_pipeline(ids):
    start = time.time()
    for i in ids:
        _ = expensive_transform(i)
    end = time.time()
    print(f"Processed {len(ids)} items in {end - start:.2f} seconds")

data_ids = [1, 2, 1, 3, 2, 4, 1, 5]  # Note duplicates: 1 and 2 appear multiple times

print("First run (some cache misses):")
run_pipeline(data_ids)

print("\nSecond run (all cache hits):")
run_pipeline(data_ids)

First run (some cache misses):
Processed 8 items in 2.52 seconds

Second run (all cache hits):
Processed 8 items in 0.00 seconds


### Solution 3: Lifecycle Policy Draft

**Goal:** Reduce storage energy for data that is rarely accessed.

**Policy:**
1.  **Hot Storage (SSD/Standard S3):** Data < 30 days old. High availability, high energy cost.
2.  **Warm Storage (HDD/Infrequent Access):** Data 30-90 days old. Lower availability, lower energy.
3.  **Cold Storage (Tape/Glacier):** Data > 90 days old. Very low energy (often offline), high retrieval latency.
4.  **Delete:** Logs > 1 year old (unless required for compliance).

**Estimated Savings:**
Moving 1TB of data from Standard (Hot) to Archive (Cold) can reduce associated carbon emissions by **~80-90%** due to the passive nature of tape storage compared to spinning disks or SSDs.

---

### Solution 4: Constraints on Region Hopping

While moving workloads to Sweden (eu-north-1) or France (eu-west-3) might be greener, you may be blocked by:

1.  **Data Residency Laws (GDPR):** Personal data of EU citizens generally shouldn't leave the EU. Data of US citizens might need to stay in the US depending on specific sector regulations (e.g., health data).
2.  **Latency:** If your users are in India, serving them from Sweden will introduce significant lag, degrading user experience.
3.  **Service Availability:** Not all cloud services (e.g., specific GPU types) are available in every region.
4.  **Cost:** Electricity prices vary, and so do cloud prices. Sometimes the greenest region is more expensive.