## 1. Introduction

This notebook implements **Part 1: Data Persistence and Pipeline** for the DAT204M HW2 project.

It demonstrates how to securely upload a cleaned dataset from a local CSV file into a **cloud database (Supabase)** using Python.

The goal is to establish a reusable and professional data pipeline that can:
- Read and validate the cleaned CSV data.
- Sanitize missing or invalid values (e.g., NaN → None).
- Create a cloud database table matching the dataset schema.
- Upload or upsert the records to Supabase programmatically.
- Verify successful persistence by querying one or more rows.

This step forms the foundation for **Part 2: Analysis and Modeling**, where the data will be retrieved directly from the cloud.

## 2. Database Setup

We use **Supabase**, a managed PostgreSQL service with a Python SDK (`supabase-py`) for easy integration.

- The `.env` file securely stores credentials:
```
SUPABASE_URL=https://<PROJECT_REF>.supabase.co
SUPABASE_SERVICE_ROLE_KEY=<your_service_key>
```
- These credentials are loaded via `python-dotenv` to avoid hardcoding sensitive keys.
- The connection uses HTTPS REST endpoints (no direct port access).

Supabase is chosen for this project because it offers:
- A generous free tier.
- PostgreSQL compatibility.
- RESTful API and SDK support.
- Built-in data browser for verification.

In [1]:
import os
import json
import pandas as pd
import numpy as np
from dotenv import load_dotenv
from supabase import create_client, Client

load_dotenv()

url: str = os.environ.get("SUPABASE_URL")
key: str = os.environ.get("SUPABASE_SERVICE_ROLE_KEY")

supabase: Client = create_client(url, key)

## 3. Load and Inspect the Cleaned Dataset

We begin with the final, cleaned dataset (`asean_energy_urban_wdi_wide.csv` or `brn_energy.csv`) produced in HW1.

The dataset includes energy, emissions, GDP, and population indicators for ASEAN countries.

Key data-cleaning steps from HW1 ensure:
- Consistent column names and types.
- No duplicated rows.
- Proper numeric formatting.

Here, we will verify:
- Data shape (rows × columns).
- Sample records for structure validation.
- Basic completeness before upload.

In [2]:
CSV_PATH  = "../data/asean_energy_urban_wdi.csv"
TABLE_NAME = "brn_indicators"
BATCH_SIZE = 1000  

# 1) read
df = pd.read_csv(
    CSV_PATH,
    keep_default_na=True,
    na_values=["", "NA", "NaN", "nan"]
)

# 2) sanitize to JSON-safe: NaN/±inf -> None; ensure object dtype before None
df = df.replace([np.inf, -np.inf], np.nan).astype(object).where(pd.notnull(df), None)

# 3) records + preflight check (will raise if any NaN/Infinity slipped through)
records = df.to_dict(orient="records")
json.dumps(records, allow_nan=False)

print(f"Ready to upload: {len(records)} rows, {len(df.columns)} columns")
df.head()

Ready to upload: 350 rows, 11 columns


Unnamed: 0,country,year,co2_per_capita_tco2e_excl_lulucf,co2_total_mtco2e_excl_lulucf,energy_use_kg_oe_per_capita,gdp_current_usd,population_total,renewable_electricity_pct,renewable_energy_consumption_pct,urban_pop_pct,missing_indicator_count
0,BRN,1990,16.843458,4.3,6766.202658,6039881086.68157,255292.0,0.0,0.7,66.438,0
1,BRN,1991,18.69602,4.9095,7438.262038,6284497300.38401,262596.0,0.0,0.4,66.585,0
2,BRN,1992,19.868514,5.3613,7783.232736,6327966444.87488,269839.0,0.0,0.2,67.078,0
3,BRN,1993,19.535619,5.4108,7299.532663,6203339925.05349,276971.0,0.0,0.0,67.604,0
4,BRN,1994,19.33817,5.4938,6603.066579,6467782521.31603,284091.0,0.0,0.0,68.126,0


Before uploading to Supabase, the data must be JSON-serializable.

Supabase’s REST API does not accept `NaN`, `inf`, or `-inf` values (invalid JSON tokens).

We handle this by:
- Converting all NumPy floats (`np.nan`, `np.inf`) → Python `None`.
- Ensuring numeric and string types are valid.
- Validating payload with `json.dumps(..., allow_nan=False)`.

This guarantees that the upload will succeed without syntax errors like  
`Token "NaN" is invalid (22P02)`.

## 4. Data Upload and Verification

We use the Supabase Python client’s `upsert()` method to insert or update rows.

**Key properties of this approach:**
- `upsert()` inserts new rows or updates existing ones based on the primary key (`country`, `year`).
- Data is sent in small batches (≤500 rows) to stay under the 10 MB REST payload limit.
- Each batch is validated locally before upload.
- Verification:
  - Query one known row via `maybe_single()`.
  - Count total rows in the cloud table.

This ensures the dataset is now persisted in a cloud-accessible format for subsequent analysis.

In [3]:
from math import ceil
from postgrest import APIError  # optional: to catch specific API errors

total = len(records)
num_batches = ceil(total / BATCH_SIZE)
print(f"Uploading {total} rows in {num_batches} batch(es) to '{TABLE_NAME}'...")

for i in range(0, total, BATCH_SIZE):
    batch = records[i:i + BATCH_SIZE]
    try:
        resp = (
            supabase
            .table(TABLE_NAME)
            .upsert(batch, returning="minimal")  # don't echo rows back
            .execute()
        )
        # success: no exception thrown
        print(f"  ✓ Batch {i//BATCH_SIZE + 1}/{num_batches} — {len(batch)} rows")
    except APIError as e:
        # PostgREST / Supabase API error (e.g., JSON NaN, RLS, etc.)
        raise RuntimeError(f"Batch starting at {i} failed: {e.message}") from e
    except Exception as e:
        # Any other client/network error
        raise RuntimeError(f"Batch starting at {i} failed: {e}") from e

print("Upload complete")

Uploading 350 rows in 1 batch(es) to 'brn_indicators'...
  ✓ Batch 1/1 — 350 rows
Upload complete


In [4]:
# Verify one row
probe = records[0]
q = supabase.table(TABLE_NAME).select("*")
for k in ("country", "year"):
    if k in probe:
        q = q.eq(k, probe[k])
row = q.limit(1).maybe_single().execute()
print("One row:", row.data)

# Count rows
count_resp = supabase.table(TABLE_NAME).select("country", count="exact").execute()
print("Total rows:", count_resp.count)

One row: {'country': 'BRN', 'year': 1990, 'co2_per_capita_tco2e_excl_lulucf': 16.8434576876675, 'co2_total_mtco2e_excl_lulucf': 4.3, 'energy_use_kg_oe_per_capita': 6766.20265813163, 'gdp_current_usd': 6039881086.68157, 'population_total': 255292, 'renewable_electricity_pct': 0, 'renewable_energy_consumption_pct': 0.7, 'urban_pop_pct': 66.438, 'missing_indicator_count': 0}
Total rows: 350


## 6. Summary

This notebook successfully demonstrates:
- A secure and reusable cloud upload pipeline.
- Proper credential handling with `.env`.
- Robust error handling and data validation.
- Direct integration between local analytics and cloud data storage.

The resulting Supabase table (`brn_indicators`) is now ready for analysis in the next phase (`02_analysis_from_db.ipynb`).