# Rearc Data Quest - Part 2

## What this notebook does -
1. **Fetch the payload** from DataUSA API using the `tesseract/data.jsonrecords` endpoint with `drilldowns=Year,Nation` & `measures=Population`.  
2. **Do not reshape or aggregate** the JSON. We upload **exactly** what the API returns.  
3. **Write the exact JSON** to `population_data.json` locally and then **PUT the same bytes** to S3.  
4. **Validations** (optional) to catch mistakes (e.g., verify `Year` appears in `columns`, `data` has >1 row).

> Set your AWS credentials in the environment (or credentials file) and update `BUCKET_NAME` and `OBJECT_KEY` below.

In [1]:
# --- Imports ---
import os, json, sys, time
from typing import Dict, Any
import requests
import boto3
from botocore.exceptions import ClientError


In [2]:
# --- Configuration ---
API_URL = "https://honolulu-api.datausa.io/tesseract/data.jsonrecords?cube=acs_yg_total_population_1&drilldowns=Year%2CNation&locale=en&measures=Population"
API_PARAMS = {
    "cube": "acs_yg_total_population_1",
    "drilldowns": "Year,Nation",
    "locale": "en",
    "measures": "Population"
}

# >>>> EDIT THESE if needed <<<<
BUCKET_NAME = os.getenv("REARC_DQ_BUCKET", "rearc-dataquest-harpreet")  # change if needed
OBJECT_KEY  = "part2/population_data.json"                              # path in S3 bucket

In [None]:
# --- Step 1: Fetch the payload (Year,Nation drilldowns) ---
def fetch_population_payload() -> Dict[str, Any]:
    resp = requests.get(API_URL, params=API_PARAMS, timeout=30)
    resp.raise_for_status()
    return resp.json()

payload = fetch_population_payload()

# Pretty-print a short preview for sanity (first 2 data rows if present)
print("columns:", payload.get("columns", []) )
print("data rows:", len(payload.get("data", [])))
print("sample:", payload.get("data", [])[:2])

columns: ['Nation ID', 'Nation', 'Year', 'Population']
data rows: 10
sample: [{'Nation ID': '01000US', 'Nation': 'United States', 'Year': 2013, 'Population': 316128839.0}, {'Nation ID': '01000US', 'Nation': 'United States', 'Year': 2014, 'Population': 318857056.0}]


In [None]:
# --- Step 2: Validate structure before writing ---
columns = payload.get("columns", [])
data = payload.get("data", [])

assert "Year" in columns, "Expected 'Year' in columns – check drilldowns=Year,Nation"
assert "Nation" in columns and "Population" in columns, "Missing expected columns"
assert len(data) > 1, "Expected multiple yearly rows; got a single row – check endpoint/params"

print("Validation OK")

Validation OK ✅


In [5]:
# --- Step 3: Save EXACT payload to local file (no transformation) ---
LOCAL_JSON_PATH = "population_data.json"
with open(LOCAL_JSON_PATH, "w", encoding="utf-8") as f:
    json.dump(payload, f, ensure_ascii=False, indent=2)
print(f"Wrote {LOCAL_JSON_PATH} with {len(data)} rows")

Wrote population_data.json with 10 rows


In [None]:
# --- Step 4: Upload to S3 ---
def upload_to_s3(local_path: str, bucket: str, key: str):
    s3 = boto3.client("s3")
    with open(local_path, "rb") as fh:
        s3.put_object(Bucket=bucket, Key=key, Body=fh, ContentType="application/json")
    print(f"Uploaded s3://{bucket}/{key}")

try:
    upload_to_s3(LOCAL_JSON_PATH, BUCKET_NAME, OBJECT_KEY)
except ClientError as e:
    print("S3 upload failed:", e)
    print("• Ensure AWS credentials are configured for the target account/region.")
    print("• Ensure bucket exists and you have PutObject permission.")

Uploaded s3://rearc-dataquest-harpreet/part2/population_data.json


In [None]:
# --- Optional: Download back from S3 and re-validate (read-after-write) ---
def download_from_s3(bucket: str, key: str) -> Dict[str, Any]:
    s3 = boto3.client("s3")
    obj = s3.get_object(Bucket=bucket, Key=key)
    return json.loads(obj["Body"].read().decode("utf-8"))

try:
    roundtrip = download_from_s3(BUCKET_NAME, OBJECT_KEY)
    assert roundtrip == payload, "Round-trip mismatch – uploaded content differs from API payload"
    print("Round-trip check OK (S3 content matches API payload)")
except ClientError as e:
    print("S3 read-back skipped/failed (this is optional):", e)

Round-trip check OK ✅ (S3 content matches API payload)


## Troubleshooting Notes
- If your uploaded file still lacks `Year`, double-check:
  - You are running **this** notebook (and not the older one).
  - The `API_URL` path includes **`data.jsonrecords`** (plural) and **not** `data.json`.
  - `API_PARAMS["drilldowns"] == "Year,Nation"` exactly (order matters in the output columns).
- If S3 upload fails, confirm:
  - IAM user/role has `s3:PutObject` on the bucket/key.
  - The bucket exists and is in your account (or correct cross-account policy is in place).
  - Region is correct (set via AWS config or environment).