# ðŸ”„ 02 â€” Transform Raw JSON into Clean Dataset

This notebook performs the **transformation stage** of the pipeline.

Raw API responses are often nested and not suitable for analysis.
Here, we convert raw JSON into a **clean, standardized DataFrame**
with a consistent schema.

### Input
- Latest file in `../data/raw/`

### Output
- `../data/processed/crypto_processed_<timestamp>.json`

### Output Schema
- `timestamp`
- `price`

In [1]:
import json
import pandas as pd
from pathlib import Path

## ðŸ“¥ Load Latest Raw JSON File

We automatically select the most recent raw file
based on timestamped filenames.

In [2]:
RAW_DIR = Path("../data/raw")
raw_path = sorted(RAW_DIR.glob("*.json"))[-1]

with open(raw_path, "r") as f:
    raw_data = json.load(f)

raw_path

WindowsPath('../data/raw/crypto_raw_20251217_140226.json')

## ðŸ”§ Transform JSON â†’ Clean DataFrame

The CoinGecko market chart endpoint provides data in this structure:


"prices": [
[timestamp_ms, price_usd],
...
]


We extract this into a DataFrame with readable timestamps.

In [3]:
prices = raw_data.get("prices", [])

df = pd.DataFrame(prices, columns=["timestamp", "price"])
df["timestamp"] = pd.to_datetime(df["timestamp"], unit="ms")

df.head()

Unnamed: 0,timestamp,price


## ðŸ§¹ Clean & Validate Data

We:
- Sort by timestamp
- Ensure correct data types
- Remove invalid values

In [4]:
df = df.sort_values("timestamp").reset_index(drop=True)
df["price"] = pd.to_numeric(df["price"], errors="coerce")
df = df.dropna(subset=["price"])

df.head()

Unnamed: 0,timestamp,price


## ðŸ’¾ Save Processed Dataset

The cleaned dataset is saved for downstream loading and visualization.

In [5]:
PROCESSED_DIR = Path("../data/processed")
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

timestamp = pd.Timestamp.now().strftime("%Y%m%d_%H%M%S")
processed_path = PROCESSED_DIR / f"crypto_processed_{timestamp}.json"

df.to_json(processed_path, orient="records", indent=4, date_format="iso")

processed_path

WindowsPath('../data/processed/crypto_processed_20251217_140249.json')

## âœ… Transformation Complete

The dataset is now clean and structured.

Proceed to:
âž¡ **03_load_data.ipynb**