
# 01 — Data Collection via SpaceX REST API (v4)

**Project:** IBM Applied Data Science Capstone — SpaceX  
**Goal:** Fetch, normalize, and save SpaceX launch-related datasets needed for downstream notebooks:

- `02_data-wrangling`
- `03_eda-visualization`
- `04_eda_sql`
- `05_folium-interactive`
- `06_ml-classification`

This notebook produces both **raw JSON** and **clean CSV** artifacts in `./data/`.



## What this notebook will produce

**Raw JSON (for reproducibility & auditing)**
- `data/payloads.json`
- `data/rockets.json`
- `data/launchpads.json`
- `data/launches_raw.json`  *(launches with populated references)*

**Tabular CSV (normalized for analysis)**
- `data/launches_exploded.csv` *(one row per **payload** per **launch**)*
- `data/launches_summary.csv`  *(one row per **launch**; payload mass aggregated)*

> These files directly support required plots and SQL tasks (e.g., Flight Number vs Launch Site, Payload vs Orbit/Launch Site, success rates, and landing outcome analyses).


## Setup

In [5]:

# If running in a fresh environment, uncomment to install:
# !pip install pandas numpy requests

import os, json, time, math, pathlib, textwrap
from pathlib import Path
from typing import Dict, Any, List, Optional, Tuple

import pandas as pd
import numpy as np
import requests

# Paths
DATA_DIR = Path("./data")
ARTIFACTS_DIR = Path("./artifacts")
DATA_DIR.mkdir(exist_ok=True, parents=True)
ARTIFACTS_DIR.mkdir(exist_ok=True, parents=True)

BASE_URL = "https://api.spacexdata.com/v4"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/124.0 Safari/537.36"
}

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 160)

print(f"Data dir: {DATA_DIR.resolve()}")
print(f"Artifacts dir: {ARTIFACTS_DIR.resolve()}")


Data dir: /Users/johnpaulsandiego/Desktop/kData/data-science-capstone/data
Artifacts dir: /Users/johnpaulsandiego/Desktop/kData/data-science-capstone/artifacts


## Helper functions

In [8]:

def _retry_get(url: str, retries: int = 5, backoff: float = 0.8, **kwargs) -> requests.Response:
    """GET with basic exponential backoff for 429/5xx."""
    last_err = None
    for i in range(retries):
        try:
            r = requests.get(url, headers=HEADERS, timeout=30, **kwargs)
            if r.status_code == 429 or 500 <= r.status_code < 600:
                time.sleep((i + 1) * backoff)
                continue
            r.raise_for_status()
            return r
        except requests.RequestException as e:
            last_err = e
            time.sleep((i + 1) * backoff)
    if last_err:
        raise last_err


def _retry_post(url: str, json_body: Dict[str, Any], retries: int = 5, backoff: float = 0.8, **kwargs) -> requests.Response:
    last_err = None
    for i in range(retries):
        try:
            r = requests.post(url, headers=HEADERS, json=json_body, timeout=30, **kwargs)
            if r.status_code == 429 or 500 <= r.status_code < 600:
                time.sleep((i + 1) * backoff)
                continue
            r.raise_for_status()
            return r
        except requests.RequestException as e:
            last_err = e
            time.sleep((i + 1) * backoff)
    if last_err:
        raise last_err


def fetch_all(endpoint: str) -> List[Dict[str, Any]]:
    """GET /v4/<endpoint> returning an array."""
    url = f"{BASE_URL}/{endpoint.strip('/')}"
    resp = _retry_get(url)
    return resp.json()


def query_collection(collection: str, query: Dict[str, Any], options: Dict[str, Any]) -> Dict[str, Any]:
    """POST /v4/<collection>/query with Mongo-style query + populate options."""
    url = f"{BASE_URL}/{collection.strip('/')}/query"
    payload = {"query": query, "options": options}
    resp = _retry_post(url, payload)
    return resp.json()


def save_json(path: Path, obj: Any) -> None:
    path.write_text(json.dumps(obj, indent=2, ensure_ascii=False))


def to_dataframe(records: List[Dict[str, Any]]) -> pd.DataFrame:
    return pd.json_normalize(records, sep=".") if records else pd.DataFrame()


## Fetch reference data: rockets, launchpads, payloads

In [11]:

rockets = fetch_all("rockets")
launchpads = fetch_all("launchpads")
payloads_all = fetch_all("payloads")  # complete catalog (used later to cross-check)

# Persist raw JSONs
save_json(DATA_DIR / "rockets.json", rockets)
save_json(DATA_DIR / "launchpads.json", launchpads)
save_json(DATA_DIR / "payloads.json", payloads_all)

print(f"rockets: {len(rockets)} | launchpads: {len(launchpads)} | payloads: {len(payloads_all)}")
pd.DataFrame(rockets)[["name","id"]].head()


rockets: 4 | launchpads: 6 | payloads: 225


Unnamed: 0,name,id
0,Falcon 1,5e9d0d95eda69955f709d1eb
1,Falcon 9,5e9d0d95eda69973a809d1ec
2,Falcon Heavy,5e9d0d95eda69974db09d1ed
3,Starship,5e9d0d96eda699382d09d1ee


### Identify Falcon 9 rocket ID

In [14]:

df_rockets = to_dataframe(rockets)
falcon9_row = df_rockets.loc[df_rockets["name"].str.lower() == "falcon 9"].head(1)
if falcon9_row.empty:
    raise RuntimeError("Falcon 9 rocket not found in rockets endpoint.")
FALCON9_ID = falcon9_row.iloc[0]["id"]
FALCON9_ID


'5e9d0d95eda69973a809d1ec'


## Fetch launches (Falcon 9) with populated references

We use `/launches/query` to filter for **Falcon 9** and populate relevant refs:
- `rocket` → name
- `launchpad` → name, region, latitude/longitude
- `payloads` → id, name, mass_kg, orbit, customers, nationalities
- `cores.core` → serial, block
- `cores.landpad` → name (e.g., LZ-1, OCISLY, JRTI)


In [17]:

populate = [
    {"path": "rocket", "select": {"name": 1}},
    {"path": "launchpad", "select": {"name": 1, "region": 1, "locality": 1, "latitude": 1, "longitude": 1}},
    {"path": "payloads", "select": {"name": 1, "id": 1, "mass_kg": 1, "orbit": 1, "customers": 1, "nationalities": 1}},
    {"path": "cores.core", "select": {"serial": 1, "block": 1}},
    {"path": "cores.landpad", "select": {"name": 1}},
]

options = {
    "pagination": False,
    "populate": populate,
    "select": {
        "flight_number": 1,
        "name": 1,
        "date_utc": 1,
        "success": 1,
        "rocket": 1,
        "payloads": 1,
        "cores": 1,
        "launchpad": 1
    },
    "sort": {"flight_number": 1}
}

query = {
    "rocket": FALCON9_ID,
    "upcoming": False
}

launches_res = query_collection("launches", query, options)
launches_docs = launches_res.get("docs", [])
save_json(DATA_DIR / "launches_raw.json", launches_docs)

len(launches_docs)


179

## Normalize to analysis tables

In [20]:

def compute_landing_outcome(core_entry: Dict[str, Any]) -> str:
    """Return label like 'Success (drone ship)', 'Failure (ground pad)', 'No attempt', 'Unknown'."""
    if not core_entry:
        return "Unknown"
    attempt = core_entry.get("landing_attempt")
    success = core_entry.get("landing_success")
    ltype = core_entry.get("landing_type")  # 'ASDS' or 'RTLS' etc.
    landpad = (core_entry.get("landpad") or {}).get("name") if isinstance(core_entry.get("landpad"), dict) else None

    if not attempt:
        return "No attempt"

    # Normalize type to drone/ground
    site_kind = None
    if ltype:
        lt = str(ltype).upper()
        if "ASDS" in lt:
            site_kind = "drone ship"
        elif "RTLS" in lt:
            site_kind = "ground pad"

    if success is True:
        if site_kind:
            return f"Success ({site_kind})"
        return "Success"
    elif success is False:
        if site_kind:
            return f"Failure ({site_kind})"
        return "Failure"
    else:
        return "Unknown"


def flatten_launches_to_exploded(launches: List[Dict[str, Any]]) -> pd.DataFrame:
    rows = []
    for doc in launches:
        flight = doc.get("flight_number")
        date_utc = doc.get("date_utc")
        success = doc.get("success")
        rocket_name = (doc.get("rocket") or {}).get("name") if isinstance(doc.get("rocket"), dict) else None

        pad = doc.get("launchpad") or {}
        site_name = pad.get("name")
        site_region = pad.get("region")
        site_locality = pad.get("locality")
        site_lat = pad.get("latitude")
        site_lon = pad.get("longitude")

        cores = doc.get("cores") or []
        core0 = cores[0] if cores else {}
        core_serial = ((core0.get("core") or {}).get("serial") if isinstance(core0.get("core"), dict) else None)
        core_block = ((core0.get("core") or {}).get("block") if isinstance(core0.get("core"), dict) else None)
        landing_outcome = compute_landing_outcome(core0)

        payloads = doc.get("payloads") or []
        if not payloads:
            rows.append({
                "flight_number": flight,
                "date_utc": date_utc,
                "year": pd.to_datetime(date_utc, errors="coerce", utc=True).year if date_utc else None,
                "launch_site": site_name,
                "site_region": site_region,
                "site_locality": site_locality,
                "site_lat": site_lat,
                "site_lon": site_lon,
                "rocket_name": rocket_name,
                "core_serial": core_serial,
                "booster_block": core_block,
                "landing_outcome": landing_outcome,
                "launch_success": success,
                "payload_id": None,
                "payload_name": None,
                "payload_mass_kg": None,
                "orbit": None,
                "customers": None,
                "nationalities": None,
            })
        else:
            for p in payloads:
                rows.append({
                    "flight_number": flight,
                    "date_utc": date_utc,
                    "year": pd.to_datetime(date_utc, errors="coerce", utc=True).year if date_utc else None,
                    "launch_site": site_name,
                    "site_region": site_region,
                    "site_locality": site_locality,
                    "site_lat": site_lat,
                    "site_lon": site_lon,
                    "rocket_name": rocket_name,
                    "core_serial": core_serial,
                    "booster_block": core_block,
                    "landing_outcome": landing_outcome,
                    "launch_success": success,
                    "payload_id": p.get("id"),
                    "payload_name": p.get("name"),
                    "payload_mass_kg": p.get("mass_kg"),
                    "orbit": p.get("orbit"),
                    "customers": ", ".join(p.get("customers") or []),
                    "nationalities": ", ".join(p.get("nationalities") or []),
                })
    return pd.DataFrame(rows)


df_exploded = flatten_launches_to_exploded(launches_docs)
print(df_exploded.shape)
df_exploded.sample(5, random_state=42) if not df_exploded.empty else df_exploded.head()


(192, 19)


Unnamed: 0,flight_number,date_utc,year,launch_site,site_region,site_locality,site_lat,site_lon,rocket_name,core_serial,booster_block,landing_outcome,launch_success,payload_id,payload_name,payload_mass_kg,orbit,customers,nationalities
45,47,2017-09-07T13:50:00.000Z,2017,KSC LC 39A,Florida,Cape Canaveral,28.608058,-80.603956,Falcon 9,B1040,4,Success (ground pad),True,5eb0e4c5b6c3bb0006eeb214,X-37B OTV-5,4990.0,LEO,USAF,United States
136,134,2021-09-14T03:55:00.000Z,2021,VAFB SLC 4E,California,Vandenberg Space Force Base,34.632093,-120.610829,Falcon 9,B1049,5,Success (drone ship),True,60e3bf3373359e1e20335c3c,Starlink 2-1 (v1.5),15600.0,PO,SpaceX,United States
76,75,2019-02-22T01:45:00.000Z,2019,CCSFS SLC 40,Florida,Cape Canaveral,28.561857,-80.577366,Falcon 9,B1048,5,Success (drone ship),True,5eb0e4cab6c3bb0006eeb234,Beresheet,585.0,GTO,SpaceIL,Israel
143,141,2021-12-18T12:41:40.000Z,2021,VAFB SLC 4E,California,Vandenberg Space Force Base,34.632093,-120.610829,Falcon 9,B1051,5,Success (drone ship),True,61bbac16437241381bf70632,Starlink 4-4 (v1.5),13260.0,PO,SpaceX,United States
113,113,2021-01-08T02:15:00.000Z,2021,CCSFS SLC 40,Florida,Cape Canaveral,28.561857,-80.577366,Falcon 9,B1060,5,Success (drone ship),True,5eb0e4d3b6c3bb0006eeb264,Turksat 5A,3500.0,GTO,Turksat,Turkey


### Build per-launch summary (aggregated payload mass)

In [23]:

def summarize_by_launch(df: pd.DataFrame) -> pd.DataFrame:
    if df.empty:
        return df
    agg = (df.groupby(["flight_number","date_utc","year","launch_site","rocket_name","core_serial","booster_block","landing_outcome","launch_success"], dropna=False)
             .agg(total_payload_mass_kg=("payload_mass_kg","sum")).reset_index())
    # Ensure numeric
    agg["total_payload_mass_kg"] = pd.to_numeric(agg["total_payload_mass_kg"], errors="coerce")
    return agg.sort_values("flight_number")

df_summary = summarize_by_launch(df_exploded.copy())
df_summary.head(10)


Unnamed: 0,flight_number,date_utc,year,launch_site,rocket_name,core_serial,booster_block,landing_outcome,launch_success,total_payload_mass_kg
0,6,2010-06-04T18:45:00.000Z,2010,CCSFS SLC 40,Falcon 9,B0003,1,No attempt,True,0.0
1,7,2010-12-08T15:43:00.000Z,2010,CCSFS SLC 40,Falcon 9,B0004,1,No attempt,True,0.0
2,8,2012-05-22T07:44:00.000Z,2012,CCSFS SLC 40,Falcon 9,B0005,1,No attempt,True,525.0
3,9,2012-10-08T00:35:00.000Z,2012,CCSFS SLC 40,Falcon 9,B0006,1,No attempt,True,800.0
4,10,2013-03-01T19:10:00.000Z,2013,CCSFS SLC 40,Falcon 9,B0007,1,No attempt,True,677.0
5,11,2013-09-29T16:00:00.000Z,2013,VAFB SLC 4E,Falcon 9,B1003,1,Failure,True,500.0
6,12,2013-12-03T22:41:00.000Z,2013,CCSFS SLC 40,Falcon 9,B1004,1,No attempt,True,3170.0
7,13,2014-01-06T18:06:00.000Z,2014,CCSFS SLC 40,Falcon 9,B1005,1,No attempt,True,3325.0
8,14,2014-04-18T19:25:00.000Z,2014,CCSFS SLC 40,Falcon 9,B1006,1,Success,True,2296.0
9,15,2014-07-14T15:15:00.000Z,2014,CCSFS SLC 40,Falcon 9,B1007,1,Success,True,1316.0


## Save artifacts

In [26]:

# Persist normalized CSVs
df_exploded.to_csv(DATA_DIR / "launches_exploded.csv", index=False)
df_summary.to_csv(DATA_DIR / "launches_summary.csv", index=False)

print("Saved:")
print(" - data/launches_exploded.csv")
print(" - data/launches_summary.csv")
print("Also raw JSONs:")
print(" - data/payloads.json\n - data/rockets.json\n - data/launchpads.json\n - data/launches_raw.json")


Saved:
 - data/launches_exploded.csv
 - data/launches_summary.csv
Also raw JSONs:
 - data/payloads.json
 - data/rockets.json
 - data/launchpads.json
 - data/launches_raw.json



## Data dictionary (columns you'll use later)

**`launches_exploded.csv`** (one row per payload per launch)
- `flight_number` — integer launch sequence (Falcon 9 only)
- `date_utc`, `year` — UTC date/time and extracted year
- `launch_site` — human-readable site name (for scatter plots & Folium)
- `site_region`, `site_locality`, `site_lat`, `site_lon` — geo context for maps
- `rocket_name` — typically "Falcon 9"
- `core_serial` — booster core serial (e.g., B1060)
- `booster_block` — integer block if available (e.g., 5 for Block 5)
- `landing_outcome` — **Success/Failure (drone ship/ground pad)**, No attempt, Unknown
- `launch_success` — boolean for overall mission success
- `payload_id`, `payload_name`
- `payload_mass_kg` — numeric payload mass in kilograms
- `orbit` — e.g., LEO, GTO, SSO (used heavily in required plots/SQL)
- `customers`, `nationalities`

**`launches_summary.csv`** (one row per launch)
- Aggregated `total_payload_mass_kg` per flight (useful for "Payload vs Launch Site")


## Quick validation checks

In [30]:

required_cols = [
    "flight_number","date_utc","year","launch_site","rocket_name",
    "payload_mass_kg","orbit","launch_success","landing_outcome"
]
missing = [c for c in required_cols if c not in df_exploded.columns]
if missing:
    print("WARNING: Missing expected columns:", missing)

print("Rows (exploded):", len(df_exploded))
print("Rows (summary):", len(df_summary))
print("Unique launch sites:", sorted([s for s in df_exploded["launch_site"].dropna().unique()]) if not df_exploded.empty else [])
print("Unique orbits:", sorted([o for o in df_exploded["orbit"].dropna().unique()]) if not df_exploded.empty else [])


Rows (exploded): 192
Rows (summary): 179
Unique launch sites: ['CCSFS SLC 40', 'KSC LC 39A', 'VAFB SLC 4E']
Unique orbits: ['ES-L1', 'GEO', 'GTO', 'HEO', 'ISS', 'LEO', 'MEO', 'PO', 'SO', 'SSO', 'TLI', 'VLEO']



> **Note on `booster_version` vs `booster_block`:**  
The classic IBM capstone sometimes references booster versions like **F9 v1.1**, **FT**, or **Block 5**. The SpaceX v4 API exposes `cores.core.block` (e.g., 5 for Block 5).  
If you need the exact legacy **version labels** for older flights (v1.0 / v1.1 / FT), combine this API data with the **Wikipedia scrape** in your data-wrangling notebook (`02_data-wrangling`) to enrich the table with those human-readable labels.



## How this feeds the required outputs

- **Scatter plots** (Flight Number vs Launch Site, Payload vs Launch Site/Orbit) → use `launches_exploded.csv` and `launches_summary.csv`.
- **Bar/Line charts** (Success Rate vs Orbit, Yearly Average Success Rate) → compute rates from `launch_success` grouped by `orbit` / `year`.
- **SQL tasks** → ingest `launches_exploded.csv` into SQLite and run queries for unique sites, payload aggregations, landing outcomes, etc.
- **Folium** → `site_lat`, `site_lon`, and `launch_site` for marker layers; color by `launch_success` / `landing_outcome`.
- **Dash** → interactive filters over `orbit`, `payload_mass_kg`, `launch_site`, `landing_outcome`.
- **ML** → label = `launch_success`; features include `orbit`, `payload_mass_kg`, `booster_block`, site, etc.



## Re-run & reproducibility tips
- If SpaceX API schema changes, adjust the `populate` `select` fields accordingly.
- If you need *all* launches (not only Falcon 9), remove the `rocket` filter in `query`.
- Always persist both RAW and CLEAN artifacts so later notebooks don't depend on re-calling the API.
