Skip to content

Normalize data schemas, enrich datasets, and fix API handler quality#65

Open
AdityaAsopa wants to merge 1 commit intoisro:masterfrom
AdityaAsopa:feat/normalize-data-schemas
Open

Normalize data schemas, enrich datasets, and fix API handler quality#65
AdityaAsopa wants to merge 1 commit intoisro:masterfrom
AdityaAsopa:feat/normalize-data-schemas

Conversation

@AdityaAsopa
Copy link

Summary

The ISRO API serves spacecraft, launcher, and mission data — but the underlying JSON has grown organically from multiple scrapers and manual edits, resulting in data that is difficult to consume programmatically. This PR brings schema consistency, data enrichment, and code quality fixes across all five endpoints without changing any API routes or response structures.

What was wrong (examples)

  • spacecraft_missions.json had 9+ field name variants for mass alone (weight, lift-off_mass, liftoff_mass, lift_off_mass, spacecraft_mass, mass, liftoffmass, liftoff mass, …) and 5+ variants for power (power_of_solar_array, solar_array_power, electrical_power, power, …)
  • 15+ date formats coexisted: "April 19, 1975", "02-Oct-2008", "22.10.2008", "2015", "Sept 23, 2009", "July 15,2023", and more
  • Values appeared in wrong fields: one entry had mission: "7 Years" (should be mission_life), another had lift-off_mass: "IRS-P3" (a spacecraft name)
  • spacecrafts.json had only id + name for 113 records — no dates, orbits, or status
  • customer_satellites.json mixed country casing ("GERMANY" vs "Germany", "UK" vs "UNITED KINGDOM")
  • API handlers imported fs (unused), used misleading variable names (launchers for customer satellite data), and returned no Content-Type header

What this PR does

1. Data normalization pipeline (scripts/normalize_data.py)

  • Parses all date variants into ISO 8601 (YYYY-MM-DD)
  • Resolves all mass field variants into numeric mass_kg
  • Extracts wattage from complex power strings (e.g., "15 Sq.m Solar Array generating 1360W"1360)
  • Classifies orbit types: LEO, SSO, GEO, Lunar, Interplanetary, Failed
  • Infers mission status (active / decommissioned / failed) from launch date + mission life
  • Normalizes country names to consistent title case
  • Merges fresh scraper output with existing data (idempotent — safe to re-run)

2. Data files normalized

Dataset Records Key improvements
spacecraft_missions.json 64 Consistent 17-field schema; 59 ISO dates, 61 numeric masses, 56 numeric power values, 51 orbit classifications
spacecrafts.json 113 Enriched from missions — 73 now have launch_date, vehicle, mission_type, orbit_type, status
launchers.json 81 Classified by vehicle_family (SLV, ASLV, PSLV, GSLV, LVM-3, RLV, …)
customer_satellites.json 75 ISO dates, numeric mass_kg, normalized country names (22 countries)
centres.json 44 Consistent lowercase field names

3. API handler fixes

  • Removed unused fs imports from all handlers
  • Fixed misleading variable names (launcherscustomerSatellites, spacecraftMissions)
  • Added Content-Type: application/json header to all responses
  • Sanitized error responses (no more leaked internal objects)
  • Root endpoint (/api) now returns JSON endpoint directory instead of HTML string

4. Landing page & documentation

  • Added missing /api/spacecraft_missions endpoint to index.html
  • Removed hotlinked external images
  • Comprehensive README with endpoint table, full schema documentation, and data pipeline instructions

What is preserved

  • All API routes unchanged — no breaking changes for existing consumers
  • Response wrapper keys unchanged (spacecrafts, launchers, etc.)
  • All existing records preserved — normalization only adds/fixes fields, never drops entries
  • Original data is enriched, not replaced — the pipeline overlays structured fields onto existing records
  • Latest data incorporated — fresh scraper output from isro.gov.in merged with existing dataset

Data verification

The normalization script was tested for idempotency (running it twice produces identical output) and spot-checked against isro.gov.in source pages. Fields that could not be confidently parsed are set to null rather than guessed.

Test plan

  • Verify all five endpoints return valid JSON with correct Content-Type header
  • Spot-check normalized dates against isro.gov.in spacecraft page
  • Confirm python scripts/normalize_data.py is idempotent (run twice, diff shows no changes)
  • Verify no existing API consumers break (same routes, same wrapper keys)

The spacecraft_missions data had deeply inconsistent schemas — mass appeared
as 'weight', 'lift-off_mass', 'spacecraft_mass', 'mass_at_lift-off' and
5 other variants; dates ranged from 'April 19, 1975' to '22 October 2008'
to '26-05-1999' across 15+ formats; KALPANA-1 had mission_life stored in
the 'mission' field as '7 Years'; and TES appeared as a duplicate entry.
spacecrafts.json had only id+name for 113 records. launchers.json had
only id for 81 records. customer_satellites.json mixed 'GERMANY' with
'Germany' and 'UK' with 'UNITED KINGDOM'.

This commit introduces scripts/normalize_data.py — an idempotent pipeline
that parses all date formats to ISO 8601, extracts numeric mass_kg and
power_watts from free-text fields (handling edge cases like '15 Sq.m Solar
Array generating 1360W'), classifies orbits (LEO/SSO/GEO/Lunar/Failed),
infers mission status from launch date + mission life, and normalizes
country names. The scraper was re-run against isro.gov.in and the fresh
data is merged with existing records — no data is lost, only enriched.

All 5 data files now have consistent, documented schemas. spacecrafts are
enriched with launch date, vehicle, orbit type, and status from missions.
Launchers are classified into 8 vehicle families. All API endpoints remain
backward-compatible — same URLs, same structure, just cleaner data.

API handlers: removed unused 'fs' imports, fixed misleading variable names
(customer_satellites.js loaded data into a var called 'launchers'), added
Content-Type: application/json headers, and sanitized error responses.
Root endpoint now returns a JSON directory of all available endpoints.
AdityaAsopa added a commit to AdityaAsopa/isro_api that referenced this pull request Mar 12, 2026
- CHANGELOG.md: full project history (v1.0.0 → v1.1.0) documenting all
  7 PRs (isro#65isro#71) in Keep a Changelog format; Unreleased section for today's work
- index.html: complete rewrite — space-themed mission control dashboard;
  live Chart.js visualisations (orbit distribution, mission status, vehicle
  families, top countries by satellite count); animated counters fed from
  /api/stats; responsive star-field background; endpoint quick-reference cards
- style.css: full rewrite with CSS custom properties; dark space palette
  (#080818 bg, #06b8ee accent); responsive grid at 900 px and 600 px breakpoints
- api/timeline.js: GET /api/timeline — aggregates launch dates from
  spacecraft_missions, spacecrafts, and customer_satellites into a unified
  chronological event stream; supports ?date=MM-DD, ?month=YYYY-MM,
  ?year=YYYY, ?range=YYYY,YYYY query params
- isro_api_plan.md: big-picture vision document (10 major platform moves)
- social_posts.md: LinkedIn posts and X thread for all 7 PRs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant