<a href="https://colab.research.google.com/github/l-87hjl/3i-atlas-public-data/blob/main/00_scrape_mpec_observations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 3I/ATLAS MPEC Observation Scraper

## Purpose
Scrapes astrometric observations for interstellar comet **3I/ATLAS** from the Minor Planet Center (MPC) database and creates two standardized CSV outputs for the public data repository.

---

## ‚ö†Ô∏è CRITICAL: J2000 COORDINATE SYSTEM

**All positional data (RA/Dec) from MPC is in the J2000.0 reference frame.**

- **Equinox**: J2000.0 (January 1, 2000, 12:00 TT)
- **Reference Frame**: International Celestial Reference Frame (ICRF)
- **Coordinate System**: Equatorial (Right Ascension, Declination)

**This is the same coordinate system used by JPL Horizons** when you query with the standard settings. When comparing MPEC observations to Horizons ephemeris, both datasets are already in J2000, so **no coordinate transformation is required**.

**Why this matters downstream:**
- Residual calculations (Observed - Computed) assume matching coordinate frames
- Any external catalogs or reference stars must also be in J2000
- Precession/nutation corrections are NOT needed for epoch matching
- Proper motion corrections may still be needed for field stars

---

## Data Source
- **URL**: https://minorplanetcenter.net/db_search/show_object?object_id=3I
- **Object**: 3I/ATLAS (interstellar comet, discovered 2025)
- **Date Range**: Customizable (default: earliest available ‚Üí +14 days)
- **Rate Limit**: Conservative scraping with delays to respect MPC servers

---

## Outputs

### 1. `observations_MPEC.csv` (Full Data)
Complete observation records with all available fields:
- `timestamp_utc`: UTC observation time (ISO 8601 format)
- `observatory_code`: MPC 3-character observatory code
- `ra_j2000_deg`: Right Ascension in decimal degrees (J2000.0)
- `dec_j2000_deg`: Declination in decimal degrees (J2000.0)
- `magnitude`: Apparent magnitude (if reported)
- `reference`: MPEC reference identifier

### 2. `observations_timestamp_observatory_only.csv` (Minimal Index)
Lightweight index file containing only:
- `timestamp_utc`: UTC observation time
- `observatory_code`: MPC observatory code

Used for quick timestamp/observatory lookups without loading full positional data.

---

## Usage
1. Run all cells in order
2. Enter desired date range when prompted (or accept defaults)
3. Review summary statistics
4. Download generated CSV files to your local machine
5. Upload to `3i-atlas-public-data/observations/` repository

---

## Notes
- **Maximum recommended range**: 14 days per run (to avoid server overload)
- **Total observations available**: 5000+ (as of December 2025)
- **Observatories**: 97+ contributing sites worldwide
- **Coordinate precision**: Typically 0.01" - 0.1" depending on observatory
- **No data filtering applied**: Raw observations preserved as reported by MPC


In [1]:
# Install required packages
!pip install -q requests beautifulsoup4 pandas lxml

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import datetime, timedelta
import time
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully")

‚úÖ Libraries imported successfully


## Configuration & Helper Functions

In [11]:
# MPC 3I/ATLAS permalink
MPC_URL = "https://minorplanetcenter.net/db_search/show_object?utf8=%E2%9C%93&object_id=3I"

# Headers for polite scraping
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Scientific Research; 3I/ATLAS Analysis) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
}

def parse_ra_to_degrees(ra_str: str) -> float:
    """
    Convert RA from HH:MM:SS.SS format to decimal degrees (J2000).

    Args:
        ra_str: Right Ascension in format "HH:MM:SS.SS" or "HH MM SS.SS"

    Returns:
        RA in decimal degrees (0-360)
    """
    ra_str = ra_str.strip().replace(':', ' ')
    parts = ra_str.split()

    hours = float(parts[0])
    minutes = float(parts[1]) if len(parts) > 1 else 0.0
    seconds = float(parts[2]) if len(parts) > 2 else 0.0

    # Convert to degrees: 1 hour = 15 degrees
    degrees = (hours + minutes/60.0 + seconds/3600.0) * 15.0
    return round(degrees, 8)

def parse_dec_to_degrees(dec_str: str) -> float:
    """
    Convert Dec from ¬±DD:MM:SS.S format to decimal degrees (J2000).

    Args:
        dec_str: Declination in format "¬±DD:MM:SS.S" or "¬±DD MM SS.S"

    Returns:
        Dec in decimal degrees (-90 to +90)
    """
    dec_str = dec_str.strip().replace(':', ' ')

    # Handle sign
    sign = -1.0 if dec_str.startswith('-') else 1.0
    dec_str = dec_str.lstrip('+-')

    parts = dec_str.split()
    degrees = float(parts[0])
    minutes = float(parts[1]) if len(parts) > 1 else 0.0
    seconds = float(parts[2]) if len(parts) > 2 else 0.0

    # Convert to decimal degrees
    decimal = degrees + minutes/60.0 + seconds/3600.0
    return round(sign * decimal, 8)

def parse_observation_date(date_str: str) -> datetime:
    """
    Parse MPC observation date to datetime object.

    Args:
        date_str: Date in format "YYYY MM DD.ddddd" (UTC)

    Returns:
        datetime object in UTC
    """
    parts = date_str.strip().split()
    year = int(parts[0])
    month = int(parts[1])
    day_decimal = float(parts[2])

    day = int(day_decimal)
    fraction = day_decimal - day

    dt = datetime(year, month, day)
    dt += timedelta(days=fraction)

    return dt

print("‚úÖ Helper functions defined")

‚úÖ Helper functions defined


## Fetch Object Summary from MPC

In [4]:
def get_object_summary():
    """
    Fetch 3I/ATLAS object summary and total observation count from MPC.
    """
    print(f"üîç Fetching object summary from MPC...")
    print(f"    URL: {MPC_URL}")

    response = requests.get(MPC_URL, headers=HEADERS, timeout=30)
    response.raise_for_status()

    soup = BeautifulSoup(response.content, 'lxml')

    # Extract object name
    title = soup.find('h2')
    object_name = title.text.strip() if title else "3I/ATLAS"

    # Try to find observation count
    obs_count = "Unknown"
    for text in soup.stripped_strings:
        if 'observation' in text.lower():
            match = re.search(r'(\d+)\s+observation', text, re.IGNORECASE)
            if match:
                obs_count = match.group(1)
                break

    print(f"\n{'='*70}")
    print(f"OBJECT CONFIRMATION")
    print(f"{'='*70}")
    print(f"Name:               {object_name}")
    print(f"Total Observations: {obs_count}")
    print(f"Coordinate System:  J2000.0 (ICRF equatorial)")
    print(f"{'='*70}\n")

    return object_name, obs_count

# Execute
obj_name, total_obs = get_object_summary()

üîç Fetching object summary from MPC...
    URL: https://minorplanetcenter.net/db_search/show_object?utf8=%E2%9C%93&object_id=3I

OBJECT CONFIRMATION
Name:               Orbit
Total Observations: Unknown
Coordinate System:  J2000.0 (ICRF equatorial)



## User Input: Date Range Selection

In [12]:
# Default values
DEFAULT_START = "2025-07-15"  # Approximate discovery date range
DEFAULT_DAYS = 14

print("\nüìÖ DATE RANGE SELECTION")
print("=" * 70)
print("Enter the date range for observations to scrape.")
print(f"Default start: {DEFAULT_START}")
print(f"Recommended max span: {DEFAULT_DAYS} days per run\n")

# Get user input
start_date_str = input(f"Start date (YYYY-MM-DD) [{DEFAULT_START}]: ").strip()
if not start_date_str:
    start_date_str = DEFAULT_START

days_str = input(f"Number of days to scrape [{DEFAULT_DAYS}]: ").strip()
if not days_str:
    num_days = DEFAULT_DAYS
else:
    num_days = int(days_str)

# Parse dates
start_date = datetime.strptime(start_date_str, "%Y-%m-%d")
end_date = start_date + timedelta(days=num_days)

print(f"\n‚úÖ Selected Range:")
print(f"   Start: {start_date.strftime('%Y-%m-%d')}")
print(f"   End:   {end_date.strftime('%Y-%m-%d')}")
print(f"   Span:  {num_days} days")
print("=" * 70)


üìÖ DATE RANGE SELECTION
Enter the date range for observations to scrape.
Default start: 2025-07-15
Recommended max span: 14 days per run

Start date (YYYY-MM-DD) [2025-07-15]: 2025-12-18
Number of days to scrape [14]: 14

‚úÖ Selected Range:
   Start: 2025-12-18
   End:   2026-01-01
   Span:  14 days


In [13]:
# Calculate timing estimate
estimate_min = num_days * 0.5
estimate_max = num_days * 1.0

print(f"\n‚è±Ô∏è  TIMING ESTIMATE")
print(f"{'='*70}")
print(f"For {num_days} days of observations:")
print(f"  Expected scraping time: {estimate_min:.1f} - {estimate_max:.1f} minutes")
print(f"  Approximate rate: ~30-60 seconds per day")
print(f"")
print(f"‚ö†Ô∏è  The scraping cell may appear frozen for several minutes.")
print(f"    This is NORMAL! Progress messages will appear as parsing completes.")
print(f"    Larger date ranges = longer wait times.")
print(f"{'='*70}\n")


‚è±Ô∏è  TIMING ESTIMATE
For 14 days of observations:
  Expected scraping time: 7.0 - 14.0 minutes
  Approximate rate: ~30-60 seconds per day

‚ö†Ô∏è  The scraping cell may appear frozen for several minutes.
    This is NORMAL! Progress messages will appear as parsing completes.
    Larger date ranges = longer wait times.



## Scrape Observations from MPC Table

**Note**: This scrapes the HTML observation table. Each row contains:
- Date (UT) in YYYY MM DD.ddddd format
- J2000 RA in HH:MM:SS.SS format
- J2000 Dec in ¬±DD:MM:SS.S format
- Magnitude (optional)
- Observatory code (3-character MPC designation)
- Reference (MPEC identifier)

---

### ‚è±Ô∏è Expected Duration
- **Initial page load**: 2-5 seconds (polite delay to respect MPC servers)
- **Parsing time**: Approximately **20-40 seconds per day** of observations
- **For 14-day range**: Expect **5-10 minutes total**
- **Actual time varies** based on:
  - Network speed
  - MPC server response time
  - Number of observations in date range
  - HTML table complexity

**Don't panic if it seems frozen!** The MPC page can be large (5000+ observations). Progress messages will appear as parsing completes.

In [15]:
def scrape_observations(start_dt: datetime, end_dt: datetime) -> List[Dict]:
    """
    Scrape MPC observations for 3I/ATLAS within specified date range.

    Returns:
        List of observation dictionaries with J2000 coordinates
    """
    num_days = (end_dt - start_dt).days
    estimated_minutes = num_days * 0.5  # Rough estimate: ~30 sec/day = 0.5 min/day

    print(f"\\nüåê Scraping observations from MPC...")
    print(f"‚è±Ô∏è  Estimated time: {estimated_minutes:.1f}-{estimated_minutes*2:.1f} minutes for {num_days} days")
    print(f"    (Network speed and server response may vary)")
    print(f"    This is normal - the page is large! Please wait...\\n")

    time.sleep(2)  # Polite delay
    print(f"    ‚¨áÔ∏è  Fetching MPC page...")

    response = requests.get(MPC_URL, headers=HEADERS, timeout=30)
    response.raise_for_status()
    print(f"    ‚úÖ Page downloaded ({len(response.content)//1024} KB)")

    print(f"    üîç Parsing HTML table...")
    soup = BeautifulSoup(response.content, 'lxml')

    # Find observation table
    tables = soup.find_all('table')
    obs_table = None

    for table in tables:
        # Look for table with observation headers
        headers = table.find_all('th')
        if any('Date' in th.text and 'RA' in str(table) for th in headers):
            obs_table = table
            break

    if not obs_table:
        raise ValueError("Could not find observation table on MPC page")

    print(f"    üìä Processing observations...")
    observations = []
    rows = obs_table.find_all('tr')[1:]  # Skip header row
    total_rows = len(rows)
    print(f"    Found {total_rows} total observation rows in table")

    processed = 0
    for row in rows:
        cols = row.find_all('td')
        if len(cols) < 5:
            continue

        try:
            # Parse observation date
            date_str = cols[0].text.strip()
            obs_dt = parse_observation_date(date_str)

            # Filter by date range
            if not (start_dt <= obs_dt < end_dt):
                continue

            # Extract fields
            ra_str = cols[1].text.strip()
            dec_str = cols[2].text.strip()
            mag_str = cols[3].text.strip()
            location = cols[4].text.strip()  # Observatory code
            reference = cols[5].text.strip() if len(cols) > 5 else ""

            # Convert coordinates to decimal degrees (J2000)
            ra_deg = parse_ra_to_degrees(ra_str)
            dec_deg = parse_dec_to_degrees(dec_str)

            # Parse magnitude (may be empty)
            magnitude = None
            if mag_str and mag_str != '‚Äî':
                try:
                    magnitude = float(mag_str)
                except ValueError:
                    pass

            obs = {
                'timestamp_utc': obs_dt.isoformat(),
                'observatory_code': location,
                'ra_j2000_deg': ra_deg,
                'dec_j2000_deg': dec_deg,
                'magnitude': magnitude,
                'reference': reference
            }

            observations.append(obs)
            processed += 1

            # Progress indicator every 100 observations
            if processed % 100 == 0:
                print(f"       ... {processed} observations matched so far")

        except Exception as e:
            print(f"‚ö†Ô∏è  Warning: Failed to parse row: {e}")
            continue

    print(f"\n‚úÖ Scraped {len(observations)} observations in date range")
    return observations

# Execute scraping
observations = scrape_observations(start_date, end_date)

\nüåê Scraping observations from MPC...
‚è±Ô∏è  Estimated time: 7.0-14.0 minutes for 14 days
    (Network speed and server response may vary)
    This is normal - the page is large! Please wait...\n
    ‚¨áÔ∏è  Fetching MPC page...
    ‚úÖ Page downloaded (1636 KB)
    üîç Parsing HTML table...
    üìä Processing observations...
    Found 5816 total observation rows in table
       ... 100 observations matched so far

‚úÖ Scraped 157 observations in date range


## Create Output DataFrames

In [16]:
if len(observations) == 0:
    print("\n‚ö†Ô∏è  No observations found in specified date range!")
    print("   Try adjusting the start date or expanding the date range.")
else:
    # Full observation dataset
    df_full = pd.DataFrame(observations)

    # Minimal timestamp/observatory index
    df_index = df_full[['timestamp_utc', 'observatory_code']].copy()

    # Sort by timestamp
    df_full = df_full.sort_values('timestamp_utc').reset_index(drop=True)
    df_index = df_index.sort_values('timestamp_utc').reset_index(drop=True)

    print(f"\nüìä DATASET SUMMARY")
    print(f"{'='*70}")
    print(f"Total observations:  {len(df_full)}")
    print(f"Date range:          {df_full['timestamp_utc'].min()} to {df_full['timestamp_utc'].max()}")
    print(f"Observatories:       {df_full['observatory_code'].nunique()} unique sites")
    print(f"Coordinate system:   J2000.0 equatorial (RA/Dec)")
    print(f"{'='*70}")

    # Show sample
    print("\nüìã Sample observations (first 5):")
    print(df_full.head())

    print("\nüìã Observatory distribution:")
    print(df_full['observatory_code'].value_counts().head(10))


üìä DATASET SUMMARY
Total observations:  157
Date range:          2025-12-18T00:35:55.507200 to 2025-12-23T03:25:15.888000
Observatories:       33 unique sites
Coordinate system:   J2000.0 equatorial (RA/Dec)

üìã Sample observations (first 5):
                timestamp_utc                       observatory_code  \
0  2025-12-18T00:35:55.507200  M09 ‚Äì Observatory Gromme - Oudsbergen   
1  2025-12-18T00:42:00.028800                            C23 ‚Äì Olmen   
2  2025-12-18T00:45:57.628800  M09 ‚Äì Observatory Gromme - Oudsbergen   
3  2025-12-18T00:55:37.718400  M09 ‚Äì Observatory Gromme - Oudsbergen   
4  2025-12-18T01:03:34.041600                            C23 ‚Äì Olmen   

   ra_j2000_deg  dec_j2000_deg magnitude  reference  
0    162.942558       6.616783      None   MPEC Y51  
1    162.937621       6.618531      None  MPEC Y151  
2    162.934492       6.619500      None   MPEC Y51  
3    162.926683       6.622164      None   MPEC Y51  
4    162.920271       6.624431      Non

## Save Output Files

In [17]:
if len(observations) > 0:
    # Generate filenames with date range
    date_suffix = f"{start_date.strftime('%Y%m%d')}_{end_date.strftime('%Y%m%d')}"

    filename_full = f"observations_MPEC_{date_suffix}.csv"
    filename_index = f"observations_timestamp_observatory_only_{date_suffix}.csv"

    # Save CSVs
    df_full.to_csv(filename_full, index=False, float_format='%.8f')
    df_index.to_csv(filename_index, index=False)

    print(f"\nüíæ FILES SAVED")
    print(f"{'='*70}")
    print(f"Full dataset:  {filename_full}")
    print(f"               ({len(df_full)} rows, {df_full.shape[1]} columns)")
    print(f"")
    print(f"Index file:    {filename_index}")
    print(f"               ({len(df_index)} rows, {df_index.shape[1]} columns)")
    print(f"{'='*70}")

    print("\n‚úÖ COMPLETE! Download the files from the Colab file browser (left sidebar).")
    print("   Upload to: 3i-atlas-public-data/observations/")
else:
    print("\n‚ùå No files created (no observations in range)")


üíæ FILES SAVED
Full dataset:  observations_MPEC_20251218_20260101.csv
               (157 rows, 6 columns)

Index file:    observations_timestamp_observatory_only_20251218_20260101.csv
               (157 rows, 2 columns)

‚úÖ COMPLETE! Download the files from the Colab file browser (left sidebar).
   Upload to: 3i-atlas-public-data/observations/


---

## üìö Reference: Column Definitions

### Full Dataset (`observations_MPEC.csv`)

| Column | Type | Description | Units |
|--------|------|-------------|-------|
| `timestamp_utc` | string | UTC observation time | ISO 8601 format |
| `observatory_code` | string | MPC 3-character observatory designation | ‚Äî |
| `ra_j2000_deg` | float | Right Ascension in J2000.0 frame | decimal degrees (0-360) |
| `dec_j2000_deg` | float | Declination in J2000.0 frame | decimal degrees (-90 to +90) |
| `magnitude` | float | Apparent visual magnitude | mag (nullable) |
| `reference` | string | MPEC reference identifier | ‚Äî |

### Index File (`observations_timestamp_observatory_only.csv`)

| Column | Type | Description |
|--------|------|-------------|
| `timestamp_utc` | string | UTC observation time |
| `observatory_code` | string | MPC observatory code |

---

## üîó Related Documentation

- **MPC Observatory Codes**: https://minorplanetcenter.net/iau/lists/ObsCodesF.html
- **MPEC Format Guide**: https://minorplanetcenter.net/iau/info/MPCFormat.html
- **J2000 Reference Frame**: https://aa.usno.navy.mil/faq/ICRS_doc
- **3I/ATLAS Discovery**: MPEC 2025-N12

---

**Repository**: https://github.com/l-87hjl/3i-atlas-public-data  
**License**: CC0 1.0 (Public Domain)
