# Load JSONL into MongoDB (PyMongo) â€” Flights Project

This notebook imports the cleaned JSONL files into MongoDB with **proper BSON Date types** and **safe upsert behavior**.

**What it does:**
- Parameterize paths and Mongo connection
- Read `flights_clean_10d.jsonl` and `aircraft_10d.jsonl`
- Parse ISO timestamps to `datetime`
- **Upsert** flights by `(icao24, departure_time)`
- **Upsert** aircraft by `icao24` using **sparse updates** (wonâ€™t overwrite non-null DB values with nulls)
- Sanity checks: counts and sample docs

> Run this from your repo root kernel in VS Code, with your `adt` virtual environment selected.


In [20]:
# ðŸ”§ Parameters â€” adjust as needed

# Where your JSONL files live (relative to repo root)
FLIGHTS_FILE = "../data_enriched/flights_clean_10d.jsonl"
AIRCRAFT_FILE = "../data_enriched/aircraft_10d.jsonl"

# MongoDB connection
MONGO_URI = "mongodb://localhost:27017"
DB_NAME = "openair"

# Safe upsert behavior for aircraft:
# - If True: only $set non-null fields (prevents wiping existing values with None)
# - If False: $set everything as-is (may overwrite values with None)
AIRCRAFT_SPARSE_UPSERT = True

FLIGHTS_FILE, AIRCRAFT_FILE, MONGO_URI, DB_NAME, AIRCRAFT_SPARSE_UPSERT

('../data_enriched/flights_clean_10d.jsonl',
 '../data_enriched/aircraft_10d.jsonl',
 'mongodb://localhost:27017',
 'openair',
 True)

In [21]:
# ðŸ“¦ Imports (install in your venv if missing: pip install pymongo)
from pymongo import MongoClient, UpdateOne
from datetime import datetime
import json, os, itertools

print("PyMongo ready")

PyMongo ready


## Helpers

In [22]:
def iso_to_dt(s: str):
    """Parse ISO-8601 string to datetime (handles trailing 'Z')."""
    if s is None:
        return None
    if s.endswith('Z'):
        s = s.replace('Z', '+00:00')
    return datetime.fromisoformat(s)

def non_null_set(doc: dict, allowed_keys):
    """Return a dict with only non-null values for the allowed keys."""
    payload = {}
    for k in allowed_keys:
        v = doc.get(k, None)
        if v is not None and str(v) not in ("NaT", "None"):
            payload[k] = v
    return payload

## Connect to MongoDB

In [23]:
client = MongoClient(MONGO_URI)
db = client[DB_NAME]
db.name

'openair'

## Load & Upsert â€” Flights

In [24]:
def chunked(seq, size):
    for i in range(0, len(seq), size):
        yield seq[i:i+size]


In [25]:
# Load & Upsert â€” Flights (BATCHED)
from pymongo import UpdateOne
import json

BATCH = 1000  # 1000â€“2000 is usually safe; reduce to 500 if you still see the error

ops = []
n_read = 0

with open(FLIGHTS_FILE, "r", encoding="utf-8") as f:
    for line in f:
        n_read += 1
        doc = json.loads(line)
        # Cast ISO strings to datetime
        doc["departure_time"] = iso_to_dt(doc["departure_time"])
        doc["arrival_time"]   = iso_to_dt(doc["arrival_time"])
        # Upsert key
        key = {"icao24": doc["icao24"], "departure_time": doc["departure_time"]}
        ops.append(UpdateOne(key, {"$set": doc}, upsert=True))

print(f"Prepared {len(ops)} upserts for flights (read {n_read} lines).")

upserted = modified = 0
for i, chunk in enumerate(chunked(ops, BATCH), start=1):
    res = db.flights.bulk_write(chunk, ordered=False)
    upserted += res.upserted_count
    modified += res.modified_count
    print(f"[Flights] Batch {i}: upserted={res.upserted_count}, modified={res.modified_count}")

print("Flights totals -> upserted:", upserted, "| modified:", modified)


Prepared 610535 upserts for flights (read 610535 lines).
[Flights] Batch 1: upserted=453, modified=3
[Flights] Batch 2: upserted=368, modified=0
[Flights] Batch 3: upserted=370, modified=1
[Flights] Batch 4: upserted=352, modified=1
[Flights] Batch 5: upserted=398, modified=1
[Flights] Batch 6: upserted=424, modified=7
[Flights] Batch 7: upserted=391, modified=0
[Flights] Batch 8: upserted=363, modified=0
[Flights] Batch 9: upserted=341, modified=5
[Flights] Batch 10: upserted=396, modified=0
[Flights] Batch 11: upserted=432, modified=11
[Flights] Batch 12: upserted=313, modified=0
[Flights] Batch 13: upserted=474, modified=0
[Flights] Batch 14: upserted=422, modified=11
[Flights] Batch 15: upserted=365, modified=2
[Flights] Batch 16: upserted=414, modified=6
[Flights] Batch 17: upserted=340, modified=1
[Flights] Batch 18: upserted=369, modified=2
[Flights] Batch 19: upserted=465, modified=4
[Flights] Batch 20: upserted=405, modified=5
[Flights] Batch 21: upserted=416, modified=0
[Flig

## Load & Upsert â€” Aircraft (sparse by default)

In [26]:
# Load & Upsert â€” Aircraft (BATCHED, sparse upsert)
from pymongo import UpdateOne
import json
from datetime import datetime

BATCH = 1000  # reduce if you still see message size errors

ops = []
n_read = 0

with open(AIRCRAFT_FILE, "r", encoding="utf-8") as f:
    for line in f:
        n_read += 1
        raw = json.loads(line)

        # Base doc and optional last_seen parsing
        doc = {"icao24": raw.get("icao24")}
        if raw.get("last_seen"):
            try:
                s = raw["last_seen"]
                if s.endswith('Z'):
                    s = s.replace('Z', '+00:00')
                doc["last_seen"] = datetime.fromisoformat(s)
            except Exception:
                pass

        key = {"icao24": raw.get("icao24")}
        update = {"$setOnInsert": {"icao24": raw.get("icao24")}}

        if AIRCRAFT_SPARSE_UPSERT:
            # Only set non-null values to avoid clobbering known fields with nulls
            set_payload = non_null_set({**raw, **doc}, ["registration","model","typecode","last_seen"])
            if set_payload:
                update["$set"] = set_payload
        else:
            # Blind set (may overwrite with nulls)
            update["$set"] = {k: raw.get(k) for k in ["registration","model","typecode"] if k in raw}
            if "last_seen" in doc:
                update["$set"]["last_seen"] = doc["last_seen"]

        ops.append(UpdateOne(key, update, upsert=True))

print(f"Prepared {len(ops)} upserts for aircraft (read {n_read} lines).")

upserted = modified = 0
for i, chunk in enumerate(chunked(ops, BATCH), start=1):
    res = db.aircraft.bulk_write(chunk, ordered=False)
    upserted += res.upserted_count
    modified += res.modified_count
    print(f"[Aircraft] Batch {i}: upserted={res.upserted_count}, modified={res.modified_count}")

print("Aircraft totals -> upserted:", upserted, "| modified:", modified)


Prepared 75306 upserts for aircraft (read 75306 lines).
[Aircraft] Batch 1: upserted=1000, modified=0
[Aircraft] Batch 2: upserted=1000, modified=0
[Aircraft] Batch 3: upserted=1000, modified=0
[Aircraft] Batch 4: upserted=1000, modified=0
[Aircraft] Batch 5: upserted=1000, modified=0
[Aircraft] Batch 6: upserted=1000, modified=0
[Aircraft] Batch 7: upserted=1000, modified=0
[Aircraft] Batch 8: upserted=1000, modified=0
[Aircraft] Batch 9: upserted=1000, modified=0
[Aircraft] Batch 10: upserted=1000, modified=0
[Aircraft] Batch 11: upserted=1000, modified=0
[Aircraft] Batch 12: upserted=1000, modified=0
[Aircraft] Batch 13: upserted=1000, modified=0
[Aircraft] Batch 14: upserted=1000, modified=0
[Aircraft] Batch 15: upserted=1000, modified=0
[Aircraft] Batch 16: upserted=1000, modified=0
[Aircraft] Batch 17: upserted=1000, modified=0
[Aircraft] Batch 18: upserted=1000, modified=0
[Aircraft] Batch 19: upserted=1000, modified=0
[Aircraft] Batch 20: upserted=1000, modified=0
[Aircraft] Ba

## Sanity Checks

In [27]:
print("flights:", db.flights.count_documents({}))
print("aircraft:", db.aircraft.count_documents({}))

print("\nSample flight:")
print(db.flights.find_one() or {})

print("\nSample aircraft:")
print(db.aircraft.find_one() or {})

flights: 610535
aircraft: 75306

Sample flight:
{'_id': ObjectId('6910f4b9832a6b7e9fc66d00'), 'departure_time': datetime.datetime(2022, 9, 1, 8, 39, 54), 'icao24': '44cdc6', 'arrival_time': datetime.datetime(2022, 9, 1, 10, 33, 25), 'callsign': 'BEL8DG', 'duration_min': 114, 'estarrivalairport': 'EBBR', 'estdepartureairport': 'LIRF', 'firstseen': 1662021594.0, 'ingest_date': '2022-09-01', 'lastseen': 1662028405, 'model': 'A320 214', 'registration': 'OO-SNF', 'typecode': 'A320'}

Sample aircraft:
{'_id': ObjectId('691103c1832a6b7e9fd0b428'), 'icao24': '000001', 'last_seen': datetime.datetime(2022, 9, 9, 20, 46, 37), 'model': 'Antonov 2', 'registration': 'SP-FGR', 'typecode': 'AN2'}
