<a href="https://colab.research.google.com/github/ipeirotis-org/datasets/blob/main/Flight_Stats/Load_DB1B_Market_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DB1B Market Data Pipeline

## Overview
This notebook loads the **DB1B Origin & Destination Survey** from the Bureau of Transportation Statistics (BTS) into BigQuery.

**What is DB1B?**
- A 10% sample of all airline tickets sold in the US
- Contains origin, destination, fare, carrier, and routing information
- Published quarterly by the US Department of Transportation
- Critical for analyzing airline pricing, competition, and market structure

**Pipeline Architecture:**
1. **Download**: Fetch quarterly ZIP files from BTS
2. **Stage**: Upload raw CSVs to Google Cloud Storage
3. **Validate**: Check schema consistency across years
4. **Load**: Use BigQuery load jobs to import from GCS

**Key Features:**
- Resume capability (skips already-loaded quarters)
- Schema evolution tracking
- Data quality validation
- Partitioned and clustered BigQuery table for query performance

**Data Source:** https://www.transtats.bts.gov/DatabaseInfo.asp?QO_VQ=EFI&Yv0x=D

## Configuration

In [1]:
# Configuration parameters
CONFIG = {
    # GCP Settings
    'PROJECT_ID': 'nyu-datasets',
    'DATASET_ID': 'flights',
    'TABLE_NAME': 'raw_db1b_market',
    'GCS_BUCKET': 'bts_datasets',  # Bucket for staging CSV files
    'GCS_PREFIX': 'db1b_market/',           # Folder within bucket

    # Data Range
    'YEARS': list(range(2000, 2025)),
    'QUARTERS': [1, 2, 3, 4],

    # Processing
    'MAX_RETRIES': 3,
    'RETRY_DELAY': 5,  # seconds
    'SAMPLE_RATE': 10,  # DB1B is a 10% sample

    # BTS URL Template
    'BASE_URL': 'https://transtats.bts.gov/PREZIP/Origin_and_Destination_Survey_DB1BMarket_{}_{}.zip'
}

print(f"Configuration loaded.")
print(f"Will process {len(CONFIG['YEARS'])} years × {len(CONFIG['QUARTERS'])} quarters = {len(CONFIG['YEARS']) * len(CONFIG['QUARTERS'])} files")
print(f"Target: {CONFIG['PROJECT_ID']}.{CONFIG['DATASET_ID']}.{CONFIG['TABLE_NAME']}")
print(f"Staging: gs://{CONFIG['GCS_BUCKET']}/{CONFIG['GCS_PREFIX']}")

Configuration loaded.
Will process 25 years × 4 quarters = 100 files
Target: nyu-datasets.flights.raw_db1b_market
Staging: gs://bts_datasets/db1b_market/


## Setup and Authentication

In [2]:
# Import required libraries
import pandas as pd
import requests
import io
import zipfile
import gc
import time
from typing import List, Dict, Optional, Tuple, Set
from datetime import datetime
from google.cloud import bigquery
from google.cloud import storage
from google.colab import auth
from tqdm.notebook import tqdm

# Authenticate
auth.authenticate_user()

# Initialize clients
bq_client = bigquery.Client(project=CONFIG['PROJECT_ID'])
storage_client = storage.Client(project=CONFIG['PROJECT_ID'])

print("✓ Authentication successful")
print(f"✓ BigQuery client initialized for project: {CONFIG['PROJECT_ID']}")
print(f"✓ Storage client initialized")

✓ Authentication successful
✓ BigQuery client initialized for project: nyu-datasets
✓ Storage client initialized


## Column Metadata

Official column descriptions from the DOT data dictionary.

In [3]:
COLUMN_DESCRIPTIONS = {
    "ItinID": "Itinerary ID. Identification number assigned to identify an itinerary. Foreign key to DB1BTicket.",
    "MktID": "Market ID. Identification number assigned to identify a market. Foreign key to DB1BMarket.",
    "MktCoupons": "Number of Coupons in the Market. The number of flight segments in the market.",
    "Year": "Year of the survey.",
    "Quarter": "Quarter of the survey (1-4).",
    "Origin": "Origin Airport Code (e.g., JFK, ORD).",
    "OriginAirportID": "Origin Airport ID. A unique numeric code assigned by US DOT to the origin airport.",
    "OriginAirportSeqID": "Origin Airport Sequence ID.",
    "OriginCityMarketID": "Origin City Market ID. Use this field to consolidate airports serving the same city market.",
    "OriginCountry": "Origin Country Code.",
    "OriginStateFips": "Origin State FIPS Code.",
    "OriginState": "Origin State Code.",
    "OriginStateName": "Origin State Name.",
    "OriginWac": "Origin World Area Code (WAC). Geographic area code for the origin.",
    "Dest": "Destination Airport Code (e.g., LAX, SFO).",
    "DestAirportID": "Destination Airport ID. A unique numeric code assigned by US DOT to the destination airport.",
    "DestAirportSeqID": "Destination Airport Sequence ID.",
    "DestCityMarketID": "Destination City Market ID. Use this field to consolidate airports serving the same city market.",
    "DestCountry": "Destination Country Code.",
    "DestStateFips": "Destination State FIPS Code.",
    "DestState": "Destination State Code.",
    "DestStateName": "Destination State Name.",
    "DestWac": "Destination World Area Code (WAC). Geographic area code for the destination.",
    "AirportGroup": "Airport Group. Sequence of airports in the market.",
    "WacGroup": "World Area Code Group. Sequence of WACs in the market.",
    "TkCarrier": "Ticketing Carrier. The airline that sold the ticket.",
    "TkCarrierChange": "Ticketing Carrier Change Indicator. 1 if the ticketing carrier changes within the market; 0 otherwise.",
    "OpCarrier": "Operating Carrier. The airline that actually operated the flight.",
    "OpCarrierChange": "Operating Carrier Change Indicator. 1 if the operating carrier changes within the market; 0 otherwise.",
    "RPCarrier": "Reporting Carrier. The airline that submitted the data to DOT.",
    "TkCarrierGroup": "Ticketing Carrier Group. Sequence of ticketing carriers.",
    "OpCarrierGroup": "Operating Carrier Group. Sequence of operating carriers.",
    "Passengers": "Number of Passengers. 10% sample count (multiply by 10 for total estimate).",
    "MktFare": "Market Fare. The prorated fare for this specific market (one-way portion of the trip).",
    "BulkFare": "Bulk Fare. Fare paid for the entire itinerary (round-trip or one-way).",
    "MktDistance": "Market Distance. Non-stop distance between origin and destination.",
    "MktMilesFlown": "Market Miles Flown. Actual miles flown for this market (may differ from non-stop distance).",
    "NonStopMiles": "Non-Stop Miles. Great circle distance between origin and destination.",
    "ItinGeoType": "Itinerary Geography Type. 1=Domestic, 2=International.",
    "MktGeoType": "Market Geography Type. 1=Domestic, 2=International.",
    "MktDistanceGroup": "Market Distance Group. Categorical grouping of market distances.",
    "Unnamed: 41": "Unknown column that appears in data from 2020 onwards. Appears to be empty/null. Likely a data artifact."
}

# Define expected data types for better memory efficiency
COLUMN_DTYPES = {
    'ItinID': 'int64',
    'MktID': 'int32',
    'MktCoupons': 'int8',
    'Year': 'int16',
    'Quarter': 'int8',
    'Origin': 'category',
    'OriginAirportID': 'int32',
    'OriginAirportSeqID': 'int32',
    'OriginCityMarketID': 'int32',
    'OriginCountry': 'category',
    'OriginStateFips': 'category',
    'OriginState': 'category',
    'OriginStateName': 'category',
    'OriginWac': 'int16',
    'Dest': 'category',
    'DestAirportID': 'int32',
    'DestAirportSeqID': 'int32',
    'DestCityMarketID': 'int32',
    'DestCountry': 'category',
    'DestStateFips': 'category',
    'DestState': 'category',
    'DestStateName': 'category',
    'DestWac': 'int16',
    'TkCarrier': 'category',
    'TkCarrierChange': 'int8',
    'OpCarrier': 'category',
    'OpCarrierChange': 'int8',
    'RPCarrier': 'category',
    'Passengers': 'int32',
    'MktFare': 'float32',
    'BulkFare': 'float32',
    'MktDistance': 'int16',
    'MktMilesFlown': 'int16',
    'NonStopMiles': 'int16',
    'ItinGeoType': 'int8',
    'MktGeoType': 'int8',
    'MktDistanceGroup': 'int8'
}

print(f"✓ Loaded metadata for {len(COLUMN_DESCRIPTIONS)} columns")

✓ Loaded metadata for 42 columns


## Utility Functions

In [4]:
def get_gcs_path(year: int, quarter: int) -> str:
    """Generate GCS path for a given year/quarter."""
    return f"{CONFIG['GCS_PREFIX']}{year}_Q{quarter}.csv"

def get_table_id() -> str:
    """Get full BigQuery table ID."""
    return f"{CONFIG['PROJECT_ID']}.{CONFIG['DATASET_ID']}.{CONFIG['TABLE_NAME']}"

def file_exists_in_gcs(year: int, quarter: int) -> bool:
    """Check if a file already exists in GCS."""
    bucket = storage_client.bucket(CONFIG['GCS_BUCKET'])
    blob = bucket.blob(get_gcs_path(year, quarter))
    return blob.exists()

def get_loaded_quarters() -> Set[Tuple[int, int]]:
    """Query BigQuery to find which year/quarter combinations are already loaded."""
    table_id = get_table_id()
    try:
        query = f"""
        SELECT DISTINCT Year, Quarter
        FROM `{table_id}`
        ORDER BY Year, Quarter
        """
        results = bq_client.query(query).result()
        loaded = {(row.Year, row.Quarter) for row in results}
        return loaded
    except Exception as e:
        # Table doesn't exist yet
        return set()

def validate_dataframe(df: pd.DataFrame, year: int, quarter: int) -> List[str]:
    """Validate data quality of a dataframe."""
    issues = []

    # Check row count
    if len(df) < 10000:
        issues.append(f"Suspiciously low row count: {len(df)}")
    elif len(df) > 10_000_000:
        issues.append(f"Suspiciously high row count: {len(df)}")

    # Check critical columns aren't all null
    critical_cols = ['Origin', 'Dest', 'Passengers', 'MktFare']
    for col in critical_cols:
        if col in df.columns and df[col].isna().all():
            issues.append(f"Critical column '{col}' is entirely NULL")

    # Check Year/Quarter match expected values
    if 'Year' in df.columns and not df['Year'].isna().all():
        unique_years = df['Year'].unique()
        if len(unique_years) != 1 or unique_years[0] != year:
            issues.append(f"Year mismatch: expected {year}, found {unique_years}")

    if 'Quarter' in df.columns and not df['Quarter'].isna().all():
        unique_quarters = df['Quarter'].unique()
        if len(unique_quarters) != 1 or unique_quarters[0] != quarter:
            issues.append(f"Quarter mismatch: expected {quarter}, found {unique_quarters}")

    # Check passengers are positive
    if 'Passengers' in df.columns:
        negative_passengers = (df['Passengers'] < 0).sum()
        if negative_passengers > 0:
            issues.append(f"Found {negative_passengers} rows with negative passengers")

    return issues

print("✓ Utility functions defined")

✓ Utility functions defined


## Download Functions

In [5]:
def download_and_upload_to_gcs(year: int, quarter: int, force: bool = False) -> Optional[str]:
    """
    Download a DB1B file from BTS and upload to GCS.

    Args:
        year: Year to download
        quarter: Quarter to download (1-4)
        force: If True, re-download even if file exists in GCS

    Returns:
        GCS URI if successful, None otherwise
    """
    gcs_path = get_gcs_path(year, quarter)
    gcs_uri = f"gs://{CONFIG['GCS_BUCKET']}/{gcs_path}"

    # Check if already exists
    if not force and file_exists_in_gcs(year, quarter):
        print(f"  ✓ Already in GCS: {year} Q{quarter}")
        return gcs_uri

    url = CONFIG['BASE_URL'].format(year, quarter)

    # Retry logic for downloads
    for attempt in range(CONFIG['MAX_RETRIES']):
        try:
            print(f"  ↓ Downloading {year} Q{quarter}... (attempt {attempt + 1}/{CONFIG['MAX_RETRIES']})")
            response = requests.get(url, timeout=300)
            response.raise_for_status()

            # Extract CSV from ZIP
            with zipfile.ZipFile(io.BytesIO(response.content)) as z:
                csv_files = [f for f in z.namelist() if f.endswith('.csv')]
                if not csv_files:
                    print(f"  ✗ No CSV found in ZIP for {year} Q{quarter}")
                    return None

                csv_data = z.read(csv_files[0])

            # Load into pandas for validation
            df = pd.read_csv(io.BytesIO(csv_data))

            # Validate data quality
            issues = validate_dataframe(df, year, quarter)
            if issues:
                print(f"  ⚠ Data quality issues for {year} Q{quarter}:")
                for issue in issues:
                    print(f"    - {issue}")

            # Upload to GCS
            bucket = storage_client.bucket(CONFIG['GCS_BUCKET'])
            blob = bucket.blob(gcs_path)
            blob.upload_from_string(csv_data, content_type='text/csv')

            print(f"  ✓ Uploaded to GCS: {year} Q{quarter} ({len(df):,} rows, {len(csv_data) / 1024 / 1024:.1f} MB)")

            # Clean up
            del df, csv_data, response
            gc.collect()

            return gcs_uri

        except requests.RequestException as e:
            print(f"  ✗ Download failed for {year} Q{quarter}: {e}")
            if attempt < CONFIG['MAX_RETRIES'] - 1:
                print(f"    Retrying in {CONFIG['RETRY_DELAY']} seconds...")
                time.sleep(CONFIG['RETRY_DELAY'])
            else:
                print(f"    Max retries exceeded. Skipping {year} Q{quarter}.")
                return None
        except Exception as e:
            print(f"  ✗ Error processing {year} Q{quarter}: {e}")
            return None

    return None

def download_all_files(skip_existing: bool = True) -> List[Tuple[int, int, str]]:
    """
    Download all configured files and upload to GCS.

    Returns:
        List of (year, quarter, gcs_uri) tuples for successful downloads
    """
    print("\n" + "="*70)
    print("STEP 1: DOWNLOAD AND STAGE TO GCS")
    print("="*70 + "\n")

    # Get already loaded quarters
    loaded_quarters = get_loaded_quarters() if skip_existing else set()
    if loaded_quarters:
        print(f"Found {len(loaded_quarters)} quarters already in BigQuery. Will skip these.\n")

    successful_files = []

    total_files = len(CONFIG['YEARS']) * len(CONFIG['QUARTERS'])
    with tqdm(total=total_files, desc="Overall Progress") as pbar:
        for year in CONFIG['YEARS']:
            for quarter in CONFIG['QUARTERS']:
                # Skip if already loaded in BigQuery
                if (year, quarter) in loaded_quarters:
                    print(f"  ⊘ Skipping {year} Q{quarter} (already in BigQuery)")
                    pbar.update(1)
                    continue

                gcs_uri = download_and_upload_to_gcs(year, quarter)
                if gcs_uri:
                    successful_files.append((year, quarter, gcs_uri))

                pbar.update(1)
                time.sleep(1)  # Be nice to BTS servers

    print(f"\n✓ Successfully staged {len(successful_files)} files in GCS")
    return successful_files

print("✓ Download functions defined")

✓ Download functions defined


## Schema Analysis Functions

In [16]:
def analyze_schema(files: List[Tuple[int, int, str]]) -> List[bigquery.SchemaField]:
    """
    Analyze schema across all files and create a unified BigQuery schema.
    Tracks schema evolution across years.
    IMPORTANT: Preserves the actual column order from CSV files.
    """
    print("\n" + "="*70)
    print("STEP 2: ANALYZE SCHEMA")
    print("="*70 + "\n")

    all_columns = set()
    schema_by_year = {}
    column_order = None  # Will store the actual column order from CSV

    print("Scanning columns across all files...")
    for year, quarter, gcs_uri in tqdm(files, desc="Schema Analysis"):
        bucket = storage_client.bucket(CONFIG['GCS_BUCKET'])
        blob = bucket.blob(get_gcs_path(year, quarter))

        # Read just the header
        sample = blob.download_as_bytes(start=0, end=10000)
        df_sample = pd.read_csv(io.BytesIO(sample), nrows=0)

        # Get columns as a list (preserves order)
        columns_list = list(df_sample.columns)
        columns = set(columns_list)
        all_columns.update(columns)

        # Store the column order from the first file we see
        if column_order is None:
            column_order = columns_list
            print(f"Using column order from {year} Q{quarter}")

        if year not in schema_by_year:
            schema_by_year[year] = columns

    # Report schema evolution
    print(f"\nFound {len(all_columns)} unique columns across all files.\n")

    print("Schema Evolution:")
    sorted_years = sorted(schema_by_year.keys())
    baseline_schema = schema_by_year[sorted_years[0]]

    for year in sorted_years:
        year_schema = schema_by_year[year]
        new_cols = year_schema - baseline_schema
        removed_cols = baseline_schema - year_schema

        if new_cols or removed_cols:
            print(f"  {year}:")
            if new_cols:
                print(f"    + New columns: {', '.join(sorted(new_cols))}")
            if removed_cols:
                print(f"    - Removed columns: {', '.join(sorted(removed_cols))}")
        else:
            print(f"  {year}: No changes from baseline")

    # Create BigQuery schema using the ACTUAL column order from CSV
    bq_schema = []

    # Add any new columns that weren't in the baseline at the end
    all_columns_ordered = column_order + [c for c in all_columns if c not in column_order]

    for col in all_columns_ordered:
        # Determine BigQuery type
        if col in ['ItinID']:
            bq_type = 'INT64'
        elif col in ['Year', 'OriginAirportID', 'DestAirportID', 'MktID',
                     'OriginAirportSeqID', 'DestAirportSeqID', 'OriginCityMarketID', 'DestCityMarketID']:
            bq_type = 'INTEGER'
        elif col in ['MktFare', 'BulkFare', 'Passengers', 'MktDistance', 'MktMilesFlown', 'NonStopMiles']:
            bq_type = 'FLOAT'
        elif col in ['Quarter', 'MktCoupons',
                     'ItinGeoType', 'MktGeoType', 'MktDistanceGroup', 'OriginWac', 'DestWac']:
            bq_type = 'INTEGER'
        else:
            bq_type = 'STRING'

        # Get description
        description = COLUMN_DESCRIPTIONS.get(col, "")

        bq_schema.append(
            bigquery.SchemaField(
                name=col,
                field_type=bq_type,
                mode='NULLABLE',
                description=description
            )
        )

    print(f"\n✓ Created BigQuery schema with {len(bq_schema)} columns")
    print(f"✓ Column order preserved from CSV files")
    return bq_schema

print("✓ Schema analysis functions defined")

✓ Schema analysis functions defined


## BigQuery Load Functions

In [7]:
def create_or_update_table(schema: List[bigquery.SchemaField], recreate: bool = False):
    """
    Create or update the BigQuery table with partitioning and clustering.

    Args:
        schema: BigQuery schema
        recreate: If True, drop and recreate the table
    """
    table_id = get_table_id()

    # Check if table exists
    try:
        existing_table = bq_client.get_table(table_id)
        if recreate:
            print(f"Dropping existing table: {table_id}")
            bq_client.delete_table(table_id)
        else:
            print(f"Table {table_id} already exists. Will append data.")
            return
    except Exception:
        pass  # Table doesn't exist, will create

    # Create table with partitioning and clustering
    table = bigquery.Table(table_id, schema=schema)

    # Partition by Year (range partitioning)
    table.range_partitioning = bigquery.RangePartitioning(
        field="Year",
        range_=bigquery.PartitionRange(start=2000, end=2030, interval=1)
    )

    # Cluster by common query columns
    table.clustering_fields = ["Origin", "Dest", "Quarter"]

    # Set table description
    table.description = (
        "DB1B Origin & Destination Survey - Market Data. "
        "Contains a 10% sample of airline tickets with origin, destination, fare, and routing info. "
        "Source: US Department of Transportation, Bureau of Transportation Statistics. "
        f"Loaded on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}."
    )

    table = bq_client.create_table(table)
    print(f"✓ Created table: {table_id}")
    print(f"  - Partitioned by: Year (range partitioning)")
    print(f"  - Clustered by: {', '.join(table.clustering_fields)}")

def load_from_gcs(files: List[Tuple[int, int, str]], schema: List[bigquery.SchemaField]):
    """
    Load data from GCS to BigQuery using load jobs.
    """
    print("\n" + "="*70)
    print("STEP 3: LOAD TO BIGQUERY")
    print("="*70 + "\n")

    table_id = get_table_id()

    # Ensure table exists
    try:
        bq_client.get_table(table_id)
        print(f"Table {table_id} exists. Appending data...\n")
    except Exception:
        print(f"Creating table {table_id}...\n")
        create_or_update_table(schema, recreate=False)

    # Load each file
    failed_loads = []

    for year, quarter, gcs_uri in tqdm(files, desc="Loading to BigQuery"):
        try:
            print(f"\nLoading {year} Q{quarter} from {gcs_uri}")

            job_config = bigquery.LoadJobConfig(
                source_format=bigquery.SourceFormat.CSV,
                skip_leading_rows=1,
                schema=schema,
                write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
                create_disposition=bigquery.CreateDisposition.CREATE_NEVER,
                allow_quoted_newlines=True,
                max_bad_records=100  # Allow some malformed rows
            )

            load_job = bq_client.load_table_from_uri(
                gcs_uri,
                table_id,
                job_config=job_config
            )

            # Wait for job to complete
            load_job.result()

            # Get job statistics
            stats = load_job._properties['statistics']['load']
            output_rows = int(stats['outputRows'])

            print(f"  ✓ Loaded {output_rows:,} rows for {year} Q{quarter}")

        except Exception as e:
            print(f"  ✗ Failed to load {year} Q{quarter}: {e}")
            failed_loads.append((year, quarter, str(e)))

    # Summary
    print("\n" + "="*70)
    print("LOAD SUMMARY")
    print("="*70)
    print(f"✓ Successfully loaded: {len(files) - len(failed_loads)}/{len(files)} files")

    if failed_loads:
        print(f"\n✗ Failed loads ({len(failed_loads)}):")
        for year, quarter, error in failed_loads:
            print(f"  - {year} Q{quarter}: {error}")

    # Get final row count
    query = f"SELECT COUNT(*) as total_rows FROM `{table_id}`"
    result = bq_client.query(query).result()
    total_rows = list(result)[0].total_rows
    print(f"\n✓ Total rows in {table_id}: {total_rows:,}")
    print(f"✓ Estimated total passengers (×10): {total_rows * CONFIG['SAMPLE_RATE']:,}")

print("✓ BigQuery load functions defined")

✓ BigQuery load functions defined


## Main Execution

In [None]:
# Execute the pipeline
print("\n" + "#"*70)
print("#" + " "*68 + "#")
print("#" + " "*20 + "DB1B DATA PIPELINE" + " "*20 + "#")
print("#" + " "*68 + "#")
print("#"*70 + "\n")

print(f"Start time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")

# Step 1: Download and stage to GCS
# files = download_all_files(skip_existing=True)

if not files:
    print("\n⚠ No new files to process. Pipeline complete.")
else:
    # Step 2: Analyze schema
    schema = analyze_schema(files)

    # Step 3: Load to BigQuery
    load_from_gcs(files, schema)

    print(f"\nEnd time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("\n" + "#"*70)
    print("#" + " "*68 + "#")
    print("#" + " "*22 + "PIPELINE COMPLETE" + " "*22 + "#")
    print("#" + " "*68 + "#")
    print("#"*70)


######################################################################
#                                                                    #
#                    DB1B DATA PIPELINE                    #
#                                                                    #
######################################################################

Start time: 2025-12-01 15:09:13


STEP 2: ANALYZE SCHEMA

Scanning columns across all files...


Schema Analysis:   0%|          | 0/100 [00:00<?, ?it/s]

Using column order from 2000 Q1

Found 42 unique columns across all files.

Schema Evolution:
  2000: No changes from baseline
  2001: No changes from baseline
  2002: No changes from baseline
  2003: No changes from baseline
  2004: No changes from baseline
  2005: No changes from baseline
  2006: No changes from baseline
  2007: No changes from baseline
  2008: No changes from baseline
  2009: No changes from baseline
  2010: No changes from baseline
  2011: No changes from baseline
  2012: No changes from baseline
  2013: No changes from baseline
  2014: No changes from baseline
  2015: No changes from baseline
  2016: No changes from baseline
  2017: No changes from baseline
  2018: No changes from baseline
  2019: No changes from baseline
  2020: No changes from baseline
  2021: No changes from baseline
  2022: No changes from baseline
  2023: No changes from baseline
  2024: No changes from baseline

✓ Created BigQuery schema with 42 columns
✓ Column order preserved from CSV file

Loading to BigQuery:   0%|          | 0/100 [00:00<?, ?it/s]


Loading 2000 Q1 from gs://bts_datasets/db1b_market/2000_Q1.csv
  ✓ Loaded 4,409,439 rows for 2000 Q1

Loading 2000 Q2 from gs://bts_datasets/db1b_market/2000_Q2.csv
  ✓ Loaded 4,926,593 rows for 2000 Q2

Loading 2000 Q3 from gs://bts_datasets/db1b_market/2000_Q3.csv


## Example Queries

Now that the data is loaded, here are some useful queries to explore it:

In [None]:
# Example 1: Top 10 routes by passenger volume (2024)
query_top_routes = f"""
SELECT
    Origin,
    Dest,
    SUM(Passengers * {CONFIG['SAMPLE_RATE']}) as estimated_total_passengers,
    ROUND(AVG(MktFare), 2) as avg_fare,
    COUNT(*) as num_tickets
FROM `{get_table_id()}`
WHERE Year = 2024
GROUP BY Origin, Dest
ORDER BY estimated_total_passengers DESC
LIMIT 10
"""

print("Top 10 Routes by Passenger Volume (2024):")
print(query_top_routes)
print("\nTo run: bq_client.query(query_top_routes).to_dataframe()")

In [None]:
# Example 2: Average fares by carrier
query_carrier_fares = f"""
SELECT
    TkCarrier as carrier,
    COUNT(*) as num_tickets,
    ROUND(AVG(MktFare), 2) as avg_fare,
    ROUND(STDDEV(MktFare), 2) as fare_stddev,
    SUM(Passengers * {CONFIG['SAMPLE_RATE']}) as estimated_total_passengers
FROM `{get_table_id()}`
WHERE Year = 2024 AND Quarter = 2
    AND MktFare > 0
GROUP BY TkCarrier
HAVING num_tickets > 1000
ORDER BY estimated_total_passengers DESC
"""

print("Average Fares by Carrier (2024 Q2):")
print(query_carrier_fares)

In [None]:
# Example 3: Quarterly trends from a specific airport
query_airport_trends = f"""
SELECT
    Year,
    Quarter,
    COUNT(DISTINCT Dest) as num_destinations,
    SUM(Passengers * {CONFIG['SAMPLE_RATE']}) as estimated_total_passengers,
    ROUND(AVG(MktFare), 2) as avg_fare
FROM `{get_table_id()}`
WHERE Origin = 'JFK'  -- Change to your airport of interest
GROUP BY Year, Quarter
ORDER BY Year, Quarter
"""

print("Quarterly Trends from JFK:")
print(query_airport_trends)

In [None]:
# Example 4: Market concentration analysis
query_market_concentration = f"""
WITH route_carriers AS (
    SELECT
        Origin,
        Dest,
        TkCarrier,
        SUM(Passengers * {CONFIG['SAMPLE_RATE']}) as passengers
    FROM `{get_table_id()}`
    WHERE Year = 2024
    GROUP BY Origin, Dest, TkCarrier
),
route_totals AS (
    SELECT
        Origin,
        Dest,
        SUM(passengers) as total_passengers,
        COUNT(DISTINCT TkCarrier) as num_carriers
    FROM route_carriers
    GROUP BY Origin, Dest
)
SELECT
    Origin,
    Dest,
    num_carriers,
    total_passengers,
    ROUND(total_passengers / num_carriers, 0) as avg_passengers_per_carrier
FROM route_totals
WHERE total_passengers > 100000  -- Major routes only
ORDER BY num_carriers ASC, total_passengers DESC
LIMIT 20
"""

print("Market Concentration (Routes with Few Competitors):")
print(query_market_concentration)

## Data Dictionary Reference

Quick reference for key columns:

In [None]:
# Display key columns and their meanings
key_columns = [
    'Origin', 'Dest', 'Passengers', 'MktFare', 'BulkFare',
    'TkCarrier', 'OpCarrier', 'Year', 'Quarter',
    'MktDistance', 'MktCoupons', 'ItinGeoType'
]

print("Key Column Reference:")
print("=" * 80)
for col in key_columns:
    if col in COLUMN_DESCRIPTIONS:
        print(f"\n{col}:")
        print(f"  {COLUMN_DESCRIPTIONS[col]}")

print("\n" + "=" * 80)
print("\nIMPORTANT: Passenger counts are 10% sample. Multiply by 10 for total estimates.")
print("MktFare = Fare for this specific market (one leg of journey)")
print("BulkFare = Fare for entire itinerary (all legs combined)")