# Flighter Data Mining Portfolio

This notebook transforms OpenFlights data into a curated dataset that supports the Flighter portfolio story. It ingests airline, airport, route, and aircraft metadata, cleans each source, merges the pieces at the route level, and derives distances plus CO₂ estimates for the resulting legs.

In [77]:
%pip install pandas numpy requests

Note: you may need to restart the kernel to use updated packages.


## Environment setup

- Install the required libraries up front to keep the notebook self-contained for reviewers.
- `pandas` and `numpy` cover the tabular and numerical work while `requests`, `time`, and `datetime.date` support future enrichment.

In [78]:
import pandas as pd
import numpy as np
import requests
import time
from datetime import date

## Data ingestion

- `airlines.csv`, `ports-extended.csv`, `routes.csv`, and `planes.csv` come from OpenFlights exports.
- `aircraft_data.json` contains emissions metadata that will be merged later.
- Keeping the raw files in `raw_data` ensures every run is reproducible.

In [79]:
airlines_df = pd.read_csv('./raw_data/airlines.csv')
ports_df = pd.read_csv('./raw_data/ports-extended.csv')
routes_df = pd.read_csv('./raw_data/routes.csv')
planes_df = pd.read_csv('./raw_data/planes.csv')
aircraft_data_df = pd.read_json('./raw_data/aircraft_data.json')

In [80]:
valid_countries = [
  'ALASKA','Afghanistan','Albania','Algeria','American Samoa','Angola','Antigua and Barbuda','Argentina','Armenia','Aruba','Australia','Austria','Azerbaijan','Bahamas',
  'Bahrain','Bangladesh','Barbados','Belarus','Belgium','Belize','Benin','Bermuda','Bhutan','Bolivia','Bosnia and Herzegovina','Botswana','Brazil','British Virgin Islands',
  'Brunei','Bulgaria','Burkina Faso','Burma','Burundi','Cambodia','Cameroon','Canada','Canadian Territories','Cape Verde','Cayman Islands','Central African Republic',
  'Chad','Chile','China','Colombia','Comoros','Congo (Brazzaville)','Congo (Kinshasa)','Cook Islands','Costa Rica',"Cote d'Ivoire",'Croatia','Cuba','Cyprus',
  'Czech Republic',"Democratic People's Republic of Korea",'Democratic Republic of Congo','Democratic Republic of the Congo','Denmark','Djibouti','Dominican Republic',
  'Ecuador','Egypt','El Salvador','Equatorial Guinea','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Guiana','French Polynesia','Gabon',
  'Gambia','Georgia','Germany','Ghana','Greece','Guadeloupe','Guatemala','Guinea','Guinea-Bissau','Guyana','Haiti','Honduras','Hong Kong','Hong Kong SAR of China','Hungary',
  'Iceland','India','Indonesia','Iraq','Ireland','Israel','Italy','Ivory Coast','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kiribati','Kuwait','Kyrgyzstan',
  'Lao Peoples Democratic Republic','Latvia','Lebanon','Liberia','Libya','Lithuania','Luxembourg','Macao','Macedonia','Madagascar','Malawi','Malaysia','Maldives','Mali',
  'Malta','Marshall Islands','Mauritania','Mauritius','Mexico','Moldova','Monaco','Mongolia','Montenegro','Montserrat','Morocco','Mozambique','Myanmar','Namibia','Nauru',
  'Nepal','Netherland','Netherlands','Netherlands Antilles','New Zealand','Nicaragua','Niger','Nigeria','Norway','Oman','Pakistan','Palau','Panama','Papua New Guinea',
  'Paraguay','Peru','Philippines','Poland','Portugal','Puerto Rico','Qatar','Republic of Korea','Republic of the Congo','Reunion','Romania','Russia','Rwanda','S.A.',
  'Saint Kitts and Nevis','Saint Lucia','Saint Vincent and the Grenadines','Samoa','Sao Tome and Principe','Saudi Arabia','Senegal','Serbia','Seychelles','Sierra Leone',
  'Singapore','Slovakia','Slovenia','Solomon Islands','South Africa','South Korea','South Sudan','Spain','Sri Lanka','Sudan','Suriname','Swaziland','Sweden','Switzerland',
  'Taiwan','Tajikistan','Tanzania','Thailand','Togo','Tonga','Trinidad and Tobago','Tunisia','Turkey','Turkmenistan','Turks and Caicos Islands','Uganda','Ukraine',
  'United Arab Emirates','United Kingdom','UNited Kingdom','United States','Uruguay','Uzbekistan','Vanuatu','Venezuela','Vietnam','Zambia','Zimbabwe'
]

In [81]:
planes = [
  'Airbus A319','Airbus A320','Airbus A330','Airbus A321','Airbus A330-200','Airbus A330-300','Airbus A340','Airbus A380','Airbus A318','Airbus A340-600',
  'Airbus A340-300','Airbus A380-800','Airbus A340-200','Airbus A340-500','Airbus A300-600','Airbus A310','Boeing 737-400','Boeing 737-700','Boeing 777',
  'Boeing 767-300','Boeing 737','Boeing 737-500','Boeing 737-800','Boeing 777-300ER','Boeing 737-900','Boeing 777-200','Boeing 737-600','Boeing 737-300',
  'Boeing 747-400','Boeing 757','Boeing 787','Boeing 767','Boeing 777-300','Boeing 777-200LR','Boeing 787-8','Boeing 717','Boeing 767-400','Boeing 757-200',
  'Boeing 757-300','Boeing 747','Boeing 737-200','Boeing 767-200'
]

In [82]:
operational_airlines_2026 = [
  'Helvetic Airways','Jetstar Asia Airways','Sichuan Airlines','Gazpromavia','Air North Charter - Canada','Cebu Pacific','IndiGo Airlines','Israir','Jeju Air','Star Flyer',
  'Aegean Airlines','Georgian Airways','American Airlines','Air Canada','Mandarin Airlines','Air France','Air Algerie','Air India Limited','AirAsia','AeroMéxico',
  'Aerolineas Argentinas','Alaska Airlines','Royal Air Maroc','Finnair','Belavia Belarusian Airlines','JetBlue Airways','Uni Air','British Airways','Skymark Airlines','Biman Bangladesh Airlines',
  'Royal Brunei Airlines','Nouvel Air Tunisie','EVA Air','Air Baltic','Caribbean Airlines','Air Busan','Air China','Airlines PNG','China Airlines','Copa Airlines','Cubana de Aviación','Cathay Pacific',
  'China Southern Airlines','Daallo Airlines','AirAsia X','Nok Air','Condor Flugdienst','Delta Air Lines','TAAG Angola Airlines','Scat Air','Norwegian Air Shuttle','Aer Lingus','Emirates',
  'Ethiopian Airlines','Etihad Airways','Frontier Airlines','Bulgaria Air','Thai AirAsia','Ariana Afghan Airlines','Shanghai Airlines','Ryanair','Rossiya-Russian Airlines','Allegiant Air','Air Arabia',
  'Garuda Indonesia','Gulf Air Bahrain','Air Greenland','Sky Airline','Hawaiian Airlines','Air Seychelles','Juneyao Airlines','Hainan Airlines','Hong Kong Airlines','Uzbekistan Airways','Iberia Airlines',
  'Solomon Airlines','Air India Express','Arkia Israel Airlines','Azerbaijan Airlines','Jazeera Airways','Mango','Japan Airlines','Jetstar Airways','Lion Mentari Airlines','Air Serbia','Druk Air','Air Astana',
  'Korean Air','KLM Royal Dutch Airlines','Kenya Airways','Kuwait Airways','Cayman Airways','LAN Airlines','Lufthansa','Jin Air','LOT Polish Airlines','Jet2.com','Swiss International Air Lines',
  'El Al Israel Airlines','Air Madagascar','Middle East Airlines','Xiamen Airlines','Malaysia Airlines','SilkAir','Air Mauritius','Egyptair','China Eastern Airlines','Air Vanuatu','All Nippon Airways',
  'Spirit Airlines','Nile Air','Japan Transocean Air','Air Macau','Air New Zealand','Malindo Air','Overland Airways','MIAT Mongolian Airlines','Nauru Air Corporation','Austrian Airlines','Croatia Airlines',
  'Asiana Airlines','Pegasus Airlines','Bangkok Airways','Pakistan International Airlines','West Air China','Philippine Airlines','Ukraine International Airlines','Air Niugini','Surinam Airways','Qantas',
  'Qatar Airways','Lao Airlines','Indonesia AirAsia','Nepal Airlines','Atlantic Airways','Royal Jordanian','Tarom','Kam Air','S7 Airlines','South African Airways','Shandong Airlines','Sudan Airways','Spicejet',
  'Sriwijaya Air','Scandinavian Airlines System','Brussels Airlines','Singapore Airlines','Corsairfly','Aeroflot Russian Airlines','Saudi Arabian Airlines','Sun Country Airlines','Turkmenistan Airlines',
  'Thai Airways International','Turkish Airlines','Air Tahiti Nui','Transavia France','TAP Portugal','Air Transat','Tunisair','Tway Airlines','Air Caraïbes','easyJet','Ural Airlines','United Airlines',
  'AlMasria Universal Airlines','SriLankan Airlines','Bahamasair','Air Austral','Air Europa','Conviasa','Carpatair','Virgin Australia','Vietnam Airlines','Cabo Verde Airlines','Virgin Atlantic Airways',
  'Wizz Air','Rwandair Express','Southwest Airlines','WestJet','Oman Air','SunExpress','Volaris','Eastar Jet','Shenzhen Airlines'
]

## Reference lists

`valid_countries` keeps only recognized locations, `planes` enumerates the commercial airframes we analyze, and `operational_airlines_2026` ensures the focus stays on carriers flying in 2026.

## Cleaning functions

`remove_nulls` drops rows with null-like markers, and the per-dataset `clean_data` routines rename columns, enforce string lengths, normalize countries, and drop duplicates so downstream joins work with a consistent schema.

In [83]:
def remove_nulls(df, cols):
    mask = (
        df[cols]
        .notna()
        .all(axis=1)
        & ~(df[cols] == r'\N').any(axis=1)
    )
    return df.loc[mask].copy()

In [84]:
def clean_data(df):
    df = df.drop(columns=['Alias'])
    df = df[df['Active'].apply(str).str.contains('Y', regex=False, na=False, case=False)]
    df = df[df['IATA'].notna()]
    df = df[df['IATA'].str.len() == 2]
    df = df.dropna(subset=['ICAO'])
    df = df[df['Callsign'].notna() & (df['Callsign'] != r'\N')]
    df = df[df['Country'].isin(valid_countries)]
    country_mapping = {
        'Sa': 'South Africa',
        'S.A.': 'South Africa',
        'S': None,
        'Russian Federation': 'Russia',
        'Republic Of Korea': 'South Korea',
        "Democratic People's Republic Of Korea": 'North Korea',
        'Burma': 'Myanmar',
        "Cote D'Ivoire": 'Ivory Coast',
        'Congo (Kinshasa)': 'Democratic Republic Of The Congo',
        'Democratic Republic Of Congo': 'Democratic Republic Of The Congo',
        'Congo (Brazzaville)': 'Republic Of The Congo',
        'Netherland': 'Netherlands',
        'Hong Kong Sar Of China': 'Hong Kong',
        'UNited Kingdom': 'United Kingdom',
    }
    df['Country'] = df['Country'].map(country_mapping).fillna(df['Country'])
    df = df[~df['Name'].str.contains('cargo', case=False, na=False)]
    df = df.rename(columns={
        'Airline ID': 'airline_id',
        'Name': 'airline_name',
        'IATA': 'airline_iata',
        'ICAO': 'airline_icao',
        'Callsign': 'airline_call_sign',
        'Country': 'airline_country',
        'Active': 'airline_active',
    })
    df['airline_id'] = df['airline_id'].astype(str)
    df = remove_nulls(
        df,
        [
            'airline_id',
            'airline_name',
            'airline_iata',
            'airline_icao',
            'airline_country',
            'airline_active',
        ],
    )
    return df

airlines_df_clean = clean_data(airlines_df.copy())

In [85]:
def clean_data(df):
    df = df.rename(columns={
        'Airline ID': 'port_id',
        'Name': 'port_name',
        'City': 'port_city',
        'Country': 'port_country',
        'IATA': 'port_iata',
        'ICAO': 'port_icao',
        'Latitude': 'port_latitude',
        'Longitude': 'port_longitude',
        'Timezone': 'port_timezone',
        'Tz database timezone': 'port_database_timezone',
    })
    df = df.drop(columns=['Source ', 'DST', 'Altitude'])
    df = df[
        ~(
            (df['port_iata'].isna() | df['port_icao'].isna())
            | (df['port_iata'] == r'\N')
            | (df['port_icao'] == r'\N')
        )
    ]
    df = df[df['port_iata'].str.len() == 3]
    df = df[df['port_icao'].str.len() == 4]
    df = df.dropna(subset=['port_city'])
    df = df[df['port_timezone'].notna() & (df['port_timezone'] != r'\N')]
    df = df[df['port_database_timezone'].notna() & (df['port_database_timezone'] != r'\N')]
    df['port_id'] = df['port_id'].astype(str)
    df = remove_nulls(
        df,
        [
            'port_id',
            'port_name',
            'port_city',
            'port_country',
            'port_iata',
            'port_icao',
            'port_latitude',
            'port_longitude',
            'port_timezone',
            'port_database_timezone',
        ],
    )
    return df

ports_df_clean = clean_data(ports_df.copy())

In [86]:
def clean_data(df):
    df = df.drop(columns=['Codeshare'])
    df = df.rename(columns={
        'Airline': 'route_airline_iata',
        'Airline ID': 'route_airline_id',
        'Source airport': 'route_source_airport',
        'Source airport ID': 'route_source_airport_id',
        'Destination airport': 'route_destination_airport',
        'Destination airport ID': 'route_destination_airport_id',
        'Stops': 'route_stops',
        'Equipment': 'route_plane_iso',
    })
    df['route_airline_id'] = df['route_airline_id'].astype(str)
    df['route_source_airport_id'] = df['route_source_airport_id'].astype(str)
    df = remove_nulls(
        df,
        [
            'route_airline_id',
            'route_airline_iata',
            'route_source_airport',
            'route_source_airport_id',
            'route_destination_airport',
            'route_destination_airport_id',
            'route_stops',
            'route_plane_iso',
        ],
    )
    return df

routes_df_clean = clean_data(routes_df.copy())

In [87]:
def clean_data(df):
    df = df.drop(columns=['dafif_code'])
    df = df.rename(columns={'name': 'plane_name', 'iso_code': 'plane_iso'})
    df = remove_nulls(df, ['plane_name', 'plane_iso'])
    return df

planes_df_clean = clean_data(planes_df.copy())

## Merge strategy

Join the cleaned routes with airline metadata, then attach source and destination airport details, and finally merge the equipment lookup to add plane-level context.

In [88]:
merged_df = pd.merge(
    routes_df_clean,
    airlines_df_clean,
    left_on='route_airline_id',
    right_on='airline_id',
    how='inner',
)

src_ports = ports_df_clean.rename(
    columns={c: f'source_{c}' for c in ports_df_clean.columns if c != 'port_id'}
)
merged_df = merged_df.merge(
    src_ports,
    left_on='route_source_airport_id',
    right_on='port_id',
    how='left',
).drop(columns=['port_id'])

dst_ports = ports_df_clean.rename(
    columns={c: f'destination_{c}' for c in ports_df_clean.columns if c != 'port_id'}
)
merged_df = merged_df.merge(
    dst_ports,
    left_on='route_destination_airport_id',
    right_on='port_id',
    how='left',
).drop(columns=['port_id'])

planes_df_clean['plane_iso'] = planes_df_clean['plane_iso'].astype(str)
merged_df['route_plane_iso'] = merged_df['route_plane_iso'].astype(str)
merged_df = merged_df.merge(
    planes_df_clean,
    left_on='route_plane_iso',
    right_on='plane_iso',
    how='left',
)

merged_df = merged_df.drop(
    columns=[
        'route_airline_iata',
        'route_airline_id',
        'route_source_airport',
        'route_destination_airport',
    ]
)
merged_df = merged_df.dropna()
df = merged_df.copy()

## Commercial filters

Keep only the routes that use the provided commercial planes and airlines still operating in 2026.

In [89]:
final_df = df[df['plane_name'].isin(planes)]
final_df = final_df[final_df['airline_name'].isin(operational_airlines_2026)]

## Aircraft metadata enrichment

Merge in `aircraft_data.json` so every route includes emissions metadata.

In [90]:
final_df = final_df.merge(
    aircraft_data_df,
    left_on='plane_name',
    right_on='aircraft_name',
    how='left',
)
final_df = final_df.dropna()

## Distances and international flags

Compute Haversine distances between airports, add a miles conversion, and flag whether a leg crosses a border.

In [91]:
def haversine_km(lat1, lon1, lat2, lon2):
    R = 6371.0
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat / 2) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

final_df['distance_km'] = haversine_km(
    final_df['source_port_latitude'],
    final_df['source_port_longitude'],
    final_df['destination_port_latitude'],
    final_df['destination_port_longitude'],
)
final_df['distance_miles'] = final_df['distance_km'] * 0.621371
final_df['is_international'] = (
    final_df['source_port_country'] != final_df['destination_port_country']
)

## CO₂ estimation

Estimate the total CO₂ per passenger by multiplying the distance by the aircraft-specific emission factor.

In [92]:
final_df['co2_total_kg'] = (
    final_df['distance_km'] * final_df['co2_g_per_pax_mile'] / 1000
)

Persist columns schema for reference

In [None]:
with open('./clean_data/columns.txt', 'w') as f:
    for col in final_df.columns:
        f.write(f'{col}\n')

## Results and next steps

- `final_df` now contains clean, merged, and enriched route-level data for the portfolio.
- `columns.txt` captures the schema so collaborators can inspect the column set.

In [94]:
final_df.to_csv("./clean_data/final_flight_data.csv", index=False)