
# General information of the datascrapping process.

Here, we get vital data for the key airports we want to scrape data from:

location (coordinates, country, city), IATA code,

This then allows us, using the wikipedia API to find each airport's corresponding wikipedia link and wikipedia name.
For example JFK is encoded in wikipedia's internal database as John_F._Kennedy_International_Airport. This matches the last part of the url in the wikipedia page link for the airport:     
https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport

For each time range:

The wikipedia page's raw text can be scrapped predictiably for a list of destinations from an airport. We will encode airport destinations using iata code.

We do this for each airport, generating a large csv file of airport-destination pairs for a particular time change.

We will look at 2 time ranges (now(as of June 4th 11am Eastern Time), before Jan 1, 2000 UTC 0)

### Sources :


list of top 1000 airports by traffic to scrape:

https://gettocenter.com/airports/top-100-airports-in-world/1000#google_vignette  


detailed airport database to cross reference:

https://www.partow.net/miscellaneous/airportdatabase/index.html#Downloads 

In [1]:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
# Suppress just SettingWithCopyWarning
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

# Part 1: get basic data for the list of top 1000 airports 

In [2]:
# Example: load HTML from a URL or string
url = "https://gettocenter.com/airports/top-100-airports-in-world/1000"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

table = soup.find("table")  
rows = table.find_all("tr")

with open("./data/top_airports_basic_data.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    for row in rows:
        # Extract all cells (td or th)
        cols = row.find_all(["td", "th"])
        # Write the row text content to CSV
        writer.writerow([col.get_text(strip=True) for col in cols])
    f.close()

after replacing blank strings "", read in again csv file and add column names

In [3]:
data = pd.read_csv("./data/top_airports_basic_data.csv", names=["full_name", "iata", "city", "country", "estimated_pax"])
print(len(data))
data=data.dropna()
print(len(data))
data.head(n=11)

1087
989


Unnamed: 0,full_name,iata,city,country,estimated_pax
1.0,Hartsfield–Jackson Atlanta International Airport,ATL,Atlanta,United States,103902992
2.0,Beijing Capital International Airport,PEK,Beijing,China,95786442
3.0,Dubai International Airport,DXB,Dubai,United Arab Emirates,88242099
4.0,Los Angeles International Airport,LAX,Los Angeles,United States,84557968
5.0,O'Hare International Airport,ORD,Chicago,United States,79828183
6.0,Heathrow Airport,LHR,London,United Kingdom,78014598
7.0,Haneda Airport,HND,Tokyo,Japan,76476251
8.0,Hong Kong International Airport,HKG,Hong Kong,Hong Kong,72665078
9.0,Shanghai Pudong International Airport,PVG,Shanghai,China,70001237
10.0,Charles de Gaulle International Airport,CDG,Paris,France,69471442


we now have obtain basic information for the top airports the the world. save the pdf

In [4]:
data.to_csv("./data/top_airports_basic_data.csv")

#  Part 2:
Get wikipedia urls and for each iata code by using search api in wikipedia

In [5]:
#testing a function
def get_wikipedia_url_from_name(name):
    """Search Wikipedia using IATA code and return the best-matching article title."""
    search_url = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "list": "search",
        "srsearch": f"{name}",
        "format": "json"
    }

    response = requests.get(search_url, params=params)
    data = response.json()
    try:
        raw_name = data['query']['search'][0]['title'] 
        raw_name = raw_name.replace(" ", "_") #replace the raw name spaces with _

        return "https://en.wikipedia.org/wiki/"+raw_name #format for english wiki
    except (KeyError, IndexError):
        return None
    
print(get_wikipedia_url_from_name("Hartsfield–Jackson Atlanta International Airport")) #popular airport
print(get_wikipedia_url_from_name("Gobernador Castello Airport")) #more obscure airport


https://en.wikipedia.org/wiki/Hartsfield–Jackson_Atlanta_International_Airport
https://en.wikipedia.org/wiki/Gobernador_Edgardo_Castello_Airport


now, run for all airports

In [6]:
data = pd.read_csv("./data/top_airports_basic_data.csv")

In [7]:
data["wiki_url"] =None #add column
for index, row in data.iterrows():
    try:
        print("current index:", index)
        name = row["full_name"]
        url = get_wikipedia_url_from_name(name)
        data["wiki_url"][index] = url
    except:
        data["wiki_url"][index] = None

current index: 0
current index: 1
current index: 2
current index: 3
current index: 4
current index: 5
current index: 6
current index: 7
current index: 8
current index: 9
current index: 10
current index: 11
current index: 12
current index: 13
current index: 14
current index: 15
current index: 16
current index: 17
current index: 18
current index: 19
current index: 20
current index: 21
current index: 22
current index: 23
current index: 24
current index: 25
current index: 26
current index: 27
current index: 28
current index: 29
current index: 30
current index: 31
current index: 32
current index: 33
current index: 34
current index: 35
current index: 36
current index: 37
current index: 38
current index: 39
current index: 40
current index: 41
current index: 42
current index: 43
current index: 44
current index: 45
current index: 46
current index: 47
current index: 48
current index: 49
current index: 50
current index: 51
current index: 52
current index: 53
current index: 54
current index: 55
cu

In [8]:
data.head(n=989)

Unnamed: 0.1,Unnamed: 0,full_name,iata,city,country,estimated_pax,wiki_url
0,1.0,Hartsfield–Jackson Atlanta International Airport,ATL,Atlanta,United States,103902992,https://en.wikipedia.org/wiki/Hartsfield–Jacks...
1,2.0,Beijing Capital International Airport,PEK,Beijing,China,95786442,https://en.wikipedia.org/wiki/Beijing_Capital_...
2,3.0,Dubai International Airport,DXB,Dubai,United Arab Emirates,88242099,https://en.wikipedia.org/wiki/Dubai_Internatio...
3,4.0,Los Angeles International Airport,LAX,Los Angeles,United States,84557968,https://en.wikipedia.org/wiki/Los_Angeles_Inte...
4,5.0,O'Hare International Airport,ORD,Chicago,United States,79828183,https://en.wikipedia.org/wiki/O'Hare_Internati...
...,...,...,...,...,...,...,...
984,985.0,Noumérat - Moufdi Zakaria Airport,GHA,Ghardaia,Algeria,45794,https://en.wikipedia.org/wiki/Noumérat_–_Moufd...
985,986.0,Guarani International Airport,AGT,Ciudad del Este,Paraguay,40923,https://en.wikipedia.org/wiki/Guaraní_Internat...
986,987.0,Catarman National Airport,CRM,Catarman,Philippines,40237,https://en.wikipedia.org/wiki/Catarman_Nationa...
987,988.0,Sauce Viejo Airport,SFN,Santa Fe,Argentina,37725,https://en.wikipedia.org/wiki/Sauce_Viejo_Inte...


finding number of null entries in the wiki_url column

In [9]:
print(f"Number of null entries: {data['wiki_url'].isnull().sum()}")

Number of null entries: 7


find the null rows

In [10]:
data[data['wiki_url'].isnull()]

Unnamed: 0.1,Unnamed: 0,full_name,iata,city,country,estimated_pax,wiki_url
97,98.0,Liuting Airport,TAO,Qingdao,China,23210530,
98,99.0,Brisbane International Airport,BNE,Brisbane,Australia,23205702,
138,139.0,Presidente Juscelino Kubistschek International...,BSB,Brasilia,Brazil,16912680,
161,162.0,Nice-Côte d'Azur Airport,NCE,Nice,France,13304782,
671,672.0,Regional de Maringá - Sílvio Nane Junior Airport,MGF,Maringa,Brazil,658000,
677,678.0,Lajes Field,TER,Lajes (terceira Island),Portugal,631236,
972,973.0,Suboficial Ay Santiago Germano Airport,AFA,San Rafael,Argentina,56905,


modifying the data


In [11]:
data["wiki_url"][97] = "https://en.wikipedia.org/wiki/Qingdao_Liuting_International_Airport"
data["wiki_url"][98] = "https://en.wikipedia.org/wiki/Brisbane_Airport"
data["wiki_url"][138] = "https://en.wikipedia.org/wiki/Bras%C3%ADlia_International_Airport"
data["wiki_url"][161] = "https://en.wikipedia.org/wiki/Nice_C%C3%B4te_d%27Azur_Airport"
data["wiki_url"][671] = "https://en.wikipedia.org/wiki/Maring%C3%A1_Regional_Airport"
data["wiki_url"][677] = "https://en.wikipedia.org/wiki/Lajes_Field"
data["wiki_url"][972] = "https://en.wikipedia.org/wiki/San_Rafael_Airport_(Argentina)"


saving the updated data

In [12]:
data.to_csv("./data/top_airports_basic_data.csv")

# Part 3:

Try to get wikipedia airport names with underscores to make querying easier, and testing a script to get a list of destinations from a wikipedia article

### getting wikipedia names


Populating wikipedia names in our top_airports_basic_data.csv

In [13]:
#the wikipedia name based on the url, is simply found as:
print("name:", "https://en.wikipedia.org/wiki/Victoria_International_Airport".split("https://en.wikipedia.org/wiki/")[1])

name: Victoria_International_Airport


In [14]:
data = pd.read_csv("./data/top_airports_basic_data.csv")
data.head(n=1)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,full_name,iata,city,country,estimated_pax,wiki_url
0,0,1.0,Hartsfield–Jackson Atlanta International Airport,ATL,Atlanta,United States,103902992,https://en.wikipedia.org/wiki/Hartsfield–Jacks...


In [15]:
data["wiki_name"] = None 
wiki_names = []

for i, row in data.iterrows():
    url = str(row["wiki_url"])
    wiki_names.append(url.split("https://en.wikipedia.org/wiki/")[1])
    
data["wiki_name"] = wiki_names
#overwrite
data.to_csv("./data/top_airports_basic_data.csv")

# Part 4 : Transition into getting routes data (in the other jupyter notebook)

### testing script to get raw text from a wikipedia page and get a list of destination-tuples for the current destinations listed in wikipedia, using beautiful soup

In [92]:
def get_destinations(iata_source, wiki_name):
    url = f"https://en.wikipedia.org/wiki/{wiki_name}"
    response = requests.get(url)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    #find the related destination table
    heading = soup.find("h2", string="Airlines and destinations")
    table = heading.find_next("table") 
    while ('wikitable' not in table.get("class")): #find the next table matching a predictable class, if one has not been found
        table = table.find_next("table") 
    rows = table.find_all("tr")

    
    for i in range(1,len(rows)): #exclude the first row
        row = rows[i]
        # Extract all cells (td or th)
        cols = row.find_all(["td", "th"])
        # Write the row text content to CSV
        #first column is the airline
        airline = cols[0].get_text(strip=True)
        print("current airline:", airline)
        #get the list of destinations in the 2nd  
        destinations = cols[1]
        isSeasonal = 0 #iterate over subcomponents (seasonal always comes last, so set is seasonal to be false for now)
        for child in destinations.children: 
            #anchor components are the only destinations
            if (child.name == "a"):
                dest_name = child.get('title') #the title is the official wikipedia airport name (without _ in place of spaces)
                dest_name = dest_name.replace(" ", "_") 
                output = f"{iata_source},{wiki_name},{dest_name},{airline},{isSeasonal}\n" #final output to append to the file
                print("DESTINATION FOUND:", output)
            elif ((child.name == "b") and (child.text == "Seasonal:")):
                isSeasonal = 1 #get seasonal to be 1 for future destinations
            


Testing on various airports

In [93]:
get_destinations("JFK","John_F._Kennedy_International_Airport")

current airline: Aer Lingus
DESTINATION FOUND: JFK,John_F._Kennedy_International_Airport,Dublin_Airport,Aer Lingus,0

DESTINATION FOUND: JFK,John_F._Kennedy_International_Airport,Manchester_Airport,Aer Lingus,0

DESTINATION FOUND: JFK,John_F._Kennedy_International_Airport,Shannon_Airport,Aer Lingus,0

current airline: Aeroméxico
DESTINATION FOUND: JFK,John_F._Kennedy_International_Airport,Mexico_City_International_Airport,Aeroméxico,0

DESTINATION FOUND: JFK,John_F._Kennedy_International_Airport,Monterrey_International_Airport,Aeroméxico,1

current airline: Air Canada Express
DESTINATION FOUND: JFK,John_F._Kennedy_International_Airport,Montréal–Trudeau_International_Airport,Air Canada Express,0

DESTINATION FOUND: JFK,John_F._Kennedy_International_Airport,Toronto_Pearson_International_Airport,Air Canada Express,0

current airline: Air China
DESTINATION FOUND: JFK,John_F._Kennedy_International_Airport,Beijing_Capital_International_Airport,Air China,0

current airline: Air Europa
DESTINA

In [94]:

get_destinations("DFW","Dallas_Fort_Worth_International_Airport")

current airline: Aeroméxico
DESTINATION FOUND: DFW,Dallas_Fort_Worth_International_Airport,Mexico_City_International_Airport,Aeroméxico,0

current airline: Air Canada
DESTINATION FOUND: DFW,Dallas_Fort_Worth_International_Airport,Toronto_Pearson_International_Airport,Air Canada,0

DESTINATION FOUND: DFW,Dallas_Fort_Worth_International_Airport,Montréal–Trudeau_International_Airport,Air Canada,1

current airline: Air Canada Express
DESTINATION FOUND: DFW,Dallas_Fort_Worth_International_Airport,Montréal–Trudeau_International_Airport,Air Canada Express,1

current airline: Air France
DESTINATION FOUND: DFW,Dallas_Fort_Worth_International_Airport,Charles_de_Gaulle_Airport,Air France,0

current airline: Alaska Airlines
DESTINATION FOUND: DFW,Dallas_Fort_Worth_International_Airport,Portland_International_Airport,Alaska Airlines,0

DESTINATION FOUND: DFW,Dallas_Fort_Worth_International_Airport,Seattle–Tacoma_International_Airport,Alaska Airlines,0

current airline: American Airlines
DESTINATION

In [95]:
get_destinations("OUI","Ushant_Airport")

current airline: Finist'air
DESTINATION FOUND: OUI,Ushant_Airport,Brest_Bretagne_Airport,Finist'air,0



In [96]:
get_destinations("CAN", "Guangzhou_Baiyun_International_Airport")

current airline: 9 Air
DESTINATION FOUND: CAN,Guangzhou_Baiyun_International_Airport,Beijing_Daxing_International_Airport,9 Air,0

DESTINATION FOUND: CAN,Guangzhou_Baiyun_International_Airport,Changchun_Longjia_International_Airport,9 Air,0

DESTINATION FOUND: CAN,Guangzhou_Baiyun_International_Airport,Chengde_Puning_Airport,9 Air,0

DESTINATION FOUND: CAN,Guangzhou_Baiyun_International_Airport,Chengdu_Tianfu_International_Airport,9 Air,0

DESTINATION FOUND: CAN,Guangzhou_Baiyun_International_Airport,Chifeng_Yulong_Airport,9 Air,0

DESTINATION FOUND: CAN,Guangzhou_Baiyun_International_Airport,Chongqing_Jiangbei_International_Airport,9 Air,0

DESTINATION FOUND: CAN,Guangzhou_Baiyun_International_Airport,Dalian_Zhoushuizi_International_Airport,9 Air,0

DESTINATION FOUND: CAN,Guangzhou_Baiyun_International_Airport,Haikou_Meilan_International_Airport,9 Air,0

DESTINATION FOUND: CAN,Guangzhou_Baiyun_International_Airport,Hangzhou_Xiaoshan_International_Airport,9 Air,0

DESTINATION FOUND: CA

In [97]:
get_destinations("VDM","Gobernador_Edgardo_Castello_Airport")


current airline: Aerolíneas Argentinas
DESTINATION FOUND: VDM,Gobernador_Edgardo_Castello_Airport,Aeroparque_Jorge_Newbery,Aerolíneas Argentinas,0

DESTINATION FOUND: VDM,Gobernador_Edgardo_Castello_Airport,San_Carlos_de_Bariloche_Airport,Aerolíneas Argentinas,0



In [98]:
get_destinations("FRA","Frankfurt_Airport")

current airline: Aegean Airlines
DESTINATION FOUND: FRA,Frankfurt_Airport,Athens_International_Airport,Aegean Airlines,0

DESTINATION FOUND: FRA,Frankfurt_Airport,Thessaloniki_International_Airport,Aegean Airlines,0

DESTINATION FOUND: FRA,Frankfurt_Airport,Heraklion_International_Airport,Aegean Airlines,1

current airline: Aer Lingus
DESTINATION FOUND: FRA,Frankfurt_Airport,Dublin_Airport,Aer Lingus,0

current airline: Air Algérie
DESTINATION FOUND: FRA,Frankfurt_Airport,Houari_Boumediene_Airport,Air Algérie,0

current airline: Air Astana
DESTINATION FOUND: FRA,Frankfurt_Airport,Almaty_International_Airport,Air Astana,0

DESTINATION FOUND: FRA,Frankfurt_Airport,Nursultan_Nazarbayev_International_Airport,Air Astana,0

DESTINATION FOUND: FRA,Frankfurt_Airport,Oral_Ak_Zhol_Airport,Air Astana,0

current airline: Air Cairo
DESTINATION FOUND: FRA,Frankfurt_Airport,Hurghada_International_Airport,Air Cairo,0

DESTINATION FOUND: FRA,Frankfurt_Airport,Marsa_Alam_International_Airport,Air Cairo,

In [99]:
get_destinations("PAP","Toussaint_Louverture_International_Airport")

current airline: Air Caraïbes
DESTINATION FOUND: PAP,Toussaint_Louverture_International_Airport,Orly_Airport,Air Caraïbes,0

current airline: Air France
DESTINATION FOUND: PAP,Toussaint_Louverture_International_Airport,Pointe-à-Pitre_International_Airport,Air France,0

current airline: Air Transat
DESTINATION FOUND: PAP,Toussaint_Louverture_International_Airport,Montréal-Pierre_Elliott_Trudeau_International_Airport,Air Transat,0

current airline: American Airlines
DESTINATION FOUND: PAP,Toussaint_Louverture_International_Airport,Miami_International_Airport,American Airlines,0

current airline: Caicos Express Airways
DESTINATION FOUND: PAP,Toussaint_Louverture_International_Airport,Providenciales_International_Airport,Caicos Express Airways,0

current airline: InterCaribbean Airways
DESTINATION FOUND: PAP,Toussaint_Louverture_International_Airport,Providenciales_International_Airport,InterCaribbean Airways,0

current airline: JetBlue
DESTINATION FOUND: PAP,Toussaint_Louverture_Internati

In [100]:
get_destinations("PEK","Beijing_Capital_International_Airport")

current airline: Air Algérie
DESTINATION FOUND: PEK,Beijing_Capital_International_Airport,Houari_Boumediene_Airport,Air Algérie,0

current airline: Air Astana
DESTINATION FOUND: PEK,Beijing_Capital_International_Airport,Almaty_International_Airport,Air Astana,0

current airline: Air Canada
DESTINATION FOUND: PEK,Beijing_Capital_International_Airport,Vancouver_International_Airport,Air Canada,0

current airline: Air China
DESTINATION FOUND: PEK,Beijing_Capital_International_Airport,Aksu_Hongqipo_Airport,Air China,0

DESTINATION FOUND: PEK,Beijing_Capital_International_Airport,Nursultan_Nazarbayev_International_Airport,Air China,0

DESTINATION FOUND: PEK,Beijing_Capital_International_Airport,Athens_International_Airport,Air China,0

DESTINATION FOUND: PEK,Beijing_Capital_International_Airport,Auckland_Airport,Air China,0

DESTINATION FOUND: PEK,Beijing_Capital_International_Airport,Suvarnabhumi_Airport,Air China,0

DESTINATION FOUND: PEK,Beijing_Capital_International_Airport,Josep_Tarrad