# Aircraft Crashes Data Collection And Cleaning

## Overview

This notebook collects and prepares the data for the analysis of all the aircraft accidents since 1918.

### About dataset

The data will be scraped from the [BAAA Crash Archives](https://www.baaa-acro.com/crash-archives) and the [ASN Database](https://asn.flightsafety.org/database/).

**BAAA**

`date` date and local time of the accident<br>
`aircraft_type` aircraft make and model<br>
`operator` operator of the aircraft<br>
`registration` unique code to a single aircraft, required by international convention<br>
`flight_phase` phase of the flight when the accident occured<br>
`flight_type` type of flight (ex: military)<br>
`survivors` indicates if there was survivors or not<br>
`site` type of location where the accident happened (ex: mountains)<br>
`departure` city where the departure was planned<br> 
`arrival` city where the arrival was planned<br> 
`msn` manufacturer's serial number of the aircraft<br>
`yom` year of manufacture of the aircraft involved in the accident<br>
`flight_number` flight number<br>
`location` location of the accident<br>
`country` country where the crash happened<br>
`region` region of the world where the crash happened<br>
`crew_on_board` number of crew members on board at the time of the accident<br>
`crew_fatalities` number of crew members who died in the crash<br>
`pax_on_board` number of passengers on board at the time of the accident<br> 
`pax_fatalities` number of passengers who died in the crash<br>                 
`other_fatalities` other victims of the accident outside of the aircraft<br>
`total_fatalities` total number of deaths<br>
`captain_flying_hours` number of flying hours of the captain<br>
`captain_flying_hours_on_type` number of hours the captain flew on the type of aircraft involved in the crash<br>
`copilot_flying_hours` number of flying hours of the copilot<br>  
`copilot_flying_hours_on_type` number of hours the copilot flew on the type of aircraft involved in the crash<br>  
`aircraft_flying_hours` number of flying hours of the aircraft before the crash<br>
`aircraft_flight_cycles` number of flights of the aircraft<br><br>


**ASN**

`date` date of the accident<br>
`time` time of the accident<br>
`type` make and model of the aircraft<br>
`first_flight` year the aircraft was inaugurated<br>
`engine` type and number of engines<br>
`owner` operator of the aircraft<br>
`registration` unique code to a single aircraft, required by international convention<br>
`msn` manufacturer's serial number of the aircraft<br>
`year_of_manufacture` year of manufacture of the aircraft involved in the accident<br>
`total_airframe_hrs` number of flying hours of the aircraft before the crash<br>
`cycles` number of flights of the aircraft<br>
`engine_model` make and model of the aircraft engine<br>
`fatalities` total number of fatalities<br>
`occupants` number of crew members and passengers on board<br>
`other_fatalities` other victims of the accident outside of the aircraft<br>
`aircraft_damage` severity of the aircraft damage
`category` type of accident<br>
`location` location of the crash<br>
`phase` phase of the flight when the accident occured<br>
`nature` type of flight (ex: military)<br>
`departure_airport` airport where the departure was planned<br>
`destination_airport` airport when the arrival was planned<br>
`investigating_agency` agency who made the accident deport<br>
`confidence_rating` quality of the information (ex: missing information)

---

## Data Collection

In [None]:
from bs4 import BeautifulSoup
import math
import os
import pandas as pd
import re
import requests
from urllib.parse import unquote

In [None]:
def export_list_to_csv(data:list, csv_path:str) -> None:
	df = pd.DataFrame(data)
	if not os.path.isfile(csv_path):
		df.to_csv(csv_path, index=False)
	else:
		df.to_csv(csv_path, index=False, header=False, mode='a')

### BAAA

In [None]:
# Scrape total number of accidents
root_url = 'https://www.baaa-acro.com'

response = requests.get(root_url)
soup = BeautifulSoup(response.content, 'html.parser')
accident_files = soup.find('div', {'class': 'total-accident-files'})
nb_crashes = int(accident_files.text.replace(',', ''))
	

In [None]:
# Scrape details of all accidents
nb_rows_per_page = 20
nb_pages = math.ceil(nb_crashes / nb_rows_per_page)
csv_path = 'data/baaa_scraped_data.csv'

for i in range(nb_pages):
	listing_url = '{}/crash-archives?page={}'.format(root_url, i)
	response = requests.get(listing_url)
	soup = BeautifulSoup(response.content, 'html.parser')
	anchors = soup.find_all('a', {'class': 'red-btn'})

	crash_list = []
	
	for j, a in enumerate(anchors):
		link = a['href']
		#print('Page {}, link {}: {}{}'.format(i, j + 1, root_url, link))
		details_url = root_url + link
		response = requests.get(details_url)
		soup = BeautifulSoup(response.content, 'html.parser')
		details = {}
		
		details_div = soup.find('div', {'class': 'crash-details'})
		
		date_div = details_div.find('div', {'class': 'crash-date'})
		details['date'] = date_div.find('span').next_sibling.text.strip() if date_div else None
		
		aircraft_div = details_div.find('div', {'class': 'crash-aircraft'})
		details['aircraft_type'] = aircraft_div.find('a').find('div').text if aircraft_div else None
		
		operator_div = details_div.find('div', {'class': 'crash-operator'})

		if operator_div:
			if (operator_div.find('img')): # Extract operator name from image link
				pattern = re.compile(r'(?<=target_id=).*(?= \(\d+\))')
				img_link = unquote(operator_div.find('img').parent['href'])
				details['operator'] = pattern.search(img_link).group(0)
			else:
				details['operator'] = operator_div.find('a').find('div').text
		else:
			details['operator'] = None

		reg_div = details_div.find('div', {'class': 'crash-registration'})
		details['registration'] = reg_div.find('div').text if reg_div else None
		
		flight_phase_div = details_div.find('div', {'class': 'crash-flight-phase'})
		details['flight_phase'] = flight_phase_div.find('a').find('div').text if flight_phase_div else None
		
		flight_type_div = details_div.find('div', {'class': 'crash-flight-type'})
		details['flight_type'] = flight_type_div.find('a').find('div').text if flight_type_div else None
		
		survivors_div = details_div.find('div', {'class': 'crash-survivors'})
		details['survivors'] = survivors_div.find('a').find('div').text if survivors_div else None
		
		site_div = details_div.find('div', {'class': 'crash-site'})
		details['site'] = site_div.find('a').find('div').text if site_div else None
		
		schedule_div = details_div.find('div', {'class': 'crash-schedule'})
		details['schedule'] = schedule_div.find('div').text if schedule_div else None
		
		msn_div = details_div.find('div', {'class': 'crash-construction-num'})
		details['msn'] = msn_div.find('div').text if msn_div else None
		
		yom_div = details_div.find('div', {'class': 'crash-yom'})
		details['yom'] = yom_div.find('div').text if yom_div else None

		flight_number = details_div.find('div', {'class': 'crash-flight-number'})
		details['flight_number'] = flight_number.find('div').text if flight_number else None
		
		location_div = details_div.find('div', {'class': 'crash-location'})
		if location_div:
			location_details = location_div.select('a')
			details['location'] = ', '.join(item.text.strip() for item in location_details) if location_details else None
		else:
			details['location'] = None
		
		country_div = details_div.find('div', {'class': 'crash-country'})
		details['country'] = country_div.find('a').find('div').text if country_div else None
		
		region_div = details_div.find('div', {'class': 'crash-region'})
		details['region'] = region_div.find('a').find('div').text if region_div else None
		
		crew_on_board_div = details_div.find('div', {'class': 'crash-crew-on-board'})
		details['crew_on_board'] = crew_on_board_div.find('div').text if crew_on_board_div else None
		
		crew_fatalities_div = details_div.find('div', {'class': 'crash-crew-fatalities'})
		details['crew_fatalities'] = crew_fatalities_div.find('div').text if crew_fatalities_div else None
		
		pax_on_board_div = details_div.find('div', {'class': 'crash-pax-on-board'})
		details['pax_on_board'] = pax_on_board_div.find('div').text if pax_on_board_div else None
		
		pax_fatalities_div = details_div.find('div', {'class': 'crash-pax-fatalities'})
		details['pax_fatalities'] = pax_fatalities_div.find('div').text if pax_fatalities_div else None
		
		others_div = details_div.find('div', {'class': 'crash-other-fatalities'})
		details['other_fatalities'] = others_div.find('div').text if others_div else None
		
		total_fatalities_div = details_div.find('div', {'class': 'crash-total-fatalities'})
		details['total_fatalities'] = total_fatalities_div.find('div').text if total_fatalities_div else None

		captain_hours_div = details_div.find('div', {'class': 'captain-total-flying-hours'})
		details['captain_flying_hours'] = captain_hours_div.find('div').text if captain_hours_div else None

		captain_hours_type_div = details_div.find('div', {'class': 'captain-total-hours-type'})
		details['captain_flying_hours_on_type'] = captain_hours_type_div.find('div').text if captain_hours_type_div else None

		copilot_hours_div = details_div.find('div', {'class': 'copilot-total-flying-hours'})
		details['copilot_flying_hours'] = copilot_hours_div.find('div').text if copilot_hours_div else None

		copilot_hours_type_div = details_div.find('div', {'class': 'copilot-total-hours-type'})
		details['copilot_flying_hours_on_type'] = copilot_hours_type_div.find('div').text if copilot_hours_type_div else None

		aircraft_hours_div = details_div.find('div', {'class': 'crash-aircraft-flight-hours'})
		details['aircraft_flying_hours'] = aircraft_hours_div.find('div').text if aircraft_hours_div else None

		aircraft_cycles_div = details_div.find('div', {'class': 'crash-aircraft-flight-cycles'})
		details['aircraft_flight_cycles'] = aircraft_cycles_div.find('div').text if aircraft_cycles_div else None
		
		crash_list.append(details)
	
	export_list_to_csv(crash_list, csv_path)


In [None]:
# Scrape accident causes
csv_path = 'data/baaa_crash_reasons.csv'

reasons = {
  'Human factor': 12990,
  'Other causes': 12992,
  'Technical failure': 12988,
  'Terrorism act, hijacking, sabotage, any kind of hostile action': 12991,
  'Unknown': 12993,
  'Weather': 12989
}

for reason, target_id in reasons.items():
	url = 'https://www.baaa-acro.com/crash-archives?field_crash_cause_target_id={}'.format(target_id)
	response = requests.get(url)
	soup = BeautifulSoup(response.content, 'html.parser')
	pattern = re.compile(r'\d+$')
	total_items_txt = soup.find('div', {'class': 'view-header'}).find('span').text
	total_items = int(pattern.search(total_items_txt).group(0))
	nb_items_per_page = 20
	nb_pages = math.ceil(total_items / nb_items_per_page)
	
	for i in range(nb_pages):
		page_url = url + '&page={}'.format(i)
		page_response = requests.get(page_url)
		page_soup = BeautifulSoup(page_response.content, 'html.parser')
		table = page_soup.find('table')
		rows = table.find_all('tr')

		crash_list = []
		for row in rows[1:]: # skip table header
			created = row.find('td', {'class': 'views-field-created'}).find('time').text
			registration_div = row.find('div', {'class': 'registration-field'})
			crash_list.append({
				'date': created,
				'registration': registration_div.text if registration_div else None,
				'cause': reason
			})
		export_list_to_csv(crash_list, csv_path)		

### ASN

In [None]:
csv_path = 'data/asn_scraped_data.csv'
root_url = 'https://asn.flightsafety.org'

# Add headers to avoid 403 unauthorized error
headers = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'
}

database = '/database'
database_url = root_url + database

#for year in range(1919, 2026):
for year in range(1973, 2026):
	year_url = '{}{}/year/{}/1'.format(root_url, database, year)
	response = requests.get(year_url, headers=headers)
	soup = BeautifulSoup(response.content, 'html.parser')
	nb_occurences_txt = soup.find('div', {'class': 'innertube'}).find('span').text
	pattern = re.compile(r'\d+(?= occurrences)')
	nb_occurences = int(pattern.search(nb_occurences_txt).group(0))
	max_items_per_page = 100
	nb_pages = math.ceil(nb_occurences / max_items_per_page)

	#for page in range(1, nb_pages + 1):
	for page in range(1, nb_pages + 1):
		page_url = '{}{}/year/{}/{}'.format(root_url, database, year, page)
		response = requests.get(page_url, headers=headers)
		soup = BeautifulSoup(response.content, 'html.parser')
		table = soup.find('table', {'class': 'hp'})
		anchors = table.find_all('a')
		links = [a['href'] for a in anchors]
		
		crash_list = []
		for i, link in enumerate(links):
			details_url = root_url + link
			print('Year {}, page {}, item {}, link: {}'.format(year, page, i + 1, details_url))
			response = requests.get(details_url, headers=headers)
			soup = BeautifulSoup(response.content, 'html.parser')
			table = soup.find('table')
			details = {}

			date_label = table.find('td', string='Date:')
			details['date'] = date_label.next_sibling.text

			time_label = table.find('td', string='Time:')
			details['time'] = time_label.next_sibling.text

			type_label = table.find('td', string='Type:')
			anchor = type_label.next_sibling.find('a')

			if anchor: # Get more details about aircraft if link exists
				details['type'] = anchor.text
				href = anchor['href']
				type_url = root_url + href
				type_response = requests.get(type_url, headers=headers)
				type_soup = BeautifulSoup(type_response.content, 'html.parser')
				type_table = type_soup.find('table')
				type_details = list(type_table.find('td', {'valign': 'top'}).stripped_strings)
				details['type_details'] = ', '.join(type_details)
			else:
				details['type'] = type_label.next_sibling.text
				details['type_details'] = None

			owner_label = table.find('td', string='Owner/operator:')
			details['owner'] = owner_label.next_sibling.text

			reg_label = table.find('td', string='Registration:')
			details['registration'] = reg_label.next_sibling.text

			msn_label = table.find('td', string='MSN:')
			details['msn'] = msn_label.next_sibling.text

			yom_label = table.find('td', string='Year of manufacture:')
			details['year_of_manufacture'] = yom_label.next_sibling.text if yom_label else None

			air_hours_label = table.find('td', string='Total airframe hrs:')
			details['total_airframe_hrs'] = air_hours_label.next_sibling.text if air_hours_label else None

			cycles_label = table.find('td', string='Cycles:')
			details['cycles'] = cycles_label.next_sibling.text if cycles_label else None

			engine_label = table.find('td', string='Engine model:')
			details['engine_model'] = engine_label.next_sibling.text if engine_label else None

			fatal_label = table.find('td', string='Fatalities:')
			details['fatalities'] = fatal_label.next_sibling.text

			other_label = table.find('td', string='Other fatalities:')
			details['other_fatalities'] = other_label.next_sibling.text

			damage_label = table.find('td', string='Aircraft damage:')
			details['aircraft_damage'] = damage_label.next_sibling.text

			cat_label = table.find('td', string='Category:')
			details['category'] = cat_label.next_sibling.text if cat_label else None

			loc_label = table.find('td', string='Location:')
			details['location'] = ' '.join(loc_label.next_sibling.stripped_strings)

			phase_label = table.find('td', string='Phase:')
			details['phase'] = phase_label.next_sibling.text

			nature_label = table.find('td', string='Nature:')
			details['nature'] = nature_label.next_sibling.text

			dep_label = table.find('td', string='Departure airport:')
			details['departure_airport'] = dep_label.next_sibling.text

			des_label = table.find('td', string='Destination airport:')
			details['destination_airport'] = des_label.next_sibling.text

			inv_label = table.find('td', string=re.compile('Investigating'))
			details['investigating_agency'] = inv_label.next_sibling.text if inv_label else None

			conf_label = table.find('td', string='Confidence Rating:')
			details['confidence_rating'] = ''.join(conf_label.next_sibling.stripped_strings) if conf_label else None

			crash_list.append(details)
		
		export_list_to_csv(crash_list, csv_path)	

---

## Data Exploration

In [81]:
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
import numpy as np
import pandas as pd

### BAAA

In [53]:
# Crashes
baaa_df = pd.read_csv('data/baaa_scraped_data.csv')
baaa_df.head()

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
0,"Mar 17, 2025 at 1818 LT",BAe Jetstream 31,Línea Aérea Nacional de Honduras - LANHSA,HR-AYW,Takeoff (climb),Scheduled Revenue Flight,Yes,"Lake, Sea, Ocean, River",Roatán – La Ceiba,863,...,15.0,10.0,0.0,13,,,,,,
1,"Mar 13, 2025 at 0733 LT",Cessna 525 CitationJet CJ2,LBL 525 CZ LLC,N525CZ,Takeoff (climb),Private,No,"Plain, Valley",Mesquite - Addison,525A-0380,...,0.0,0.0,0.0,1,,,,,,
2,"Mar 7, 2025",Antonov AN-32,Indian Air Force - Bharatiya Vayu Sena,,Landing (descent or approach),Military,Yes,Airport (less than 10 km from airport),,,...,0.0,0.0,0.0,0,,,,,,
3,"Mar 4, 2025 at 0954 LT",BAe Jetstream 31,SAETA Perú (Servicios Aéreos Tarapota),OB-2178,Landing (descent or approach),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Iquitos - Güeppí,861,...,11.0,0.0,0.0,0,,,,,,
4,"Feb 25, 2025",Antonov AN-26,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,,Takeoff (climb),Military,No,City,,,...,13.0,13.0,29.0,46,,,,,,


In [54]:
# Causes
baaa_causes_df = pd.read_csv('data/baaa_crash_reasons.csv', parse_dates=['date'], date_format='%b %d, %Y')
baaa_causes_df.head()

Unnamed: 0,date,registration,cause
0,2025-02-05,UP-A0123,Human factor
1,2025-01-29,N709PS,Human factor
2,2025-01-09,PR-GFS,Human factor
3,2025-01-08,HK-2522,Human factor
4,2024-08-16,C-FZHG,Human factor


In [55]:
baaa_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36087 entries, 0 to 36086
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   date                          36087 non-null  object 
 1   aircraft_type                 36087 non-null  object 
 2   operator                      36085 non-null  object 
 3   registration                  34900 non-null  object 
 4   flight_phase                  35476 non-null  object 
 5   flight_type                   36030 non-null  object 
 6   survivors                     34811 non-null  object 
 7   site                          35720 non-null  object 
 8   schedule                      25713 non-null  object 
 9   msn                           28065 non-null  object 
 10  yom                           26337 non-null  float64
 11  flight_number                 2896 non-null   object 
 12  location                      36076 non-null  object 
 13  c

In [56]:
baaa_causes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34572 entries, 0 to 34571
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          34572 non-null  datetime64[ns]
 1   registration  33462 non-null  object        
 2   cause         34572 non-null  object        
dtypes: datetime64[ns](1), object(2)
memory usage: 810.4+ KB


In [57]:
baaa_df.isnull().sum()

date                                0
aircraft_type                       0
operator                            2
registration                     1187
flight_phase                      611
flight_type                        57
survivors                        1276
site                              367
schedule                        10374
msn                              8022
yom                              9750
flight_number                   33191
location                           11
country                             3
region                              2
crew_on_board                      22
crew_fatalities                     1
pax_on_board                       50
pax_fatalities                      4
other_fatalities                   16
total_fatalities                    0
captain_flying_hours            29207
captain_flying_hours_on_type    30242
copilot_flying_hours            33856
copilot_flying_hours_on_type    34097
aircraft_flying_hours           30384
aircraft_fli

In [58]:
baaa_causes_df.isnull().sum()

date               0
registration    1110
cause              0
dtype: int64

In [59]:
# Check for duplicates
baaa_df[baaa_df.duplicated(keep=False)]

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
2500,"Jun 15, 2008",Harbin Yunsunji Y-12,China Flying Dragon Special Aviation Company,B-3841,Flight,Geographical / Geophysical / Scientific,Yes,Mountains,,0061,...,2.0,2.0,0.0,3,,,,,,
2501,"Jun 15, 2008",Harbin Yunsunji Y-12,China Flying Dragon Special Aviation Company,B-3841,Flight,Geographical / Geophysical / Scientific,Yes,Mountains,,0061,...,2.0,2.0,0.0,3,,,,,,
7540,"Jun 8, 1988",Lockheed C-130 Hercules,United States Air Force - USAF (since 1947),61-2373,Landing (descent or approach),Training,No,Airport (less than 10 km from airport),Little Rock - Greenville,3720,...,0.0,0.0,0.0,6,,,,,,
7541,"Jun 8, 1988",Lockheed C-130 Hercules,United States Air Force - USAF (since 1947),61-2373,Landing (descent or approach),Training,No,Airport (less than 10 km from airport),Little Rock - Greenville,3720,...,0.0,0.0,0.0,6,,,,,,
7660,"Dec 28, 1987",PZL-Mielec AN-2,Aeroflot - Russian International Airlines,CCCP-02531,Takeoff (climb),Scheduled Revenue Flight,Yes,"Plain, Valley",,1G121-15,...,0.0,0.0,0.0,0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33821,"Sep 30, 1933",Avro 594 Avian,Holden's Air Transport Services,VH-UIV,Landing (descent or approach),Cargo,Yes,Airport (less than 10 km from airport),Salamaua – Bulolo,193,...,1.0,0.0,0.0,0,,,,,,
35000,"Oct 18, 1928",Douglas M-3,National Air Transport - USA,NC1064,Flight,Postal (mail),No,Mountains,Cleveland – New York,658,...,0.0,0.0,0.0,1,,,,,,
35001,"Oct 18, 1928",Douglas M-3,National Air Transport - USA,NC1064,Flight,Postal (mail),No,Mountains,Cleveland – New York,658,...,0.0,0.0,0.0,1,,,,,,
35540,"Dec 31, 1923",Loening 23 Air Yacht,New York-Newport Air Service,,,Scheduled Revenue Flight,Yes,"Lake, Sea, Ocean, River",,,...,0.0,0.0,0.0,0,,,,,,


In [60]:
baaa_causes_df[baaa_causes_df.duplicated(keep=False)]

Unnamed: 0,date,registration,cause
3989,1987-04-13,VT-DFM,Human factor
3990,1987-04-13,VT-DFM,Human factor
6074,1972-10-23,,Human factor
6075,1972-10-23,,Human factor
6262,1971-07-03,,Human factor
...,...,...,...
34239,1938-09-09,,Weather
34240,1938-09-09,,Weather
34241,1938-09-09,,Weather
34469,1929-09-17,,Weather


### ASN

In [61]:
asn_df = pd.read_csv('data/asn_scraped_data.csv')
asn_df.head()

Unnamed: 0,date,time,type,type_details,owner,registration,msn,year_of_manufacture,total_airframe_hrs,cycles,...,other_fatalities,aircraft_damage,category,location,phase,nature,departure_airport,destination_airport,investigating_agency,confidence_rating
0,Saturday 2 August 1919,,Caproni Ca.48,"Caproni Ca.48, First flight: 1919, 3 Piston en...",Caproni,,,1919.0,,,...,0,"Destroyed, written off",Accident,Verona - Italy,En route,Passenger,Venice-Marco Polo Airport (VCE/LIPZ),Milano-Taliedo Airport,,"Information is only available from news, socia..."
1,Monday 11 August 1919,,Felixstowe Fury,"Felixstowe Fury, First flight: 1918, 5 Piston ...",Royal Air Force - RAF,N123,,1918.0,,,...,0,"Destroyed, written off",Accident,off Felixtowe RNAS - United Kingdom,Initial climb,Military,Felixstowe RNAS,Felixstowe RNAS,,
2,Monday 23 February 1920,,Handley Page O/7,"Handley Page Type O, First flight: 1915, 2 Pis...",Handley Page Transport,G-EANV,HP-7,1919.0,,,...,0,"Destroyed, written off",Accident,"Acadia Siding, Cape Province - South Africa",En route,Passenger - Scheduled,,,,
3,Wednesday 25 February 1920,,Handley Page O/400,"Handley Page Type O, First flight: 1915, 2 Pis...",Handley Page Transport,G-EAMC,HP-27,,,,...,0,"Destroyed, written off",Accident,10 km N of El Shereik - Sudan,Unknown,Unknown,Aswan Airport (ASW/HESN),Khartoum-Civil Airport (KRT/HSSS),,
4,Wednesday 30 June 1920,,Handley Page O/400,"Handley Page Type O, First flight: 1915, 2 Pis...",Handley Page Transport,G-EAKE,HP-22,1919.0,,,...,0,"Destroyed, written off",Accident,Östanå - Sweden,En route,Demo/Airshow/Display,Stockholm (unknown airport),Kjeller Air Base (ENKJ),,


In [62]:
asn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26566 entries, 0 to 26565
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   date                  26566 non-null  object 
 1   time                  12085 non-null  object 
 2   type                  26566 non-null  object 
 3   type_details          26236 non-null  object 
 4   owner                 26501 non-null  object 
 5   registration          26566 non-null  object 
 6   msn                   26566 non-null  object 
 7   year_of_manufacture   20651 non-null  float64
 8   total_airframe_hrs    5635 non-null   object 
 9   cycles                1641 non-null   object 
 10  engine_model          13704 non-null  object 
 11  fatalities            26566 non-null  object 
 12  other_fatalities      26566 non-null  int64  
 13  aircraft_damage       26566 non-null  object 
 14  category              25765 non-null  object 
 15  location           

In [63]:
asn_df.isnull().sum()

date                        0
time                    14481
type                        0
type_details              330
owner                      65
registration                0
msn                         0
year_of_manufacture      5915
total_airframe_hrs      20931
cycles                  24925
engine_model            12862
fatalities                  0
other_fatalities            0
aircraft_damage             0
category                  801
location                    0
phase                       0
nature                      9
departure_airport        8265
destination_airport      8318
investigating_agency    19919
confidence_rating       14864
dtype: int64

In [64]:
# Check for duplicates
asn_df[asn_df.duplicated(keep=False)]

Unnamed: 0,date,time,type,type_details,owner,registration,msn,year_of_manufacture,total_airframe_hrs,cycles,...,other_fatalities,aircraft_damage,category,location,phase,nature,departure_airport,destination_airport,investigating_agency,confidence_rating
499,Friday 10 May 1940,,Junkers Ju-52/3m,"Junkers Ju-52/3m, First flight: 1932, 3 Piston...",Luftwaffe,,,,,,...,0,"Destroyed, written off",Accident,Waalhaven - Netherlands,Unknown,Military,,,,Little or no information is available
500,Friday 10 May 1940,,Junkers Ju-52/3m,"Junkers Ju-52/3m, First flight: 1932, 3 Piston...",Luftwaffe,,,,,,...,0,"Destroyed, written off",Accident,Waalhaven - Netherlands,Unknown,Military,,,,Little or no information is available
501,Friday 10 May 1940,,Junkers Ju-52/3m,"Junkers Ju-52/3m, First flight: 1932, 3 Piston...",Luftwaffe,,,,,,...,0,"Destroyed, written off",Accident,Waalhaven - Netherlands,Unknown,Military,,,,Little or no information is available
502,Friday 10 May 1940,,Junkers Ju-52/3m,"Junkers Ju-52/3m, First flight: 1932, 3 Piston...",Luftwaffe,,,,,,...,0,"Destroyed, written off",Accident,Waalhaven - Netherlands,Unknown,Military,,,,Little or no information is available
503,Friday 10 May 1940,,Junkers Ju-52/3m,"Junkers Ju-52/3m, First flight: 1932, 3 Piston...",Luftwaffe,,,,,,...,0,"Destroyed, written off",Accident,Waalhaven - Netherlands,Unknown,Military,,,,Little or no information is available
504,Friday 10 May 1940,,Junkers Ju-52/3m,"Junkers Ju-52/3m, First flight: 1932, 3 Piston...",Luftwaffe,,,,,,...,0,"Destroyed, written off",Accident,Waalhaven - Netherlands,Unknown,Military,,,,Little or no information is available
505,Friday 10 May 1940,,Junkers Ju-52/3m,"Junkers Ju-52/3m, First flight: 1932, 3 Piston...",Luftwaffe,,,,,,...,0,"Destroyed, written off",Accident,Rijksweg 13 - Netherlands,Unknown,Military,,,,Little or no information is available
509,Friday 10 May 1940,,Junkers Ju-52/3m,"Junkers Ju-52/3m, First flight: 1932, 3 Piston...",Luftwaffe,,,,,,...,0,"Destroyed, written off",Accident,Rotterdam - Netherlands,Unknown,Military,,,,Little or no information is available
514,Friday 10 May 1940,,Junkers Ju-52/3m,"Junkers Ju-52/3m, First flight: 1932, 3 Piston...",Luftwaffe,,,,,,...,0,"Destroyed, written off",Accident,"Delft - Den Haag, on the road - Netherlands",Unknown,Military,,,,Little or no information is available
515,Friday 10 May 1940,,Junkers Ju-52/3m,"Junkers Ju-52/3m, First flight: 1932, 3 Piston...",Luftwaffe,,,,,,...,0,"Destroyed, written off",Accident,"Delft - Den Haag, on the road - Netherlands",Unknown,Military,,,,Little or no information is available


---

## Data Cleaning

In [65]:
# Remove duplicates
baaa_df = baaa_df.drop_duplicates()
baaa_causes_df = baaa_causes_df.drop_duplicates()
asn_df = asn_df.drop_duplicates()

In [66]:
# Strip whitespaces
def remove_whitespaces(df):
	for column in df.columns:
		if df[column].dtype == 'object':
			df[column] = df[column].str.strip()
	return df

baaa_df = remove_whitespaces(baaa_df)
baaa_causes_df = remove_whitespaces(baaa_causes_df)
asn_df = remove_whitespaces(asn_df)

### Merge dataframes on date and registration number

Although it's not very likely, the same aircraft can be involved in multiple accidents. Combining the registration number and the date ensures the unicity of the rows.

The main (left) dataset will be the one from BAAA as it's the most reliable and the second (right) one will be ASN dataset.

In [67]:
# Convert BAAA date to datetime
baaa_df['date'] = pd.to_datetime(baaa_df['date'], format='%b %d, %Y at %H%M LT', errors='coerce') \
				.fillna(pd.to_datetime(baaa_df['date'], format='%b %d, %Y', errors='coerce'))
assert baaa_df['date'].isna().sum() == 0

In [68]:
# Convert ASN date to datetime
asn_df['date'] = pd.to_datetime(asn_df['date'], format='%A %d %B %Y', errors='coerce')

In [69]:
# Create date string column
baaa_df['date_str'] = baaa_df['date'].dt.strftime('%Y-%m-%d')
baaa_causes_df['date_str'] = baaa_causes_df['date'].dt.strftime('%Y-%m-%d')
asn_df['date_str'] = asn_df['date'].dt.strftime('%Y-%m-%d')

In [70]:
# Merge three dataframes
df = pd.merge(left=baaa_df, right=baaa_causes_df, how='left', on=['registration', 'date_str'])
df = pd.merge(left=df, right=asn_df, how='left', on=['registration', 'date_str'])
df.head()

Unnamed: 0,date_x,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn_x,...,other_fatalities_y,aircraft_damage,category,location_y,phase,nature,departure_airport,destination_airport,investigating_agency,confidence_rating
0,2025-03-17 18:18:00,BAe Jetstream 31,Línea Aérea Nacional de Honduras - LANHSA,HR-AYW,Takeoff (climb),Scheduled Revenue Flight,Yes,"Lake, Sea, Ocean, River",Roatán – La Ceiba,863,...,0.0,Destroyed,Accident,off Juan Manuel Gálvez International Airport (...,Initial climb,Passenger,Roatán-Juan Manuel Gálvez International Airpor...,La Ceiba-Goloson International Airport (LCE/MHLC),,"Information is only available from news, socia..."
1,2025-03-13 07:33:00,Cessna 525 CitationJet CJ2,LBL 525 CZ LLC,N525CZ,Takeoff (climb),Private,No,"Plain, Valley",Mesquite - Addison,525A-0380,...,0.0,Destroyed,Accident,"near Mesquite Metro Airport (KHQZ), Mesquite, ...",Initial climb,Ferry/positioning,"Mesquite Metro Airport, TX (KHQZ)","Dallas-Addison Airport, TX (ADS/KADS)",NTSB,"Information is only available from news, socia..."
2,2025-03-07 00:00:00,Antonov AN-32,Indian Air Force - Bharatiya Vayu Sena,,Landing (descent or approach),Military,Yes,Airport (less than 10 km from airport),,,...,,,,,,,,,,
3,2025-03-04 09:54:00,BAe Jetstream 31,SAETA Perú (Servicios Aéreos Tarapota),OB-2178,Landing (descent or approach),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Iquitos - Güeppí,861,...,0.0,Destroyed,Accident,Güeppi Airport (SPGP) - Peru,Landing,Passenger,Iquitos-Coronel FAP Francisco Secada Vignetta ...,Güeppi Airport (SPGP),,"Information is only available from news, socia..."
4,2025-02-25 00:00:00,Antonov AN-26,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,,Takeoff (climb),Military,No,City,,,...,,,,,,,,,,


In [71]:
df.columns

Index(['date_x', 'aircraft_type', 'operator', 'registration', 'flight_phase',
       'flight_type', 'survivors', 'site', 'schedule', 'msn_x', 'yom',
       'flight_number', 'location_x', 'country', 'region', 'crew_on_board',
       'crew_fatalities', 'pax_on_board', 'pax_fatalities',
       'other_fatalities_x', 'total_fatalities', 'captain_flying_hours',
       'captain_flying_hours_on_type', 'copilot_flying_hours',
       'copilot_flying_hours_on_type', 'aircraft_flying_hours',
       'aircraft_flight_cycles', 'date_str', 'date_y', 'cause', 'date', 'time',
       'type', 'type_details', 'owner', 'msn_y', 'year_of_manufacture',
       'total_airframe_hrs', 'cycles', 'engine_model', 'fatalities',
       'other_fatalities_y', 'aircraft_damage', 'category', 'location_y',
       'phase', 'nature', 'departure_airport', 'destination_airport',
       'investigating_agency', 'confidence_rating'],
      dtype='object')

In [72]:
# Remove time from datetime and drop other date and timecolumns
df['date'] = pd.to_datetime(df['date_str'])
df = df.drop(['date_x', 'date_y', 'time', 'date_str'], axis=1)

In [73]:
# Keep data from 1970 to now
df = df[df['date'].dt.year >= 1970]

In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13647 entries, 0 to 13646
Data columns (total 47 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   aircraft_type                 13647 non-null  object        
 1   operator                      13646 non-null  object        
 2   registration                  13383 non-null  object        
 3   flight_phase                  13440 non-null  object        
 4   flight_type                   13619 non-null  object        
 5   survivors                     13299 non-null  object        
 6   site                          13526 non-null  object        
 7   schedule                      10746 non-null  object        
 8   msn_x                         13084 non-null  object        
 9   yom                           12975 non-null  float64       
 10  flight_number                 2093 non-null   object        
 11  location_x                    136

### Add latitude and longitude

In [75]:
# Merge location (BAAA, then ASN, then country)
df['location_y'] = df['location_y'].str.replace(' - ', ', ')
df['location'] = df['location_x'].fillna(df['location_y']).fillna(df['country'])
df = df.drop(['location_x', 'location_y'], axis=1)
assert df['location'].isnull().sum() == 0

In [90]:
# Get coordinates from geocoder
geolocator = Nominatim(user_agent='aircraft_crashes_analysis')
geocoder = RateLimiter(geolocator.geocode, min_delay_seconds=1)

def get_coord(row, column:str='location') -> tuple:
	result = (np.nan, np.nan)

	location = geocoder(row[column], language='en', exactly_one=True)

	if (location):
		print('Coordinates: ({}, {})'.format(location.latitude, location.longitude))
		result = (location.latitude, location.longitude)
	
	return result

In [None]:
# Create columns with coordinates
coordinates = df.apply(get_coord, axis=1, result_type='expand')
coordinates.columns = ['latitude', 'longitude']
coordinates

Coordinates: (16.34902105, -86.49775125625627)
Coordinates: (32.749898900000005, -96.53114976775751)
Coordinates: (26.6981094, 88.3245465)
Coordinates: (-0.1176738, -75.2510798)
Coordinates: (12.2809691, 24.7741672)
Coordinates: (29.3433587, -81.3110131)
Coordinates: (33.4942189, -111.926018)
Coordinates: (8.4111477, 20.648473)
Coordinates: (-23.5506507, -46.6333824)
Coordinates: (64.49906304999999, -165.3781247368321)
Coordinates: (6.8321846, 124.4571005)
Coordinates: (47.1746607, 7.4585386)
Coordinates: (50.253281, 66.914993)
Coordinates: (39.9527237, -75.1635262)
Coordinates: (38.851289449999996, -77.03968893913375)
Coordinates: (9.4770538, 29.6738532)
Coordinates: (10.2890152, -66.6753173)
Coordinates: (35.1649646820541, 128.96658353110658)
Coordinates: (-7.99008235, 27.573188020340893)
Coordinates: (-38.9514242, -72.6254333)
Coordinates: (-23.433162, -45.083415)
Coordinates: (6.3168685, -76.1344983)
Coordinates: (-32.007009499999995, 115.52052709064799)
Coordinates: (34.9905794, 1

RateLimiter caught an error, retrying (0/2 tries). Called with (*('San José-Juan Santamaría, Alajuela (Center-North)',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               

Coordinates: (15.4702001, -90.3735065)
Coordinates: (40.7970382, -74.4809868)
Coordinates: (37.9331799, -75.3788141)
Coordinates: (18.93313205, -99.25995794657074)
Coordinates: (23.5715402, 111.0139268)
Coordinates: (50.864902, 39.077244)
Coordinates: (50.07671, 30.778891)
Coordinates: (46.797985, 61.661221)
Coordinates: (-2.5055839, 28.8594885)
Coordinates: (35.469331, -76.8927522)
Coordinates: (23.2035785, -106.4208391)
Coordinates: (53.2637767, 158.129562)
Coordinates: (48.406414, -89.259796)
Coordinates: (-15.782571449999999, -47.90641827266603)
Coordinates: (34.9988862, 25.37756405)
Coordinates: (30.2489634, 120.2052342)
Coordinates: (25.416515099999998, -77.87826580832996)
Coordinates: (24.7137524, -81.0903512)
Coordinates: (-2.993575, 27.470205281994577)
Coordinates: (29.690292, -95.8996261)
Coordinates: (42.9956397, -71.4547891)
Coordinates: (42.3727441, -122.87343834417277)
Coordinates: (60.7922222, -161.755833)
Coordinates: (45.216675, -85.013942)
Coordinates: (45.66607815000

RateLimiter caught an error, retrying (0/2 tries). Called with (*('Shanghai-Pudong, Shanghai',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^

Coordinates: (31.142735899999998, 121.80411312160436)
Coordinates: (15.771172, -84.5394884)
Coordinates: (38.2462145, 43.1065625)
Coordinates: (55.0756557, 45.3591634)
Coordinates: (47.53670165, -116.80823236378586)
Coordinates: (10.0596326, -72.5463146)
Coordinates: (-14.6213903, -57.4907268)
Coordinates: (43.5476008, -96.7293629)
Coordinates: (33.3267128, -83.388485)
Coordinates: (61.5808901, -159.5364417)
Coordinates: (33.3870578, -84.2829784)
Coordinates: (-42.9173055, -71.3216508)
Coordinates: (40.5152491, -107.5464541)
Coordinates: (32.749898900000005, -96.53114976775751)
Coordinates: (45.809234849999996, -108.54767517885944)
Coordinates: (16.7873999, -90.1183212)
Coordinates: (-15.5430787, -55.1590511)
Coordinates: (14.5123016, 121.02188606807516)
Coordinates: (14.4951959, -91.8497718)
Coordinates: (38.5312385, -99.3083345)
Coordinates: (33.4942189, -111.926018)
Coordinates: (15.969131, -85.0932642)
Coordinates: (-13.0453165, 143.3049774337979)
Coordinates: (45.636623, -89.41207

RateLimiter caught an error, retrying (0/2 tries). Called with (*('Santa Rosa-Route 66, New Mexico',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^

Coordinates: (55.084263, -59.1773419)
Coordinates: (56.3449837, -94.7062234)
Coordinates: (50.7845975, 8.1208181)
Coordinates: (24.665151, -82.8553985)
Coordinates: (33.8708215, -117.929416)
Coordinates: (-41.45479125, -72.91925159064559)
Coordinates: (34.4942683, -89.0078418)
Coordinates: (16.5242969, -90.1884154)
Coordinates: (18.391647772677963, -68.64539723154932)
Coordinates: (42.5599532, -0.714569)
Coordinates: (-23.00602225, -47.142003969661445)
Coordinates: (49.9668352, 8.6659289)
Coordinates: (-24.3557932, 26.090973359547775)
Coordinates: (35.4445183, -97.3075152336067)
Coordinates: (40.2317686, -82.9651045)
Coordinates: (39.1908926, -84.3635507)
Coordinates: (8.750004, 38.981734)
Coordinates: (-34.559455400000004, -58.41436375)
Coordinates: (46.6812216, -68.0154578)
Coordinates: (32.54041615, -93.74500728259136)
Coordinates: (55.96004625, 37.41296395)
Coordinates: (29.7713308, -94.6785092)
Coordinates: (35.9120114, -100.3839018)
Coordinates: (1.827598, -61.125389)
Coordinates

RateLimiter caught an error, retrying (0/2 tries). Called with (*('Guatemala City-La Aurora, Guatemala',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^

Coordinates: (14.5821741, -90.52902094901863)
Coordinates: (8.5115155, -77.2790562)
Coordinates: (17.6467874, -101.551974)
Coordinates: (-17.7698233, -47.0986827)
Coordinates: (40.9152061, -81.43992419328089)
Coordinates: (9.181376628779413, 32.19393763260591)
Coordinates: (30.8763519, -84.4324676)
Coordinates: (4.8459246, 31.5959173)
Coordinates: (31.5746624, 74.33579487454149)
Coordinates: (14.8854699, -11.2338767)
Coordinates: (26.072017000000002, -80.15099673135214)
Coordinates: (-34.608797, -58.449165)
Coordinates: (26.1003392, -80.399513)
Coordinates: (-26.1360131, 28.24496774636681)
Coordinates: (20.4325631, -100.59215214404156)
Coordinates: (4.61001675, -74.06930256481051)
Coordinates: (56.159227, -120.6867691)
Coordinates: (41.7598919, -84.8721294)
Coordinates: (30.523018999999998, -90.41507115892587)
Coordinates: (3.5834664, -76.4952223)
Coordinates: (2.23548745, 44.99694192797743)
Coordinates: (34.794631, 67.684334)
Coordinates: (9.4051992, -0.8423986)
Coordinates: (11.17297

RateLimiter caught an error, retrying (0/2 tries). Called with (*('Lyon-Bron, Rhône',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  

Coordinates: (45.72819335, 4.9427041133278555)
Coordinates: (43.4887907, -112.03628)
Coordinates: (62.1445646, 65.4349289)
Coordinates: (-33.0244535, -71.5517636)
Coordinates: (13.68187665, 100.74858027718852)
Coordinates: (40.3100446, -75.1304588)
Coordinates: (49.21230655, -2.1255999596428845)
Coordinates: (38.8339578, -104.825348)
Coordinates: (39.611146, -87.6961374)
Coordinates: (5.3083763, 45.8796603)
Coordinates: (25.55449055, -79.2753655146048)
Coordinates: (60.48309525, -106.46911866978007)
Coordinates: (23.0324259, -102.8941125)
Coordinates: (-3.62220965, 35.806488328097174)
Coordinates: (62.47496825, -114.4304652775906)
Coordinates: (49.49719485, -126.39698830612838)
Coordinates: (33.56349565, -86.75152731622148)
Coordinates: (52.1502914, -1.1581125)
Coordinates: (41.3082138, -72.9250518)
Coordinates: (2.0140883, 45.30403425603224)
Coordinates: (51.421190100000004, 12.229585904866344)
Coordinates: (45.78122545, 3.1641316301933693)
Coordinates: (33.7042605, 110.4272323)
Coord

RateLimiter caught an error, retrying (0/2 tries). Called with (*("Saint John's-V. C. Bird, All Antigua",), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^

Coordinates: (15.5635972, 32.5349123)
Coordinates: (47.18714275, 11.481965913116952)
Coordinates: (27.6939119, 85.3582197389141)
Coordinates: (18.86508265, -70.69128274955573)
Coordinates: (28.555524, 77.0851248)
Coordinates: (32.8406946, -83.6324022)
Coordinates: (59.0827605, 159.9512943)
Coordinates: (18.580374900000002, 73.9182264958952)
Coordinates: (41.79954695, 12.591271133206927)
Coordinates: (32.5661322, -97.3089674572978)
Coordinates: (31.5656822, 74.3141829)
Coordinates: (60.5544444, -151.258333)
Coordinates: (47.3464669, 6.7043732)
Coordinates: (0.1236548, 117.471708)
Coordinates: (-1.0917422, 35.1991268)
Coordinates: (-22.9531726, -43.3715825)
Coordinates: (10.6326436, 30.3793795)
Coordinates: (48.8376296, -1.5959177)
Coordinates: (39.5487452, -89.2941549)
Coordinates: (47.4903839, 9.5540541)
Coordinates: (42.8804219, -8.5458608)
Coordinates: (-21.7609533, -43.3501129)
Coordinates: (34.8688613, -111.7614394)
Coordinates: (-22.9531726, -43.3715825)
Coordinates: (50.2136954, 

RateLimiter caught an error, retrying (0/2 tries). Called with (*('San Juan-Luis Muñoz Marín (Isla Verde), All Puerto Rico',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
         

Coordinates: (-45.1641639, -73.522341)
Coordinates: (68.166667, 19.5)
Coordinates: (35.1823171, -83.3815429)


RateLimiter caught an error, retrying (0/2 tries). Called with (*('San Juan-Luis Muñoz Marín (Isla Verde), All Puerto Rico',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
         

Coordinates: (55.0667545, -132.1476799)
Coordinates: (23.1785489, -75.0883369563434)
Coordinates: (-3.9067038, -70.5151509)
Coordinates: (49.9668352, 8.6659289)
Coordinates: (-43.1200305, -73.6203025)
Coordinates: (20.121629249999998, -103.14594641558197)
Coordinates: (28.5552719, -82.3878709)
Coordinates: (44.3190159, 23.7965614)
Coordinates: (-2.5055839, 28.8594885)
Coordinates: (43.2372099, 6.072772)
Coordinates: (11.6931716, -70.1917178)
Coordinates: (-1.3798578, -48.47964610194296)
Coordinates: (29.1871986, -82.1400923)
Coordinates: (31.6205738, 65.7157573)
Coordinates: (-41.45479125, -72.91925159064559)
Coordinates: (52.6600742, -3.1474212)
Coordinates: (32.7703841, -89.1153488)
Coordinates: (36.7501195, -95.9336707113497)
Coordinates: (-1.2471976, 35.0135428)
Coordinates: (52.5076307, -93.0235632)
Coordinates: (36.2005843, -115.121584)
Coordinates: (52.02481615, 113.31022128113224)
Coordinates: (40.537187849999995, 72.80517527412405)
Coordinates: (26.19595785, -80.18080289928344

RateLimiter caught an error, retrying (0/2 tries). Called with (*('San Juan-Luis Muñoz Marín (Isla Verde), All Puerto Rico',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
         

Coordinates: (40.4531829, -112.36104)
Coordinates: (46.1882007, -123.8319802)
Coordinates: (6.235794, -62.852937)
Coordinates: (26.19595785, -80.18080289928344)
Coordinates: (-6.9060304, 107.5858421)
Coordinates: (17.6118858, 121.7300377)
Coordinates: (46.0131505, -112.536508)
Coordinates: (-6.12075515, 106.67843374344412)
Coordinates: (0.0611715, 32.4698564)
Coordinates: (59.4453384, 29.4838591)
Coordinates: (12.98815675, 77.62260003796)
Coordinates: (9.3148173, -70.6081655)
Coordinates: (52.32698005, 4.74150530038293)
Coordinates: (30.0292907, 31.2349472)
Coordinates: (42.868594, 74.6043881)
Coordinates: (60.5544444, -151.258333)
Coordinates: (25.702096, 32.647186)
Coordinates: (64.4975098, -165.4061701)
Coordinates: (5.4827786, -74.65692612424778)
Coordinates: (51.5166529, -0.170663323598357)
Coordinates: (42.8867166, -78.8783922)
Coordinates: (34.9988862, 25.37756405)
Coordinates: (46.5332403, 9.874586)
Coordinates: (-3.2996768, -60.6213528)
Coordinates: (41.7648867, 12.4770551)
Co

RateLimiter caught an error, retrying (0/2 tries). Called with (*('San José-Juan Santamaría, Alajuela (Center-North)',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               

Coordinates: (-1.70301095, 26.922519404864875)
Coordinates: (-7.99008235, 27.573188020340893)
Coordinates: (41.3767645, 75.63946056211226)
Coordinates: (15.74603365664981, -86.86237974433374)
Coordinates: (-25.512629, -49.1883148)
Coordinates: (26.2122345, 127.6791452)
Coordinates: (-17.9429225, 145.9285815740916)
Coordinates: (35.1649646820541, 128.96658353110658)
Coordinates: (-17.484320599999997, -149.80705089813)
Coordinates: (57.4086082, -135.4596206)
Coordinates: (-0.52363325, 40.35640534962073)
Coordinates: (33.3315804, -105.673099)
Coordinates: (-37.3679025, 145.15894799344477)
Coordinates: (48.83647775, -125.19982959423098)
Coordinates: (55.34211905, 37.83439345)
Coordinates: (55.3430696, -131.6466819)
Coordinates: (42.1765004, 13.7178999)
Coordinates: (29.8946952, -81.3145395)
Coordinates: (40.1672117, -105.101928)
Coordinates: (-3.3201862, 17.3774958)
Coordinates: (-23.62741465, -46.65556099749749)
Coordinates: (11.11889205, -74.23118878167132)
Coordinates: (-23.62741465, -4

RateLimiter caught an error, retrying (0/2 tries). Called with (*('San Juan-Luis Muñoz Marín (Isla Verde), All Puerto Rico',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
         

Coordinates: (27.9216441, -110.8994059)
Coordinates: (19.3847356, -100.8257528)
Coordinates: (-17.831773, 31.045686)
Coordinates: (-14.8349438, -64.9044936)
Coordinates: (45.3187339, -72.6463074)
Coordinates: (38.263995, -104.6141867)
Coordinates: (42.6603037, -77.0540989)
Coordinates: (9.0668934, 30.8800267)
Coordinates: (19.292545, -99.6569007)
Coordinates: (-32.8894155, -68.8446177)
Coordinates: (34.5266431, 69.1849082)
Coordinates: (15.5635972, 32.5349123)
Coordinates: (40.8598219, -74.0593075)
Coordinates: (60.32181555, 24.94736111181276)
Coordinates: (45.4140984, 6.6349892)
Coordinates: (33.3061701, 44.3872213)
Coordinates: (39.124162600000005, -94.59283735999804)
Coordinates: (47.1615598, 27.5837814)
Coordinates: (51.2915724, 6.7869558)
Coordinates: (-7.99008235, 27.573188020340893)
Coordinates: (-17.3608546, -66.1843653)
Coordinates: (38.8838856, -94.81887)
Coordinates: (41.752695900000006, -111.808733045283)
Coordinates: (-27.2361111, 28.8513889)
Coordinates: (64.2741955, 100.

RateLimiter caught an error, retrying (0/2 tries). Called with (*('Panama City, Panamá',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Coordinates: (8.9714493, -79.5341802)
Coordinates: (39.049906449999995, -84.66515749429266)
Coordinates: (58.502505, -119.400279)
Coordinates: (15.6675884, -96.5536552)
Coordinates: (1.9232201, -67.0606645)
Coordinates: (30.3644888, -97.9875325)
Coordinates: (15.74603365664981, -86.86237974433374)
Coordinates: (-8.673149500000001, 147.2621270017873)
Coordinates: (0.5753738500000001, 29.478005283454408)
Coordinates: (32.6405247, -115.474899)
Coordinates: (-36.5517611, 145.9833562)
Coordinates: (11.6931716, -70.1917178)
Coordinates: (13.1649129, -87.9106084)
Coordinates: (26.19595785, -80.18080289928344)
Coordinates: (58.3838352, -116.0395199)
Coordinates: (43.8487795, -73.4232317)
Coordinates: (48.324280349999995, -80.75000560627319)
Coordinates: (44.5126379, -88.0125794)
Coordinates: (-15.028736949999999, 40.743558362822235)
Coordinates: (7.7001398, 27.9909335)
Coordinates: (66.1504928, -18.9096279)
Coordinates: (5.8158742, -71.7232411)
Coordinates: (45.4277659, -75.710976)
Coordinates

RateLimiter caught an error, retrying (0/2 tries). Called with (*('San Juan-Luis Muñoz Marín (Isla Verde), All Puerto Rico',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
         

Coordinates: (63.0897222, -156.4305556)
Coordinates: (8.3691166, 31.1362454)
Coordinates: (14.6027962, -61.0676724)
Coordinates: (25.6024386, -103.7555886)
Coordinates: (7.79845155, -76.74603894250129)
Coordinates: (21.94263367904466, -102.34798357261975)
Coordinates: (4.61001675, -74.06930256481051)
Coordinates: (-32.3680849, -54.1697146)
Coordinates: (45.8261611, 13.473368702886166)
Coordinates: (-15.1194063, 39.2615886)
Coordinates: (49.9137407, -74.3713954)
Coordinates: (-17.5366464, -149.55616940209092)
Coordinates: (46.5332403, 9.874586)
Coordinates: (51.32421415, -0.8451410407803077)
Coordinates: (32.697580849999994, -80.00563739746423)
Coordinates: (-2.993575, 27.470205281994577)
Coordinates: (30.1140504, 31.424546074834208)
Coordinates: (10.6498095, -71.6443596)
Coordinates: (42.4484778, -73.2541069)
Coordinates: (50.4520214, 40.1399663)
Coordinates: (16.0990897, -88.8078855)
Coordinates: (43.1009031, -75.2326641)
Coordinates: (-3.62220965, 35.806488328097174)
Coordinates: (38

RateLimiter caught an error, retrying (0/2 tries). Called with (*('San Juan-Luis Muñoz Marín (Isla Verde), All Puerto Rico',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
         

Coordinates: (35.1649646820541, 128.96658353110658)
Coordinates: (51.864399, -1.2266697)
Coordinates: (39.58125245, 2.7092683401782343)
Coordinates: (45.4279922, 10.328465676993805)
Coordinates: (58.3019496, -134.419734)
Coordinates: (43.7480676, 27.4058887)
Coordinates: (25.03933675, -77.47012287193701)
Coordinates: (-8.7676566, 18.0014181)
Coordinates: (40.10776265, -85.6101306392725)
Coordinates: (39.6786205, -104.9599839)
Coordinates: (40.10776265, -85.6101306392725)
Coordinates: (32.4113732, -111.21748791631757)
Coordinates: (44.320281, -91.9149094)
Coordinates: (22.406628, -79.9651013)
Coordinates: (43.8015666, -115.1267474)
Coordinates: (-16.9206657, 145.7721854)
Coordinates: (42.2051312, 13.5192689)
Coordinates: (28.555524, 77.0851248)
Coordinates: (53.333333, -60.416667)
Coordinates: (43.735584, 44.653614)
Coordinates: (31.6816183, -96.4783677)
Coordinates: (1.352975, 103.99605546603934)
Coordinates: (36.2974945, 59.6059232)
Coordinates: (41.5774793, -71.5376881)
Coordinates: 

RateLimiter caught an error, retrying (0/2 tries). Called with (*('San Juan-Luis Muñoz Marín (Isla Verde), All Puerto Rico',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
         

Coordinates: (47.5490076, -52.7806361)
Coordinates: (44.5126379, -88.0125794)
Coordinates: (33.2342834, -97.5861393)
Coordinates: (25.9372513, -81.71573)
Coordinates: (39.22142085, -106.86734117631343)
Coordinates: (-7.3804481, 112.7458578)
Coordinates: (49.453872, 11.077298)
Coordinates: (39.6786205, -104.9599839)
Coordinates: (47.8992374, 2.1660454246529404)
Coordinates: (-14.9195617, 13.4897509)
Coordinates: (31.0404625, -84.8790911)
Coordinates: (-25.9396667, 27.92617105029214)
Coordinates: (-23.437859000000003, -46.48055622681591)
Coordinates: (63.426664599999995, -20.26988158464468)
Coordinates: (32.2617378, -83.7364358)
Coordinates: (46.4976734, -84.3476583)
Coordinates: (-21.59120195, -45.47401856941612)
Coordinates: (48.1023356, -77.7875715)
Coordinates: (-11.7790355, 19.9122676)
Coordinates: (4.085013, -72.9584671)
Coordinates: (12.0831175, 15.0191262)
Coordinates: (33.3967079, -86.1597137)
Coordinates: (-6.3364522, 106.76341868042863)
Coordinates: (0.0401529, -51.0569588)
Co

RateLimiter caught an error, retrying (0/2 tries). Called with (*('San Juan-Luis Muñoz Marín (Isla Verde), All Puerto Rico',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
         

Coordinates: (33.8278405, -78.6800323)
Coordinates: (-23.9656606, -46.3486247)
Coordinates: (67.1659371, -163.049567)
Coordinates: (20.528865, -103.3136855407796)
Coordinates: (15.35100255, -92.55576537361532)
Coordinates: (4.2050673, 34.3600434)
Coordinates: (42.4797854, -78.2497346)
Coordinates: (59.6454064, -151.5445643)
Coordinates: (40.1000956, -76.5663556)
Coordinates: (44.29372617055997, -64.8548064659279)
Coordinates: (59.4513337, -157.3146558)
Coordinates: (-9.5483991, 16.3475335)
Coordinates: (40.642947899999996, -73.7793733748521)
Coordinates: (31.809584, -106.36659802997961)
Coordinates: (-8.3996233, 13.445438)
Coordinates: (-25.68744875, 28.199824541000496)
Coordinates: (-4.2136618, -69.9423596)
Coordinates: (20.4490789, 99.8842143)
Coordinates: (69.37182119373804, 88.05523473164229)
Coordinates: (51.3582945, 7.473296)
Coordinates: (28.4024938, 83.6999277)
Coordinates: (-23.6463741, -70.3980033)
Coordinates: (1.1219929, 104.11570805549073)
Coordinates: (40.9298182, 73.0011

RateLimiter caught an error, retrying (0/2 tries). Called with (*('Barra de Tijuca, Rio de Janeiro',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^

Coordinates: (-23.01681215, -43.41762522970831)
Coordinates: (-24.298056, 28.120833)
Coordinates: (-3.37546155, -76.63961612438942)
Coordinates: (-25.9396667, 27.92617105029214)
Coordinates: (48.4506259, -0.1443836)
Coordinates: (40.9782389, 28.826320811867586)
Coordinates: (-22.4683626, -44.4463494)
Coordinates: (4.61001675, -74.06930256481051)
Coordinates: (43.0865381, 77.1202694)
Coordinates: (48.0679255, -114.09666640516207)
Coordinates: (52.726192, 39.390289)
Coordinates: (46.6812216, -68.0154578)
Coordinates: (43.7342156, 7.417846160725075)
Coordinates: (29.357515, -100.8987707)
Coordinates: (46.808327, -100.783739)
Coordinates: (53.63636215, 9.994550134684175)
Coordinates: (33.9528472, -84.5496148)
Coordinates: (26.715364, -80.0532942)
Coordinates: (51.88696515, 0.24419290157330298)
Coordinates: (-5.1971376, -80.6267237)
Coordinates: (10.777000699999999, 123.01801470321547)
Coordinates: (33.921563, -78.0202677)
Coordinates: (-33.2315713, -54.3862595)
Coordinates: (34.5266431, 69

RateLimiter caught an error, retrying (0/2 tries). Called with (*('San Juan-Luis Muñoz Marín (Isla Verde), All Puerto Rico',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
         

Coordinates: (41.0339862, -73.7629097)
Coordinates: (54.356937, 55.897495)
Coordinates: (64.4975098, -165.4061701)
Coordinates: (61.8050465, -149.73677930645164)
Coordinates: (4.61001675, -74.06930256481051)
Coordinates: (52.6041877, 39.5936899)
Coordinates: (54.1930321, 37.61754)
Coordinates: (45.0072635, 43.3522929)
Coordinates: (45.32203645, -75.66726438915012)
Coordinates: (45.7630435, 106.2704145)
Coordinates: (-26.0294904, -48.8545202)
Coordinates: (40.735657, -74.1723667)
Coordinates: (-11.1851561, -40.5112434)
Coordinates: (46.476974, 38.64938)
Coordinates: (41.7907878, -107.239176)
Coordinates: (52.131802, -106.660767)
Coordinates: (-8.861408, 13.228756132544063)
Coordinates: (22.5150514, -103.4739008)
Coordinates: (25.774109385498363, -100.2921006025493)
Coordinates: (47.5048851, -111.29189)
Coordinates: (-31.294444, 25.823889)
Coordinates: (12.2536026, 104.6658389)
Coordinates: (39.185911000000004, -77.61041775092207)
Coordinates: (42.36609515, -122.21459729828005)
Coordinat

RateLimiter caught an error, retrying (0/2 tries). Called with (*('Charlotte Amalie-Cyril E. King (ex Harry S. Truman), All US Virgin Islands',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso

Coordinates: (-25.143075, -65.50421)
Coordinates: (13.7692585, -13.6682901)
Coordinates: (56.64116920800744, -131.00652178637174)
Coordinates: (8.7498359, 124.9757909)
Coordinates: (9.2089255, 12.4802485)
Coordinates: (61.09604365, -155.57460505433366)
Coordinates: (53.32782, 91.945374)
Coordinates: (64.7316875, 177.5060925)
Coordinates: (27.4930246, -99.507425)
Coordinates: (40.1672117, -105.101928)
Coordinates: (45.031823, -93.3606072)
Coordinates: (9.792644899999999, 80.0690376002588)
Coordinates: (-19.750833, -47.936666)
Coordinates: (41.4776121, -91.1210053)
Coordinates: (41.9153358, -83.5135665)


RateLimiter caught an error, retrying (0/2 tries). Called with (*("Apia-Fagali'i, All Samoa Islands",), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^

Coordinates: (35.1477774, -114.568298)
Coordinates: (50.178029, -86.7138858)
Coordinates: (36.7168315, -76.2494453)
Coordinates: (36.057938, -76.6077213)
Coordinates: (28.0394654, -81.9498042)
Coordinates: (45.636623, -89.412075)
Coordinates: (15.1161645, 79.3689683)
Coordinates: (43.643032, -72.251587)
Coordinates: (37.3315103, -80.8111867)
Coordinates: (4.7676576, 7.0188527)
Coordinates: (56.646801, 32.26532)
Coordinates: (-26.48657915, 28.235991770634506)
Coordinates: (-10.1632209, 123.6017755)
Coordinates: (55.1380556, -131.9961111)
Coordinates: (43.6166163, -116.200886)
Coordinates: (48.0260305, 1.8739814)
Coordinates: (58.7166428, -111.1500187)
Coordinates: (48.5452022, -58.5871146)
Coordinates: (62.1445646, 65.4349289)
Coordinates: (33.6742048, -117.86802428008934)
Coordinates: (40.8211046, -82.51523393226182)
Coordinates: (12.3529756, 121.0663052)
Coordinates: (53.72068, 91.4406019)
Coordinates: (38.10184812665927, -80.6034853099025)
Coordinates: (60.7922222, -161.755833)
Coord

RateLimiter caught an error, retrying (0/2 tries). Called with (*('São Paulo-Guarulhos, São Paulo',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^

Coordinates: (-23.437859000000003, -46.48055622681591)
Coordinates: (-8.4195244, 20.7410918)
Coordinates: (-25.095556, 30.779444)
Coordinates: (15.5635972, 32.5349123)
Coordinates: (13.7414754, 106.9873662)
Coordinates: (-29.8052778, 29.7672222)
Coordinates: (54.94897, 73.0278)
Coordinates: (47.6565584, 23.5719843)
Coordinates: (47.8409723, 12.9823914)
Coordinates: (29.9841416, -95.33298595614491)
Coordinates: (45.2864689, -122.334259)
Coordinates: (15.4702001, -90.3735065)
Coordinates: (-12.8203689, 28.2155973)
Coordinates: (40.86108, -79.895197)
Coordinates: (-21.8988131, 140.915775)
Coordinates: (-39.9296533, 143.8523082)
Coordinates: (-38.23632245, 146.42203085872927)
Coordinates: (19.7891616, -70.6926181)
Coordinates: (31.8658887, -116.602983)
Coordinates: (34.52054225, -109.37834294269751)
Coordinates: (-8.5172149, 17.8303814)
Coordinates: (36.5331586, -82.326806)
Coordinates: (35.1987522, -111.651822)
Coordinates: (20.020452, -155.66391)
Coordinates: (-17.831773, 31.045686)
Coor

RateLimiter caught an error, retrying (0/2 tries). Called with (*('Istanbul, Marmara Region (Marmara Bölgesi)',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^

Coordinates: (41.006381, 28.9758715)
Coordinates: (36.307854750000004, -112.29289603702432)
Coordinates: (35.8789231, -97.4252772)
Coordinates: (35.714259, -83.5101638)
Coordinates: (41.7120775, -112.165779)
Coordinates: (53.6399088, 22.457839141500717)
Coordinates: (-23.437859000000003, -46.48055622681591)
Coordinates: (2.7431004, 101.70631463311038)
Coordinates: (68.9304699, -179.491343)
Coordinates: (-12.8074989, 15.757342194111722)
Coordinates: (25.0612686, 121.4598089)
Coordinates: (-8.7676566, 18.0014181)
Coordinates: (46.0131505, -112.536508)
Coordinates: (12.5824125, -13.3719081)
Coordinates: (47.2610441, 39.72490856518773)
Coordinates: (52.1575911, -95.3762948)
Coordinates: (51.066955, 8.6270903)
Coordinates: (42.511726, -83.6156648)
Coordinates: (48.9434494, 2.4220457)
Coordinates: (33.5855677, -101.8470215)
Coordinates: (27.6939119, 85.3582197389141)
Coordinates: (-7.8011998, 110.3646608)
Coordinates: (43.0865381, 77.1202694)
Coordinates: (37.6624312, -121.8746789)
Coordinat

RateLimiter caught an error, retrying (0/2 tries). Called with (*('Rio de Janeiro-Santos Dumont, Rio de Janeiro',), **{'language': 'en', 'exactly_one': True}).
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/opt/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^

Coordinates: (-21.22025825, -41.88430764225166)
Coordinates: (-20.7051873, 140.5058305)
Coordinates: (-6.9318423, -76.7724793)
Coordinates: (46.2311749, 7.3588795)
Coordinates: (41.151458, -87.2856621455731)
Coordinates: (57.959473, 102.734192)
Coordinates: (42.9830241, 47.5048717)
Coordinates: (47.2596224, -0.0785177)
Coordinates: (34.7313843, 137.65969421936074)
Coordinates: (-9.9293255, -76.2394845)
Coordinates: (15.5635972, 32.5349123)
Coordinates: (-11.169126921828052, -45.38281147497049)
Coordinates: (33.5071275, 51.9135547)
Coordinates: (59.0831232, -135.3430573)
Coordinates: (44.4297011, 5.7168847)
Coordinates: (-23.437859000000003, -46.48055622681591)
Coordinates: (37.2061545, -93.29206311368085)
Coordinates: (34.7336097, -115.2449794)
Coordinates: (-31.3035268, -64.21219174274952)
Coordinates: (46.306999, 44.270187)
Coordinates: (60.345203, 102.28093)
Coordinates: (-20.2308712, -58.1690716)
Coordinates: (63.426664599999995, -20.26988158464468)
Coordinates: (22.315395549999998

In [None]:
df = df.join(coordinates)

In [None]:
# Export data even if it's not completely clean
# Because it takes a long time to get the coordinates
df.to_csv('data/merged_data_with_coordinates2.csv', index=False)

---

### Continue data cleaning with added coordinates

In [None]:
import numpy as np
import pandas as pd
import pickle

In [None]:
# Load merged data
df = pd.read_csv('data/merged_data_with_coordinates.csv', parse_dates=['date'])
df.head()

### Split some columns into multiple

In [None]:
# Split schedule into 2 columns
schedule = df['schedule'].str.split(' - ', expand=True)
df['departure'] = schedule[0]
df['arrival'] = schedule[1]
df = df.drop('schedule', axis=1)

In [None]:
# Split type details into 2 columns
df['type_details'] = df['type_details'].str.extract(r'(\bFirst flight: \d{4}, .*)$', expand=False)
details = df['type_details'].str.split(', ', expand=True)
df['first_flight'] = details[0].str.extract(r'(\d{4})', expand=False)
df['engine'] = details[1]
df = df.drop('type_details', axis=1)

In [None]:
# Split fatalities from ASN
fatalities = df['fatalities'].str.split(' / ', expand=True)
df['fatalities'] = fatalities[0].str.extract(r'(\d+)')
df['occupants'] = fatalities[1].str.extract(r'(\d+)')

### Merge common columns

In [None]:
df['operator'] = df['operator'].fillna(df['owner'])
df = df.drop('owner', axis=1)

In [None]:
df['type'] = df['aircraft_type'].fillna(df['type'])
df = df.drop('aircraft_type', axis=1)

In [None]:
df['yom'] = df['yom'].fillna(df['year_of_manufacture'])
df = df.drop('year_of_manufacture', axis=1)

In [None]:
df['aircraft_flying_hours'] = df['aircraft_flying_hours'].fillna(df['total_airframe_hrs'])
df = df.drop('total_airframe_hrs', axis=1)

In [None]:
df['aircraft_flight_cycles'] = df['aircraft_flight_cycles'].fillna(df['cycles'])
df = df.drop('cycles', axis=1)

In [None]:
df['msn'] = df['msn_x'].fillna(df['msn_y'])
df = df.drop(['msn_x', 'msn_y'], axis=1)

In [None]:
df['flight_phase'] = df['flight_phase'].fillna(df['phase'])
df = df.drop('phase', axis=1)

In [None]:
df['flight_type'] = df['flight_type'].fillna(df['nature'])
df = df.drop('nature', axis=1)

In [None]:
df['departure'] = df['departure_airport'].fillna(df['departure'])
df = df.drop('departure_airport', axis=1)

In [None]:
df['arrival'] = df['destination_airport'].fillna(df['arrival'])
df = df.drop('destination_airport', axis=1)

In [None]:
on_board = df['crew_on_board'] + df['pax_on_board']
df['occupants'] = on_board.fillna(df['occupants'])
df = df.drop(['crew_on_board', 'pax_on_board'], axis=1)

In [None]:
df['fatalities'] = df['total_fatalities'].fillna(df['fatalities'])
df = df.drop(['crew_fatalities', 'pax_fatalities', 'total_fatalities'], axis=1)

In [None]:
df['other_fatalities'] = df['other_fatalities_x'].fillna(df['other_fatalities_y'])
df = df.drop(['other_fatalities_x', 'other_fatalities_y'], axis=1)

### Drop rows and columns

#### Drop redundant/unnecessary columns

**registration, msn**<br>
Those are unique identifiers or an aircraft.

**flight_number**<br>
It's an unique identifier of an flight.

**captain_flying_hours, captain_flying_hours_on_type, copilot_flying_hours, copilot_flying_hours_on_type, aircraft_flying_hours, aircraft_flight_cycles, departure, arrival**<br>
There are too many null values.

**date_str**<br>
It was used to merge the dataframes.

**survivors**<br>
It can be calculated with the number of occupants minus the number of fatalities.

**first_flight**<br>
It's the first flight of the aircraft in general, not the one involved in the accident

**investigating_agency, confidence_rating**<br>
It won't help categorize the data.

In [None]:
columns_to_drop = [
  'registration',
  'msn',
  'flight_number',
  'captain_flying_hours', 
  'captain_flying_hours_on_type', 
  'copilot_flying_hours',
  'copilot_flying_hours_on_type',
  'aircraft_flying_hours', 
  'aircraft_flight_cycles',
  'departure',
  'arrival',
  'date_str',
  'survivors',
  'first_flight',
  'investigating_agency',
  'confidence_rating']

df = df.drop(columns_to_drop, axis=1)

#### Drop rows

In [None]:
# Drop rows where coordinates and occupants are null
subset = ['latitude', 'longitude', 'occupants']
df = df.dropna(subset=subset)

### Impute missing values

#### String columns

In [None]:
columns = df.drop(['latitude', 'longitude', 'occupants'], axis=1).select_dtypes(include='object').columns

for column in columns:
	df[column] = df[column].fillna('Unknown')

#### Numeric columns

In [None]:
# Inpute missing other_fatalities to 0
df['other_fatalities'] = df['other_fatalities'].fillna(0)
assert df['other_fatalities'].isna().sum() == 0

In [None]:
# Input missing yom with year - average age
df['aircraft_age'] = df['date'].dt.year - df['yom']
df['aircraft_age'] = df['aircraft_age'].fillna(int(df['aircraft_age'].mean()))
df['yom'] = df['yom'].fillna(df['date'].dt.year - df['aircraft_age'])

In [None]:
# Assert there are no more null values
assert df.isna().sum().sum() == 0

### Collapse categories

In [None]:
# Get unique values of flight phase
df['flight_phase'].sort_values().unique()

In [None]:
df['flight_phase'] = np.where(df['flight_phase'].isin(['Take off', 'Initial climb']), 'Takeoff (climb)', df['flight_phase'])
df['flight_phase'] = np.where(df['flight_phase'] == 'En route', 'Flight', df['flight_phase'])
df['flight_phase'] = np.where(df['flight_phase'].isin(['Landing', 'Approach']), 'Landing (descent or approach)', df['flight_phase'])
df['flight_phase'] = np.where(df['flight_phase'] == 'Taxi', 'Taxiing', df['flight_phase'])
df['flight_phase'] = np.where(df['flight_phase'] == 'Standing', 'Parking', df['flight_phase'])
df['flight_phase'].sort_values().unique()

In [None]:
# Get unique values of flight_type
df['flight_type'].sort_values().unique()

In [None]:
# Regroup values
df['flight_type'] = np.where(df['flight_type'] == '-', 'Unknown', df['flight_type'])
df['flight_type'].sort_values().unique()

In [None]:
# Get unique values of aircraft_damage
df['aircraft_damage'].sort_values().unique()

In [None]:
# Regroup values
df['aircraft_damage'] = df['aircraft_damage'].str.replace(', written off', '')
df['aircraft_damage'].sort_values().unique()

In [None]:
# Get unique values of cause
df['cause'].sort_values().unique()

In [None]:
df['category'].sort_values().unique()

In [None]:
# Regroup categories
df['category'] = np.where(df['category'] == 'UK', 'Unknown', df['category'])
df['category'].sort_values().unique()

### Convert columns

In [None]:
# Remove time from date
df['date'] = pd.to_datetime(df['date'].dt.strftime('%Y-%m-%d'))

In [None]:
# Convert yom, occupants and fatalities to int
df[['yom', 'occupants', 'fatalities', 'other_fatalities']] = df[['yom', 'occupants', 'fatalities', 'other_fatalities']].astype('int')

In [None]:
# Convert coordinates to float
df[['latitude', 'longitude']] = df[['latitude', 'longitude']].astype('float')

#### Convert category, aircraft_damage and engine into ordinal categories

In [None]:
df['category'].sort_values().unique()

In [None]:
categories = ['Unknown', 'Incident', 'Serious incident', 'Accident', 'Unlawful Interference']
df['category'] = pd.Categorical(df['category'], categories, ordered=True)
df['category'] = df['category'].fillna('Unknown')
df['category'].sort_values().unique()

In [None]:
df['aircraft_damage'].sort_values().unique()

In [None]:
categories = [
  	'Unknown',
  	'Minor, repaired',
  	'Minor',
  	'Substantial, repaired',
  	'Substantial',
	'Destroyed',
  	'Aircraft missing']
df['aircraft_damage'] = pd.Categorical(df['aircraft_damage'], categories, ordered=True)
df['aircraft_damage'].cat.categories

In [None]:
df['engine'].unique()

In [None]:
df['engine'] = np.where(df['engine'].isin([
  '2 Piston engines',
  '3 Piston engines',
  '4 Piston engines',
  '6 Piston engines']), 'Multi Piston Engines', df['engine'])
df['engine'].unique()

In [None]:
df['engine'] = np.where(df['engine'].isin([
  '2 Turboprop engines',
  '3 Turboprop engines',
  '4 Turboprop engines']), 'Multi Turboprop Engines', df['engine'])
df['engine'].unique()

In [None]:
df['engine'] = np.where(df['engine'].isin([
  '2 Jet engines',
  '3 Jet engines',
  '4 Jet engines']), 'Multi Jet Engines', df['engine'])
df['engine'].unique()

In [None]:
categories = [
  'Unknown',
  '1 Piston engine',
  'Multi Piston Engines',
  '1 Turboprop engine',
  'Multi Turboprop Engines',
  '1 Jet engine',
  'Multi Jet Engines']
df['engine'] = pd.Categorical(df['engine'], categories, ordered=True)
df['engine'].cat.categories

### Validate values

In [None]:
df.describe()

In [None]:
# Get rows with yom below 1900
low_yom = df['yom'] < 1900
df[low_yom]

In [None]:
# Replace with average
df.loc[low_yom, 'yom'] = df['date'].dt.year - int(df['aircraft_age'].mean())

In [None]:
# Get row with 5 digit yom
high_yom = df['yom'] > 2025
df[high_yom]

In [None]:
# After checking in the BAAA and ASN website, replace with 1956
df.loc[high_yom, 'yom'] = 1956

In [None]:
# Make sure fatalities are not greater than occupants
df.loc[df['fatalities'] > df['occupants'], 'fatalities'] = df['occupants']

### Export data

In [None]:
df.info()

In [None]:
# Reorder columns
df = df[[
	'date',
  	'category',
	'type',
	'operator',
	'yom',
	'engine',
	'engine_model',
	'flight_phase',
	'flight_type',
	'site',
	'location',
	'country',
	'region',
	'latitude',
	'longitude',
	'aircraft_damage',
	'occupants',
	'fatalities',
	'other_fatalities',
	'cause'
  ]]

In [None]:
# Sort data from the earliest to the latest crash
df = df.sort_values(by='date')

In [None]:
# Reset index
df = df.reset_index(drop=True)

In [None]:
# Serialize data with pickle
with open('data/cleaned_data.pkl', 'wb') as handle:
  pickle.dump(df, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Export data to CSV
df.to_csv('data/cleaned_data.csv', index=False)

## End