# Aircraft Crashes Data Collection And Cleaning

## Overview

This notebook collects and prepares the data for the analysis of all the aircraft accidents since 1918.

### About dataset

The data will be scraped from the [BAAA Crash Archives](https://www.baaa-acro.com/crash-archives) and the [ASN Database](https://asn.flightsafety.org/database/).

**BAAA**

`date` date and local time of the accident<br>
`aircraft_type` aircraft make and model<br>
`operator` operator of the aircraft<br>
`registration` unique code to a single aircraft, required by international convention<br>
`flight_phase` phase of the flight when the accident occured<br>
`flight_type` type of flight (ex: military)<br>
`survivors` indicates if there was survivors or not<br>
`site` type of location where the accident happened (ex: mountains)<br>
`departure` city where the departure was planned<br> 
`arrival` city where the arrival was planned<br> 
`msn` manufacturer's serial number of the aircraft<br>
`yom` year of manufacture of the aircraft involved in the accident<br>
`flight_number` flight number<br>
`location` location of the accident<br>
`country` country where the crash happened<br>
`region` region of the world where the crash happened<br>
`crew_on_board` number of crew members on board at the time of the accident<br>
`crew_fatalities` number of crew members who died in the crash<br>
`pax_on_board` number of passengers on board at the time of the accident<br> 
`pax_fatalities` number of passengers who died in the crash<br>                 
`other_fatalities` other victims of the accident outside of the aircraft<br>
`total_fatalities` total number of deaths<br>
`captain_flying_hours` number of flying hours of the captain<br>
`captain_flying_hours_on_type` number of hours the captain flew on the type of aircraft involved in the crash<br>
`copilot_flying_hours` number of flying hours of the copilot<br>  
`copilot_flying_hours_on_type` number of hours the copilot flew on the type of aircraft involved in the crash<br>  
`aircraft_flying_hours` number of flying hours of the aircraft before the crash<br>
`aircraft_flight_cycles` number of flights of the aircraft<br><br>


**ASN**

`date` date of the accident<br>
`time` time of the accident<br>
`type` make and model of the aircraft<br>
`first_flight` year the aircraft was inaugurated<br>
`engine` type and number of engines<br>
`owner` operator of the aircraft<br>
`registration` unique code to a single aircraft, required by international convention<br>
`msn` manufacturer's serial number of the aircraft<br>
`year_of_manufacture` year of manufacture of the aircraft involved in the accident<br>
`total_airframe_hrs` number of flying hours of the aircraft before the crash<br>
`cycles` number of flights of the aircraft<br>
`engine_model` make and model of the aircraft engine<br>
`fatalities` total number of fatalities<br>
`occupants` number of crew members and passengers on board<br>
`other_fatalities` other victims of the accident outside of the aircraft<br>
`aircraft_damage` severity of the aircraft damage
`category` type of accident<br>
`location` location of the crash<br>
`phase` phase of the flight when the accident occured<br>
`nature` type of flight (ex: military)<br>
`departure_airport` airport where the departure was planned<br>
`destination_airport` airport when the arrival was planned<br>
`investigating_agency` agency who made the accident deport<br>
`confidence_rating` quality of the information (ex: missing information)

---

## Data Collection

In [None]:
from bs4 import BeautifulSoup
from datetime import datetime
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
import math
import numpy as np
import pandas as pd
import pickle
import re
import requests
from urllib.parse import unquote

### BAAA

In [None]:
# Scrape total number of accidents
root_url = 'https://www.baaa-acro.com'

response = requests.get(root_url)
soup = BeautifulSoup(response.content, 'html.parser')
accident_files = soup.find('div', {'class': 'total-accident-files'})
nb_crashes = int(accident_files.text.replace(',', ''))
	

In [None]:
# Scrape details of all accidents
nb_rows_per_page = 20
nb_pages = math.ceil(nb_crashes / nb_rows_per_page)
csv_path = 'data/baaa_scraped_data.csv'

for i in range(nb_pages):
	listing_url = '{}/crash-archives?page={}'.format(root_url, i)
	response = requests.get(listing_url)
	soup = BeautifulSoup(response.content, 'html.parser')
	anchors = soup.find_all('a', {'class': 'red-btn'})

	crash_list = []
	
	for j, a in enumerate(anchors):
		link = a['href']
		#print('Page {}, link {}: {}{}'.format(i, j + 1, root_url, link))
		details_url = root_url + link
		response = requests.get(details_url)
		soup = BeautifulSoup(response.content, 'html.parser')
		details = {}
		
		details_div = soup.find('div', {'class': 'crash-details'})
		
		date_div = details_div.find('div', {'class': 'crash-date'})
		details['date'] = date_div.find('span').next_sibling.text.strip() if date_div else None
		
		aircraft_div = details_div.find('div', {'class': 'crash-aircraft'})
		details['aircraft_type'] = aircraft_div.find('a').find('div').text if aircraft_div else None
		
		operator_div = details_div.find('div', {'class': 'crash-operator'})

		if operator_div:
			if (operator_div.find('img')): # Extract operator name from image link
				pattern = re.compile(r'(?<=target_id=).*(?= \(\d+\))')
				img_link = unquote(operator_div.find('img').parent['href'])
				details['operator'] = pattern.search(img_link).group(0)
			else:
				details['operator'] = operator_div.find('a').find('div').text
		else:
			details['operator'] = None

		reg_div = details_div.find('div', {'class': 'crash-registration'})
		details['registration'] = reg_div.find('div').text if reg_div else None
		
		flight_phase_div = details_div.find('div', {'class': 'crash-flight-phase'})
		details['flight_phase'] = flight_phase_div.find('a').find('div').text if flight_phase_div else None
		
		flight_type_div = details_div.find('div', {'class': 'crash-flight-type'})
		details['flight_type'] = flight_type_div.find('a').find('div').text if flight_type_div else None
		
		survivors_div = details_div.find('div', {'class': 'crash-survivors'})
		details['survivors'] = survivors_div.find('a').find('div').text if survivors_div else None
		
		site_div = details_div.find('div', {'class': 'crash-site'})
		details['site'] = site_div.find('a').find('div').text if site_div else None
		
		schedule_div = details_div.find('div', {'class': 'crash-schedule'})
		details['schedule'] = schedule_div.find('div').text if schedule_div else None
		
		msn_div = details_div.find('div', {'class': 'crash-construction-num'})
		details['msn'] = msn_div.find('div').text if msn_div else None
		
		yom_div = details_div.find('div', {'class': 'crash-yom'})
		details['yom'] = yom_div.find('div').text if yom_div else None

		flight_number = details_div.find('div', {'class': 'crash-flight-number'})
		details['flight_number'] = flight_number.find('div').text if flight_number else None
		
		location_div = details_div.find('div', {'class': 'crash-location'})
		if location_div:
			location_details = location_div.select('a')
			details['location'] = ', '.join(item.text.strip() for item in location_details) if location_details else None
		else:
			details['location'] = None
		
		country_div = details_div.find('div', {'class': 'crash-country'})
		details['country'] = country_div.find('a').find('div').text if country_div else None
		
		region_div = details_div.find('div', {'class': 'crash-region'})
		details['region'] = region_div.find('a').find('div').text if region_div else None
		
		crew_on_board_div = details_div.find('div', {'class': 'crash-crew-on-board'})
		details['crew_on_board'] = crew_on_board_div.find('div').text if crew_on_board_div else None
		
		crew_fatalities_div = details_div.find('div', {'class': 'crash-crew-fatalities'})
		details['crew_fatalities'] = crew_fatalities_div.find('div').text if crew_fatalities_div else None
		
		pax_on_board_div = details_div.find('div', {'class': 'crash-pax-on-board'})
		details['pax_on_board'] = pax_on_board_div.find('div').text if pax_on_board_div else None
		
		pax_fatalities_div = details_div.find('div', {'class': 'crash-pax-fatalities'})
		details['pax_fatalities'] = pax_fatalities_div.find('div').text if pax_fatalities_div else None
		
		others_div = details_div.find('div', {'class': 'crash-other-fatalities'})
		details['other_fatalities'] = others_div.find('div').text if others_div else None
		
		total_fatalities_div = details_div.find('div', {'class': 'crash-total-fatalities'})
		details['total_fatalities'] = total_fatalities_div.find('div').text if total_fatalities_div else None

		captain_hours_div = details_div.find('div', {'class': 'captain-total-flying-hours'})
		details['captain_flying_hours'] = captain_hours_div.find('div').text if captain_hours_div else None

		captain_hours_type_div = details_div.find('div', {'class': 'captain-total-hours-type'})
		details['captain_flying_hours_on_type'] = captain_hours_type_div.find('div').text if captain_hours_type_div else None

		copilot_hours_div = details_div.find('div', {'class': 'copilot-total-flying-hours'})
		details['copilot_flying_hours'] = copilot_hours_div.find('div').text if copilot_hours_div else None

		copilot_hours_type_div = details_div.find('div', {'class': 'copilot-total-hours-type'})
		details['copilot_flying_hours_on_type'] = copilot_hours_type_div.find('div').text if copilot_hours_type_div else None

		aircraft_hours_div = details_div.find('div', {'class': 'crash-aircraft-flight-hours'})
		details['aircraft_flying_hours'] = aircraft_hours_div.find('div').text if aircraft_hours_div else None

		aircraft_cycles_div = details_div.find('div', {'class': 'crash-aircraft-flight-cycles'})
		details['aircraft_flight_cycles'] = aircraft_cycles_div.find('div').text if aircraft_cycles_div else None
		
		crash_list.append(details)
	
	df = pd.DataFrame(crash_list)

	if i == 0:
		df.to_csv(csv_path, index=False)
	else:
		df.to_csv(csv_path, index=False, header=False, mode='a')


In [None]:
# Scrape accident causes
csv_path = 'data/baaa_crash_reasons.csv'

reasons = {
  'Human factor': 12990,
  'Other causes': 12992,
  'Technical failure': 12988,
  'Terrorism act, hijacking, sabotage, any kind of hostile action': 12991,
  'Unknown': 12993,
  'Weather': 12989
}

for i, (reason, target_id) in enumerate(reasons.items()):
	url = 'https://www.baaa-acro.com/crash-archives?field_crash_cause_target_id={}'.format(target_id)
	response = requests.get(url)
	soup = BeautifulSoup(response.content, 'html.parser')
	pattern = re.compile(r'\d+$')
	total_items_txt = soup.find('div', {'class': 'view-header'}).find('span').text
	total_items = int(pattern.search(total_items_txt).group(0))
	nb_items_per_page = 20
	nb_pages = math.ceil(total_items / nb_items_per_page)
	
	for j in range(nb_pages):
		page_url = url + '&page={}'.format(j)
		page_response = requests.get(page_url)
		page_soup = BeautifulSoup(page_response.content, 'html.parser')
		table = page_soup.find('table')
		rows = table.find_all('tr')

		crash_list = []
		for k, row in enumerate(rows[1:]):
			created = row.find('td', {'class': 'views-field-created'}).find('time').text
			registration_div = row.find('div', {'class': 'registration-field'})
			crash_list.append({
				'date': created,
				'registration': registration_div.text if registration_div else None,
				'cause': reason
			})
		df = pd.DataFrame(crash_list)
		df.to_csv(csv_path, index=False, header=False, mode='a')				
	

### ASN

In [None]:
csv_path = 'data/asn_scraped_data.csv'
root_url = 'https://asn.flightsafety.org'

# Add headers to avoid 403 unauthorized error
headers = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'
}

database = '/database'
database_url = root_url + database

#for year in range(1919, 2026):
for year in range(1973, 2026):
	year_url = '{}{}/year/{}/1'.format(root_url, database, year)
	response = requests.get(year_url, headers=headers)
	soup = BeautifulSoup(response.content, 'html.parser')
	nb_occurences_txt = soup.find('div', {'class': 'innertube'}).find('span').text
	pattern = re.compile(r'\d+(?= occurrences)')
	nb_occurences = int(pattern.search(nb_occurences_txt).group(0))
	max_items_per_page = 100
	nb_pages = math.ceil(nb_occurences / max_items_per_page)

	#for page in range(1, nb_pages + 1):
	for page in range(1, nb_pages + 1):
		page_url = '{}{}/year/{}/{}'.format(root_url, database, year, page)
		response = requests.get(page_url, headers=headers)
		soup = BeautifulSoup(response.content, 'html.parser')
		table = soup.find('table', {'class': 'hp'})
		anchors = table.find_all('a')
		links = [a['href'] for a in anchors]
		
		crash_list = []
		for i, link in enumerate(links):
			details_url = root_url + link
			print('Year {}, page {}, item {}, link: {}'.format(year, page, i + 1, details_url))
			response = requests.get(details_url, headers=headers)
			soup = BeautifulSoup(response.content, 'html.parser')
			table = soup.find('table')
			details = {}

			date_label = table.find('td', string='Date:')
			details['date'] = date_label.next_sibling.text

			time_label = table.find('td', string='Time:')
			details['time'] = time_label.next_sibling.text

			type_label = table.find('td', string='Type:')
			anchor = type_label.next_sibling.find('a')

			if anchor: # Get more details about aircraft if link exists
				details['type'] = anchor.text
				href = anchor['href']
				type_url = root_url + href
				type_response = requests.get(type_url, headers=headers)
				type_soup = BeautifulSoup(type_response.content, 'html.parser')
				type_table = type_soup.find('table')
				type_details = list(type_table.find('td', {'valign': 'top'}).stripped_strings)
				details['type_details'] = ', '.join(type_details)
			else:
				details['type'] = type_label.next_sibling.text
				details['type_details'] = None

			owner_label = table.find('td', string='Owner/operator:')
			details['owner'] = owner_label.next_sibling.text

			reg_label = table.find('td', string='Registration:')
			details['registration'] = reg_label.next_sibling.text

			msn_label = table.find('td', string='MSN:')
			details['msn'] = msn_label.next_sibling.text

			yom_label = table.find('td', string='Year of manufacture:')
			details['year_of_manufacture'] = yom_label.next_sibling.text if yom_label else None

			air_hours_label = table.find('td', string='Total airframe hrs:')
			details['total_airframe_hrs'] = air_hours_label.next_sibling.text if air_hours_label else None

			cycles_label = table.find('td', string='Cycles:')
			details['cycles'] = cycles_label.next_sibling.text if cycles_label else None

			engine_label = table.find('td', string='Engine model:')
			details['engine_model'] = engine_label.next_sibling.text if engine_label else None

			fatal_label = table.find('td', string='Fatalities:')
			details['fatalities'] = fatal_label.next_sibling.text

			other_label = table.find('td', string='Other fatalities:')
			details['other_fatalities'] = other_label.next_sibling.text

			damage_label = table.find('td', string='Aircraft damage:')
			details['aircraft_damage'] = damage_label.next_sibling.text

			cat_label = table.find('td', string='Category:')
			details['category'] = cat_label.next_sibling.text if cat_label else None

			loc_label = table.find('td', string='Location:')
			details['location'] = ' '.join(loc_label.next_sibling.stripped_strings)

			phase_label = table.find('td', string='Phase:')
			details['phase'] = phase_label.next_sibling.text

			nature_label = table.find('td', string='Nature:')
			details['nature'] = nature_label.next_sibling.text

			dep_label = table.find('td', string='Departure airport:')
			details['departure_airport'] = dep_label.next_sibling.text

			des_label = table.find('td', string='Destination airport:')
			details['destination_airport'] = des_label.next_sibling.text

			inv_label = table.find('td', string=re.compile('Investigating'))
			details['investigating_agency'] = inv_label.next_sibling.text if inv_label else None

			conf_label = table.find('td', string='Confidence Rating:')
			details['confidence_rating'] = ''.join(conf_label.next_sibling.stripped_strings) if conf_label else None

			crash_list.append(details)
		
		df = pd.DataFrame(crash_list)

		if year == 1919 and page == 1:
			df.to_csv(csv_path, index=False)
		else:
			df.to_csv(csv_path, index=False, header=False, mode='a')

---

## Data Exploration

### BAAA

In [None]:
baaa_df = pd.read_csv('data/baaa_scraped_data.csv')
baaa_df.head()

In [None]:
baaa_df.info()

In [None]:
baaa_df.isnull().sum()

In [None]:
# Check for duplicates
baaa_df[baaa_df.duplicated(keep=False)]

### ASN

In [None]:
asn_df = pd.read_csv('data/asn_scraped_data.csv')
asn_df.head()

In [None]:
asn_df.info()

In [None]:
asn_df.isnull().sum()

In [None]:
# Check for duplicates
asn_df[asn_df.duplicated(keep=False)]

---

## Data Cleaning

In [None]:
# Remove duplicates
baaa_df = baaa_df.drop_duplicates()
asn_df = asn_df.drop_duplicates()

In [None]:
# Strip whitespaces
def remove_whitespaces(df):
	for column in df.columns:
		if df[column].dtype == 'object':
			df[column] = df[column].str.strip()
	return df

baaa_df = remove_whitespaces(baaa_df)
asn_df = remove_whitespaces(asn_df)

### Merge dataframes on date and registration number

Although it's not very likely, the same aircraft can be involved in multiple accidents. Combining the registration number and the date ensures the unicity of the rows.

The main (left) dataset will be the one from BAAA as it's the most reliable and the second (right) one will be ASN dataset.

In [None]:
# Convert BAAA date to datetime
baaa_df['date'] = pd.to_datetime(baaa_df['date'], format='%b %d, %Y at %H%M LT', errors='coerce') \
				.fillna(pd.to_datetime(baaa_df['date'], format='%b %d, %Y', errors='coerce'))
assert baaa_df['date'].isna().sum() == 0

In [None]:
# Convert ASN date and time to datetime
asn_df['date'] = pd.to_datetime(asn_df['date'], format='%A %d %B %Y', errors='coerce')

In [None]:
# Create date string column
baaa_df['date_str'] = baaa_df['date'].dt.strftime('%Y-%m-%d')
asn_df['date_str'] = asn_df['date'].dt.strftime('%Y-%m-%d')

In [None]:
# Merge two dataframes
df = pd.merge(left=baaa_df, right=asn_df, how='left', on=['registration', 'date_str'])
df.head()

### Add latitude and longitude

In [None]:
# Merge location (BAAA, then ASN, then country)
df['location'] = df['location_x'].fillna(df['location_y']).fillna(df['country'])
df = df.drop(['location_x', 'location_y'], axis=1)
assert df['location'].isnull().sum() == 0

In [None]:
# Get coordinates from geocoder
geolocator = Nominatim(user_agent='aircraft_crashes_analysis')
geocoder = RateLimiter(geolocator.geocode, min_delay_seconds=1)

def get_coord(row, country=True) -> str:
	result = np.nan
	
	try:
		if (country):
			location = geocoder(row['country'], language='en', exactly_one=True)
		else:
			location = geocoder(row['location'], language='en', exactly_one=True)
		
		if (location):
			print('Coordinates: ({}, {})'.format(location.latitude, location.longitude))
			result = str(location.latitude) + ', ' + str(location.longitude)
	except:
		print('An error occured')
	
	return result


In [None]:
# Add column with coordinates
df['lat_lng'] = df.apply(get_coord, axis=1)

In [None]:
# Inpute missing coordinates with country coordinates
mask = df['lat_lng'].isna()
df.loc[mask, 'lat_lng'] = df[mask].apply(get_coord, country=True, axis=1)

In [None]:
# Export data
df.to_csv('data/merged_data_with_coordinates.csv', index=False)

---

### Continue data cleaning with added coordinates

In [1]:
import numpy as np
import pandas as pd
import pickle

In [66]:
# Load data
df = pd.read_csv('data/merged_data_with_coordinates.csv', parse_dates=['date_x'])
df.head()

Unnamed: 0,date_x,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn_x,...,aircraft_damage,category,phase,nature,departure_airport,destination_airport,investigating_agency,confidence_rating,location,lat_lng
0,2025-03-17 18:18:00,BAe Jetstream 31,Línea Aérea Nacional de Honduras - LANHSA,HR-AYW,Takeoff (climb),Scheduled Revenue Flight,Yes,"Lake, Sea, Ocean, River",Roatán – La Ceiba,863,...,Destroyed,Accident,Initial climb,Passenger,Roatán-Juan Manuel Gálvez International Airpor...,La Ceiba-Goloson International Airport (LCE/MHLC),,"Information is only available from news, socia...",Roatán Islas de la Bahía,"16.34902105, -86.49775125625627"
1,2025-03-13 07:33:00,Cessna 525 CitationJet CJ2,LBL 525 CZ LLC,N525CZ,Takeoff (climb),Private,No,"Plain, Valley",Mesquite - Addison,525A-0380,...,Destroyed,Accident,Initial climb,Ferry/positioning,"Mesquite Metro Airport, TX (KHQZ)","Dallas-Addison Airport, TX (ADS/KADS)",NTSB,"Information is only available from news, socia...","Mesquite Metro, Texas","32.749898900000005, -96.53114976775751"
2,2025-03-07 00:00:00,Antonov AN-32,Indian Air Force - Bharatiya Vayu Sena,,Landing (descent or approach),Military,Yes,Airport (less than 10 km from airport),,,...,,,,,,,,,"Bagdogra, West Bengal","26.6981094, 88.3245465"
3,2025-03-04 09:54:00,BAe Jetstream 31,SAETA Perú (Servicios Aéreos Tarapota),OB-2178,Landing (descent or approach),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Iquitos - Güeppí,861,...,Destroyed,Accident,Landing,Passenger,Iquitos-Coronel FAP Francisco Secada Vignetta ...,Güeppi Airport (SPGP),,"Information is only available from news, socia...","Güeppí, Loreto","-0.1176738, -75.2510798"
4,2025-02-25 00:00:00,Antonov AN-26,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,,Takeoff (climb),Military,No,City,,,...,,,,,,,,,"Wadi Seidna AFB, Khartoum (الخرطوم)","14.5844444, 29.4917691"


In [67]:
# Load reasons
reasons_df = pd.read_csv('data/baaa_crash_reasons.csv', parse_dates=['date'], date_format='%b %d, %Y')
reasons_df.sample(5)

Unnamed: 0,date,registration,cause
2757,1995-10-26,N9NP,Human factor
16301,1947-07-31,G-AHZJ,Technical failure
14623,1969-09-26,CCCP-44984,Technical failure
20788,1941-12-07,W8417,"Terrorism act, hijacking, sabotage, any kind o..."
9239,1943-03-27,A65-2,Human factor


In [68]:
# Merge accident causes
df['date_str'] = df['date_x'].dt.strftime('%Y-%m-%d')
reasons_df['date_str'] = reasons_df['date'].dt.strftime('%Y-%m-%d')
df = pd.merge(df, reasons_df, how='left', on=['registration', 'date_str'])
df.columns

Index(['date_x', 'aircraft_type', 'operator', 'registration', 'flight_phase',
       'flight_type', 'survivors', 'site', 'schedule', 'msn_x', 'yom',
       'flight_number', 'country', 'region', 'crew_on_board',
       'crew_fatalities', 'pax_on_board', 'pax_fatalities',
       'other_fatalities_x', 'total_fatalities', 'captain_flying_hours',
       'captain_flying_hours_on_type', 'copilot_flying_hours',
       'copilot_flying_hours_on_type', 'aircraft_flying_hours',
       'aircraft_flight_cycles', 'date_str', 'date_y', 'time', 'type',
       'type_details', 'owner', 'msn_y', 'year_of_manufacture',
       'total_airframe_hrs', 'cycles', 'engine_model', 'fatalities',
       'other_fatalities_y', 'aircraft_damage', 'category', 'phase', 'nature',
       'departure_airport', 'destination_airport', 'investigating_agency',
       'confidence_rating', 'location', 'lat_lng', 'date', 'cause'],
      dtype='object')

### Split some columns into multiple

In [69]:
# Split schedule into 2 columns
schedule = df['schedule'].str.split(' - ', expand=True)
df['departure'] = schedule[0]
df['arrival'] = schedule[1]
df = df.drop('schedule', axis=1)

In [70]:
# Split lat_lng into 2 columns
split_columns = df['lat_lng'].str.split(', ', expand=True)
df['latitude'] = split_columns[0]
df['longitude'] = split_columns[1]
df = df.drop('lat_lng', axis=1)

In [71]:
# Split type details into 2 columns
df['type_details'] = df['type_details'].str.extract(r'(\bFirst flight: \d{4}, .*)$', expand=False)
details = df['type_details'].str.split(', ', expand=True)
df['first_flight'] = details[0].str.extract(r'(\d{4})', expand=False)
df['engine'] = details[1]
df = df.drop('type_details', axis=1)

In [72]:
# Split fatalities from ASN
fatalities = df['fatalities'].str.split(' / ', expand=True)
df['fatalities'] = fatalities[0].str.extract(r'(\d+)')
df['occupants'] = fatalities[1].str.extract(r'(\d+)')

### Merge common columns

In [73]:
df['date'] = df['date_x']
df = df.drop(['date_x', 'date_y', 'time'], axis=1)
df['date'] = pd.to_datetime(df['date'].dt.strftime('%Y-%m-%d'))

In [74]:
df['operator'] = df['operator'].fillna(df['owner'])
df = df.drop('owner', axis=1)

In [75]:
df['type'] = df['aircraft_type'].fillna(df['type'])
df = df.drop('aircraft_type', axis=1)

In [76]:
df['yom'] = df['yom'].fillna(df['year_of_manufacture'])
df = df.drop('year_of_manufacture', axis=1)

In [77]:
df['aircraft_flying_hours'] = df['aircraft_flying_hours'].fillna(df['total_airframe_hrs'])
df = df.drop('total_airframe_hrs', axis=1)

In [78]:
df['aircraft_flight_cycles'] = df['aircraft_flight_cycles'].fillna(df['cycles'])
df = df.drop('cycles', axis=1)

In [79]:
df['msn'] = df['msn_x'].fillna(df['msn_y'])
df = df.drop(['msn_x', 'msn_y'], axis=1)

In [80]:
df['flight_phase'] = df['flight_phase'].fillna(df['phase'])
df = df.drop('phase', axis=1)

In [81]:
df['flight_type'] = df['flight_type'].fillna(df['nature'])
df = df.drop('nature', axis=1)

In [82]:
df['departure'] = df['departure_airport'].fillna(df['departure'])
df = df.drop('departure_airport', axis=1)

In [83]:
df['arrival'] = df['destination_airport'].fillna(df['arrival'])
df = df.drop('destination_airport', axis=1)

In [84]:
on_board = df['crew_on_board'] + df['pax_on_board']
df['occupants'] = on_board.fillna(df['occupants'])
df = df.drop(['crew_on_board', 'pax_on_board'], axis=1)

In [85]:
df['fatalities'] = df['total_fatalities'].fillna(df['fatalities'])
df = df.drop(['crew_fatalities', 'pax_fatalities', 'total_fatalities'], axis=1)

In [86]:
df['other_fatalities'] = df['other_fatalities_x'].fillna(df['other_fatalities_y'])
df = df.drop(['other_fatalities_x', 'other_fatalities_y'], axis=1)

### Drop rows and columns

#### Drop redundant/unnecessary columns

**registration, msn**<br>
Those are unique identifiers or an aircraft.

**flight_number**<br>
It's an unique identifier of an flight.

**captain_flying_hours, captain_flying_hours_on_type, copilot_flying_hours, copilot_flying_hours_on_type, aircraft_flying_hours, aircraft_flight_cycles, departure, arrival**<br>
There are too many null values.

**date_str**<br>
It was used to merge the dataframes.

**survivors**<br>
It can be calculated with the number of occupants minus the number of fatalities.

**first_flight**<br>
It's the first flight of the aircraft in general, not the one involved in the accident

**investigating_agency, confidence_rating**<br>
It won't help categorize the data.

In [87]:
columns_to_drop = [
  'registration',
  'msn',
  'flight_number',
  'captain_flying_hours', 
  'captain_flying_hours_on_type', 
  'copilot_flying_hours',
  'copilot_flying_hours_on_type',
  'aircraft_flying_hours', 
  'aircraft_flight_cycles',
  'departure',
  'arrival',
  'date_str',
  'survivors',
  'first_flight',
  'investigating_agency',
  'confidence_rating']

df = df.drop(columns_to_drop, axis=1)

#### Drop rows

In [88]:
# Keep data from 1970 to now
df = df[df['date'].dt.year >= 1970]

In [89]:
# Drop rows where coordinates and occupants are null
subset = ['latitude', 'longitude', 'occupants']
df = df.dropna(subset=subset)

### Impute missing values

#### String columns

In [90]:
columns = df.drop(['latitude', 'longitude', 'occupants'], axis=1).select_dtypes(include='object').columns

for column in columns:
	df[column] = df[column].fillna('Unknown')

#### Numeric columns

In [91]:
# Inpute missing other_fatalities to 0
df['other_fatalities'] = df['other_fatalities'].fillna(0)
assert df['other_fatalities'].isna().sum() == 0

In [92]:
# Input missing yom with year - average age
df['aircraft_age'] = df['date'].dt.year - df['yom']
df['aircraft_age'] = df['aircraft_age'].fillna(int(df['aircraft_age'].mean()))
df['yom'] = df['yom'].fillna(df['date'].dt.year - df['aircraft_age'])

In [93]:
# Assert there are no more null values
assert df.isna().sum().sum() == 0

### Collapse categories

In [94]:
# Get unique values of flight phase
df['flight_phase'].sort_values().unique()

array(['Approach', 'En route', 'Flight', 'Landing',
       'Landing (descent or approach)', 'Parking', 'Take off',
       'Takeoff (climb)', 'Taxiing', 'Unknown'], dtype=object)

In [95]:
df['flight_phase'] = np.where(df['flight_phase'].isin(['Take off', 'Initial climb']), 'Takeoff (climb)', df['flight_phase'])
df['flight_phase'] = np.where(df['flight_phase'] == 'En route', 'Flight', df['flight_phase'])
df['flight_phase'] = np.where(df['flight_phase'].isin(['Landing', 'Approach']), 'Landing (descent or approach)', df['flight_phase'])
df['flight_phase'] = np.where(df['flight_phase'] == 'Taxi', 'Taxiing', df['flight_phase'])
df['flight_phase'] = np.where(df['flight_phase'] == 'Standing', 'Parking', df['flight_phase'])
df['flight_phase'].sort_values().unique()

array(['Flight', 'Landing (descent or approach)', 'Parking',
       'Takeoff (climb)', 'Taxiing', 'Unknown'], dtype=object)

In [96]:
# Get unique values of flight_type
df['flight_type'].sort_values().unique()

array(['-', 'Aerial photography', 'Aerobatic', 'Ambulance', 'Bombing',
       'Calibration', 'Cargo',
       'Charter/Taxi (Non Scheduled Revenue Flight)', 'Cinematography',
       'Delivery', 'Demonstration', 'Executive/Corporate/Business',
       'Ferry', 'Fire fighting',
       'Geographical / Geophysical / Scientific', 'Government',
       'Humanitarian', 'Illegal (smuggling)', 'Meteorological / Weather',
       'Military', 'Positioning', 'Postal (mail)', 'Private',
       'Refuelling', 'Scheduled Revenue Flight',
       'Skydiving / Paratroopers', 'Spraying (Agricultural)', 'Supply',
       'Survey / Patrol / Reconnaissance', 'Test', 'Topographic',
       'Training', 'Unknown'], dtype=object)

In [97]:
# Regroup values
df['flight_type'] = np.where(df['flight_type'] == '-', 'Unknown', df['flight_type'])
df['flight_type'].sort_values().unique()

array(['Aerial photography', 'Aerobatic', 'Ambulance', 'Bombing',
       'Calibration', 'Cargo',
       'Charter/Taxi (Non Scheduled Revenue Flight)', 'Cinematography',
       'Delivery', 'Demonstration', 'Executive/Corporate/Business',
       'Ferry', 'Fire fighting',
       'Geographical / Geophysical / Scientific', 'Government',
       'Humanitarian', 'Illegal (smuggling)', 'Meteorological / Weather',
       'Military', 'Positioning', 'Postal (mail)', 'Private',
       'Refuelling', 'Scheduled Revenue Flight',
       'Skydiving / Paratroopers', 'Spraying (Agricultural)', 'Supply',
       'Survey / Patrol / Reconnaissance', 'Test', 'Topographic',
       'Training', 'Unknown'], dtype=object)

In [98]:
# Get unique values of aircraft_damage
df['aircraft_damage'].sort_values().unique()

array(['Aircraft missing, written off', 'Destroyed',
       'Destroyed, written off', 'Minor, repaired', 'Minor, written off',
       'Substantial', 'Substantial, repaired', 'Substantial, written off',
       'Unknown', 'Unknown, written off'], dtype=object)

In [99]:
# Regroup values
df['aircraft_damage'] = df['aircraft_damage'].str.replace(', written off', '')
df['aircraft_damage'].sort_values().unique()

array(['Aircraft missing', 'Destroyed', 'Minor', 'Minor, repaired',
       'Substantial', 'Substantial, repaired', 'Unknown'], dtype=object)

In [100]:
# Get unique values of category
df['cause'].sort_values().unique()

array(['Human factor', 'Other causes', 'Technical failure',
       'Terrorism act, hijacking, sabotage, any kind of hostile action',
       'Unknown', 'Weather'], dtype=object)

In [102]:
df['cause'] = np.where(df['cause'] == 'Other causes', 'Unknown', df['cause'])

In [103]:
df['category'].sort_values().unique()

array(['Accident', 'Incident', 'Other', 'Serious incident', 'UK',
       'Unknown', 'Unlawful Interference'], dtype=object)

In [104]:
df['category'] = np.where(df['category'].isin(['Other', 'UK']), 'Unknown', df['category'])

### Convert columns

In [105]:
# Convert yom, occupants and fatalities to int
df[['yom', 'occupants', 'fatalities', 'other_fatalities']] = df[['yom', 'occupants', 'fatalities', 'other_fatalities']].astype('int')

In [106]:
# Convert coordinates to float
df[['latitude', 'longitude']] = df[['latitude', 'longitude']].astype('float')

In [107]:
df['category'].sort_values().unique()

array(['Accident', 'Incident', 'Serious incident', 'Unknown',
       'Unlawful Interference'], dtype=object)

In [108]:
categories = ['Unknown', 'Incident', 'Serious incident', 'Accident', 'Unlawful Interference']
df['category'] = pd.Categorical(df['category'], categories, ordered=True)
df['category'].cat.categories

Index(['Unknown', 'Incident', 'Serious incident', 'Accident',
       'Unlawful Interference'],
      dtype='object')

In [109]:
df['aircraft_damage'].sort_values().unique()

array(['Aircraft missing', 'Destroyed', 'Minor', 'Minor, repaired',
       'Substantial', 'Substantial, repaired', 'Unknown'], dtype=object)

In [110]:
categories = [
  	'Unknown',
  	'Minor, repaired',
  	'Minor',
  	'Substantial, repaired',
  	'Substantial',
	'Destroyed',
  	'Aircraft missing']
df['aircraft_damage'] = pd.Categorical(df['aircraft_damage'], categories, ordered=True)
df['aircraft_damage'].cat.categories

Index(['Unknown', 'Minor, repaired', 'Minor', 'Substantial, repaired',
       'Substantial', 'Destroyed', 'Aircraft missing'],
      dtype='object')

In [111]:
df['engine'].unique()

array(['2 Turboprop engines', '2 Jet engines', 'Unknown',
       '1 Turboprop engine', '1 Piston engine', '2 Piston engines',
       '4 Jet engines', '4 Piston engines', '3 Jet engines',
       '1 Jet engine', '4 Turboprop engines', '3 Piston engines'],
      dtype=object)

In [112]:
df['engine'] = np.where(df['engine'].isin([
  '2 Piston engines',
  '3 Piston engines',
  '4 Piston engines',
  '6 Piston engines']), 'Multi Piston Engines', df['engine'])
df['engine'].unique()

array(['2 Turboprop engines', '2 Jet engines', 'Unknown',
       '1 Turboprop engine', '1 Piston engine', 'Multi Piston Engines',
       '4 Jet engines', '3 Jet engines', '1 Jet engine',
       '4 Turboprop engines'], dtype=object)

In [113]:
df['engine'] = np.where(df['engine'].isin([
  '2 Turboprop engines',
  '3 Turboprop engines',
  '4 Turboprop engines']), 'Multi Turboprop Engines', df['engine'])
df['engine'].unique()

array(['Multi Turboprop Engines', '2 Jet engines', 'Unknown',
       '1 Turboprop engine', '1 Piston engine', 'Multi Piston Engines',
       '4 Jet engines', '3 Jet engines', '1 Jet engine'], dtype=object)

In [114]:
df['engine'] = np.where(df['engine'].isin([
  '2 Jet engines',
  '3 Jet engines',
  '4 Jet engines']), 'Multi Jet Engines', df['engine'])
df['engine'].unique()

array(['Multi Turboprop Engines', 'Multi Jet Engines', 'Unknown',
       '1 Turboprop engine', '1 Piston engine', 'Multi Piston Engines',
       '1 Jet engine'], dtype=object)

In [115]:
categories = [
  'Unknown',
  '1 Piston engine',
  'Multi Piston Engines',
  '1 Turboprop engine',
  'Multi Turboprop Engines',
  '1 Jet engine',
  'Multi Jet Engines']
df['engine'] = pd.Categorical(df['engine'], categories, ordered=True)
df['engine'].cat.categories

Index(['Unknown', '1 Piston engine', 'Multi Piston Engines',
       '1 Turboprop engine', 'Multi Turboprop Engines', '1 Jet engine',
       'Multi Jet Engines'],
      dtype='object')

### Validate values

In [116]:
df.describe()

Unnamed: 0,yom,fatalities,date,latitude,longitude,occupants,other_fatalities,aircraft_age
count,13631.0,13631.0,13631,13631.0,13631.0,13631.0,13631.0,13631.0
mean,1972.164258,6.506493,1992-08-11 08:55:04.467757440,27.385568,-22.33269,14.179737,0.151786,19.939843
min,0.0,0.0,1970-01-02 00:00:00,-72.843869,-179.491343,0.0,0.0,-17588.0
25%,1964.0,0.0,1979-10-14 12:00:00,10.641227,-90.051764,2.0,0.0,9.0
50%,1973.0,1.0,1990-08-28 00:00:00,34.217637,-60.621353,4.0,0.0,19.0
75%,1981.0,4.0,2003-12-17 12:00:00,45.035153,36.722534,9.0,0.0,28.0
max,19567.0,520.0,2025-03-17 00:00:00,80.916649,178.029725,524.0,297.0,2000.0
std,161.986414,21.878993,,25.823273,83.77619,34.198537,3.506298,161.905


In [117]:
# Get rows with yom below 1900
low_yom = df['yom'] < 1900
df[low_yom]

Unnamed: 0,operator,flight_phase,flight_type,site,yom,country,region,type,engine_model,fatalities,...,category,location,date,cause,latitude,longitude,engine,occupants,other_fatalities,aircraft_age
1101,Technoservis-A,Takeoff (climb),Spraying (Agricultural),"Plain, Valley",16,Russia,Asia,PZL-Mielec AN-2,Unknown,0,...,Accident,"Aksarino, Republic of Tatarstan",2016-04-03,Human factor,55.342326,51.906545,1 Piston engine,1,0,2000.0
1310,FlyBe,Takeoff (climb),Scheduled Revenue Flight,Airport (less than 10 km from airport),23,United Kingdom,Europe,Saab 340,General Electric CT7-9B,0,...,Accident,"Stornoway, Hebrides Islands",2015-01-02,Weather,58.207704,-6.382723,Multi Turboprop Engines,29,0,1992.0
1346,Air Century (ACSA),Landing (descent or approach),Charter/Taxi (Non Scheduled Revenue Flight),Airport (less than 10 km from airport),18,Dominican Republic,Central America,BAe Jetstream 31,Garrett TPE331,0,...,Accident,"Punta Cana, La Altagracia",2014-10-12,Technical failure,18.556551,-68.369161,Multi Turboprop Engines,13,0,1996.0
1392,Skyward International Aviation,Takeoff (climb),Cargo,City,26,Kenya,Africa,Fokker 50,Pratt & Whitney Canada PW125B,4,...,Accident,"Nairobi-Jomo Kenyatta (ex Embakasi), Nairobi C...",2014-07-02,Human factor,1.441968,38.431398,Multi Turboprop Engines,4,0,1988.0
8120,Rural Aerial co-op,Flight,Spraying (Agricultural),"Plain, Valley",1,New Zealand,Oceania,Fletcher FU-24,Unknown,0,...,Unknown,"Ihuraua, Manawatu-Wanganui (Horizons Regional ...",1986-07-31,Unknown,-41.500083,172.834408,Unknown,0,0,1985.0
10428,Jose Benitez,Landing (descent or approach),Private,Airport (less than 10 km from airport),254,United States of America,North America,Convair CV-440 Metropolitan,Unknown,0,...,Accident,"Key West-Intl, Florida",1979-04-16,Technical failure,24.555477,-81.759616,Multi Piston Engines,2,0,1725.0
11258,German Air Force - Deutsche Luftwaffe,Unknown,Military,Airport (less than 10 km from airport),0,Germany,Europe,Dornier DO.28D Skyservant,Unknown,0,...,Accident,"Kaufbeuren AFB, Bavaria",1977-02-28,Unknown,51.163818,10.447831,Multi Piston Engines,0,0,1977.0
13170,Aeroflot - Russian International Airlines,Flight,Spraying (Agricultural),"Plain, Valley",2,Russia,Asia,PZL-Mielec AN-2,Unknown,2,...,Accident,"Pochep, Bryansk oblast",1971-07-15,Human factor,52.92878,33.454536,1 Piston engine,2,0,1969.0
13233,Aeroflot - Russian International Airlines,Flight,Scheduled Revenue Flight,"Plain, Valley",2,Ukraine,Europe,PZL-Mielec AN-2,Unknown,0,...,Accident,"Chernivtsi, Chernivtsi Oblast",1971-04-29,Technical failure,48.28647,25.937653,1 Piston engine,0,0,1969.0
13471,Aeroflot - Russian International Airlines,Flight,Spraying (Agricultural),"Plain, Valley",29,Russia,Asia,PZL-Mielec AN-2,Unknown,1,...,Accident,"Satyshevo, Republic of Tatarstan",1970-07-19,Human factor,64.686314,97.745306,1 Piston engine,2,0,1941.0


In [118]:
# Replace with average
df.loc[low_yom, 'yom'] = df['date'].dt.year - int(df['aircraft_age'].mean())

In [119]:
# Get row with 5 digit yom
high_yom = df['yom'] > 2025
df[high_yom]

Unnamed: 0,operator,flight_phase,flight_type,site,yom,country,region,type,engine_model,fatalities,...,category,location,date,cause,latitude,longitude,engine,occupants,other_fatalities,aircraft_age
10488,Air Rhodesia,Takeoff (climb),Scheduled Revenue Flight,Airport (less than 10 km from airport),19567,Zimbabwe,Africa,Vickers Viscount,Unknown,59,...,Unlawful Interference,"Kariba, Mashonaland West",1979-02-12,"Terrorism act, hijacking, sabotage, any kind o...",-16.527274,28.775548,Multi Turboprop Engines,59,0,-17588.0


In [120]:
# After checking in the ASN website, replace with 1956
df.loc[high_yom, 'yom'] = 1956

In [121]:
# Make sure fatalities are not greater than occupants
df.loc[df['fatalities'] > df['occupants'], 'fatalities'] = df['occupants']

### Export data

In [122]:
# Reorder columns
df = df[[
	'date',
  	'category',
	'type',
	'operator',
	'yom',
	'engine',
	'engine_model',
	'flight_phase',
	'flight_type',
	'site',
	'location',
	'country',
	'region',
	'latitude',
	'longitude',
	'aircraft_damage',
	'occupants',
	'fatalities',
	'other_fatalities',
	'cause'
  ]]

In [123]:
# Sort data from the earliest to the latest crash
df = df.sort_values(by='date')

In [124]:
# Reset index
df = df.reset_index(drop=True)

In [125]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13631 entries, 0 to 13630
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date              13631 non-null  datetime64[ns]
 1   category          13631 non-null  category      
 2   type              13631 non-null  object        
 3   operator          13631 non-null  object        
 4   yom               13631 non-null  int64         
 5   engine            13631 non-null  category      
 6   engine_model      13631 non-null  object        
 7   flight_phase      13631 non-null  object        
 8   flight_type       13631 non-null  object        
 9   site              13631 non-null  object        
 10  location          13631 non-null  object        
 11  country           13631 non-null  object        
 12  region            13631 non-null  object        
 13  latitude          13631 non-null  float64       
 14  longitude         1363

In [126]:
# Serialize data with pickle
with open('data/cleaned_data.pkl', 'wb') as handle:
  pickle.dump(df, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [127]:
# Export data to CSV
df.to_csv('data/cleaned_data.csv', index=False)

## End