# Aircraft Crashes Data Collection And Cleaning

## Overview

This notebook collects and prepares the data for the analysis of all the aircraft accidents since 1970.

### About dataset

The data will be scraped from the [BAAA Crash Archives](https://www.baaa-acro.com/crash-archives) and the [ASN Database](https://asn.flightsafety.org/database/).

**BAAA**

`date` date and local time of the accident<br>
`aircraft_type` aircraft make and model<br>
`operator` operator of the aircraft<br>
`registration` unique code to a single aircraft, required by international convention<br>
`flight_phase` phase of the flight when the accident occured<br>
`flight_type` type of flight (ex: military)<br>
`survivors` indicates if there was survivors or not<br>
`site` type of location where the accident happened (ex: mountains)<br>
`departure` city where the departure was planned<br> 
`arrival` city where the arrival was planned<br> 
`msn` manufacturer's serial number of the aircraft<br>
`yom` year of manufacture of the aircraft involved in the accident<br>
`flight_number` flight number<br>
`location` location of the accident<br>
`country` country where the crash happened<br>
`region` region of the world where the crash happened<br>
`crew_on_board` number of crew members on board at the time of the accident<br>
`crew_fatalities` number of crew members who died in the crash<br>
`pax_on_board` number of passengers on board at the time of the accident<br> 
`pax_fatalities` number of passengers who died in the crash<br>                 
`other_fatalities` other victims of the accident outside of the aircraft<br>
`total_fatalities` total number of deaths<br>
`captain_flying_hours` number of flying hours of the captain<br>
`captain_flying_hours_on_type` number of hours the captain flew on the type of aircraft involved in the crash<br>
`copilot_flying_hours` number of flying hours of the copilot<br>  
`copilot_flying_hours_on_type` number of hours the copilot flew on the type of aircraft involved in the crash<br>  
`aircraft_flying_hours` number of flying hours of the aircraft before the crash<br>
`aircraft_flight_cycles` number of flights of the aircraft<br><br>


**ASN**

`date` date of the accident<br>
`time` time of the accident<br>
`type` make and model of the aircraft<br>
`first_flight` year the aircraft was inaugurated<br>
`engine` type and number of engines<br>
`owner` operator of the aircraft<br>
`registration` unique code to a single aircraft, required by international convention<br>
`msn` manufacturer's serial number of the aircraft<br>
`year_of_manufacture` year of manufacture of the aircraft involved in the accident<br>
`total_airframe_hrs` number of flying hours of the aircraft before the crash<br>
`cycles` number of flights of the aircraft<br>
`engine_model` make and model of the aircraft engine<br>
`fatalities` total number of fatalities<br>
`occupants` number of crew members and passengers on board<br>
`other_fatalities` other victims of the accident outside of the aircraft<br>
`aircraft_damage` severity of the aircraft damage
`category` type of accident<br>
`location` location of the crash<br>
`phase` phase of the flight when the accident occured<br>
`nature` type of flight (ex: military)<br>
`departure_airport` airport where the departure was planned<br>
`destination_airport` airport when the arrival was planned<br>
`investigating_agency` agency who made the accident deport<br>
`confidence_rating` quality of the information (ex: missing information)

---

## Data Collection

In [None]:
from bs4 import BeautifulSoup
import math
import os
import pandas as pd
import re
import requests
from urllib.parse import unquote

In [None]:
def export_list_to_csv(data:list, csv_path:str) -> None:
	df = pd.DataFrame(data)
	if not os.path.isfile(csv_path):
		df.to_csv(csv_path, index=False)
	else:
		df.to_csv(csv_path, index=False, header=False, mode='a')

### BAAA

In [None]:
# Scrape total number of accidents
root_url = 'https://www.baaa-acro.com'

response = requests.get(root_url)
soup = BeautifulSoup(response.content, 'html.parser')
accident_files = soup.find('div', {'class': 'total-accident-files'})
nb_crashes = int(accident_files.text.replace(',', ''))
	

In [None]:
# Scrape details of all accidents
nb_rows_per_page = 20
nb_pages = math.ceil(nb_crashes / nb_rows_per_page)
csv_path = 'data/baaa_scraped_data.csv'

for i in range(nb_pages):
	listing_url = '{}/crash-archives?page={}'.format(root_url, i)
	response = requests.get(listing_url)
	soup = BeautifulSoup(response.content, 'html.parser')
	anchors = soup.find_all('a', {'class': 'red-btn'})

	crash_list = []
	
	for j, a in enumerate(anchors):
		link = a['href']
		#print('Page {}, link {}: {}{}'.format(i, j + 1, root_url, link))
		details_url = root_url + link
		response = requests.get(details_url)
		soup = BeautifulSoup(response.content, 'html.parser')
		details = {}
		
		details_div = soup.find('div', {'class': 'crash-details'})
		
		date_div = details_div.find('div', {'class': 'crash-date'})
		details['date'] = date_div.find('span').next_sibling.text.strip() if date_div else None
		
		aircraft_div = details_div.find('div', {'class': 'crash-aircraft'})
		details['aircraft_type'] = aircraft_div.find('a').find('div').text if aircraft_div else None
		
		operator_div = details_div.find('div', {'class': 'crash-operator'})

		if operator_div:
			if (operator_div.find('img')): # Extract operator name from image link
				pattern = re.compile(r'(?<=target_id=).*(?= \(\d+\))')
				img_link = unquote(operator_div.find('img').parent['href'])
				details['operator'] = pattern.search(img_link).group(0)
			else:
				details['operator'] = operator_div.find('a').find('div').text
		else:
			details['operator'] = None

		reg_div = details_div.find('div', {'class': 'crash-registration'})
		details['registration'] = reg_div.find('div').text if reg_div else None
		
		flight_phase_div = details_div.find('div', {'class': 'crash-flight-phase'})
		details['flight_phase'] = flight_phase_div.find('a').find('div').text if flight_phase_div else None
		
		flight_type_div = details_div.find('div', {'class': 'crash-flight-type'})
		details['flight_type'] = flight_type_div.find('a').find('div').text if flight_type_div else None
		
		survivors_div = details_div.find('div', {'class': 'crash-survivors'})
		details['survivors'] = survivors_div.find('a').find('div').text if survivors_div else None
		
		site_div = details_div.find('div', {'class': 'crash-site'})
		details['site'] = site_div.find('a').find('div').text if site_div else None
		
		schedule_div = details_div.find('div', {'class': 'crash-schedule'})
		details['schedule'] = schedule_div.find('div').text if schedule_div else None
		
		msn_div = details_div.find('div', {'class': 'crash-construction-num'})
		details['msn'] = msn_div.find('div').text if msn_div else None
		
		yom_div = details_div.find('div', {'class': 'crash-yom'})
		details['yom'] = yom_div.find('div').text if yom_div else None

		flight_number = details_div.find('div', {'class': 'crash-flight-number'})
		details['flight_number'] = flight_number.find('div').text if flight_number else None
		
		location_div = details_div.find('div', {'class': 'crash-location'})
		if location_div:
			location_details = location_div.select('a')
			details['location'] = ', '.join(item.text.strip() for item in location_details) if location_details else None
		else:
			details['location'] = None
		
		country_div = details_div.find('div', {'class': 'crash-country'})
		details['country'] = country_div.find('a').find('div').text if country_div else None
		
		region_div = details_div.find('div', {'class': 'crash-region'})
		details['region'] = region_div.find('a').find('div').text if region_div else None
		
		crew_on_board_div = details_div.find('div', {'class': 'crash-crew-on-board'})
		details['crew_on_board'] = crew_on_board_div.find('div').text if crew_on_board_div else None
		
		crew_fatalities_div = details_div.find('div', {'class': 'crash-crew-fatalities'})
		details['crew_fatalities'] = crew_fatalities_div.find('div').text if crew_fatalities_div else None
		
		pax_on_board_div = details_div.find('div', {'class': 'crash-pax-on-board'})
		details['pax_on_board'] = pax_on_board_div.find('div').text if pax_on_board_div else None
		
		pax_fatalities_div = details_div.find('div', {'class': 'crash-pax-fatalities'})
		details['pax_fatalities'] = pax_fatalities_div.find('div').text if pax_fatalities_div else None
		
		others_div = details_div.find('div', {'class': 'crash-other-fatalities'})
		details['other_fatalities'] = others_div.find('div').text if others_div else None
		
		total_fatalities_div = details_div.find('div', {'class': 'crash-total-fatalities'})
		details['total_fatalities'] = total_fatalities_div.find('div').text if total_fatalities_div else None

		captain_hours_div = details_div.find('div', {'class': 'captain-total-flying-hours'})
		details['captain_flying_hours'] = captain_hours_div.find('div').text if captain_hours_div else None

		captain_hours_type_div = details_div.find('div', {'class': 'captain-total-hours-type'})
		details['captain_flying_hours_on_type'] = captain_hours_type_div.find('div').text if captain_hours_type_div else None

		copilot_hours_div = details_div.find('div', {'class': 'copilot-total-flying-hours'})
		details['copilot_flying_hours'] = copilot_hours_div.find('div').text if copilot_hours_div else None

		copilot_hours_type_div = details_div.find('div', {'class': 'copilot-total-hours-type'})
		details['copilot_flying_hours_on_type'] = copilot_hours_type_div.find('div').text if copilot_hours_type_div else None

		aircraft_hours_div = details_div.find('div', {'class': 'crash-aircraft-flight-hours'})
		details['aircraft_flying_hours'] = aircraft_hours_div.find('div').text if aircraft_hours_div else None

		aircraft_cycles_div = details_div.find('div', {'class': 'crash-aircraft-flight-cycles'})
		details['aircraft_flight_cycles'] = aircraft_cycles_div.find('div').text if aircraft_cycles_div else None
		
		crash_list.append(details)
	
	export_list_to_csv(crash_list, csv_path)


In [None]:
# Scrape accident causes
csv_path = 'data/baaa_crash_reasons.csv'

reasons = {
  'Human factor': 12990,
  'Other causes': 12992,
  'Technical failure': 12988,
  'Terrorism act, hijacking, sabotage, any kind of hostile action': 12991,
  'Unknown': 12993,
  'Weather': 12989
}

for reason, target_id in reasons.items():
	url = 'https://www.baaa-acro.com/crash-archives?field_crash_cause_target_id={}'.format(target_id)
	response = requests.get(url)
	soup = BeautifulSoup(response.content, 'html.parser')
	pattern = re.compile(r'\d+$')
	total_items_txt = soup.find('div', {'class': 'view-header'}).find('span').text
	total_items = int(pattern.search(total_items_txt).group(0))
	nb_items_per_page = 20
	nb_pages = math.ceil(total_items / nb_items_per_page)
	
	for i in range(nb_pages):
		page_url = url + '&page={}'.format(i)
		page_response = requests.get(page_url)
		page_soup = BeautifulSoup(page_response.content, 'html.parser')
		table = page_soup.find('table')
		rows = table.find_all('tr')

		crash_list = []
		for row in rows[1:]: # skip table header
			created = row.find('td', {'class': 'views-field-created'}).find('time').text
			registration_div = row.find('div', {'class': 'registration-field'})
			crash_list.append({
				'date': created,
				'registration': registration_div.text if registration_div else None,
				'cause': reason
			})
		export_list_to_csv(crash_list, csv_path)		

### ASN

In [None]:
csv_path = 'data/asn_scraped_data.csv'
root_url = 'https://asn.flightsafety.org'

# Add headers to avoid 403 unauthorized error
headers = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'
}

database = '/database'
database_url = root_url + database

#for year in range(1919, 2026):
for year in range(1973, 2026):
	year_url = '{}{}/year/{}/1'.format(root_url, database, year)
	response = requests.get(year_url, headers=headers)
	soup = BeautifulSoup(response.content, 'html.parser')
	nb_occurences_txt = soup.find('div', {'class': 'innertube'}).find('span').text
	pattern = re.compile(r'\d+(?= occurrences)')
	nb_occurences = int(pattern.search(nb_occurences_txt).group(0))
	max_items_per_page = 100
	nb_pages = math.ceil(nb_occurences / max_items_per_page)

	#for page in range(1, nb_pages + 1):
	for page in range(1, nb_pages + 1):
		page_url = '{}{}/year/{}/{}'.format(root_url, database, year, page)
		response = requests.get(page_url, headers=headers)
		soup = BeautifulSoup(response.content, 'html.parser')
		table = soup.find('table', {'class': 'hp'})
		anchors = table.find_all('a')
		links = [a['href'] for a in anchors]
		
		crash_list = []
		for i, link in enumerate(links):
			details_url = root_url + link
			print('Year {}, page {}, item {}, link: {}'.format(year, page, i + 1, details_url))
			response = requests.get(details_url, headers=headers)
			soup = BeautifulSoup(response.content, 'html.parser')
			table = soup.find('table')
			details = {}

			date_label = table.find('td', string='Date:')
			details['date'] = date_label.next_sibling.text

			time_label = table.find('td', string='Time:')
			details['time'] = time_label.next_sibling.text

			type_label = table.find('td', string='Type:')
			anchor = type_label.next_sibling.find('a')

			if anchor: # Get more details about aircraft if link exists
				details['type'] = anchor.text
				href = anchor['href']
				type_url = root_url + href
				type_response = requests.get(type_url, headers=headers)
				type_soup = BeautifulSoup(type_response.content, 'html.parser')
				type_table = type_soup.find('table')
				type_details = list(type_table.find('td', {'valign': 'top'}).stripped_strings)
				details['type_details'] = ', '.join(type_details)
			else:
				details['type'] = type_label.next_sibling.text
				details['type_details'] = None

			owner_label = table.find('td', string='Owner/operator:')
			details['owner'] = owner_label.next_sibling.text

			reg_label = table.find('td', string='Registration:')
			details['registration'] = reg_label.next_sibling.text

			msn_label = table.find('td', string='MSN:')
			details['msn'] = msn_label.next_sibling.text

			yom_label = table.find('td', string='Year of manufacture:')
			details['year_of_manufacture'] = yom_label.next_sibling.text if yom_label else None

			air_hours_label = table.find('td', string='Total airframe hrs:')
			details['total_airframe_hrs'] = air_hours_label.next_sibling.text if air_hours_label else None

			cycles_label = table.find('td', string='Cycles:')
			details['cycles'] = cycles_label.next_sibling.text if cycles_label else None

			engine_label = table.find('td', string='Engine model:')
			details['engine_model'] = engine_label.next_sibling.text if engine_label else None

			fatal_label = table.find('td', string='Fatalities:')
			details['fatalities'] = fatal_label.next_sibling.text

			other_label = table.find('td', string='Other fatalities:')
			details['other_fatalities'] = other_label.next_sibling.text

			damage_label = table.find('td', string='Aircraft damage:')
			details['aircraft_damage'] = damage_label.next_sibling.text

			cat_label = table.find('td', string='Category:')
			details['category'] = cat_label.next_sibling.text if cat_label else None

			loc_label = table.find('td', string='Location:')
			details['location'] = ' '.join(loc_label.next_sibling.stripped_strings)

			phase_label = table.find('td', string='Phase:')
			details['phase'] = phase_label.next_sibling.text

			nature_label = table.find('td', string='Nature:')
			details['nature'] = nature_label.next_sibling.text

			dep_label = table.find('td', string='Departure airport:')
			details['departure_airport'] = dep_label.next_sibling.text

			des_label = table.find('td', string='Destination airport:')
			details['destination_airport'] = des_label.next_sibling.text

			inv_label = table.find('td', string=re.compile('Investigating'))
			details['investigating_agency'] = inv_label.next_sibling.text if inv_label else None

			conf_label = table.find('td', string='Confidence Rating:')
			details['confidence_rating'] = ''.join(conf_label.next_sibling.stripped_strings) if conf_label else None

			crash_list.append(details)
		
		export_list_to_csv(crash_list, csv_path)	

---

## Data Exploration

In [1]:
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
import numpy as np
import pandas as pd

### BAAA

In [None]:
# Crashes
baaa_df = pd.read_csv('data/baaa_scraped_data.csv')
baaa_df.head()

In [None]:
# Causes
baaa_causes_df = pd.read_csv('data/baaa_crash_reasons.csv', parse_dates=['date'], date_format='%b %d, %Y')
baaa_causes_df.head()

In [None]:
baaa_df.info()

In [None]:
baaa_causes_df.info()

In [None]:
baaa_df.isnull().sum()

In [None]:
baaa_causes_df.isnull().sum()

In [None]:
# Check for duplicates
baaa_df[baaa_df.duplicated(keep=False)]

In [None]:
baaa_causes_df[baaa_causes_df.duplicated(keep=False)]

### ASN

In [None]:
asn_df = pd.read_csv('data/asn_scraped_data.csv')
asn_df.head()

In [None]:
asn_df.info()

In [None]:
asn_df.isnull().sum()

In [None]:
# Check for duplicates
asn_df[asn_df.duplicated(keep=False)]

---

## Data Cleaning

In [None]:
# Remove duplicates
baaa_df = baaa_df.drop_duplicates()
baaa_causes_df = baaa_causes_df.drop_duplicates()
asn_df = asn_df.drop_duplicates()

In [None]:
# Strip whitespaces
def remove_whitespaces(df):
	for column in df.columns:
		if df[column].dtype == 'object':
			df[column] = df[column].str.strip()
	return df

baaa_df = remove_whitespaces(baaa_df)
baaa_causes_df = remove_whitespaces(baaa_causes_df)
asn_df = remove_whitespaces(asn_df)

### Merge dataframes on date and registration number

Although it's not very likely, the same aircraft can be involved in multiple accidents. Combining the registration number and the date ensures the unicity of the rows.

The main (left) dataset will be the one from BAAA as it's the most reliable and the second (right) one will be ASN dataset.

In [None]:
# Convert BAAA date to datetime
baaa_df['date'] = pd.to_datetime(baaa_df['date'], format='%b %d, %Y at %H%M LT', errors='coerce') \
				.fillna(pd.to_datetime(baaa_df['date'], format='%b %d, %Y', errors='coerce'))
assert baaa_df['date'].isna().sum() == 0

In [None]:
# Convert ASN date to datetime
asn_df['date'] = pd.to_datetime(asn_df['date'], format='%A %d %B %Y', errors='coerce')

In [None]:
# Create date string column
baaa_df['date_str'] = baaa_df['date'].dt.strftime('%Y-%m-%d')
baaa_causes_df['date_str'] = baaa_causes_df['date'].dt.strftime('%Y-%m-%d')
asn_df['date_str'] = asn_df['date'].dt.strftime('%Y-%m-%d')

In [None]:
# Merge three dataframes
df = pd.merge(left=baaa_df, right=baaa_causes_df, how='left', on=['registration', 'date_str'])
df = pd.merge(left=df, right=asn_df, how='left', on=['registration', 'date_str'])
df.head()

In [None]:
df.columns

In [None]:
# Remove time from datetime and drop other date and timecolumns
df['date'] = pd.to_datetime(df['date_str'])
df = df.drop(['date_x', 'date_y', 'time', 'date_str'], axis=1)

In [None]:
# Keep data from 1970 to now
df = df[df['date'].dt.year >= 1970]

In [None]:
df.info()

### Add latitude and longitude

In [None]:
# Merge location (BAAA, then ASN, then country)
df['location_y'] = df['location_y'].str.replace(' - ', ', ')
df['location'] = df['location_x'].fillna(df['location_y']).fillna(df['country'])
df = df.drop(['location_x', 'location_y'], axis=1)
assert df['location'].isnull().sum() == 0

In [None]:
# Get coordinates from geocoder
geolocator = Nominatim(user_agent='aircraft_crashes_analysis')
geocoder = RateLimiter(geolocator.geocode, min_delay_seconds=1)

def get_coord(row, column:str='location') -> tuple:
	result = (np.nan, np.nan)

	location = geocoder(row[column], language='en', exactly_one=True)

	if (location):
		print('Coordinates: ({}, {})'.format(location.latitude, location.longitude))
		result = (location.latitude, location.longitude)
	
	return result

In [None]:
# Create columns with coordinates
coordinates = df.apply(get_coord, axis=1, result_type='expand')
coordinates.columns = ['latitude', 'longitude']

In [None]:
# Export data even if it's not completely clean
# Because it takes a long time to get the coordinates
df = df.join(coordinates)
df.to_csv('data/merged_data_with_coordinates.csv', index=False)

---

### Continue data cleaning with added coordinates

In [30]:
import numpy as np
import pandas as pd
import pickle

In [None]:
# Load merged data
df = pd.read_csv('data/merged_data_with_coordinates.csv', parse_dates=['date'])
df.head()

### Split some columns into multiple

In [None]:
# Split schedule into 2 columns
schedule = df['schedule'].str.split(' - ', expand=True)
df['departure'] = schedule[0]
df['arrival'] = schedule[1]
df = df.drop('schedule', axis=1)

In [None]:
# Split type details into 2 columns
df['type_details'] = df['type_details'].str.extract(r'(\bFirst flight: \d{4}, .*)$', expand=False)
details = df['type_details'].str.split(', ', expand=True)
df['first_flight'] = details[0].str.extract(r'(\d{4})', expand=False)
df['engine'] = details[1]
df = df.drop('type_details', axis=1)

In [None]:
# Split fatalities from ASN
fatalities = df['fatalities'].str.split(' / ', expand=True)
df['fatalities'] = fatalities[0].str.extract(r'(\d+)')
df['occupants'] = fatalities[1].str.extract(r'(\d+)')

### Merge common columns

In [None]:
df['operator'] = df['operator'].fillna(df['owner'])
df = df.drop('owner', axis=1)

In [None]:
df['type'] = df['aircraft_type'].fillna(df['type'])
df = df.drop('aircraft_type', axis=1)

In [None]:
df['yom'] = df['yom'].fillna(df['year_of_manufacture'])
df = df.drop('year_of_manufacture', axis=1)

In [None]:
df['aircraft_flying_hours'] = df['aircraft_flying_hours'].fillna(df['total_airframe_hrs'])
df = df.drop('total_airframe_hrs', axis=1)

In [None]:
df['aircraft_flight_cycles'] = df['aircraft_flight_cycles'].fillna(df['cycles'])
df = df.drop('cycles', axis=1)

In [None]:
df['msn'] = df['msn_x'].fillna(df['msn_y'])
df = df.drop(['msn_x', 'msn_y'], axis=1)

In [None]:
df['flight_phase'] = df['flight_phase'].fillna(df['phase'])
df = df.drop('phase', axis=1)

In [None]:
df['flight_type'] = df['flight_type'].fillna(df['nature'])
df = df.drop('nature', axis=1)

In [None]:
df['departure'] = df['departure_airport'].fillna(df['departure'])
df = df.drop('departure_airport', axis=1)

In [None]:
df['arrival'] = df['destination_airport'].fillna(df['arrival'])
df = df.drop('destination_airport', axis=1)

In [None]:
on_board = df['crew_on_board'] + df['pax_on_board']
df['occupants'] = on_board.fillna(df['occupants'])
df = df.drop(['crew_on_board', 'pax_on_board'], axis=1)

In [None]:
df['fatalities'] = df['total_fatalities'].fillna(df['fatalities'])
df = df.drop(['crew_fatalities', 'pax_fatalities', 'total_fatalities'], axis=1)

In [None]:
df['other_fatalities'] = df['other_fatalities_x'].fillna(df['other_fatalities_y'])
df = df.drop(['other_fatalities_x', 'other_fatalities_y'], axis=1)

### Inpute missing values

In [5]:
def get_null_columns():
	is_null = df.isna().sum()
	columns = is_null[is_null > 0].index
	return df[columns].info()

In [None]:
get_null_columns()

In [None]:
# Get string columns
string_columns = df.drop([
  'survivors',
  'aircraft_flying_hours',
  'aircraft_flight_cycles',
  'occupants',
  'first_flight'], axis=1).select_dtypes(include='object').columns
string_columns

In [None]:
# Fill missing string columns with Unknown
for column in string_columns:
	df[column] = df[column].fillna('Unknown')

In [None]:
get_null_columns()

In [None]:
is_null_mask = df['occupants'].isna()

In [None]:
# Fill missing values by 0 to be able to convert it to string
df['occupants'] = df['occupants'].fillna(0)
df['occupants'] = df['occupants'].astype('int')

In [None]:
# Fill missing occupants by average per type of aircraft
df.loc[is_null_mask, 'occupants'] = df[is_null_mask].groupby('type')['occupants'].transform('mean')
assert df['occupants'].isna().sum() == 0

In [None]:
# Make sure fatalities are not greater than occupants
df.loc[df['fatalities'] > df['occupants'], 'fatalities'] = df['occupants']

In [None]:
# Fill missing survivors and convert to boolean
df['survivors'] = np.where(df['occupants'] > df['fatalities'], True, False)

In [None]:
get_null_columns()

In [None]:
# Input missing yom with year - average age
df['aircraft_age'] = df['date'].dt.year - df['yom']
df['aircraft_age'] = df['aircraft_age'].fillna(int(df['aircraft_age'].mean()))
df['yom'] = df['yom'].fillna(df['date'].dt.year - df['aircraft_age'])
assert df['yom'].isna().sum() == 0

In [None]:
# Regroup flight phases
print(df['flight_phase'].sort_values().unique())
df['flight_phase'] = np.where(df['flight_phase'].isin(['Take off', 'Initial climb']), 'Takeoff (climb)', df['flight_phase'])
df['flight_phase'] = np.where(df['flight_phase'] == 'En route', 'Flight', df['flight_phase'])
df['flight_phase'] = np.where(df['flight_phase'].isin(['Landing', 'Approach']), 'Landing (descent or approach)', df['flight_phase'])
df['flight_phase'] = np.where(df['flight_phase'] == 'Taxi', 'Taxiing', df['flight_phase'])
df['flight_phase'] = np.where(df['flight_phase'] == 'Standing', 'Parking', df['flight_phase'])
print(df['flight_phase'].sort_values().unique())

In [None]:
null_coordinates = (df['latitude'].isna()) | (df['longitude'].isna())
close_to_airport = df['site'].str.contains('Airport')
take_off = df['flight_phase'].str.contains('Takeoff')
landing = df['flight_phase'].str.contains('Landing')

In [None]:
# Fill missing latitude and longitude with departure or arrival coordinates
# If crash happened close to an airport
coordinates	= df[null_coordinates & close_to_airport & take_off].apply(get_coord, column='departure', axis=1, result_type='expand')
df.loc[null_coordinates & close_to_airport & take_off, 'latitude'] = coordinates[0]
df.loc[null_coordinates & close_to_airport & take_off, 'longitude'] = coordinates[1]

In [None]:
# Save data again because of the geocoding
df.to_csv('data/imputed_data.csv', index=False)

In [None]:
coordinates	= df[null_coordinates & close_to_airport & landing].apply(get_coord, column='arrival', axis=1, result_type='expand')
df.loc[null_coordinates & close_to_airport & landing, 'latitude'] = coordinates[0]
df.loc[null_coordinates & close_to_airport & landing, 'longitude'] = coordinates[1]

In [None]:
df.to_csv('data/imputed_data.csv', index=False)

In [None]:
null_coordinates = (df['latitude'].isna()) | (df['longitude'].isna())
df[null_coordinates][['location', 'departure', 'arrival', 'flight_phase']]

In [None]:
# Remove text between parenthesis
df.loc[null_coordinates, 'location'] = df[null_coordinates]['location'].str.replace(r'\s+\(.*\)$', '', regex=True)
df.loc[null_coordinates, 'departure'] = df[null_coordinates]['departure'].str.replace(r'\s+\(.*\)$', '', regex=True)
df.loc[null_coordinates, 'arrival'] = df[null_coordinates]['arrival'].str.replace(r'\s+\(.*\)$', '', regex=True)

In [None]:
df[null_coordinates][['location', 'departure', 'arrival', 'flight_phase']]

In [None]:
coordinates	= df[null_coordinates].apply(get_coord, column='location', axis=1, result_type='expand')
df.loc[null_coordinates, 'latitude'] = coordinates[0]
df.loc[null_coordinates, 'longitude'] = coordinates[1]

In [None]:
df.to_csv('data/imputed_data.csv', index=False)

In [None]:
null_coordinates = (df['latitude'].isna()) | (df['longitude'].isna())
close_to_airport = df['site'].str.contains('Airport')
take_off = df['flight_phase'].str.contains('Takeoff')
landing = df['flight_phase'].str.contains('Landing')

In [None]:
coordinates	= df[null_coordinates & close_to_airport & take_off].apply(get_coord, column='departure', axis=1, result_type='expand')
df.loc[null_coordinates & close_to_airport & take_off, 'latitude'] = coordinates[0]
df.loc[null_coordinates & close_to_airport & take_off, 'longitude'] = coordinates[1]

In [None]:
df.to_csv('data/imputed_data.csv', index=False)

In [None]:
coordinates	= df[null_coordinates & close_to_airport & landing].apply(get_coord, column='arrival', axis=1, result_type='expand')
df.loc[null_coordinates & close_to_airport & landing, 'latitude'] = coordinates[0]
df.loc[null_coordinates & close_to_airport & landing, 'longitude'] = coordinates[1]

In [None]:
df.to_csv('data/imputed_data.csv', index=False)

---

In [33]:
# Load data
df = pd.read_csv('data/imputed_data.csv', parse_dates=['date'])

In [34]:
# Inpute missing other_fatalities to 0
df['other_fatalities'] = df['other_fatalities'].fillna(0)
assert df['other_fatalities'].isna().sum() == 0

In [35]:
get_null_columns()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13647 entries, 0 to 13646
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   captain_flying_hours          5665 non-null   float64
 1   captain_flying_hours_on_type  4777 non-null   float64
 2   copilot_flying_hours          1616 non-null   float64
 3   copilot_flying_hours_on_type  1437 non-null   float64
 4   aircraft_flying_hours         4656 non-null   object 
 5   aircraft_flight_cycles        1395 non-null   object 
 6   latitude                      12426 non-null  float64
 7   longitude                     12426 non-null  float64
 8   first_flight                  6946 non-null   float64
dtypes: float64(7), object(2)
memory usage: 959.7+ KB


### Drop rows and columns

#### Unnecessary/redundant columns

**registration, msn**<br>
Those are unique identifiers or an aircraft.

**flight_number**<br>
It's an unique identifier of an flight.

**captain_flying_hours, captain_flying_hours_on_type, copilot_flying_hours, copilot_flying_hours_on_type, aircraft_flying_hours, aircraft_flight_cycles**<br>
There are too many null values.

**first_flight**<br>
It's the first flight of the aircraft type in general, not the one involved in the accident

**investigating_agency, confidence_rating**<br>
It won't help categorize the data.

In [36]:
# Drop columns
columns_to_drop = [
  'registration',
  'msn',
  'flight_number',
  'captain_flying_hours', 
  'captain_flying_hours_on_type', 
  'copilot_flying_hours',
  'copilot_flying_hours_on_type',
  'aircraft_flying_hours', 
  'aircraft_flight_cycles',
  'first_flight',
  'investigating_agency',
  'confidence_rating']

df = df.drop(columns_to_drop, axis=1)

In [37]:
get_null_columns()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13647 entries, 0 to 13646
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   latitude   12426 non-null  float64
 1   longitude  12426 non-null  float64
dtypes: float64(2)
memory usage: 213.4 KB


In [38]:
# Drop rows where coordinates are null
df = df.dropna(subset=['latitude', 'longitude'])

### Collapse categories

In [39]:
# Regroup values of flight_type
print(df['flight_type'].sort_values().unique())
df['flight_type'] = np.where(df['flight_type'] == '-', 'Unknown', df['flight_type'])
print(df['flight_type'].sort_values().unique())

['-' 'Aerial photography' 'Aerobatic' 'Ambulance' 'Bombing' 'Calibration'
 'Cargo' 'Charter/Taxi (Non Scheduled Revenue Flight)' 'Cinematography'
 'Delivery' 'Demonstration' 'Executive/Corporate/Business' 'Ferry'
 'Fire fighting' 'Geographical / Geophysical / Scientific' 'Government'
 'Humanitarian' 'Illegal (smuggling)' 'Meteorological / Weather'
 'Military' 'Positioning' 'Postal (mail)' 'Private' 'Refuelling'
 'Scheduled Revenue Flight' 'Skydiving / Paratroopers'
 'Spraying (Agricultural)' 'Supply' 'Survey / Patrol / Reconnaissance'
 'Test' 'Topographic' 'Training' 'Unknown']
['Aerial photography' 'Aerobatic' 'Ambulance' 'Bombing' 'Calibration'
 'Cargo' 'Charter/Taxi (Non Scheduled Revenue Flight)' 'Cinematography'
 'Delivery' 'Demonstration' 'Executive/Corporate/Business' 'Ferry'
 'Fire fighting' 'Geographical / Geophysical / Scientific' 'Government'
 'Humanitarian' 'Illegal (smuggling)' 'Meteorological / Weather'
 'Military' 'Positioning' 'Postal (mail)' 'Private' 'Refuelling'
 'Sc

In [40]:
# Regroup values of aircraft_damage
print(df['aircraft_damage'].sort_values().unique())
df['aircraft_damage'] = df['aircraft_damage'].str.replace(', written off', '')
print(df['aircraft_damage'].sort_values().unique())

['Aircraft missing, written off' 'Destroyed' 'Destroyed, written off'
 'Minor, repaired' 'Minor, written off' 'Substantial'
 'Substantial, repaired' 'Substantial, written off' 'Unknown'
 'Unknown, written off']
['Aircraft missing' 'Destroyed' 'Minor' 'Minor, repaired' 'Substantial'
 'Substantial, repaired' 'Unknown']


In [41]:
# Get unique values of cause
df['cause'].sort_values().unique()

array(['Human factor', 'Other causes', 'Technical failure',
       'Terrorism act, hijacking, sabotage, any kind of hostile action',
       'Unknown', 'Weather'], dtype=object)

In [42]:
# Regroup categories of category
print(df['category'].sort_values().unique())
df['category'] = np.where(df['category'] == 'UK', 'Unknown', df['category'])
df['category'].sort_values().unique()
print(df['category'].sort_values().unique())

['Accident' 'Incident' 'Other' 'Serious incident' 'UK' 'Unknown'
 'Unlawful Interference']
['Accident' 'Incident' 'Other' 'Serious incident' 'Unknown'
 'Unlawful Interference']


### Convert columns

#### Convert category, aircraft_damage and engine into ordinal categories

In [43]:
print(df['category'].sort_values().unique())
categories = ['Unknown', 'Incident', 'Serious incident', 'Accident', 'Unlawful Interference']
df['category'] = pd.Categorical(df['category'], categories, ordered=True)
df['category'] = df['category'].fillna('Unknown')
print(df['category'].sort_values().unique())

['Accident' 'Incident' 'Other' 'Serious incident' 'Unknown'
 'Unlawful Interference']
['Unknown', 'Incident', 'Serious incident', 'Accident', 'Unlawful Interference']
Categories (5, object): ['Unknown' < 'Incident' < 'Serious incident' < 'Accident' < 'Unlawful Interference']


In [44]:
print(df['aircraft_damage'].sort_values().unique())
categories = [
  	'Unknown',
  	'Minor, repaired',
  	'Minor',
  	'Substantial, repaired',
  	'Substantial',
	'Destroyed',
  	'Aircraft missing']
df['aircraft_damage'] = pd.Categorical(df['aircraft_damage'], categories, ordered=True)
print(df['aircraft_damage'].sort_values().unique())

['Aircraft missing' 'Destroyed' 'Minor' 'Minor, repaired' 'Substantial'
 'Substantial, repaired' 'Unknown']
['Unknown', 'Minor, repaired', 'Minor', 'Substantial, repaired', 'Substantial', 'Destroyed', 'Aircraft missing']
Categories (7, object): ['Unknown' < 'Minor, repaired' < 'Minor' < 'Substantial, repaired' < 'Substantial' < 'Destroyed' < 'Aircraft missing']


In [45]:
print(df['engine'].unique())
df['engine'] = np.where(df['engine'].isin([
  '2 Piston engines',
  '3 Piston engines',
  '4 Piston engines',
  '6 Piston engines']), 'Multi Piston Engines', df['engine'])

['2 Turboprop engines' '2 Jet engines' 'Unknown' '1 Turboprop engine'
 '1 Piston engine' '2 Piston engines' '4 Jet engines' '4 Piston engines'
 '3 Jet engines' '1 Jet engine' '4 Turboprop engines' '3 Piston engines']


In [46]:
df['engine'] = np.where(df['engine'].isin([
  '2 Turboprop engines',
  '3 Turboprop engines',
  '4 Turboprop engines']), 'Multi Turboprop Engines', df['engine'])

In [47]:
df['engine'] = np.where(df['engine'].isin([
  '2 Jet engines',
  '3 Jet engines',
  '4 Jet engines']), 'Multi Jet Engines', df['engine'])

In [48]:
categories = [
  'Unknown',
  '1 Piston engine',
  'Multi Piston Engines',
  '1 Turboprop engine',
  'Multi Turboprop Engines',
  '1 Jet engine',
  'Multi Jet Engines']
df['engine'] = pd.Categorical(df['engine'], categories, ordered=True)
print(df['engine'].unique())

['Multi Turboprop Engines', 'Multi Jet Engines', 'Unknown', '1 Turboprop engine', '1 Piston engine', 'Multi Piston Engines', '1 Jet engine']
Categories (7, object): ['Unknown' < '1 Piston engine' < 'Multi Piston Engines' < '1 Turboprop engine' < 'Multi Turboprop Engines' < '1 Jet engine' < 'Multi Jet Engines']


### Correct other inconsistencies

In [49]:
df.describe()

Unnamed: 0,yom,date,fatalities,latitude,longitude,occupants,other_fatalities,aircraft_age
count,12426.0,12426,12426.0,12426.0,12426.0,12426.0,12426.0,12426.0
mean,1972.691936,1993-02-14 00:16:55.161757568,6.199018,27.121567,-24.166197,14.414293,0.123773,19.922904
min,1.0,1970-01-02 00:00:00,0.0,-89.991843,-179.491343,0.0,0.0,-17588.0
25%,1964.0,1980-02-03 06:00:00,0.0,11.266133,-87.83682,2.0,0.0,9.0
50%,1973.0,1991-05-21 00:00:00,1.0,33.637401,-62.755826,4.0,0.0,19.0
75%,1981.0,2005-01-08 00:00:00,4.0,44.503655,35.371527,9.0,0.0,28.0
max,19567.0,2025-03-17 00:00:00,520.0,82.525369,178.719537,524.0,180.0,2000.0
std,167.780095,,21.31673,24.800868,82.591362,34.796056,2.398518,167.739832


In [50]:
# Get rows with yom below 1900
low_yom = df['yom'] < 1900
df[low_yom]

Unnamed: 0,operator,flight_phase,flight_type,survivors,site,yom,country,region,cause,date,...,category,location,latitude,longitude,departure,arrival,engine,occupants,other_fatalities,aircraft_age
1101,Technoservis-A,Takeoff (climb),Spraying (Agricultural),True,"Plain, Valley",16.0,Russia,Asia,Human factor,2016-04-03,...,Accident,"Aksarino, Republic of Tatarstan",55.342326,51.906545,Aksarino,Aksarino,1 Piston engine,1,0.0,2000.0
1310,FlyBe,Takeoff (climb),Scheduled Revenue Flight,True,Airport (less than 10 km from airport),23.0,United Kingdom,Europe,Weather,2015-01-02,...,Accident,"Stornoway, Hebrides Islands",58.207704,-6.382723,Stornoway Airport (SYY/EGPO),Glasgow International Airport (GLA/EGPF),Multi Turboprop Engines,29,0.0,1992.0
1346,Air Century (ACSA),Landing (descent or approach),Charter/Taxi (Non Scheduled Revenue Flight),True,Airport (less than 10 km from airport),18.0,Dominican Republic,Central America,Technical failure,2014-10-12,...,Accident,"Punta Cana, La Altagracia",18.556551,-68.369161,San Juan-Luis Muñoz Marín International Airpor...,Punta Cana International Airport (PUJ/MDPC),Multi Turboprop Engines,13,0.0,1996.0
1392,Skyward International Aviation,Takeoff (climb),Cargo,False,City,26.0,Kenya,Africa,Human factor,2014-07-02,...,Accident,"Nairobi-Jomo Kenyatta (ex Embakasi), Nairobi C...",-1.322256,36.924926,Nairobi-Jomo Kenyatta International Airport,Mogadishu International Airport,Multi Turboprop Engines,4,0.0,1988.0
8117,Rural Aerial co-op,Flight,Spraying (Agricultural),False,"Plain, Valley",1.0,New Zealand,Oceania,Unknown,1986-07-31,...,Unknown,"Ihuraua, Manawatu-Wanganui",-40.682711,175.856601,Unknown,Unknown,Unknown,0,0.0,1985.0
10423,Jose Benitez,Landing (descent or approach),Private,True,Airport (less than 10 km from airport),254.0,United States of America,North America,Technical failure,1979-04-16,...,Accident,"Key West-Intl, Florida",24.555477,-81.759616,"Key West International Airport, FL (EYW/KEYW)","Key West International Airport, FL (EYW/KEYW)",Multi Piston Engines,2,0.0,1725.0
13163,Aeroflot - Russian International Airlines,Flight,Spraying (Agricultural),False,"Plain, Valley",2.0,Russia,Asia,Human factor,1971-07-15,...,Accident,"Pochep, Bryansk oblast",52.92878,33.454536,Pochep,Pochep,1 Piston engine,2,0.0,1969.0
13224,Aeroflot - Russian International Airlines,Flight,Scheduled Revenue Flight,False,"Plain, Valley",2.0,Ukraine,Europe,Technical failure,1971-04-29,...,Accident,"Chernivtsi, Chernivtsi Oblast",48.28647,25.937653,Unknown,Unknown,1 Piston engine,0,0.0,1969.0
13573,Aeroflot - Russian International Airlines,Flight,Spraying (Agricultural),False,"Plain, Valley",27.0,Russia,Asia,Human factor,1970-03-19,...,Accident,"Nikolayevo-Kozlovski, Rostov oblast",47.220578,38.361728,Nikolayevo-Kozlovski,Nikolayevo-Kozlovski,1 Piston engine,2,0.0,1943.0
13615,Aeroflot - Russian International Airlines,Flight,Positioning,False,"Plain, Valley",29.0,Russia,Asia,Human factor,1970-01-31,...,Accident,"Tokmasskiy, Chelyabinsk oblast",54.42112,60.263129,Chelyabinsk Airport (CEK/USCC),Magnitogorsk Airport (MQF/USCM),1 Piston engine,2,0.0,1941.0


In [51]:
# Replace with average
df.loc[low_yom, 'yom'] = df['date'].dt.year - int(df['aircraft_age'].mean())

In [52]:
# Get row with 5 digit yom
high_yom = df['yom'] > 2025
df[high_yom]

Unnamed: 0,operator,flight_phase,flight_type,survivors,site,yom,country,region,cause,date,...,category,location,latitude,longitude,departure,arrival,engine,occupants,other_fatalities,aircraft_age
10483,Air Rhodesia,Takeoff (climb),Scheduled Revenue Flight,False,Airport (less than 10 km from airport),19567.0,Zimbabwe,Africa,"Terrorism act, hijacking, sabotage, any kind o...",1979-02-12,...,Unlawful Interference,"Kariba, Mashonaland West",-16.527274,28.775548,Kariba Airport (KAB/FVKB),Salisbury Airport (HRE/FVHA),Multi Turboprop Engines,59,0.0,-17588.0


In [53]:
# After checking in the BAAA and ASN website, replace with 1956
df.loc[high_yom, 'yom'] = 1956

### Export data

In [54]:
# Assert there are no more null values
assert df.isna().sum().sum() == 0

In [55]:
# Reorder columns
df = df[[
	'date',
  	'category',
	'type',
	'operator',
	'yom',
	'engine',
	'engine_model',
	'flight_phase',
	'flight_type',
	'site',
	'location',
	'country',
	'region',
	'latitude',
	'longitude',
	'aircraft_damage',
	'survivors',
	'occupants',
	'fatalities',
	'other_fatalities',
	'cause'
  ]]

In [56]:
# Sort data from the earliest to the latest crash
df = df.sort_values(by='date')

In [57]:
# Reset index
df = df.reset_index(drop=True)

In [58]:
# Serialize data with pickle
with open('data/cleaned_data.pkl', 'wb') as handle:
  pickle.dump(df, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [59]:
# Export data to CSV
df.to_csv('data/cleaned_data.csv', index=False)

## End