# Aircraft Crashes Data Collection And Cleaning

## Overview

This notebook collects and prepares the data for the analysis of all the aircraft accidents since 1918.

### About dataset

The data will be scraped from the [BAAA Crash Archives](https://www.baaa-acro.com/crash-archives) and the [ASN Database](https://asn.flightsafety.org/database/).

**BAAA**

`date` date and local time of the accident<br>
`aircraft_type` aircraft make and model<br>
`operator` operator of the aircraft<br>
`registration` unique code to a single aircraft, required by international convention<br>
`flight_phase` phase of the flight when the accident occured<br>
`flight_type` type of flight (ex: military)<br>
`survivors` indicates if there was survivors or not<br>
`site` type of location where the accident happened (ex: mountains)<br>
`schedule` planned route of the flight<br>
`msn` manufacturer's serial number of the aircraft<br>
`yom` year of manufacture of the aircraft involved in the accident<br>
`flight_number` flight number<br>
`location` location of the accident<br>
`country` country where the crash happened<br>
`region` region of the world where the crash happened<br>
`crew_on_board` number of crew members on board at the time of the accident<br>
`crew_fatalities` number of crew members who died in the crash<br>
`pax_on_board` number of passengers on board at the time of the accident<br> 
`pax_fatalities` number of passengers who died in the crash<br>                 
`other_fatalities` other victims of the accident outside of the aircraft<br>
`total_fatalities` total number of deaths<br>
`captain_flying_hours` number of flying hours of the captain<br>
`captain_flying_hours_on_type` number of hours the captain flew on the type of aircraft involved in the crash<br>
`copilot_flying_hours` number of flying hours of the copilot<br>  
`copilot_flying_hours_on_type` number of hours the copilot flew on the type of aircraft involved in the crash<br>  
`aircraft_flying_hours` number of flying hours of the aircraft before the crash<br>
`aircraft_flight_cycles` number of flights of the aircraft<br><br>


**ASN**

`date` date of the accident<br>
`time` time of the accident<br>
`type` make and model of the aircraft<br>
`first_flight` year the aircraft was inaugurated<br>
`engine` type and number of engines<br>
`type_details` details of the aircraft<br>
`owner` operator of the aircraft<br>
`registration` unique code to a single aircraft, required by international convention<br>
`msn`  manufacturer's serial number of the aircraft<br>
`year_of_manufacture` year of manufacture of the aircraft involved in the accident<br>
`total_airframe_hrs` number of flying hours of the aircraft before the crash<br>
`cycles` number of flights of the aircraft<br>
`engine_model` make and model of the aircraft engine<br>
`fatalities` total number of fatalities<br>
`other_fatalities` other victims of the accident outside of the aircraft<br>
`aircraft_damage` severity of the aircraft damage
`category` type of accident<br>
`location` location of the crash<br>
`phase` phase of the flight when the accident occured<br>
`nature` type of flight (ex: military)<br>
`departure_airport` airport where the departure was planned<br>
`destination_airport` airport when the arrival was planned<br>
`investigating_agency` agency who made the accident deport<br>
`confidence_rating` quality of the information (ex: missing information)

---

## Data Collection

In [1]:
from bs4 import BeautifulSoup
from datetime import datetime
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
import math
import numpy as np
import pandas as pd
import pickle
import re
import requests
from urllib.parse import unquote

### BAAA

### ASN

In [101]:
csv_path = 'data/asn_scraped_data.csv'
root_url = 'https://asn.flightsafety.org'

# Add headers to avoid 403 unauthorized error
headers = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'
}

database = '/database'
database_url = root_url + database

#for year in range(1919, 2026):
for year in range(1973, 2026):
	year_url = '{}{}/year/{}/1'.format(root_url, database, year)
	response = requests.get(year_url, headers=headers)
	soup = BeautifulSoup(response.content, 'html.parser')
	nb_occurences_txt = soup.find('div', {'class': 'innertube'}).find('span').text
	pattern = re.compile(r'\d+(?= occurrences)')
	nb_occurences = int(pattern.search(nb_occurences_txt).group(0))
	max_items_per_page = 100
	nb_pages = math.ceil(nb_occurences / max_items_per_page)

	#for page in range(1, nb_pages + 1):
	for page in range(1, nb_pages + 1):
		page_url = '{}{}/year/{}/{}'.format(root_url, database, year, page)
		response = requests.get(page_url, headers=headers)
		soup = BeautifulSoup(response.content, 'html.parser')
		table = soup.find('table', {'class': 'hp'})
		anchors = table.find_all('a')
		links = [a['href'] for a in anchors]
		
		crash_list = []
		for i, link in enumerate(links):
			details_url = root_url + link
			print('Year {}, page {}, item {}, link: {}'.format(year, page, i + 1, details_url))
			response = requests.get(details_url, headers=headers)
			soup = BeautifulSoup(response.content, 'html.parser')
			table = soup.find('table')
			details = {}

			date_label = table.find('td', string='Date:')
			details['date'] = date_label.next_sibling.text

			time_label = table.find('td', string='Time:')
			details['time'] = time_label.next_sibling.text

			type_label = table.find('td', string='Type:')
			anchor = type_label.next_sibling.find('a')

			if anchor: # Get more details about aircraft if link exists
				details['type'] = anchor.text
				href = anchor['href']
				type_url = root_url + href
				type_response = requests.get(type_url, headers=headers)
				type_soup = BeautifulSoup(type_response.content, 'html.parser')
				type_table = type_soup.find('table')
				type_details = list(type_table.find('td', {'valign': 'top'}).stripped_strings)
				details['type_details'] = ', '.join(type_details)
			else:
				details['type'] = type_label.next_sibling.text
				details['type_details'] = None

			owner_label = table.find('td', string='Owner/operator:')
			details['owner'] = owner_label.next_sibling.text

			reg_label = table.find('td', string='Registration:')
			details['registration'] = reg_label.next_sibling.text

			msn_label = table.find('td', string='MSN:')
			details['msn'] = msn_label.next_sibling.text

			yom_label = table.find('td', string='Year of manufacture:')
			details['year_of_manufacture'] = yom_label.next_sibling.text if yom_label else None

			air_hours_label = table.find('td', string='Total airframe hrs:')
			details['total_airframe_hrs'] = air_hours_label.next_sibling.text if air_hours_label else None

			cycles_label = table.find('td', string='Cycles:')
			details['cycles'] = cycles_label.next_sibling.text if cycles_label else None

			engine_label = table.find('td', string='Engine model:')
			details['engine_model'] = engine_label.next_sibling.text if engine_label else None

			fatal_label = table.find('td', string='Fatalities:')
			details['fatalities'] = fatal_label.next_sibling.text

			other_label = table.find('td', string='Other fatalities:')
			details['other_fatalities'] = other_label.next_sibling.text

			damage_label = table.find('td', string='Aircraft damage:')
			details['aircraft_damage'] = damage_label.next_sibling.text

			cat_label = table.find('td', string='Category:')
			details['category'] = cat_label.next_sibling.text if cat_label else None

			loc_label = table.find('td', string='Location:')
			details['location'] = ' '.join(loc_label.next_sibling.stripped_strings)

			phase_label = table.find('td', string='Phase:')
			details['phase'] = phase_label.next_sibling.text

			nature_label = table.find('td', string='Nature:')
			details['nature'] = nature_label.next_sibling.text

			dep_label = table.find('td', string='Departure airport:')
			details['departure_airport'] = dep_label.next_sibling.text

			des_label = table.find('td', string='Destination airport:')
			details['destination_airport'] = des_label.next_sibling.text

			inv_label = table.find('td', string=re.compile('Investigating'))
			details['investigating_agency'] = inv_label.next_sibling.text if inv_label else None

			conf_label = table.find('td', string='Confidence Rating:')
			details['confidence_rating'] = ''.join(conf_label.next_sibling.stripped_strings) if conf_label else None

			crash_list.append(details)
		
		df = pd.DataFrame(crash_list)

		if year == 1919 and page == 1:
			df.to_csv(csv_path, index=False)
		else:
			df.to_csv(csv_path, index=False, header=False, mode='a')

Year 1973, page 1, item 1, link: https://asn.flightsafety.org/database/record.php?id=19730102-1
Year 1973, page 1, item 2, link: https://asn.flightsafety.org/database/record.php?id=19730102-0
Year 1973, page 1, item 3, link: https://asn.flightsafety.org/database/record.php?id=19730104-0
Year 1973, page 1, item 4, link: https://asn.flightsafety.org/wikibase/193931
Year 1973, page 1, item 5, link: https://asn.flightsafety.org/database/record.php?id=19730114-0
Year 1973, page 1, item 6, link: https://asn.flightsafety.org/wikibase/155647
Year 1973, page 1, item 7, link: https://asn.flightsafety.org/database/record.php?id=19730119-2
Year 1973, page 1, item 8, link: https://asn.flightsafety.org/database/record.php?id=19730119-1
Year 1973, page 1, item 9, link: https://asn.flightsafety.org/database/record.php?id=19730121-0
Year 1973, page 1, item 10, link: https://asn.flightsafety.org/database/record.php?id=19730122-1
Year 1973, page 1, item 11, link: https://asn.flightsafety.org/database/rec

---

## Data Exploration

### BAAA

In [None]:
baaa_df = pd.read_csv('data/baaa_scraped_data.csv')
baaa_df.head()

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
0,"Mar 13, 2025 at 0733 LT",Cessna 525 CitationJet CJ2,LBL 525 CZ LLC,N525CZ,Takeoff (climb),Private,No,"Plain, Valley",Mesquite - Addison,525A-0380,...,0.0,0.0,0.0,1,,,,,,
1,"Mar 7, 2025",Antonov AN-32,Indian Air Force - Bharatiya Vayu Sena,,Landing (descent or approach),Military,Yes,Airport (less than 10 km from airport),,,...,0.0,0.0,0.0,0,,,,,,
2,"Mar 4, 2025 at 0954 LT",BAe Jetstream 31,SAETA Perú (Servicios Aéreos Tarapota),OB-2178,Landing (descent or approach),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Iquitos - Güeppí,861,...,11.0,0.0,0.0,0,,,,,,
3,"Feb 25, 2025",Antonov AN-26,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,,Takeoff (climb),Military,No,City,,,...,13.0,13.0,29.0,46,,,,,,
4,"Feb 23, 2025",Ilyushin II-76,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,1106,Flight,Military,No,Desert,,10234 08265,...,0.0,0.0,0.0,7,,,,,,


In [None]:
baaa_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36086 entries, 0 to 36085
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   date                          36086 non-null  object 
 1   aircraft_type                 36086 non-null  object 
 2   operator                      36084 non-null  object 
 3   registration                  34899 non-null  object 
 4   flight_phase                  35475 non-null  object 
 5   flight_type                   36029 non-null  object 
 6   survivors                     34810 non-null  object 
 7   site                          35719 non-null  object 
 8   schedule                      25712 non-null  object 
 9   msn                           28064 non-null  object 
 10  yom                           26336 non-null  float64
 11  flight_number                 2895 non-null   object 
 12  location                      36075 non-null  object 
 13  c

In [None]:
baaa_df.isnull().sum()

date                                0
aircraft_type                       0
operator                            2
registration                     1187
flight_phase                      611
flight_type                        57
survivors                        1276
site                              367
schedule                        10374
msn                              8022
yom                              9750
flight_number                   33191
location                           11
country                             3
region                              2
crew_on_board                      22
crew_fatalities                     1
pax_on_board                       50
pax_fatalities                      4
other_fatalities                   16
total_fatalities                    0
captain_flying_hours            29206
captain_flying_hours_on_type    30241
copilot_flying_hours            33855
copilot_flying_hours_on_type    34096
aircraft_flying_hours           30383
aircraft_fli

In [None]:
# Check for duplicates
baaa_df[baaa_df.duplicated(keep=False)]

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
2499,"Jun 15, 2008",Harbin Yunsunji Y-12,China Flying Dragon Special Aviation Company,B-3841,Flight,Geographical / Geophysical / Scientific,Yes,Mountains,,0061,...,2.0,2.0,0.0,3,,,,,,
2500,"Jun 15, 2008",Harbin Yunsunji Y-12,China Flying Dragon Special Aviation Company,B-3841,Flight,Geographical / Geophysical / Scientific,Yes,Mountains,,0061,...,2.0,2.0,0.0,3,,,,,,
7539,"Jun 8, 1988",Lockheed C-130 Hercules,United States Air Force - USAF (since 1947),61-2373,Landing (descent or approach),Training,No,Airport (less than 10 km from airport),Little Rock - Greenville,3720,...,0.0,0.0,0.0,6,,,,,,
7540,"Jun 8, 1988",Lockheed C-130 Hercules,United States Air Force - USAF (since 1947),61-2373,Landing (descent or approach),Training,No,Airport (less than 10 km from airport),Little Rock - Greenville,3720,...,0.0,0.0,0.0,6,,,,,,
7659,"Dec 28, 1987",PZL-Mielec AN-2,Aeroflot - Russian International Airlines,CCCP-02531,Takeoff (climb),Scheduled Revenue Flight,Yes,"Plain, Valley",,1G121-15,...,0.0,0.0,0.0,0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33820,"Sep 30, 1933",Avro 594 Avian,Holden's Air Transport Services,VH-UIV,Landing (descent or approach),Cargo,Yes,Airport (less than 10 km from airport),Salamaua – Bulolo,193,...,1.0,0.0,0.0,0,,,,,,
34999,"Oct 18, 1928",Douglas M-3,National Air Transport - USA,NC1064,Flight,Postal (mail),No,Mountains,Cleveland – New York,658,...,0.0,0.0,0.0,1,,,,,,
35000,"Oct 18, 1928",Douglas M-3,National Air Transport - USA,NC1064,Flight,Postal (mail),No,Mountains,Cleveland – New York,658,...,0.0,0.0,0.0,1,,,,,,
35539,"Dec 31, 1923",Loening 23 Air Yacht,New York-Newport Air Service,,,Scheduled Revenue Flight,Yes,"Lake, Sea, Ocean, River",,,...,0.0,0.0,0.0,0,,,,,,


In [None]:
# Check number of unique values
baaa_df.nunique()

date                            28447
aircraft_type                    1176
operator                         9365
registration                    34040
flight_phase                        5
flight_type                        31
survivors                           2
site                                6
schedule                        16829
msn                             20551
yom                               151
flight_number                    2814
location                        17272
country                           219
region                              9
crew_on_board                      31
crew_fatalities                    25
pax_on_board                      255
pax_fatalities                    187
other_fatalities                   47
total_fatalities                  202
captain_flying_hours             4132
captain_flying_hours_on_type     2129
copilot_flying_hours             1709
copilot_flying_hours_on_type     1071
aircraft_flying_hours            4939
aircraft_fli

### ASN

In [None]:
asn_df = pd.read_csv('data/asn_scraped_data.csv')
asn_df.head()

In [None]:
asn_df.info()

In [None]:
asn_df.isnull().sum()

In [None]:
# Check for duplicates
asn_df[asn_df.duplicated(keep=False)]

In [None]:
# Check number of unique values
asn_df.nunique()

---

## Data Cleaning

In [None]:
# Remove duplicates
baaa_df = baaa_df.drop_duplicates()
asn_df = asn_df.drop_duplicates()

In [None]:
# Strip whitespaces
def remove_whitespaces(df):
	for column in df.columns:
		if column.dtype == 'object':
			df = df[column].str.strip()
	return df

baaa_df = remove_whitespaces(baaa_df)
asn_df = remove_whitespaces(asn_df)

### Merge dataframes on date and registration number

Although it's not very likely, the same aircraft can be involved in multiple accidents. Combining the registration number and the date ensures the unicity of the rows.

In [None]:
# Convert BAAA date to datetime
baaa_df['date'] = pd.to_datetime(baaa_df['date'], format='%b %d, %Y at %H%M LT', errors='coerce') \
				.fillna(pd.to_datetime(baaa_df['date'], format='%b %d, %Y', errors='coerce'))
assert baaa_df['date'].isna().sum() == 0

In [None]:
# Convert ASN date and time to datetime
asn_df['date'] = asn_df['date'] + ' ' + asn_df['time']
asn_df['date'] = pd.to_datetime(asn_df['date'], format='%A %-d %B %Y %H:%M', errors='coerce') \
					.fillna(pd.to_datetime(asn_df['date'], format='%A %-d %B %Y %I:%M %p', errors='coerce')) \
					.fillna(pd.to_datetime(asn_df['date'], format='%A %-d %B %Y %H:%M LT', errors='coerce')) \
					.fillna(pd.to_datetime(asn_df['date'], format='%A %-d %B %Y %H:%M UTC', errors='coerce')) \
					.fillna(pd.to_datetime(asn_df['date'], format='%A %-d %B %Y ', errors='coerce'))

In [None]:
# Create date string column
baaa_df['date_str'] = baaa_df['date'].dt.strftime('%Y-%m-%d')
asn_df['date_str'] = asn_df['date'].dt.strftime('%Y-%m-%d')

In [None]:
# Merge two dataframes
df = pd.merge(left=baaa_df, right=asn_df, how='outer', on=['registration', 'date_str'])
df.head()

### Add latitude and longitude

In [None]:
# Get coordinates from geocoder
geolocator = Nominatim(user_agent='aircraft_crashes_analysis')
geocoder = RateLimiter(geolocator.geocode, min_delay_seconds=1)

def get_coord(row) -> str:
	location = None
	response = geocoder(row['location_x'], language='en', exactly_one=True)
	response2 = geocoder(row['location_y'], language='en', exactly_one=True)
	response3 = geocoder(row['city'], language='en', exactly_one=True)
	location =  response if response else response2 if response2 else response3
	
	return str(location.latitude) + ', ' + str(location.longitude) if location else None

df['lat_lng'] = df.apply(get_coord, axis=1)

In [None]:
# Split result into 2 columns
split_columns = df['lat_lng'].str.split(', ', expand=True)
df['latitude'] = split_columns[0]
df['longitude'] = split_columns[1]

In [None]:
df.info()

In [None]:
null_values = df.isnull().sum()
null_values

In [None]:
# Get proportion of rows with null values
nan_rows = df[df.isna().any(axis=1)]

f'{len(nan_rows) / len(df):0%}'

In [None]:
# Get proportion of null values for each columns
null_values_ratios = null_values / len(df)
null_values_ratios

In [None]:
# Check number of unique values
df.nunique()

### Drop rows and columns

In [10]:
# Drop columns with more than 5% of null values
columns = null_values_ratios[null_values_ratios > 0.05].index
df = df.drop(columns, axis=1)

In [12]:
# Drop rows where all the columns are null except survivors and other_fatalities
subset = df.columns[~df.columns.isin(['survivors', 'other_fatalities'])]
df = df.dropna(subset=subset)

### Impute missing values

In [13]:
# Inpute the survivors column with the fatalities column and convert to boolean
survivors = df['crew_on_board'] + df['pax_on_board'] - df['crew_fatalities'] - df['pax_fatalities'] > 0
df['survivors'] = np.where(survivors, 'Yes', 'No')

In [14]:
# Inpute missing other_fatalities to 0
df['other_fatalities'] = df['other_fatalities'].fillna(0)

In [15]:
# Assert there are no more null values
assert df.isna().sum().sum() == 0

### Convert columns

In [18]:
# Convert flight_phase, flight_type, site and region to category
df[['flight_phase', 'flight_type', 'site', 'region']] = \
	df[['flight_phase', 'flight_type', 'site', 'region']].astype('category')

In [19]:
# Convert survivors to boolean
df['survivors'] = np.where(df['survivors'] == 'Yes', True, False)

In [None]:
# Convert float columns to integer
columns = df.select_dtypes(include=[float]).columns
df[columns] = df[columns].astype('int')

### Export data

In [21]:
# Sort data from the earliest to the latest crash
df = df.sort_values(by='date')

In [22]:
# Reset index
df = df.reset_index(drop=True)

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34956 entries, 0 to 34955
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date              34956 non-null  datetime64[ns]
 1   aircraft_type     34956 non-null  object        
 2   operator          34956 non-null  object        
 3   flight_phase      34956 non-null  category      
 4   flight_type       34956 non-null  category      
 5   survivors         34956 non-null  bool          
 6   site              34956 non-null  category      
 7   location          34956 non-null  object        
 8   country           34956 non-null  object        
 9   region            34956 non-null  category      
 10  crew_on_board     34956 non-null  int64         
 11  crew_fatalities   34956 non-null  int64         
 12  pax_on_board      34956 non-null  int64         
 13  pax_fatalities    34956 non-null  int64         
 14  other_fatalities  3495

In [24]:
# Serialize data to pickle
with open('data/crashes_cleaned_data.pkl', 'wb') as handle:
  pickle.dump(df, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [25]:
# Export data to CSV
df.to_csv('data/crashes_cleaned_data.csv', index=False)

## End