# Aircraft Crashes Data Collection And Cleaning

## Overview

This notebook collects and prepares the data for the analysis of all the aircraft accidents since 1918.

### About dataset

**Crash**

`date` date and local time of the accident<br>
`aircraft` aircraft make and model<br>
`operator` operator of the aircraft<br>
`registration` unique code to a single aircraft<br>
`flight_phase` phase of the flight when the accident occured<br>
`flight_type` type of flight<br>
`survivors` indicates if there was survivors<br>
`site` type of location where the accident happened (ex: mountains)<br>
`schedule` planned route of the flight<br>
`msn` manufacturer's serial number of the aircraft<br>
`yom` year of manufacture of the aircraft<br>
`flight_number` flight number<br>
`location` location of the accident<br>
`country` country where the crash happened<br>
`region` region of the world where the crash happened<br>
`crew_on_board` number of crew members on board at the time of the accident<br>
`crew_fatalities` number of crew members who died in the crash<br>
`pax_on_board` number of passengers on board at the time of the accident<br>     
`pax_fatalities` number of passengers who died in the crash<br>                 
`other_fatalities` other victims of the accident outside of the aircraft<br>
`total_fatalities` total number of deaths<br>
`captain_flying_hours` number of flying hours of the captain<br>
`captain_flying_hours_on_type` number of hours the captain flew on the type of aircraft involved in the crash<br>
`copilot_flying_hours` number of flying hours of the copilot<br>  
`copilot_flying_hours_on_type` number of hours the copilot flew on the type of aircraft involved in the crash<br>  
`aircraft_flying_hours` number of flying hours of the plane involved in the crash<br>
`aircraft_flight_cycles` number of flights of the aircraft<br><br>


**Aircraft**

`make` manufacturer<br>
`model` model (name)<br>
`body` body type<br>
`wing` type of wing<br>
`position` wing position<br>
`tail` tail configuration<br>
`engine` type of engine<br>
`engine_count` number of engines<br>
`wing_span` distance from one wingtip to the opposite, in meters<br>
`length` length in meters<br>
`height` height in meters<br>
`manufactured_as` other names of the aircraft

## Data Collection

In [None]:
from bs4 import BeautifulSoup
import math
import pandas as pd
import re
import requests
from urllib.parse import unquote

In [108]:
# Scrape total number of planes
root_url = 'https://skybrary.aero'
aircraft_types = '/aircraft-types'

response = requests.get(root_url + aircraft_types)
soup = BeautifulSoup(response.content, 'html.parser')
view_header = soup.find('div', {'class': 'view-header'}).text

pattern = re.compile(r'(?<=below )\d+(?= results)')
nb_planes_str = pattern.search(view_header).group(0)
nb_planes = int(nb_planes_str)
nb_planes

580

In [109]:
# Scrape details of all planes
items_per_page = 50
nb_pages = math.ceil(nb_planes / items_per_page)
csv_path = 'data/planes_scraped_data.csv'

for i in range(nb_pages):
	listing_url = '{}{}?items_per_page={}&page={}'.format(root_url, aircraft_types, items_per_page, i)
	response = requests.get(listing_url)
	soup = BeautifulSoup(response.content, 'html.parser')
	masonry_items = soup.find_all('div', {'class': 'masonry-item views-row'})
	
	plane_list = []
	
	for j, item in enumerate(masonry_items):
		link = item.find('a')['href']
		details_url = root_url + link
		print('Page {}, link {}: {}'.format(i, j + 1, details_url))
		response = requests.get(details_url)
		soup = BeautifulSoup(response.content, 'html.parser')

		details = {}
		
		details['name'] = soup.find('div', {'class': 'field-node-dynamic-token-fieldnode-aircraft-name'}) \
			.find('div', {'class': 'field-items'}).find('p').text
		details['make'] = soup.find('div', {'class': 'field-node-field-aircraft-manufacturer'}).find('a').text
		details['model'] = soup.find('div', {'class': 'field-node-field-aircraft-name'}) \
			.find('div', {'class': 'field-items'}).find('div').text
		
		body_div = soup.find('div', {'class': 'field-node-field__body-type'})
		details['body'] =  body_div.find('div', {'class': 'field-items'}).find('div').text if body_div else None

		position_div = soup.find('div', {'class': 'field-node-field-wing-position'})
		details['position'] = position_div.find('div', {'class': 'field-items'}).find('div').text if position_div else None
		
		wing_div = soup.find('div', {'class': 'field-node-field-wing-type'})
		details['wing'] = wing_div.find('div', {'class': 'field-items'}).find('div').text if wing_div else None	

		tail_div = soup.find('div', {'class': 'field-node-field-tail-configuration'})
		details['tail'] = tail_div.find('div', {'class': 'field-items'}).find('div').text if tail_div else None	
		
		engine_div = soup.find('div', {'class': 'field-node-field-engine-type'})
		details['engine'] = engine_div.find('div', {'class': 'field-items'}).find('div').text if engine_div else None
		
		engine_count_div = soup.find('div', {'class': 'field-node-field-engine-count'})
		details['engine_count'] = engine_count_div.find('div', {'class': 'field-items'}).find('div').text if engine_count_div else None
		
		wing_span_div = soup.find('div', {'class': 'field-node-field-aircraft-wing-span'})
		details['wing_span'] = wing_span_div.find('div', {'class': 'field-items'}).find('div').text if wing_span_div else None
		
		length_div = soup.find('div', {'class': 'field-node-field-aircraft-length'})
		details['length'] =  length_div.find('div', {'class': 'field-items'}).find('div').text if length_div else None
		
		height_div = soup.find('div', {'class': 'field-node-field-aircraft-height'})
		details['height'] = height_div.find('div', {'class': 'field-items'}).find('div').text if height_div else None

		manufactured_as = soup.find('div', {'class': 'view-manufacturered-as'}).find_all('span', {'class': 'field-content'})
		details['manufactured_as'] = ', '.join(span.text for span in manufactured_as) if manufactured_as else None
		
		plane_list.append(details)
	
	planes_df = pd.DataFrame(plane_list)

	if i == 0:
		planes_df.to_csv(csv_path, index=False)
	else:
		planes_df.to_csv(csv_path, index=False, header=False, mode='a')
		


Page 0, link 1: https://skybrary.aero/aircraft/f260
Page 0, link 2: https://skybrary.aero/aircraft/m339
Page 0, link 3: https://skybrary.aero/aircraft/ac56
Page 0, link 4: https://skybrary.aero/aircraft/l159
Page 0, link 5: https://skybrary.aero/aircraft/ac68
Page 0, link 6: https://skybrary.aero/aircraft/sgup
Page 0, link 7: https://skybrary.aero/aircraft/l39
Page 0, link 8: https://skybrary.aero/aircraft/alo2
Page 0, link 9: https://skybrary.aero/aircraft/alo3
Page 0, link 10: https://skybrary.aero/aircraft/as32
Page 0, link 11: https://skybrary.aero/aircraft/as3b
Page 0, link 12: https://skybrary.aero/aircraft/as50
Page 0, link 13: https://skybrary.aero/aircraft/as55
Page 0, link 14: https://skybrary.aero/aircraft/as65
Page 0, link 15: https://skybrary.aero/aircraft/frel
Page 0, link 16: https://skybrary.aero/aircraft/gazl
Page 0, link 17: https://skybrary.aero/aircraft/lama
Page 0, link 18: https://skybrary.aero/aircraft/n262
Page 0, link 19: https://skybrary.aero/aircraft/puma
Pag

---

## Data Exploration

### Crashes

In [35]:
crashes_df = pd.read_csv('data/crashes_scraped_data.csv')
crashes_df.head()

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
0,"Mar 13, 2025 at 0733 LT",Cessna 525 CitationJet CJ2,LBL 525 CZ LLC,N525CZ,Takeoff (climb),Private,No,"Plain, Valley",Mesquite - Addison,525A-0380,...,0.0,0.0,0.0,1,,,,,,
1,"Mar 7, 2025",Antonov AN-32,Indian Air Force - Bharatiya Vayu Sena,,Landing (descent or approach),Military,Yes,Airport (less than 10 km from airport),,,...,0.0,0.0,0.0,0,,,,,,
2,"Mar 4, 2025 at 0954 LT",BAe Jetstream 31,SAETA Perú (Servicios Aéreos Tarapota),OB-2178,Landing (descent or approach),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Iquitos - Güeppí,861,...,11.0,0.0,0.0,0,,,,,,
3,"Feb 25, 2025",Antonov AN-26,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,,Takeoff (climb),Military,No,City,,,...,13.0,13.0,29.0,46,,,,,,
4,"Feb 23, 2025",Ilyushin II-76,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,1106,Flight,Military,No,Desert,,10234 08265,...,0.0,0.0,0.0,7,,,,,,


In [None]:
crashes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36086 entries, 0 to 36085
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   date                          36086 non-null  object 
 1   aircraft_type                 36086 non-null  object 
 2   operator                      36084 non-null  object 
 3   registration                  34899 non-null  object 
 4   flight_phase                  35475 non-null  object 
 5   flight_type                   36029 non-null  object 
 6   survivors                     34810 non-null  object 
 7   site                          35719 non-null  object 
 8   schedule                      25712 non-null  object 
 9   msn                           28064 non-null  object 
 10  yom                           26336 non-null  float64
 11  flight_number                 2895 non-null   object 
 12  location                      36075 non-null  object 
 13  c

In [None]:
crashes_df.isnull().sum()

date                                0
aircraft_type                       0
operator                            2
registration                     1187
flight_phase                      611
flight_type                        57
survivors                        1276
site                              367
schedule                        10374
msn                              8022
yom                              9750
flight_number                   33191
location                           11
country                             3
region                              2
crew_on_board                      22
crew_fatalities                     1
pax_on_board                       50
pax_fatalities                      4
other_fatalities                   16
total_fatalities                    0
captain_flying_hours            29206
captain_flying_hours_on_type    30241
copilot_flying_hours            33855
copilot_flying_hours_on_type    34096
aircraft_flying_hours           30383
aircraft_fli

In [None]:
# Check for duplicates
crashes_df[crashes_df.duplicated(keep=False)]

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
2499,"Jun 15, 2008",Harbin Yunsunji Y-12,China Flying Dragon Special Aviation Company,B-3841,Flight,Geographical / Geophysical / Scientific,Yes,Mountains,,0061,...,2.0,2.0,0.0,3,,,,,,
2500,"Jun 15, 2008",Harbin Yunsunji Y-12,China Flying Dragon Special Aviation Company,B-3841,Flight,Geographical / Geophysical / Scientific,Yes,Mountains,,0061,...,2.0,2.0,0.0,3,,,,,,
7539,"Jun 8, 1988",Lockheed C-130 Hercules,United States Air Force - USAF (since 1947),61-2373,Landing (descent or approach),Training,No,Airport (less than 10 km from airport),Little Rock - Greenville,3720,...,0.0,0.0,0.0,6,,,,,,
7540,"Jun 8, 1988",Lockheed C-130 Hercules,United States Air Force - USAF (since 1947),61-2373,Landing (descent or approach),Training,No,Airport (less than 10 km from airport),Little Rock - Greenville,3720,...,0.0,0.0,0.0,6,,,,,,
7659,"Dec 28, 1987",PZL-Mielec AN-2,Aeroflot - Russian International Airlines,CCCP-02531,Takeoff (climb),Scheduled Revenue Flight,Yes,"Plain, Valley",,1G121-15,...,0.0,0.0,0.0,0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33820,"Sep 30, 1933",Avro 594 Avian,Holden's Air Transport Services,VH-UIV,Landing (descent or approach),Cargo,Yes,Airport (less than 10 km from airport),Salamaua – Bulolo,193,...,1.0,0.0,0.0,0,,,,,,
34999,"Oct 18, 1928",Douglas M-3,National Air Transport - USA,NC1064,Flight,Postal (mail),No,Mountains,Cleveland – New York,658,...,0.0,0.0,0.0,1,,,,,,
35000,"Oct 18, 1928",Douglas M-3,National Air Transport - USA,NC1064,Flight,Postal (mail),No,Mountains,Cleveland – New York,658,...,0.0,0.0,0.0,1,,,,,,
35539,"Dec 31, 1923",Loening 23 Air Yacht,New York-Newport Air Service,,,Scheduled Revenue Flight,Yes,"Lake, Sea, Ocean, River",,,...,0.0,0.0,0.0,0,,,,,,


In [None]:
# Check number of unique values
crashes_df.nunique()

date                            28447
aircraft_type                    1176
operator                         9365
registration                    34040
flight_phase                        5
flight_type                        31
survivors                           2
site                                6
schedule                        16829
msn                             20551
yom                               151
flight_number                    2814
location                        17272
country                           219
region                              9
crew_on_board                      31
crew_fatalities                    25
pax_on_board                      255
pax_fatalities                    187
other_fatalities                   47
total_fatalities                  202
captain_flying_hours             4132
captain_flying_hours_on_type     2129
copilot_flying_hours             1709
copilot_flying_hours_on_type     1071
aircraft_flying_hours            4939
aircraft_fli

### Planes

In [110]:
planes_df = pd.read_csv('data/planes_scraped_data.csv')
planes_df.head()

Unnamed: 0,name,make,model,body,position,wing,tail,engine,engine_count,wing_span,length,height,manufactured_as
0,AERMACCHI SF.260,AERMACCHI,SF.260,Narrow,,Fixed Wing,,Piston,Single,8.22 m,7.00 m,2.60 m,"TUSAS SF-260, SIAI-MARCHETTI SF-260E, SIAI-MAR..."
1,AERMACCHI MB-339,AERMACCHI,MB-339,Narrow,Low wing with wing tip tanks,Fixed Wing,"Regular tail, low set",Jet,Single,10.90 m,11.00 m,4.00 m,AERMACCHI MB-339
2,AERO (1) Commander 560,AERO (1),Commander 560,Narrow,High wing,Fixed Wing,"Regular tail, low set",Piston,Multi,14.90 m,11.20 m,4.60 m,"AERO (1) L-26B Commander 560, AERO (1) U-4A Co..."
3,AERO (2) L-159,AERO (2),L-159,Narrow,Low wing with wing tip tanks,Fixed Wing,"Regular tail, mid set",Jet,Single,9.50 m,12.70 m,4.80 m,"AERO (2) Albatros 2, AERO (2) L-159 Albatros 2..."
4,AERO COMMANDER Commander 680F,AERO COMMANDER,Commander 680F,Narrow,High wing,Fixed Wing,"Regular tail, low set",Piston,Multi,13.40 m,13.10 m,4.60 m,"AERO (1) U-9 Commander 680 Super, AERO (1) U-4..."


In [111]:
planes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 580 entries, 0 to 579
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             580 non-null    object
 1   make             580 non-null    object
 2   model            580 non-null    object
 3   body             520 non-null    object
 4   position         440 non-null    object
 5   wing             578 non-null    object
 6   tail             438 non-null    object
 7   engine           577 non-null    object
 8   engine_count     577 non-null    object
 9   wing_span        574 non-null    object
 10  length           574 non-null    object
 11  height           572 non-null    object
 12  manufactured_as  538 non-null    object
dtypes: object(13)
memory usage: 59.0+ KB


In [113]:
planes_df.isnull().sum()

name                 0
make                 0
model                0
body                60
position           140
wing                 2
tail               142
engine               3
engine_count         3
wing_span            6
length               6
height               8
manufactured_as     42
dtype: int64

In [114]:
# Check for duplicates
planes_df[planes_df.duplicated(keep=False)]

Unnamed: 0,name,make,model,body,position,wing,tail,engine,engine_count,wing_span,length,height,manufactured_as


In [115]:
planes_df.nunique()

name               579
make               125
model              578
body                 2
position            25
wing                 3
tail                21
engine               5
engine_count         2
wing_span          319
length             402
height             292
manufactured_as    537
dtype: int64

---

## Data Cleaning

### Planes

In [116]:
# Convert wing_span, length and height to float
dimensions = ['wing_span', 'length', 'height']

for column in dimensions:
	planes_df[column] = planes_df[column].str.replace(' m', '')
	planes_df[column] = planes_df[column].astype('float')


planes_df[dimensions].describe()

Unnamed: 0,wing_span,length,height
count,574.0,574.0,572.0
mean,22.318049,22.744599,6.712605
std,15.02324,16.81549,4.504609
min,5.3,2.83,1.0
25%,11.5,10.58,3.55
50%,15.85,16.05,4.96
75%,28.7,30.1775,8.59
max,88.5,84.0,24.09


### Crashes

In [None]:
# Convert date column to datetime
crashes_df['date'] = pd.to_datetime(crashes_df['date'], format='%b %d, %Y at %H%M LT', errors='coerce') \
					.fillna(pd.to_datetime(crashes_df['date'], format='%b %d, %Y', errors='coerce'))
crashes_df.sample(5)

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
7005,1990-01-11 10:30:00,Boeing KC-135 Stratotanker,United States Air Force - USAF (since 1947),59-1494,Parking,Military,Yes,Airport (less than 10 km from airport),,17982.0,...,0.0,0.0,0.0,0,,,,,,
28697,1941-06-13 01:35:00,Armstrong Whitworth AW.38 Whitley,Royal Air Force - RAF,Z6721,Flight,Bombing,Yes,"Lake, Sea, Ocean, River",Leeming - Leeming,2109.0,...,0.0,0.0,0.0,0,,,,,,
11599,1976-01-21 00:00:00,Antonov AN-24,CAAC - Civil Aviation Administration of China,B-492,Landing (descent or approach),Scheduled Revenue Flight,No,Airport (less than 10 km from airport),Guangzhou – Changsha – Hangzhou – Shanghai,,...,36.0,36.0,0.0,40,,,,,,
18828,1952-02-03 00:00:00,De Havilland DH.104 Dove,Indian Air Force - Bharatiya Vayu Sena,HW516,Takeoff (climb),Military,Yes,Airport (less than 10 km from airport),Lucknow – New Delhi,4159.0,...,6.0,0.0,0.0,0,,,,,,
10155,1979-12-23 00:00:00,GAF Nomad N.22,Douglas Airways,P2-DNL,Landing (descent or approach),Scheduled Revenue Flight,No,Airport (less than 10 km from airport),Port Moresby - Manari,39.0,...,14.0,14.0,0.0,16,5000.0,,,,,


In [94]:
# Remove duplicates
crashes_df = crashes_df.drop_duplicates()

In [121]:
crashes_df['aircraft_type'].nunique()

1176

In [None]:
# TODO: Collapse aircraft types from crashes with planes dataframe
possible_names = planes_df['name'] + ', ' + planes_df['manufactured_as']

for names_str in possible_names:
  name_list = str(names_str).split(', ')
  same_plane_mask = crashes_df['aircraft_type'].isin(name_list)
  crashes_df.loc[same_plane_mask, 'aircraft_type'] = name_list[0]

In [131]:
crashes_df['aircraft_type'].nunique()

1176

In [83]:
crashes_df.columns

Index(['date', 'aircraft_type', 'operator', 'registration', 'flight_phase',
       'flight_type', 'survivors', 'site', 'schedule', 'msn', 'yom',
       'flight_number', 'location', 'country', 'region', 'crew_on_board',
       'crew_fatalities', 'pax_on_board', 'pax_fatalities', 'other_fatalities',
       'total_fatalities', 'captain_flying_hours',
       'captain_flying_hours_on_type', 'copilot_flying_hours',
       'copilot_flying_hours_on_type', 'aircraft_flying_hours',
       'aircraft_flight_cycles'],
      dtype='object')

## End