# Aircraft Crashes Data Collection And Cleaning

## Overview

This notebook collects and prepares the data for the analysis of all the aircraft accidents since 1918.

### About dataset

`date` date and local time of the accident<br>
`aircraft_type` aircraft make and model<br>
`operator` operator of the aircraft<br>
`registration` unique code to a single aircraft<br>
`flight_phase` phase of the flight when the accident occured<br>
`flight_type` type of flight (ex: military)<br>
`survivors` indicates if there was survivors or not<br>
`site` type of location where the accident happened (ex: mountains)<br>
`schedule` planned route of the flight<br>
`msn` manufacturer's serial number of the aircraft<br>
`yom` year of manufacture of the aircraft<br>
`flight_number` flight number<br>
`location` location of the accident<br>
`country` country where the crash happened<br>
`region` region of the world where the crash happened<br>
`crew_on_board` number of crew members on board at the time of the accident<br>
`crew_fatalities` number of crew members who died in the crash<br>
`pax_on_board` number of passengers on board at the time of the accident<br> 
`pax_fatalities` number of passengers who died in the crash<br>                 
`other_fatalities` other victims of the accident outside of the aircraft<br>
`total_fatalities` total number of deaths<br>
`captain_flying_hours` number of flying hours of the captain<br>
`captain_flying_hours_on_type` number of hours the captain flew on the type of aircraft involved in the crash<br>
`copilot_flying_hours` number of flying hours of the copilot<br>  
`copilot_flying_hours_on_type` number of hours the copilot flew on the type of aircraft involved in the crash<br>  
`aircraft_flying_hours` number of flying hours of the plane involved in the crash<br>
`aircraft_flight_cycles` number of flights of the aircraft<br><br>

---

## Data Collection

In [1]:
from bs4 import BeautifulSoup
from datetime import datetime
import math
import numpy as np
import pandas as pd
import pickle
import re
import requests
from urllib.parse import unquote
from thefuzz import fuzz, process

---

## Data Exploration

In [2]:
df = pd.read_csv('data/crashes_scraped_data.csv')
df.head()

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
0,"Mar 13, 2025 at 0733 LT",Cessna 525 CitationJet CJ2,LBL 525 CZ LLC,N525CZ,Takeoff (climb),Private,No,"Plain, Valley",Mesquite - Addison,525A-0380,...,0.0,0.0,0.0,1,,,,,,
1,"Mar 7, 2025",Antonov AN-32,Indian Air Force - Bharatiya Vayu Sena,,Landing (descent or approach),Military,Yes,Airport (less than 10 km from airport),,,...,0.0,0.0,0.0,0,,,,,,
2,"Mar 4, 2025 at 0954 LT",BAe Jetstream 31,SAETA Perú (Servicios Aéreos Tarapota),OB-2178,Landing (descent or approach),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Iquitos - Güeppí,861,...,11.0,0.0,0.0,0,,,,,,
3,"Feb 25, 2025",Antonov AN-26,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,,Takeoff (climb),Military,No,City,,,...,13.0,13.0,29.0,46,,,,,,
4,"Feb 23, 2025",Ilyushin II-76,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,1106,Flight,Military,No,Desert,,10234 08265,...,0.0,0.0,0.0,7,,,,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36086 entries, 0 to 36085
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   date                          36086 non-null  object 
 1   aircraft_type                 36086 non-null  object 
 2   operator                      36084 non-null  object 
 3   registration                  34899 non-null  object 
 4   flight_phase                  35475 non-null  object 
 5   flight_type                   36029 non-null  object 
 6   survivors                     34810 non-null  object 
 7   site                          35719 non-null  object 
 8   schedule                      25712 non-null  object 
 9   msn                           28064 non-null  object 
 10  yom                           26336 non-null  float64
 11  flight_number                 2895 non-null   object 
 12  location                      36075 non-null  object 
 13  c

In [4]:
null_values = df.isnull().sum()
null_values

date                                0
aircraft_type                       0
operator                            2
registration                     1187
flight_phase                      611
flight_type                        57
survivors                        1276
site                              367
schedule                        10374
msn                              8022
yom                              9750
flight_number                   33191
location                           11
country                             3
region                              2
crew_on_board                      22
crew_fatalities                     1
pax_on_board                       50
pax_fatalities                      4
other_fatalities                   16
total_fatalities                    0
captain_flying_hours            29206
captain_flying_hours_on_type    30241
copilot_flying_hours            33855
copilot_flying_hours_on_type    34096
aircraft_flying_hours           30383
aircraft_fli

In [5]:
# Get proportion of rows with null values
nan_rows = df[df.isna().any(axis=1)]

f'{len(nan_rows) / len(df):0%}'

'98.741894%'

In [6]:
# Get proportion of null values for each column
null_values_ratios = null_values / len(df)
null_values_ratios

date                            0.000000
aircraft_type                   0.000000
operator                        0.000055
registration                    0.032894
flight_phase                    0.016932
flight_type                     0.001580
survivors                       0.035360
site                            0.010170
schedule                        0.287480
msn                             0.222302
yom                             0.270188
flight_number                   0.919775
location                        0.000305
country                         0.000083
region                          0.000055
crew_on_board                   0.000610
crew_fatalities                 0.000028
pax_on_board                    0.001386
pax_fatalities                  0.000111
other_fatalities                0.000443
total_fatalities                0.000000
captain_flying_hours            0.809344
captain_flying_hours_on_type    0.838026
copilot_flying_hours            0.938175
copilot_flying_h

In [7]:
# Check for duplicates
df[df.duplicated(keep=False)]

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
2499,"Jun 15, 2008",Harbin Yunsunji Y-12,China Flying Dragon Special Aviation Company,B-3841,Flight,Geographical / Geophysical / Scientific,Yes,Mountains,,0061,...,2.0,2.0,0.0,3,,,,,,
2500,"Jun 15, 2008",Harbin Yunsunji Y-12,China Flying Dragon Special Aviation Company,B-3841,Flight,Geographical / Geophysical / Scientific,Yes,Mountains,,0061,...,2.0,2.0,0.0,3,,,,,,
7539,"Jun 8, 1988",Lockheed C-130 Hercules,United States Air Force - USAF (since 1947),61-2373,Landing (descent or approach),Training,No,Airport (less than 10 km from airport),Little Rock - Greenville,3720,...,0.0,0.0,0.0,6,,,,,,
7540,"Jun 8, 1988",Lockheed C-130 Hercules,United States Air Force - USAF (since 1947),61-2373,Landing (descent or approach),Training,No,Airport (less than 10 km from airport),Little Rock - Greenville,3720,...,0.0,0.0,0.0,6,,,,,,
7659,"Dec 28, 1987",PZL-Mielec AN-2,Aeroflot - Russian International Airlines,CCCP-02531,Takeoff (climb),Scheduled Revenue Flight,Yes,"Plain, Valley",,1G121-15,...,0.0,0.0,0.0,0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33820,"Sep 30, 1933",Avro 594 Avian,Holden's Air Transport Services,VH-UIV,Landing (descent or approach),Cargo,Yes,Airport (less than 10 km from airport),Salamaua – Bulolo,193,...,1.0,0.0,0.0,0,,,,,,
34999,"Oct 18, 1928",Douglas M-3,National Air Transport - USA,NC1064,Flight,Postal (mail),No,Mountains,Cleveland – New York,658,...,0.0,0.0,0.0,1,,,,,,
35000,"Oct 18, 1928",Douglas M-3,National Air Transport - USA,NC1064,Flight,Postal (mail),No,Mountains,Cleveland – New York,658,...,0.0,0.0,0.0,1,,,,,,
35539,"Dec 31, 1923",Loening 23 Air Yacht,New York-Newport Air Service,,,Scheduled Revenue Flight,Yes,"Lake, Sea, Ocean, River",,,...,0.0,0.0,0.0,0,,,,,,


In [8]:
# Check number of unique values
df.nunique()

date                            28447
aircraft_type                    1176
operator                         9365
registration                    34040
flight_phase                        5
flight_type                        31
survivors                           2
site                                6
schedule                        16829
msn                             20551
yom                               151
flight_number                    2814
location                        17272
country                           219
region                              9
crew_on_board                      31
crew_fatalities                    25
pax_on_board                      255
pax_fatalities                    187
other_fatalities                   47
total_fatalities                  202
captain_flying_hours             4132
captain_flying_hours_on_type     2129
copilot_flying_hours             1709
copilot_flying_hours_on_type     1071
aircraft_flying_hours            4939
aircraft_fli

---

## Data Cleaning

### Drop rows and columns

In [9]:
# Remove duplicates
df = df.drop_duplicates()

In [10]:
# Drop columns with more than 5% of null values
columns = null_values_ratios[null_values_ratios > 0.05].index
df = df.drop(columns, axis=1)

In [11]:
# Drop registration column as it is an unique identifier
df = df.drop('registration', axis=1)

In [12]:
# Drop rows where all the columns are null except survivors and other_fatalities
subset = df.columns[~df.columns.isin(['survivors', 'other_fatalities'])]
df = df.dropna(subset=subset)

### Impute missing values

In [13]:
# Inpute the survivors column with the fatalities column and convert to boolean
survivors = df['crew_on_board'] + df['pax_on_board'] - df['crew_fatalities'] - df['pax_fatalities'] > 0
df['survivors'] = np.where(survivors, 'Yes', 'No')

In [14]:
# Inpute missing other_fatalities to 0
df['other_fatalities'] = df['other_fatalities'].fillna(0)

In [15]:
# Assert there are no more null values
assert df.isna().sum().sum() == 0

### Replace values

In [16]:
# Only keep first part of location
df['location'] = df['location'].str.split(', ', expand=True)[0]
df.sample(5)

Unnamed: 0,date,aircraft_type,operator,flight_phase,flight_type,survivors,site,location,country,region,crew_on_board,crew_fatalities,pax_on_board,pax_fatalities,other_fatalities,total_fatalities
11780,"Jun 17, 1975 at 0833 LT",Sud-Aviation SE-210 Caravelle,Indian Airlines,Landing (descent or approach),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Mumbai-Chhatrapati Shivaji (Santa Cruz),India,Asia,6.0,0.0,87.0,0.0,0.0,0
25612,"Apr 15, 1942",Simmonds Spartan,Private Papuan,Flight,Humanitarian,No,"Plain, Valley",Goroka,Papua New Guinea,Oceania,0.0,0.0,0.0,0.0,0.0,0
692,"Mar 22, 2019",Rockwell Sabreliner 60,Private American,Landing (descent or approach),Illegal (smuggling),No,"Plain, Valley",Bajamar,Honduras,Central America,0.0,0.0,0.0,0.0,0.0,0
8570,"Dec 29, 1984 at 1315 LT",Piper PA-31-310 Navajo,Ready Air,Flight,Positioning,Yes,"Lake, Sea, Ocean, River",Port-de-Paix,Haiti,Central America,2.0,0.0,0.0,0.0,0.0,0
22528,"Jun 18, 1944",Douglas C-47 Skytrain (DC-3),China National Aviation Corporation - CNAC,Flight,Scheduled Revenue Flight,Yes,Mountains,Guilin,China,Asia,3.0,0.0,11.0,1.0,0.0,1


### Convert columns

In [17]:
# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'], format='%b %d, %Y at %H%M LT', errors='coerce') \
				.fillna(pd.to_datetime(df['date'], format='%b %d, %Y', errors='coerce'))
assert df['date'].isna().sum() == 0

In [18]:
# Convert flight_phase, flight_type, site and region to category
df[['flight_phase', 'flight_type', 'site', 'region']] = \
	df[['flight_phase', 'flight_type', 'site', 'region']].astype('category')

In [19]:
# Convert survivors to boolean
df['survivors'] = np.where(df['survivors'] == 'Yes', True, False)

In [20]:
# Convert float columns to integer
columns = df.select_dtypes(include=[float]).columns
df[columns] = df[columns].astype('int')

### Export data

In [21]:
# Sort data from the earliest to the latest crash
df = df.sort_values(by='date')

In [22]:
# Reset index
df = df.reset_index(drop=True)

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34956 entries, 0 to 34955
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date              34956 non-null  datetime64[ns]
 1   aircraft_type     34956 non-null  object        
 2   operator          34956 non-null  object        
 3   flight_phase      34956 non-null  category      
 4   flight_type       34956 non-null  category      
 5   survivors         34956 non-null  bool          
 6   site              34956 non-null  category      
 7   location          34956 non-null  object        
 8   country           34956 non-null  object        
 9   region            34956 non-null  category      
 10  crew_on_board     34956 non-null  int64         
 11  crew_fatalities   34956 non-null  int64         
 12  pax_on_board      34956 non-null  int64         
 13  pax_fatalities    34956 non-null  int64         
 14  other_fatalities  3495

In [24]:
# Serialize data to pickle
with open('data/crashes_cleaned_data.pkl', 'wb') as handle:
  pickle.dump(df, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [25]:
# Export data to CSV
df.to_csv('data/crashes_cleaned_data.csv', index=False)

## End