# Aircraft Crashes Data Collection And Cleaning

## Overview

This notebook collects and prepares the data for the analysis of all the aircraft accidents since 1918.

### About dataset

**Crash**

`date` date and local time of the accident<br>
`aircraft` aircraft make and model<br>
`operator` operator of the aircraft<br>
`registration` unique code to a single aircraft<br>
`flight_phase` phase of the flight when the accident occured<br>
`flight_type` type of flight<br>
`survivors` indicates if there was survivors<br>
`site` type of location where the accident happened (ex: mountains)<br>
`schedule` planned route of the flight<br>
`msn` manufacturer's serial number of the aircraft<br>
`yom` year of manufacture of the aircraft<br>
`flight_number` flight number<br>
`location` location of the accident<br>
`country` country where the crash happened<br>
`region` region of the world where the crash happened<br>
`crew_on_board` number of crew members on board at the time of the accident<br>
`crew_fatalities` number of crew members who died in the crash<br>
`pax_on_board` number of passengers on board at the time of the accident<br>     
`pax_fatalities` number of passengers who died in the crash<br>                 
`other_fatalities` other victims of the accident outside of the aircraft<br>
`total_fatalities` total number of deaths<br>
`captain_flying_hours` number of flying hours of the captain<br>
`captain_flying_hours_on_type` number of hours the captain flew on the type of aircraft involved in the crash<br>
`copilot_flying_hours` number of flying hours of the copilot<br>  
`copilot_flying_hours_on_type` number of hours the copilot flew on the type of aircraft involved in the crash<br>  
`aircraft_flying_hours` number of flying hours of the plane involved in the crash<br>
`aircraft_flight_cycles` number of flights of the aircraft<br><br>

## Data Collection

In [112]:
from bs4 import BeautifulSoup
from datetime import datetime
import math
import numpy as np
import pandas as pd
import re
import requests
from urllib.parse import unquote
from thefuzz import fuzz, process

---

## Data Exploration

In [113]:
df = pd.read_csv('data/crashes_scraped_data.csv')
df.head()

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
0,"Mar 13, 2025 at 0733 LT",Cessna 525 CitationJet CJ2,LBL 525 CZ LLC,N525CZ,Takeoff (climb),Private,No,"Plain, Valley",Mesquite - Addison,525A-0380,...,0.0,0.0,0.0,1,,,,,,
1,"Mar 7, 2025",Antonov AN-32,Indian Air Force - Bharatiya Vayu Sena,,Landing (descent or approach),Military,Yes,Airport (less than 10 km from airport),,,...,0.0,0.0,0.0,0,,,,,,
2,"Mar 4, 2025 at 0954 LT",BAe Jetstream 31,SAETA Perú (Servicios Aéreos Tarapota),OB-2178,Landing (descent or approach),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Iquitos - Güeppí,861,...,11.0,0.0,0.0,0,,,,,,
3,"Feb 25, 2025",Antonov AN-26,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,,Takeoff (climb),Military,No,City,,,...,13.0,13.0,29.0,46,,,,,,
4,"Feb 23, 2025",Ilyushin II-76,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,1106,Flight,Military,No,Desert,,10234 08265,...,0.0,0.0,0.0,7,,,,,,


In [114]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36086 entries, 0 to 36085
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   date                          36086 non-null  object 
 1   aircraft_type                 36086 non-null  object 
 2   operator                      36084 non-null  object 
 3   registration                  34899 non-null  object 
 4   flight_phase                  35475 non-null  object 
 5   flight_type                   36029 non-null  object 
 6   survivors                     34810 non-null  object 
 7   site                          35719 non-null  object 
 8   schedule                      25712 non-null  object 
 9   msn                           28064 non-null  object 
 10  yom                           26336 non-null  float64
 11  flight_number                 2895 non-null   object 
 12  location                      36075 non-null  object 
 13  c

In [115]:
df.isnull().sum()

date                                0
aircraft_type                       0
operator                            2
registration                     1187
flight_phase                      611
flight_type                        57
survivors                        1276
site                              367
schedule                        10374
msn                              8022
yom                              9750
flight_number                   33191
location                           11
country                             3
region                              2
crew_on_board                      22
crew_fatalities                     1
pax_on_board                       50
pax_fatalities                      4
other_fatalities                   16
total_fatalities                    0
captain_flying_hours            29206
captain_flying_hours_on_type    30241
copilot_flying_hours            33855
copilot_flying_hours_on_type    34096
aircraft_flying_hours           30383
aircraft_fli

In [116]:
# Get proportion of rows with null values
nan_rows = df[df.isna().any(axis=1)]

f'{len(nan_rows) / len(df):0%}'

'98.741894%'

In [117]:
# Check for duplicates
df[df.duplicated(keep=False)]

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
2499,"Jun 15, 2008",Harbin Yunsunji Y-12,China Flying Dragon Special Aviation Company,B-3841,Flight,Geographical / Geophysical / Scientific,Yes,Mountains,,0061,...,2.0,2.0,0.0,3,,,,,,
2500,"Jun 15, 2008",Harbin Yunsunji Y-12,China Flying Dragon Special Aviation Company,B-3841,Flight,Geographical / Geophysical / Scientific,Yes,Mountains,,0061,...,2.0,2.0,0.0,3,,,,,,
7539,"Jun 8, 1988",Lockheed C-130 Hercules,United States Air Force - USAF (since 1947),61-2373,Landing (descent or approach),Training,No,Airport (less than 10 km from airport),Little Rock - Greenville,3720,...,0.0,0.0,0.0,6,,,,,,
7540,"Jun 8, 1988",Lockheed C-130 Hercules,United States Air Force - USAF (since 1947),61-2373,Landing (descent or approach),Training,No,Airport (less than 10 km from airport),Little Rock - Greenville,3720,...,0.0,0.0,0.0,6,,,,,,
7659,"Dec 28, 1987",PZL-Mielec AN-2,Aeroflot - Russian International Airlines,CCCP-02531,Takeoff (climb),Scheduled Revenue Flight,Yes,"Plain, Valley",,1G121-15,...,0.0,0.0,0.0,0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33820,"Sep 30, 1933",Avro 594 Avian,Holden's Air Transport Services,VH-UIV,Landing (descent or approach),Cargo,Yes,Airport (less than 10 km from airport),Salamaua – Bulolo,193,...,1.0,0.0,0.0,0,,,,,,
34999,"Oct 18, 1928",Douglas M-3,National Air Transport - USA,NC1064,Flight,Postal (mail),No,Mountains,Cleveland – New York,658,...,0.0,0.0,0.0,1,,,,,,
35000,"Oct 18, 1928",Douglas M-3,National Air Transport - USA,NC1064,Flight,Postal (mail),No,Mountains,Cleveland – New York,658,...,0.0,0.0,0.0,1,,,,,,
35539,"Dec 31, 1923",Loening 23 Air Yacht,New York-Newport Air Service,,,Scheduled Revenue Flight,Yes,"Lake, Sea, Ocean, River",,,...,0.0,0.0,0.0,0,,,,,,


In [118]:
# Check number of unique values
df.nunique()

date                            28447
aircraft_type                    1176
operator                         9365
registration                    34040
flight_phase                        5
flight_type                        31
survivors                           2
site                                6
schedule                        16829
msn                             20551
yom                               151
flight_number                    2814
location                        17272
country                           219
region                              9
crew_on_board                      31
crew_fatalities                    25
pax_on_board                      255
pax_fatalities                    187
other_fatalities                   47
total_fatalities                  202
captain_flying_hours             4132
captain_flying_hours_on_type     2129
copilot_flying_hours             1709
copilot_flying_hours_on_type     1071
aircraft_flying_hours            4939
aircraft_fli

---

## Data Cleaning

In [119]:
# Remove duplicates
df = df.drop_duplicates()

In [120]:
# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'], format='%b %d, %Y at %H%M LT', errors='coerce') \
				.fillna(pd.to_datetime(df['date'], format='%b %d, %Y', errors='coerce'))
df.sample(5)

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
34997,1928-10-23 00:00:00,Ryan B-1 Brougham,George Peck,,Flight,Private,No,Mountains,,,...,3.0,3.0,0.0,4,,,,,,
18952,1951-10-28 00:00:00,Short S.45 Solent,Trans Oceanic Airways - TOA,VH-TOC,Takeoff (climb),Positioning,Yes,"Lake, Sea, Ocean, River",Brisbane – Port Moresby,S.1308,...,0.0,0.0,0.0,0,,,,,,
11405,1976-09-17 09:21:00,Douglas C-47 Skytrain (DC-3),Government of Madhya Pradesh,VT-AXC,Takeoff (climb),Government,Yes,Airport (less than 10 km from airport),Bhopal - Indore,20303,...,8.0,0.0,0.0,0,,,,,,
1834,2011-12-20 17:13:00,Boeing 737-300,Sriwijaya Air,PK-CKM,Landing (descent or approach),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Jakarta - Yogyakarta,28333/2810,...,131.0,0.0,0.0,0,29801.0,,562.0,,31281.0,21591.0
28645,1941-06-19 00:00:00,Douglas DC-3,Liniile Aeriene Române Exploatate cu Statul - ...,YR-PAF,Takeoff (climb),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Bucharest – Sofia,1986,...,15.0,0.0,0.0,0,,,,,,


In [121]:
# Get proportion of rows where flight_phase and flight_type are null
mask = df['flight_phase'].isna() | df['flight_type'].isna()
f'{len(df[mask]) / len(df):0%}'

'1.786164%'

In [122]:
# Drop rows where flight_phase and flight_type are null
df = df.dropna(subset=['flight_phase', 'flight_type']).reset_index(drop=True)

In [123]:
# Convert 2 rows to category
df[['flight_phase', 'flight_type']] = df[['flight_phase', 'flight_type']].astype('category')

In [124]:
# Inpute the survivors column with the fatalities column and convert to boolean
survivors_mask = df['crew_on_board'] + df['pax_on_board'] - df['crew_fatalities'] - df['pax_fatalities'] > 0
df['survivors'] = np.where(survivors_mask, True, False)

In [125]:
# Get proportion of rows where site is null
f'{len(df[df['site'].isna()]) / len(df):0%}'

'0.347222%'

In [126]:
# Drop rows where site is null
df = df.dropna(subset='site').reset_index(drop=True)

In [127]:
# Convert column to category
df['site'] = df['site'].astype('category')

In [128]:
# Get proportion of rows where schedule is null
f'{len(df[df['schedule'].isna()]) / len(df):0%}'

'27.166276%'

In [129]:
# Drop schedule column
df = df.drop('schedule', axis=1)

In [130]:
# Get proportion of rows where yom is null
f'{len(df[df['yom'].isna()]) / len(df):0%}'

'26.894956%'

In [131]:
# Drop yom column
df = df.drop('yom', axis=1)

In [132]:
# Get proportion of rows where flight_number is null
f'{len(df[df['flight_number'].isna()]) / len(df):0%}'

'91.740447%'

In [133]:
# Drop flight_number column
df = df.drop('flight_number', axis=1)

In [153]:
# Only keep first part in location column
df['location'] = df['location'].str.split(', ', expand=True)[0]

In [155]:
# Get proportion of rows where location is null
f'{len(df[df['location'].isna()]) / len(df):0%}'

'0.031416%'

In [156]:
# Drop rows where location is null
df = df.dropna(subset='location').reset_index(drop=True)

In [157]:
df.head()

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,msn,location,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
0,2025-03-13 07:33:00,Cessna 525 CitationJet CJ2,LBL 525 CZ LLC,N525CZ,Takeoff (climb),Private,False,"Plain, Valley",525A-0380,Mesquite Metro,...,0.0,0.0,0.0,1,,,,,,
1,2025-03-07 00:00:00,Antonov AN-32,Indian Air Force - Bharatiya Vayu Sena,,Landing (descent or approach),Military,False,Airport (less than 10 km from airport),,Bagdogra,...,0.0,0.0,0.0,0,,,,,,
2,2025-03-04 09:54:00,BAe Jetstream 31,SAETA Perú (Servicios Aéreos Tarapota),OB-2178,Landing (descent or approach),Scheduled Revenue Flight,True,Airport (less than 10 km from airport),861,Güeppí,...,11.0,0.0,0.0,0,,,,,,
3,2025-02-25 00:00:00,Antonov AN-26,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,,Takeoff (climb),Military,False,City,,Wadi Seidna AFB,...,13.0,13.0,29.0,46,,,,,,
4,2025-02-23 00:00:00,Ilyushin II-76,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,1106,Flight,Military,False,Desert,10234 08265,Nyala,...,0.0,0.0,0.0,7,,,,,,


In [158]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35003 entries, 0 to 35002
Data columns (total 24 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   date                          35003 non-null  datetime64[ns]
 1   aircraft_type                 35003 non-null  object        
 2   operator                      35001 non-null  object        
 3   registration                  33876 non-null  object        
 4   flight_phase                  35003 non-null  category      
 5   flight_type                   35003 non-null  category      
 6   survivors                     35003 non-null  bool          
 7   site                          35003 non-null  category      
 8   msn                           27238 non-null  object        
 9   location                      35003 non-null  object        
 10  country                       35003 non-null  object        
 11  region                      

In [137]:
df.dtypes

date                            datetime64[ns]
aircraft_type                           object
operator                                object
registration                            object
flight_phase                          category
flight_type                           category
survivors                                 bool
site                                  category
msn                                     object
location                                object
country                                 object
region                                  object
crew_on_board                          float64
crew_fatalities                        float64
pax_on_board                           float64
pax_fatalities                         float64
other_fatalities                       float64
total_fatalities                         int64
captain_flying_hours                   float64
captain_flying_hours_on_type           float64
copilot_flying_hours                   float64
copilot_flyin

## End