# Aircraft Crashes Data Collection And Cleaning

## Overview

This notebook collects and prepares the data for the analysis of all the aircraft accidents since 1918.

### About dataset

**Crash**

`date` date and local time of the accident<br>
`aircraft` aircraft make and model<br>
`operator` operator of the aircraft<br>
`registration` unique code to a single aircraft<br>
`flight_phase` phase of the flight when the accident occured<br>
`flight_type` type of flight<br>
`survivors` indicates if there was survivors<br>
`site` type of location where the accident happened (ex: mountains)<br>
`schedule` planned route of the flight<br>
`msn` manufacturer's serial number of the aircraft<br>
`yom` year of manufacture of the aircraft<br>
`flight_number` flight number<br>
`location` location of the accident<br>
`country` country where the crash happened<br>
`region` region of the world where the crash happened<br>
`crew_on_board` number of crew members on board at the time of the accident<br>
`crew_fatalities` number of crew members who died in the crash<br>
`pax_on_board` number of passengers on board at the time of the accident<br>     
`pax_fatalities` number of passengers who died in the crash<br>                 
`other_fatalities` other victims of the accident outside of the aircraft<br>
`total_fatalities` total number of deaths<br>
`captain_flying_hours` number of flying hours of the captain<br>
`captain_flying_hours_on_type` number of hours the captain flew on the type of aircraft involved in the crash<br>
`copilot_flying_hours` number of flying hours of the copilot<br>  
`copilot_flying_hours_on_type` number of hours the copilot flew on the type of aircraft involved in the crash<br>  
`aircraft_flying_hours` number of flying hours of the plane involved in the crash<br>
`aircraft_flight_cycles` number of flights of the aircraft<br><br>


**Aircraft**

`make` manufacturer<br>
`model` model (name)<br>
`body` body type<br>
`wing` type of wing<br>
`position` wing position<br>
`tail` tail configuration<br>
`engine` type of engine<br>
`engine_count` number of engines<br>
`wing_span` distance from one wingtip to the opposite, in meters<br>
`length` length in meters<br>
`height` height in meters<br>
`manufactured_as` other names of the aircraft

## Data Collection

In [376]:
from bs4 import BeautifulSoup
from datetime import datetime
import math
import pandas as pd
import re
import requests
from urllib.parse import unquote
from thefuzz import fuzz, process

---

## Data Exploration

### Crashes

In [366]:
crashes_df = pd.read_csv('data/crashes_scraped_data.csv')
crashes_df.head()

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
0,"Mar 13, 2025 at 0733 LT",Cessna 525 CitationJet CJ2,LBL 525 CZ LLC,N525CZ,Takeoff (climb),Private,No,"Plain, Valley",Mesquite - Addison,525A-0380,...,0.0,0.0,0.0,1,,,,,,
1,"Mar 7, 2025",Antonov AN-32,Indian Air Force - Bharatiya Vayu Sena,,Landing (descent or approach),Military,Yes,Airport (less than 10 km from airport),,,...,0.0,0.0,0.0,0,,,,,,
2,"Mar 4, 2025 at 0954 LT",BAe Jetstream 31,SAETA Perú (Servicios Aéreos Tarapota),OB-2178,Landing (descent or approach),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Iquitos - Güeppí,861,...,11.0,0.0,0.0,0,,,,,,
3,"Feb 25, 2025",Antonov AN-26,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,,Takeoff (climb),Military,No,City,,,...,13.0,13.0,29.0,46,,,,,,
4,"Feb 23, 2025",Ilyushin II-76,Sudanese Air Force - Al Quwwat al-Jawwiya As-S...,1106,Flight,Military,No,Desert,,10234 08265,...,0.0,0.0,0.0,7,,,,,,


In [367]:
crashes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36086 entries, 0 to 36085
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   date                          36086 non-null  object 
 1   aircraft_type                 36086 non-null  object 
 2   operator                      36084 non-null  object 
 3   registration                  34899 non-null  object 
 4   flight_phase                  35475 non-null  object 
 5   flight_type                   36029 non-null  object 
 6   survivors                     34810 non-null  object 
 7   site                          35719 non-null  object 
 8   schedule                      25712 non-null  object 
 9   msn                           28064 non-null  object 
 10  yom                           26336 non-null  float64
 11  flight_number                 2895 non-null   object 
 12  location                      36075 non-null  object 
 13  c

In [368]:
crashes_df.isnull().sum()

date                                0
aircraft_type                       0
operator                            2
registration                     1187
flight_phase                      611
flight_type                        57
survivors                        1276
site                              367
schedule                        10374
msn                              8022
yom                              9750
flight_number                   33191
location                           11
country                             3
region                              2
crew_on_board                      22
crew_fatalities                     1
pax_on_board                       50
pax_fatalities                      4
other_fatalities                   16
total_fatalities                    0
captain_flying_hours            29206
captain_flying_hours_on_type    30241
copilot_flying_hours            33855
copilot_flying_hours_on_type    34096
aircraft_flying_hours           30383
aircraft_fli

In [369]:
# Check for duplicates
crashes_df[crashes_df.duplicated(keep=False)]

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
2499,"Jun 15, 2008",Harbin Yunsunji Y-12,China Flying Dragon Special Aviation Company,B-3841,Flight,Geographical / Geophysical / Scientific,Yes,Mountains,,0061,...,2.0,2.0,0.0,3,,,,,,
2500,"Jun 15, 2008",Harbin Yunsunji Y-12,China Flying Dragon Special Aviation Company,B-3841,Flight,Geographical / Geophysical / Scientific,Yes,Mountains,,0061,...,2.0,2.0,0.0,3,,,,,,
7539,"Jun 8, 1988",Lockheed C-130 Hercules,United States Air Force - USAF (since 1947),61-2373,Landing (descent or approach),Training,No,Airport (less than 10 km from airport),Little Rock - Greenville,3720,...,0.0,0.0,0.0,6,,,,,,
7540,"Jun 8, 1988",Lockheed C-130 Hercules,United States Air Force - USAF (since 1947),61-2373,Landing (descent or approach),Training,No,Airport (less than 10 km from airport),Little Rock - Greenville,3720,...,0.0,0.0,0.0,6,,,,,,
7659,"Dec 28, 1987",PZL-Mielec AN-2,Aeroflot - Russian International Airlines,CCCP-02531,Takeoff (climb),Scheduled Revenue Flight,Yes,"Plain, Valley",,1G121-15,...,0.0,0.0,0.0,0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33820,"Sep 30, 1933",Avro 594 Avian,Holden's Air Transport Services,VH-UIV,Landing (descent or approach),Cargo,Yes,Airport (less than 10 km from airport),Salamaua – Bulolo,193,...,1.0,0.0,0.0,0,,,,,,
34999,"Oct 18, 1928",Douglas M-3,National Air Transport - USA,NC1064,Flight,Postal (mail),No,Mountains,Cleveland – New York,658,...,0.0,0.0,0.0,1,,,,,,
35000,"Oct 18, 1928",Douglas M-3,National Air Transport - USA,NC1064,Flight,Postal (mail),No,Mountains,Cleveland – New York,658,...,0.0,0.0,0.0,1,,,,,,
35539,"Dec 31, 1923",Loening 23 Air Yacht,New York-Newport Air Service,,,Scheduled Revenue Flight,Yes,"Lake, Sea, Ocean, River",,,...,0.0,0.0,0.0,0,,,,,,


In [370]:
# Check number of unique values
crashes_df.nunique()

date                            28447
aircraft_type                    1176
operator                         9365
registration                    34040
flight_phase                        5
flight_type                        31
survivors                           2
site                                6
schedule                        16829
msn                             20551
yom                               151
flight_number                    2814
location                        17272
country                           219
region                              9
crew_on_board                      31
crew_fatalities                    25
pax_on_board                      255
pax_fatalities                    187
other_fatalities                   47
total_fatalities                  202
captain_flying_hours             4132
captain_flying_hours_on_type     2129
copilot_flying_hours             1709
copilot_flying_hours_on_type     1071
aircraft_flying_hours            4939
aircraft_fli

### Aircrafts

In [348]:
aircrafts_df = pd.read_csv('data/aircrafts_scraped_data.csv').sort_values(by='name').reset_index(drop=True)
aircrafts_df.head()

Unnamed: 0,name,make,model,body,position,wing,tail,engine,engine_count,wing_span,length,height
0,328 SUPPORT SERVICES Dornier 328JET,FAIRCHILD DORNIER,328JET,Narrow,High wing,Fixed Wing,T-tail,Jet,Multi,20.90 m,20.90 m,7.20 m
1,ABHCO Gazelle,AEROSPATIALE,SA-341 Gazelle,,,Rotary,,Turboshaft,Single,10.50 m,11.97 m,3.15 m
2,ABHCO SA-342 Gazelle,AEROSPATIALE,SA-341 Gazelle,,,Rotary,,Turboshaft,Single,10.50 m,11.97 m,3.15 m
3,ADVANCED AIRCRAFT Spirit 750,CESSNA,P210 (turbine),Narrow,,Fixed Wing,,Turboprop,Single,11.20 m,8.59 m,2.95 m
4,ADVANCED AIRCRAFT Turbine P210,CESSNA,P210 (turbine),Narrow,,Fixed Wing,,Turboprop,Single,11.20 m,8.59 m,2.95 m


In [349]:
aircrafts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4146 entries, 0 to 4145
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          4146 non-null   object
 1   make          4146 non-null   object
 2   model         4146 non-null   object
 3   body          3540 non-null   object
 4   position      3184 non-null   object
 5   wing          4140 non-null   object
 6   tail          3175 non-null   object
 7   engine        4139 non-null   object
 8   engine_count  4141 non-null   object
 9   wing_span     4133 non-null   object
 10  length        4133 non-null   object
 11  height        4126 non-null   object
dtypes: object(12)
memory usage: 388.8+ KB


In [350]:
aircrafts_df.isnull().sum()

name              0
make              0
model             0
body            606
position        962
wing              6
tail            971
engine            7
engine_count      5
wing_span        13
length           13
height           20
dtype: int64

In [351]:
# Check for duplicates
aircrafts_df[aircrafts_df.duplicated(keep=False)]

Unnamed: 0,name,make,model,body,position,wing,tail,engine,engine_count,wing_span,length,height
17,AERMACCHI MB-339,AERMACCHI,MB-339,Narrow,Low wing with wing tip tanks,Fixed Wing,"Regular tail, low set",Jet,Single,10.90 m,11.00 m,4.00 m
18,AERMACCHI MB-339,AERMACCHI,MB-339,Narrow,Low wing with wing tip tanks,Fixed Wing,"Regular tail, low set",Jet,Single,10.90 m,11.00 m,4.00 m
22,AERO (1) Commander 560,AERO (1),Commander 560,Narrow,High wing,Fixed Wing,"Regular tail, low set",Piston,Multi,14.90 m,11.20 m,4.60 m
23,AERO (1) Commander 560,AERO (1),Commander 560,Narrow,High wing,Fixed Wing,"Regular tail, low set",Piston,Multi,14.90 m,11.20 m,4.60 m
36,AERO (2) L-159,AERO (2),L-159,Narrow,Low wing with wing tip tanks,Fixed Wing,"Regular tail, mid set",Jet,Single,9.50 m,12.70 m,4.80 m
...,...,...,...,...,...,...,...,...,...,...,...,...
4141,ZENAIR Zenith (CH-200),ZENAIR,Zenith (CH-200),Narrow,,Fixed Wing,,Piston,Single,7.00 m,6.25 m,2.11 m
4142,ZENAIR Zenith (CH-2000),ZENAIR,Zenith (CH-2000),Narrow,,Fixed Wing,,Piston,Single,8.79 m,7.01 m,2.08 m
4143,ZENAIR Zenith (CH-2000),ZENAIR,Zenith (CH-2000),Narrow,,Fixed Wing,,Piston,Single,8.79 m,7.01 m,2.08 m
4144,ZENAIR Zenith (CH-250),ZENAIR,Zenith (CH-250),Narrow,,Fixed Wing,,Piston,Single,7.00 m,6.25 m,2.11 m


In [352]:
aircrafts_df.nunique()

name            3761
make             125
model            578
body               2
position          25
wing               3
tail              21
engine             5
engine_count       2
wing_span        319
length           402
height           292
dtype: int64

---

## Data Cleaning

### Aircrafts

In [353]:
# Drop duplicates
aircrafts_df = aircrafts_df.drop_duplicates().reset_index(drop=True)

In [354]:
# Convert wing_span, length and height to float
dimensions = ['wing_span', 'length', 'height']

for column in dimensions:
	aircrafts_df[column] = aircrafts_df[column].str.replace(' m', '')
	aircrafts_df[column] = aircrafts_df[column].astype('float')


aircrafts_df[dimensions].describe()

Unnamed: 0,wing_span,length,height
count,3753.0,3753.0,3747.0
mean,17.533642,17.315683,5.330507
std,11.182198,12.292348,3.504684
min,5.3,2.83,1.0
25%,10.7,9.0,2.9
50%,13.61,13.5,4.32
75%,20.0,19.76,6.2
max,88.5,84.0,24.09


### Crashes

In [371]:
# Remove duplicates
crashes_df = crashes_df.drop_duplicates()

In [378]:
# Convert date column to datetime
crashes_df['date'] = pd.to_datetime(crashes_df['date'], format='%b %d, %Y at %H%M LT', errors='coerce') \
					.fillna(pd.to_datetime(crashes_df['date'], format='%b %d, %Y', errors='coerce'))
crashes_df.sample(5)

Unnamed: 0,date,aircraft_type,operator,registration,flight_phase,flight_type,survivors,site,schedule,msn,...,pax_on_board,pax_fatalities,other_fatalities,total_fatalities,captain_flying_hours,captain_flying_hours_on_type,copilot_flying_hours,copilot_flying_hours_on_type,aircraft_flying_hours,aircraft_flight_cycles
11298,1977-01-13 18:14:00,Tupolev TU-104,Aeroflot - Russian International Airlines,CCCP-42369,Landing (descent or approach),Scheduled Revenue Flight,No,Airport (less than 10 km from airport),Novossibirsk – Almaty,8 66 012 03,...,82.0,82.0,0.0,90,,,,,27189.0,12819.0
5285,1996-02-19 09:04:00,Douglas DC-9,Continental Airlines,N10556,Landing (descent or approach),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Washington DC - Houston,47423,...,82.0,0.0,0.0,0,17500.0,5000.0,2200.0,575.0,63132.0,58913.0
26233,1942-02-16 00:20:00,Armstrong Whitworth AW.38 Whitley,Royal Air Force - RAF,Z9229,Flight,Bombing,Yes,"Plain, Valley",Leeming - Leeming,2334,...,0.0,0.0,0.0,4,,,,,,
23803,1943-01-21 00:00:00,Handley Page H.P.57 Halifax II,Royal Air Force - RAF,DT581,Flight,Bombing,Yes,Mountains,Snaith - Snaith,,...,0.0,0.0,0.0,2,,,,,,
1460,2014-02-02 07:36:00,Airbus A320,East Air,EY-623,Landing (descent or approach),Scheduled Revenue Flight,Yes,Airport (less than 10 km from airport),Moscow – Kulob,428,...,187.0,0.0,0.0,0,18321.0,509.0,2900.0,1300.0,54604.0,23974.0


In [None]:
crashes_df['aircraft_type'].nunique()

1176

In [385]:
crashes_df.columns

Index(['date', 'aircraft_type', 'operator', 'registration', 'flight_phase',
       'flight_type', 'survivors', 'site', 'schedule', 'msn', 'yom',
       'flight_number', 'location', 'country', 'region', 'crew_on_board',
       'crew_fatalities', 'pax_on_board', 'pax_fatalities', 'other_fatalities',
       'total_fatalities', 'captain_flying_hours',
       'captain_flying_hours_on_type', 'copilot_flying_hours',
       'copilot_flying_hours_on_type', 'aircraft_flying_hours',
       'aircraft_flight_cycles'],
      dtype='object')

In [386]:
crashes_df.dtypes

date                            datetime64[ns]
aircraft_type                           object
operator                                object
registration                            object
flight_phase                            object
flight_type                             object
survivors                               object
site                                    object
schedule                                object
msn                                     object
yom                                    float64
flight_number                           object
location                                object
country                                 object
region                                  object
crew_on_board                          float64
crew_fatalities                        float64
pax_on_board                           float64
pax_fatalities                         float64
other_fatalities                       float64
total_fatalities                         int64
captain_flyin

In [None]:
# Convert survivors to boolean


## End