# Aircraft Safety Analysis for Business Expansion

<img src="images/Flight-Safety-Instructions.jpg">

## Introduction

For my Flatiron School Phase 1 Project, I have been given the following prompt:

Your company is expanding into new industries to diversify its portfolio. Specifically, the company is interested in purchasing and operating airplanes for both commercial and private enterprises. However, the potential risks associated with different aircraft types are unknown. As a data analyst, you have been tasked with determining which aircraft are the lowest risk for the company to start this new business endeavor. Your findings will be translated into actionable insights for the head of the new aviation division to help decide which aircraft to purchase.

## Problem Statement

I will aim to identify which aircraft models present the lowest risk in terms of safety. The key questions I'm seeking to answer include:
- Which aircraft models have the lowest average injury scores?
- How do factors such as the number of engines and the purpose of flight impact the safety of the aircraft?
- Are there temporal trends in the occurrence of accidents or incidents?

## Dataset

The dataset contains records of aircraft incidents and accidents from the NTSB up to 2023, including information on the event, location, aircraft characteristics, and injury severity. The dataset has 90,348 entries and 31 columns. Some key columns include:
- `Event.Id`: Unique identifier for each event.
- `Event.Date`: Date of the event.
- `Location`: Location of the event.
- `Country`: Country where the event occurred.
- `Injury.Severity`: Severity of injuries in the event.
- `Aircraft.damage`: Extent of damage to the aircraft.
- `Make` and `Model`: Manufacturer and model of the aircraft.
- `Number.of.Engines`: Number of engines on the aircraft.
- `Purpose.of.flight`: The purpose of the flight during the event.

## Data Cleaning and Exploration

### Data Cleaning
1. **Handling Missing Values**: Identify and handle missing values appropriately, either by imputation or removal.
2. **Data Type Conversion**: Convert columns to appropriate data types (e.g., dates to datetime objects, numerical columns to numeric types).
3. **Feature Engineering**: Create new features such as `Injury.Severity.Score` to quantify injury severity and extract useful information such as `Event.Year` from the `Event.Date`.

### Exploratory Data Analysis (EDA)
1. **Summary Statistics**: Generate summary statistics for numerical and categorical columns.
2. **Visualizations**: Create histograms, bar plots, and scatter plots to understand the distribution and relationships in the data.
   - **Average Injury Score by Model**: Identify aircraft models with the lowest average injury scores.
   - **Average Injury Score by Number of Engines**: Investigate how the number of engines impacts the injury score.
   - **Average Injury Score by Purpose of Flight**: Explore the impact of flight purpose on safety.
   - **Events by Year**: Visualize temporal trends in the number of incidents.

By following this structured approach, we will be able to gain insights into the safety of different aircraft models and provide actionable recommendations to the company's aviation division.

# Data Import & Initial Handling

In [54]:
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

In [55]:
# Reading data from a CSV file
df = pd.read_csv('data/Aviation_Data.csv')

# Preview the data
df.head()

  df = pd.read_csv('data/Aviation_Data.csv')


Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90348 entries, 0 to 90347
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      90348 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

# Data Cleaning

As previously stated, there appears to be 90,348 rows and 31 columns, however, all columns except `Investigation.Type` have missing values. Let's figure out what remains and start a plan to handle the missing data.

In [57]:
# I want to start with dropping duplicate rows and then re-check to see where we stand.
df.drop_duplicates(inplace=True)

df = df.reset_index(drop=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88958 entries, 0 to 88957
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88958 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

Okay, we found and removed 1390 duplicated rows. I like to write a function that shows me a similar view to `.info()`, but in percentages of the overall dataset.

In [58]:
def percent_data_missing(df):
    missing_data_summary = df.isnull().sum().sort_values(ascending=False)
    missing_data_percentage = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
    
    missing_data_df = pd.DataFrame({'Missing Values': missing_data_summary, 'Percentage': missing_data_percentage})
    print(missing_data_df)

In [59]:
percent_data_missing(df)

                        Missing Values  Percentage
Schedule                         76376   85.856247
Air.carrier                      72310   81.285550
FAR.Description                  56935   64.002113
Aircraft.Category                56671   63.705344
Longitude                        54585   61.360417
Latitude                         54576   61.350300
Airport.Code                     38826   43.645316
Airport.Name                     36254   40.754064
Broad.phase.of.flight            27234   30.614447
Publication.Date                 15299   17.198004
Total.Serious.Injuries           12579   14.140381
Total.Minor.Injuries             12002   13.491760
Total.Fatal.Injuries             11470   12.893725
Engine.Type                       7165    8.054363
Report.Status                     6453    7.253985
Purpose.of.flight                 6261    7.038153
Number.of.Engines                 6153    6.916747
Total.Uninjured                   5981    6.723398
Weather.Condition              

`Aircraft.Category` is something I'd like to be able to filter on and there are a ton of missing values. I want to see if we can glean that information from matching values in other columns. This is a great case for a function as we can probably use it for filling in other missing values later on.

In [60]:
def fill_nan_column_with_mode(df, column_to_fill, reference_column):
    nan_indices = df[df[column_to_fill].isna()].index.tolist()
    
    # Use tqdm to create a progress bar
    for index in tqdm(nan_indices, desc=f"Filling NaN values in '{column_to_fill}'"):
        row = df.loc[index]
        ref_value = row[reference_column]
        
        # If reference value is NaN, skip this row
        if pd.isna(ref_value):
            continue
        
        # Create a temporary DataFrame with the same reference value
        temp_df = df[df[reference_column] == ref_value]
        
        # Drop rows where the column to fill is NaN
        temp_df = temp_df.dropna(subset=[column_to_fill])
        
        # If temp_df is not empty, calculate the mode
        if not temp_df.empty:
            mode_value = temp_df[column_to_fill].mode()
            
            if not mode_value.empty:
                # Fill the NaN value with the mode
                df.at[index, column_to_fill] = mode_value.iloc[0]
                
    return df

In [61]:
df = fill_nan_column_with_mode(df, 'Aircraft.Category', 'Model')

Filling NaN values in 'Aircraft.Category': 100%|██████████████████████████████████████████████████████████████████████| 56671/56671 [02:51<00:00, 331.07it/s]


That saved 48,323 data points in the `Aircraft.Category` column. Now I'll clean up the remaining Aircraft.Category values that didn't get caught with our function

In [62]:
df['Aircraft.Category'] = df['Aircraft.Category'].fillna('Unknown')

Now I want to take a look at the aircraft categories to see if there are any we should exclude

In [63]:
df['Aircraft.Category'].value_counts()

Aircraft.Category
Airplane             70768
Unknown               8500
Helicopter            7646
Glider                 989
Balloon                470
Gyrocraft              266
Weight-Shift           165
Powered Parachute       95
Ultralight              33
WSFT                     9
Blimp                    7
Powered-Lift             6
UNK                      2
Rocket                   1
ULTR                     1
Name: count, dtype: int64

We are likley really only looking at airplanes and possible helicopters so let's exclude everything else.

In [64]:
aircraft_cats_to_keep = ['Airplane', 'Helicopter']

df = df[df['Aircraft.Category'].isin(aircraft_cats_to_keep)]

`Schedule`, `Air.Carrier`, `FAR.Description`, `Latitude`, and `Longitude` all have missing data from over 60% of the rows. I feel it's safe to just remove those columns as they aren't contributing anything meaningful to our analysis and aren't values that we can impute.

In [65]:
df.drop(columns=['Schedule', 'Air.carrier', 'FAR.Description', 'Latitude', 'Longitude'], inplace=True)

`Event.Id`, `Accident.Number`, `Event.Date`, `Location`, `Amateur.Built`, and `Country` are each missing data in less than 1% of the rows. Because of the low impact, I will remove those rows. `Model` are values I would like to keep if possible so I will use my function to try and fill those in. Anything that is left I will drop. Since I can determine the make from the model but not the model from the make, I will drop the missing `Make` values as well.

In [66]:
df = fill_nan_column_with_mode(df, 'Model', 'Make')

Filling NaN values in 'Model': 100%|████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:00<00:00, 166.85it/s]


In [67]:
df.dropna(subset=['Event.Id', 'Accident.Number', 'Event.Date', 'Location', 'Amateur.Built', 'Country', 'Make', 'Model'], inplace=True)

`Registration.Number` doesn't seem like it would be useful for our analysis so I will drop that as well.

In [68]:
df.drop(columns=['Registration.Number'], inplace=True)

`Amateur.Built` is an interesting datapoint. Let's look deeper at that.

In [69]:
df['Amateur.Built'].value_counts()

Amateur.Built
No     72394
Yes     5692
Name: count, dtype: int64

Nearly 8500 incidents were related to 'homebuilt' aircraft. I don't feel from a company standpoint, we would be interesting in purchasing amateur built aircraft but rather commercially produced aircraft so let's get rid of those and then drop the column.

In [70]:
df = df[df['Amateur.Built'] != 'Yes']
df.drop(columns=['Amateur.Built'], inplace=True)

`Injury.Severity` might be something that could help us with our safety determinations:

In [71]:
df['Injury.Severity'].unique()

array(['Fatal(2)', 'Fatal(3)', 'Fatal(1)', 'Non-Fatal', 'Fatal(4)',
       'Incident', 'Fatal(8)', 'Fatal(78)', 'Fatal(7)', 'Fatal(6)',
       'Fatal(5)', 'Fatal(12)', 'Fatal(14)', 'Fatal(10)', 'Fatal(11)',
       'Fatal(17)', 'Fatal(13)', 'Fatal(29)', 'Fatal(9)', 'Unavailable',
       'Fatal(135)', 'Fatal(25)', 'Fatal(82)', 'Fatal(156)', 'Fatal(18)',
       'Fatal(144)', 'Fatal(111)', 'Fatal(20)', 'Fatal(73)', 'Fatal(27)',
       'Fatal(34)', 'Fatal(87)', 'Fatal(16)', 'Fatal(56)', 'Fatal(37)',
       'Fatal(54)', 'Fatal(65)', 'Fatal(160)', 'Fatal(189)', 'Fatal(97)',
       'Fatal(125)', 'Fatal(228)', 'Fatal(52)', 'Fatal(15)', 'Fatal(229)',
       'Fatal(80)', 'Fatal(169)', 'Fatal(88)', 'Fatal(131)', 'Fatal(60)',
       'Fatal(143)', 'Fatal(83)', 'Fatal(24)', 'Fatal(44)', 'Fatal(64)',
       'Fatal(92)', 'Fatal(19)', 'Fatal(26)', 'Fatal(138)', 'Fatal(206)',
       'Fatal(71)', 'Fatal(23)', 'Fatal(21)', 'Fatal(115)', 'Fatal(141)',
       'Fatal(121)', 'Fatal(45)', 'Fatal(145)', 'Fatal(1

There are a lot of different variations of "Fatal" in this column. All with numbers in parenthesis following the value. Let's compare these to a sample of the `Total.Fatal.Injuries` column.

In [72]:
df[df['Injury.Severity'] == 'Fatal(2)']['Total.Fatal.Injuries']

0        2.0
3        2.0
15       2.0
24       2.0
33       2.0
        ... 
63822    2.0
63823    2.0
63826    2.0
63828    2.0
63874    2.0
Name: Total.Fatal.Injuries, Length: 2972, dtype: float64

It seems like much of this data is duplicated from the `Total.Fatal.Injuries` columns. i.e. Rows with 'Fatal(2)' in the `Injury.Severity` column also have 2 in the `Total.Fatal.Injuries` column. As this is duplicate information, let's drop it.

In [73]:
df.drop(columns=['Injury.Severity'], inplace=True)

Now I want to take a look at our 'date' columns.

We have two different 'date' columns (`Event.Date` and `Publication.Date`), we probably don't ned both as we're not interested in the length of the the elapsed time between the incident and the report being released. I'll keep `Event.Date`.

In [74]:
df.drop(columns=['Publication.Date'], inplace=True)

I'll probably want to use the `Event.Date` column for analysis so I need to convert it to a `datetime` object instead of a string.

In [75]:
df['Event.Date'] = pd.to_datetime(df['Event.Date'])

Now I'm interested in some of the categorical columns that may benefit from a placeholder value.

In [76]:
# Let's check the Weather.Condition column
df['Weather.Condition'].unique()

array(['UNK', 'IMC', 'VMC', nan, 'Unk'], dtype=object)

It looks like weather conditions include IMC, VMC, UNK, Unk, and nan. I'll group Unk, UNK, and nan all as UNK.

In [77]:
df['Weather.Condition'] = df['Weather.Condition'].fillna('UNK')

Since I may need to do more value renamming, I'll create a function for future use.

In [78]:
def rename_value(df, col_name, old_value, new_value):
  df.loc[df[col_name].str.lower().str.contains(old_value.lower()), col_name] = new_value

rename_value(df, 'Weather.Condition', 'Unk', 'UNK')

df['Weather.Condition'].unique()

array(['UNK', 'IMC', 'VMC'], dtype=object)

In [79]:
df['Broad.phase.of.flight'].unique()

array(['Cruise', 'Approach', 'Climb', 'Unknown', 'Takeoff', 'Landing',
       'Taxi', 'Descent', 'Maneuvering', 'Standing', 'Go-around', 'Other',
       nan], dtype=object)

In [80]:
df['Broad.phase.of.flight'] = df['Broad.phase.of.flight'].fillna('Unknown')

In [81]:
df['Purpose.of.flight'].unique()

array(['Personal', nan, 'Business', 'Instructional', 'Unknown', 'Ferry',
       'Executive/corporate', 'Aerial Observation', 'Aerial Application',
       'Public Aircraft', 'Skydiving', 'Other Work Use', 'Positioning',
       'Flight Test', 'Air Race/show', 'Air Drop',
       'Public Aircraft - Federal', 'Glider Tow',
       'Public Aircraft - Local', 'External Load',
       'Public Aircraft - State', 'Banner Tow', 'Firefighting',
       'Air Race show', 'PUBS', 'ASHO', 'PUBL'], dtype=object)

In [82]:
df['Purpose.of.flight'] = df['Purpose.of.flight'].fillna('Unknown')

There are some flight types in here that likely don't apply to our business case so let's drop them

In [83]:
# List of flight types to drop
flight_types_to_drop = ['Air Drop', 'Air Race/show', 'Firefighting', 'Glider Tow', 'Air Race show', 'Flight Test']

# Drop the unwanted flight types
df = df[~df['Purpose.of.flight'].isin(flight_types_to_drop)]

In [84]:
df['Airport.Name'].unique()

array([nan, 'BLACKBURN AG STRIP', 'HANOVER', ...,
       'HAWKINSVILLE-PULASKI COUNTY', 'Lewiston Municipal Airport',
       'WICHITA DWIGHT D EISENHOWER NT'], dtype=object)

In [85]:
df['Airport.Name'] = df['Airport.Name'].fillna('Unknown')

# Rename values in 'Airport.Name' containing 'private' (case-insensitive) to 'PRIVATE'
rename_value(df, 'Airport.Name', 'private', 'PRIVATE')

In [86]:
df['Airport.Code'].value_counts()

Airport.Code
NONE    1164
PVT      362
APA      143
MRI      125
ORD      106
        ... 
32J        1
0CL1       1
10S        1
A8L        1
EIKH       1
Name: count, Length: 9427, dtype: int64

In [87]:
df['Airport.Code'] = df['Airport.Code'].fillna('UNK')

In [88]:
df = fill_nan_column_with_mode(df, 'Number.of.Engines', 'Model')

Filling NaN values in 'Number.of.Engines': 100%|████████████████████████████████████████████████████████████████████████| 4404/4404 [00:11<00:00, 383.08it/s]


Since one of the safety metrics I want to look at is the number of engines, I will go ahead a drop the remaining where we don't have that information.

In [89]:
df = df.dropna(subset=['Number.of.Engines'])

In [90]:
df['Engine.Type'].value_counts()

Engine.Type
Reciprocating      57671
Turbo Shaft         3050
Turbo Prop          2752
Turbo Fan           1863
Unknown              573
Turbo Jet            422
Geared Turbofan       12
Electric               3
UNK                    1
Name: count, dtype: int64

In [91]:
df = fill_nan_column_with_mode(df, 'Engine.Type', 'Model')

Filling NaN values in 'Engine.Type': 100%|██████████████████████████████████████████████████████████████████████████████| 5219/5219 [00:13<00:00, 379.86it/s]


In [92]:
df['Engine.Type'] = df['Engine.Type'].fillna('UNK')

In [93]:
df['Aircraft.damage'].value_counts()

Aircraft.damage
Substantial    52736
Destroyed      14384
Minor           2057
Unknown          104
Name: count, dtype: int64

In [94]:
df['Aircraft.damage'] = df['Aircraft.damage'].fillna('Unknown')

In [95]:
df['Report.Status'] = df['Report.Status'].fillna('')

I want to check to see how many rows have zero injury and uninjured information

In [96]:
# Specify the columns to check
columns_to_check = ['Total.Uninjured', 'Total.Minor.Injuries', 'Total.Serious.Injuries', 'Total.Fatal.Injuries']

# Count rows where all specified columns are NaN
num_rows_all_nan = df[columns_to_check].isna().all(axis=1).sum()

print(f"Number of rows with NaN in all specified columns: {num_rows_all_nan}")

Number of rows with NaN in all specified columns: 116


That's not very many and likely won't be able to impute well; let's drop them.

In [97]:
# Identify rows where all specified columns are NaN
rows_to_drop = df[columns_to_check].isna().all(axis=1)

# Drop those rows
df = df.drop(index=df[rows_to_drop].index)

# Reset index after dropping rows
df.reset_index(drop=True, inplace=True)

There are a lot of NaN values in the injury columns. For now, I will fill those with a 0.

In [98]:
columns_to_fill = ['Total.Uninjured', 'Total.Minor.Injuries', 'Total.Serious.Injuries', 'Total.Fatal.Injuries']

for col in columns_to_fill:
    df[col] = df[col].fillna(0)

In [107]:
df['Country'].value_counts()

Country
United States               66323
Brazil                        332
Mexico                        306
Canada                        286
United Kingdom                262
                            ...  
French Guiana                   1
Chad                            1
Cayenne                         1
Yemen                           1
Turks and Caicos Islands        1
Name: count, Length: 204, dtype: int64

It seems like the vast majority of incidents are in the US. Lets keep the 3 primary North American countries and remove everything else.

In [108]:
# List of flight types to drop
countries_to_keep = ['Canada', 'United States', 'Mexico']

# Drop the unwanted flight types
df = df[df['Country'].isin(countries_to_keep)]

Let's see where we're at now.

In [110]:
percent_data_missing(df)

                        Missing Values  Percentage
Event.Id                             0         0.0
Investigation.Type                   0         0.0
Broad.phase.of.flight                0         0.0
Weather.Condition                    0         0.0
Total.Uninjured                      0         0.0
Total.Minor.Injuries                 0         0.0
Total.Serious.Injuries               0         0.0
Total.Fatal.Injuries                 0         0.0
Purpose.of.flight                    0         0.0
Engine.Type                          0         0.0
Number.of.Engines                    0         0.0
Model                                0         0.0
Make                                 0         0.0
Aircraft.Category                    0         0.0
Aircraft.damage                      0         0.0
Airport.Name                         0         0.0
Airport.Code                         0         0.0
Country                              0         0.0
Location                       

In [111]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 66915 entries, 0 to 71449
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                66915 non-null  object        
 1   Investigation.Type      66915 non-null  object        
 2   Accident.Number         66915 non-null  object        
 3   Event.Date              66915 non-null  datetime64[ns]
 4   Location                66915 non-null  object        
 5   Country                 66915 non-null  object        
 6   Airport.Code            66915 non-null  object        
 7   Airport.Name            66915 non-null  object        
 8   Aircraft.damage         66915 non-null  object        
 9   Aircraft.Category       66915 non-null  object        
 10  Make                    66915 non-null  object        
 11  Model                   66915 non-null  object        
 12  Number.of.Engines       66915 non-null  float64    