# Choosing The Best Aircraft

<img src="https://images.unsplash.com/photo-1524592714635-d77511a4834d" alt="drawing" width="700"/>

*Image by S. Esenin from Unsplash*

# Overview

Our company is expanding into new industries to diversify its portfolio. Specifically, we are interested in purchasing and operating airplanes for commercial and private enterprises, despite lacking knowledge about potential risks in the aviation sector.

As the data scientist of the company, my goal is to explore which aircraft present the lowest risk. With my findings, I will create actionable insights that the head of the new aviation division can use to help decide which aircraft to purchase.

Analysis by Kawsar Hussain

# Challenge

With the task at hand, we will be presenting to our company stakeholders three different recommendations that will directly affect our entry into the aviation industry. The goal is to provide data-driven direction for our new aviation division, using evidence that our recommendations will result in purchasing an aircraft that minimizes risks and maximizes benefits for our new aviation company.

# Dataset

In the `data` folder we pulled [aviation data](https://www.kaggle.com/datasets/khsamaha/aviation-accident-database-synopses) from the National Transportation Safety board that includes certain civilian accidents and selected incidents in the United States and international waters spanning from 1962 to 2023.

# Solution

This project utilizes comprehensive risk assessment and statistical analysis to determine the lowest-risk aircraft for our company’s new aviation division. We will formulate three hypotheses regarding the factors that contribute to aircraft safety and operational reliability. By analyzing accident data, we will test these hypotheses and provide evidence-based recommendations that translate into actionable insights, guiding stakeholders in making informed decisions about which aircraft to purchase for commercial and private operations.

# Results

# Code

## Loading Tools

Import our data science tools.

In [2]:
import itertools
import numpy as np
import pandas as pd 
from numbers import Number
import sqlite3
import scipy.stats as stats
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns
import zipfile
import os
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)
pd.set_option('display.float_format', '{:,.2f}'.format)

## Loading Datasets

#### National Transportation Safety Aviation Accident Data

This data houses various information regarding accidents and incidents throughout the U.S. and international waters, including event datas, location, aircraft details, flight purpose, and more. We aim to use this vital information for analyzing risks in aircrafts.

In [3]:
df = pd.read_csv('data/Aviation_Data.csv')
df.sample(n=5)

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Aircraft.Category,Registration.Number,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,FAR.Description,Schedule,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
13615,20001214X38485,Accident,NYC86LA050,1985-12-15,"MT. JOY, PA",United States,,,,,Non-Fatal,Substantial,,N7177W,Piper,PA-28-180,No,1.0,Reciprocating,,,Personal,,0.0,0.0,0.0,2.0,VMC,Cruise,Probable Cause,
87188,20210408102893,Accident,ERA21LA179,2021-03-26,"Tampa, FL",United States,028050N,0822043W,VDF,Tampa Executive Airport,Non-Fatal,Substantial,Airplane,N392DC,DIAMOND AIRCRAFT IND INC,DA20-C1,No,1.0,Reciprocating,091,,Instructional,Superior Aviation Gateway LLC,0.0,0.0,0.0,1.0,VMC,,The pilots improper recovery from a bounced l...,20-08-2021
88603,20220207104606,Accident,DCA22WA074,2022-01-16,"Okayama City,",Japan,,,,,Serious,,Airplane,JA24MC,AIRBUS,A320,No,,,NUSC,,,,0.0,0.0,1.0,55.0,,,,25-02-2022
55168,20030725X01198,Accident,CHI03LA224,2003-07-23,"Highland, MI",United States,42.666667,-83.616667,,,Non-Fatal,,,N11BQ,Aerostar,S81A,No,0.0,,,,Unknown,,,2.0,8.0,2.0,VMC,Approach,Probable Cause,30-12-2003
42027,20001208X08198,Accident,MIA97LA193,1997-06-20,"ORANGEBURG, SC",United States,,,OGB,ORANGEBURG MUNICIPAL,Non-Fatal,Substantial,,N64936,Cessna,152,No,1.0,Reciprocating,,,Personal,,0.0,0.0,0.0,1.0,VMC,Landing,Probable Cause,07-01-1998


In [6]:
df[['Injury.Severity', 'Total.Fatal.Injuries', 'Total.Serious.Injuries', 'Total.Minor.Injuries',
    'Total.Uninjured']].head(20)

Unnamed: 0,Injury.Severity,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured
0,Fatal(2),2.0,0.0,0.0,0.0
1,Fatal(4),4.0,0.0,0.0,0.0
2,Fatal(3),3.0,,,
3,Fatal(2),2.0,0.0,0.0,0.0
4,Fatal(1),1.0,2.0,,0.0
5,Non-Fatal,,,1.0,44.0
6,Fatal(4),4.0,0.0,0.0,0.0
7,Non-Fatal,0.0,0.0,0.0,2.0
8,Non-Fatal,0.0,0.0,0.0,2.0
9,Non-Fatal,0.0,0.0,3.0,0.0


# Exploratory Data Analysis

### 1. Understanding The Data
- Dataframe `shape`
- `head` and `tail`
- `info`
- `describe`

In [3]:
df.shape

(90348, 31)

In [4]:
df.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Aircraft.Category,Registration.Number,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,FAR.Description,Schedule,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,Fatal(2),Destroyed,,NC6404,Stinson,108-3,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,Fatal(4),Destroyed,,N5069P,Piper,PA24-180,No,1.0,Reciprocating,,,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.92,-81.88,,,Fatal(3),Destroyed,,N5142R,Cessna,172M,No,1.0,Reciprocating,,,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,Fatal(2),Destroyed,,N1168J,Rockwell,112,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,Fatal(1),Destroyed,,N15NY,Cessna,501,No,,,,,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90348 entries, 0 to 90347
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      90348 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

In [6]:
df.describe()

Unnamed: 0,Number.of.Engines,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured
count,82805.0,77488.0,76379.0,76956.0,82977.0
mean,1.15,0.65,0.28,0.36,5.33
std,0.45,5.49,1.54,2.24,27.91
min,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0,1.0
75%,1.0,0.0,0.0,0.0,2.0
max,8.0,349.0,161.0,380.0,699.0


### 2. Data Preparation
- Dropping irrelevant columns and rows
- Identifying duplicated columns
- Renaming columns
- Feature creation

Let's clean up our dataset by `dropping` irrelevant `columns` and `renaming` them to represent the data better. I kept columns that would be essential for our analysis and renamed them for easier analysis/code.

In [8]:
df = df[[
    # 'Event.Id', 'Accident.Number', 'Latitude', 'Longitude', 'Airport.Code', 'Airport.Name', 
    # 'Registration.Number', 'FAR.Description', 'Schedule', 'Air.carrier', 'Report.Status', 'Publication.Date'
    'Investigation.Type', 'Event.Date', 'Location', 'Country', 'Injury.Severity', 'Aircraft.damage',
    'Aircraft.Category', 'Make', 'Model', 'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 
    'Purpose.of.flight', 'Total.Fatal.Injuries', 'Total.Serious.Injuries', 'Total.Minor.Injuries',
    'Total.Uninjured', 'Weather.Condition', 'Broad.phase.of.flight' ]]

df.rename(columns=
{
    'Investigation.Type':'investigation', 'Event.Date':'date', 'Location':'location', 'Country':'country',
    'Aircraft.damage':'damage', 'Aircraft.Category':'category', 'Amateur.Built':'amateur',
    'Number.of.Engines':'engine_count', 'Engine.Type':'engine_type', 'Purpose.of.flight':'flight_purpose',
    'Total.Fatal.Injuries':'fatal_injuries', 'Total.Serious.Injuries':'serious_injuries', 'Total.Minor.Injuries':'minor_injuries', 
    'Total.Uninjured':'not_injured', 'Weather.Condition':'weather_condition', 'Broad.phase.of.flight':'flight_stage'
}, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90348 entries, 0 to 90347
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   investigation      90348 non-null  object 
 1   date               88889 non-null  object 
 2   location           88837 non-null  object 
 3   country            88663 non-null  object 
 4   Injury.Severity    87889 non-null  object 
 5   damage             85695 non-null  object 
 6   category           32287 non-null  object 
 7   Make               88826 non-null  object 
 8   Model              88797 non-null  object 
 9   amateur            88787 non-null  object 
 10  engine_count       82805 non-null  float64
 11  engine_type        81793 non-null  object 
 12  flight_purpose     82697 non-null  object 
 13  fatal_injuries     77488 non-null  float64
 14  serious_injuries   76379 non-null  float64
 15  minor_injuries     76956 non-null  float64
 16  not_injured        829

The columns that remain in the dataframe are the columns that after swifting through unique values manually, I found these columns to be essential for assessing risk among aircrafts. Other columns show accident details, whereas we are prioritizing aircraft details regarding the accident data. This way we are focusing our analysis specifically on aircrafts.

One **important** column that was dropped was done so because of the choice of choosing the reliability of this column indicating the type of injuries VS the already recorded different kinds of injuries on the columns `fata

There are some important data preparation we need to do after looking at relevant columns `df.info()`:
- Rid `null` values within `Make` and `Model` because we only want to work with aircrafts that we have accident data for.
- Rid `null` values within `injury` (severity) because this houses the type of injury that is translated to the other injury columns, however, they don't always match, but this column has more entries. For that reason, `injury` will be prioritize.
- Rid `null` values within `fatal_injuries`, `serious_injuries`, `minor_injuries` because we only want to work with risk data of aircrafts and missing risks will not help with our analysis.

Let's start with these tasks before cleaning our data even further.

In [None]:
df = df[df['Make'].notnull()]
df = df[df['Model'].notnull()]
# df = df[df['fatal_injuries'].notnull()]
# df = df[df['serious_injuries'].notnull()]
# df = df[df['minor_injuries'].notnull()]
df.info()

In [None]:
df[df['fatal_injuries'].isna()]

There are some important data preparation we need to do after looking at relevant columns `df.info()`:
- Rid `null` values within `Make` and `Model` because we only want to work with aircrafts that we have accident data for.
- Rid `null` values within `fatal_injuries`, `serious_injuries`, `minor_injuries` because we only want to work with risk data of aircrafts and missing risks will not help with our analysis.

Let's start with these tasks before cleaning our data even further.

After inspecting `category`, we've determined for our initial analysis

In [None]:
df.category.unique()

There are some important data preparation we need to do:
- Replace `genre` values that are `NaN` with string `'unknown'` if data is missing. This way we are not losing any data even if we don't know the genre.
- Convert `release` to pandas `datetime` so that we can do some logical operations over it for deeper investigation.
- Convert `budget` and `revenue` into `int` so that we can do logical operations over the data for better analysis.
- Create new feature `profit` so that we better understand film financials.
- Normalize `rating` because the feature is directly affected by `votes` which disproportianately affects the true rating.