## Final Project Submission

Please fill out:
* Student name: Noah Hunsicker, Colby Gates
* Student pace: Full time
* Scheduled project review date/time: 
* Instructor name: Praveen Gowtham
* Blog post URL: 

# Reccomending Airplanes to Purchase by Minimizing Risk

# Goals

Our task was to identify airplanes for the buisness to purchased based on which planes we identified as having the lowest risk to the company. There are many ways to interpret this question whether that be minimizing financial risk, maximizing aircraft longeviety, or the option that we chose, minimizing the risk to passengers in the event of an incident or accident occuring on board. 

To answer this question we will parse the relevant data, isolate varibles that are most useful for identifying which airplanes are best at 
keeping passengers alive, and then these applying statistics to our dataset to discover which airplanes perform the best.

# Data

The data that we are using for our analysis comes from a kaggle [dataset](https://www.kaggle.com/datasets/khsamaha/aviation-accident-database-synopses) which is taken from the National Transportation Safety Board. This dataset contains information on aircraft incidents and accidents in the United States dating back to 1962. This dataset takes the form of a CSV that we converted to a pandas dataframe for an easier time cleaning and manipulating the data. For full documentation on the data cleaning process refrence the Data_Exploration_Cleaning jupyter notebook in the notebooks folder. After the data was cleaned we exported it to a new CSV for manipulation, statistical aggregation, and use in tableau. 

In [2]:
#importing libraries and cleaned data for later visualization
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('../data/final_data.csv')

# Methods

Before we dive into the methodology we used for identifying planes that are best at keeping passengers safe in the event of an incident or accident it is important to note why we chose this metric over other potential metrics that one may think are better suited at answering a broad question such as "how do we minimize risk".

The first metric we considered was a figuring out which planes have the least incidents and accidents as those planes should be considered the least risky. The problem with this is that our dataset does not contain data on ALL flights taken by each aircraft, just data on events that happen on aircrafts. This means that we have no way of knowing which airplanes have more accidents than others, just data on the outcomes of accidents when they do happen. If we had data on how many flights took place for each make and model in the given time frame we could have made this analysis, but with what we had it was not possible.

The second metric that we considered was using a net cost function based on all of the factors that we had to try to come up with a overall monetary cost of each plane crashing based on a [model](https://www.nlr.org/wp-content/uploads/2019/10/App-11-NLR-9-CR-2008-307.pdf) origionally put out by the Netherlands Air Transport Safety Institute. However this model requires us to have the cost of each craft, the insurance value of each craft, the cost of a closure at the airport that has to deal with the recovery from the accident, and the amount of damage that the aircraft sustained, which was not contained in our dataset.

This led us to approaching the question from an angle centered around minimizing injury as we have data on the severity of injury for each event. At first we were measuring which airplanes were best at keeping passengers uninjured, but we dicided that it was more important to focus on minimizing major and fatal injuries as minor injuries, while unfortunate, do not outweigh the need to prevent death and permanent injury.  

To find this value for each plane we created a new variable calculating the passengers per flight which was equal to the sum of the major injury, minor injury, fatal injury, and uninjured columns combined. Then we constructed a variable equal to the number of major injuries plus the number of fatal injuries in each event. Finally we divided the number of major and fatal injuries by the number of passengers to get the percent of passengers seriously injured or killed. We then grouped our data by each model of plane and sorted by percent of passengers with serious and fatal injuries to find the best and worst planes. 

## Additional Metrics

There were a few other metrics that we looked at to make additional reccomendations so that we were able to guide the airplane purchase process as best as possible. 

The first of these additional metrics that we chose to investigate was engine type, as we hypothesized different engine types could affect plane safety in the event of an accident or incident. We had already constructed our measurement variable as was outlined previously in this document, so all we had to do was group our data by engine type and sort our data by percent of passengers with serious or fatal injuries. We also performed this calculation for number of engines as looking at both engine type and number gives us a better picture of how important engine specifications are for airplane safety.

The next metric we looked at was location, as there are many reasons to believe that the area you are flying over has an effect on the number of serious and fatal injuries sustained in the event of an accident. We chose to group this by states as our data had information on what state the plane was flying over at the time of the incident. Not only do different states have different weather conditions, geographical conditions, and atmospheric conditions that can affect the performance of a plane, but you also have to consider the developmental conditions of the area you are flying over in the event of an emergency. For example if you are flying over a sparsely populated area with the nearest airport being hundreds of miles away then your options in the event of an emergency landing become much worse than if you were flying near a major airport that could accomidate an emergency landing. The process that we used to extract the state information from the location information is also in the Data_Exploration_cleaning jupyter notebook. Once we had this information we applied the same groupby and sorting process that we used for the other metrics to find which states are best to be flying over in the event of an accident.

While the general climate of the area is important to consider for plane safety, the current weather conditions at the time of the incident are also important to examine, as the presence of incliment weather can mess with a plane's ability to navigate safely. The dataset had information on if there were visual meteorological conditions,  conditions shown on instrumentation, or unknown weather conditions at the time of the event. We sorted these by the same way that we did for the other variables.

The purpose of the flight is also important to consider as it can help us reccomend specific ventures to avoid for being too risky, and it can help us to identify ventures to pursue for being safer. As with the other metrics, we grouped and sorted the data by percent of passengers with serious and fatal injuries to find which ventures are safest and which are the most dangerous.

The final metric we looked at was the phase of flight that the event took place in. Now this metric differs from the others due to the fact that all flights go through all of these phases so we can't exactly reccomend to avoid one phase or another. What we can do is find out which phases of flight have the worst outcomes for passengers and reccomend additional safety protocols be put in place for these phases in an attempt to improve major and fatal accident rates for them. We applied the same grouping and sorting procedure to this metric as we did all the others.

# Results

First, we will be examining the results of our make and model analysis as that is the primary metric we are looking at in order to make our reccomendation. We already imported our data and the required libraries for analysis earlier in this notebook so we will be going right into it.

In [3]:



x1 = top_ten_small_planes_model
y1 = top_ten_small_planes_percmajor

x2 = list(by_model_final_over_20.index)
y2 = list(by_model_final_over_20['Percent.Serious.and.Fatal'])

fig, ax = plt.subplots(1, 2, figsize = (12,6))

# Plot data

ax[0].bar(x1, y1)
ax[1].bar(x2, y2)

# Customize appearance
ax[0].tick_params(axis="x", labelrotation=90)
ax[0].set_ylabel('Percent Serious or Fatal')
ax[0].set_xlabel("Plane Model")
ax[0].set_title("Percent Serious or Fatal in Planes with a Capacity < 20") 
ax[1].tick_params(axis="x", labelrotation=90)
ax[1].set_ylabel("Percent Serious or Fatal")
ax[1].set_xlabel("Plane Model")
ax[1].set_title("Percent Serious or Fatal in Planes with a Capacity > 20") 
plt.ylim(0,12)

NameError: name 'top_ten_small_planes_model' is not defined