# Aviation Risk Project

### Business Problem

 Company X is expanding into new industries to diversify its portfolio. Specifically, they are interested in purchasing and operating airplanes for commercial and private enterprises, but do not know anything about the potential risks of aircraft. You are charged with determining which aircrafts are the lowest risk for the company to start this new business endeavor. You must then translate your findings into actionable insights that the head of the new aviation division can use to help decide which aircraft to purchase.

- Goals
    - Identify causes of aviation accidents and risk factors
    - Assess aircraft options to find lowest risk options
    - Also consider: regulatory compliance, maintenance requirements, insurance costs, industry standards, capacity, fuel efficiency
- Data
    - Aviation accidents from 1962 to 2023
    - Includes civil accidents and selected incidents in the US and international waters
- Methods
- Results

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [18]:
# import data file 

df = pd.read_csv('../data/AviationData.csv', encoding='latin-1')
df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.9222,-81.8781,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [19]:
# convert Event.Date column from object to datetime

df['Event.Date'] = pd.to_datetime(df['Event.Date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                88889 non-null  object        
 1   Investigation.Type      88889 non-null  object        
 2   Accident.Number         88889 non-null  object        
 3   Event.Date              88889 non-null  datetime64[ns]
 4   Location                88837 non-null  object        
 5   Country                 88663 non-null  object        
 6   Latitude                34382 non-null  object        
 7   Longitude               34373 non-null  object        
 8   Airport.Code            50249 non-null  object        
 9   Airport.Name            52790 non-null  object        
 10  Injury.Severity         87889 non-null  object        
 11  Aircraft.damage         85695 non-null  object        
 12  Aircraft.Category       32287 non-null  object

In [20]:
# create new data frame with only US incidents

df['Country'].value_counts()
df_us = df[df['Country'] == 'United States']
df_us

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.9222,-81.8781,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88884,20221227106491,Accident,ERA23LA093,2022-12-26,"Annapolis, MD",United States,,,,,...,Personal,,0.0,1.0,0.0,0.0,,,,29-12-2022
88885,20221227106494,Accident,ERA23LA095,2022-12-26,"Hampton, NH",United States,,,,,...,,,0.0,0.0,0.0,0.0,,,,
88886,20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,341525N,1112021W,PAN,PAYSON,...,Personal,,0.0,0.0,0.0,1.0,VMC,,,27-12-2022
88887,20221227106498,Accident,WPR23LA076,2022-12-26,"Morgan, UT",United States,,,,,...,Personal,MC CESSNA 210N LLC,0.0,0.0,0.0,0.0,,,,


In [21]:
df_us['Investigation.Type'].value_counts()

Accident    79906
Incident     2342
Name: Investigation.Type, dtype: int64

In [22]:
# convert all location data to lowercase

df_us['Location'] = df_us['Location'].str.lower()
df_us['Location'].value_counts()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_us['Location'] = df_us['Location'].str.lower()


anchorage, ak          548
miami, fl              275
houston, tx            271
albuquerque, nm        265
chicago, il            256
                      ... 
shady grove cor, va      1
new vienna, ia           1
andersonville, ga        1
chippewa lake, oh        1
south paris, me          1
Name: Location, Length: 17588, dtype: int64

In [23]:
# pull state info from locations

df_us['State'] = df_us['Location'].str.split(',').str[-1].str.strip().str.upper()
df_us['State'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_us['State'] = df_us['Location'].str.split(',').str[-1].str.strip().str.upper()


CA    8857
TX    5913
FL    5825
AK    5672
AZ    2834
      ... 
PO      14
GU       8
VI       6
UN       3
CB       1
Name: State, Length: 61, dtype: int64

In [24]:
state_codes = pd.read_csv('../data/USState_Codes.csv')
state_codes

Unnamed: 0,US_State,Abbreviation
0,Alabama,AL
1,Alaska,AK
2,Arizona,AZ
3,Arkansas,AR
4,California,CA
...,...,...
57,Virgin Islands,VI
58,Washington_DC,DC
59,Gulf of mexico,GM
60,Atlantic ocean,AO


In [25]:
# merge state names to abbreviated codes

df_us = pd.merge(df_us, state_codes, how='left', left_on='State', right_on='Abbreviation')
df_us = df_us.drop(['Abbreviation'], 1)
df_us = df_us.rename(columns={'State':'State.Code','US_State':"State.Name"})
df_us.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82248 entries, 0 to 82247
Data columns (total 33 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                82248 non-null  object        
 1   Investigation.Type      82248 non-null  object        
 2   Accident.Number         82248 non-null  object        
 3   Event.Date              82248 non-null  datetime64[ns]
 4   Location                82237 non-null  object        
 5   Country                 82248 non-null  object        
 6   Latitude                32265 non-null  object        
 7   Longitude               32255 non-null  object        
 8   Airport.Code            49189 non-null  object        
 9   Airport.Name            51654 non-null  object        
 10  Injury.Severity         82140 non-null  object        
 11  Aircraft.damage         80269 non-null  object        
 12  Aircraft.Category       28154 non-null  object

In [None]:
# Regions are based off of those at
# http://nationalgeographic.org/maps/united-states-regions/
# Includes District of Columbia as a state

regions_to_states = {
    'South': ['West Virginia', 'District of Columbia', 'Maryland', 'Virginia',
              'Kentucky', 'Tennessee', 'North Carolina', 'Mississippi',
              'Arkansas', 'Louisiana', 'Alabama', 'Georgia', 'South Carolina',
              'Florida', 'Delaware'],
    'Southwest': ['Arizona', 'New Mexico', 'Oklahoma', 'Texas'],
    'West': ['Washington', 'Oregon', 'California', 'Nevada', 'Idaho', 'Montana',
             'Wyoming', 'Utah', 'Colorado', 'Alaska', 'Hawaii'],
    'Midwest': ['North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota',
                'Iowa', 'Missouri', 'Wisconsin', 'Illinois', 'Michigan', 'Indiana',
                'Ohio'],
    'Northeast': ['Maine', 'Vermont', 'New York', 'New Hampshire', 'Massachusetts',
                  'Rhode Island', 'Connecticut', 'New Jersey', 'Pennsylvania']
}

states_to_regions = {
    'Washington': 'West', 'Oregon': 'West', 'California': 'West', 'Nevada': 'West',
    'Idaho': 'West', 'Montana': 'West', 'Wyoming': 'West', 'Utah': 'West',
    'Colorado': 'West', 'Alaska': 'West', 'Hawaii': 'West', 'Maine': 'Northeast',
    'Vermont': 'Northeast', 'New York': 'Northeast', 'New Hampshire': 'Northeast',
    'Massachusetts': 'Northeast', 'Rhode Island': 'Northeast', 'Connecticut': 'Northeast',
    'New Jersey': 'Northeast', 'Pennsylvania': 'Northeast', 'North Dakota': 'Midwest',
    'South Dakota': 'Midwest', 'Nebraska': 'Midwest', 'Kansas': 'Midwest',
    'Minnesota': 'Midwest', 'Iowa': 'Midwest', 'Missouri': 'Midwest', 'Wisconsin': 'Midwest',
    'Illinois': 'Midwest', 'Michigan': 'Midwest', 'Indiana': 'Midwest', 'Ohio': 'Midwest',
    'West Virginia': 'South', 'District of Columbia': 'South', 'Maryland': 'South',
    'Virginia': 'South', 'Kentucky': 'South', 'Tennessee': 'South', 'North Carolina': 'South',
    'Mississippi': 'South', 'Arkansas': 'South', 'Louisiana': 'South', 'Alabama': 'South',
    'Georgia': 'South', 'South Carolina': 'South', 'Florida': 'South', 'Delaware': 'South',
    'Arizona': 'Southwest', 'New Mexico': 'Southwest', 'Oklahoma': 'Southwest',
    'Texas': 'Southwest'}

In [None]:
# clean airport names - lots of private air strips

df_us['Airport.Name'] = df_us['Airport.Name'].str.lower()
df_us['Airport.Name'] = df_us['Airport.Name'].replace(['private airstrip', 'private strip'], 'private')
df_us['Airport.Name'].value_counts()

In [None]:
df_us['Injury.Severity'].value_counts().head(20)

In [None]:
# split number from Fatal(#) to count number of fatalities

df_us['Num.Fatalities'] = df_us['Injury.Severity'].str.split('(').str[-1]
df_us['Num.Fatalities'] = df_us['Num.Fatalities'].str.split(')').str[0]
df_us['Num.Fatalities'] = pd.to_numeric(df_us['Num.Fatalities'], errors='coerce').fillna(0).astype(int)
df_us['Num.Fatalities'].value_counts()

In [None]:
# update labeling in severity column

df_us['Injury.Severity'] = df_us['Injury.Severity'].str.split('(').str[0]
df_us['Injury.Severity'].value_counts()

In [None]:
# clean registration column

df_us['Registration.Number'] = df_us['Registration.Number'].str.upper()

df_us['Registration.Number'].value_counts()

In [None]:
# clean weather condition

df_us['Weather.Condition'] = df_us['Weather.Condition'].str.upper()
df_us['Weather.Condition'].value_counts()

# VMC - Visual Meteorological Conditions - generally clear and good visibility; pilots can navigate and operate aircraft by visual reference to the ground
# IMC - Instrument Meteorological Conditions - reduced visibility due to factors like fog, rain, or low clouds; pilots may need to rely on instruments for navigation and control
# UNK - Unknown