# Aviation Accidents Analysis

You are part of a consulting firm that is tasked to do an analysis of commercial and passenger jet airline safety. The client (an airline/airplane insurer) is interested in knowing what types of aircraft (makes/models) exhibit low rates of total destruction and low likelihood of fatal or serious passenger injuries in the event of an accident. They are also interested in any general variables/conditions that might be at play. Your analysis will be based off of aviation accident data accumulated from the years 1948-2023. 

Our client is only interested in airplane makes/models that are professional builds and could potentially still be active. Assume a max lifetime of 40 years for a make/model retirement and make sure to filter your data accordingly (i.e. from 1983 onwards). They would also like separate recommendations for small aircraft vs. larger passenger models. **In addition, make sure that claims that you make are statistically robust and that you have enough samples when making comparisons between groups.**


In this summative assessment you will demonstrate your ability to:
- **Use Pandas to load, inspect, and clean the dataset appropriately.**
- **Transform relevant columns to create measures that address the problem at hand.**
- conduct EDA: visualization and statistical measures to systematically understand the structure of the data
- recommend a set of airplanes and makes conforming to the client's request and identify at least *two* factors contributing to airplane safety. You must provide supporting evidence (visuals, summary statistics, tables) for each claim you make.

### Make relevant library imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Loading and Inspection

### Load in data from the relevant directory and inspect the dataframe.
- inspect NaNs, datatypes, and summary statistics

In [None]:
df = pd.read_csv('./data/AviationData.csv', encoding='latin-1', low_memory=False)

**Datatypes**

In [None]:
df.info()

#### Summary Stats

In [None]:
df.describe()

## Data Cleaning

### Filtering aircrafts and events

We want to filter the dataset to include aircraft that the client is interested in an analysis of:
- inspect relevant columns
- figure out any reasonable imputations
- filter the dataset

In [None]:
air_craft_se = df['Aircraft.Category'] # Series
print(f'NaN Values: {air_craft_se.isna().sum()}') # NaN values

print('\nCount per Category')
air_craft_se.value_counts() # count of non-NaN values

**Reasonable Imputation**

In [None]:
air_craft_se.fillna('Airplane', inplace=True) # missing vals replaced by 'Airplane'

print('\nNew Count per Category (Imputation)')
air_craft_se.value_counts()

**Airplane DataFrame**

In [None]:
airplane_df = df[air_craft_se == 'Airplane']
airplane_df.info()

**Retain Only 'Professional Builds'**

In [None]:
airplane_df['Amateur.Built'].value_counts() # count no. of amateur builds

In [None]:
airplane_df = airplane_df[airplane_df['Amateur.Built'] == 'No'] # remove amateur builds
airplane_df['Amateur.Built'].value_counts()

**Retain Events of the last 40 Years**

In [None]:
airplane_df = airplane_df.dropna(subset=['Publication.Date']) # drop NaNs
airplane_df['Report.Date'] = pd.to_datetime(airplane_df['Publication.Date']) # convert to pd datetime obj

airplane_df = airplane_df.sort_values(by=['Report.Date']) # sort by pd datetime asc
airplane_df = airplane_df[airplane_df['Report.Date'] > '1983'] # retain last 40 by filter

airplane_df.info()

### Cleaning and constructing Key Measurables

Injuries and robustness to destruction are a key interest point for the client. Clean and impute relevant columns and then create derived fields that best quantifies what the client wishes to track. **Use commenting or markdown to explain any cleaning assumptions as well as any derived columns you create.**

**Construct metric for fatal/serious injuries**

*Hint:* Estimate the total number of passengers on each flight. The likelihood of serious / fatal injury can be estimated as a fraction from this.

**Filter for 'Injury' Columns**

In [None]:
filter_col = airplane_df.columns.str.contains('njur') # select injury cols to use as filter
inj_cols = airplane_df.columns[filter_col][1::] # select rows for injury columns

print('NaNs for Injury Columns')
airplane_df[inj_cols].isna().sum() # calc NaN per injury type

In [None]:
airplane_df[inj_cols].isna().all(axis=1).sum()

**Remove NaNs on 'Injury' Columns**

In [None]:
airplane_df = airplane_df[~airplane_df[inj_cols].isna().all(axis=1)] # where true, remove injury NaNs from cols

print('\nRemaining \'true\' NaNs')
airplane_df[inj_cols].isna().sum()

**Replace Remaining NaNs with Zero**

In [None]:
airplane_df.loc[:,inj_cols] = airplane_df[inj_cols].fillna(0) # fill NaNs with zero
airplane_df[inj_cols].isna().sum() # NaNs are now absent

**Calculate Injury Rate**

In [None]:
airplane_df.loc[:,'N_passenger'] = airplane_df.loc[:,inj_cols].sum(axis=1) # total no. of passengers

airplane_df = airplane_df[airplane_df['N_passenger'] > 0] # only retain passenger records

passengers = airplane_df['N_passenger'] # passengers
serious_inj = airplane_df['Total.Serious.Injuries'] # serious injuries
fatal_inj = airplane_df['Total.Fatal.Injuries'] # fatal injuries

airplane_df['ser_inj_rate'] = (fatal_inj + serious_inj)/passengers # calculate serious injury rate

**View New Columns**

In [None]:
airplane_df.head()

**Aircraft.Damage**
- identify and execute any cleaning tasks
- construct a derived column tracking whether an aircraft was destroyed or not.

**Remove NaNs and unknowns**

In [None]:
airplane_df['Aircraft.damage'].unique()

In [None]:
airplane_df['Aircraft.damage'].value_counts()

In [None]:
print('\nReplace unknowns with Nans')

airplane_df['Aircraft.damage'] = airplane_df['Aircraft.damage'].replace({'Unknown':np.nan}) # turn unkws to NaN
airplane_df.dropna(subset=['Aircraft.damage'], inplace=True) # drop NaNs from A.d col
airplane_df['Aircraft.damage'].value_counts() # confirm dropped NaNs

**Boolean Masking for 'Destroyed'**

In [None]:
airplane_df['is_destroyed'] = (airplane_df['Aircraft.damage'] == 'Destroyed').astype(int)

In [None]:
airplane_df.head()

### Investigate the *Make* column
- Identify cleaning tasks here
- List cleaning tasks clearly in markdown
- Execute the cleaning tasks
- For your analysis, keep Makes with a reasonable number (you can put the threshold at 50 though lower could work as well)

In [None]:
airplane_df['Make'].value_counts()

**Consolidate**

In [None]:
airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["CESSNA AIRCRAFT CO", "CESSNA AIRCRAFT COMPANY", "CESSNA AIRCRAFT", "Cessna"], value="CESSNA")

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["PIPER AIRCRAFT INC", "PIPER AIRCRAFT CORPORATION", "PIPER AIRCRAFT", "Piper"], value="PIPER")

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["BEECHCRAFT", "HAWKER BEECHCRAFT", "HAWKER BEECHCRAFT CORP", "HAWKER BEECH", "Beech"], value="BEECH") 

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["THE BOEING COMPANY", "BOEING COMPANY", "BOEING STEARMAN", "Boeing"], value="BOEING")

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["MOONEY AIRCRAFT CORP.", "MOONEY AIRPLANE CO INC", "MOONEY INTERNATIONAL", "Mooney"], value="MOONEY")

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["GRUMMAN ACFT ENG COR-SCHWEIZER", "GRUMMAN AMERICAN AVN. CORP", "GRUMMAN AMERICAN AVN. CORP", "Grumman-schweizer", "Schweizer", "Grumman"], value= "GRUMMAN")

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["AIRBUS INDUSTRIES", "Airbus Industrie", "Airbus"], value="AIRBUS")

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["MAULE AIRCRAFT CORP", "Maule"], value="MAULE")

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["AERONCA AIRCRAFT CORPORATION", "AERONCA CHAMPION", "AERONCA CHAMP", "Aeronca", "Champion"], value="AERONCA")

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["AIR TRACTOR", "AIR TRACTOR INC", "Air Tractor"], value="AIR TRACTOR")

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["CIRRUS", "CIRRUS DESIGN CORP", "CIRRUS DESIGN CORP.", "Cirrus"], value="CIRRUS")

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["ERCOUPE", "ERCOUPE (ENG & RESEARCH CORP.)", "Ercoupe"], value="ERCOUPE")

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["AVIAT AIRCRAFT INC", "AVIAT", "Aviat"], value="AVIAT")

airplane_df["Make"] = airplane_df["Make"].replace(to_replace=["ROCKWELL", "ROCKWELL INTERNATIONAL", "Rockwell"], value="ROCKWELL")

In [None]:
filtered_makes = airplane_df['Make'].value_counts()[airplane_df['Make'].value_counts() > 50]
filtered_makes[:25]

**Filter Based On Makes**

In [None]:
airplane_df = airplane_df[airplane_df['Make'].isin(filtered_makes.index)]

In [None]:
airplane_df.info()

### Inspect Model column
- Get rid of any NaNs.
- Inspect the column and counts for each model/make. Are model labels unique to each make?
- If not, create a derived column that is a unique identifier for a given plane type.

In [None]:
airplane_df.dropna(subset=['Model'], inplace=True)

**Multi-Indexing (Make & Model)**

In [None]:
airplane_df['make_model'] = airplane_df['Make'] + '_' + airplane_df['Model'].str.upper()

### Cleaning other columns
- there are other columns containing data that might be related to the outcome of an accident. We list a few here:
- Engine.Type
- Weather.Condition
- Number.of.Engines
- Purpose.of.flight
- Broad.phase.of.flight

Inspect and identify potential cleaning tasks in each of the above columns. Execute those cleaning tasks. 

**Note**: You do not necessarily need to impute or drop NaNs here.

**Clean 'Engine.Type'**

In [None]:
airplane_df['Engine.Type'].value_counts()

In [None]:
print('\nReplace \'Unknown\' & \'UNK\'')

In [None]:
airplane_df['Engine.Type'].replace({'Unknown': np.nan, 'UNK':np.nan}, inplace=True)

**Filter Out Single 'Engine' Values**

In [None]:
filtr_engine = airplane_df['Engine.Type'].value_counts()[airplane_df['Engine.Type'].value_counts() > 1] # filter out single engine value

airplane_df = airplane_df[airplane_df['Engine.Type'].isin(filtr_engine.index)] # implement filter in engine col

airplane_df['Engine.Type'].value_counts() # check work

In [None]:
airplane_df['Number.of.Engines'].value_counts()

**Removing Zero 'Number.of.Engines'**

In [None]:
airplane_df = airplane_df[airplane_df['Number.of.Engines'] > 0.0]

print('\nCheck Work')
airplane_df['Number.of.Engines'].value_counts()

**Clean 'Weather.Condition'**

In [None]:
print('\nUnique Values')
airplane_df['Weather.Condition'].unique()

**Replace 'Weather.Condition' Unknowns with NaNs**

In [None]:
airplane_df['Weather.Condition'].replace({'UNK': np.nan, 'Unk': np.nan}, inplace=True)

In [None]:
print('\n Check Work')
airplane_df['Weather.Condition'].value_counts()

### Column Removal
- inspect the dataframe and drop any columns that have too many NaNs

In [None]:
airplane_df.info()

**Drop Cols with Many NaNs**

In [None]:
airplane_df = airplane_df.drop(columns=['FAR.Description', 'Air.carrier', 'Schedule'])

### Save DataFrame to csv
- its generally useful to save data to file/server after its in a sufficiently cleaned or intermediate state
- the data can then be loaded directly in another notebook for further analysis
- this helps keep your notebooks and workflow readable, clean and modularized

In [None]:
airplane_df.to_csv('data/air_cleaned.csv', index=False)