## Airplane Purchase Risk Assessment


### Project Overview

The objective of this analysis is to evaluate and determine low-risk aircrafts to be purchased by our company, Mawingu Group of Companies as we gear towards expanding our portfolio and breaking into the aviation industry.We aim to opeate airplanes for commercial and private enterprises hence we need to determine the potential risks of aircrafts.
This project analyzes aviation accident data from the National Transport Safety Board, covering civil aviation accidents and selected incidents that occured in the United States and international waters from 1962 to 2023.

### Business Problem
Inorder for Mawingu Group of Companies to expand its portfolio into the aviation industry, we have to understand the risks associated with purchasing and operating airplanes for both commercial and private enterprises. Choosing aricraft models know to be safe is not only important for our customers' safety but also for the business's finances and reputation. 

Assessing historical aviation accident data offers an opportunity to identify key trends, patterns and risk factors linked to different airplane models. This will enable us to make informed decisions backed by the data, as we will have aircraft models with proven track records of safety to choose from.

### Data Understanding

The goal is to explore the National Transportation Safety Board data and identify key features that we can use to assess the risks associated with varies airplanes. These include the severity of the injuries, types of engine, aircraft damage etc.

In [2]:
import pandas as pd
import numpy as np

In [None]:
!ls

In [3]:
df = pd.read_csv("AviationData.csv", encoding='ISO-8859-1', low_memory=False)

state_codes = pd.read_csv("USState_Codes.csv")

In [None]:
state_codes.info()

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.describe()

### Data Cleaning

After exploring the data, I need to clean the data to make it easier to work with. We can do this by dropping unneccessary columns and normalizing column names


In [4]:
# Remove the  punctuations on the column names and make them easier to work with

df.columns = df.columns.str.title().str.replace(".","_" )

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event_Id                88889 non-null  object 
 1   Investigation_Type      88889 non-null  object 
 2   Accident_Number         88889 non-null  object 
 3   Event_Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport_Code            50249 non-null  object 
 9   Airport_Name            52790 non-null  object 
 10  Injury_Severity         87889 non-null  object 
 11  Aircraft_Damage         85695 non-null  object 
 12  Aircraft_Category       32287 non-null  object 
 13  Registration_Number     87572 non-null  object 
 14  Make                    88826 non-null

In [6]:
#check the number of null values in each column

df.isnull().sum()

Event_Id                      0
Investigation_Type            0
Accident_Number               0
Event_Date                    0
Location                     52
Country                     226
Latitude                  54507
Longitude                 54516
Airport_Code              38640
Airport_Name              36099
Injury_Severity            1000
Aircraft_Damage            3194
Aircraft_Category         56602
Registration_Number        1317
Make                         63
Model                        92
Amateur_Built               102
Number_Of_Engines          6084
Engine_Type                7077
Far_Description           56866
Schedule                  76307
Purpose_Of_Flight          6192
Air_Carrier               72241
Total_Fatal_Injuries      11401
Total_Serious_Injuries    12510
Total_Minor_Injuries      11933
Total_Uninjured            5912
Weather_Condition          4492
Broad_Phase_Of_Flight     27165
Report_Status              6381
Publication_Date          13771
dtype: i

In [7]:
# Drop unnecessary columns - from the previous code we can see columns with a lot of null values
#which we might not necessarily need
df.drop(["Latitude", "Longitude", "Airport_Code", "Airport_Name", "Schedule", "Air_Carrier", "Far_Description", "Publication_Date"], axis = 1, inplace = True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 23 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event_Id                88889 non-null  object 
 1   Investigation_Type      88889 non-null  object 
 2   Accident_Number         88889 non-null  object 
 3   Event_Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Injury_Severity         87889 non-null  object 
 7   Aircraft_Damage         85695 non-null  object 
 8   Aircraft_Category       32287 non-null  object 
 9   Registration_Number     87572 non-null  object 
 10  Make                    88826 non-null  object 
 11  Model                   88797 non-null  object 
 12  Amateur_Built           88787 non-null  object 
 13  Number_Of_Engines       82805 non-null  float64
 14  Engine_Type             81812 non-null

In [9]:
df.columns

Index(['Event_Id', 'Investigation_Type', 'Accident_Number', 'Event_Date',
       'Location', 'Country', 'Injury_Severity', 'Aircraft_Damage',
       'Aircraft_Category', 'Registration_Number', 'Make', 'Model',
       'Amateur_Built', 'Number_Of_Engines', 'Engine_Type',
       'Purpose_Of_Flight', 'Total_Fatal_Injuries', 'Total_Serious_Injuries',
       'Total_Minor_Injuries', 'Total_Uninjured', 'Weather_Condition',
       'Broad_Phase_Of_Flight', 'Report_Status'],
      dtype='object')