# Aircraft Safety Analysis for Business Expansion

<img src="images/Flight-Safety-Instructions.jpg">

## Introduction

For my Flatiron School Phase 1 Project, I have been given the following prompt:

Your company is expanding into new industries to diversify its portfolio. Specifically, the company is interested in purchasing and operating airplanes for both commercial and private enterprises. However, the potential risks associated with different aircraft types are unknown. As a data analyst, you have been tasked with determining which aircraft are the lowest risk for the company to start this new business endeavor. Your findings will be translated into actionable insights for the head of the new aviation division to help decide which aircraft to purchase.

## Problem Statement

I will aim to identify which aircraft models present the lowest risk in terms of safety. The key questions I'm seeking to answer include:
- Which aircraft models have the lowest average injury scores?
- How do factors such as the number of engines and the purpose of flight impact the safety of the aircraft?
- Are there temporal trends in the occurrence of accidents or incidents?

## Dataset

The dataset contains records of aircraft incidents and accidents from the NTSB up to 2023, including information on the event, location, aircraft characteristics, and injury severity. The dataset has 90,348 entries and 31 columns. Some key columns include:
- `Event.Id`: Unique identifier for each event.
- `Event.Date`: Date of the event.
- `Location`: Location of the event.
- `Country`: Country where the event occurred.
- `Injury.Severity`: Severity of injuries in the event.
- `Aircraft.damage`: Extent of damage to the aircraft.
- `Make` and `Model`: Manufacturer and model of the aircraft.
- `Number.of.Engines`: Number of engines on the aircraft.
- `Purpose.of.flight`: The purpose of the flight during the event.

## Data Cleaning and Exploration

### Data Cleaning
1. **Handling Missing Values**: Identify and handle missing values appropriately, either by imputation or removal.
2. **Data Type Conversion**: Convert columns to appropriate data types (e.g., dates to datetime objects, numerical columns to numeric types).
3. **Feature Engineering**: Create new features such as `Injury.Severity.Score` to quantify injury severity and extract useful information such as `Event.Year` from the `Event.Date`.

### Exploratory Data Analysis (EDA)
1. **Summary Statistics**: Generate summary statistics for numerical and categorical columns.
2. **Visualizations**: Create histograms, bar plots, and scatter plots to understand the distribution and relationships in the data.
   - **Average Injury Score by Model**: Identify aircraft models with the lowest average injury scores.
   - **Average Injury Score by Number of Engines**: Investigate how the number of engines impacts the injury score.
   - **Average Injury Score by Purpose of Flight**: Explore the impact of flight purpose on safety.
   - **Events by Year**: Visualize temporal trends in the number of incidents.

By following this structured approach, we will be able to gain insights into the safety of different aircraft models and provide actionable recommendations to the company's aviation division.

# Data Import & Initial Handling

In [1]:
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import seaborn as sns
from tqdm import tqdm

In [2]:
# Reading data from a CSV file
df = pd.read_csv('data/Aviation_Data.csv')

# Preview the data
df.head()

  df = pd.read_csv('data/Aviation_Data.csv')


Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90348 entries, 0 to 90347
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      90348 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null