In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from dateutil.parser import parse
from datetime import datetime
import seaborn as sns
import pycountry_convert as pc

<div style="background-color: #9df9ef; padding: 10px;"></div>

# Dataset 2 - Aviation Accident Database & Synopses, up to 2023 from NTSB - National Transportation Safety Board 

<div style="background-color: #9df9ef; padding: 10px;"></div>

## 2. Aviation Accident Database & Synopses, up to 2023


The dataset is [here](https://www.kaggle.com/datasets/khsamaha/aviation-accident-database-synopses)
The NTSB aviation accident database [here](https://www.ntsb.gov/Pages/home.aspx) contains information from 1962 and later about civil aviation accidents and selected incidents within the United States, its territories and possessions, and in international waters.

### 2.1. Reading, analyzing the dataset

In [2]:
avioset_ntsb = pd.read_csv('data/dataset_2_ntsb_gov/AviationData.csv', 
                           encoding='windows-1252', 
                           low_memory=False)

In [3]:
# 88889 rows × 31 columns
avioset_ntsb

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88884,20221227106491,Accident,ERA23LA093,2022-12-26,"Annapolis, MD",United States,,,,,...,Personal,,0.0,1.0,0.0,0.0,,,,29-12-2022
88885,20221227106494,Accident,ERA23LA095,2022-12-26,"Hampton, NH",United States,,,,,...,,,0.0,0.0,0.0,0.0,,,,
88886,20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,341525N,1112021W,PAN,PAYSON,...,Personal,,0.0,0.0,0.0,1.0,VMC,,,27-12-2022
88887,20221227106498,Accident,WPR23LA076,2022-12-26,"Morgan, UT",United States,,,,,...,Personal,MC CESSNA 210N LLC,0.0,0.0,0.0,0.0,,,,


<div style="background-color: #9df9ef; padding: 10px;"></div>

### 2.2. Dataset cleaning, normalization

#### 2.2.1. Let's see what data do we have in all columns.

In [4]:
avioset_ntsb.dtypes

Event.Id                   object
Investigation.Type         object
Accident.Number            object
Event.Date                 object
Location                   object
Country                    object
Latitude                   object
Longitude                  object
Airport.Code               object
Airport.Name               object
Injury.Severity            object
Aircraft.damage            object
Aircraft.Category          object
Registration.Number        object
Make                       object
Model                      object
Amateur.Built              object
Number.of.Engines         float64
Engine.Type                object
FAR.Description            object
Schedule                   object
Purpose.of.flight          object
Air.carrier                object
Total.Fatal.Injuries      float64
Total.Serious.Injuries    float64
Total.Minor.Injuries      float64
Total.Uninjured           float64
Weather.Condition          object
Broad.phase.of.flight      object
Report.Status 

<div style="background-color: #9df9ef; padding: 10px;"></div>

#### 2.2.2. Let's change column names like Event.Id into event_id.

In [8]:
# We will make a new dataset so as not to lose the original
avioset_ntsb_low = avioset_ntsb

In [9]:
def change_col_names(df, old_symbol, new_symbol):
    df.columns = df.columns.str.lower().str.replace(old_symbol, new_symbol)
    return df

In [10]:
change_col_names(avioset_ntsb_low, '.', '_')

Unnamed: 0,event_id,investigation_type,accident_number,event_date,location,country,latitude,longitude,airport_code,airport_name,...,purpose_of_flight,air_carrier,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight,report_status,publication_date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88884,20221227106491,Accident,ERA23LA093,2022-12-26,"Annapolis, MD",United States,,,,,...,Personal,,0.0,1.0,0.0,0.0,,,,29-12-2022
88885,20221227106494,Accident,ERA23LA095,2022-12-26,"Hampton, NH",United States,,,,,...,,,0.0,0.0,0.0,0.0,,,,
88886,20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,341525N,1112021W,PAN,PAYSON,...,Personal,,0.0,0.0,0.0,1.0,VMC,,,27-12-2022
88887,20221227106498,Accident,WPR23LA076,2022-12-26,"Morgan, UT",United States,,,,,...,Personal,MC CESSNA 210N LLC,0.0,0.0,0.0,0.0,,,,


In [11]:
avioset_ntsb_low.columns

Index(['event_id', 'investigation_type', 'accident_number', 'event_date',
       'location', 'country', 'latitude', 'longitude', 'airport_code',
       'airport_name', 'injury_severity', 'aircraft_damage',
       'aircraft_category', 'registration_number', 'make', 'model',
       'amateur_built', 'number_of_engines', 'engine_type', 'far_description',
       'schedule', 'purpose_of_flight', 'air_carrier', 'total_fatal_injuries',
       'total_serious_injuries', 'total_minor_injuries', 'total_uninjured',
       'weather_condition', 'broad_phase_of_flight', 'report_status',
       'publication_date'],
      dtype='object')

<div style="background-color: #9df9ef; padding: 10px;"></div>

#### 2.2.4. Let's look at the unique values in each column 

In [12]:
# View unique values for all columns
def all_columns_unique_values(df):
    for col in df.columns:
        print(f"Unique values in column \'{col}\': \n {df[col].unique()}\n\n **************** \n")

all_columns_unique_values(avioset_ntsb_low)

Unique values in column 'event_id': 
 ['20001218X45444' '20001218X45447' '20061025X01555' ... '20221227106497'
 '20221227106498' '20221230106513']

 **************** 

Unique values in column 'investigation_type': 
 ['Accident' 'Incident']

 **************** 

Unique values in column 'accident_number': 
 ['SEA87LA080' 'LAX94LA336' 'NYC07LA005' ... 'WPR23LA075' 'WPR23LA076'
 'ERA23LA097']

 **************** 

Unique values in column 'event_date': 
 ['1948-10-24' '1962-07-19' '1974-08-30' ... '2022-12-22' '2022-12-26'
 '2022-12-29']

 **************** 

Unique values in column 'location': 
 ['MOOSE CREEK, ID' 'BRIDGEPORT, CA' 'Saltville, VA' ... 'San Manual, AZ'
 'Auburn Hills, MI' 'Brasnorte, ']

 **************** 

Unique values in column 'country': 
 ['United States' nan 'GULF OF MEXICO' 'Puerto Rico' 'ATLANTIC OCEAN'
 'HIGH ISLAND' 'Bahamas' 'MISSING' 'Pakistan' 'Angola' 'Germany'
 'Korea, Republic Of' 'Martinique' 'American Samoa' 'PACIFIC OCEAN'
 'Canada' 'Bolivia' 'Mexico' 'Domini

<div style="background-color: #9df9ef; padding: 10px;"></div>

#### 2.2.3. Let's analyze nan in each column

In [14]:
# View NAN values in all columns
def all_columns_nan_values(df):
    for col in df.columns:
        print(f"All nan values in column \'{col}\': \n {df[col].isna().sum()}\n\n **************** \n")
        
all_columns_nan_values(avioset_ntsb_low)

All nan values in column 'event_id': 
 0

 **************** 

All nan values in column 'investigation_type': 
 0

 **************** 

All nan values in column 'accident_number': 
 0

 **************** 

All nan values in column 'event_date': 
 0

 **************** 

All nan values in column 'location': 
 52

 **************** 

All nan values in column 'country': 
 226

 **************** 

All nan values in column 'latitude': 
 54507

 **************** 

All nan values in column 'longitude': 
 54516

 **************** 

All nan values in column 'airport_code': 
 38757

 **************** 

All nan values in column 'airport_name': 
 36185

 **************** 

All nan values in column 'injury_severity': 
 1000

 **************** 

All nan values in column 'aircraft_damage': 
 3194

 **************** 

All nan values in column 'aircraft_category': 
 56602

 **************** 

All nan values in column 'registration_number': 
 1382

 **************** 

All nan values in column 'make': 
 63



<div style="background-color: #9df9ef; padding: 10px;"></div>

#### 2.2.5. Let's analyze columns 6, 7, 28 
- TODO

In [None]:
avioset_ntsb_low.columns[6]

In [None]:
avioset_ntsb_low.latitude.unique()

In [None]:
avioset_ntsb_low.columns[7]

In [None]:
avioset_ntsb_low.latitude.unique()

In [None]:
avioset_ntsb_low.columns[28]

<div style="background-color: #9df9ef; padding: 10px;"></div>

#### 2.2.6. How many na we have? Should we replace or not? 
- TODO

<div style="background-color: #9df9ef; padding: 10px;"></div>

#### 2.2.7. Let's convert date columns into datetime 
- TODO

<div style="background-color: #9df9ef; padding: 10px;"></div>

#### 2.2.8. Let's take a look on the summary of statistics for numerical columns - It's good idea first to change some dtypes !!
- TODO
max: The maximum value in the data.
It is interesting to further analyze what data we have for Number.of.Engines == 0. There are aircraft without engines, commonly referred to as gliders or sailplanes. These aircraft are designed to fly without the need for an engine, relying instead on natural sources of lift, such as rising air currents (thermals), ridge lift, or wave lift, to stay airborne.

In [None]:
avioset_ntsb_low.describe()

#### count: The number of non-null entries: 
We have a pretty good amount of non-null entries.
*****
#### mean: The average value of the data.
*****
#### std: The standard deviation, which measures the amount of variation or dispersion from the mean.
*****
#### min: The minimum value in the data.
It is interesting to further analyze what data we have for Number.of.Engines == 0.
There are aircraft without engines, commonly referred to as gliders or sailplanes. These aircraft are designed to fly without the need for an engine, relying instead on natural sources of lift, such as rising air currents (thermals), ridge lift, or wave lift, to stay airborne.
- TODO
*****
#### 25%: The 25th percentile (first quartile), which is the value below which 25% of the data fall.
*****
#### 50%: The 50th percentile (median), which is the middle value of the data.
*****
#### 75%: The 75th percentile (third quartile), which is the value below which 75% of the data fall.
*****
#### max: The maximum value in the data.
It's interesting to see further:
- How many injured people do we have? 
- How many accidents with injures do we have? 
- How are these values scattered Fatal.Injures 349 max, Total.Serious.Injures - 161, Total.Minor.Injures - 380?
- TODO

<div style="background-color: #9df9ef; padding: 10px;"></div>