# Data Cleaning

First, we will open the AviationData csv and look at the columns.

In [248]:
import pandas as pd
 
df = pd.read_csv('./data/AviationData.csv', encoding='latin1')

df.head(5)

  df = pd.read_csv('./data/AviationData.csv', encoding='latin1')


Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


Then, we will look at the column names and the dataframe info to find possible null values.

In [249]:
df.columns

Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',
       'Location', 'Country', 'Latitude', 'Longitude', 'Airport.Code',
       'Airport.Name', 'Injury.Severity', 'Aircraft.damage',
       'Aircraft.Category', 'Registration.Number', 'Make', 'Model',
       'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 'FAR.Description',
       'Schedule', 'Purpose.of.flight', 'Air.carrier', 'Total.Fatal.Injuries',
       'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured',
       'Weather.Condition', 'Broad.phase.of.flight', 'Report.Status',
       'Publication.Date'],
      dtype='object')

In [250]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

We decided to drop the Latitude, Longitude, Airport Code, and Airport Name columns, instead choosing to use the Location column.

In [251]:
dropped_cols = ['Latitude', 'Longitude', 'Airport.Code', 'Airport.Name', 'Schedule', 'Injury.Severity', 'Publication.Date', 'Report.Status', 'Air.carrier']
dropped_df = df.drop(columns=dropped_cols)
dropped_df = dropped_df.dropna(axis='index', subset=['Location', 'Make', 'Model', 'Registration.Number'])
dropped_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 87408 entries, 0 to 88888
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                87408 non-null  object 
 1   Investigation.Type      87408 non-null  object 
 2   Accident.Number         87408 non-null  object 
 3   Event.Date              87408 non-null  object 
 4   Location                87408 non-null  object 
 5   Country                 87188 non-null  object 
 6   Aircraft.damage         84404 non-null  object 
 7   Aircraft.Category       31960 non-null  object 
 8   Registration.Number     87408 non-null  object 
 9   Make                    87408 non-null  object 
 10  Model                   87408 non-null  object 
 11  Amateur.Built           87377 non-null  object 
 12  Number.of.Engines       82575 non-null  float64
 13  Engine.Type             81262 non-null  object 
 14  FAR.Description         31646 non-null  obj

Separated Location column into two new columns: City and State

In [252]:
dropped_df[['Location_City', 'Location_State']] = dropped_df['Location'].str.split(', ', n=1, expand=True)
dropped_df['Location_City'] = dropped_df['Location_City'].str.title()
dropped_df[['Location_City', 'Location_State']].head(10)

Unnamed: 0,Location_City,Location_State
0,Moose Creek,ID
1,Bridgeport,CA
2,Saltville,VA
3,Eureka,CA
4,Canton,OH
5,Boston,MA
6,Cotton,MN
7,Pullman,WA
8,East Hanover,NJ
9,Jacksonville,FL


Standardized the string format of the Make column.

In [253]:
dropped_df['Make'] = dropped_df['Make'].str.title()
dropped_df['Make'].value_counts()

Make
Cessna           26867
Piper            14753
Beech             5299
Bell              2612
Boeing            2477
                 ...  
Izatt                1
Mince                1
Dana A. Moore        1
Slater               1
Royse Ralph L        1
Name: count, Length: 7534, dtype: int64

Let's take another look at dropped_df.

In [254]:
dropped_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 87408 entries, 0 to 88888
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                87408 non-null  object 
 1   Investigation.Type      87408 non-null  object 
 2   Accident.Number         87408 non-null  object 
 3   Event.Date              87408 non-null  object 
 4   Location                87408 non-null  object 
 5   Country                 87188 non-null  object 
 6   Aircraft.damage         84404 non-null  object 
 7   Aircraft.Category       31960 non-null  object 
 8   Registration.Number     87408 non-null  object 
 9   Make                    87408 non-null  object 
 10  Model                   87408 non-null  object 
 11  Amateur.Built           87377 non-null  object 
 12  Number.of.Engines       82575 non-null  float64
 13  Engine.Type             81262 non-null  object 
 14  FAR.Description         31646 non-null  obj

Let's assume that if there are values in the Total Injuries columns that are NaN, they should be zero.

In [255]:
dropped_df[['Total.Fatal.Injuries', 'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured']] = dropped_df[['Total.Fatal.Injuries', 'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured']].fillna(value=0)
dropped_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 87408 entries, 0 to 88888
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                87408 non-null  object 
 1   Investigation.Type      87408 non-null  object 
 2   Accident.Number         87408 non-null  object 
 3   Event.Date              87408 non-null  object 
 4   Location                87408 non-null  object 
 5   Country                 87188 non-null  object 
 6   Aircraft.damage         84404 non-null  object 
 7   Aircraft.Category       31960 non-null  object 
 8   Registration.Number     87408 non-null  object 
 9   Make                    87408 non-null  object 
 10  Model                   87408 non-null  object 
 11  Amateur.Built           87377 non-null  object 
 12  Number.of.Engines       82575 non-null  float64
 13  Engine.Type             81262 non-null  object 
 14  FAR.Description         31646 non-null  obj

In [262]:
dropped_df['Amateur.Built'] = dropped_df['Amateur.Built'].fillna('No')
dropped_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 87408 entries, 0 to 88888
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                87408 non-null  object 
 1   Investigation.Type      87408 non-null  object 
 2   Accident.Number         87408 non-null  object 
 3   Event.Date              87408 non-null  object 
 4   Location                87408 non-null  object 
 5   Country                 87188 non-null  object 
 6   Aircraft.damage         84404 non-null  object 
 7   Aircraft.Category       31960 non-null  object 
 8   Registration.Number     87408 non-null  object 
 9   Make                    87408 non-null  object 
 10  Model                   87408 non-null  object 
 11  Amateur.Built           87408 non-null  object 
 12  Number.of.Engines       82575 non-null  float64
 13  Engine.Type             81262 non-null  object 
 14  FAR.Description         31646 non-null  obj

In [266]:
dropped_df['Weather.Condition'] = dropped_df['Weather.Condition'].fillna('VMC')
dropped_df['Weather.Condition'] = dropped_df['Weather.Condition'].str.upper()
dropped_df['Weather.Condition'].value_counts()
dropped_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 87408 entries, 0 to 88888
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                87408 non-null  object 
 1   Investigation.Type      87408 non-null  object 
 2   Accident.Number         87408 non-null  object 
 3   Event.Date              87408 non-null  object 
 4   Location                87408 non-null  object 
 5   Country                 87188 non-null  object 
 6   Aircraft.damage         84404 non-null  object 
 7   Aircraft.Category       31960 non-null  object 
 8   Registration.Number     87408 non-null  object 
 9   Make                    87408 non-null  object 
 10  Model                   87408 non-null  object 
 11  Amateur.Built           87408 non-null  object 
 12  Number.of.Engines       82575 non-null  float64
 13  Engine.Type             81262 non-null  object 
 14  FAR.Description         31646 non-null  obj

In [269]:
dropped_df['Purpose.of.flight'] = dropped_df['Purpose.of.flight'].fillna('Unknown')
dropped_df['Purpose.of.flight'].value_counts()
dropped_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 87408 entries, 0 to 88888
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                87408 non-null  object 
 1   Investigation.Type      87408 non-null  object 
 2   Accident.Number         87408 non-null  object 
 3   Event.Date              87408 non-null  object 
 4   Location                87408 non-null  object 
 5   Country                 87188 non-null  object 
 6   Aircraft.damage         84404 non-null  object 
 7   Aircraft.Category       31960 non-null  object 
 8   Registration.Number     87408 non-null  object 
 9   Make                    87408 non-null  object 
 10  Model                   87408 non-null  object 
 11  Amateur.Built           87408 non-null  object 
 12  Number.of.Engines       82575 non-null  float64
 13  Engine.Type             81262 non-null  object 
 14  FAR.Description         31646 non-null  obj

In [None]:
#dropped_df['Aircraft.damage'].value_counts()

In [257]:
#dropped_df[dropped_df['Number.of.Engines'] == 0][['Aircraft.Category', 'Event.Date']]
#dropped_df[(dropped_df['Number.of.Engines'] == 0) & (dropped_df['Aircraft.Category'] == 'Airplane')]
#dropped_df[(dropped_df['Aircraft.damage'].isna()) & (dropped_df['Total.Fatal.Injuries'] > 0)]['Aircraft.damage'].fillna('Substantial')
#dropped_df['Aircraft.damage'].value_counts()

In [258]:
#dropped_df['Total.Fatal.Injuries'].value_counts().head(10)

Now we'll drop entries where Make or Model is null

## IN PROGRESS

In [259]:
dropped_df[dropped_df['Number.of.Engines'].isna()][['Make', 'Model', 'Aircraft.Category']]

Unnamed: 0,Make,Model,Aircraft.Category
4,Cessna,501,
3600,Piccard,AX-6,
3741,Schweizer,2-33A,
3772,Schweizer,SGS 1-26B,
3870,Pratt-Read,PRG-1,
...,...,...,...
88883,Air Tractor,AT502,
88884,Piper,PA-28-151,
88885,Bellanca,7ECA,
88887,Cessna,210N,


Next, we'll look at the Aircraft Category column. There is still useful information here, but a lot of it is missing. We will try to fill in missing values in this column with info from the Number of Engines column, following these rules:
- 1 engine = 'Small Aircraft'
- 2 engines = 'Medium Aircraft'
- more than 2 engines = 'Large Aircraft'

In [260]:
def num_engines_to_size(record):
    if record['Number.of.Engines'] == 1:
        return 'Small Aircraft'
    elif record['Number.of.Engines'] > 2:
        return 'Large Aircraft'
    else:
        return 'Medium Aircraft'