# Phase 1 Code Challenge
This code challenge is designed to test your understanding of the Phase 1 material. It covers:

- Pandas
- Data Visualization
- Exploring Statistical Data
- Python Data Structures

*Read the instructions carefully.* Your code will need to meet detailed specifications to pass automated tests.

## Code Tests

We have provided some code tests for you to run to check that your work meets the item specifications. Passing these tests does not necessarily mean that you have gotten the item correct - there are additional hidden tests. However, if any of the tests do not pass, this tells you that your code is incorrect and needs changes to meet the specification. To determine what the issue is, read the comments in the code test cells, the error message you receive, and the item instructions.

---
## Part 1: Pandas [Suggested Time: 15 minutes]
---
In this part, you will preprocess a dataset from the video game [FIFA19](https://www.kaggle.com/karangadiya/fifa19), which contains data from the players' real-life careers.

In [4]:
# Run this cell without changes

import pandas as pd
import numpy as np
from numbers import Number
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt

### 1.1) Read `fifa.csv` into a pandas DataFrame named `df`

Use pandas to create a new DataFrame, called `df`, containing the data from the dataset in the file `fifa.csv` in the folder containing this notebook. 

Hint: Use the string `'./fifa.csv'` as the file reference.

In [5]:
## load data into dataframe
df = pd.read_csv('./AviationData.csv', encoding='latin-1')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50249 non-null  object 
 9   Airport.Name            52790 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87572 non-null  object 
 14  Make                    88826 non-null

In [7]:
## removing Month and Day from the Event.Date column
df['Event.Date'] = df['Event.Date'].str[:-6]

In [8]:
## creating new dataframe of incidents by year
df_by_year = df['Event.Date'].value_counts()
df_by_year = df_by_year.sort_index()
df_by_year

1948       1
1962       1
1974       1
1977       1
1979       2
1981       1
1982    3593
1983    3556
1984    3457
1985    3096
1986    2880
1987    2828
1988    2730
1989    2544
1990    2518
1991    2462
1992    2355
1993    2313
1994    2257
1995    2309
1996    2187
1997    2148
1998    2226
1999    2209
2000    2220
2001    2063
2002    2020
2003    2085
2004    1952
2005    2031
2006    1851
2007    2016
2008    1893
2009    1783
2010    1786
2011    1850
2012    1835
2013    1561
2014    1535
2015    1582
2016    1664
2017    1638
2018    1681
2019    1624
2020    1392
2021    1545
2022    1607
Name: Event.Date, dtype: int64

In [9]:
## converting Event.Date column to type int
df['Event.Date'] = df['Event.Date'].astype(np.int64)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  int64  
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50249 non-null  object 
 9   Airport.Name            52790 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87572 non-null  object 
 14  Make                    88826 non-null

In [11]:
## creating new dataframe of incidents after 2009
df_after_09 = df.loc[df['Event.Date'] >= 2009]

In [12]:
df_after_09.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
65806,20090101X22627,Accident,ERA09CA119,2009,"Lancaster, NY",United States,425519N,0783646W,BQR,Buffalo-Lancaster Regional,...,Instructional,"Bob Miller Flight Training, Inc.",0.0,0.0,0.0,1.0,VMC,,The student pilot's failure to maintain direct...,25-09-2020
65807,20090102X02015,Accident,CEN09FA111,2009,"Joliet, IL",United States,041316N,0881053W,JOT,Joliet Regional Airport,...,Personal,Stuart Seffern,2.0,0.0,0.0,0.0,VMC,,The pilots failure to maintain adequate flyin...,25-09-2020
65808,20090105X01036,Incident,ENG09IA002,2009,"Atlanta, GA",United States,033451N,0842326W,KATL,Atlanta International Airport,...,,Delta Air Lines,0.0,0.0,0.0,257.0,,,The fan blade fractured due to a fatigue crack...,25-09-2020
65809,20090105X22709,Accident,ERA09LA128,2009,"Philadelphia, MS",United States,324946N,0885711W,,,...,Positioning,QUINN AVIATION INC,1.0,0.0,0.0,0.0,IMC,,The pilot's improper preflight weather plannin...,25-09-2020
65810,20090113X30943,Accident,CEN09WA129,2009,"Molesmes, France",France,047360N,0003260E,,,...,Unknown,,1.0,0.0,0.0,0.0,,,Under the jurisdiction and control of the Fren...,03-11-2020


In [13]:
## dropping columns which are not needed for EDA
df_after_09 = df_after_09.drop(columns=['Accident.Number', 'Location', "Latitude", 'Longitude', 'Airport.Code', 'Airport.Name', 'Registration.Number',
'Schedule', 'Air.carrier', 'Report.Status', 'Publication.Date'])

In [14]:
df_after_09.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23083 entries, 65806 to 88888
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                23083 non-null  object 
 1   Investigation.Type      23083 non-null  object 
 2   Event.Date              23083 non-null  int64  
 3   Country                 23083 non-null  object 
 4   Injury.Severity         22105 non-null  object 
 5   Aircraft.damage         21598 non-null  object 
 6   Aircraft.Category       22655 non-null  object 
 7   Make                    23044 non-null  object 
 8   Model                   23035 non-null  object 
 9   Amateur.Built           23083 non-null  object 
 10  Number.of.Engines       19702 non-null  float64
 11  Engine.Type             17680 non-null  object 
 12  FAR.Description         22373 non-null  object 
 13  Purpose.of.flight       18830 non-null  object 
 14  Total.Fatal.Injuries    23083 non-

In [15]:
## converting all rows in column 'Make' to lower case strings
df_after_09['Make'] = df.Make.astype(str).str.lower()


In [16]:
## converting all rows in column 'Make' to objects
df_after_09['Make'] = df_after_09['Make'].astype(object)

In [17]:
## Boeing is listed in several different formats in the data i.e 'The Boeing Company'. This code checks all rows in 'Make' for substring
## 'boeing' and changes it the value to standard 'boeing'

df_after_09['Make'].loc[df_after_09['Make'].str.contains('boeing')] = 'boeing'


In [18]:

def normalize_company_names(df, column_name, company_name):
    '''Takes in df dataframe, checks every value in column_name for substring company_name
        If substring company_name exists, the value of the row is overwritten to company_name.
        Then returns the new df dataframe'''
    df[column_name].loc[df[column_name].str.contains(company_name)] = company_name
    return df


In [19]:
## Checking function by normalizing all companies with 'piper'
normalize_company_names(df_after_09, 'Make', 'piper')

Unnamed: 0,Event.Id,Investigation.Type,Event.Date,Country,Injury.Severity,Aircraft.damage,Aircraft.Category,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,FAR.Description,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight
65806,20090101X22627,Accident,2009,United States,Non-Fatal,Substantial,Airplane,cessna,172SP,No,1.0,Reciprocating,091,Instructional,0.0,0.0,0.0,1.0,VMC,
65807,20090102X02015,Accident,2009,United States,Fatal,Destroyed,Airplane,lantzair flyers inc,LANCAIR 360,Yes,1.0,Reciprocating,091,Personal,2.0,0.0,0.0,0.0,VMC,
65808,20090105X01036,Incident,2009,United States,Non-Fatal,Minor,Airplane,boeing,777,No,2.0,Turbo Fan,121,,0.0,0.0,0.0,257.0,,
65809,20090105X22709,Accident,2009,United States,Fatal,Substantial,Airplane,air tractor inc,AT-602,No,1.0,Turbo Prop,091,Positioning,1.0,0.0,0.0,0.0,IMC,
65810,20090113X30943,Accident,2009,France,Fatal,Destroyed,Helicopter,eurocopter deutsch,EC 135 T2,No,1.0,,NUSN,Unknown,1.0,0.0,0.0,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88884,20221227106491,Accident,2022,United States,Minor,,,piper,PA-28-151,No,,,091,Personal,0.0,1.0,0.0,0.0,,
88885,20221227106494,Accident,2022,United States,,,,bellanca,7ECA,No,,,,,0.0,0.0,0.0,0.0,,
88886,20221227106497,Accident,2022,United States,Non-Fatal,Substantial,Airplane,american champion aircraft,8GCBC,No,1.0,,091,Personal,0.0,0.0,0.0,1.0,VMC,
88887,20221227106498,Accident,2022,United States,,,,cessna,210N,No,,,091,Personal,0.0,0.0,0.0,0.0,,


In [22]:
## Normalizing the company names of the top airline manufacturers
normalize_company_names(df_after_09, 'Make', 'boeing')
normalize_company_names(df_after_09, 'Make', 'airbus')
normalize_company_names(df_after_09, 'Make', 'cessna')
normalize_company_names(df_after_09, 'Make', 'beech')
normalize_company_names(df_after_09, 'Make', 'cirrus')


Unnamed: 0,Event.Id,Investigation.Type,Event.Date,Country,Injury.Severity,Aircraft.damage,Aircraft.Category,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,FAR.Description,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight
65806,20090101X22627,Accident,2009,United States,Non-Fatal,Substantial,Airplane,cessna,172SP,No,1.0,Reciprocating,091,Instructional,0.0,0.0,0.0,1.0,VMC,
65807,20090102X02015,Accident,2009,United States,Fatal,Destroyed,Airplane,lantzair flyers inc,LANCAIR 360,Yes,1.0,Reciprocating,091,Personal,2.0,0.0,0.0,0.0,VMC,
65808,20090105X01036,Incident,2009,United States,Non-Fatal,Minor,Airplane,boeing,777,No,2.0,Turbo Fan,121,,0.0,0.0,0.0,257.0,,
65809,20090105X22709,Accident,2009,United States,Fatal,Substantial,Airplane,air tractor inc,AT-602,No,1.0,Turbo Prop,091,Positioning,1.0,0.0,0.0,0.0,IMC,
65810,20090113X30943,Accident,2009,France,Fatal,Destroyed,Helicopter,eurocopter deutsch,EC 135 T2,No,1.0,,NUSN,Unknown,1.0,0.0,0.0,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88884,20221227106491,Accident,2022,United States,Minor,,,piper,PA-28-151,No,,,091,Personal,0.0,1.0,0.0,0.0,,
88885,20221227106494,Accident,2022,United States,,,,bellanca,7ECA,No,,,,,0.0,0.0,0.0,0.0,,
88886,20221227106497,Accident,2022,United States,Non-Fatal,Substantial,Airplane,american champion aircraft,8GCBC,No,1.0,,091,Personal,0.0,0.0,0.0,1.0,VMC,
88887,20221227106498,Accident,2022,United States,,,,cessna,210N,No,,,091,Personal,0.0,0.0,0.0,0.0,,


In [21]:
## Looking at the top 20 companies by number of incidents
df_after_09['Make'].value_counts()[0:20]

cessna                         5509
piper                          3236
beech                          1226
boeing                         1207
bell                            631
robinson                        375
cirrus                          353
airbus                          311
mooney                          275
robinson helicopter             219
air tractor inc                 217
robinson helicopter company     185
aeronca                         161
bellanca                        159
maule                           156
schweizer                       152
air tractor                     145
hughes                          144
embraer                         139
champion                        130
Name: Make, dtype: int64