# Intro

Investigating Airplane Accidents

In this project, I will be working with a data set containing 77,282 aviation accidents that occured in the U.S., and the metadata associated with them. The data is stored in the `AviationData.txt` file coming from the National Transportation Safety Board (NTSB).

I'm going to walkthrough this project by:

 - Cleaning the data.

 - Using algorithms to:
   - Search for a string like "LAX94LA336" through rows.

   - Create a hash table a search for the "LAX9LA336" again, to compare time complexity between methods 

   - Investigate the states where most accidents occured.

   - Calculate the amount of falalities and serious injuries by month.

   - Calculate the worst air carriers.

   - Find the biggest percentages of accidents occuring in adverse weather conditions.

## The Data

In the text file, each entry is delimited by a new line. Each data point is separated by a pipe character. So we'll read the file in and do some preliminary cleaning.

In [14]:
aviation_list = []
aviation_data = []

with open('AviationData.txt', 'r') as file:
    for line in file:
        aviation_data.append(line)
        text = line.split('|')
        words = []
        for word in text:
            word = word.strip()
            words.append(word)
        aviation_list.append(words)

print(aviation_data[1]) # 0th row contains the headers
        
print(aviation_list[1])

20150908X74637 | Accident | CEN15LA402 | 09/08/2015 | Freeport, IL | United States | 42.246111 | -89.581945 | KFEP | albertus Airport | Non-Fatal | Substantial | Unknown | N24TL | CLARKE REGINALD W | DRAGONFLY MK |  |  |  | Part 91: General Aviation |  | Personal |  |  | 1 |  |  | VMC | TAKEOFF | Preliminary | 09/09/2015 | 

['20150908X74637', 'Accident', 'CEN15LA402', '09/08/2015', 'Freeport, IL', 'United States', '42.246111', '-89.581945', 'KFEP', 'albertus Airport', 'Non-Fatal', 'Substantial', 'Unknown', 'N24TL', 'CLARKE REGINALD W', 'DRAGONFLY MK', '', '', '', 'Part 91: General Aviation', '', 'Personal', '', '', '1', '', '', 'VMC', 'TAKEOFF', 'Preliminary', '09/09/2015', '']


In [15]:
def linear_search(code):
    lax_code = []    
    for row in aviation_list:
        for item in row:
            if item == code:
                lax_code.append(row)
    return lax_code


lin_search = linear_search('LAX94LA336')

print(lin_search[0])

['20001218X45447', 'Accident', 'LAX94LA336', '07/19/1962', 'BRIDGEPORT, CA', 'United States', '', '', '', '', 'Fatal(4)', 'Destroyed', '', 'N5069P', 'PIPER', 'PA24-180', 'No', '1', 'Reciprocating', '', '', 'Personal', '', '4', '0', '0', '0', 'UNK', 'UNKNOWN', 'Probable Cause', '09/19/1996', '']


## Hash table - dictionary 

We'll now store the data in a dictionary:

 - Create an empty list and name it aviation_dict_list.
 -  Loop through each item in aviation_data and split it on the pipe character (|).
 - Convert the split row to a dictionary. The dictionary should use the columns names as keys, and their values as its own values. Here's an example of a single item:
 - {"Event Id": "20150908X74637", "Investigation Type": "Accident", ...}
 - Append the result to aviation_dict_list.
 - Create an empty list and name it lax_dict.
 - Search through aviation_dict_list for LAX94LA336. This value could be in any key in any dictionary.
 - When you find the value, append the entire dictionary to lax_dict.

In [16]:
def dictionary(l):
    # Clean input and create a list of keys for a dictionary    
    not_yet_keys = l[0].split('|')
    keys = []
    for key in not_yet_keys:
        key = key.strip()
        keys.append(key)
    
    # Get the values for the keys
    values = []
    for n in range(1, len(l)):
        not_yet_values = l[n].split('|')
        clean_values = []
        for value in not_yet_values:
            value = value.strip()
            clean_values.append(value)
        values.append(clean_values)
     
    # Pair the values to the keys
    aviation_dict_list = []
    for y in range(0, len(values)):
        paired = {}
        for x in range(0, len(keys)):        
            paired[keys[x]] = values[y][x]
        aviation_dict_list.append(paired)    
    return aviation_dict_list
        

        
aviation_dict_list = dictionary(aviation_data)
aviation_dict_list[1]

{'Event Id': '20150906X32704',
 'Investigation Type': 'Accident',
 'Accident Number': 'ERA15LA339',
 'Event Date': '09/05/2015',
 'Location': 'Laconia, NH',
 'Country': 'United States',
 'Latitude': '43.606389',
 'Longitude': '-71.452778',
 'Airport Code': 'LCI',
 'Airport Name': 'Laconia Municipal Airport',
 'Injury Severity': 'Fatal(1)',
 'Aircraft Damage': 'Substantial',
 'Aircraft Category': 'Weight-Shift',
 'Registration Number': 'N2264X',
 'Make': 'EVOLUTION AIRCRAFT INC',
 'Model': 'REVO',
 'Amateur Built': 'No',
 'Number of Engines': '1',
 'Engine Type': 'Reciprocating',
 'FAR Description': 'Part 91: General Aviation',
 'Schedule': '',
 'Purpose of Flight': 'Personal',
 'Air Carrier': '',
 'Total Fatal Injuries': '1',
 'Total Serious Injuries': '',
 'Total Minor Injuries': '',
 'Total Uninjured': '',
 'Weather Condition': 'VMC',
 'Broad Phase of Flight': 'MANEUVERING',
 'Report Status': 'Preliminary',
 'Publication Date': '09/10/2015',
 '': ''}

In [17]:
def dict_search(dict_list, target):
    lax_dict = []
    for x in range(0, len(dict_list)):
        for value in dict_list[x].values():
            if value == target:
                lax_dict.append(dict_list[x])
    return lax_dict


lax_dict = dict_search(aviation_dict_list, "LAX94LA336")

lax_dict[0]

{'Event Id': '20001218X45447',
 'Investigation Type': 'Accident',
 'Accident Number': 'LAX94LA336',
 'Event Date': '07/19/1962',
 'Location': 'BRIDGEPORT, CA',
 'Country': 'United States',
 'Latitude': '',
 'Longitude': '',
 'Airport Code': '',
 'Airport Name': '',
 'Injury Severity': 'Fatal(4)',
 'Aircraft Damage': 'Destroyed',
 'Aircraft Category': '',
 'Registration Number': 'N5069P',
 'Make': 'PIPER',
 'Model': 'PA24-180',
 'Amateur Built': 'No',
 'Number of Engines': '1',
 'Engine Type': 'Reciprocating',
 'FAR Description': '',
 'Schedule': '',
 'Purpose of Flight': 'Personal',
 'Air Carrier': '',
 'Total Fatal Injuries': '4',
 'Total Serious Injuries': '0',
 'Total Minor Injuries': '0',
 'Total Uninjured': '0',
 'Weather Condition': 'UNK',
 'Broad Phase of Flight': 'UNKNOWN',
 'Report Status': 'Probable Cause',
 'Publication Date': '09/19/1996',
 '': ''}

The order of this search is linear O(n) because we still had to loop through the first list. However this is more efficient than looping two lists.

## Accidents by US State

 - Count up how many accidents occurred in each U.S. state, and assign the result to state_accidents.
 - parse the state by splitting the Location field and extracting the state.
 - Sort state_accidents, and extract the name of the state with the most aviation accidents.

In [18]:
from collections import Counter

def most_state_accidents(data):
    state_accidents = []
    for x in range(0, len(data)):
        state_accidents.append(data[x]['Location'][-2:])
    state_count = Counter(state_accidents)
    return state_accidents, state_count.most_common(5)

state_accidents, accident_prone_states = most_state_accidents(aviation_dict_list)

accident_prone_states

[('CA', 8032), ('FL', 5118), ('TX', 5112), ('AK', 5049), ('AZ', 2502)]

The states with the highest number of aiplane accidents are: 
- California, 
- Florida, 
- Texas, 
- Alaska, 
- Arizona. 

We're not able to create a comparison between states as we don't have statistics that show successful flights or flight hours. From our research we know that most accidents occur during take-off or landing.

Next, we will look at which months have the most accidents.

# Fatalities & Injuries by Month

In [19]:
def worst_month_accidents(data):
    months = []
    change_month = {"01":"January",
                    "02":"February",
                    "03":"March",
                    "04":"April",
                    "05":"May",
                    "06":"June",
                    "07":"July",
                    "08":"August",
                    "09":"September",
                    "10":"October",
                    "11":"November",
                    "12":"December"}
    
    for x in range(0, len(data)):
        month = data[x]['Event Date'][0:2]
        try:
            month = change_month[month]
        except KeyError:
            month = data[x]['Event Id'][4:6]
            month = change_month[month]
        if data[x]['Event Date'] != '':
            year = data[x]['Event Date'][-4:]
        else:
            year = data[x]['Event Id'][0:4]
        months.append(month + ' ' + year)
        
    worst_months = Counter(months)
    return worst_months, worst_months.most_common(3)

month_count_accidents, worst_3_months_acc = worst_month_accidents(aviation_dict_list)

worst_3_months_acc

[('July 1982', 433), ('August 1983', 421), ('July 1983', 413)]

The worst months were in the summer of 1983. We'd have to do a bit of external research to find out why this occurred. 

## Worst months for injuries

We'll now take a look at the worst injuries by month:

In [20]:
def worst_month_injuries(data):
    injuries_by_month = {}
    change_month = {"01":"January",
                    "02":"February",
                    "03":"March",
                    "04":"April",
                    "05":"May",
                    "06":"June",
                    "07":"July",
                    "08":"August",
                    "09":"September",
                    "10":"October",
                    "11":"November",
                    "12":"December"}
    for x in range(0, len(data)):
        injuries = 0
        month = data[x]['Event Date'][0:2]
        try: 
            month = change_month[month]
        except KeyError:
            month = data[x]['Event Id'][4:6]
            month = change_month[month]
        if data[x]['Event Date'] != '':
            year = data[x]['Event Date'][-4:]
        else:
            year = data[x]['Event Id'][0:4]
        month = month + ' ' + year
        fatal = data[x]['Total Fatal Injuries']
        serious = data[x]['Total Serious Injuries']
        # Skip the blanks        
        if fatal:
            injuries += int(fatal)
        if serious:
            injuries += int(serious)
        injuries_by_month[month] = injuries
        injuries_by_month = Counter(injuries_by_month)        
        
    return injuries_by_month, injuries_by_month.most_common(3)
           
month_count_injuries, worst_3_months_inj  = worst_month_injuries(aviation_dict_list)

worst_3_months_inj

[('January 2007', 102), ('July 2002', 71), ('June 2010', 5)]

## Summary

Again we have two summer months, possibly this is when most people are on vacation and take more trips. We also see a winter month here. 

Next steps:

 - Map out accidents using the cartopy library for matplotlib.
 - Count the number of accidents by air carrier.
 - Count the number of accidents by airplane make and model.
 - Figure out what percentage of accidents occur under adverse weather conditions.

In [23]:
# Makes with the most accidents

def most_makes_accidents(data):
    makes_accidents = []
    for x in range(0, len(data)):
        makes_accidents.append(data[x]['Make'])
    make_count = Counter(makes_accidents)
    return makes_accidents, make_count.most_common(5)

make_accidents, accident_prone_makes = most_makes_accidents(aviation_dict_list)

accident_prone_makes

[('CESSNA', 16611),
 ('PIPER', 9183),
 ('Cessna', 7739),
 ('Piper', 4096),
 ('BEECH', 3031)]

In [22]:
# Models with the most accidents

def most_model_accidents(data):
    model_accidents = []
    for x in range(0, len(data)):
        model_accidents.append(data[x]['Model'])
    model_count = Counter(model_accidents)
    return model_accidents, model_count.most_common(5)

model_accidents, accident_prone_models = most_model_accidents(aviation_dict_list)

accident_prone_models


[('152', 2251),
 ('172', 1164),
 ('172N', 1121),
 ('PA-28-140', 900),
 ('172M', 771)]

## Summary

The [Cessna 152](https://en.wikipedia.org/wiki/Cessna_152) is a single-engine aircraft and is frequently used for training. It makes sense that there would be more accidents. Piper and Beech also make small aircraft, typically these aren't used in commercial flights. 

It might be interesting to see what the most common model involved in accidents is among commercial aircraft. 

In [38]:
# Airlines with the most accidents
def air_carriers_acc(n, data):
    air_carriers = []
    for i in range(0, len(data)):
        air_carrier = data[i]['Air Carrier']
        if air_carrier != '':
            air_carriers.append(air_carrier)
    ac_count = Counter(air_carriers)
    return ac_count.most_common(n)

top_10 = air_carriers_acc(10, aviation_dict_list)
print(top_10)

[('UNITED AIRLINES', 49), ('AMERICAN AIRLINES', 41), ('CONTINENTAL AIRLINES', 25), ('USAIR', 24), ('DELTA AIR LINES INC', 23), ('AMERICAN AIRLINES, INC.', 22), ('SOUTHWEST AIRLINES CO', 21), ('CONTINENTAL AIRLINES, INC.', 19), ('UNITED AIR LINES INC', 14), ('US AIRWAYS INC', 12)]


## Findings

We have to be careful here as some of the companies were incorporated and appear twice. 

American Airlines has the ties for most accidents (63) with United Airlines and United Air Lines Inc, however we know that United and Continental merged around 2010. 

If we were to count United Airlines, United Air Lines Inc, and Continental Airlines, Inc. together, we have a total of 82. 

In [39]:
# Modified most_makes_accidents to find commercial flights
def most_commercial_accidents(data):
    makes_accidents = []
    for x in range(0, len(data)):
        air_carrier = data[x]['Air Carrier']
        make = data[x]['Make']
        
        if air_carrier != '' and make != 'CESSNA' :
            makes_accidents.append(data[x]['Make'])
            
    make_count = Counter(makes_accidents)
    return makes_accidents, make_count.most_common(5)

make_accidents, accident_prone_makes = most_commercial_accidents(aviation_dict_list)

accident_prone_makes

[('BOEING', 454),
 ('PIPER', 284),
 ('Cessna', 207),
 ('BEECH', 204),
 ('Boeing', 181)]

It looks like we need to clean the data further! We should change all to lower case. However for now it's sufficient.

Boeing is very likely the most common aircraft among commercial carriers in the US. Even if we filtered further by Boeing models the results would probably not tell us too much. 

It's somewhat of a suprise to see that Cessna continues to be among the top, even with commercial flights.

In [11]:
# Accidents by weather condition
def worst_weather(data):
    weathers = []
    for i in range(0, len(data)):
        weather = data[i]['Weather Condition']
        if weather != '':
            weathers.append(weather)
    weather_count = Counter(weathers)
    percentage = [(x, weather_count[x]/len(data)*100) for x in weather_count]
    return sorted(percentage, key=lambda x: x[1], reverse=True) 

worst_weather = worst_weather(aviation_dict_list)
print(worst_weather) 

[('VMC', 89.024469145068), ('IMC', 7.220403462688112), ('UNK', 1.1969306815388)]


Interestingly there are more accidents in Visual Meteorological Conditions than in Instrument Meteorological Conditions. So although weather may be a factor, it's not particularly related to accident cause based on our data.

In [26]:
# Accidents by Flight Phase
def maniobra_accidents(data):
    maneuvers_accidents = []
    for x in range(0, len(data)):
        maneuvers_accidents.append(data[x]['Broad Phase of Flight'])
    maniobra_count = Counter(maneuvers_accidents)
    return maneuvers_accidents, maniobra_count.most_common(5)

maneuvers_accidents, accident_prone_moves = maniobra_accidents(aviation_dict_list)

accident_prone_moves
 

[('LANDING', 18569),
 ('TAKEOFF', 14751),
 ('CRUISE', 10598),
 ('MANEUVERING', 9502),
 ('APPROACH', 7513)]

As expected most of the accidents occurred during take-off or landing. 