# **Exploratory Data Analysis**
---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('dark_background')

## **1) Core Objectives**
---

### Overview:

Our primary objective with this exploratory analysis is to identify the features in our data that best represent our understanding of the [root causes of flight delays](../Research/flight_delay_reasons.md) and in so doing, provide the basis for the **accurate** and **reliable** prediction of commercial flight delays. We aim to develop a sense for how well our feature space captures common flight delay reasons in order to determine which features to select, which additional data to gather, and which features should be re-engineered or excluded.

### [Suggested Considerations](https://github.com/lighthouse-labs/mid-term-project-I/blob/master/exploratory_analysis.ipynb):

- Test the hypothesis that the delay is from Normal distribution and that mean of the delay is 0. Be careful about the outliers.
- Is average/median monthly delay different during the year? If so, which months have the biggest delays and what could be the reason?
- Does the weather affect the delay?
- How are taxi times changing during the day? Does higher traffic lead to longer taxi times?
- What is the average percentage of delays that exist prior to departure (*i.e.* are arrival delays caused by departure delays)? Are airlines able to lower the delay during the flights?
- How many states cover 50% of US air traffic?
- Test the hypothesis that planes fly faster when there is a departure delay.
- When (which hour) do most 'LONG', 'SHORT', 'MEDIUM' haul flights take off?
- Find the top 10 the bussiest airports. Does the greatest number of flights mean that the majority of passengers went through a given airport? How much traffic do these 10 airports cover?
- Do bigger delays lead to bigger fuel consumption per passenger?

## **2) Data Cleaning**
---

### **flights**:

In [2]:
flights = pd.read_csv('../Data/files/flights_sample.csv')
flights.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,...,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,first_dep_time,total_add_gtime,longest_add_gtime,no_name
0,2019-05-19,UA,UA_CODESHARE,UA,4264,EV,N48901,4264,12266,IAH,...,127,,,,,,,,,
1,2019-05-19,UA,UA_CODESHARE,UA,4266,EV,N12540,4266,13244,MEM,...,468,,,,,,,,,
2,2019-05-19,UA,UA_CODESHARE,UA,4272,EV,N11164,4272,12266,IAH,...,1091,,,,,,,,,
3,2019-05-19,UA,UA_CODESHARE,UA,4281,EV,N13995,4281,11042,CLE,...,310,,,,,,,,,
4,2019-05-19,UA,UA_CODESHARE,UA,4286,EV,N13903,4286,13061,LRD,...,301,,,,,,,,,


In [3]:
flights.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2387955 entries, 0 to 2387954
Data columns (total 42 columns):
 #   Column               Dtype  
---  ------               -----  
 0   fl_date              object 
 1   mkt_unique_carrier   object 
 2   branded_code_share   object 
 3   mkt_carrier          object 
 4   mkt_carrier_fl_num   int64  
 5   op_unique_carrier    object 
 6   tail_num             object 
 7   op_carrier_fl_num    int64  
 8   origin_airport_id    int64  
 9   origin               object 
 10  origin_city_name     object 
 11  dest_airport_id      int64  
 12  dest                 object 
 13  dest_city_name       object 
 14  crs_dep_time         int64  
 15  dep_time             float64
 16  dep_delay            float64
 17  taxi_out             float64
 18  wheels_off           float64
 19  wheels_on            float64
 20  taxi_in              float64
 21  crs_arr_time         int64  
 22  arr_time             float64
 23  arr_delay            float64
 24

Preview missing data percentages for features with null values.

In [13]:
has_missing_values = flights.isna().sum() > 0
missing_value_percentage = flights.isna().sum() / flights.shape[0] * 100

missing_value_percentage[has_missing_values]

tail_num                 0.304947
dep_time                 1.616865
dep_delay                1.647979
taxi_out                 1.707151
wheels_off               1.707151
wheels_on                1.754681
taxi_in                  1.754681
arr_time                 1.718165
arr_delay                1.944132
cancellation_code       98.317305
crs_elapsed_time         0.000168
actual_elapsed_time      1.927172
air_time                 1.963605
carrier_delay           81.124770
weather_delay           81.124770
nas_delay               81.124770
security_delay          81.124770
late_aircraft_delay     81.124770
first_dep_time          99.313597
total_add_gtime         99.313722
longest_add_gtime       99.313681
no_name                100.000000
dtype: float64

no_name is completely blank, so we can drop it with absolutely no information loss.

In [27]:
flights = flights.drop('no_name', axis=1)

In [29]:
flights.cancellation_code.unique()

array([nan, 'A', 'C', 'B', 'D'], dtype=object)

It may be worthwhile to investigate the `cancellation_code` feature further, but the missing proportion here should correspond with the `cancelled` feature assuming that codes are only present for cancelled flights (as suggested in the flights description). In other words, most data in this column is ***structually missing*** (i.e. missing because most flights were not cancelled).

In [31]:
flights['cancelled'].value_counts()/flights.shape[0] * 100

0    98.317305
1     1.682695
Name: cancelled, dtype: float64

The percentages support the structurally missing data hypothesis, but let us confirm by inspecting the relevant subset more rigorously.

In [33]:
flights[flights.cancelled > 0]['cancellation_code'].unique()

array(['A', 'C', 'B', 'D'], dtype=object)

In [34]:
flights[flights.cancelled == 0]['cancellation_code'].unique()

array([nan], dtype=object)

We have the most support that we can obtain from the data for the structurally missing hypothesis with respect to `cancellation_code`! It remains possible that human errors (e.g. data entry) are present but obscured in this table.