# Exploratory Data Analysis
----



### Lighthouse Labs, Midterm Project Project - Predicting Flight Delays.

##### January 13, 2023. Terre Leung, Tetiana Fesenko, and Jamie Dormaar

---

_Use this notebook to get familiar with the datasets we have. There is 10 questions we need to answer during the EDA._


_We shouldn't limit our EDA to these 10 questions. Let's be creative :)._

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime as dt

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_theme()

from scipy import stats

import warnings
warnings.filterwarnings('ignore')

import os
import json
import requests



In [None]:
# Load data tables:
flights_initial_500000_records    = pd.read_csv('../data/flights_initial_500000_records.csv', delimiter= ',')
flights_delay_dates_all_records   = pd.read_csv('../data/flights_delay_dates_all_records.csv', delimiter= ',')
flights_random_100000_records     = pd.read_csv('../data/flights_random_100000_records.csv', delimiter= ',')
flights_random_5000_records       = pd.read_csv('../data/flights_random_5000_records.csv', delimiter= ',')
flights_test_all_records          = pd.read_csv('../data/flights_test_all_records.csv', delimiter= ',')
fuel_consumption_all_records      = pd.read_csv('../data/fuel_consumption_all_records.csv', delimiter= ',')
passengers_initial_300000_records = pd.read_csv('../data/passengers_initial_300000_records.csv', delimiter= ',')
# flights_usa = pd.read_csv('../data/usa_flights2.csv', delimiter= ',')  # Terre is there a new csv to go with this one?

In [None]:
# Save working copies of the data:
df_fl_init    = flights_initial_500000_records.copy()
df_fl_delays  = flights_delay_dates_all_records.copy()
df_fl_smpl1   = flights_random_100000_records.copy()
df_fl_smpl2   = flights_random_5000_records.copy()
df_fl_test    = flights_test_all_records.copy()
df_fc         = fuel_consumption_all_records.copy()
df_pa_init    = passengers_initial_300000_records.copy()

##### Set your session working table to temp variable df:

In [None]:
df = df_fl_smpl1.copy()

##### SAVE a session timestamp to label the saved outputs: 
>(Optional: this can be useful if you want to help keep your files organized)


In [None]:

tag = 'smpl_100K_' # option with leading name            'Jamie_'
# tag = ''
# dt = dt.now().time().strftime(f'%b%d_%H%M')   # 'Jan01_1704'
# dt = dt.now().time().strftime(f'%b%-d_%H%M')  # 'Jan1_1708'
# dt = dt.now().time().strftime(f'%a_%H%M')        # 'Mon_1710'
dt = ''
session = f'{tag}{dt}'

##### SETUP: A first look at tables:


In [None]:

# flights_initial_500000_records
print(f'\nflights_initial_500000_records.shape: {flights_initial_500000_records.shape}')
display(flights_initial_500000_records.head())

# flights_delay_dates_all_records
print(f'\nflights_delay_dates_all_records.shape: {flights_delay_dates_all_records.shape}')
display(flights_delay_dates_all_records.head())

# flights_random_100000_records
print(f'\nflights_random_100000_records.shape: {flights_random_100000_records.shape}')
display(flights_random_100000_records.head())

# flights_test_all_records
print(f'\nflights_test_all_records.shape: {flights_test_all_records.shape}')
display(flights_test_all_records.head())

# fuel_consumption_all_records
print(f'\nfuel_consumption_all_records.shape: {fuel_consumption_all_records.shape}')
display(fuel_consumption_all_records.head())

# passengers_initial_300000_records
print(f'\npassengers_initial_300000_records.shape: {passengers_initial_300000_records.shape}')
display(passengers_initial_300000_records.head())


In [None]:
df.info()

##### NOTE: Missing Data content for each of the four data tables.


In [None]:
# Check for nulls:
# flights Table percent Null content:
df_fl_init_nulls = df_fl_init.isnull().sum().sort_values(ascending= False)
percent = (df_fl_init.isnull().sum()/df_fl_init.isnull().count()).sort_values(ascending = False)
df_fl_init_missing_data = pd.concat(
    [df_fl_init_nulls, percent]
  , axis=1
  , keys=['Total', 'Percent']
  , verify_integrity= True
)
print(f'\nflights_missing_data.head(20)')
display(df_fl_init_missing_data.head(20))

# flights_test Table percent Null content:
df_fl_test_nulls = df_fl_test.isnull().sum().sort_values(ascending= False)
percent = (df_fl_test.isnull().sum()/df_fl_test.isnull().count()).sort_values(ascending = False)
df_fl_test_missing_data = pd.concat(
    [df_fl_test_nulls, percent]
  , axis=1
  , keys=['Total', 'Percent']
  , verify_integrity= True
)
print(f'\nflights_test_missing_data.head(20)')
display(df_fl_test_missing_data.head(20))

# fuel_consumption Table percent Null content:
df_fc_nulls = df_fc.isnull().sum().sort_values(ascending= False)
percent = (df_fc.isnull().sum()/df_fc.isnull().count()).sort_values(ascending = False)
df_fc_missing_data = pd.concat(
    [df_fc_nulls, percent]
  , axis=1
  , keys=['Total', 'Percent']
  , verify_integrity= True
)
print(f'\nfuel_consumption_missing_data.head(20)')
display(df_fc_missing_data.head(20))

# passengers Table percent Null content:
df_pa_init_nulls = df_pa_init.isnull().sum().sort_values(ascending= False)
percent = (df_pa_init.isnull().sum()/df_pa_init.isnull().count()).sort_values(ascending = False)
df_pa_init_missing_data = pd.concat(
    [df_pa_init_nulls, percent]
  , axis=1
  , keys=['Total', 'Percent']
  , verify_integrity= True
)
print(f'\npassengers_missing_data.head(20)')
display(df_pa_init_missing_data.head(20))


##### NOTE: Differences between flights, and flights_test table data:

In [None]:
flights_columns = df_fl_init.columns
flights_columns

In [None]:
flights_test_columns = df_fl_test.columns
flights_test_columns

In [None]:
fl_test_exclusion = df_fl_init[df_fl_init.columns[~df_fl_init.columns.isin([flights_test_columns])]]
fl_test_exclusion.head()

##### ANALYZE: Arrival delay details in the flights table:


#### **Task 1**: 

1. Test the hypothesis that the delay is from Normal distribution. 
1. And, that the **mean** of the arrival delays is 0. 
1. Be careful about the outliers.

>##### TASK 1.1: Test the hypothesis that the delay is from Normal distribution. 

In [None]:
df[['arr_delay']].value_counts().sort_values(ascending=False).head(20)

In [None]:
df['arr_delay'].describe()

The `stats` package from the `scipy` module will test the Null hypothesis that the data is normally distributed.
If the resulting p value is > than 0.05 we can assume the data is distributed normally with high statistical probability.

In [None]:
# from scipy import stats
stat, p = stats.shapiro(df['arr_delay'])
print('%0.15f' % p, stat)

The statistical calculation above printed the following warning:
```
UserWarning: p-value may not be accurate for N > 5000.
```
Smaller samples taken to correct for this inaccuracy:

In [None]:
# Sample a subset:
x = df.sample(1000)
len(x)

In [None]:
# Rerun Shapiro Wilk Normality Test:
stat, p = stats.shapiro(x['arr_delay'])
print('%0.15f' % p, stat)

The data appears to indeed be normally distributed.

>##### TASK 1.3: Managing outliers.

##### VISUALIZE: Arrival delay distribution, and manage outliers:

In [None]:
plt.hist(df['arr_delay'], bins=500)


plt.xlabel('Delay Time (min)')
plt.title('Arrival Delay Distribution')
plt.xlim(-50, 50)
plt.ylim(0, 10000)

plt.savefig(f'../Images/Arrival_delay_distn_{session}.png')
plt.show()


In [None]:
# Outlier detection:
sns.boxplot(data= df, x='arr_delay', whis= 2.5)

plt.savefig(f'../Images/Arrival_delay_outliers_boxplot_{session}.png')

Manually chosen outlier range limits:

In [None]:
# # Define and remove the outliers by a chosen parameter:
# max_delay = 100
# outliers    = df_fl_init[df_fl_init['arr_delay'] > max_delay]
# df_fl_clean = df_fl_init[df_fl_init['arr_delay'] < max_delay]

or using the standard 1.5 * IQR:

In [None]:
# Instantiate the Arrival Delays:
delays = df['arr_delay']

# Define the quantiles of the delay distribution:
Q1 = delays.quantile(0.25)
Q3 = delays.quantile(0.75)
IQR = Q3 - Q1

# Define the outlier thresholds
min_threshold = (Q1 - 1.5 * IQR)
max_threshold = (Q3 + 1.5 * IQR)

In [None]:
df_clean = df[~((delays < min_threshold)|(delays > max_threshold))]
df_clean.shape

In [None]:
sns.boxplot(x=df_clean['arr_delay'])

#SAVE boxplot of clean delay distribution:
plt.savefig(f'../Images/Arrival_delay_boxplot_{session}.png')

In [None]:
# fig, ax1 = plt.subplots()

# ax1 = fl_df_clean.plot()
# ax2 = fl_df.plot()

# ax1.hist([y1, y2])
# ax1.set_xlim(-10,10)
# fig, (ax1, ax2) = plt.subplots(1, 2)


#### **Task 2**: Is average/median monthly delay different during the year? If yes, which are months with the biggest delays and what could be the reason?

In [None]:
# Convert fl_date from string to datetime data type
df_clean[['fl_date']] = df_clean[['fl_date']].apply(pd.to_datetime)

In [None]:
df_clean['date'] = df_clean['fl_date'].dt.date
df_clean['year'] = df_clean['fl_date'].dt.year
df_clean['month'] = df_clean['fl_date'].dt.month


In [None]:
print(df_clean[['year', 'month', 'fl_date', 'date']].dtypes)
display(df_clean[['year', 'month', 'fl_date', 'date']].head())

In [None]:
df_clean[['month', 'year', 'arr_delay']].groupby(['year', 'month']).describe()

>NOTE: This was the point where we discovered that our initial sample of 500000 records we collected from the source flights table turned out to only include records from 2 months in 2018.  So evidently the source table is sorted by date.

In [None]:
df_clean.columns

In [None]:
# Separate the data for easier viewing re annual delay trends:
df_clean_2018 = df_clean[df_clean['year']==2018]
df_clean_2019 = df_clean[df_clean['year']==2019]

In [None]:
df_clean_2018.groupby(['month']).agg({'arr_delay': np.mean}).sort_values('arr_delay', ascending=False)
# df_delays_2018.groupby(['year', 'month']).agg(({'arr_delay': np.median}))

In [None]:
df_clean_2019.groupby('month').agg({'arr_delay': np.mean}).sort_values('arr_delay', ascending=False)
# df_delays_2019.groupby(['year', 'month']).agg(({'arr_delay': np.median}))

In [None]:
sns.scatterplot(data=df_clean_2019, x="month", y="arr_delay")
plt.show()

In [None]:
# month
var = 'month'
data = df_clean_2019[['arr_delay',var]]
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="arr_delay", data=data)
fig.axis(ymin=-100, ymax=100)

plt.savefig(f'../Images/Arrival_delays_monthly_boxplot_{session}.png')

In [None]:
df_clean_2019.describe()

In [None]:
pass_columns = sorted(list(df_pa_init.columns))
# pass_columns

In [None]:
# df['origin_city_name'].value_counts()

There doesn't appear to be an observable trend.

Perhaps if we isolate the flights with a single country as a destination, for example the US:

In [None]:
df_clean['origin_region_code'] = df_clean['origin_city_name'].str[-2:]
df_clean['dest_region_code'] = df_clean['dest_city_name'].str[-2:]


In [None]:
us_states = [
    'AL', 'AK', 'AZ', 'AR', 'CA', 'CO'
  , 'CT', 'DE', 'DC', 'FL', 'GA', 'HI'
  , 'ID', 'IL', 'IN', 'IA', 'KS', 'KY'
  , 'LA', 'ME', 'MD', 'MA', 'MI', 'MN'
  , 'MS', 'MO', 'MT', 'NE', 'NV', 'NH'
  , 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH'
  , 'OK', 'OR', 'PA', 'RI', 'SC', 'SD'
  , 'TN', 'TX', 'UT', 'VT', 'VA', 'WA'
  , 'WV', 'WI', 'WY'
]


In [None]:
df_us = df_clean[df_clean['dest_region_code'].isin(us_states)]
df_us.head()

In [None]:
# month
var = 'month'
data = df_us[['arr_delay',var]]
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="arr_delay", data=data)
fig.axis(ymin=-60, ymax=60)

plt.savefig(f'../Images/Arr_delays_monthly_boxplot_US_{session}.png')

In [None]:
df_us[['month', 'arr_delay']].groupby('month').median().sort_values('arr_delay')

#### **Task 3**: Does the weather affect the delay? 
Use the API to pull the weather information for flights. There is no need to get weather for ALL flights. We can choose the right representative sample. Let's focus on four weather types:

[Local Historical Weather API, WWO](https://www.worldweatheronline.com/weather-api/api/docs/historical-weather-api.aspx)
- sunny
- cloudy
- rainy
- snow.
Test the hypothesis that these 4 delays are from the same distribution. If they are not, which ones are significantly different?

In [None]:
# NOTE: would like to increase location precision later with lat/long coords of the dest_airport_id later
# For now will approximate with dest_city_name and 
dest_cities = list(df_clean['dest_city_name'])
arr_date = list(df_clean['date'])
arr_time = list(df_clean['arr_time'])

The `arr_time_code` number will be used to index the correct weather type according to the flight arr_time:


In [None]:
arr_time_code = []
for i in arr_time:
    if i <= 150:
      arr_time_code.append(0)
    elif i <= 450:
      arr_time_code.append(1)
    elif i <= 750:
      arr_time_code.append(2)
    elif i <= 1050:
      arr_time_code.append(3)
    elif i <= 1350:
      arr_time_code.append(4)
    elif i <= 1650:
      arr_time_code.append(5)
    elif i <= 1950:
      arr_time_code.append(6)
    elif i <= 2250:
      arr_time_code.append(7)
    else:
      arr_time_code.append(0)

In [None]:
print(dest_cities[0])
print(arr_date[0])
print(arr_time[0])
print(arr_time_code[0])

In [None]:
def WWO_API_weather_type(city, date, time):
  '''
  input:
  output:  
  '''
  api_key = os.environ['WEATHER_API_KEY']
  params = {
    'q': city
    , 'date': date
    , 'format': 'json'
    , 'key': api_key
  }

  wwo_url = f'https://api.worldweatheronline.com/premium/v1/past-weather.ashx?'
  wwoHxWeather_json = requests.get(wwo_url, params=params).json()

  list_of_dict = []
  dest_site = wwoHxWeather_json['data']['weather'][0]['hourly']

  weather_dict = {
      'weather_type':   dest_site[time]['weatherDesc'][0]['value']
  }
  list_of_dict.append(weather_dict)
  return pd.DataFrame(list_of_dict)

    

In [None]:
test_weather_desc = WWO_API_weather_type('Aberdeen, SD', '2018-01-01', 4)
test_weather_desc

In [None]:
# Create list of tiny dfs:
weather_type_list = []
for i in range(df_clean.shape[0]):
  city_x = dest_cities[i]
  date_x = arr_date[i]
  time_x = arr_time_code[i]
  x = WWO_API_weather_type(city_x, date_x, time_x)
  weather_type_list.append(x)


I stopped the API function loop after 119min which only accumulated approx 21.8% of our data.

In [None]:
# Confirm equal lengths:  
print(len(weather_type_list))
print(len(dest_cities))

In [None]:
# Concatenate the list of dfs to one:

df_weather_type = pd.DataFrame()
df_x = pd.DataFrame()

for x in weather_type_list:
  df_x = pd.concat([df_weather_type, x])
  df_weather_type = df_x

In [None]:
df_clean.shape

In [None]:
# Create a truncated version of the df_clean to at least save what we have:
temp = df_clean.copy()
temp = temp.reset_index()
temp.shape

In [None]:
df_clean_trunc = temp.loc[0:19702, :]
df_clean_trunc.shape

In [None]:
df_weather_type.head(10)

In [None]:
len(list(df_weather_type['weather_type']))

In [None]:
# Add the new column to working df:
df_clean_trunc['weather'] = list(df_weather_type['weather_type'])

In [None]:
# Confirm the weather type in df:
df_clean_trunc.weather.head()

##### SAVE new version of df_clean with weather_types:

In [None]:
df_clean_trunc.to_csv(f'../data/flights_clean_df{session}.csv', index= False)

In [None]:
# 
var = 'weather'
data = df_clean_trunc[['arr_delay',var]]

f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="arr_delay", data=data)
fig.axis(ymin=-60, ymax=60)

plt.savefig(f'../Images/Arr_delays_weather_type_boxplot_{session}.png')

In [None]:
def WWO_API_aux_weather(city, date, time):
  '''
  input: This function requires 3 input variables. 
    1. city: acceptable format includes:         # string data type
        • City Name
        • City Name, State (US only)
        • City Name, State, Country
        • City Name, Country
        • IP address: XXX.XXX.XXX.XXX
        • Postal Code (UK or Canada), Zipcode (US)
        • Latitude and longitude in decimal degrees
    2. date: 
  output:  
  '''
  api_key = os.environ['WEATHER_API_KEY']
  params = {
    'q': city
    , 'date': date
    , 'format': 'json'
    , 'key': api_key
  }

  wwo_url = f'https://api.worldweatheronline.com/premium/v1/past-weather.ashx?'
  wwoHxWeather_json = requests.get(wwo_url, params=params).json()

  list_of_dict = []
  for i in wwoHxWeather_json['data']['weather'][0]:
    weather = wwoHxWeather_json['data']['weather'][0]
    hourly = wwoHxWeather_json['data']['weather'][0]['hourly']

    weather_dict = {
          'max_temp_C':           weather['maxtempC']
        , 'min_temp_C':           weather['mintempC']
        , 'avg_temp_C':           weather['avgtempC']
        , 'total_snow_cm':        weather['totalSnow_cm']
        , 'sun_hour':             weather['sunHour']
        , 'uv_index':             weather['uvIndex']
        , 'arr_wind_chill_C':     hourly[time]['WindChillC']
        , 'arr_wind_gust_Kmph':   hourly[time]['WindGustKmph']
        , 'arr_cloud_cover':      hourly[time]['cloudcover']
        , 'arr_precip_MM':        hourly[time]['precipMM']
        , 'arr_pressure':         hourly[time]['pressure']
        , 'arr_temp_C':           hourly[time]['tempC']
        , 'arr_time':             hourly[time]['time']
        , 'arr_uv_index':         hourly[time]['uvIndex']
        , 'arr_visibility':       hourly[time]['visibility']
        , 'arr_weather_code':     hourly[time]['weatherCode']
        , 'arr_wind_dir_16Point': hourly[time]['winddir16Point']
        , 'arr_wind_dir_degree':  hourly[time]['winddirDegree']
        , 'arr_wind_speed_Kmph':  hourly[time]['windspeedKmph']
        , 'arr_weather_type':     hourly[time]['weatherDesc'][0]['value']
    }
  list_of_dict.append(weather_dict)
  return pd.DataFrame(list_of_dict)

    

##### TEST API function:

In [None]:
# NOTE: would like to increase location precision later with lat/long coords of the dest_airport_id later
# For now will approximate with dest_city_name and state code.
dest_cities = list(df_clean['dest_city_name'])
arr_date = list(df_clean['date'])
arr_time = list(df_clean['arr_time'])

In [None]:
test_aux = WWO_API_aux_weather('Aberdeen, SD', '2018-01-01', 4)
test_aux

In [None]:
df_test = pd.DataFrame(test['data']['weather'])
list(df_test.columns)

In [None]:
hourly_columns = sorted(list(df_test_hourly.columns))

In [None]:
df_test_hourly = pd.json_normalize(df_test['hourly'][0])
df_test_hourly

In [None]:
arr_time_code = []
for i in arr_time:
    if i <= 150:
      arr_time_code.append(0)
    elif i <= 450:
      arr_time_code.append(1)
    elif i <= 750:
      arr_time_code.append(2)
    elif i <= 1050:
      arr_time_code.append(3)
    elif i <= 1350:
      arr_time_code.append(4)
    elif i <= 1650:
      arr_time_code.append(5)
    elif i <= 1950:
      arr_time_code.append(6)
    elif i <= 2250:
      arr_time_code.append(7)
    else:
      arr_time_code.append(0)

In [None]:
test['data']
test['data']['weather']
test['data']['weather'][0]


In [None]:
for i in range(df_clean.shape[0]):
  


##### TEST API function:

In [None]:
test = WorldWeatherOnlineAPI('Aberdeen, SD', '2018-01-01', 6)

In [None]:
df_test = pd.DataFrame(test['data']['weather'])
list(df_test.columns)

In [None]:
hourly_columns = sorted(list(df_test_hourly.columns))

In [None]:
time_dict{
  0:[0, 150]
  1:[151, 450]
  2:[451, 750]
  3:[751, 1050]
  4:[1051, 1350]
  5:[1351, 1650]
  6:[1651, 1950]
  7:[1951, 2250]
  0:[2251, 2400]

}

In [None]:
df_test_hourly = pd.json_normalize(df_test['hourly'][0])
df_test_hourly

In [None]:
df_test_maxtemp['weatherDesc']

In [None]:
test['data']
test['data']['weather']
test['data']['weather'][0]


In [None]:
def WWO_API_weather_json(city, date):
  '''
  input:
  output:  
  '''
  api_key = 5f2766a2052b46e284e45545231101
  params = {
    'q': city
    , 'date': date
    , 'format': 'json'
    , 'key': api_key
  }

  wwo_url = f'https://api.worldweatheronline.com/premium/v1/past-weather.ashx?'
  wwoHxWeather_json = requests.get(wwo_url, params=params).json()

  list_of_dict = []
  dest_site = wwoHxWeather_json
  list_of_dict.append(weather_dict)

  return list_of_dict


In [None]:
for i in range(df_clean.shape[0]):
  


#### **Task 4**: How taxi times changing during the day? Does higher traffic lead to bigger taxi times?

# Handling Outliers

In [None]:
df_clean

##### Let's look at a boxplot of our target variable (taxi_out) to identify any outliers.

In [None]:
ax = sns.boxplot(x=df_clean["taxi_out"])
plt.show()

In [None]:
Q1 = df["taxi_out"].quantile(0.25)
Q3 = df["taxi_out"].quantile(0.75)
IQR = Q3 - Q1

bound = Q3 + 1.5 * IQR
print(f'The upper bound time limit for taxi time is : {bound}')


# Months

In [None]:
sns.set_style('darkgrid')
ax = sns.countplot(x="month", data=df_clean)
ax.set_title('Month Counts');

In [None]:
month_grouped = df_clean.groupby(['month'])['taxi_out'].mean()

month_grouped = month_grouped.reset_index()

ax = sns.barplot(x='month', y='taxi_out', data=month_grouped, color='#45B39D');

ax.set_title('Taxi-Out time by Month');
ax.set_ylabel('Average Taxi-Out time');

In [None]:
time_grouped = df_clean.groupby(['arr_time'])['taxi_out'].mean()

time_grouped = time_grouped.reset_index()

ax = sns.barplot(x='arr_time', y='taxi_out', data=time_grouped, color='#45B39D');

ax.set_title('Taxi-Out time by arr_time');
ax.set_ylabel('Average Taxi-Out time');

# Days

In [None]:
ax = sns.countplot(x="arr_time", data=df_clean)
ax.set_title('Day Counts');

In [None]:
df_clean.columns


#### **Task 5**: What is the average percentage of delays that is already created before departure? (aka are arrival delays caused by departure delays?) Are airlines able to lower the delay during the flights?

In [None]:
flights_usa[['fl_date']] = flights_usa[['fl_date']].apply(pd.to_datetime)
flights_usa['fl_date']

In [None]:
flights_usa['year'] = flights_usa['fl_date'].dt.year
flights_usa['month'] = flights_usa['fl_date'].dt.month

In [None]:
#See the distributions
flights_usa[['year', 'month']].value_counts().sort_index(ascending=False)

In [None]:
flights_usa['state'] = flights_usa['origin_city_name'].str[-2:]
flights_usa['late_arr'] = (flights_usa['arr_delay'] > 0).astype(int)
flights_usa['late_dep'] = (flights_usa['dep_delay'] > 0).astype(int)
flights_usa

In [None]:
flights_usa['speed'] = flights_usa['distance']/flights_usa['air_time']
no_dep_delay = flights_usa[flights_usa['late_dep'] == 0]
yes_dep_delay = flights_usa[flights_usa['late_dep'] == 1]

In [None]:
#If there is no departure delay, there is a 15% chance of late arrival
no_dep_delay['late_arr'].mean()

In [None]:
#If there is a departure delay, there is a 73% chance of late arrival
yes_dep_delay['late_arr'].mean()

#### **Task 6**: How many states cover 50% of US air traffic? 

In [None]:
top_8 = flights_usa['state'].value_counts().head(8)
top_8

In [None]:
total_flight = flights_usa['origin_city_name'].count()
total_flight

In [None]:
#These 8 states cover 53% of the flight
top_8.sum()/total_flight

#### **Task 7**: Test the hypothesis whether planes fly faster when there is the departure delay? 

In [None]:
#Mean of planes speed without departure delay
no_dep_delay['speed'].mean()

In [None]:
#Mean of planes speed with departure delay
yes_dep_delay['speed'].mean()

#### **Task 8**: When (which hour) do most 'LONG', 'SHORT', 'MEDIUM' haul flights take off?

#### **Task 9**: Find the top 10 the bussiest airports. Does the biggest number of flights mean that the biggest number of passengers went through the particular airport? How much traffic do these 10 airports cover?

#### **Task 10**: Do bigger delays lead to bigger fuel comsumption per passenger? 
We need to do four things to answer this as accurate as possible:
- Find out average monthly delay per air carrier (monthly delay is sum of all delays in 1 month)
- Find out distance covered monthly by different air carriers
- Find out number of passengers that were carried by different air carriers
- Find out total fuel comsumption per air carrier.

Use this information to get the average fuel comsumption per passenger per km. Is this higher for the airlines with bigger average delays?