# Flights Data Visualization
## by Nithin Venugopal

## Preliminary Wrangling

> In this Project, we will be using Flights dataset taken from [Stats_Computing](http://stat-computing.org/dataexpo/2009/the-data.html). The Flights dataset essentially shows information of Flights from the US and includes data from 1987 to 2008. This rich data source contains so many variables for analysis including details pertaining to delays, flight information, causes of delays, taxi in/out, date etc.
For my project, I am only including the analysis of the year 2007 since my system had touble encorporating all the data which surpasses to 1.5 GB. We will start with 
- Gathering the data
- Assessing the important variables
- Cleaning unecessary information
- Exploratory Analysis
- Explanatory Analysis

In [2]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
pd.set_option('display.max_columns', None)


%matplotlib inline

In [4]:
flights_2007_original = pd.read_csv('2007.csv')
flights_2007_original.isna().sum(), flights_2007.shape

NameError: name 'flights_2007' is not defined

In [None]:
flights_2007_original.head()

In [None]:
# How many duplicates?
flights_2007_original.duplicated().sum()

In [None]:
# Which are the values
flights_2007_original[flights_2007_original.duplicated()]

In [None]:
# Remove Dupllicates
flights_2007_original.drop_duplicates(inplace = True)
flights_2007_original.info()

In [None]:
#How many nan values
flights_2007_original.isna().sum()

> I will not be removing any Nan's since it is not pertinent to the variables I am taking into consideration

### What is the structure of your dataset?

> My data contains 2389217 rows and 29 rows. The data contains daily aircraft data from many carriers from 2007. Most data types as integers and strings. Any data that contains time is represented in minutes.

### What is/are the main feature(s) of interest in your dataset?

> As a Data Analyst and an avid traveller, my interest lies in finding out which are the worst airlines with respect to the delay of departure and arrival and how these values are correlated. I want to also explore the other reasons for delays per month and also finally look into the various delays with respect to the 5 Worst Airlines with respect to the delay time.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> The feature that would help me analyze this dataset includes
- Month
- Unique Carriers
- Departed Delay Time
- Arrival Delay time
- Various Delay causes such as weather, carrier delay etc

In [None]:
# Making a copy

flights_2007 = flights_2007_original.copy()

## Univariate Exploration



#### What is the relationship between Arrival, Departure and Elapsed Delay

In [None]:
plt.figure(figsize = [20,5])

plt.subplot(1,3,1)
plt.hist(flights_2007.ArrDelay, bins = np.arange(-50, 80,5), rwidth = 0.9)
plt.title('Arrival Delay')



plt.subplot(1,3,2)
plt.hist(flights_2007.DepDelay, bins = np.arange(-50, 80,5), rwidth = 0.9)
plt.title('Departure Delay')


plt.subplot(1,3,3)
plt.hist((flights_2007.ActualElapsedTime-flights_2007.CRSElapsedTime), bins = np.arange(-50, 80,5), rwidth = 0.9)
plt.title('Elapsed Delay')



> We can see that the Departure Delay follows extreme right skewness while Arrival Delay is slightly right skewed. The Elapsed Delay tends more towards the normal distribution. What conclusions can we make?
- If we witnessed both Departure and Arrival Delay to be the same distribution, we could have concluded that all flights that have a departure delay should also have an Arrival Delay. This is not the case and hence we can conclude that although Departure Delays can cause Arrival Delays, this is not certain for all scenarios. 
- Since all plots have the same y-axis, I feel transformation will only be an extra step to come up with the same conclusion



### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?
- Essentially no straighforward correlation can be formed although we cannot say there is no correlation at all. We will need to perform a bivariate exploration for a better picture of the correlation. I did not perfom any transformations for the same since we could zoom in to get an idea of the distributions.

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
- Yes there were unusual distributions. I expected the arrival time delay and departure time delay to have the same distribution but did not find that to be true. I did not do any adjustments since we can explore more in the bivariate exploration

## Bivariate Exploration



#### Worst airlines in Terms of delay

In [None]:
carriers = pd.read_csv('carriers.csv')
carrier_dict = dict(carriers.values)

#plt.figure(figsize=([20, 15]));

flights_2007_worst = flights_2007.groupby('UniqueCarrier').agg({'DepDelay':'mean', 'ArrDelay':'mean'}).rename(index=carrier_dict).sort_values('DepDelay', ascending=False).head(5)

flights_2007_worst.plot(kind = 'barh');
plt.xlabel('Time in mins');
plt.ylabel('Unique Carrier');
plt.title('Worst Airlines ranked by Delay Time');



> We can see the worst performers of 2007 in terms of delays. There could be so many reasons for this poor performance which we will explore as we go further

#### Relation between Arrival Time Delay, Departure Time Delay and Elapsed Time Delay

In [None]:
fig, [ax1,ax2,ax3] = plt.subplots(1,3, figsize = [16,6])
sb.scatterplot(flights_2007.DepDelay, flights_2007.ArrDelay, alpha = 0.01,s = 0.3, ax = ax1)
ax1.set_xlabel('Departure Delay')
ax1.set_ylabel('Arrival Delay')
ax1.set_xlim([-200,2000])
ax1.set_ylim([-200,2000])

sb.scatterplot(flights_2007.DepDelay, (flights_2007.ActualElapsedTime - flights_2007.CRSElapsedTime),s = 0.3, alpha = 0.01, ax = ax2)
ax2.set_xlabel('Departure Delay')
ax2.set_ylabel('Elapsed Time Delay')
ax2.set_xlim([-200,2000])
ax2.set_ylim([-200,2000])


sb.scatterplot(flights_2007.ArrDelay, (flights_2007.ActualElapsedTime - flights_2007.CRSElapsedTime),s = 0.3, alpha = 0.01, ax = ax3)
ax3.set_xlabel('Arrival Delay')
ax3.set_ylabel('Elapsed Time Delay')
ax3.set_xlim([-200,2000])
ax3.set_ylim([-200,2000])


> - Left: we can see that there is a strong correlation between Departure Delay and Arrival Delay. But in general the Departure Delay seems to be slightly more than Arrival Delay
> - Right: It seems that the arrival delay has some correlation with Elapsed time Delay but overall the correlation is not strong enough since the scales are different
> - Centre: There is no correlation between Departure Delay and Elapsed Time Delay

#### Delays compounded by Month

In [None]:
flights_2007_mean = flights_2007.groupby('Month').agg({'DepDelay':'mean', 'ArrDelay':'mean'})


flights_2007_mean.plot.line()

plt.show()

> Here we can see the best month to travel where the mean delays are much less than the rest of the year. Further analysis needs to be done on all the years to see if this pattern is consistent throughout. We will also need to see city by city calculation to know more about this pattern

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
>Here we can see a strong correlation between arrival time delay and departure time delay. There is no correlation for elapsed time delay with respect to departure time delay although the arrival time delay showed some correlation with respect to the elapsed time delay.

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
> I expected the correlation between elapsed time delay to arrival time delay to be more strong since if the arrival time delay is more, the elapsed time delay should automatically show the same. I found out that was not the case which shows that the departure time delay is proportional to arrival time delay. This means that there were many instances were the carrier took the same amount of minutes to reach even though there were delays.
> 1 thing to note is the delays seen in the month of September to november. This needs to be further verified with other years so as to find a pattern. Furthermore we can plot the various causes of delays with the Month variable and see if the pattern fits

## Multivariate Exploration



#### Line Plot comparing the various causes of delays with respect to month

In [None]:
Carrier_month = flights_2007.query('CarrierDelay > 0').groupby('Month').agg({'CarrierDelay':'mean'})
Weather_month = flights_2007.query('WeatherDelay > 0').groupby('Month').agg({'WeatherDelay':'mean'})
NASDelay_month = flights_2007.query('NASDelay > 0').groupby('Month').agg({'NASDelay':'mean'})
Security_month = flights_2007.query('SecurityDelay > 0').groupby('Month').agg({'SecurityDelay':'mean'})
LateAircraft_month = flights_2007.query('LateAircraftDelay >0').groupby('Month').agg({'LateAircraftDelay':'mean'})



In [None]:
plt.figure(figsize=[20,10])

plt.errorbar(x = Carrier_month.index, y = Carrier_month['CarrierDelay'])
plt.errorbar(x = Weather_month.index, y = Weather_month['WeatherDelay'])
plt.errorbar(x = NASDelay_month.index, y = NASDelay_month['NASDelay'])
plt.errorbar(x = Security_month.index, y = Security_month['SecurityDelay'])
plt.errorbar(x = LateAircraft_month.index, y = LateAircraft_month['LateAircraftDelay'])

plt.xlabel('Month')
plt.ylabel('Delay Time in Minutes')
plt.title('Month vs Delay Time Due to Carrier, Weather, NAS, Security or Late Aircraft')
plt.legend(['Carrier Delay', 'Weather Delay', 'NAS Delay', 'Security Delay', 'Late Aircraft Delay'])

plt.show()

> Here we can see that the 2 main reasons for Delays are the Weather and also the Late Arrival of the plane. It is also interesting to note the variation of these delays does differ month to month drastically.

#### Line Plot comparing the various causes of delays with respect to the worst performing carriers

In [None]:
#Subsetting the required data
flights_2007_worst1 = flights_2007.groupby('UniqueCarrier').agg({'DepDelay':'mean', 'ArrDelay':'mean'}).sort_values('DepDelay', ascending=False).head(5)
worst = pd.DataFrame(flights_2007_worst1.reset_index().UniqueCarrier)
worst1 = flights_2007.loc[flights_2007['UniqueCarrier'].isin(worst['UniqueCarrier'])]
flights_2007_worst.head(5).index

In [None]:
#grouping careerwise
carrier1 = worst1.groupby('UniqueCarrier').agg({'CarrierDelay':'mean'})
weather1 = worst1.groupby('UniqueCarrier').agg({'WeatherDelay':'mean'})
nas1 = worst1.groupby('UniqueCarrier').agg({'NASDelay':'mean'})
security1 = worst1.groupby('UniqueCarrier').agg({'SecurityDelay':'mean'})
late1 = worst1.groupby('UniqueCarrier').agg({'LateAircraftDelay':'mean'})

In [None]:
#PLotting the figure
plt.figure(figsize=[20,10])

plt.errorbar(x = carrier1.index, y = carrier1['CarrierDelay'])
plt.errorbar(x = weather1.index, y = weather1['WeatherDelay'])
plt.errorbar(x = nas1.index, y = nas1['NASDelay'])
plt.errorbar(x = security1.index, y = security1['SecurityDelay'])
plt.errorbar(x = late1.index, y = late1['LateAircraftDelay'])

plt.xlabel('Carriers')
plt.ylabel('Delay Time in Minutes')
plt.title('Worst Carriers (Delays) vs Delay Time Due to Carrier, Weather, NAS, Security or Late Aircraft')
plt.legend(['Carrier Delay', 'Weather Delay', 'NAS Delay', 'Security Delay', 'Late Aircraft Delay'])
plt.xticks(carrier1.index, labels =flights_2007_worst.head(5).index, rotation = 45)

plt.show()

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> In the 1st plot, we can see that the 2 main reasons for Delays are the Weather and also the Late Arrival of the plane. It is also interesting to note the variation of these delays does differ month to month quite a bit.
> In the 2nd plot, we can see that the aircrafts that showed high departure and arrival delays essentially had the blame on themselves where the carrier delay and Late aircraft delay seemed to be the highest factor.

### Were there any interesting or surprising interactions between features?
>It is also interesting to note that in the 1st plot, the variation of these delays does differ month to month quite a bit.
 