# Flights Data Exploration
## by Natalya Bakhshetyan

## Preliminary Wrangling

> This document explores flight data for years 2017 through 2019.

In [129]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

Load the datasets into pandas DataFrames.

In [130]:
flight_data1 = pd.read_csv("flight_data1.csv", dtype = {'DEP_TIME': object, 'ARR_TIME': object})
flight_data2 = pd.read_csv("flight_data2.csv", dtype = {'DEP_TIME': object, 'ARR_TIME': object})
flight_data3 = pd.read_csv("flight_data3.csv", dtype = {'DEP_TIME': object, 'ARR_TIME': object})
print(flight_data1.shape, flight_data2.shape, flight_data3.shape)

(583985, 25) (570118, 25) (450017, 25)


Sample 10,000 rows from each DataFrame for our analysis.

In [188]:
sample_data1 = flight_data1.sample(n=10000, random_state=1)
sample_data2 = flight_data2.sample(n=10000, random_state=1)
sample_data3 = flight_data3.sample(n=10000, random_state=1)

Combine all 3 DataFrames into 1.

In [189]:
combined_df = pd.concat([sample_data1, sample_data2, sample_data3])

In [190]:
# high-level overview of data shape and composition
print(combined_df.shape)
print(combined_df.dtypes)
print(combined_df.columns)
print(combined_df.head())

(30000, 25)
YEAR                     int64
QUARTER                  int64
MONTH                    int64
DAY_OF_WEEK              int64
ORIGIN                  object
ORIGIN_CITY_NAME        object
ORIGIN_STATE_NM         object
DEST                    object
DEST_CITY_NAME          object
DEST_STATE_NM           object
DEP_TIME                object
DEP_DELAY_NEW          float64
ARR_TIME                object
ARR_DELAY_NEW          float64
CANCELLED              float64
DIVERTED               float64
AIR_TIME               float64
FLIGHTS                float64
DISTANCE               float64
CARRIER_DELAY          float64
WEATHER_DELAY          float64
NAS_DELAY              float64
SECURITY_DELAY         float64
LATE_AIRCRAFT_DELAY    float64
Unnamed: 24            float64
dtype: object
Index(['YEAR', 'QUARTER', 'MONTH', 'DAY_OF_WEEK', 'ORIGIN', 'ORIGIN_CITY_NAME',
       'ORIGIN_STATE_NM', 'DEST', 'DEST_CITY_NAME', 'DEST_STATE_NM',
       'DEP_TIME', 'DEP_DELAY_NEW', 'ARR_TIME', 'A

Cleaning steps.

In [191]:
#convert DAY_OF_WEEK to ordered categorical

def to_categorical(col, ordered_categories):
    ordered_var = pd.api.types.CategoricalDtype(ordered = True, categories = ordered_categories)
    combined_df[col] = combined_df[col].astype(ordered_var)

col = "DAY_OF_WEEK"
ordered_categories = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', "Saturday", 'Sunday']
combined_df[col] = combined_df[col].astype(str)
weekday_dict = {'1': 'Monday', '2' : "Tuesday", '3': 'Wednesday',
               '4': 'Thursday', '5': 'Friday', '6': 'Saturday',
               '7': 'Sunday'}
combined_df[col] = combined_df[col].replace(weekday_dict)
to_categorical(col, ordered_categories)

In [192]:
#confirm conversion of DAY_OF_WEEK to ordered categorical

combined_df.DAY_OF_WEEK.unique()

[Wednesday, Sunday, Tuesday, Monday, Saturday, Thursday, Friday]
Categories (7, object): [Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday]

In [193]:
#drop rows containing null values in columns DEP_TIME and ARR_TIME
combined_df.dropna(subset = ['ARR_TIME'], how = 'all', inplace = True)
combined_df.dropna(subset = ['DEP_TIME'], how = 'all', inplace = True)
#confirm the change
combined_df.shape

(29211, 25)

In [196]:
#convert DEP_TIME and ARR_TIME from float to datetime
def float_to_datetime(col):
    combined_df[col] = combined_df[col].apply(lambda x: x.zfill(4))
    combined_df[col] = pd.to_datetime(combined_df[col], format='%H%M', errors = 'coerce').dt.time
    
float_to_datetime('DEP_TIME')
float_to_datetime('ARR_TIME')
print(combined_df.DEP_TIME.head())
print(combined_df.ARR_TIME.head())

380728    14:43:00
544669    15:11:00
211691    00:04:00
501386    23:03:00
443541    10:09:00
403423    14:19:00
263383    18:58:00
131232    13:33:00
439250    20:46:00
199955    08:52:00
388695    15:20:00
106354    10:18:00
303020    08:48:00
340065    17:24:00
370913    06:44:00
110145    17:07:00
237786    08:29:00
30967     14:36:00
494159    13:22:00
392647    15:18:00
150369    08:22:00
222944    17:25:00
48271     14:41:00
543135    17:13:00
114230    05:50:00
486664    09:22:00
473541    20:33:00
357885    19:41:00
576937    06:53:00
2533      06:43:00
            ...   
429762    12:20:00
273022    17:15:00
55266     15:09:00
45742     12:12:00
56030     15:07:00
407867    05:53:00
94964     20:52:00
362662    09:43:00
185084    20:53:00
127578    22:22:00
192806    20:16:00
308704    11:39:00
218331    10:20:00
75638     15:22:00
379280    11:34:00
332375    08:19:00
337191    08:36:00
436654    13:44:00
287967    09:40:00
232366    13:30:00
384548    05:51:00
287945    12

In [208]:
#drop rows containing null values in columns DEP_TIME and ARR_TIME
combined_df.dropna(subset = ['ARR_TIME'], how = 'all', inplace = True)
combined_df.dropna(subset = ['DEP_TIME'], how = 'all', inplace = True)
#confirm the change
combined_df.shape

(29192, 25)

### What is the structure of your dataset?

> Your answer here!

### What is/are the main feature(s) of interest in your dataset?

> Your answer here!

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Your answer here!

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!