# Capstone Project 1 - Data Wrangling

Background:
- Data consists of information on all commercial U.S. domestic flights during Jan, Aug, Nov, and Dec of 2016

Procedure:
1. Initialization
  - import and view raw data
2. Improve readability
  - reformat headers and reindex
  - address missing values
3. Evaluate remaining data
  - separate into dataframes based on component of flight information
  - address each component and reshape if necessary
  - join into one final Flights dataframe
4. Create final Links dataframes
  - link = unique origin to destination pair
  - utilize top 100 most traveled links as they contain high frequency of flights
  - create links dataframe consisting of median delay times per hour per link during sample period
5. Store final dataframes for further analysis

## Step 1: Initialization

In [1]:
import numpy as np
import pandas as pd

#create dataframe from flight data csv
flights = pd.read_csv(r"C:\Users\mm183\Documents\Springboard\CP1\flight_data\flight_data.csv")

#explore data
print(flights.info())
print(flights.head())
print(flights.tail())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1856061 entries, 0 to 1856060
Data columns (total 43 columns):
QUARTER                  int64
MONTH                    int64
DAY_OF_MONTH             int64
DAY_OF_WEEK              int64
FL_DATE                  object
UNIQUE_CARRIER           object
FL_NUM                   int64
ORIGIN_AIRPORT_ID        int64
ORIGIN_AIRPORT_SEQ_ID    int64
ORIGIN_CITY_MARKET_ID    int64
ORIGIN                   object
ORIGIN_CITY_NAME         object
ORIGIN_STATE_ABR         object
ORIGIN_STATE_NM          object
DEST_AIRPORT_ID          int64
DEST_AIRPORT_SEQ_ID      int64
DEST_CITY_MARKET_ID      int64
DEST                     object
DEST_CITY_NAME           object
DEST_STATE_ABR           object
DEST_STATE_NM            object
CRS_DEP_TIME             int64
DEP_TIME                 float64
DEP_DELAY                float64
DEP_DELAY_NEW            float64
WHEELS_ON                float64
TAXI_IN                  float64
CRS_ARR_TIME             int64

## Step 2: Improve readability
  - reformat headers and reindex
  - address missing values

In [2]:
#make column names lowercase
flights.columns = map(str.lower, flights.columns)

#view percentage of columns consisting of NaN
print(flights.isnull().mean())

quarter                  0.000000
month                    0.000000
day_of_month             0.000000
day_of_week              0.000000
fl_date                  0.000000
unique_carrier           0.000000
fl_num                   0.000000
origin_airport_id        0.000000
origin_airport_seq_id    0.000000
origin_city_market_id    0.000000
origin                   0.000000
origin_city_name         0.000000
origin_state_abr         0.000000
origin_state_nm          0.000000
dest_airport_id          0.000000
dest_airport_seq_id      0.000000
dest_city_market_id      0.000000
dest                     0.000000
dest_city_name           0.000000
dest_state_abr           0.000000
dest_state_nm            0.000000
crs_dep_time             0.000000
dep_time                 0.014290
dep_delay                0.014290
dep_delay_new            0.014290
wheels_on                0.015120
taxi_in                  0.015120
crs_arr_time             0.000000
arr_time                 0.015120
arr_delay     

In [3]:
#make lists of columns with high and low proportions of missing values 
high_per_nan = [col for col in flights if flights[col].isnull().mean() >= .5 ]
low_per_nan = [col for col in flights if 0 < flights[col].isnull().mean() < .5]

In [4]:
'''set aside columns containing high percentage of missing values
    - such columns contain circumstantial flight delay information for a small proportion of total flights
'''
#create dataframe containing high percentage NaN columns
hpn_cols = flights[high_per_nan]

#drop columns from flights dataframe
flights = flights.drop(columns=high_per_nan)

In [5]:
'''explore columns containing low percentage of missing values
    - create dataframe of columns with low percentage of NaN values
    - view descriptive statistics
    - view standard deviation of columns as percentage of mean to gauge variability
    - view percentage of total rows containing missing values
'''
lpn_cols = flights[low_per_nan]
print(lpn_cols.describe(include='all'))
print('coefficient of variation:','\n', lpn_cols.apply(lambda x: (x.std() / x.mean()) * 100))
print('percent of rows containing nan:', (len(flights[flights.isnull().any(axis=1)]) / len(flights)) * 100)

           dep_time     dep_delay  dep_delay_new     wheels_on       taxi_in  \
count  1.829537e+06  1.829537e+06   1.829537e+06  1.827997e+06  1.827997e+06   
mean   1.334419e+03  9.812064e+00   1.267238e+01  1.471640e+03  7.562686e+00   
std    5.004642e+02  4.166410e+01   4.064121e+01  5.270121e+02  6.047238e+00   
min    1.000000e+00 -6.000000e+01   0.000000e+00  1.000000e+00  1.000000e+00   
25%    9.200000e+02 -5.000000e+00   0.000000e+00  1.053000e+03  4.000000e+00   
50%    1.329000e+03 -2.000000e+00   0.000000e+00  1.510000e+03  6.000000e+00   
75%    1.741000e+03  7.000000e+00   7.000000e+00  1.914000e+03  9.000000e+00   
max    2.400000e+03  2.040000e+03   2.040000e+03  2.400000e+03  2.500000e+02   

           arr_time     arr_delay  arr_delay_new  crs_elapsed_time  \
count  1.827997e+06  1.824403e+06   1.824403e+06      1.856058e+06   
mean   1.476238e+03  4.206204e+00   1.259892e+01      1.463799e+02   
std    5.314493e+02  4.389959e+01   4.031154e+01      7.673182e+01   

In [6]:
'''set aside rows containing missing values from flights dataframe
    - due to high variability in critical columns (pertaining to delay times) imputation of means would not be appropriate
    - rows constitute small percentage of total data
    - still possible at a later point to calculate missing values from remaining delay information
'''
#create dataframe of rows with null values
nan_rows = flights[flights.isnull().any(axis=1)]

#drop rows with null values from flights
flights = flights.drop(nan_rows.index)

#view percentage of remaining columns consisting of NaN 
print(flights.isnull().mean())

#view shape of resulting flights dataframe
print(flights.shape)

quarter                  0.0
month                    0.0
day_of_month             0.0
day_of_week              0.0
fl_date                  0.0
unique_carrier           0.0
fl_num                   0.0
origin_airport_id        0.0
origin_airport_seq_id    0.0
origin_city_market_id    0.0
origin                   0.0
origin_city_name         0.0
origin_state_abr         0.0
origin_state_nm          0.0
dest_airport_id          0.0
dest_airport_seq_id      0.0
dest_city_market_id      0.0
dest                     0.0
dest_city_name           0.0
dest_state_abr           0.0
dest_state_nm            0.0
crs_dep_time             0.0
dep_time                 0.0
dep_delay                0.0
dep_delay_new            0.0
wheels_on                0.0
taxi_in                  0.0
crs_arr_time             0.0
arr_time                 0.0
arr_delay                0.0
arr_delay_new            0.0
crs_elapsed_time         0.0
actual_elapsed_time      0.0
air_time                 0.0
flights       

## Step 3: Evaluate remaining data
  - separate into dataframes based on component of flight information
  - reshape if necessary
  - drop redundant columns

In [7]:
'''separate by columns based on component of flight information
    - date
    - location (origin & destination)
    - status timestamps and other metrics
'''
#create dataframes for groups of columns
date_info = flights.loc[:, 'quarter':'fl_date']
location_info = flights.loc[:, 'origin_airport_id':'dest_state_nm']
flight_metrics = flights.loc[:, 'crs_dep_time':'distance']

In [8]:
'''define function for creating dictionary with unique values in columns in given dataframe
    - key as column name
    - value as array of unique values in column 
    - sort array for readability
'''
def uv_dict(dataframe):
    unique_col_values = {col: array for (col, array) in [(col, dataframe[col].unique()) for col in dataframe]}
    for col, array in unique_col_values.items():
        array = array.sort()
    return unique_col_values

### Date

In [9]:
#view unique values in flight date columns
print(uv_dict(date_info))

#view columns pertaining to flight metrics
date_info.head()

{'quarter': array([1, 3, 4], dtype=int64), 'month': array([ 1,  8, 11, 12], dtype=int64), 'day_of_month': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
      dtype=int64), 'day_of_week': array([1, 2, 3, 4, 5, 6, 7], dtype=int64), 'fl_date': array(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
       '2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
       '2016-01-09', '2016-01-10', '2016-01-11', '2016-01-12',
       '2016-01-13', '2016-01-14', '2016-01-15', '2016-01-16',
       '2016-01-17', '2016-01-18', '2016-01-19', '2016-01-20',
       '2016-01-21', '2016-01-22', '2016-01-23', '2016-01-24',
       '2016-01-25', '2016-01-26', '2016-01-27', '2016-01-28',
       '2016-01-29', '2016-01-30', '2016-01-31', '2016-08-01',
       '2016-08-02', '2016-08-03', '2016-08-04', '2016-08-05',
       '2016-08-06', '2016-08-07', '2016-08-08', '2016-08-09',
       '2016-08-10', '2016-08-11', '

Unnamed: 0,quarter,month,day_of_month,day_of_week,fl_date
0,1,1,3,7,2016-01-03
1,1,1,3,7,2016-01-03
2,1,1,3,7,2016-01-03
3,1,1,3,7,2016-01-03
4,1,1,3,7,2016-01-03


In [10]:
#dummify quarter, month, day of month, and day of week columns
date_info = pd.get_dummies(date_info, columns=['quarter','month','day_of_week','day_of_month'])

#view final dataframe on date info
date_info.head()

Unnamed: 0,fl_date,quarter_1,quarter_3,quarter_4,month_1,month_8,month_11,month_12,day_of_week_1,day_of_week_2,...,day_of_month_22,day_of_month_23,day_of_month_24,day_of_month_25,day_of_month_26,day_of_month_27,day_of_month_28,day_of_month_29,day_of_month_30,day_of_month_31
0,2016-01-03,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2016-01-03,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2016-01-03,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2016-01-03,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2016-01-03,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Location

****
#### Bureau of Transportation Statistics Definitions
(https://www.transtats.bts.gov/Fields.asp?Table_ID=292)
****

In [11]:
#create dictionary of unique values in location columns
location_uv = uv_dict(location_info)
print(location_uv)

#view columns pertaining to flight location
location_info.head()

{'origin_airport_id': array([10135, 10136, 10140, 10141, 10146, 10154, 10155, 10157, 10158,
       10165, 10170, 10185, 10208, 10245, 10257, 10279, 10299, 10333,
       10372, 10397, 10408, 10423, 10431, 10434, 10469, 10529, 10551,
       10561, 10577, 10581, 10599, 10620, 10627, 10631, 10666, 10685,
       10693, 10713, 10721, 10728, 10731, 10732, 10739, 10747, 10754,
       10779, 10781, 10785, 10792, 10800, 10821, 10849, 10868, 10874,
       10918, 10926, 10980, 10990, 10994, 11003, 11013, 11042, 11049,
       11057, 11066, 11076, 11097, 11109, 11122, 11140, 11146, 11150,
       11193, 11203, 11252, 11259, 11267, 11278, 11292, 11298, 11308,
       11336, 11337, 11413, 11423, 11433, 11447, 11471, 11481, 11495,
       11503, 11525, 11537, 11540, 11577, 11587, 11603, 11612, 11617,
       11618, 11624, 11630, 11637, 11638, 11641, 11648, 11695, 11697,
       11721, 11775, 11778, 11823, 11865, 11867, 11884, 11898, 11905,
       11921, 11953, 11973, 11977, 11980, 11982, 11986, 11995, 11996

Unnamed: 0,origin_airport_id,origin_airport_seq_id,origin_city_market_id,origin,origin_city_name,origin_state_abr,origin_state_nm,dest_airport_id,dest_airport_seq_id,dest_city_market_id,dest,dest_city_name,dest_state_abr,dest_state_nm
0,11292,1129202,30325,DEN,"Denver, CO",CO,Colorado,11003,1100303,31003,CID,"Cedar Rapids/Iowa City, IA",IA,Iowa
1,14027,1402702,34027,PBI,"West Palm Beach/Palm Beach, FL",FL,Florida,11292,1129202,30325,DEN,"Denver, CO",CO,Colorado
2,15356,1535602,35356,TTN,"Trenton, NJ",NJ,New Jersey,14492,1449202,34492,RDU,"Raleigh/Durham, NC",NC,North Carolina
3,14492,1449202,34492,RDU,"Raleigh/Durham, NC",NC,North Carolina,15356,1535602,35356,TTN,"Trenton, NJ",NJ,New Jersey
4,15356,1535602,35356,TTN,"Trenton, NJ",NJ,New Jersey,13930,1393004,30977,ORD,"Chicago, IL",IL,Illinois


In [12]:
"""remove state abbreviation suffix from city_name values
    - already present in state_abr columns
    - create function to remove abbreviations from list of columns in a dataframe
"""
def remove_abbr(dataframe, lst_of_columns):
    for col in lst_of_columns:
        dataframe[col] = dataframe[col].str[:-4]

#call function on location info for origin and destination city name columns 
remove_abbr(location_info,['origin_city_name','dest_city_name'])

'''create link column from origin and destination name columns
    - link = unique origin to destination pair
'''
location_info['link'] = location_info['origin'] + '-' + location_info['dest']

#view final dataframe on flight origin and destination
location_info.head()

Unnamed: 0,origin_airport_id,origin_airport_seq_id,origin_city_market_id,origin,origin_city_name,origin_state_abr,origin_state_nm,dest_airport_id,dest_airport_seq_id,dest_city_market_id,dest,dest_city_name,dest_state_abr,dest_state_nm,link
0,11292,1129202,30325,DEN,Denver,CO,Colorado,11003,1100303,31003,CID,Cedar Rapids/Iowa City,IA,Iowa,DEN-CID
1,14027,1402702,34027,PBI,West Palm Beach/Palm Beach,FL,Florida,11292,1129202,30325,DEN,Denver,CO,Colorado,PBI-DEN
2,15356,1535602,35356,TTN,Trenton,NJ,New Jersey,14492,1449202,34492,RDU,Raleigh/Durham,NC,North Carolina,TTN-RDU
3,14492,1449202,34492,RDU,Raleigh/Durham,NC,North Carolina,15356,1535602,35356,TTN,Trenton,NJ,New Jersey,RDU-TTN
4,15356,1535602,35356,TTN,Trenton,NJ,New Jersey,13930,1393004,30977,ORD,Chicago,IL,Illinois,TTN-ORD


### Metrics

****
#### Bureau of Transportation Statistics Definitions
(https://www.transtats.bts.gov/Fields.asp)
****

In [13]:
#view unique values in flight metrics columns
print(uv_dict(flight_metrics))

#view columns pertaining to flight metrics
flight_metrics.head()

{'crs_dep_time': array([   1,    2,    3, ..., 2357, 2358, 2359], dtype=int64), 'dep_time': array([1.000e+00, 2.000e+00, 3.000e+00, ..., 2.358e+03, 2.359e+03,
       2.400e+03]), 'dep_delay': array([ -60.,  -58.,  -54., ..., 1663., 1964., 2040.]), 'dep_delay_new': array([0.000e+00, 1.000e+00, 2.000e+00, ..., 1.663e+03, 1.964e+03,
       2.040e+03]), 'wheels_on': array([1.000e+00, 2.000e+00, 3.000e+00, ..., 2.358e+03, 2.359e+03,
       2.400e+03]), 'taxi_in': array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,  33.,
        34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,  44.,
        45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,  55.,
        56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,  66.,
        67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,  77.,
        78.,  79.,  80.,  

Unnamed: 0,crs_dep_time,dep_time,dep_delay,dep_delay_new,wheels_on,taxi_in,crs_arr_time,arr_time,arr_delay,arr_delay_new,crs_elapsed_time,actual_elapsed_time,air_time,flights,distance
0,1525,1524.0,-1.0,0.0,1807.0,8.0,1820,1815.0,-5.0,0.0,115.0,111.0,87.0,1.0,692.0
1,700,744.0,44.0,44.0,940.0,8.0,929,948.0,19.0,19.0,269.0,244.0,224.0,1.0,1679.0
2,1845,1858.0,13.0,13.0,2006.0,7.0,2015,2013.0,-2.0,0.0,90.0,75.0,60.0,1.0,373.0
3,2055,2054.0,-1.0,0.0,2208.0,7.0,2220,2215.0,-5.0,0.0,85.0,81.0,57.0,1.0,373.0
4,1250,1252.0,2.0,2.0,1349.0,51.0,1420,1440.0,20.0,20.0,150.0,168.0,107.0,1.0,693.0


In [14]:
'''remove flights column from metrics dataframe
    - all rows contain only one flight i.e. column values are redundant
'''
flight_metrics = flight_metrics.drop(columns='flights')

In [15]:
'''rename delay columns for clarity
    - dep_delay and arr_delay to dep_deviation and arr_deviation
        -- values can also indicate instances of early flights
    - dep_delay_new and arr_delay_new to dep_delay and arr_delay
        -- they are exclusively indicative of flight delay
'''
flight_metrics.rename(columns={'dep_delay':'dep_dev', 'dep_delay_new':'dep_delay', 'arr_delay':'arr_dev','arr_delay_new':'arr_delay'}, inplace=True)

#view resulting flight metrics dataframe
flight_metrics.head()

Unnamed: 0,crs_dep_time,dep_time,dep_dev,dep_delay,wheels_on,taxi_in,crs_arr_time,arr_time,arr_dev,arr_delay,crs_elapsed_time,actual_elapsed_time,air_time,distance
0,1525,1524.0,-1.0,0.0,1807.0,8.0,1820,1815.0,-5.0,0.0,115.0,111.0,87.0,692.0
1,700,744.0,44.0,44.0,940.0,8.0,929,948.0,19.0,19.0,269.0,244.0,224.0,1679.0
2,1845,1858.0,13.0,13.0,2006.0,7.0,2015,2013.0,-2.0,0.0,90.0,75.0,60.0,373.0
3,2055,2054.0,-1.0,0.0,2208.0,7.0,2220,2215.0,-5.0,0.0,85.0,81.0,57.0,373.0
4,1250,1252.0,2.0,2.0,1349.0,51.0,1420,1440.0,20.0,20.0,150.0,168.0,107.0,693.0


In [16]:
'''create final flights dataframe
    - join edited content-specific dataframes
    - create multiIndex from link and flight departure datetime
'''
flights = date_info.join(location_info).join(flight_metrics).sort_values(by=['link','fl_date','dep_time'])

#import datetime to create index from flight departure time
from datetime import datetime

#initialize empty list and loop through dep_time and fl_date columns to format then append flight date 
fl_datetime = []
for (i, v) in zip(flights.dep_time.astype(int).apply(str), flights.fl_date):
    if len(i) == 1:
        i = '000' + i 
    elif len(i) == 2:
        i = '00' + i 
    elif len(i) == 3:
        i = '0' + i
    elif i == '2400':
        i = '0000'
    else:
        i = i[:2] + i[2:]
    fl_datetime.append(v + i)

#add datetime index as column to flights dataframe
flights['dt_index'] = pd.to_datetime(fl_datetime, format='%Y-%m-%d%H%M')

#create multiIndexed dataframe from link and dt_index columns
flights = flights.set_index(['link', 'dt_index'],drop=False)

#view final flights dataframe
print(flights.info())
print(flights.head())
print(flights.tail())

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1824403 entries, (ABE-ATL, 2016-01-01 07:00:00) to (YUM-PHX, 2016-12-31 19:15:00)
Data columns (total 76 columns):
fl_date                  object
quarter_1                uint8
quarter_3                uint8
quarter_4                uint8
month_1                  uint8
month_8                  uint8
month_11                 uint8
month_12                 uint8
day_of_week_1            uint8
day_of_week_2            uint8
day_of_week_3            uint8
day_of_week_4            uint8
day_of_week_5            uint8
day_of_week_6            uint8
day_of_week_7            uint8
day_of_month_1           uint8
day_of_month_2           uint8
day_of_month_3           uint8
day_of_month_4           uint8
day_of_month_5           uint8
day_of_month_6           uint8
day_of_month_7           uint8
day_of_month_8           uint8
day_of_month_9           uint8
day_of_month_10          uint8
day_of_month_11          uint8
day_of_month_12          uin

## Step 4: Create final Links dataframes
  - link = unique origin to destination pair
  - utilize top 100 most traveled links as they contain high frequency of flights
  - create links dataframe consisting of median delay times per hour per link during sample period

In [17]:
'''create top links dataframe
    - consisting of only the top 100 most traveled links
'''
#get counts for occurrence of each link in flights dataframe
name, count = np.unique(flights.link, return_counts=True)

#create series from link counts and sort in descending order
link_counts = pd.Series(dict(zip(name, count))).sort_values(ascending=False)

#create list of top 100 links
top_links_lst = link_counts[:100].index.tolist()

#create top links dataframe from flights dataframe
top_links = flights[flights.link.isin(top_links_lst)]

In [18]:
'''create list of dataframes resampled by link and hour
    - initialize empty list
    - loop through multiIndex by link
    - resample link-specific dataframes by hour and append to list
'''
links_hourly_lst = []
for n in top_links.index.get_level_values(0).unique():
    n_df = top_links.loc[n]
    n_hourly = n_df.resample('60T').median().fillna(method='pad')
    n_hourly['link'] = n
    links_hourly_lst.append(n_hourly)

#concatenate resampled dataframes in list
links_hourly_concat = pd.concat(links_hourly_lst)

In [19]:
'''create links_d dataframe 
    - slice for months in original sample
    - avoids having final links dataframe consist mostly of filled values
    - link column not yet dummified
'''
#create list of months in original sample
sample_months = top_links.index.get_level_values(1).month.unique().tolist()

#populate empty list with slices of the concatenated links_hourly dataframe containing sample months
links_s_months_lst= []
for month in sample_months:
    links_nmonth = links_hourly_concat.loc[links_hourly_concat.index.month == month]
    links_s_months_lst.append(links_nmonth)

#concatenate list of sample month slices to create final links_d dataframe
links_d = pd.concat(links_s_months_lst)

#view final links_d dataframe
print(links_d.info())
print(links_d.head())
print(links_d.tail())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 294369 entries, 2016-01-01 00:00:00 to 2016-12-31 19:00:00
Data columns (total 66 columns):
quarter_1                294369 non-null float64
quarter_3                294369 non-null float64
quarter_4                294369 non-null float64
month_1                  294369 non-null float64
month_8                  294369 non-null float64
month_11                 294369 non-null float64
month_12                 294369 non-null float64
day_of_week_1            294369 non-null float64
day_of_week_2            294369 non-null float64
day_of_week_3            294369 non-null float64
day_of_week_4            294369 non-null float64
day_of_week_5            294369 non-null float64
day_of_week_6            294369 non-null float64
day_of_week_7            294369 non-null float64
day_of_month_1           294369 non-null float64
day_of_month_2           294369 non-null float64
day_of_month_3           294369 non-null float64
day_of_month_4        

In [20]:
'''create final links dataframe 
    - dummify link column
'''
links = pd.get_dummies(links_d, columns=['link'])

#view final links dataframe
print(links.info())
print(links.head())
print(links.tail())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 294369 entries, 2016-01-01 00:00:00 to 2016-12-31 19:00:00
Columns: 165 entries, quarter_1 to link_TPA-ATL
dtypes: float64(65), uint8(100)
memory usage: 176.3 MB
None
                     quarter_1  quarter_3  quarter_4  month_1  month_8  \
dt_index                                                                 
2016-01-01 00:00:00        1.0        0.0        0.0      1.0      0.0   
2016-01-01 01:00:00        1.0        0.0        0.0      1.0      0.0   
2016-01-01 02:00:00        1.0        0.0        0.0      1.0      0.0   
2016-01-01 03:00:00        1.0        0.0        0.0      1.0      0.0   
2016-01-01 04:00:00        1.0        0.0        0.0      1.0      0.0   

                     month_11  month_12  day_of_week_1  day_of_week_2  \
dt_index                                                                
2016-01-01 00:00:00       0.0       0.0            0.0            0.0   
2016-01-01 01:00:00       0.0       0.0   

## Step 5: Store final dataframes for further analysis

In [21]:
%store flights
%store links_d
%store links

Stored 'flights' (DataFrame)
Stored 'links_d' (DataFrame)
Stored 'links' (DataFrame)
