sources:http://nbviewer.jupyter.org/url/ddowey.github.io/cs109-Final-Project/FlightDelay-ProjectFinal.ipynb

As a frequent traveller, I have spent a lot of hours at airports, waiting for delayed flights. 
This analysis has 2 main goals:<br> 
<br> 
1) Predict whether a flight will be delayed I will approach the problem as a binary classification problem and try to predict whether a flight will be delayed or not and which model is the best to predict flight delays. <br> 
<br> 
2) Examine the causes of flight delays. I will also focus on exploratory analysis and determine what is the relationship between the available variables (temporal, spatial and other). Flights belonging to which airline are most likely to be delayed? 

Conclusion: prediction likely requires more features than those available. For further information, I would add historical information on weather in each of the airports' locations- I would expect that flights are more delayed when it snows or during a particularly rainy season.
Another interesting information would be information about the state of the plane <br>
<br>
Possible extension:coupon-specific information for each domestic itinerary of the Origin and Destination Survey, such as the operating carrier, origin and destination airports, number of passengers, fare class, coupon type, trip break indicator, and distance.
https://www.transtats.bts.gov/Tables.asp?DB_ID=125&DB_Name=Airline%20Origin%20and%20Destination%20Survey%20%28DB1B%29&DB_Short_Name=Origin%20and%20Destination%20Survey


I need to download the data from the Bureau of Transportation Statistics. There does not seem to be any API, but the complete dataset can be downloaded by months from the URL in the following format:  https://transtats.bts.gov/PREZIP/On_Time_On_Time_Performance_YYYY_M.zip <br>
<br>
Due to the size of the dataset, I'd rather download only specific fields. I will select them and download the data using Selenium.

In [1]:
#load basic libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
import zipfile
import time
import os
from bs4 import BeautifulSoup
import requests
import re

In [39]:
import bts_data

In [3]:
# set options
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', None)
plt.style.use('fivethirtyeight')
%matplotlib inline
#import warnings
#warnings.filterwarnings('ignore')

The years and months to download can be changed easily. For now, I will download 2015 data. Later on, I can download and analyze more years or use 2016 data to test my model on data to imitate a real-life implementation. <br>
 <br>

In [23]:
years = ["2016"] # years to download
months = [str(month+1) for month in range(12)] # months to download

In [24]:
root = os.getcwd() # current folder
target_folder = 'bts_data' # download the data here

In [25]:
os.getcwd()

'C:\\Users\\micakova\\Experiments\\Pokusy\\Search'

First off, I will check what fields I can download. The full dataset contains a lot of columns. To make the volume of downloaded data smaller, I will specify which variables I would like to download as opposed to downloading the whole data set.

In [26]:
target_url = 'https://transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time'
page = requests.get(target_url) 
soup = BeautifulSoup(page.content, "lxml")
fields = list()
#[x for x in soup.findAll(type='checkbox')]
for x in soup.findAll(type='checkbox'):
    try:
        fields.append(x['title'])
    except Exception:
        pass     

In [27]:
fields

['Year',
 'Quarter',
 'Month',
 'DayofMonth',
 'DayOfWeek',
 'FlightDate',
 'UniqueCarrier',
 'AirlineID',
 'Carrier',
 'TailNum',
 'FlightNum',
 'OriginAirportID',
 'OriginAirportSeqID',
 'OriginCityMarketID',
 'Origin',
 'OriginCityName',
 'OriginState',
 'OriginStateFips',
 'OriginStateName',
 'OriginWac',
 'DestAirportID',
 'DestAirportSeqID',
 'DestCityMarketID',
 'Dest',
 'DestCityName',
 'DestState',
 'DestStateFips',
 'DestStateName',
 'DestWac',
 'CRSDepTime',
 'DepTime',
 'DepDelay',
 'DepDelayMinutes',
 'DepDel15',
 'DepartureDelayGroups',
 'DepTimeBlk',
 'TaxiOut',
 'WheelsOff',
 'WheelsOn',
 'TaxiIn',
 'CRSArrTime',
 'ArrTime',
 'ArrDelay',
 'ArrDelayMinutes',
 'ArrDel15',
 'ArrivalDelayGroups',
 'ArrTimeBlk',
 'Cancelled',
 'CancellationCode',
 'Diverted',
 'CRSElapsedTime',
 'ActualElapsedTime',
 'AirTime',
 'Flights',
 'Distance',
 'DistanceGroup',
 'CarrierDelay',
 'WeatherDelay',
 'NASDelay',
 'SecurityDelay',
 'LateAircraftDelay',
 'FirstDepTime',
 'TotalAddGTime',
 

**As a first step, I will select the following variables:** <br>
<br>
1) **Time Period**: ['Year','Month','DayofMonth','DayOfWeek'] <br>
2) **Airline**: ['UniqueCarrier',' FlightNum']  <br>
3) **Origin and destination **: ['OriginAirportID','DestAirportID','OriginWac','DestWac']<br>
4) **Departure and arrival performance +  flight summary information**: [CRSArrTime','CRSDepTime','DepDelay','ArrDelay','Cancelled','CancellationCode','CRSElapsedTime','Distance','DivAirportLandings'] <br>
5) **Detailed information about the cause of delay**: ['CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay']	

<br> For the prediction itself, I will only use information which is available weeks/months ahead of the flight. However, as a part of the exploratory analysis, I will also examine variables which are only available ex-post to examine the reasons for the delay and the relationship between departure and arrival delay. <br> 
I tried to minimize duplicities (downloading only one from OriginAirportID and OriginAirportSeqID is sufficient), but some may still remain. If I discover that some of the fields are redundant during the analysis, I will omit them. 

In [28]:
unselect_vars = ['OriginAirportSeqID','DestAirportSeqID'] # Some of the fields are already preselected 

selected_vars = ['Year','Month','DayofMonth','DayOfWeek'
,'UniqueCarrier'\
,'OriginWac','DestWac' \
,'FlightNum'\
,'CRSArrTime','CRSDepTime'\
,'DepDelay','ArrDelay' \
,'Cancelled','CancellationCode' \
,'CRSElapsedTime','Distance'  \
,'DivAirportLandings'
,'CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay'] # list of fields to select
selected_vars = selected_vars + unselect_vars # OriginAirportID selected by default

In [29]:
print(selected_vars)

['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier', 'OriginWac', 'DestWac', 'FlightNum', 'CRSArrTime', 'CRSDepTime', 'DepDelay', 'ArrDelay', 'Cancelled', 'CancellationCode', 'CRSElapsedTime', 'Distance', 'DivAirportLandings', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay', 'OriginAirportSeqID', 'DestAirportSeqID']


 Possible other data to use:other data: latitude, longitude, weather data, airplane data

In [30]:
def download_data(root_path, target_folder, years, months, selected_vars=None):
    """
    Download the BTS data using selenium. 
    years, months, selected_vars are lists. 
    selected_vars is optional - if not specified, all available fields will be downloaded.
    """
    target_folder_path = os.path.join(root_path,target_folder)
    if os.path.exists(target_folder_path):
        print('folder already exists!')
    else:
        #specify target_folder, years and months (list) and selected_vars (optional)
        profile = webdriver.FirefoxProfile() #create a new profile
        profile.set_preference("browser.download.folderList", 2)
        profile.set_preference("browser.download.manager.showWhenStarting", False)
        profile.set_preference("browser.download.dir", target_folder_path) # save data to
        profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-gzip")
        target_url = 'https://transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time'
        driver = webdriver.Firefox(firefox_profile=profile)
        driver.get(target_url)
        time.sleep(10)
        if selected_vars:
            for element in selected_vars:
                driver.find_element_by_xpath("//input[@type='checkbox' and @title='{}']".format(element)).click()
        else:  
            driver.find_element_by_name("DownloadZip").click() # download all fields
        time.sleep(10)
        for year in years:
            for month in months:
                print("Downloading data for {}/{}".format(month, year))
                Select(driver.find_element_by_id('XYEAR')).select_by_value(year) # select year in the dropdown list
                time.sleep(5)
                Select(driver.find_element_by_id('FREQUENCY')).select_by_value(month) # select month in the dropdown list
                time.sleep(5)
                driver.find_element_by_name("Download").click() 
                time.sleep(20) # wait 
                #time.sleep(10)
        time.sleep(150)  # wait to  make sure all the files have finished downloading  
        print('Driver is quitting...')
        driver.quit()

In [31]:
# unzips individual files, renames them and moves them to a single folder 

def unzip_data(main_folder_path,target_folder_name):
    file_extension = 'zip'
    new_folder_name = 'unzipped'
    target_folder = os.path.join(main_folder_path,target_folder_name)
    print('target folder: {}'.format(target_folder))
    new_folder = os.path.join(target_folder,new_folder_name)
    if not os.path.exists(new_folder):
        os.makedirs(new_folder)
        for i,item in enumerate(os.listdir(target_folder)): # loop through items in dir.
            if item.endswith(file_extension): # check for ".zip" extension
                zip_ref = zipfile.ZipFile(os.path.join(target_folder,item)) # create zipfile object
                zip_ref.extractall(os.path.join(new_folder,new_folder_name+'{}'.format(i+1)))
                zip_ref.close()
                print('{} unzipped'.format(item))
            # extract file to dir. The unzipped files have the same name
        #for subfolder in os.listdir(new_folder): # rename file if the files have the same name
        for subfolder in next(os.walk(new_folder))[1]:
            subfolder_path = os.path.join(new_folder,subfolder)
            for file_name in os.listdir(subfolder_path):
                #print(file_name)
                if re.search('unzipped[0-9]+', subfolder_path).group(0) is None:
                    new_file_name = os.path.join(new_folder,'data0.csv')
                else: 
                    new_file_name = os.path.join(new_folder,'data{}.csv'.format(re.search('unzipped[0-9]+', subfolder_path).group(0)))   
                #print(new_file_name)
                os.rename(os.path.join(subfolder_path,file_name), os.path.join(new_file_name))
            
         
            # extract file to single dir
    else:
        print('The directory {} already exists!'.format(new_folder))
            

In [32]:
#new_folder = os.path.join(main_folder_path,target_folder_name)
# mf path C:\\Users\\micakova\\Experiments\\Pokusy\\Search
# target folder name data_train_raw\\
# unzipped1...12
def concat_files(main_folder_path,target_folder_name,output_name):
    write_header = True
    new_folder_path = os.path.join(main_folder_path,target_folder_name,'unzipped')
    print(new_folder_path)
    if os.path.exists(output_name):
        print('file already exists!')
    else:
        with open(output_name, 'w') as target:
            #for item in os.listdir(target_folder+ '\\' + new_folder_name):
            csv_files = [file for file in os.listdir(new_folder_path) if file.endswith('csv')]
            for item in csv_files:
                print(item)
                with open(os.path.join(new_folder_path,item), 'r') as source:
                    lines = source.readlines()
                    if not write_header:
                        lines = lines[1:]
                    else:
                        write_header = False
                    lines = [line for line in lines if line.strip() != '']
                    target.writelines(lines)        

In [47]:
download_data(root, target_folder, years, months, selected_vars=selected_vars)

In [35]:
unzip_data(root,target_folder)

target folder: C:\Users\micakova\Experiments\Pokusy\Search\bts_data


In [2]:
output_name = 'df_2016.csv'
concat_files(root, target_folder,output_name)

In [None]:
df_2015 = pd.read_csv(output_name)