## 1) What is your question?

__How are you operationalizing your dependent variable (if any?)__

The dependent variable is categorical -- A positive result means a flight departure will be delayed, a negative result means a flight departure will not be delayed.  For brevity, "delay" will mean departure delay throughout.

Initially, the goal was to determine which Chicago airport would provide a shorter delay for a flight on a specified date to a specified airport.  The model was going to determine average expected delay in minutes for the specified flights at Midway and O'Hare and recommend an airport based on the shorter expected delay.  The models created for this dependent variable had low accuracy and struggled mightily to predict delays longer than 45 minutes. These models were more successful at predicting whether there was a delay or not.  

Therefore, in hopes of being able to produce a model that could return somewhat accurate results, the dependent variable was switched to canceled or not canceled.  Various cutoffs for qualifying as delayed (5, 10, 15 minutes) were tested but the model performs best when simply deciding between delay vs no delay.

__What are your potential independent variables, why, and how will you operationalize them?__

1) Airports & Historic Flight Data  
    Prior airport, Origin Airport, Destination Airport, Daily Scheduled Departures, Hourly Scheduled Departures

2) Aircraft  
    Airline, Model, Year Manufactured

3) Time  
    Scheduled Departure (Year, Week, Day of Week & Hour), Minutes between scheduled arrival in Chicago and scheduled departure

4) Weather  
    Temperature, Visibility, Wind Speed, and Precipitation at departure and arrival airport by hour from three hours before departure/arrival through two hours after departure/arrival.  Different iteration used weather at prior airport and departure airport.

The goal of using these variables is to capture delays related to airports, airlines, aircraft, scheduling, seasonal variability, and weather.  These variables fail to account for other possible sources of delay such as staffing issues, security issues, vendor issues/delays, airport construction, policy changes and any other unknowns which may affect the timeliness of aircraft departures.


## 2) What is your data?

__Where did you find it and how?__

1) Airport, Airline, and Historic Flight Data:  
What: Historic flight schedules and actual flight data including delay times Jan 2011 - July 2016  
Source: http://www.transtats.bts.gov/  
Method: programmatically downloaded csv files from their reporting interface

What: Location Data on airports  
Source: http://openflights.org/data.html  
Method: downloaded csv

2) Aircraft:  
What: Match table of of old Tail Numbers to New Tail Numbers.  Some BTS Tail Number are no longer correct.  
Source: https://www.researchgate.net  
Method: downloaded CSV match table 

What: Aircraft data by tail number (Manufacture, Model, Manufacture Date, Seats)  
Source: https://flightaware.com/resources/registration/  
Method: scraped site using tail numbers from bts.gov data


3) Weather:  
What: Roughly hourly historic weather at airports Jan 2011 - July 2016  
Source: https://www.wunderground.com/history  
Method: scraped site using airport codes from bts.gov data

__What are the major transformations you made and why?__

Each row of the BTS flight dataset provided a single departure date and scheduled and actual departure and arrival times.

 - Creating an arrival date -- some flights are overnight
 - Difference between an aircraft's scheduled arrival and departure times at Chicago
 - The model required info about the prior airport (airport prior to arriving at Chicago). A method involving tail numbers and Chicago airport codes was used to determine which airport an aircraft had departed from to Chicago and what time it departed the prior airport.
 - Tail Numbers change over time and get reassigned.  Figuring out which new Tail Numbers was necesarry to get aircraft details
 - Times and Dates.  Dates and times were presented in different formats across the BTS dataset and the Weather Dataset.
 - Airport Codes.  Needed match table to go from US DOT Airline ID provided in BTS data to IATA codes


__Did you do any form of dimensionality reduction? Why?__

The amount of data in the training set caused the models to process slowly.
Originally used PCA during model testing stage to reduce dimensionality.  Alternatively, used sample data to derive feature importance from a Random Forest, then reduced the features to those that were most important.

## 3) Potential Models?

__Looking at your data and your question, what are the potential techniques you might use? Why?__

Originally used Random Forest Regressor with PCA on entire training set.  Results were dismal so switched to classification models.  

Classification Models Tested:  
 - Random Forest Classifier
 - Logistic Regression
 - MPL Classifier
 - KNN Classifier
 - Extremely Randomized Trees Classifier
 
In addition to multiple models with different hyperparameters, also tested many iterations of feature sets and tested using PCA vs the raw features.


## 4) Results

__For each of your potential models, what did you find?__

I found that results were fairly consistent across classification models.

__How well does it score on holdout data?__

The score on holdout (test) data is similar to the score on training data, suggesting the model isn't overfit.

__If you chose hyperparameters, why did you choose the ones you did / did you use gridsearch and what was the outcome there?__

Gridsearch wasn't used, but many iterations were used.

## 5) Answer your original question!

__What did you find out about the world? Are there models that seem well or not well suited to the task?__

The causes of flight delays are more complicated than I thought!  Simply predicting a delay is a challenge and predicting the length of a delay is extremely difficult.  

__Given more code / more time / other resources, what are changes you would make to your analysis?__

Given more time and resources, I would conduct additional research on the proven causes of delays and investigate datasets that encompass those causes.