## Goal

Create a model that predicts the expected delay of a flight from Chicago to a given city/airport on a specific date in the future by departure airport (ORD vs MDW).

Extended goals:
 - Create interactive website
 - Predict Security wait time
 - Predict travel time given start location and mode of transport 
 
## Success Criteria

Predict delays within +/- 5 minutes 50% of the time


## Features

Key features:

 - __Airports:__  Prior Airport, Origin Airport, Destination Airport
 - __Airplane:__  Airline, Manufacture Year, and potentially airplane model/type
 - __Date:__  Week Number, Weekday, Daypart, and potentially other
 - __Origin Airport Weather:__  Temperature, Visibility, Wind Speed, Precipitation, Events, Conditions


Potential additional features:

 - __Prior Airport Weather__
 - __Destination Airport Weather__
 - __Holidays__

## Models

__Delay Model__

Create a regression model that predicts the minutes the flight will be delayed.  Several regression models will be tested and the top performer will be used.  

Delay will be defined as: 

```
max((Actual Arrival Time - Expected Arrival Time), 0)
```

Since 2011, 43% of all Chicago departures have experienced a delay.  Of those delayed flights, the average delay time is 35 minutes.  Expected delay for all Chicago flights is 15 minutes.

__Cancellation Model__ -- ***No Longer Planning to Create***

Because the likelihood of cancellation is so low it will be difficult to outperform a baseline prediction of "not cancelled".  Additionally, there is no data to train the model on cancellation likelihood -- a flight was either cancelled or not cancelled.

May include historic averages of cancellation for flights to destination on website.

~~Plan A:  Create a classification model that returns a flight's cancellation status as likely or unlikely.  Will test KNN and Random Forest models.~~

~~Plan B:  Only 1.4% of flights from Chicago have been cancelled since 2011, which creates a very high accuracy threshold for the classification model.  It's very likely the classification models will be unable to achieve an accuracy above 98.6%.  If that is the case, a regression model will be used instead.  Multiple regression models will be tested, and the best model will be used to return a probability of cancellation for flights from ORD vs MDW.~~


## Risks and Assumptions

__Future Flights__

In order to create an interactive model, there needs to be a database of future departures from Chicago.  I've been unable to get a database of future flights that provide Airport, Airplane, and Date features.  The main sources of this data are [SRS Analyzer from diio]('https://www.diio.net/products/srs-analyser-1') and [OAG]('http://www.oag.com/schedules/worldwide-direct-flights').  SRS no longer offers free trials, and I'm waiting to hear back from OAG.  Alternatively, [Route Happy]('https://www.routehappy.com') uses OAG's database and the data may be scrapable.

__Future Weather__

Weather is inherently unpredictable, especially for dates more than a few days in the future.  The current solution is to use daily historic averages.  This will remove all hourly weather variations and severely limit daily variations.  Relying on averages will decrease model accuracy and decrease the variance of responses for departure dates near each other.

__Future Aircraft__

The Tail Number for future flights is unknown.  If historic data shows that flight numbers consistently use the same type of aircraft, then aircraft type will be assigned based on flight number.  If type of aircraft is inconsistent within flight numbers, then aircraft features may have to be dropped from the model.

__Outliers__

Over 99% of all flights are ontime or delayed less than 3 hours.  The longest delay since 2011 was 21 hours.  It is unlikely that the model will be able to capture predict such extreme delays, and it may be beneficial to remove these outliers from the dataset altogether.

## Datasets Acquired

[Link to Dropbox 11/28/16]('https://www.dropbox.com/sh/uoqmwp3ay868ywd/AACZL3OJVg5SrvrpVfCLmo-Na?dl=0')

 - Historic Flight Data: Departures from ORD and MDW from 01/2011 through 06/2016.  
 - Historic Weather Data: Hourly weather at ORD and MDW from 01/2011 through 06/2016
 - Weather Averages: Temperature and Precipitation. Other averages will be calculated from Historic Weather Data
 - Airplane Data: Model and Manufacture year by Tail Number

## Datasets Desired

 - Future Flight Data: OAG or scrape Route Happy
 - Historic Security Wait Data: Data may not exist, wait times vary by terminal, terminal not available in future flight data
 - Google Maps Travel Time: May be able to incorporate as API into site?
 - Employment info by airline over time: May not exist.
 - Average flight fill rate by airline/route over time: May not exist