## Problem
The goal is to predict flight delays between Boston and Chicago and determine the most predictive factors.

## Client
Airline customers obviously don’t enjoy flight delays and would like to avoid them. Customers would be interested in knowing what airline, day of the week, city, weather factors, etc. would contribute to a delay. A customer then might book on certain days, with certain airlines and may consider going to a different airport nearby. A customer who is on a strict time constraint would be much more likely to do this. This could also be helpful from an airlines perspective as they may need to focus more attention on reducing delays under certain conditions. An airline may want to adjust their flight schedule during certain months to ensure that they are arriving on time. 

## Dataset
In order to solve this problem, I used a 2015 flight delays and cancellations dataset from the Department of Transportation.  This data was provided on Kaggle in csv format and contains over 5 million rows.  I also included weather data in my analysis from NOAA's National Centers for Environmental Information (NCEI). The dataset was in csv format and contained daily weather summaries for Boston and Chicago during 2015. 

## Cleaning 
**Kaggle provided a csv of flights from 2015. Took the following steps after reading it:**
* Combined flights.MONTH and flights.DAY columns into one datetime object. 
* Selected flights that were from Boston to Chicago and Chicago to Boston. 
* Dropped all rows that were missing a value in flights.ARRIVAL_DELAY. 
* Deleted any unnecessary columns such as cancellations.

**The NOAA provided daily weather data for Boston and Chicago in a csv file. Took the following steps after reading it:**
* Converted weather.DATE column to datetime in order to merge with flights.
* Created weather.ORIGIN_AIRPORT column in order to merge with flights. 
* Merged flights.csv and weather.csv together on DATE and ORIGIN_AIRPORT.

**Took the following steps after combining the data in one dataframe called flights_w_weather:**
* Verified that all flights were from Boston to Chicago and Chicago to Boston and that they had the correct weather data
* Created a unique identifier to avoid duplicate rows. Kept one instance of each. 
* Identified rows with extreme values of DEPARTURE_DELAY and deleted them.
* Created flights_w_weather.DD_TAG and flights_w_weather.AD_TAG to show arrival and departure delays. If an arrival delay had a value greater than 0, it is considered an arrival delay and tagged 1. Same with departure delays. 
* Created flights_w_weather.DELAY to show delays. If an arrival or departure delay occurred, it is considered a delay and tagged 1. 

## Data Storytelling
I first wanted to find out the amount of delays that occured in 2015. In total, there were 6467 delays (50.76% of flights)  between Boston and Chicago. 50.76% is very high and shows how common the problem is. I wanted to continue my analysis by asking additional questions about these delays and the rest of the data. 

### How many airlines fly this route?
The chart below shows that there are five airlines that run between Chicago and Boston in this dataset. American Airlines has the highest count with 5744 flights while Spirit Airlines has the lowest count with 352 flights.
<img src="files/images/airlines.JPG">
### Which delay type caused the most delays? 
The chart below shows that air system has the highest count with 1736 delays. I was surprised by the relatively small amount of weather delays (229).
<img src="files/images/delay_types.JPG"> </p>
### Which airlines have the highest and lowest percentage of delays?
The chart below shows that Spirit has the highest percentage of delays with 68.47% while American has the lowest percentage of delays with 41.56%. 
<img src="files/images/airline_delay.JPG">
</p>
### Which seasons have the highest and lowest percentage of delays?
The chart below shows that Winter has the highest percentage of delays with 62.12% while Fall has the lowest percentage of delays with 35.46%.
<img src="files/images/season.JPG">
### Which days of the week have the highest and lowest percentage of delays?
The chart below shows that Wednesday has the highest percentage of delays with 53.65% while Friday has the lowest percentage of delays with 42.95%.
<img src="files/images/day_of_week.JPG">

## Hypothesis Tests
<p> A few of the charts above provided insight that I wanted to test further. I conducted 2 proportions ztests to test the differences in proportions of delays. In total, I conducted three tests and analyzed the p-value and confidence interval to make my conclusion. </p>
<p> The purpose of the first test was to determine if United Airlines has a higher proportion of delays than American Airlines. I compared American and United because they have a similar sample size and American had the lowest delay rate among all 5 airlines. The test provided a p-value that was less than .05. I was able to reject the null hypothesis that the proportion of delays is the same between American and United. The 95% confidence interval shows that the difference in proportions is between -.232 and -.155. I can conclude that United has more delays. </p>
<p> The purpose of the second test was to determine if Friday flights have a smaller proportion of delays than Saturday flights. I compared Friday and Saturday because I felt it was a common decision people make. For example, a customer may be going on vacation for the weekend and want to take off Friday. The p-value was less than .05 and thus I was able to reject the null hypothesis that the proportion of delays is the same between Friday and Saturday. The 95% confidence interval shows that the difference in proportions is between -.15 and -.006. I can conclude that Friday has less delays. </p>
<p> The purpose of the third test was to determine if winter has a higher proportion of delays than any other season. I chose to conduct this test because it is a common assumption among airline customers. For summer, the p-value was less than .05 and thus I was able to reject the null hypothesis that the proportion of delays is the same between summer and winter. The 95% confidence shows that the difference in proportions is between .052 and .148. I can conclude that winter has more. For fall, the p-value was less than .05 and thus I was able to reject the null hypothesis that the proportion of delays is the same between winter and fall. The 95% confidence interval shows that the difference in proportions is between .21 and .31. I can conclude that winter has more. For spring, the p-value was less than .05 and thus I was able to reject the null hypothesis that the proportion of delays is the same between winter and spring. The 95% confidence interval shows that the difference in proportions is between .06 and .16. I can conclude that winter has more. In summary, winter does have a higher proportion of delays than any other season. There is a similar difference in proportions for spring and summer, but fall has the highest difference in proportion from winter. </p>

## Machine Learning 

### Data Prep
First, I normalized PRCP, SNOW, and AWND. Second, I redefined the dataframe flights_w_weather as dflogit_1. Third, I created dummy variables for DAY_OF_WEEK, SEASON, AIRLINE, ORIGIN_AIRPORT, and DESTINATION_AIRPORT. 

### Model Fitting and Evaluation
Models were fit using GridSearchCV to determine the optimal parameters. I felt accuracy score was appropriate because the success rate in the dataset (delay rate) was around 50%. I then used a confusion matrix to determine the type 1 and type 2 errors that were made with each model and to verify the accuracy score. Then in order to choose the best model, I built ROC curves and used AUC score as my primary metric for comparison. The model with the highest AUC score was selected. 

### Logistic Regression
Logistic regression provided an accuracy Score of 61%. Confusion matrix showed that 1010 flights were delays and predicted correctly, 938 were not delays and predicted correctly, 634 were not delays and predicted incorrectly and 603 were delays and predicted incorrectly. The coefficients showed that SNOW is the most influential feature and PRCP is the 2nd most influential. The weekday coefficients showed that Wednesday, Thursday, Saturday, Monday all contribute to a higher likelihood of delay than Tuesday. The airline coefficients showed that Spirit Airlines has the lowest contribution to a delay. B6, AA, UA, and OO are all negative. The season Coefficients showed that fall has the lowest contribution to a delay. Winter, Spring, and Summer are all positive. The airport coefficients showed that Logan has a higher contribution to delay than Chicago. 

Below is the ROC curve for Logistic regression. The AUC score is .65. 
<img src="files/images/roc_lr.JPG">

### Random Forest 
Random forest provided an accuracy score of 64%. The confusion matrix showed that 1085 flights were delays and predicted correctly, 942 were not delays and predicted correctly, 599 were not delays and predicted incorrectly, 559 were delays and predicted incorrectly. The feature importances showed that PRCP is the most important feature and AWND is the 2nd most important. 

Below is the ROC curve for Random forest. The AUC score is .68. 
<img src="files/images/roc_rf.JPG">

### KNN
K-NN provided and accuracy score of 61%. The confusion matrix showed that 950 flights were delays and predicted correctly, 979 were not delays and predicted correctly, 562 were not delays and predicted incorrectly, 694 were delays and predicted incorrectly. 

Below is the ROC curve for K-NN. The AUC score is .64. 
<img src="files/images/roc_knn.JPG">

## Conclusion 

### Model Selection
Random Forest Classifier provides the highest AUC score of .68. This is fairly low, but I would choose it over KNN and Logistic Regression. The main concern for all the models is the type 1 and type 2 errors. There were 599 type 1 errors and 559 type 2 errors using Random Forest. These are both important errors to consider. A type 1 error could result in a customer missing their flight because they gave themselves more time when they shouldn't have. A type 2 error could result in a customer being late to an important meeting that same day as they wouldn't have given themselves enough time to travel. I think it is obvious that one shouldn't rely on these models entirely, but there are some useful insights. 

### Recommendations 
As most people would suspect: snow, precipitation and wind are important factors in determining a delay. These factors are often more severe in the winter time and thus it would make sense that winter caused the most delays of any season. When booking flights in the winter time, customers should keep this in mind and should give themselves more time to travel. American Airlines caused the least delays and it seems that if you are on a time crunch traveling between Chicago and Boston, it would be a smarter choice than United. If leaving on a weekend, Friday would be a smarter choice than Saturday. 

### Next Steps
I would like to utilize the rest of the dataset to get a deeper understanding of the best airlines and airports when under a time crunch. For example, if a customer had an interview in New York and was flying from the Cleveland area, should they fly out of Akron/Canton or Cleveland? Should they fly to JFK or Laguardia? I would also go back and analyze time of day to get a deeper understanding of the impact of timing on delays. Should my flight be at 6am or 1pm? I would then feed upcoming flight information to the model and use the results to inform travel decisions.
