# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this link: [https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312]. Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following link: [https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ]. 


# Step 1: Problem formulation and data collection

Start this project off by writing a few sentences below that summarize the business problem and the business goal you're trying to achieve in this scenario. Include a business metric you would like your team to aspire toward. With that information defined, clearly write out the machine learning problem statement. Finally, add a comment or two about the type of machine learning this represents. 


### 1. Determine if and why ML is an appropriate solution to deploy.

Business Problem: Flight delays are a common frustration for travelers, disrupting their schedules and plans.

Business Goal: Improve the customer booking experience by predicting weather-related flight delays.

Machine Learning Problem Statement: Predict whether a domestic flight will be delayed due to weather conditions, based on historical flight performance data, including scheduled and actual departure times, flight distances, origin and destination airports, and carriers.

Type of Machine Learning: Supervised learning, classification task.

ML Appropriateness: ML is an appropriate solution because:

Historical data is available on flight times and delays.
ML algorithms can identify complex patterns in data that may correlate with weather-related delays.
ML models can automatically process and analyze large volumes of data.
ML can use historical trends to make predictions about future events.
ML models can be retrained and improved over time.
In summary, ML has the potential to provide a more robust and scalable solution than traditional rule-based systems for predicting flight delays.

### 2. Formulate the business problem, success metrics, and desired ML output.

Business Problem: We want to provide customers with a way to see real-time predictions of weather-related flight delays before they book their flights. 

This will help them to plan their travel more effectively and manage their expectations.

Desired ML Output: We want the ML model to output a binary prediction for each flight, indicating whether it is likely to be delayed due to weather:

0: The flight is unlikely to be delayed due to weather.

1: The flight is likely to be delayed due to weather.

Additionally, it would be helpful to provide a confidence score for each prediction, giving customers an idea of how certain the model is about the prediction. 
This could be in the form of a probability score that indicates the likelihood of a delay.

Success Metrics: We will measure the success of our solution by tracking the following metrics:

Prediction Accuracy: The percentage of flight delay predictions that are correct.

Reduction in Customer Complaints: The number of customer complaints related to unexpected flight delays.

Increase in Customer Satisfaction: Customer satisfaction scores related to the new predictive feature.

Booking Conversion Rate: The number of bookings made using the prediction feature.

Customer Retention Rate: The rate at which customers return to book flights with us.

We believe that by providing customers with real-time predictions of weather-related flight delays, we can improve their satisfaction and trust in our travel booking system.


### 3. Identify the type of ML problem you’re dealing with.

Supervised binary classification is the type of machine learning problem we are facing in this scenario. 
This implies that the model will learn from historical flight data, where the delay status is known, to predict delays for new, unseen flights. 

The two classes in this case are:

Class 0: The flight will not be delayed due to weather.

Class 1: The flight will be delayed due to weather.

The model will learn to distinguish between these two classes by analyzing the input features, which may include historical flight data and weather data. 
Once trained, the model can be used to predict the delay status of new flights by providing it with the relevant input features.

In simpler terms, we are trying to train a machine learning model to identify which flights are likely to be delayed due to weather. 
We will do this by providing the model with historical flight data, which includes information such as the scheduled and actual departure times, flight distances, origin and destination airports, and carriers. 
The model will also be given weather data, such as the forecast for the day of the flight. Once the model is trained, it will be able to predict the delay status of new flights by analyzing the input features.

This is a binary classification problem because the model is only predicting two possible outcomes: delayed or not delayed.

### Setup

Now that we have decided where to focus our energy, let's set things up so you can start working on solving the problem.

In [1]:
import os
from pathlib2 import Path
from zipfile import ZipFile
import time

import pandas as pd
import numpy as np
import subprocess

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

# <please add any other library or function you are aiming to import here>
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline



# Step 2: Data preprocessing and visualization  
In this data preprocessing phase, you should take the opportunity to explore and visualize your data to better understand it. First, import the necessary libraries and read the data into a Pandas dataframe. After that, explore your data. Look for the shape of the dataset and explore your columns and the types of columns you're working with (numerical, categorical). Consider performing basic statistics on the features to get a sense of feature means and ranges. Take a close look at your target column and determine its distribution.

### Specific questions to consider
1. What can you deduce from the basic statistics you ran on the features? 

2. What can you deduce from the distributions of the target classes?

3. Is there anything else you deduced from exploring the data?

Start by bringing in the dataset from an Amazon S3 public bucket to this notebook environment.

In [4]:
# download the files


zip_path = os.path.join(os.getcwd(),'data_compressed')    #'path to the zip files'
base_path =  os.getcwd()                                  #'the path to the folder that contains the whole project (data and code)'
csv_base_path = os.path.join(os.getcwd(),'Data_Files')     #'path to where you want the zip files extracted'



# Ensure 'data_compressed' and 'Data_Files' directories are in the current working directory
#zip_path = os.path.join(os.getcwd(), 'data_compressed')    # 'path to the zip files'
#base_path = os.getcwd()                                    # 'the path to the folder that contains the whole project (data and code)'
#csv_base_path = os.path.join(os.getcwd(), 'Data_Files')    # 'path to where you want the zip files extracted'

# Create the directories if they do not exist
#os.makedirs(zip_path, exist_ok=True)
#os.makedirs(csv_base_path, exist_ok=True)

    
#!mkdir -p path/to/directory/Data_Files

!mkdir -p {csv_base_path}

In [6]:
# How many zip files do we have? write a code to answer it.


zip_path = '/Users/harshapotluri/Documents/DataScicnecTechlogy&systems/FinalProject_U3221945/data_compressed'

# This will create the directory if it does not exist already
os.makedirs(zip_path, exist_ok=True)

try:
    zip_files = [f for f in os.listdir(zip_path) if f.endswith('.zip')]
    print("Number of zip files:", len(zip_files))
except FileNotFoundError as e:
    print(e)

total_files = len(os.listdir(zip_path))
print("Number of zip files are", total_files)

Number of zip files: 0
Number of zip files are 0


#### Extract CSV files from ZIP files