# Predicting Flight Delays with GitHub Copilot
The flight dataset is a dataset that contains information about flights and how they are delayed.
In this notebook, we will use the dataset to predict whether a flight will be delayed or not.

In [1]:
import pandas as pd

df = pd.read_csv("data/flights.csv")
df.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,Carrier,OriginAirportID,OriginAirportName,OriginCity,OriginState,DestAirportID,DestAirportName,DestCity,DestState,CRSDepTime,DepDelay,DepDel15,CRSArrTime,ArrDelay,ArrDel15,Cancelled
0,2013,9,16,1,DL,15304,Tampa International,Tampa,FL,12478,John F. Kennedy International,New York,NY,1539,4,0.0,1824,13,0,0
1,2013,9,23,1,WN,14122,Pittsburgh International,Pittsburgh,PA,13232,Chicago Midway International,Chicago,IL,710,3,0.0,740,22,1,0
2,2013,9,7,6,AS,14747,Seattle/Tacoma International,Seattle,WA,11278,Ronald Reagan Washington National,Washington,DC,810,-3,0.0,1614,-7,0,0
3,2013,7,22,1,OO,13930,Chicago O'Hare International,Chicago,IL,11042,Cleveland-Hopkins International,Cleveland,OH,804,35,1.0,1027,33,1,0
4,2013,5,16,4,DL,13931,Norfolk International,Norfolk,VA,10397,Hartsfield-Jackson Atlanta International,Atlanta,GA,545,-1,0.0,728,-9,0,0


## Cleaning data with GitHub Copilot

You can clean data with GitHub Copilot and ask questions about your data. For example, you can ask Copilot to remove missing values, remove duplicates, normalize your data and more.

In [2]:
from scipy.stats import zscore

# Select delay columns
delay_columns = ['DepDelay', 'ArrDelay']

# Calculate z-scores for delay columns
z_scores = df[delay_columns].apply(zscore)

# Filter out rows where the z-score is greater than 3
df_cleaned = df[(z_scores < 3).all(axis=1)]

# Display the cleaned dataframe
df_cleaned.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,Carrier,OriginAirportID,OriginAirportName,OriginCity,OriginState,DestAirportID,DestAirportName,DestCity,DestState,CRSDepTime,DepDelay,DepDel15,CRSArrTime,ArrDelay,ArrDel15,Cancelled
0,2013,9,16,1,DL,15304,Tampa International,Tampa,FL,12478,John F. Kennedy International,New York,NY,1539,4,0.0,1824,13,0,0
1,2013,9,23,1,WN,14122,Pittsburgh International,Pittsburgh,PA,13232,Chicago Midway International,Chicago,IL,710,3,0.0,740,22,1,0
2,2013,9,7,6,AS,14747,Seattle/Tacoma International,Seattle,WA,11278,Ronald Reagan Washington National,Washington,DC,810,-3,0.0,1614,-7,0,0
3,2013,7,22,1,OO,13930,Chicago O'Hare International,Chicago,IL,11042,Cleveland-Hopkins International,Cleveland,OH,804,35,1.0,1027,33,1,0
4,2013,5,16,4,DL,13931,Norfolk International,Norfolk,VA,10397,Hartsfield-Jackson Atlanta International,Atlanta,GA,545,-1,0.0,728,-9,0,0


## Create a model with GitHub Copilot

You can create a model with GitHub Copilot. For example, you can ask Copilot to create a model that predicts whether a flight will be delayed or not.

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Select features and target variable
X = df_cleaned[['DayOfWeek', 'DestAirportID']]
y = df_cleaned['ArrDel15']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

Model Accuracy: 0.80


## Code with your Voice with GitHub Copilot

With Copilot Chat you can start a Voice chat and get help with your dataset.


In [4]:
# Assuming the DestAirportID for Los Angeles is known, let's say it's 12892
los_angeles_id = 12892
wednesday = 3

# Create a DataFrame for the input
input_data = pd.DataFrame({'DayOfWeek': [wednesday], 'DestAirportID': [los_angeles_id]})

# Predict the probability of delay
probability_of_delay = model.predict_proba(input_data)[0][1]

# Calculate the odds
odds_of_delay = probability_of_delay / (1 - probability_of_delay)

print(f"Odds of a flight being delayed to Los Angeles on a Wednesday: {odds_of_delay:.2f}")

Odds of a flight being delayed to Los Angeles on a Wednesday: 0.24


## Export Origin Airport and IDs

In [6]:
# Get unique column values for origin airport and id and export to CSV. Save in the server directory.

unique_origin_airports = df['OriginAirportName'].unique()
unique_origin_airport_ids = df['OriginAirportID'].unique()

unique_origin_airports_df = pd.DataFrame({'OriginAirportName': unique_origin_airports, 'OriginAirportID': unique_origin_airport_ids})
unique_origin_airports_df.to_csv('data/unique_origin_airports.csv', index=False)

# Export model to use in an app


In [7]:
# export the model to import later into Flask. Save in the server directory

import joblib

joblib.dump(model, 'model.pkl')

['model.pkl']