# Third Assignment

For this assignment you will have to identify a suitable dataset and implement a machine learning pipeline, comprising the following steps:

* Presentation of the problem (6 points)
* data exploration (4 points)
* data preparation (10 points)
* model selection: try 2 models (max 3) (20 points)
* evaluation on test set (3 points)
* conclusion commenting on the results and comparing the models (7 points)

**Total: 50 points**

Some possible datasets:
* [Titanic dataset](https://www.kaggle.com/c/titanic): binary classification
* [Wine dataset](https://www.kaggle.com/datasets/yasserh/wine-quality-dataset): multi-class classification/regression
* [Abalone dataset](https://www.kaggle.com/datasets/rodolfomendes/abalone-dataset): regression
* [California housing price](https://www.kaggle.com/datasets/camnugent/california-housing-prices): regression
* [CIFAR-10](http://www.cs.toronto.edu/~kriz/cifar.html): multi-class classification in images

Look for more on [Kaggle](https://www.kaggle.com/datasets) or elsewhere.

**Deadline:** To be confirmed by the tutor - please see the date on our Canvas page.

**Submission:** Please email your solutions and your completed Declaration of Authorship (DoA) form to weeklyclasses@conted.ox.ac.uk 

# Presentation of the problem (6 points)
I am going to build on the NYC flights data that I breifly explored in "Python for Data Science Part 1"

## Introduction to the dataset

The New York City (nyc) flight data can be imported from the `flights` library in the `nycflights13` python data package. It contains data of flights that departed from three NYC airports (JFK, LGA or EWR) in 2013. The dataset contains a lot of information about each flight such as what time it departed, when it arrived, was the flight delayed, etc.

## Introduction to the problem that I want to solve

I thought that it would be useful to create a machine learning model that could identify/predict which flights will get cancelled given the departure delay, airline carrier, and travel distance. Many people don't know if their flight will be cancelled and this uncertinity might cause anxiety when travelling. Creating a flight cancelation predictor migtht help prepare individuals about whether they will expereince any flight cancelations when traveling. 

I believe it would be interesting to see if a model can give a approximate prediction of when a flight might be delayed given these factors. The reason I am using these three variables (departure delay, airline carrier, and travel distance) to predict flight cancelation is because I believe that these variables might play a crucial role in flight delays (ex. if a flight has a huge departure delay then it would most probably be cancelled, some airline carriers might be known to cancel their flights more often, perhaps flights with longer travel distances are more prone to cancellations) -I will explore these observations/correlations in the data exploration section  

# Data Exploration (4 points)

*   To begin exploring the data I printed the first five rows of the dataset to get a better understanding of the information that I am given. This is where I picked the input variables that the machine learning model is going to use to predict whether a flight is cancelled or not. 
*   I then checked the unique values in the `year` column to make sure that the flights dataset only contains 2013 flights data. This helped me understand the data that I am dealing with and will help me in my explainations.
*   Lastly, I created a summary statisitic table to get a better idea of each variable. I noticed that departure delay and arival delay had positive and negative values and I assumed that these variables would only contain positive values. I realised that the positve numbers meant that the flight departed early. I believe that having positive and negative numbers to represent the departure delay will help the machine learning model in predicting whether a flight will be cancelled.ints)

In [1]:
#installed the `nycflights13` python data package for nyc flight data
#pip install nycflights13
#pip install plotly

In [2]:
from nycflights13 import flights #impoting the flights library which contains all the nyc flight data
import pandas as pd

pd.DataFrame(flights)[0:5] #viewing the data

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z


In [3]:
pd.DataFrame(flights)['year'].unique() #I am double-checking to see if all the flight data is from 2013 only

array([2013], dtype=int64)

In [4]:
pd.DataFrame(flights).describe() #summary statistics of the data

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,flight,air_time,distance,hour,minute
count,336776.0,336776.0,336776.0,328521.0,336776.0,328521.0,328063.0,336776.0,327346.0,336776.0,327346.0,336776.0,336776.0,336776.0
mean,2013.0,6.54851,15.710787,1349.109947,1344.25484,12.63907,1502.054999,1536.38022,6.895377,1971.92362,150.68646,1039.912604,13.180247,26.2301
std,0.0,3.414457,8.768607,488.281791,467.335756,40.210061,533.264132,497.457142,44.633292,1632.471938,93.688305,733.233033,4.661316,19.300846
min,2013.0,1.0,1.0,1.0,106.0,-43.0,1.0,1.0,-86.0,1.0,20.0,17.0,1.0,0.0
25%,2013.0,4.0,8.0,907.0,906.0,-5.0,1104.0,1124.0,-17.0,553.0,82.0,502.0,9.0,8.0
50%,2013.0,7.0,16.0,1401.0,1359.0,-2.0,1535.0,1556.0,-5.0,1496.0,129.0,872.0,13.0,29.0
75%,2013.0,10.0,23.0,1744.0,1729.0,11.0,1940.0,1945.0,14.0,3465.0,192.0,1389.0,17.0,44.0
max,2013.0,12.0,31.0,2400.0,2359.0,1301.0,2400.0,2359.0,1272.0,8500.0,695.0,4983.0,23.0,59.0


# Data Preparation (10 points)

*   To prepare the data I first need to create a new column which specifies whether a flight got cancelled or not. To do this I first check and see if there are any `NA` values in the arrival and departure delay columns because this would be an indication that a flight was cancelled

In [5]:
df = pd.DataFrame(flights)
print('The number of NA values in the departure delay columns are: ', df['dep_delay'].isna().sum())
print('The number of NA values in the arrival delay columns are: ', df['arr_delay'].isna().sum())

The number of NA values in the departure delay columns are:  8255
The number of NA values in the arrival delay columns are:  9430


*   I noticed that there are more NA values under the arrival delay column. I thought that the count of the NA values for the departure and arrival delays would be the same, but after specificaly printing out the NA values in both columns I can see that there are some situations where a flight did have a set departure delay but it never had a arrival delay (keep in mind that the arrival delay column has values ranging from the negatives to the positives so the value under this column can represent flights that arrived early or on time). This probably means that the flights got cancelled due to an unexpected reason

In [6]:
df[df['dep_delay'].isna()].head(3) #there are situations where both the arrival and departure delay columns have no value

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
838,2013,1,1,,1630,,,1815,,EV,4308,N18120,EWR,RDU,,416,16,30,2013-01-01T21:00:00Z
839,2013,1,1,,1935,,,2240,,AA,791,N3EHAA,LGA,DFW,,1389,19,35,2013-01-02T00:00:00Z
840,2013,1,1,,1500,,,1825,,AA,1925,N3EVAA,LGA,MIA,,1096,15,0,2013-01-01T20:00:00Z


In [7]:
df[df['arr_delay'].isna()].head(3) #there are situations where a departure delay is listed, but the arrival delay is NaN

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
471,2013,1,1,1525.0,1530,-5.0,1934.0,1805,,MQ,4525,N719MQ,LGA,XNA,,1147,15,30,2013-01-01T20:00:00Z
477,2013,1,1,1528.0,1459,29.0,2002.0,1647,,EV,3806,N17108,EWR,STL,,872,14,59,2013-01-01T19:00:00Z
615,2013,1,1,1740.0,1745,-5.0,2158.0,2020,,MQ,4413,N739MQ,LGA,XNA,,1147,17,45,2013-01-01T22:00:00Z


In [8]:
df[df['arr_delay'] == 0].head(3) #the arrival delay column contains on-time flight arrivals

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
35,2013,1,1,627.0,630,-3.0,1018.0,1018,0.0,US,27,N535UW,JFK,PHX,330.0,2153,6,30,2013-01-01T11:00:00Z
114,2013,1,1,807.0,810,-3.0,1043.0,1043,0.0,DL,269,N308DE,JFK,ATL,126.0,760,8,10,2013-01-01T13:00:00Z
217,2013,1,1,956.0,1000,-4.0,1241.0,1241,0.0,DL,1847,N956DL,LGA,ATL,129.0,762,10,0,2013-01-01T15:00:00Z


*   I am dropping the columns that I do not need. After looking at the dataset again I have decided to also keep the `hour` column since I believe that the hour of the day might also play a role of when flights will most probably not be cancelled
*   I then created the cancelled column which marks a flight as cancelled if the result is one and a flight as not cancelled if the value is a zero

In [9]:
df = df.loc[:, ['dep_delay', 'arr_delay', 'carrier', 'distance', 'hour']]
df['cancelled'] = ( flights['arr_delay'].isna() ).astype(int)

In [10]:
print(df[df['cancelled'] == 1].head(2)) #one represents cancelled flights that did not have a arrival time
print()
print(df[df['cancelled'] == 0].head(2)) #zero represents non-cancelled flights that arrived

     dep_delay  arr_delay carrier  distance  hour  cancelled
471       -5.0        NaN      MQ      1147    15          1
477       29.0        NaN      EV       872    14          1

   dep_delay  arr_delay carrier  distance  hour  cancelled
0        2.0       11.0      UA      1400     5          0
1        4.0       20.0      UA      1416     5          0


*   Now that I have created the column of cancelled flights I will be creating the training and testing dataset

In [11]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

#training and testing dataset split
features = df[['dep_delay', 'carrier', 'distance', 'hour']]
target = df['cancelled']

# Imputeing missing values in 'departure_delay' column
imputer = SimpleImputer(strategy='constant', fill_value=-999)
features['departure_delay_imputed'] = imputer.fit_transform(features.loc[:,['dep_delay']])
features.drop(columns=['dep_delay'], inplace=True) #removing the dep_delay column


#label encoding for the `carrier` column
label_encoder = LabelEncoder()
features['airline_encoded'] = label_encoder.fit_transform(features['carrier']) # Converting the strings in the 'carrier' column to numerical values (label encoding)
features.drop(columns=['carrier'], inplace=True) #removing the carrier column

#training and testing data split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42, stratify = target)

#feature scaling (standardisation)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [12]:
print(X_train.shape, X_test.shape)

(269420, 4) (67356, 4)


# model selection: try 2 models (max 3) (20 points)

*   I will be using three supervised learning tasks/models for the flight cancellation prediction (Random Forest, Logistic Regression, and Support Vector Machines -SVM)
*   I will first run the models on the training dataset and will produce the results of the classification. I will then compare the results on the test data which will be on the next section

### Random Forest Classification

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Train the model
RF_model = RandomForestClassifier(random_state=42)
RF_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = RF_model.predict(X_train_scaled)

# Evaluate the model
print("Accuracy on training data:", accuracy_score(y_train, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_train, y_pred))

print("\nClassification Report:")
print(classification_report(y_train, y_pred))

Accuracy on training data: 0.9977841288694232
Confusion Matrix:
[[261872      4]
 [   593   6951]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    261876
           1       1.00      0.92      0.96      7544

    accuracy                           1.00    269420
   macro avg       1.00      0.96      0.98    269420
weighted avg       1.00      1.00      1.00    269420



### Logistic Regression Classification

In [14]:
from sklearn.linear_model import LogisticRegression

LR_model = LogisticRegression(random_state=42)
LR_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = LR_model.predict(X_train_scaled)

# Evaluate the model
print("Accuracy:", accuracy_score(y_train, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_train, y_pred))
print("\nClassification Report:\n", classification_report(y_train, y_pred))

Accuracy: 0.9965370054190483

Confusion Matrix:
 [[261876      0]
 [   933   6611]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    261876
           1       1.00      0.88      0.93      7544

    accuracy                           1.00    269420
   macro avg       1.00      0.94      0.97    269420
weighted avg       1.00      1.00      1.00    269420



### SVC Classification

In [15]:
from sklearn.svm import SVC

# Train an SVM model
SVC_model = SVC(kernel='linear', C=1.0, random_state=42)
SVC_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = SVC_model.predict(X_train_scaled)

# Evaluate the model
print("Accuracy:", accuracy_score(y_train, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_train, y_pred))
print("\nClassification Report:\n", classification_report(y_train, y_pred))

Accuracy: 0.9965370054190483

Confusion Matrix:
 [[261876      0]
 [   933   6611]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    261876
           1       1.00      0.88      0.93      7544

    accuracy                           1.00    269420
   macro avg       1.00      0.94      0.97    269420
weighted avg       1.00      1.00      1.00    269420



# evaluation on test set (3 points)

### Random Forest Classification

In [16]:
# Make predictions on the test data
y_pred_test = RF_model.predict(X_test_scaled)

# Evaluate the model on the test data
print("Accuracy on test data:", accuracy_score(y_test, y_pred_test))
print("\nConfusion Matrix on test data:\n", confusion_matrix(y_test, y_pred_test))
print("\nClassification Report on test data:\n", classification_report(y_test, y_pred_test))

Accuracy on test data: 0.995828137062771

Confusion Matrix on test data:
 [[65428    42]
 [  239  1647]]

Classification Report on test data:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     65470
           1       0.98      0.87      0.92      1886

    accuracy                           1.00     67356
   macro avg       0.99      0.94      0.96     67356
weighted avg       1.00      1.00      1.00     67356



### Logistic Regression Classification on test set

In [17]:
# Make predictions
y_pred_test = LR_model.predict(X_test_scaled)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_test))
print("\nClassification Report:\n", classification_report(y_test, y_pred_test))

Accuracy: 0.9964071500682938

Confusion Matrix:
 [[65470     0]
 [  242  1644]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     65470
           1       1.00      0.87      0.93      1886

    accuracy                           1.00     67356
   macro avg       1.00      0.94      0.96     67356
weighted avg       1.00      1.00      1.00     67356



### SVC Classification on test set

In [18]:
# Make predictions
y_pred_test = SVC_model.predict(X_test_scaled)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_test))
print("\nClassification Report:\n", classification_report(y_test, y_pred_test))

Accuracy: 0.9964071500682938

Confusion Matrix:
 [[65470     0]
 [  242  1644]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     65470
           1       1.00      0.87      0.93      1886

    accuracy                           1.00     67356
   macro avg       1.00      0.94      0.96     67356
weighted avg       1.00      1.00      1.00     67356



# conclusion commenting on the results and comparing the models (7 points)

After looking at all the F1 scores and confusion matrices of all the models I have come to the conclusion that the logistic and SVM classification algorithms are the most accurate as compared to random forest classification (I came to this conclusion after looking at the models performance on the test datasets. Initially, I thought that the random forest classifier was the most accurate model but my initial observations turned out to be incorrect)

Random Forest Classification

*  On the training dataset, the random forest classifier was able to correctly classify 261,872 non-cancelled flights and 6,951 cancelled flights. The F1 score for non-cancelled flights was 1 and for cancelled flights it was 0.96
*  On the test dataset, the random forest classifier was able to correctly classify 65,428 non-cancelled flights and 1,647 cancelled flights. The F1 score for non-cancelled flights was 1 and for cancelled flights it was 0.92

Logistic Regression Classification

*  On the training dataset, the logistic regression classifier was able to correctly classify 261,876 non-cancelled flights and 6,611 cancelled flights. The F1 score for non-cancelled flights was 1 and for cancelled flights it was 0.93
*  On the test dataset, the logistic regression classifier was able to correctly classify 65,470 non-cancelled flights and 1,644 cancelled flights. The F1 score for non-cancelled flights was 1 and for cancelled flights it was 0.93

Support Vector Machines (SVM) Classification

*  On the training dataset, the SVM classifier was able to correctly classify 261,876 non-cancelled flights and 6,611 cancelled flights. The F1 score for non-cancelled flights was 1 and for cancelled flights it was 0.93
*  On the test dataset, the SVM classifier was able to correctly classify 65,470 non-cancelled flights and 1,644 cancelled flights. The F1 score for non-cancelled flights was 1 and for cancelled flights it was 0.93

As you can see, the F1 scores for the random forest were the highest when compared to the logistic regression and SVM classifiers. But this value decreased when the classifer was used on the test data. I did see that on the test data, the Random Forest Classifer did correct classify more cancelled flights when compared to the logistic and SVM classifer but this difference is three which is not much. However, if you look at the amount of non-cancelled flights the Random Forest classifier fell short and could not correctly identify 42 more non-cancelled flights while the logistic and SVM classifer did correctly classify these values.

Comparing the F1 scores of all three classifiers on both the training and test data I can say that all three models perform really closley and are fairly accurate. But when you only look at the training data results you would assume that the Random Forest classifier is the best classification task to use since it has F1 scores of 1 and 0.96 for the non-cancelled and cancelled flights respectivley. But when you look at the test data results you notice that the random forest F1 score for cancelled flights drops to 0.92 which suggests that perhaps the Random Forest classifier might not be the best classifier to use since the logistic and SVM classifiers have the highest F1 score of 0.93 for cancelled flights (this difference is not huge but if someone wants to go for a little bit more accuracy then I would go with the logisitc and SVM classifiers to perform this supervised learning classification task.