<h2 style='text-align: center;'> Data Science Technology and Systems </h2>
<h3 style='text-align: center;'> Final Assignment: Predicting Airplane Delays </h3>
<h3 style='text-align: center;'> Part B – On Cloud </h3>
<h4 style='text-align: center;'> Pauline Armamento - u3246782 </h4>

# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this [link](https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312). Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following [link](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ). 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

In [None]:
import time
start = time.time()
import warnings, requests, zipfile, io
warnings.simplefilter('ignore')
import pandas as pd
import numpy as np
from scipy.io import arff


import os
import boto3
import sagemaker
import subprocess
from sagemaker import image_uris
from sagemaker.image_uris import retrieve
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve, auc, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt



In [None]:
# Load CSV files into dataframes
df1 = pd.read_csv('combined_csv_v1.csv')
df2 = pd.read_csv('combined_csv_v2.csv')

In [None]:
# Check for missing values in df1
print(df1.isnull().sum())

In [None]:
# Remove rows with missing values
df1.dropna(inplace=True)

In [None]:
df1.shape

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

### 1. Split data into training, validation and testing sets (70% - 15% - 15%).

In [None]:
# Split the data into train (70%), and then split the remaining 30% into validation (15%) and test (15%)
class_column = 'target'  # Target variable for stratified splitting

# Split the data into training (70%) and temp (30%)
train_data, temp_data = train_test_split(df1, test_size=0.3, random_state=0, stratify=df1[class_column])

# Split the temp data into validation (15%) and testing (15%)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=0, stratify=temp_data[class_column])

# Print the shapes of the datasets
print("Training set shape:", train_data.shape)
print("Validation set shape:", val_data.shape)
print("Testing set shape:", test_data.shape)

# Save these splits to CSV files for uploading to S3
train_file = 'train.csv'
val_file = 'validation.csv'
test_file = 'test.csv'

train_data.to_csv(train_file, index=False, header=False)
val_data.to_csv(val_file, index=False, header=False)
test_data.to_csv(test_file, index=False, header=False)


In [None]:
import boto3

# Check Existing Buckets
s3_client = boto3.client('s3')
response = s3_client.list_buckets()

print("Existing buckets:")
for bucket in response['Buckets']:
    print(f'  {bucket["Name"]}')

## V1 - Linear Learner Estimator
This section presents the code implementation and execution of the Linear Learner Estimator Model, specifically tailored for the combined_csv_v1 dataset.

In [None]:
# Function to upload CSV files to S3
s3_resource = boto3.Session().resource('s3')

def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False)
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

# Define the S3 bucket and prefix
bucket='c135321a3429024l7975892t1w377404591872-labbucket-hgx0ormdptdb' # Change according to existing bucket
prefix = 'flight-delay-project-data1xgboost'

# Upload data to S3
upload_s3_csv('train.csv', 'train', train_data)
upload_s3_csv('validation.csv', 'validate', val_data)
upload_s3_csv('test.csv', 'test', test_data)

### 2. Use linear learner estimator to build a classification model.

In [None]:
import sagemaker
from sagemaker.serializers import CSVSerializer
from sagemaker.amazon.amazon_estimator import RecordSet
import boto3

# Instantiate the LinearLearner estimator object with 1 ml.m4.xlarge
num_classes = len(pd.unique(train_data[class_column]))
classifier_estimator = sagemaker.LinearLearner(role=sagemaker.get_execution_role(),
                                              instance_count=1,
                                              instance_type='ml.m4.xlarge',
                                              predictor_type='binary_classifier',
                                              binary_classifier_model_selection_criteria = 'cross_entropy_loss')

# Create train, validate, and test records
train_records = classifier_estimator.record_set(train_data.values[:, 1:].astype(np.float32), train_data.values[:, 0].astype(np.float32), channel='train')
val_records = classifier_estimator.record_set(val_data.values[:, 1:].astype(np.float32), val_data.values[:, 0].astype(np.float32), channel='validation')
test_records = classifier_estimator.record_set(test_data.values[:, 1:].astype(np.float32), test_data.values[:, 0].astype(np.float32), channel='test')


# Fit the model
classifier_estimator.fit([train_records, val_records, test_records])

In [None]:
# Model Evaluation

sagemaker.analytics.TrainingJobAnalytics(classifier_estimator._current_job_name, 
                                         metric_names = ['test:objective_loss', 
                                                         'test:binary_f_beta',
                                                         'test:precision',
                                                         'test:recall']
                                        ).dataframe()

### 3. Host the model on another instance

In [None]:
# Hosting the model

import sagemaker

# Assume linear_estimator has already been defined and trained
predictor = classifier_estimator.deploy(
    initial_instance_count=1,  # Number of instances
    instance_type='ml.m4.xlarge', 
    endpoint_name='flight-delay-endpoint'
)

print("Model deployed. Endpoint name:", predictor.endpoint_name)


### 4. Perform batch transform to evaluate the model on testing data

In [None]:
# Batch Transformer

def batch_linear_predict(test_data, estimator):
    batch_X = test_data.iloc[:,1:];
    batch_X_file='batch-in.csv'
    upload_s3_csv(batch_X_file, 'batch-in', batch_X)

    batch_output = "s3://{}/{}/batch-out/".format(bucket,prefix)
    batch_input = "s3://{}/{}/batch-in/{}".format(bucket,prefix,batch_X_file)

    classifier_transformer = estimator.transformer(instance_count=1,
                                           instance_type='ml.m4.xlarge',
                                           strategy='MultiRecord',
                                           assemble_with='Line',
                                           output_path=batch_output)

    classifier_transformer.transform(data=batch_input,
                             data_type='S3Prefix',
                             content_type='text/csv',
                             split_type='Line')
    
    classifier_transformer.wait()

    s3 = boto3.client('s3')
    obj = s3.get_object(Bucket=bucket, Key="{}/batch-out/{}".format(prefix,'batch-in.csv.out'))
    target_predicted_df = pd.read_json(io.BytesIO(obj['Body'].read()),orient="records",lines=True)
    return test_data.iloc[:,0], target_predicted_df.iloc[:,0]

In [None]:
test_labels, target_predicted = batch_linear_predict(test_data, classifier_estimator)

### 5. Report the performance metrics that you see better test the model performance 

In [None]:
# Model Performance Metrics 

from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(test_labels, target_predicted):
    matrix = confusion_matrix(test_labels, target_predicted)
    df_confusion = pd.DataFrame(matrix)
    colormap = sns.color_palette("BrBG", 10)
    sns.heatmap(df_confusion, annot=True, fmt='.2f', cbar=None, cmap=colormap)
    plt.title("Confusion Matrix")
    plt.tight_layout()
    plt.ylabel("True Class")
    plt.xlabel("Predicted Class")
    plt.show()
    
plot_confusion_matrix(test_labels, target_predicted)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score

def plot_roc(test_labels, target_predicted):
    # Calculate confusion matrix components
    TN, FP, FN, TP = confusion_matrix(test_labels, target_predicted).ravel()
    
    # Calculate various metrics
    Sensitivity = float(TP) / (TP + FN) * 100 if (TP + FN) > 0 else 0
    Specificity = float(TN) / (TN + FP) * 100 if (TN + FP) > 0 else 0
    Precision = float(TP) / (TP + FP) * 100 if (TP + FP) > 0 else 0
    NPV = float(TN) / (TN + FN) * 100 if (TN + FN) > 0 else 0
    FPR = float(FP) / (FP + TN) * 100 if (FP + TN) > 0 else 0
    FNR = float(FN) / (TP + FN) * 100 if (TP + FN) > 0 else 0
    FDR = float(FP) / (TP + FP) * 100 if (TP + FP) > 0 else 0
    ACC = float(TP + TN) / (TP + FP + FN + TN) * 100 if (TP + FP + FN + TN) > 0 else 0

    # Print metrics
    print("Sensitivity or TPR: ", Sensitivity, "%") 
    print("Specificity or TNR: ", Specificity, "%") 
    print("Precision: ", Precision, "%") 
    print("Negative Predictive Value: ", NPV, "%") 
    print("False Positive Rate: ", FPR, "%")
    print("False Negative Rate: ", FNR, "%") 
    print("False Discovery Rate: ", FDR, "%")
    print("Accuracy: ", ACC, "%") 

    # Calculate AUC
    auc_value = roc_auc_score(test_labels, target_predicted)
    print("Validation AUC:", auc_value)

    # Calculate ROC curve
    fpr, tpr, thresholds = roc_curve(test_labels, target_predicted)
    roc_auc = auc(fpr, tpr)

    # Plot ROC curve
    plt.figure()
    plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")

    # Create the axis for thresholds
    ax2 = plt.gca().twinx()
    ax2.plot(fpr, thresholds, markeredgecolor='r', linestyle='dashed', color='r')
    ax2.set_ylabel('Threshold', color='r')




In [None]:
plot_roc(test_labels, target_predicted)

## V2 - Linear Learner Estimator
This section presents the code implementation and execution of the Linear Learner Estimator Model, specifically tailored for the combined_csv_v2 dataset.

We repeat the steps performed previously but we will apply it to df2 that contains combined_csv_v2 dataset.

### 1. Split data into training, validation and testing sets (70% - 15% - 15%).

In [None]:
# Split the data into train (70%), and then split the remaining 30% into validation (15%) and test (15%)
class_column = 'target'  # Target variable for stratified splitting

# Split the data into training (70%) and temp (30%)
train_data, temp_data = train_test_split(df2, test_size=0.3, random_state=0, stratify=df2[class_column])

# Split the temp data into validation (15%) and testing (15%)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=0, stratify=temp_data[class_column])

# Print the shapes of the datasets
print("Training set shape:", train_data.shape)
print("Validation set shape:", val_data.shape)
print("Testing set shape:", test_data.shape)

# Save these splits to CSV files for uploading to S3
train_file = 'train.csv'
val_file = 'validation.csv'
test_file = 'test.csv'

train_data.to_csv(train_file, index=False, header=False)
val_data.to_csv(val_file, index=False, header=False)
test_data.to_csv(test_file, index=False, header=False)


In [None]:
import boto3

s3_client = boto3.client('s3')
response = s3_client.list_buckets()

print("Existing buckets:")
for bucket in response['Buckets']:
    print(f'  {bucket["Name"]}')


### 2. Use linear learner estimator to build a classification model.

In [None]:
# Function to upload CSV files to S3
s3_resource = boto3.Session().resource('s3')

def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False)
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

# Define the S3 bucket and prefix
bucket='c135321a3429026l8211909t1w07796198217-flightbucket-wiycexlqxuki' #change accordingly
prefix = 'flight-delay-project'

# Upload data to S3
upload_s3_csv('train.csv', 'train', train_data)
upload_s3_csv('validation.csv', 'validate', val_data)
upload_s3_csv('test.csv', 'test', test_data)

In [None]:
import sagemaker
from sagemaker.serializers import CSVSerializer
from sagemaker.amazon.amazon_estimator import RecordSet
import boto3

# Instantiate the LinearLearner estimator object with 1 ml.m4.xlarge
num_classes = len(pd.unique(train_data[class_column]))
classifier_estimator = sagemaker.LinearLearner(role=sagemaker.get_execution_role(),
                                              instance_count=1,
                                              instance_type='ml.m4.xlarge',
                                              predictor_type='binary_classifier',
                                              binary_classifier_model_selection_criteria = 'cross_entropy_loss')

In [None]:
# Create train, validate, and test records
train_records = classifier_estimator.record_set(train_data.values[:, 1:].astype(np.float32), train_data.values[:, 0].astype(np.float32), channel='train')
val_records = classifier_estimator.record_set(val_data.values[:, 1:].astype(np.float32), val_data.values[:, 0].astype(np.float32), channel='validation')
test_records = classifier_estimator.record_set(test_data.values[:, 1:].astype(np.float32), test_data.values[:, 0].astype(np.float32), channel='test')


In [None]:
# Fit the model
classifier_estimator.fit([train_records, val_records, test_records])

In [None]:
# Model Evaluation

sagemaker.analytics.TrainingJobAnalytics(classifier_estimator._current_job_name, 
                                         metric_names = ['test:objective_loss', 
                                                         'test:binary_f_beta',
                                                         'test:precision',
                                                         'test:recall']
                                        ).dataframe()

### 3. Host the model on another instance

In [None]:
# Hosting the model

import sagemaker

predictor = classifier_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m4.xlarge',
    endpoint_name='flight-delay-endpoint-v2'  
)

print("Model deployed. Endpoint name:", predictor.endpoint_name)

### 4. Perform batch transform to evaluate the model on testing data

In [None]:
# Batch Transformer

def batch_linear_predict(test_data, estimator):
    batch_X = test_data.iloc[:,1:];
    batch_X_file='batch-in.csv'
    upload_s3_csv(batch_X_file, 'batch-in', batch_X)

    batch_output = "s3://{}/{}/batch-out/".format(bucket,prefix)
    batch_input = "s3://{}/{}/batch-in/{}".format(bucket,prefix,batch_X_file)

    classifier_transformer = estimator.transformer(instance_count=1,
                                           instance_type='ml.m4.xlarge',
                                           strategy='MultiRecord',
                                           assemble_with='Line',
                                           output_path=batch_output)

    classifier_transformer.transform(data=batch_input,
                             data_type='S3Prefix',
                             content_type='text/csv',
                             split_type='Line')
    
    classifier_transformer.wait()

    s3 = boto3.client('s3')
    obj = s3.get_object(Bucket=bucket, Key="{}/batch-out/{}".format(prefix,'batch-in.csv.out'))
    target_predicted_df = pd.read_json(io.BytesIO(obj['Body'].read()),orient="records",lines=True)
    return test_data.iloc[:,0], target_predicted_df.iloc[:,0]

In [None]:
test_labels, target_predicted = batch_linear_predict(test_data, classifier_estimator)

### 5. Report the performance metrics that you see better test the model performance

In [None]:
# Model Performance Metrics 

from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(test_labels, target_predicted):
    matrix = confusion_matrix(test_labels, target_predicted)
    df_confusion = pd.DataFrame(matrix)
    colormap = sns.color_palette("BrBG", 10)
    sns.heatmap(df_confusion, annot=True, fmt='.2f', cbar=None, cmap=colormap)
    plt.title("Confusion Matrix")
    plt.tight_layout()
    plt.ylabel("True Class")
    plt.xlabel("Predicted Class")
    plt.show()
    

In [None]:
plot_confusion_matrix(test_labels, target_predicted)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score

def plot_roc(test_labels, target_predicted):
    # Calculate confusion matrix components
    TN, FP, FN, TP = confusion_matrix(test_labels, target_predicted).ravel()
    
    # Calculate various metrics
    Sensitivity = float(TP) / (TP + FN) * 100 if (TP + FN) > 0 else 0
    Specificity = float(TN) / (TN + FP) * 100 if (TN + FP) > 0 else 0
    Precision = float(TP) / (TP + FP) * 100 if (TP + FP) > 0 else 0
    NPV = float(TN) / (TN + FN) * 100 if (TN + FN) > 0 else 0
    FPR = float(FP) / (FP + TN) * 100 if (FP + TN) > 0 else 0
    FNR = float(FN) / (TP + FN) * 100 if (TP + FN) > 0 else 0
    FDR = float(FP) / (TP + FP) * 100 if (TP + FP) > 0 else 0
    ACC = float(TP + TN) / (TP + FP + FN + TN) * 100 if (TP + FP + FN + TN) > 0 else 0

    # Print metrics
    print("Sensitivity or TPR: ", Sensitivity, "%") 
    print("Specificity or TNR: ", Specificity, "%") 
    print("Precision: ", Precision, "%") 
    print("Negative Predictive Value: ", NPV, "%") 
    print("False Positive Rate: ", FPR, "%")
    print("False Negative Rate: ", FNR, "%") 
    print("False Discovery Rate: ", FDR, "%")
    print("Accuracy: ", ACC, "%") 

    # Calculate AUC
    auc_value = roc_auc_score(test_labels, target_predicted)
    print("Validation AUC:", auc_value)

    # Calculate ROC curve
    fpr, tpr, thresholds = roc_curve(test_labels, target_predicted)
    roc_auc = auc(fpr, tpr)

    # Plot ROC curve
    plt.figure()
    plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")

    # Create the axis for thresholds
    ax2 = plt.gca().twinx()
    ax2.plot(fpr, thresholds, markeredgecolor='r', linestyle='dashed', color='r')
    ax2.set_ylabel('Threshold', color='r')




In [None]:
plot_roc(test_labels, target_predicted)

## Linear Estimator Model 1

The linear estimator model 1 struggles to accurately predict delays. While it correctly identifies 193,742 instances as "No Delay" (true negatives), it misclassifies a significant number of actual delays as "No Delay." Specifically, 51,375 instances of "Delay" were incorrectly labeled as "No Delay" (false negatives). This high false negative rate undermines the model's reliability for predicting delays, despite its strong performance in identifying "No Delay" cases.

The linear estimator model 1, while demonstrating strong performance in identifying "No Delay" cases with a high specificity of 99.95%, struggles to accurately predict "Delay" instances. The model's sensitivity of 0.24% indicates a high false negative rate, meaning it fails to correctly identify a significant portion of actual delays. Additionally, its precision of 56.31% and negative predictive value of 79.04% suggest that while it can correctly identify some delays, it also makes a substantial number of incorrect positive predictions. Overall, the model's performance, as measured by an accuracy of 79.02% and a validation AUC of 0.5009, is limited in its ability to reliably predict delays.


## Linear Estimator Model 2

Linear estimator model 2, while showing improvement over model 1, still faces challenges in accurately predicting delays. The model correctly identifies 191,858 instances as "No Delay" (true negatives). However, it misclassifies a substantial number of actual delays as "No Delay." Specifically, 48,910 instances of "Delay" were incorrectly labeled as "No Delay" (false negatives). While the model's true positive rate (2,590 correct "Delay" predictions) is higher than model 1, the high false negative rate continues to hinder its overall performance in predicting delays.

Linear estimator model 2 demonstrates improved performance over model 1, particularly in terms of sensitivity, which has increased to 5.03%. This indicates a higher true positive rate, meaning the model correctly identifies more actual delay instances. However, the model still suffers from a high false negative rate of 94.97%, suggesting that it misses a significant number of true delays. Additionally, while the specificity remains high at 98.98%, the model's precision (56.66%) and negative predictive value (79.69%) suggest that it can still make incorrect positive and negative predictions. Overall, the model's accuracy of 79.26% and validation AUC of 0.5200 indicate a modest improvement over model 1 but still limited predictive power for delay events.

## Linear Estimator Model Comparison

Model 1 struggled to accurately predict delays, despite its high specificity. It had a very low sensitivity, indicating a high false negative rate. This means the model often missed actual delay events.

Model 2 shows improvement over Model 1, particularly in terms of sensitivity. It correctly identifies more actual delay instances, reducing the number of false negatives. However, it still suffers from a high false negative rate, suggesting that it misses a significant number of true delays.

Model 2 demonstrates significant improvement over Model 1 in accurately identifying delay instances. It exhibits a higher sensitivity, indicating a reduced rate of false negatives. While Model 2 still faces challenges in accurately predicting delays, particularly due to its high false negative rate, it represents a notable step forward in model performance.

## Linear Estimator Model Conclusion

While both linear estimator models 1 and 2 have shown limitations in accurately predicting delays, particularly in terms of high false negative rates, Model 2 represents a significant improvement. Its increased sensitivity indicates a better ability to identify true delay instances. However, both models still require further refinement and optimization to achieve a more reliable and accurate prediction of delays. Future research should focus on exploring advanced machine learning techniques, incorporating additional relevant features, and fine-tuning model hyperparameters to enhance predictive performance.

# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

## V1 and V2 - XGBOOST
The following sections outline the implementation and execution of the XGBOOST Estimator Model, customized for the datasets combined_csv_v1 (assigned to df1) and combined_csv_v2 (assigned to df2).

In [12]:
# Load CSV files into dataframes
df1 = pd.read_csv('combined_csv_v1.csv')
df2 = pd.read_csv('combined_csv_v2.csv')

In [None]:
# Check for missing values in df1
print(df1.isnull().sum())

In [None]:
# Remove rows with missing values
df1.dropna(inplace=True)

### 1. Split data into training, validation and testing sets (70% - 15% - 15%).

In [None]:
# Split the data into train (70%), and then split the remaining 30% into validation (15%) and test (15%)
class_column = 'target'  # Target variable for stratified splitting

# Split the data into training (70%) and temp (30%)
train_data, temp_data = train_test_split(df1, test_size=0.3, random_state=0, stratify=df1[class_column]) #change to df2 for combined_csv_v2 

# Split the temp data into validation (15%) and testing (15%)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=0, stratify=temp_data[class_column])

# Print the shapes of the datasets
print("Training set shape:", train_data.shape)
print("Validation set shape:", val_data.shape)
print("Testing set shape:", test_data.shape)

# Save these splits to CSV files for uploading to S3
train_file = 'train.csv'
val_file = 'validation.csv'
test_file = 'test.csv'

train_data.to_csv(train_file, index=False, header=False)
val_data.to_csv(val_file, index=False, header=False)
test_data.to_csv(test_file, index=False, header=False)


In [None]:
import boto3

# Check existing buckets
s3_client = boto3.client('s3')
response = s3_client.list_buckets()

print("Existing buckets:")
for bucket in response['Buckets']:
    print(f'  {bucket["Name"]}')


In [None]:
# Function to upload CSV files to S3
s3_resource = boto3.Session().resource('s3')

def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False)
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

# Define the S3 bucket and prefix
bucket='c135321a3429022l7975879t1w796437860557-labbucket-zqdijia3ifya'
prefix = 'flight-delay-project-data1xgboost'

# Upload data to S3
upload_s3_csv('train.csv', 'train', train_data)
upload_s3_csv('validation.csv', 'validate', val_data)
upload_s3_csv('test.csv', 'test', test_data)

### 2. Use xgboost estimator to build a classifcation model.

In [None]:
# Retrieve the XGBoost container
container = sagemaker.image_uris.retrieve('xgboost', boto3.Session().region_name, '1.0-1')


In [None]:
# Set hyperparameters for the XGBoost model
hyperparams = {
    "num_round": "42",
    "eval_metric": "auc",
    "objective": "binary:logistic"
}


In [None]:
# Define S3 output location
s3_output_location = f"s3://{bucket}/{prefix}/output/"


In [None]:
# Create the XGBoost model estimator
xgb_model = sagemaker.estimator.Estimator(
    container,
    sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=s3_output_location,
    hyperparameters=hyperparams,
    sagemaker_session=sagemaker.Session()
)

In [None]:
# Set up the training input channels
train_channel = sagemaker.inputs.TrainingInput(
    f"s3://{bucket}/{prefix}/train/",
    content_type='text/csv'
)

validate_channel = sagemaker.inputs.TrainingInput(
    f"s3://{bucket}/{prefix}/validate/",
    content_type='text/csv'
)

# Combine channels into a dictionary
data_channels = {'train': train_channel, 'validation': validate_channel}


In [None]:
# Fit the model
xgb_model.fit(inputs=data_channels, logs=False)

print('Model training complete and ready for hosting!')

### 3. Host the model on another instance

In [None]:
# Hosting the model

xgb_predictor = xgb_model.deploy(initial_instance_count=1,
                serializer = sagemaker.serializers.CSVSerializer(),
                instance_type='ml.m4.xlarge')

### 4. Perform batch transform to evaluate the model on testing data

In [None]:
# Batch Transform

batch_X = test_data.iloc[:,1:];
batch_X.head()

In [None]:
batch_X_file='batch-in.csv'
upload_s3_csv(batch_X_file, 'batch-in', batch_X)

In [None]:
batch_output = "s3://{}/{}/batch-out/".format(bucket,prefix)
batch_input = "s3://{}/{}/batch-in/{}".format(bucket,prefix,batch_X_file)

xgb_transformer = xgb_model.transformer(instance_count=1,
                                       instance_type='ml.m4.xlarge',
                                       strategy='MultiRecord',
                                       assemble_with='Line',
                                       output_path=batch_output)

xgb_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')
xgb_transformer.wait()

In [None]:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key="{}/batch-out/{}".format(prefix,'batch-in.csv.out'))
target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()),sep=',',names=['class'])
target_predicted.head(5)

In [None]:
def binary_convert(x):
    threshold = 0.65
    if x > threshold:
        return 1
    else:
        return 0

target_predicted['binary'] = target_predicted['class'].apply(binary_convert)

print(target_predicted.head(10))
test_data.head(10)

In [None]:
test_labels = test_data.iloc[:,0]
test_labels.head()

### 5. Report the performance metrics that you see better test the model performance 

In [None]:
from sklearn.metrics import confusion_matrix
import pandas as pd


matrix = confusion_matrix(test_labels, target_predicted['binary'])
matrix

In [None]:
df_confusion = pd.DataFrame(matrix, 
                             index=['Actual No Delay', 'Actual Delay'], 
                             columns=['Predicted No Delay', 'Predicted Delay'])
print(df_confusion)


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 6))
sns.heatmap(df_confusion, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


In [None]:
from sklearn.metrics import roc_auc_score

# Get probabilities for the positive class

y_scores = target_predicted['class']

# Calculate ROC AUC score
roc_auc = roc_auc_score(test_labels, y_scores)
print(f"ROC AUC Score: {roc_auc:.4f}")


In [None]:
from sklearn.metrics import roc_curve

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(test_labels, y_scores)


In [None]:
import matplotlib.pyplot as plt

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')  # Diagonal line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()


## XGBoost Model 1

The XGBoost Model 1, while highly accurate in predicting "No Delay" flights (correctly identifying 193,819 instances), struggles significantly with "Delay" predictions. The model misclassifies a substantial number of actual delays as "No Delay," resulting in a high false negative rate of 51,439. Although it correctly identifies 61 delay instances (true positives), this number is relatively low compared to the number of missed delays.

Model 1 achieves a ROC AUC score of 0.6791. It performs exceptionally well in predicting "No Delay" cases, demonstrating high precision, very high recall, and a good F1-score. However, the model struggles with "Delay" cases, exhibiting low precision, very low recall, and a very low F1-score. Despite this, the model maintains an overall accuracy of 0.790253.

## XGBoost Model 2

XGBoost Model 2, while showing improvement over Model 1, still struggles with accurately predicting "Delay" instances. The model correctly identifies 193,130 instances as "No Delay" (true negatives). However, it misclassifies a substantial number of actual delays as "No Delay." Specifically, 48,930 instances of "Delay" were incorrectly labeled as "No Delay" (false negatives). While the model's true positive rate (2,570 correct "Delay" predictions) is higher than Model 1, the high false negative rate continues to hinder its overall performance in predicting delays.

Model 2 demonstrates a good performance with a ROC AUC score of 0.7308. It excels in predicting "No Delay" cases, achieving high precision, very high recall, and a good F1-score. Notably, the model also shows significant improvement in predicting "Delay" cases compared to Model 1, with higher precision, recall, and F1-score. This results in an overall accuracy of 0.797672.

## XGBoost Model Comparison

Model 2 outperforms Model 1 in terms of overall performance. It achieves a higher ROC AUC score of 0.7308 compared to Model 1's 0.6791, indicating better discrimination between the two classes. Both models exhibit high precision and recall for the "No Delay" class, but Model 2 excels in predicting "Delay" cases with higher precision (0.783776 vs. 0.753086), recall (0.049903 vs. 0.001184), and F1-score (0.093832 vs. 0.002365). Additionally, Model 2 has a slightly higher overall accuracy of 0.797672 compared to Model 1's 0.790253.

Based on the provided metrics, Model 2 appears to be the superior model. It demonstrates better performance in both identifying "Delay" cases and overall accuracy.

## XGBoost Model Conclusion

XGBoost Model 2 outperforms Model 1 in predicting both "Delay" and "No Delay" cases. While both models excel at identifying "No Delay" instances, Model 2 shows significant improvement in predicting "Delay" cases, as evidenced by higher precision, recall, and F1-score. This leads to a higher overall accuracy for Model 2.

To further improve the model, consider exploring techniques like hyperparameter tuning, feature engineering, and ensemble methods. Additionally, investigating the underlying reasons for the model's limitations in predicting "Delay" cases can provide valuable insights for future enhancements.


## Project Conclusion

While both linear and XGBoost models have been explored to predict flight delays, the results indicate that both models, particularly the linear models, struggle to accurately predict "Delay" cases. This is largely due to the class imbalance in the dataset, where "No Delay" cases significantly outnumber "Delay" cases.

To enhance the predictive performance of the model, several strategies can be employed. Firstly, optimizing the model's hyperparameters through tuning can significantly improve its accuracy. Secondly, addressing the class imbalance issue, where "Delay" cases are underrepresented, by techniques like oversampling, undersampling, or class weighting can enhance the model's ability to learn from the minority class. Lastly, exploring different machine learning algorithms and ensemble methods can potentially lead to more robust and accurate predictions.