# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this [link](https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/EhWeqeQsh-9Mr1fneZc9_0sBOBzEdXngvxFJtAlIa-eAgA?e=8ukWwa). Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following [link](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ). 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

**Note:** In case of the data is too much to be uploaded to the AWS, please use 20% of the data only for this task.

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

In [1]:
import os, io, boto3, zipfile, requests
import pandas as pd
from sklearn.model_selection import train_test_split
from sagemaker import get_execution_role, Session, image_uris
import sagemaker

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
bucket='c182567a4701745l12374482t1w921201583050-labbucket-9dbcd7xdlpx6'

In [3]:
df = pd.read_csv('combined_csv_v2_20percent.csv')

# label column is named 'target' and is first column 
if 'target' not in df.columns:
    raise ValueError("Expected a 'target' column in the dataset.")
cols = df.columns.tolist()
cols = ['target'] + [c for c in cols if c != 'target']
df = df[cols]

# Optional: ensure target is 0/1 ints
df['target'] = pd.to_numeric(df['target'], errors='coerce').fillna(0).astype(int)

# Train/Val/Test split (70/15/15)
train, temp = train_test_split(
    df, test_size=0.30, random_state=42, stratify=df['target']
)
validate, test = train_test_split(
    temp, test_size=0.50, random_state=42, stratify=temp['target']
)

print(f"Train: {len(train)}  Validate: {len(validate)}  Test: {len(test)}")

Train: 228982  Validate: 49068  Test: 49068


In [4]:
prefix = 'lab3'                  # change if you like

s3_resource = boto3.Session().resource('s3')

def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    # header=False, index=False like your format
    dataframe.to_csv(csv_buffer, header=False, index=False)
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

# Filenames (mirroring your naming convention)
train_file    = 'flight_delay_train.csv'
validate_file = 'flight_delay_validate.csv'
test_file     = 'flight_delay_test.csv'

# Upload splits
upload_s3_csv(train_file, 'train', train)
upload_s3_csv(validate_file, 'validate', validate)
upload_s3_csv(test_file, 'test', test)

print(f"S3 paths:\n  s3://{bucket}/{prefix}/train/{train_file}\n  s3://{bucket}/{prefix}/validate/{validate_file}\n  s3://{bucket}/{prefix}/test/{test_file}")

S3 paths:
  s3://c182567a4701745l12374482t1w921201583050-labbucket-9dbcd7xdlpx6/lab3/train/flight_delay_train.csv
  s3://c182567a4701745l12374482t1w921201583050-labbucket-9dbcd7xdlpx6/lab3/validate/flight_delay_validate.csv
  s3://c182567a4701745l12374482t1w921201583050-labbucket-9dbcd7xdlpx6/lab3/test/flight_delay_test.csv


In [5]:
region = boto3.Session().region_name
region

'us-east-1'

In [6]:
container = image_uris.retrieve('linear-learner', region)

In [7]:
s3_output_location="s3://{}/{}/output/".format(bucket,prefix)
# Create estimator
linear = sagemaker.estimator.Estimator(
    image_uri=container,
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=s3_output_location,
    sagemaker_session=sagemaker.Session()
)

# Set hyperparameters for classification
linear.set_hyperparameters(
    predictor_type='binary_classifier',
    mini_batch_size=200,
    epochs=10,
    loss='logistic',
    normalize_data=True
)

In [8]:
train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

data_channels = {'train': train_channel, 'validation': validate_channel}

linear.fit(inputs=data_channels, logs=False)

print('ready for hosting!')

INFO:sagemaker:Creating training-job with name: linear-learner-2025-10-30-09-48-46-671



2025-10-30 09:48:51 Starting - Starting the training job..
2025-10-30 09:49:06 Starting - Preparing the instances for training....
2025-10-30 09:49:31 Downloading - Downloading input data........
2025-10-30 09:50:16 Downloading - Downloading the training image...............
2025-10-30 09:51:37 Training - Training image download completed. Training in progress..................................................................................................
2025-10-30 09:59:49 Uploading - Uploading generated training model.
2025-10-30 10:00:02 Completed - Training job completed
ready for hosting!


In [9]:
predictor = linear.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large'
)

INFO:sagemaker:Creating model with name: linear-learner-2025-10-30-10-00-11-614
INFO:sagemaker:Creating endpoint-config with name linear-learner-2025-10-30-10-00-11-614
INFO:sagemaker:Creating endpoint with name linear-learner-2025-10-30-10-00-11-614


-------!

In [10]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
import json, numpy as np

# Make sure features are numeric float32
X_test = test.drop('target', axis=1).astype('float32')
y_test = test['target']

# Configure endpoint I/O
predictor.serializer   = CSVSerializer()        # send CSV
predictor.deserializer = JSONDeserializer()     # receive JSON
predictor.content_type = 'text/csv'
predictor.accept       = 'application/json'

pred_labels, pred_scores = [], []
batch_size = 200

for start in range(0, len(X_test), batch_size):
    batch = X_test.iloc[start:start+batch_size]
    payload = batch.to_csv(header=False, index=False).strip()  # no trailing newline

    resp = predictor.predict(payload)          # <-- now a dict
    preds = resp['predictions']                # list of dicts

    for p in preds:
        pred_labels.append(int(p['predicted_label']))
        pred_scores.append(float(p['score']))

pred_labels = np.array(pred_labels)
pred_scores = np.array(pred_scores)


In [12]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score
acc  = accuracy_score(y_test, pred_labels)
prec = precision_score(y_test, pred_labels, zero_division=0)
rec  = recall_score(y_test, pred_labels, zero_division=0)
f1   = f1_score(y_test, pred_labels, zero_division=0)
cm   = confusion_matrix(y_test, pred_labels)

print(f"Accuracy    : {acc:.4f}")
print(f"Precision   : {prec:.4f}")
print(f"Recall      : {rec:.4f}")
print(f"F1-score    : {f1:.4f}")
print("Confusion Matrix:\n", cm)
print("\nClassification Report:\n", classification_report(y_test, pred_labels, zero_division=0))

Accuracy    : 0.7943
Precision   : 0.5702
Recall      : 0.0524
F1-score    : 0.0960
Confusion Matrix:
 [[38437   404]
 [ 9691   536]]

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.99      0.88     38841
           1       0.57      0.05      0.10     10227

    accuracy                           0.79     49068
   macro avg       0.68      0.52      0.49     49068
weighted avg       0.75      0.79      0.72     49068



# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

In [14]:
# train test val previously calculated for v2
container = image_uris.retrieve('xgboost',boto3.Session().region_name,'1.0-1')

hyperparams={"num_round":"42",
             "eval_metric": "auc",
             "objective": "binary:logistic"}

s3_output_location="s3://{}/{}/output/".format(bucket,prefix)
xgb_model=sagemaker.estimator.Estimator(container,
                                       sagemaker.get_execution_role(),
                                       instance_count=1,
                                       instance_type='ml.m4.xlarge',
                                       output_path=s3_output_location,
                                        hyperparameters=hyperparams,
                                        sagemaker_session=sagemaker.Session())

train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

data_channels = {'train': train_channel, 'validation': validate_channel}

xgb_model.fit(inputs=data_channels, logs=False)

print('ready for hosting!')

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker.image_uris:Defaulting to only supported image scope: cpu.
INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2025-10-30-10-15-19-039



2025-10-30 10:15:20 Starting - Starting the training job..
2025-10-30 10:15:35 Starting - Preparing the instances for training...
2025-10-30 10:15:59 Downloading - Downloading input data......
2025-10-30 10:16:30 Downloading - Downloading the training image.........
2025-10-30 10:17:20 Training - Training image download completed. Training in progress........
2025-10-30 10:18:01 Uploading - Uploading generated training model.
2025-10-30 10:18:14 Completed - Training job completed
ready for hosting!


In [15]:
xgb_predictor = xgb_model.deploy(initial_instance_count=1,
                serializer = sagemaker.serializers.CSVSerializer(),
                instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2025-10-30-10-19-41-170
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2025-10-30-10-19-41-170
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2025-10-30-10-19-41-170


------!

In [19]:
batch_X = test.iloc[:,1:];
batch_X.head()

Unnamed: 0,Distance,DepHourofDay,AWND_O,PRCP_O,TAVG_O,AWND_D,PRCP_D,TAVG_D,SNOW_O,SNOW_D,...,Origin_SFO,Dest_CLT,Dest_DEN,Dest_DFW,Dest_IAH,Dest_LAX,Dest_ORD,Dest_PHX,Dest_SFO,is_holiday_True
234564,602.0,12,43,0,-69.0,18,0,157.0,0.0,0.0,...,False,False,False,False,False,False,False,True,False,False
32337,802.0,14,33,0,115.0,51,0,214.0,0.0,0.0,...,False,False,False,True,False,False,False,False,False,False
10046,370.0,12,34,0,186.0,40,0,197.0,0.0,0.0,...,False,False,False,False,False,False,False,True,False,False
228703,862.0,9,36,0,194.0,50,0,284.0,0.0,0.0,...,False,False,True,False,False,False,False,False,False,False
101846,1199.0,21,42,0,171.0,64,0,192.0,0.0,0.0,...,False,False,True,False,False,False,False,False,False,False


In [20]:
batch_X_file='batch-in.csv'
upload_s3_csv(batch_X_file, 'batch-in', batch_X)

In [None]:
import re
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import BytesDeserializer
import numpy as np

# Prepare test data
X_test = test.drop('target', axis=1).astype('float32')
y_test = test['target'].astype(int).to_numpy()

# Configure predictor
xgb_predictor.serializer   = CSVSerializer()
xgb_predictor.deserializer = BytesDeserializer()  # return raw bytes
xgb_predictor.content_type = 'text/csv'
xgb_predictor.accept       = 'text/csv'

def parse_scores(resp_bytes, expect_n=None):
    """Parse CSV text that may be one line with commas or many lines."""
    text = resp_bytes.decode('utf-8', errors='ignore').strip()
    # Split on commas, whitespace, or newlines; filter empties
    tokens = [t for t in re.split(r'[,\r\n\s]+', text) if t]
    scores = np.array([float(t) for t in tokens], dtype=float)
    if expect_n is not None and scores.size != expect_n:
        print(f"Parsed {scores.size} scores, expected {expect_n}. First 120 chars:\n{text[:120]}")
    return scores

pred_scores = []
batch_size = 200

for start in range(0, len(X_test), batch_size):
    batch = X_test.iloc[start:start + batch_size]
    payload = batch.to_csv(header=False, index=False).strip()
    resp = xgb_predictor.predict(payload)           # bytes
    scores = parse_scores(resp, expect_n=len(batch))
    pred_scores.extend(scores.tolist())

pred_scores = np.array(pred_scores, dtype=float)
pred_labels = (pred_scores >= 0.5).astype(int)

In [26]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report

print("Accuracy :", accuracy_score(y_test, pred_labels))
print("Precision:", precision_score(y_test, pred_labels, zero_division=0))
print("Recall   :", recall_score(y_test, pred_labels, zero_division=0))
print("F1-score :", f1_score(y_test, pred_labels, zero_division=0))
try:
    print("ROC AUC  :", roc_auc_score(y_test, pred_scores))
except Exception as e:
    print("ROC AUC  : n/a (", e, ")")
print("\nConfusion Matrix:\n", confusion_matrix(y_test, pred_labels))
print("\nClassification Report:\n", classification_report(y_test, pred_labels, zero_division=0))

Accuracy : 0.8012757805494416
Precision: 0.64
Recall   : 0.10638505915713307
F1-score : 0.18244319610966714
ROC AUC  : 0.693120653833251

Confusion Matrix:
 [[38229   612]
 [ 9139  1088]]

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.98      0.89     38841
           1       0.64      0.11      0.18     10227

    accuracy                           0.80     49068
   macro avg       0.72      0.55      0.53     49068
weighted avg       0.77      0.80      0.74     49068



### Comparison between Linear and XGBoost on Combined Data:

The results from the combined dataset v2 show a clear improvement in overall performance for both the simple Linear Learner model and the ensemble XGBoost model compared to the earlier version (v1). However, the XGBoost model demonstrates a stronger ability to capture delayed flights, indicating that ensemble learning provides better predictive power and sensitivity to complex patterns in the data.

Starting with the Linear Learner, it achieved an accuracy of 79.43%, a precision of 0.57, and a recall of 0.0524, resulting in an F1-score of 0.0960. These metrics indicate that while the model was able to correctly classify most non-delayed flights (majority class), it struggled significantly in identifying the minority class — delayed flights. The confusion matrix confirms this imbalance: out of 10,227 delayed flights, only 536 were correctly identified, while 9,691 were missed. Although the linear model has learned some separation between classes, its linear decision boundary likely limits its ability to capture non-linear interactions between features such as weather, time, and route characteristics. This led to high precision (when it predicted a delay, it was often correct) but very low recall (it failed to detect most delays).

The XGBoost model, on the other hand, achieved a slightly higher accuracy of 80.12%, and showed substantial improvements in all minority-class metrics: precision increased to 0.64, recall to 0.106, and F1-score to 0.182. The ROC-AUC score of 0.693 also suggests that XGBoost learned a more discriminative decision boundary compared to the near-random performance seen in the previous dataset version. The confusion matrix shows that XGBoost correctly identified 1,088 delayed flights, roughly double that of the Linear Learner, while maintaining strong performance on non-delayed flights (38,229 true negatives). This demonstrates XGBoost’s ability to handle feature interactions and non-linear patterns, leveraging its boosting mechanism to sequentially minimize misclassifications that a simple linear model cannot adapt to.

In summary, while both models still suffer from the class imbalance problem, the XGBoost model performed notably better, particularly in identifying delayed flights. Its improvement in recall and F1-score highlights the advantage of ensemble methods in capturing complex relationships between features. The Linear Learner remains useful as a fast, interpretable baseline, but it lacks the flexibility to generalize effectively in data with high dimensionality and non-linear dependencies. To further improve both models, future work should focus on techniques such as class-weight adjustment, oversampling the delayed class, or threshold tuning, which would help enhance recall without overly compromising precision.