# Preliminary SVM Testing on Flight Delays

### By: Tristan Levy-Park

## Working with Unbalanced Data

In [None]:
# Import necessary libraries
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('/content/Dataset/Illinois_10_years_data.csv')

# Drop rows with missing values
df.dropna(inplace=True)

df = df[df['Month'] == 1]

# Downsample the data to reduce size (e.g., sample 1% of the data)
df = df.sample(frac=0.5, random_state=42).reset_index(drop=True)

# Convert integer features to int32 for memory efficiency
int_columns = ['Year', 'Quarter', 'Month', 'Day_of_Month', 'Day_of_Week', 'Scheduled_Departure_Time', 'Scheduled_Departure_Time_Minutes', 'Target']
for col in int_columns:
    df[col] = df[col].astype(np.int32)

# Convert continuous numeric features to float32
float_columns = ['Taxi_Out_Time_Minutes', 'Flight_Distance_Miles', 'Air_Temperature_Fahrenheit', 'Dew_Point_Temperature_Fahrenheit',
                 'Relative_Humidity_Percent', 'Wind_Direction_Degrees', 'Wind_Speed_Knots', 'Hourly_Precipitation_Inches',
                 'Pressure_Altimeter_Inches', 'Sea_Level_Pressure_Millibar', 'Visibility_Miles', 'Sky_Level_1_Altitude_Feet',
                 'Apparent_Temperature_Fahrenheit']
for col in float_columns:
    df[col] = df[col].astype(np.float32)

# Drop unnecessary columns
df = df.drop(['Origin_State_Name', 'Departure_Datetime', 'Departure_Delay_Minutes'], axis=1)

# Define categorical columns
categorical_columns = ['Operating_Carrier_Code', 'Tail_Number', 'Origin_Airport_ID', 'Origin_Airport_Code', 'Destination_Airport_Code', 'Destination_State_Name', 'Sky_Cover_Level_1']

# Separate high-cardinality and low-cardinality categorical columns
high_cardinality_columns = ['Tail_Number', 'Origin_Airport_Code', 'Destination_Airport_Code']
low_cardinality_columns = [col for col in categorical_columns if col not in high_cardinality_columns]

# Apply label encoding to high-cardinality categorical features
label_encoders = {}
for col in high_cardinality_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# Apply one-hot encoding to low-cardinality categorical features
df = pd.get_dummies(df, columns=low_cardinality_columns, drop_first=True)

# Define the years for each dataset
train_years = [2014, 2015, 2016, 2017, 2018, 2019]
val_years = [2020, 2021, 2022]
test_years = [2023, 2024]

# Create train, validation, and test sets
train_df = df[df['Year'].isin(train_years)].reset_index(drop=True)
val_df = df[df['Year'].isin(val_years)].reset_index(drop=True)
test_df = df[df['Year'].isin(test_years)].reset_index(drop=True)

# Separate features and target variable
X_train = train_df.drop('Target', axis=1)
y_train = train_df['Target']
X_val = val_df.drop('Target', axis=1)
y_val = val_df['Target']
X_test = test_df.drop('Target', axis=1)
y_test = test_df['Target']

# Standardize numeric features to improve SVM model performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Train the LinearSVC model
svm_model = LinearSVC(random_state=42)
svm_model.fit(X_train_scaled, y_train)

# Predict on the test set
y_pred = svm_model.predict(X_test_scaled)
y_pred_scores = svm_model.decision_function(X_test_scaled)  # Use decision_function for score-based metrics

# Evaluate on validation set
y_val_pred = svm_model.predict(X_val_scaled)
print("Validation Accuracy:", accuracy_score(y_val, y_val_pred))
print("\nClassification Report on Validation Set:\n", classification_report(y_val, y_val_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_val, y_val_pred))

# Evaluate on test set
y_test_pred = svm_model.predict(X_test_scaled)
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))
print("\nClassification Report on Test Set:\n", classification_report(y_test, y_test_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_test_pred))

# Visualization Metrics
# 1. Confusion Matrix Heatmap
plt.figure(figsize=(6, 4))
conf_matrix = confusion_matrix(y_val, y_val_pred)  # Store it in 'conf_matrix'
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['No Delay', 'Delay'], yticklabels=['No Delay', 'Delay'])
plt.title("Confusion Matrix of Validation Set")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()


plt.figure(figsize=(6, 4))
conf_matrix = confusion_matrix(y_test, y_test_pred)  # Store it in 'conf_matrix'
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['No Delay', 'Delay'], yticklabels=['No Delay', 'Delay'])
plt.title("Confusion Matrix of Test Set")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

# 2. ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_scores)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='blue', label='AUC = %0.2f' % roc_auc)
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend(loc="lower right")
plt.show()

# 3. Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_pred_scores)

plt.figure(figsize=(6, 4))
plt.plot(recall, precision, color='purple')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.show()


AttributeError: partially initialized module 'pandas' has no attribute '_pandas_parser_CAPI' (most likely due to a circular import)

## Results of Unbalanced Data

**For the Validation Set:**

* We get an accuracy of 86.3% which seems high, but it's misleading. This accuracy reflects that the model is heavily biased towards predicting the majority class (no delay), which dominates the dataset.

* Class 0 (No Delay): The model has high precision (0.86) and recall (1.00), meaning it correctly identifies non-delayed flights.

* Class 1 (Delay): Precision and recall are both 0, indicating that the model fails to identify any delays. This likely occurs because of class imbalance—there are far fewer delayed flights compared to non-delayed ones, so the model defaults to predicting "No Delay" for all instances.

* The confusion matrix shows that all 946 delayed flights were classified as non-delayed, with no true positives for the delay class.

**For the test set:**

* We get an accuracy of 77%, which again seems reasonable at first glance but is similarly biased.

* Class 0 (No Delay): Precision of 0.77 and perfect recall mean the model is very good at identifying non-delayed flights.

* Class 1 (Delay): Like the validation set, the model fails to identify any delays, with a recall and precision of 0.

* The matrix again shows that all delayed flights were misclassified as non-delayed.

We have a heavily **Imbalanced** Dataset!

Thus, we need to Oversample the minority class (delays) or undersample the majority class (no delay). We can use the technique SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic samples for the minority class.


## Balancing our Data, using LinearSVC() for computational improvement

## Applying SMOTE() function to balance the dataset

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('/content/Dataset/Illinois_10_years_data.csv')

# Drop rows with missing values
df.dropna(inplace=True)

# Filter data to only include January
df = df[df['Month'] == 1]

# Downsample to 1% for performance (adjust as needed)
df = df.sample(frac=0.5, random_state=42).reset_index(drop=True)

# Convert integer features to int32 for memory efficiency
int_columns = ['Year', 'Quarter', 'Month', 'Day_of_Month', 'Day_of_Week', 'Scheduled_Departure_Time', 'Scheduled_Departure_Time_Minutes', 'Target']
for col in int_columns:
    df[col] = df[col].astype(np.int32)

# Convert continuous numeric features to float32
float_columns = ['Taxi_Out_Time_Minutes', 'Flight_Distance_Miles', 'Air_Temperature_Fahrenheit', 'Dew_Point_Temperature_Fahrenheit',
                 'Relative_Humidity_Percent', 'Wind_Direction_Degrees', 'Wind_Speed_Knots', 'Hourly_Precipitation_Inches',
                 'Pressure_Altimeter_Inches', 'Sea_Level_Pressure_Millibar', 'Visibility_Miles', 'Sky_Level_1_Altitude_Feet',
                 'Apparent_Temperature_Fahrenheit']
for col in float_columns:
    df[col] = df[col].astype(np.float32)

# Drop unnecessary columns
df = df.drop(['Origin_State_Name', 'Departure_Datetime', 'Departure_Delay_Minutes'], axis=1)

# Define categorical columns
categorical_columns = ['Operating_Carrier_Code', 'Tail_Number', 'Origin_Airport_ID', 'Origin_Airport_Code', 'Destination_Airport_Code', 'Destination_State_Name', 'Sky_Cover_Level_1']

# Apply label encoding to high-cardinality categorical features
label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# Separate features and target
X = df.drop('Target', axis=1)
y = df['Target']

# Split data into train, validation, and test sets by year
train_years = [2014, 2015, 2016, 2017, 2018, 2019]
val_years = [2020, 2021, 2022]
test_years = [2023, 2024]

train_df = df[df['Year'].isin(train_years)].reset_index(drop=True)
val_df = df[df['Year'].isin(val_years)].reset_index(drop=True)
test_df = df[df['Year'].isin(test_years)].reset_index(drop=True)

# Separate features and target variable for train, validation, and test sets
X_train = train_df.drop('Target', axis=1)
y_train = train_df['Target']
X_val = val_df.drop('Target', axis=1)
y_val = val_df['Target']
X_test = test_df.drop('Target', axis=1)
y_test = test_df['Target']

# Apply SMOTE only on training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_resampled)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Train the LinearSVC model
svm_model = LinearSVC(random_state=42)
svm_model.fit(X_train_scaled, y_train_resampled)

# Predict on the test set
y_pred = svm_model.predict(X_test_scaled)
y_pred_scores = svm_model.decision_function(X_test_scaled)  # Use decision_function for score-based metrics

# Evaluate on train set
y_train_pred = svm_model.predict(X_train_scaled)
print("Training Accuracy:", accuracy_score(y_train_resampled, y_train_pred)) # Changed y_train to y_train_resampled
print("\nClassification Report on Training Set:\n", classification_report(y_train_resampled, y_train_pred)) # Changed y_train to y_train_resampled
print("\nConfusion Matrix:\n", confusion_matrix(y_train_resampled, y_train_pred)) # Changed y_train to y_train_resampled

# Evaluate on validation set
y_val_pred = svm_model.predict(X_val_scaled)
print("Validation Accuracy:", accuracy_score(y_val, y_val_pred))
print("\nClassification Report on Validation Set:\n", classification_report(y_val, y_val_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_val, y_val_pred))

# Evaluate on test set
y_test_pred = svm_model.predict(X_test_scaled)
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))
print("\nClassification Report on Test Set:\n", classification_report(y_test, y_test_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_test_pred))

# Visualization Metrics
# 1. Confusion Matrix Heatmap

plt.figure(figsize=(6, 4))
conf_matrix = confusion_matrix(y_train_resampled, y_train_pred)  # Store it in 'conf_matrix'
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['No Delay', 'Delay'], yticklabels=['No Delay', 'Delay'])
plt.title("Confusion Matrix of Training Set")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

plt.figure(figsize=(6, 4))
conf_matrix = confusion_matrix(y_val, y_val_pred)  # Store it in 'conf_matrix'
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['No Delay', 'Delay'], yticklabels=['No Delay', 'Delay'])
plt.title("Confusion Matrix of Validation Set")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()


plt.figure(figsize=(6, 4))
conf_matrix = confusion_matrix(y_test, y_test_pred)  # Store it in 'conf_matrix'
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['No Delay', 'Delay'], yticklabels=['No Delay', 'Delay'])
plt.title("Confusion Matrix of Test Set")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

# 2. ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_scores)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='blue', label='AUC = %0.2f' % roc_auc)
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve on Test Set")
plt.legend(loc="lower right")
plt.show()

# 3. Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_pred_scores)

plt.figure(figsize=(6, 4))
plt.plot(recall, precision, color='purple')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve on Test Set")
plt.show()

# Get feature importance from LinearSVC
feature_importance = np.abs(svm_model.coef_[0])  # Take absolute value to show strength regardless of direction

# Create a DataFrame to pair feature names with their importance
feature_importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': feature_importance
})

# Sort features by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Display feature importance
print("Feature Importance:\n", feature_importance_df)

# Optional: Visualize feature importance as a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title("Feature Importance based on LinearSVC Coefficients")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()



AttributeError: partially initialized module 'pandas' has no attribute '_pandas_parser_CAPI' (most likely due to a circular import)

Here, we apply SMOTE only to the training set to avoid a form of data leakage, which occurs when information from the test or validation sets influences the training process. Here’s why:

* The goal of validation and test sets is to simulate how the model will perform on completely unseen data. If we apply SMOTE to the entire dataset, synthetic examples based on validation or test data will indirectly influence the model during training. This creates an "unfair" advantage, as the model has indirectly seen patterns from future (test/validation) data.

* SMOTE creates synthetic samples by interpolating between points in the feature space. Applying it only to the training set ensures these synthetic samples are based solely on the training data distribution. This keeps the validation and test sets "pure" and independent, providing an unbiased evaluation of model performance.

By using SMOTE exclusively on the training set, we maintain a realistic measure of model generalization to new data.

NOTE: Synthetic samples in the context of SMOTE (Synthetic Minority Over-sampling Technique) are artificially generated data points created to balance class distributions in imbalanced datasets. Here’s how SMOTE generates them:

SMOTE Generates Synthetic samples by:

* Generating New Data Points: SMOTE creates synthetic samples for the minority class by selecting a data point from the minority class and generating new data points between it and its nearest neighbors from the same class.

* Interpolating Between Points: SMOTE doesn’t just duplicate minority class examples; instead, it finds nearby points in the feature space (minority samples) and interpolates to create a new sample that is somewhere between the two original samples.



## Results of balanced dataset


### Validation Set

**Accuracy:**

* The model achieves an accuracy of 0.80 (or 80%) on the validation set, indicating that 80% of predictions are correct. However, given the class imbalance (while SMOTE helps mitigate class imbalance during training, it doesn’t fully solve all issues with imbalanced performance. To further improve, we might consider additional strategies like adjusting decision thresholds, using cost-sensitive learning, or employing other classifiers that handle class imbalance effectively), accuracy alone might not reflect the model’s true performance on both classes.

**Precision:**

* Class 0 (No Delay): Precision is high at 0.84, meaning 84% of flights predicted as "No Delay" were correct.

* Class 1 (Delay): Precision is lower at 0.39, so only 39% of flights predicted as "Delay" were actually delayed.

**Recall:**

* Class 0: High recall of 0.94, meaning 94% of actual "No Delay" flights were correctly identified.

* Class 1: Low recall of 0.17, indicating that only 17% of actual "Delay" flights were correctly identified. This suggests the model often misses delays.

**F1-Score:**

* Class 0: High F1-score of 0.89, reflecting a good balance between precision and recall for "No Delay."

* Class 1: Low F1-score of 0.24 due to low recall, showing poor performance in identifying delays.

**Macro and Weighted Averages:**

* Macro Avg: Averages precision, recall, and F1-score equally across both classes, resulting in lower scores (0.56 for recall and F1) due to the poor recall for delays.

* Weighted Avg: Reflects the class imbalance by giving more weight to Class 0, resulting in higher overall scores (0.80 accuracy and 0.77 weighted F1-score).

**Confusion Matrix:**

* The matrix highlights the model’s strong performance in identifying non-delayed flights but its struggle with delays, with many delays incorrectly classified as "No Delay."

### Test Set

**Accuracy**:

* The model achieves 0.74 (74%) accuracy on the test set, which is lower than the validation accuracy, suggesting a slight drop in performance on truly unseen data.

**Precision:**

* Class 0: High precision of 0.75, meaning 75% of "No Delay" predictions were correct.

* Class 1: Precision is somewhat better at 0.66, indicating that 66% of "Delay" predictions were correct.

**Recall:**

* Class 0: Very high recall at 0.97, so almost all non-delayed flights were correctly identified.

* Class 1: Low recall of 0.16, indicating that only 16% of actual delays were correctly identified. This means the model misses a large portion of delayed flights.

**F1-Score:**

* Class 0: F1-score of 0.84, reflecting a good balance between precision and recall for non-delayed flights.

* Class 1: F1-score of 0.26 due to low recall, showing poor performance in identifying delays.

**Macro and Weighted Averages:**

* Macro Avg: Shows lower averages due to poor recall for delays (0.56 for recall and 0.55 for F1-score).

* Weighted Avg: Skewed by the high number of non-delayed flights, with overall metrics of 0.74 for accuracy and 0.68 for the weighted F1-score.

**Confusion Matrix:**

Similar to the validation set, the confusion matrix for the test set indicates that the model effectively identifies non-delayed flights but struggles significantly with delayed flights.


**ROC Curve:** The ROC curve plots the True Positive Rate (Recall) against the False Positive Rate for different threshold values.

* AUC (Area Under the Curve) = 0.70: This value indicates the model’s ability to distinguish between the two classes. An AUC of 0.70 suggests that the model has moderate discriminatory power, though it’s not particularly strong. A perfect model would have an AUC of 1.0, and a model with no discriminatory power would have an AUC of 0.5.

* Curve Shape: The ROC curve initially rises but then flattens out as the False Positive Rate increases, which is common for models with moderate performance. The curve above the diagonal (gray dashed line) shows that the model performs better than random guessing, but there is room for improvement.

* Interpretation: This ROC curve and AUC value indicate that the model is somewhat effective at separating delayed flights from non-delayed flights but could benefit from improvements, especially in recognizing the minority class (delayed flights).

**Precision-Recall (PR) Curve:** The Precision-Recall curve plots Precision against Recall (True Positive Rate) for various threshold values.

* Curve Shape: The curve starts with high precision at very low recall and then declines as recall increases. This decline shows that as the model captures more true positives (increasing recall), precision decreases, likely due to an increasing number of false positives.

* Interpretation: Low Precision at Higher Recall: This curve indicates that when the model tries to capture more of the minority class (delays), precision drops, suggesting that the model struggles to maintain accuracy in identifying delays as it attempts to capture more of them. Imbalance Challenges: Precision-Recall curves are particularly useful for imbalanced datasets. Here, the curve suggests that the model has difficulty balancing precision and recall effectively for the minority class, indicating room for improvement in delay prediction.



## Overall Summary/Findings

* The model performs well for non-delayed flights (high recall and F1-score for class 0) but poorly for delayed flights, with low recall and F1-score for class 1.

* The low recall for class 1 on both sets indicates that the model often misses delayed flights, which could be problematic if predicting delays is crucial.

* The ROC curve and AUC indicate moderate overall performance in class discrimination.

* The Precision-Recall curve highlights challenges in achieving a good balance between precision and recall for delays, especially given the class imbalance. Improvements could involve methods that focus on recall or alternative strategies like threshold adjustments.

* Potential improvements include techniques like threshold tuning, using alternative classifiers, or additional sampling strategies to improve recall for the delayed class.

## Next Steps

To improve the recall and F1-score for the minority class (delayed flights), we can try different oversampling techniques:

While SMOTE is a popular technique, we might also consider:

* ADASYN (Adaptive Synthetic Sampling): ADASYN is a variation of SMOTE that focuses on creating more synthetic samples for minority class points that are harder to classify.

* Borderline-SMOTE: This variant of SMOTE generates synthetic points only near the decision boundary, potentially improving the model’s ability to differentiate between classes near the boundary.

* Undersample the Majority Class: Sometimes a combination of oversampling the minority class and undersampling the majority class can improve model performance, as this reduces the imbalance while also minimizing the risk of overfitting to synthetic samples.

We can also try ddjusting the Decision Threshold:

* By default, many classifiers use a threshold of 0.5 for binary classification. We can adjust this threshold to increase the recall of the minority class by lowering the threshold for classifying a flight as "Delayed."

* We can experiment with different thresholds and evaluate the effect on precision, recall, and F1-score using the validation set. The ideal threshold will often depend on the specific trade-offs you want (e.g., a higher recall at the cost of precision).
