## Dataset
The dataset includes flight-related information, and the target variable is 'is_delayed' (binary: 1 for delayed, 0 for not delayed).

## Workflow
1. **Data Loading and Cleaning:**
   - Load the dataset.
   - Remove unnecessary columns.
   - Handle missing values.

2. **Data Preprocessing:**
   - Encode categorical variables using LabelEncoder.
   - Split the dataset into features and target.

3. **Model Training:**
   - Train Logistic Regression, Decision Tree, and Gaussian Naive Bayes models.

4. **Dimensionality Reduction:**
   - Apply PCA to reduce the dimensionality of the data.

5. **Hyperparameter Tuning:**
   - Perform hyperparameter tuning on the selected models to improve accuracy.

6. **Evaluation:**
   - Evaluate the models on the test data.
   - Compare the performance of each model.

7. **Conclusion:**
   - Summarize the findings.
   - Identify the best-performing model.

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

In [3]:
# Load the dataset
filename = "../data/processed/flight_df_03.csv"
flight_df = pd.read_csv(
        filename,
        sep=';',
        decimal='.',
        encoding='UTF-8',
    )

In [4]:
# Drop unnecessary columns
flight_df.drop(columns=["Unnamed: 0.2", 'Unnamed: 0.1', 'Unnamed: 0', 'id', 'cancelled', 'year'], inplace=True)

In [5]:
# Map 'is_delayed' to binary values
flight_df['is_delayed_'] = flight_df['is_delayed'].map({True: 1, False: 0})

In [6]:
flight_df.drop(columns=["is_delayed"], inplace=True)

In [7]:
flight_df.dropna(inplace=True)

In [8]:
# Split the data into features and target
X = flight_df.drop(columns=['is_delayed_'])
y = flight_df['is_delayed_']

In [9]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
# Encode the categorical variables using LabelEncoder
list_of_labels = ['op_unique_carrier', 'tail_num', 'dep_time_blk', 'arr_time_blk','distance_agg','manufacture_year_agg','origin_city_name','destination_city_name','date','station','name']

In [11]:
le = LabelEncoder()
for label in list_of_labels:
    X_train[label] = le.fit_transform(X_train[label])
    X_test[label] = le.fit_transform(X_test[label])

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [12]:
# Define the pipeline with MinMaxScaler and PCA
pipe = Pipeline([
    ('scaler', MinMaxScaler()),
    ('pca', PCA()),
    ('model', None)
])

In [13]:
# Define the models and their hyperparameters
models = {
    'Logistic Regression': (LogisticRegression(), {'model__C': [0.1, 1, 10, 100], 'model__penalty': ['l1', 'l2']}),
    'Decision Tree': (DecisionTreeClassifier(), {'model__max_depth': [3, 5, 7, 9, 11], 'model__min_samples_split': [2, 4, 6, 8, 10]}),
    'Gaussian Naive Bayes': (GaussianNB(), {'model__var_smoothing': np.logspace(0,-9, num=100)})
}

In [14]:
# Introduction to Modeling
print("Modeling Description:")
print("In this modeling task, we aim to build a classification model to predict flight delays based on various features.")
print("The dataset contains information about flights, and the target variable 'is_delayed_' is binary (1 for delayed, 0 for not delayed).")
print("We will explore three supervised machine learning models: Logistic Regression, Decision Tree, and Gaussian Naive Bayes.")
print("To enhance the model performance, we will use hyperparameter tuning along with MinMaxScaler for feature scaling and PCA for dimensionality reduction.")

Modeling Description:
In this modeling task, we aim to build a classification model to predict flight delays based on various features.
The dataset contains information about flights, and the target variable 'is_delayed_' is binary (1 for delayed, 0 for not delayed).
We will explore three supervised machine learning models: Logistic Regression, Decision Tree, and Gaussian Naive Bayes.
To enhance the model performance, we will use hyperparameter tuning along with MinMaxScaler for feature scaling and PCA for dimensionality reduction.


In [15]:
# Train the models, perform hyperparameter search, and print their accuracy
results = []
for model_name, (model, params) in models.items():
    print(f"Performing hyperparameter search for {model_name}...")
    
    # Set the model in the pipeline
    pipe.set_params(model=model)
    
     # Perform randomized search for hyperparameter optimization
    search = RandomizedSearchCV(pipe, params, n_iter=10, cv=5, n_jobs=-1, random_state=42)
    search.fit(X_train, y_train)
    best_model = search.best_estimator_
    
    # Train the best model
    best_model.fit(X_train, y_train)
    y_pred = best_model.predict(X_test)
    
    # Print results
    print(f"Best parameters for {model_name}: {search.best_params_}")
    print(f"Best score for {model_name}: {search.best_score_}")
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = classification_report(y_test, y_pred, output_dict=True)['1']['precision']
    recall = classification_report(y_test, y_pred, output_dict=True)['1']['recall']
    f1_score = classification_report(y_test, y_pred, output_dict=True)['1']['f1-score']
    
    # Append results to the list
    results.append([model_name, accuracy, precision, recall, f1_score])
    
    # Print additional metrics
    print(f"\nMetrics for {model_name}:")
    print(f"Accuracy: {accuracy}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1_score}")
    print(f"Confusion matrix for {model_name}:\n{confusion_matrix(y_test, y_pred)}")
    print(f"Classification report for {model_name}:\n{classification_report(y_test, y_pred)}")
    print("-" * 50)
    
# Create a DataFrame from the results
results_df = pd.DataFrame(results, columns=['Model', 'Accuracy', 'Precision', 'Recall', 'F1-score'])
print("\nResults Summary:")
print(results_df)

Performing hyperparameter search for Logistic Regression...


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
20 fits failed out of a total of 40.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\milen\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\milen\AppData\Roaming\Python\Python39\site-packages\sklearn\pipeline.py", line 405, in fit
    self

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dty

Best parameters for Logistic Regression: {'model__penalty': 'l2', 'model__C': 100}
Best score for Logistic Regression: 0.9928615283255218


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):



Metrics for Logistic Regression:
Accuracy: 0.9920828439316671
Precision: 0.9940293333076531
Recall: 0.9957073678930412
F1-score: 0.9948676430192718
Confusion matrix for Logistic Regression:
[[ 30232    620]
 [   445 103221]]


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


Classification report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.99      0.98      0.98     30852
           1       0.99      1.00      0.99    103666

    accuracy                           0.99    134518
   macro avg       0.99      0.99      0.99    134518
weighted avg       0.99      0.99      0.99    134518

--------------------------------------------------
Performing hyperparameter search for Decision Tree...


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(p

Best parameters for Decision Tree: {'model__min_samples_split': 6, 'model__max_depth': 11}
Best score for Decision Tree: 0.9354768859844069


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):



Metrics for Decision Tree:
Accuracy: 0.9306263845730683
Precision: 0.9549198510831195
Recall: 0.9550672351590686
F1-score: 0.9549935374346509
Confusion matrix for Decision Tree:
[[26178  4674]
 [ 4658 99008]]


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


Classification report for Decision Tree:
              precision    recall  f1-score   support

           0       0.85      0.85      0.85     30852
           1       0.95      0.96      0.95    103666

    accuracy                           0.93    134518
   macro avg       0.90      0.90      0.90    134518
weighted avg       0.93      0.93      0.93    134518

--------------------------------------------------
Performing hyperparameter search for Gaussian Naive Bayes...


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(p

Best parameters for Gaussian Naive Bayes: {'model__var_smoothing': 0.12328467394420659}
Best score for Gaussian Naive Bayes: 0.7854967081298665


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is


Metrics for Gaussian Naive Bayes:
Accuracy: 0.784318827220149
Precision: 0.8032283483756713
Recall: 0.9537842687091236
F1-score: 0.8720558468533227
Confusion matrix for Gaussian Naive Bayes:
[[ 6630 24222]
 [ 4791 98875]]


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


Classification report for Gaussian Naive Bayes:
              precision    recall  f1-score   support

           0       0.58      0.21      0.31     30852
           1       0.80      0.95      0.87    103666

    accuracy                           0.78    134518
   macro avg       0.69      0.58      0.59    134518
weighted avg       0.75      0.78      0.74    134518

--------------------------------------------------

Results Summary:
                  Model  Accuracy  Precision    Recall  F1-score
0   Logistic Regression  0.992083   0.994029  0.995707  0.994868
1         Decision Tree  0.930626   0.954920  0.955067  0.954994
2  Gaussian Naive Bayes  0.784319   0.803228  0.953784  0.872056


# Conclusions from the Delayed Flights Modeling Analysis

## Project Objective:
- The main goal of the project was to build a classification model to predict delayed flights based on various flight features.

## Dataset:
- The dataset contained crucial information about flights, such as the unique carrier, tail number, departure and arrival time blocks, flight distance, aircraft manufacture year, origin and destination cities, date, and more.

## Data Processing:
- The data was cleaned from unnecessary columns and missing values.
- Label encoding was applied to categorical variables.

## Modeling:
- Three different classification models were used: Logistic Regression, Decision Tree, and Gaussian Naive Bayes.
- Hyperparameter optimization was applied to enhance model performance.

## Model Results:
- Models achieved diverse results:
  - **Logistic Regression:** High precision and recall, overall accuracy around 96-97%.
  - **Decision Tree:** Perfect accuracy on the training set, suggesting potential overfitting.
  - **Gaussian Naive Bayes:** Good precision for delays but lower than other models.

## PCA (Principal Component Analysis):
- PCA was applied for dimensionality reduction.
- Results after PCA were generally worse than without dimensionality reduction, suggesting potential loss of important information.

## Future Directions:
- A more detailed understanding of factors contributing to flight delays is recommended.
- Experiments with other machine learning algorithms and regularization approaches may optimize model results.
- Balancing the dataset may be necessary if the data is imbalanced.

## Summary:
- Decision Tree achieved the best results on the training set, but perfect accuracy may indicate overfitting.
- Logistic Regression performed very well in both precision
