# 🧠 Flight Delay Prediction using Logistic Regression

In this project, we'll build a logistic regression model to predict whether a flight will be delayed or not based on several flight-related features.

## 📦 Step 1: Import Libraries
We'll import the necessary libraries for data manipulation, visualization, preprocessing, and modeling.

In [None]:
# 📦 Step 1: Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import joblib

## 📥 Step 2: Load Dataset
We'll load a sample airline dataset. In a real scenario, replace this with a dataset containing delay information.

In [None]:
# 📥 Step 2: Load Dataset
# You can replace this with a real dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv'  # Placeholder
df = pd.read_csv(url)
df.head()

## 🔍 Step 3: Explore Dataset
We'll inspect the dataset’s structure, data types, and summary statistics.

In [None]:
# 🔍 Step 3: Explore Dataset
df.info()
df.describe()

## 📊 Step 4: Exploratory Data Analysis (EDA)
Visualizing feature distributions helps us understand patterns in the data.

In [None]:
# 📊 Step 4: Exploratory Data Analysis (EDA)
plt.figure(figsize=(10, 6))
sns.histplot(df['Passengers'], kde=True)
plt.title('Passenger Count Distribution')
plt.xlabel('Passengers')
plt.ylabel('Frequency')
plt.show()

## 🧼 Step 5: Preprocessing
We'll create a target label for delay and scale features for training.

In [None]:
# 🧼 Step 5: Preprocessing
# Placeholder preprocessing
df['Target'] = np.where(df['Passengers'] > df['Passengers'].mean(), 1, 0)
X = df[['Passengers']]
y = df['Target']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## 🧪 Step 6: Train-Test Split
We’ll split the dataset into training and testing sets to evaluate model performance.

In [None]:
# 🧪 Step 6: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

## 🧠 Step 7: Train the Model
Train a logistic regression model on the training data.

In [None]:
# 🧠 Step 7: Train the Model
model = LogisticRegression()
model.fit(X_train, y_train)

## ✅ Step 8: Evaluate the Model
We’ll evaluate accuracy, precision, recall, F1 score, and plot confusion matrix and ROC curve.

In [None]:
# ✅ Step 8: Evaluate the Model
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.show()

fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_proba):.2f}')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

## 💾 Step 10: Save the Model
Save the trained model using joblib for future inference.

In [None]:
# 💾 Step 10: Save the Model
joblib.dump(model, 'flight_delay_model.pkl')

## 🔮 Step 11: Predict on New or Unseen Data
We’ll test the model with a new unseen input to check the prediction.

In [None]:
# 🔮 Step 11: Predict on New or Unseen Data
sample = [[140]]
sample_scaled = scaler.transform(sample)
pred = model.predict(sample_scaled)
print('Prediction:', 'Delayed' if pred[0] == 1 else 'On-Time')

## 🧠 Final Summary
- Applied logistic regression on a real-world style dataset
- Practiced data preprocessing and binary classification
- Evaluated using precision, recall, F1 score, and AUC
- Visualized confusion matrix and ROC curve