# Airline Passenger Satisfaction Prediction using AutoML

## Project Overview
Customer satisfaction is a critical metric in the airline industry.  
This notebook applies **AutoML** using **AutoGluon** to predict whether a passenger is **satisfied** or **neutral/dissatisfied** based on survey and service quality data.

### Workflow
1. Data exploration and preprocessing  
2. AutoML model training and leaderboard evaluation  
3. Selecting the best model (WeightedEnsemble_L2)  
4. Feature importance analysis  
5. Model evaluation on test data  
6. Actionable business insights

## Dataset

**Source:** Kaggle â€“ Airline Passenger Satisfaction  
**Link:** https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction  

**Target Variable:** `satisfaction`  
**Classes:**  
- `satisfied`  
- `neutral or dissatisfied`  

**Features:**  
24 features including customer demographics, flight details, service quality ratings, and operational metrics.

## Objectives
- Train an AutoML model to predict passenger satisfaction
- Identify key drivers of satisfaction
- Generate actionable insights for airline service improvement

## Tools & Technologies
- Python
- Pandas
- AutoGluon
- Matplotlib & Seaborn
- Kaggle Notebook Environment


In [None]:
# Enable internet in Kaggle before running this cell
!pip install -U autogluon --quiet

# Verify Installation
from autogluon.tabular import TabularPredictor
print("AutoGluon installed successfully")

## Dataset Description

### Feature Categories
- **Customer Information:** Gender, Age, Customer Type  
- **Flight Details:** Class, Type of Travel, Flight Distance  
- **Service Quality Ratings:** Seat comfort, Inflight wifi, Cleanliness, Food & drink, Inflight entertainment, Inflight service, On-board service, Leg room service, Gate location, Checkin service, Online boarding, Baggage handling, Ease of online booking, Departure/Arrival time convenient  
- **Operational Metrics:** Departure delay, Arrival delay

In [None]:
import pandas as pd

# Load train and test datasets
train_df = pd.read_csv("/kaggle/input/airline-passenger-satisfaction/train.csv", index_col=0)
test_df  = pd.read_csv("/kaggle/input/airline-passenger-satisfaction/test.csv", index_col=0)

train_df.head()

## Data Inspection

We inspect the dataset to understand:
- Data types and missing values
- Target class distribution
- Readiness for AutoML training

In [None]:
# Dataset info
train_df.info()

In [None]:
# Target class distribution
train_df['satisfaction'].value_counts()

## Why AutoML?

Traditional ML requires:
- Feature preprocessing
- Model selection
- Hyperparameter tuning
- Ensemble construction

AutoML automates these steps, enabling:
- Training multiple models efficiently
- Building strong ensembles
- Reliable feature importance extraction
- Faster insight generation

## Model Training with AutoGluon

We use **TabularPredictor** with **F1 Macro** as the evaluation metric, which is more reliable than accuracy for slightly imbalanced classification tasks.

In [None]:
from autogluon.tabular import TabularPredictor

# Drop identifier column
train_df_clean = train_df.drop(columns=["id"])
test_df_clean  = test_df.drop(columns=["id"])

# Train AutoML model
predictor = TabularPredictor(
    label="satisfaction",
    eval_metric="f1_macro"  # better than accuracy for slightly imbalanced classes
).fit(
    train_data=train_df_clean,
)

In [None]:
# View leaderboard
predictor.leaderboard(silent=True)

## Model Selection

After training multiple models, we review the leaderboard:

- **WeightedEnsemble_L2** achieved the highest **F1 Macro score (~0.97)**  
- Other strong models include:
  - NeuralNetFastAI  
  - LightGBMLarge  
  - XGBoost  
  - LightGBM  

**Decision:**  
Use **WeightedEnsemble_L2** as the final model due to its superior and balanced performance across both satisfaction classes.


## Feature Importance Analysis

We analyze which features most influence passenger satisfaction based on the final ensemble model.

In [None]:
# Compute feature importance
feature_importance = predictor.feature_importance(train_df_clean)
feature_importance

### Feature Importance Summary

#### Top Drivers of Satisfaction
1. **Inflight wifi service**
2. **Type of Travel**
3. **Gate location**
4. **Customer Type**
5. **Baggage handling**

#### Moderate Influence
- Inflight service  
- Online boarding  
- Seat comfort  
- Checkin service  
- Class  
- On-board service  
- Inflight entertainment  

#### Lower Influence
- Age  
- Cleanliness  
- Leg room service  
- Delays  
- Flight distance  
- Food & drink  
- Gender  

### Business Interpretation
- Service quality and convenience dominate satisfaction
- Operational improvements in **wifi, boarding, baggage, and gate operations** yield the highest impact
- Demographic factors play a minor role

## Feature Importance Visualization

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Prepare feature importance for plotting
df = feature_importance.reset_index().rename(
    columns={"index": "feature", "importance": "importance"}
)

# Categorize features by influence
def categorize(value):
    if value >= 0.05:
        return "Top Influence"
    elif value >= 0.02:
        return "Moderate Influence"
    else:
        return "Lower Influence"

df["category"] = df["importance"].apply(categorize)

# Plot horizontal bar chart
plt.figure(figsize=(10, 8))
sns.barplot(
    data=df.sort_values("importance", ascending=True),
    x="importance",
    y="feature",
    hue="category",
    dodge=False
)
plt.title("Feature Importance for Passenger Satisfaction")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

## Model Evaluation

We evaluate the final model on unseen test data using:
- Precision, Recall, F1-score
- Confusion Matrix

This confirms real-world generalization.


In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Predict on test set
y_true = test_df['satisfaction']
y_pred = predictor.predict(test_df)

# Classification report
print(classification_report(y_true, y_pred))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=["Neutral/Dissatisfied", "Satisfied"],
    yticklabels=["Neutral/Dissatisfied", "Satisfied"]
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.show()

## Business Insights & Conclusion

### Key Insights
- **Service quality is the strongest driver** of satisfaction
- **Customer segmentation matters** (Type of Travel, Customer Type)
- **Operational focus areas:** wifi, boarding process, baggage handling
- Minor features (age, food, entertainment) have limited impact

### Conclusion
The AutoML pipeline successfully delivers a **high-performing, well-generalized model** using **WeightedEnsemble_L2**.  
Feature importance analysis provides **clear, actionable insights** that airlines can use to improve customer satisfaction and operational efficiency.
