
# ‚úàÔ∏è Airline Sentiment Review System (Jupyter Notebook)

This notebook builds an **end‚Äëto‚Äëend sentiment analysis pipeline** for airline reviews/tweets using **Python, pandas, scikit‚Äëlearn, and TF‚ÄëIDF + Logistic Regression**.  
It includes: data loading, cleaning, EDA, model training, evaluation, and exporting a reusable predictor.



## üöÄ How to Use
1. If you have the Kaggle dataset (e.g., `Tweets.csv` from *Twitter US Airline Sentiment*), place it next to this notebook and set `DATA_PATH` accordingly.  
2. Otherwise, the notebook auto‚Äëgenerates a **small sample dataset** so you can run everything end‚Äëto‚Äëend.
3. Run the cells **top to bottom**.


## üì¶ Install & Import Libraries

In [None]:

# If running locally and you don't have the packages, uncomment and run:
# !pip install -U pandas scikit-learn matplotlib numpy joblib

import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib

pd.set_option("display.max_colwidth", 120)


## üßæ Load Data

In [None]:

# Set your dataset path here. If the file is not found, a tiny demo dataset is created.
DATA_PATH = "Tweets.csv"  # Change this to your file (e.g., 'airline_tweets.csv')

def load_or_create_demo(path):
    if os.path.exists(path):
        print(f"‚úÖ Loading dataset from: {path}")
        df = pd.read_csv(path)
        # Try to auto-detect columns
        possible_text_cols = ["text", "review", "tweet", "content", "body"]
        possible_label_cols = ["airline_sentiment", "sentiment", "label", "target"]
        text_col = next((c for c in possible_text_cols if c in df.columns), None)
        label_col = next((c for c in possible_label_cols if c in df.columns), None)
        if not text_col or not label_col:
            raise ValueError(f"Could not detect text/label columns. Found columns: {list(df.columns)}\n"
                             "Expected text column like one of ['text','review','tweet'] "
                             "and label like ['airline_sentiment','sentiment','label'].")
        return df[[text_col, label_col]].rename(columns={text_col: "text", label_col: "sentiment"})
    else:
        print("‚ö†Ô∏è Dataset not found. Creating a small demo dataset instead.")
        data = {
            "text": [
                "Loved the flight, staff were amazing and seats were comfy",
                "Terrible delay and rude service, never flying this airline again",
                "Average experience, nothing special but it was on time",
                "Great crew and smooth landing!",
                "Lost my luggage and no one helped, extremely disappointed",
                "Check-in was quick and easy, happy with the service",
                "Flight canceled without proper notice, very bad management",
                "Snacks were good and plane was clean",
                "Long layover and no updates, frustrated",
                "Pilot made clear announcements and cabin felt safe"
            ],
            "sentiment": [
                "positive", "negative", "neutral", "positive", "negative",
                "positive", "negative", "positive", "negative", "positive"
            ]
        }
        return pd.DataFrame(data)

df = load_or_create_demo(DATA_PATH)
print(df.head())
print("\nClass balance:\n", df['sentiment'].value_counts())


## üëÄ Quick EDA

In [None]:

# Basic length features
df['len'] = df['text'].str.len()
print("Text length (chars):\n", df['len'].describe())

# Plot class balance
counts = df['sentiment'].value_counts()
plt.figure()
counts.plot(kind='bar', title='Class Balance')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()


## üßπ Preprocess & Split

In [None]:

# We will rely on TfidfVectorizer's built-in cleaning (lowercasing, tokenization, and English stopwords).
X = df['text'].astype(str).values
y = df['sentiment'].astype(str).values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y if len(np.unique(y)) > 1 else None
)

len(X_train), len(X_test)


## ü§ñ Model: TF‚ÄëIDF + Logistic Regression

In [None]:

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words="english", ngram_range=(1,2), max_features=30000)),
    ("clf", LogisticRegression(max_iter=1000, n_jobs=None))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

acc = accuracy_score(y_test, y_pred) if len(X_test) else None
print(f"Accuracy: {acc:.4f}" if acc is not None else "Accuracy: (demo set too small)")
print("\nClassification report:\n", classification_report(y_test, y_pred, zero_division=0) if len(X_test) else "N/A")

# Confusion Matrix
if len(X_test):
    cm = confusion_matrix(y_test, y_pred, labels=np.unique(y))
    plt.figure()
    plt.imshow(cm, interpolation='nearest')
    plt.title('Confusion Matrix')
    plt.xticks(range(len(np.unique(y))), np.unique(y), rotation=45)
    plt.yticks(range(len(np.unique(y))), np.unique(y))
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, cm[i, j], ha="center", va="center")
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.tight_layout()
    plt.show()


## üîÅ Cross‚ÄëValidation (optional)

In [None]:

if len(df) >= 50 and len(np.unique(y)) > 1:
    scores = cross_val_score(pipeline, X, y, cv=5, scoring="accuracy")
    print("CV accuracy (5-fold):", scores)
    print("Mean ¬± Std:", scores.mean(), "¬±", scores.std())
else:
    print("Dataset too small for meaningful CV; skipping.")


## üíæ Save Model & Vectorizer

In [None]:

MODEL_PATH = "airline_sentiment_model.joblib"
joblib.dump(pipeline, MODEL_PATH)
print(f"‚úÖ Saved model to {MODEL_PATH}")


## üîÆ Try Predictions

In [None]:

loaded = joblib.load("airline_sentiment_model.joblib")
samples = [
    "Flight was on time and staff were very friendly.",
    "Worst experience ever, delayed and rude attendants.",
    "It was okay, nothing great but not bad either."
]
preds = loaded.predict(samples)
for s, p in zip(samples, preds):
    print(f"{p:8s} | {s}")


## üìù Project Summary (for forms/portfolio)

In [None]:

summary = {
    "project": "Airline Sentiment Review System",
    "stack": ["Python", "pandas", "scikit-learn", "TF-IDF", "Logistic Regression", "matplotlib"],
    "features": [
        "Data loading with auto-detection of text/label columns",
        "EDA (class balance, text lengths)",
        "TF‚ÄëIDF + Logistic Regression pipeline",
        "Evaluation (accuracy, classification report, confusion matrix)",
        "Model export with joblib",
        "Quick inference examples"
    ],
    "usage": [
        "Set DATA_PATH to your dataset filename (e.g., 'Tweets.csv')",
        "Run cells top-to-bottom",
        "Use the saved model for production inference"
    ]
}
print(json.dumps(summary, indent=2))
