<a id="top"></a>

## Table of Contents

* [0. Header Section](#0)
    - [0.1 Project title, chosen domain, and team members](#0.1)
    - [0.2 Business motivation and problem statement](#0.2)
    - [0.3 Dataset description and target categories](#0.3)

* [1. Setup and Configuration](#1)
    - [1.1 Import required libraries (pandas, sklearn, nltk, xgboost, torch)](#1.1)
    - [1.2 Environment configuration and random seeds](#1.2)
    - [1.3 Helper functions for preprocessing, visualization, and evaluation](#1.3)

* [2. Data Understanding and Preprocessing](#2)
    - [2.1 Load and inspect the dataset `jason23322/high-accuracy-email-classifier`](#2.1)
    - [2.2 Clean text (remove HTML, punctuation, stopwords, lowercasing)](#2.2)
    - [2.3 Lemmatization / Tokenization with NLTK or spaCy](#2.3)
    - [2.4 Convert text to TF-IDF features](#2.4)
    - [2.5 Dimensionality Reduction with PCA for visualization](#2.5)

* [3. Exploratory Data Analysis (EDA)](#3)
    - [3.1 Analyze class distribution across 6 email categories](#3.1)
    - [3.2 Keyword frequency, message length, and term correlation](#3.2)
    - [3.3 Visualize TF-IDF and PCA projections in 2D space](#3.3)

* [4. Unsupervised Learning (Clustering)](#4)
    - [4.1 Apply K-Means clustering on TF-IDF features](#4.1)
    - [4.2 Determine optimal `k` using Elbow, Silhouette, Davies‚ÄìBouldin](#4.2)
    - [4.3 Visualize and interpret clusters (PCA / t-SNE)](#4.3)

* [5. Supervised Machine Learning Models](#5)
    - [5.1 Decision Tree Classifier (baseline)](#5.1)
    - [5.2 Random Forest (Bagging Ensemble)](#5.2)
    - [5.3 XGBoost (Boosting Ensemble)](#5.3)
    - [5.4 Stacking Ensemble (meta-learner over RF, XGB, etc.)](#5.4)
    - [5.5 Evaluation: Accuracy, Precision, Recall, F1, ROC-AUC](#5.5)
    - [5.6 Feature importance / SHAP](#5.6)

* [6. Deep Learning Model (Neural Network)](#6)
    - [6.1 Build Feed-Forward / 1D-CNN / LSTM (PyTorch)](#6.1)
    - [6.2 Inputs: TF-IDF or embeddings](#6.2)
    - [6.3 Train/validate and visualize loss/accuracy](#6.3)
    - [6.4 Compare NN vs. ensembles (incl. Stacking)](#6.4)

* [7. Dimensionality Reduction and Visualization](#7)
    - [7.1 PCA on high-dimensional TF-IDF](#7.1)
    - [7.2 Explained variance plots](#7.2)
    - [7.3 t-SNE for non-linear structure](#7.3)

* [8. Integration of LLM / Generative AI (Optional)](#8)
    - [8.1 LLM assistance (cluster summaries, error analysis)](#8.1)
    - [8.2 Synthetic email generation for data balance](#8.2)
    - [8.3 Compare manual vs. LLM-augmented preprocessing](#8.3)

* [9. Results and Discussion](#9)
    - [9.1 Performance comparison: DT, RF, XGB, **Stacking**, NN](#9.1)
    - [9.2 Confusion matrices and error analysis](#9.2)
    - [9.3 Cluster‚Äìlabel alignment insights](#9.3)
    - [9.4 Limitations and future work](#9.4)

* [10. Business Insights and Recommendations](#10)
    - [10.1 Productivity gains from auto-categorization](#10.1)
    - [10.2 Inbox/CRM workflow automation](#10.2)
    - [10.3 Governance & explainability](#10.3)

* [11. Deployment (FastAPI + Streamlit)](#11)
    - [11.1 FastAPI `/predict` endpoint for inference](#11.1)
    - [11.2 Streamlit UI (text box ‚Üí predicted category + probabilities)](#11.2)
    - [11.3 Live demo pipeline: input ‚Üí TF-IDF ‚Üí model ‚Üí category](#11.3)

* [12. Appendices and Deliverables](#12)
    - [12.1 Source notebooks, trained models, config](#12.1)
    - [12.2 API URLs/keys and dataset files](#12.2)
    - [12.3 Slides and references](#12.3)


In [None]:
from huggingface_hub import login
login()   # then paste your access token from https://huggingface.co/settings/tokens


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [3]:
import pandas as pd


In [4]:
# Login using e.g. `huggingface-cli login` to access this dataset
# splits = {'train': 'train.json', 'test': 'test.json'}
# train_df = pd.read_json("hf://datasets/jason23322/high-accuracy-email-classifier/" + splits["train"])
# test_df = pd.read_json("hf://datasets/jason23322/high-accuracy-email-classifier/" + splits["test"])

In [16]:
train_df = pd.read_csv('/content/combined_train.csv')
test_df = pd.read_csv('/content/combined_test.csv')

In [17]:
train_df

Unnamed: 0,id,subject,body,text,category,category_id
0,promotions_582,Anniversary Special: Buy one get one free,"As our loyal customer, get exclusive $60 off $...",Anniversary Special: Buy one get one free As o...,promotions,1
1,spam_1629,Your Amazon was used on new device,Your $5000 refund is processed. Claim: bit.ly/...,Your Amazon was used on new device Your $5000 ...,spam,3
2,spam_322,Re: Your Google inquiry,"Hi, following up about your Google application...","Re: Your Google inquiry Hi, following up about...",spam,3
3,social_media_80,Digital Ritual Experience Creation,Cross-cultural ceremony design. Join: virtualr...,Digital Ritual Experience Creation Cross-cultu...,social_media,2
4,forum_1351,"Your post was moved to ""Programming Help""","Trending: ""cooking"" (258 comments). View: supp...","Your post was moved to ""Programming Help"" Tren...",forum,0
...,...,...,...,...,...,...
12695,music_415,ü§ò The Ultimate Calvin Harris Experience for th...,Calvin Harris is back! Catch them live at Heng...,ü§ò The Ultimate Calvin Harris Experience for th...,concert_promotion,6
12696,flight_1177,Group Booking Confirmed: Cathay Pacific CA636,Explore New York on 2025-11-12! Flight CA636 w...,Group Booking Confirmed: Cathay Pacific CA636 ...,flight_booking,7
12697,music_168,ü§ò The Ultimate Die Twice Experience for the t...,Early bird tickets end soon! Die Twice at Quee...,ü§ò The Ultimate Die Twice Experience for the t...,concert_promotion,6
12698,music_546,‚ù§Ô∏è Get Ready for Prateek Kuhad on 06-09-2024 ‚Äì...,"Good news, Melbourne! Prateek Kuhad is perform...",‚ù§Ô∏è Get Ready for Prateek Kuhad on 06-09-2024 ‚Äì...,concert_promotion,6


In [18]:
test_df

Unnamed: 0,id,subject,body,text,category,category_id
0,social_media_1558,Watch later: Recommended story,"Group update: ""Book Club"" posted video. Trendi...","Watch later: Recommended story Group update: ""...",social_media,2
1,social_media_505,News from groups you follow,"Group ""Tech Enthusiasts"" invited you. RSVP: pl...","News from groups you follow Group ""Tech Enthus...",social_media,2
2,forum_190,Two-Factor Authentication Enforcement Notice,Required for all accounts by Dec 1: security.f...,Two-Factor Authentication Enforcement Notice R...,forum,0
3,updates_1851,Security upgrade: 2FA enabled,Your monthly statement is available. View/down...,Security upgrade: 2FA enabled Your monthly sta...,updates,4
4,verify_code_1753,Verification PIN: 907472,Use 404583 as your verification code. Device: ...,Verification PIN: 907472 Use 404583 as your ve...,verify_code,5
...,...,...,...,...,...,...
3172,flight_812,Luxury Travel Awaits! Singapore Airlines SI554...,Adventure awaits! Singapore Airlines Flight SI...,Luxury Travel Awaits! Singapore Airlines SI554...,flight_booking,7
3173,music_441,‚ù§Ô∏è The Ultimate twenty one pilots Experience f...,"Feel the bass, live the moment. twenty one pil...",‚ù§Ô∏è The Ultimate twenty one pilots Experience f...,concert_promotion,6
3174,flight_11,Your Journey with IndiGo is About to Begin!,Your journey is ready! Flight IN818 with IndiG...,Your Journey with IndiGo is About to Begin! Yo...,flight_booking,7
3175,music_207,üé∂ Experience Lizzo performing hits like ['Ever...,Experience the incredible stage presence of Li...,üé∂ Experience Lizzo performing hits like ['Ever...,concert_promotion,6


In [8]:
# from datasets import load_dataset
# import pandas as pd

# ds = load_dataset("jason23322/high-accuracy-email-classifier")
# ds.save_to_disk("local_folder/high_accuracy_email")
# train_df = pd.DataFrame(ds['train'])
# display(train_df.head())

In [9]:
# View categories
print(train_df['category'].value_counts())
print(test_df['category'].value_counts())



category
forum                1800
verify_code          1800
social_media         1796
promotions           1796
spam                 1794
updates              1794
flight_booking        960
concert_promotion     960
Name: count, dtype: int64
category
verify_code          451
forum                450
social_media         449
updates              449
promotions           449
spam                 449
flight_booking       240
concert_promotion    240
Name: count, dtype: int64


In [10]:
# !pip install keras

In [11]:
# import tensorflow as tf
# import json
# import numpy as np
# import re
# from tensorflow.keras.preprocessing.sequence import pad_sequences

# # 1. Load the trained model
# model = tf.keras.models.load_model('/content/best_high_accuracy_model.h5')

# # 2. Load tokenizer config
# with open('/content/high_accuracy_tokenizer_config.json', 'r') as f:
#     config = json.load(f)
# word_index = config['word_index']
# max_len = config['max_len']
# categories = config['categories']

# # 3. Preprocessing
# def preprocess_text(text):
#     text = text.lower()
#     text = re.sub(r'http[s]?://\S+', 'URL', text)
#     text = re.sub(r'www\.\S+', 'URL', text)
#     text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', 'EMAIL', text)
#     text = re.sub(r'\b\d+\b', 'NUMBER', text)
#     text = re.sub(r'[^\w\s]', ' ', text)
#     text = ' '.join(text.split())
#     return text

# def text_to_sequence(text):
#     words = text.split()
#     sequence = [ word_index.get(w, 1) for w in words ]  # 1 => OOV token
#     return pad_sequences([sequence], maxlen=max_len, padding='post', truncating='post')

# def predict_email_category(text):
#     processed = preprocess_text(text)
#     seq = text_to_sequence(processed)
#     probabilities = model.predict(seq, verbose=0)[0]
#     idx = np.argmax(probabilities)
#     category = categories[idx]
#     confidence = float(probabilities[idx])
#     all_probs = {categories[i]: float(probabilities[i]) for i in range(len(categories))}
#     return {
#         'predicted_category': category,
#         'confidence': confidence,
#         'all_probabilities': all_probs
#     }

# # 4. Example usage
# email_text = "Your verification code is 123456. Please enter this code."
# result = predict_email_category(email_text)
# print("Category:", result['predicted_category'])
# print("Confidence:", result['confidence'])
# print("All probabilities:", result['all_probabilities'])


In [12]:
!pip install -U "transformers>=4.30.0"
!nvidia-smi

Fri Oct 31 07:36:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   33C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [19]:
# Install dependencies if needed
# !pip install transformers datasets torch pandas scikit-learn

import re
import torch
print(torch.cuda.is_available())
import pandas as pd
from torch.utils.data import Dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import os
os.environ["WANDB_DISABLED"] = "true"

# --- Helper preprocessing function ---
def preprocess_text(text):
    """Preprocess text exactly as done during training"""
    text = text.lower()
    text = re.sub(r'http[s]?://\S+', 'URL', text)
    text = re.sub(r'www\.\S+', 'URL', text)
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', 'EMAIL', text)
    text = re.sub(r'\b\d+\b', 'NUMBER', text)
    text = re.sub(r'[^\w\s]', ' ', text)
    text = ' '.join(text.split())
    return text


# --- Load dataset ---
# splits = {'train': 'train.json', 'test': 'test.json'}
# train_df = pd.read_json("hf://datasets/jason23322/high-accuracy-email-classifier/" + splits["train"])
# test_df = pd.read_json("hf://datasets/jason23322/high-accuracy-email-classifier/" + splits["test"])

# Use preprocessed text and numeric labels
train_df['text'] = train_df['text'].apply(preprocess_text)
test_df['text'] = test_df['text'].apply(preprocess_text)
train_df = train_df.rename(columns={'category_id': 'label'})
test_df = test_df.rename(columns={'category_id': 'label'})


# --- Dataset class for PyTorch ---
class EmailDataset(Dataset):
    def __init__(self, df, tokenizer, max_len=128):
        self.texts = df['text'].tolist()
        self.labels = df['label'].tolist()
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        encoding = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=self.max_len,
            return_tensors='pt'
        )
        item = {key: val.squeeze(0) for key, val in encoding.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item


# --- Tokenizer and Datasets ---
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train_dataset = EmailDataset(train_df, tokenizer)
test_dataset = EmailDataset(test_df, tokenizer)

# --- Model setup ---
num_labels = train_df['label'].nunique()
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)


# --- Metrics ---
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    return {'accuracy': acc, 'precision': precision, 'recall': recall, 'f1': f1}


# --- Training configuration ---
training_args = TrainingArguments(
    output_dir='./results',
    do_train=True,
    do_eval=True,
    num_train_epochs=3,
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=200,
    save_steps=500,
    save_total_limit=2
)



# --- Trainer ---
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


# --- Train and evaluate ---
trainer.train()
trainer.evaluate()


True


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Step,Training Loss
200,0.651
400,0.0601
600,0.0414
800,0.0267
1000,0.0079
1200,0.0157
1400,0.005
1600,0.0153
1800,0.0072
2000,0.0024


{'eval_loss': 0.026874130591750145,
 'eval_accuracy': 0.9959080893925086,
 'eval_precision': 0.995920529109329,
 'eval_recall': 0.9959080893925086,
 'eval_f1': 0.995906629861107,
 'eval_runtime': 22.779,
 'eval_samples_per_second': 139.47,
 'eval_steps_per_second': 8.736,
 'epoch': 3.0}

In [20]:
# Evaluate the model
metrics = trainer.evaluate(eval_dataset=test_dataset)

# Print metrics nicely
print("=== Test Set Evaluation ===")
for key, value in metrics.items():
    if isinstance(value, float):
        print(f"{key}: {value:.4f}")
    else:
        print(f"{key}: {value}")


=== Test Set Evaluation ===
eval_loss: 0.0269
eval_accuracy: 0.9959
eval_precision: 0.9959
eval_recall: 0.9959
eval_f1: 0.9959
eval_runtime: 23.1778
eval_samples_per_second: 137.0710
eval_steps_per_second: 8.5860
epoch: 3.0000


In [21]:
import torch
import torch.nn.functional as F

# --- Dynamically build category mapping from dataset ---
# Make sure `train_df` contains 'label' and 'category'
category_id_to_name = dict(zip(train_df['label'].unique(), train_df['category'].unique()))
print(category_id_to_name)
# Example output: {0: 'Forum', 1: 'Promotions', 2: 'Social Media', 3: 'Spam', 4: 'Updates', 5: 'Verify Code'}

def predict_email_category(email_text, model, tokenizer, max_len=128):
    """
    Predict the category of a single email text.

    Returns:
        dict: {
            'predicted_category': str,
            'confidence': float,
            'all_probabilities': dict
        }
    """
    # Preprocess
    text = preprocess_text(email_text)

    # Tokenize
    encoding = tokenizer(
        text,
        padding='max_length',
        truncation=True,
        max_length=max_len,
        return_tensors='pt'
    )

    # Move to GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)

    # Forward pass
    model.eval()
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        probs = F.softmax(logits, dim=-1).squeeze().cpu().numpy()

    # Prediction
    pred_id = int(probs.argmax())
    predicted_category = category_id_to_name.get(pred_id, f"unknown_{pred_id}")
    confidence = float(probs[pred_id])

    all_probabilities = {category_id_to_name.get(i, f"unknown_{i}"): float(probs[i]) for i in range(len(probs))}

    return {
        'predicted_category': predicted_category,
        'confidence': confidence,
        'all_probabilities': all_probabilities
    }

# --- Example usage ---
email_text = "Your verification code is 123456. Please enter this code."
result = predict_email_category(email_text, model, tokenizer)
print("Category:", result['predicted_category'])       # üîê Verify Code
print("Confidence:", result['confidence'])
print("All probabilities:", result['all_probabilities'])


{np.int64(1): 'promotions', np.int64(3): 'spam', np.int64(2): 'social_media', np.int64(0): 'forum', np.int64(5): 'verify_code', np.int64(4): 'updates', np.int64(7): 'flight_booking', np.int64(6): 'concert_promotion'}
Category: verify_code
Confidence: 0.9994841814041138
All probabilities: {'forum': 5.1899056416004896e-05, 'promotions': 5.4017658840166405e-05, 'social_media': 7.671470666537061e-05, 'spam': 7.765327609376982e-05, 'updates': 6.802252028137445e-05, 'verify_code': 0.9994841814041138, 'concert_promotion': 7.594014459755272e-05, 'flight_booking': 0.00011157039261888713}


In [22]:
examples = [
    "Your verification code is 123456. Please enter this code.",  # Verify Code
    "Big sale! Get 50% off all items today only.",                # Promotions
    "Mike Chen commented on your post in Programming Help.",      # Social Media
    "Your post was moved to 'Cooking' forum.",                    # Forum
    "Security alert: 5 failed login attempts detected.",          # Spam
    "Your August system update is now available."                 # Updates
]

for email_text in examples:
    result = predict_email_category(email_text, model, tokenizer)
    print("Email text:", email_text)
    print("Predicted Category:", result['predicted_category'])
    print("Confidence:", round(result['confidence'], 3))
    print("All probabilities:", {k: round(v,3) for k,v in result['all_probabilities'].items()})
    print("-"*80)


Email text: Your verification code is 123456. Please enter this code.
Predicted Category: verify_code
Confidence: 0.999
All probabilities: {'forum': 0.0, 'promotions': 0.0, 'social_media': 0.0, 'spam': 0.0, 'updates': 0.0, 'verify_code': 0.999, 'concert_promotion': 0.0, 'flight_booking': 0.0}
--------------------------------------------------------------------------------
Email text: Big sale! Get 50% off all items today only.
Predicted Category: promotions
Confidence: 0.999
All probabilities: {'forum': 0.0, 'promotions': 0.999, 'social_media': 0.0, 'spam': 0.0, 'updates': 0.0, 'verify_code': 0.0, 'concert_promotion': 0.0, 'flight_booking': 0.0}
--------------------------------------------------------------------------------
Email text: Mike Chen commented on your post in Programming Help.
Predicted Category: social_media
Confidence: 0.997
All probabilities: {'forum': 0.001, 'promotions': 0.0, 'social_media': 0.997, 'spam': 0.0, 'updates': 0.001, 'verify_code': 0.0, 'concert_promotion'

In [26]:
import random

num_samples = 5

print("\n=== Sample Predictions from Test Dataset ===")

# Randomly choose 5 unique indices from the test dataset
sample_indices = random.sample(range(len(test_df)), num_samples)

for i, idx in enumerate(sample_indices):
    email_text = test_df.iloc[idx]['text']
    true_category = test_df.iloc[idx]['category']

    result = predict_email_category(email_text, model, tokenizer)

    print(f"\nSample {i+1}:")
    print("Text:", email_text[:100] + "..." if len(email_text) > 100 else email_text)
    print("True Category:", true_category)
    print("Predicted Category:", result['predicted_category'])
    print("Confidence:", round(result['confidence'], 3))
    print("All probabilities:", {k: round(v,3) for k,v in result['all_probabilities'].items()})



=== Sample Predictions from Test Dataset ===

Sample 1:
Text: your vip seat on lufthansa flight lu716 is ready enjoy premium comfort and priority boarding on your...
True Category: flight_booking
Predicted Category: flight_booking
Confidence: 0.999
All probabilities: {'forum': 0.0, 'promotions': 0.0, 'social_media': 0.0, 'spam': 0.0, 'updates': 0.0, 'verify_code': 0.0, 'concert_promotion': 0.0, 'flight_booking': 0.999}

Sample 2:
Text: your chance to see bts on NUMBER NUMBER NUMBER secure your spot love live music bts s hits seoul at ...
True Category: concert_promotion
Predicted Category: concert_promotion
Confidence: 0.999
All probabilities: {'forum': 0.0, 'promotions': 0.0, 'social_media': 0.0, 'spam': 0.0, 'updates': 0.0, 'verify_code': 0.0, 'concert_promotion': 0.999, 'flight_booking': 0.0}

Sample 3:
Text: thread locked reason revised rules effective immediately key changes citation requirements for news ...
True Category: forum
Predicted Category: forum
Confidence: 1.0
All prob

Additional Data

In [None]:
body_templates = [
    # 1. General & Direct Announcements
    "Join {artist} at {venue} in {city} on {date}! Experience hits like {songs} and an unforgettable night for {attendance} fans.",
    "Good news, {city}! {artist} is performing live at {venue} on {date} as part of their {tour_name}. Grab your tickets at {link}!",
    "{artist} is back! Catch them live at {venue}, {city}, on {date}. Sing along to {songs} and more.",
    "Feel the energy as {artist} rocks {city} at {venue} on {date}. Tickets are moving fast!",
    "Love live music? {artist}'s {tour_name} hits {city} at {venue} on {date}. Don't miss {songs} performed live!",
    "Get ready, {city}! {artist} brings their iconic sound to {venue} on {date}. Be part of {attendance} fans cheering live.",
    "The wait is over! {artist} takes the stage at {venue}, {city}, on {date}. Reserve your spot here: {link}.",
    "It's official: {artist} is performing at {venue} in {city} on {date}. Experience {songs} live!",
    "{city}, mark your calendars! {artist} delivers a powerhouse show at {venue} on {date}.",
    "One night only: {artist} live at {venue}, {city} on {date}. Get your tickets now!",

    # 2. Experience & Atmosphere
    "Experience {artist} like never before. {venue}, {city}, {date}. Top songs include {songs}.",
    "Prepare for an audiovisual spectacle! {artist} brings {tour_name} to {venue} on {date} in {city}.",
    "More than a concert‚Äîit‚Äôs an experience. Join {artist} at {venue}, {city}, on {date}.",
    "Feel the lights, hear the music. {artist} live in {city} at {venue}, {date}.",
    "The atmosphere will be electric! {artist} performs {songs} at {venue}, {city}, on {date}.",
    "Witness {artist}'s passion on stage at {venue}, {city}, {date}. {attendance} fans expected.",
    "Dazzling lights, incredible sound, {artist}! {venue}, {city}, {date}.",
    "Feel the bass, live the moment. {artist} in concert at {venue}, {city}, on {date}.",
    "Pure music magic: {artist} at {venue} in {city} on {date}. Featuring hits like {songs}.",
    "Experience the incredible stage presence of {artist} at {venue}, {city}, on {date}.",

    # 3. Music & Songs Focus
    "Sing your heart out! {artist} live at {venue} in {city} on {date}. Performances include {songs}.",
    "From classics to new hits: {artist} performs {songs} live at {venue} ({city}) on {date}.",
    "Don't just stream‚Äîexperience {artist} live at {venue}, {city}, on {date}.",
    "The soundtrack to your life, performed live. {artist} in {city} at {venue} on {date}.",
    "{artist} brings chart-topping hits like {songs} to {venue}, {city}, {date}.",
    "Studio hits to live thrills: {artist} performs at {venue} on {date}.",
    "Witness timeless music by {artist} at {venue}, {city}, on {date}.",
    "All the hits, all the energy! {artist} live at {venue}, {city}, {date}.",
    "{artist} brings the full stadium show to {venue} on {date}.",
    "The icon returns! {artist} live at {venue}, {city}, {date}.",

    # 4. Urgency & CTA
    "Tickets are selling fast! {artist} at {venue}, {city}, {date}. Secure yours at {link}!",
    "Don't wait! See {artist} live at {venue} on {date}. Limited tickets remaining.",
    "Final release tickets for {artist} at {venue} ({city}) on {date}. Act fast!",
    "{artist} adds a show in {city} at {venue} on {date}. Get your tickets now!",
    "The hottest ticket in town: {artist} at {venue}, {city}, {date}.",
    "Last chance! {artist} at {venue} on {date}. Grab yours: {link}.",
    "Early bird tickets end soon! {artist} at {venue}, {city}, {date}.",
    "Low ticket warning! {artist} live in {city} at {venue}, {date}.",
    "Tag your concert buddy! {artist} at {venue}, {city}, {date}.",
    "Get your spot now! {artist} live at {venue}, {city}, on {date}.",

    # 5. Tour & Special Events
    "The {tour_name} tour lands in {city}! See {artist} at {venue} on {date}.",
    "{artist} is back on tour! Catch the new show in {city} at {venue} on {date}.",
    "As part of {tour_name}, {artist} stops at {venue}, {city}, on {date}.",
    "Don't miss the {tour_name} spectacle: {artist} at {venue}, {city}, {date}.",
    "Experience the {tour_name} tour with {artist} live at {venue} on {date}.",
    "The final stop of {tour_name}! {artist} live in {city} at {venue} on {date}.",
    "{artist} anniversary tour! {venue}, {city}, {date}. Celebrate with hits like {songs}.",
    "{artist} brings {tour_name} to {venue}, {city}, {date}. Tickets at {link}.",
    "A special night with {artist}: {venue}, {city}, {date}.",
    "The concert you've been waiting for! {artist} at {venue} ({city}) on {date}.",

    # 6. Short & Minimalist
    "{artist}. {city}. {venue}. {date}. Tickets: {link}.",
    "Live in {city}: {artist} at {venue}, {date}.",
    "Announcing: {artist} at {venue} on {date}.",
    "{city}: {artist} is coming. {venue}, {date}.",
    "{artist} // {city} // {venue} // {date}. Get tickets at {link}.",
]


In [None]:
import random
subject_templates=[]

# Base components
emojis = ["üé§","üé∂","üî•","‚ú®","üé∏","üéπ","üéüÔ∏è","üåü","üéâ","ü§ò","üôå","üï∫","üí•","üö®","‚ù§Ô∏è"]
prefixes = [
    "Get Ready for {artist}","Don't Miss {artist}","Join {artist}","Experience {artist}",
    "The Ultimate {artist} Experience","Catch {artist} Live","See {artist} in {city}","Your Chance to See {artist}"
]
actions = [
    "in {city}","at {venue}","on {date}","for the {tour_name} tour","performing hits like {songs}"
]
suffixes = [
    "‚Äì One Night Only!","Tickets Going Fast!","Secure Your Spot!","Limited Seats Available","Be There!","Don't Miss Out!","üéüÔ∏è Get Tickets Now!","Live & Loud!","Epic Night Ahead!"
]

# Generate 50 new templates
new_subject_templates = []
while len(new_subject_templates) < 50:
    template = f"{random.choice(emojis)} {random.choice(prefixes)} {random.choice(actions)} {random.choice(suffixes)}"
    if template not in new_subject_templates:
        new_subject_templates.append(template)

# Combine with your existing list
subject_templates.extend(new_subject_templates)

print(f"Total templates: {len(subject_templates)}")
print(subject_templates[-10:])  # Preview last 10 generated


Total templates: 50
['üéâ Catch {artist} Live for the {tour_name} tour Secure Your Spot!', 'üôå Get Ready for {artist} at {venue} ‚Äì One Night Only!', 'üî• Your Chance to See {artist} performing hits like {songs} Secure Your Spot!', '‚ù§Ô∏è Join {artist} for the {tour_name} tour Limited Seats Available', 'ü§ò See {artist} in {city} on {date} Live & Loud!', "üé§ Don't Miss {artist} for the {tour_name} tour Live & Loud!", "üí• The Ultimate {artist} Experience in {city} Don't Miss Out!", "ü§ò The Ultimate {artist} Experience for the {tour_name} tour Don't Miss Out!", "üö® Experience {artist} in {city} Don't Miss Out!", 'ü§ò Experience {artist} performing hits like {songs} üéüÔ∏è Get Tickets Now!']


In [None]:
artists = [
    # --- International Pop / Rock / Hip-hop ---
    "Coldplay", "Taylor Swift", "The Weeknd", "Ed Sheeran", "Adele", "Beyonce",
    "Bruno Mars", "Dua Lipa", "Justin Bieber", "Billie Eilish", "Shawn Mendes",
    "Ariana Grande", "Katy Perry", "Lady Gaga", "Maroon 5", "Imagine Dragons",
    "Post Malone", "Sam Smith", "Harry Styles", "Rihanna", "Khalid", "Halsey",
    "Sia", "The Chainsmokers", "Lizzo", "Camila Cabello", "Olivia Rodrigo",
    "Doja Cat", "BLACKPINK", "BTS", "OneRepublic", "Twenty One Pilots", "P!nk",
    "Selena Gomez", "Miley Cyrus", "Shakira", "Enrique Iglesias", "Nicki Minaj",

    # --- Bollywood / Indian Pop / Hindi ---
    "Arijit Singh", "Shreya Ghoshal", "Neha Kakkar", "Badshah", "A. R. Rahman",
    "Darshan Raval", "Armaan Malik", "Sunidhi Chauhan", "Pritam", "Honey Singh",
    "Kailash Kher", "Mohit Chauhan", "Sonu Nigam", "Atif Aslam", "Ankit Tiwari",
    "Tanishk Bagchi", "Jubin Nautiyal", "Jonita Gandhi", "Siddharth Slathia", "Monali Thakur",
    "Ritviz", "DJ Chetas", "Seedhe Maut", "DIVINE", "Raftaar",

    # --- Indian Indie / Rock / Bands ---
    "Indian Ocean", "Parikrama", "Euphoria", "The Local Train", "Prateek Kuhad",
    "Raghu Dixit Project", "Agnee", "Ankur Tewari", "Nikhil D'Souza", "When Chai Met Toast",
    "The F16s", "Advaita", "Krosswindz", "Thermal and a Quarter", "Motherjane",
    "BlueFROG Collective", "Indus Creed", "Pentagram", "Madboy/Mink", "Swarathma",

    # --- Tamil / Malayalam / South Indian ---
    "Anirudh Ravichander", "Sid Sriram", "Yuvan Shankar Raja", "Harris Jayaraj",
    "AR Rahman Tamil", "Vijay Yesudas", "Shweta Mohan", "Karthik", "S. P. Balasubrahmanyam",
    "Chinmayi Sripaada", "Haricharan", "G. V. Prakash Kumar", "Dhee", "Vijay Prakash",
    "Pradeep Kumar", "Sukanya", "Swetha Mohan Malayalam", "Vineeth Sreenivasan", "Najim Arshad",

    # --- Korean / K-pop ---
    "EXO", "TWICE", "BLACKPINK", "BTS", "Red Velvet", "Stray Kids", "Seventeen",
    "ITZY", "NCT 127", "GOT7", "Mamamoo", "TXT", "Aespa", "SHINee", "Super Junior",
    "BIGBANG", "MONSTA X", "IU", "Taeyeon", "Sunmi",

    # --- Chinese / Mandopop / Cantopop ---
    "Jay Chou", "JJ Lin", "G.E.M.", "Faye Wong", "Eason Chan", "Jacky Cheung",
    "Li Ronghao", "Leehom Wang", "A-Lin", "Cyndi Wang", "Jolin Tsai", "Rainie Yang",
    "Mayday", "Wang Leehom", "Hacken Lee", "Hins Cheung", "Hebe Tien", "Angela Zhang",
    "Li Yuchun", "Tiger Hu",

    # --- Singapore / Malaysia / Malay / Arab Artists ---
    "Harris J", "Aliff Aziz", "Yuna", "Siti Nurhaliza", "Ahmad Dhani", "Raisa",
    "Agnes Monica", "Tulus", "Misha Omar", "Faizal Tahir", "Stacy Anam", "Shila Amzah",
    "Zainal Abidin", "Jaclyn Victor", "Afgansyah Reza", "Aizat Amdan", "Hael Husaini",
    "Ayda Jebat", "Nabila Razali", "Sufian Suhaimi",

    # --- EDM / Fusion / World Music / Others ---
    "Nucleya", "Lost Stories", "Dualist Inquiry", "Anish Sood", "Sez on the Beat",
    "DJ NYK", "Brodha V", "MC Altaf", "Prateek Kuhad", "Taba Chake", "Raghav Meattle",
    "Bandish Projekt", "Kushal Das", "Benny Dayal", "Abhijeet Sawant", "When Chai Met Toast",
    "Darshan Raval feat. Neha Kakkar", "Jonita Gandhi", "Sona Mohapatra", "Siddharth Slathia",
    "Monali Thakur", "Divine feat. Naezy", "Seedhe Maut feat. Dino", "Prateek Kuhad Acoustic",
    "Ritviz Live", "The Local Train Unplugged"
]


In [None]:
import random
import pandas as pd
import requests

# -----------------------------
# 1Ô∏è‚É£ Fetch concerts from Setlist.fm
# -----------------------------
def get_concerts(artist_name, limit=5):
    url = "https://api.setlist.fm/rest/1.0/search/setlists"
    headers = {
        "x-api-key": "vq10HitmOhgf-DabG8C1auqcfze6MtRwVKwO",
        "Accept": "application/json"
    }
    params = {"artistName": artist_name, "p": 1}
    res = requests.get(url, headers=headers, params=params)
    if res.status_code != 200:
        return []
    data = res.json().get('setlist', [])
    concerts = []
    for c in data[:limit]:
        songs = []
        for s in c.get('sets', {}).get('set', []):
            songs.extend([song['name'] for song in s.get('song', [])])
        concerts.append({
            "artist": c['artist']['name'],
            "venue": c['venue']['name'],
            "city": c['venue']['city']['name'],
            "country": c['venue']['city']['country']['name'],
            "date": c['eventDate'],
            "tour_name": c.get('tour', {}).get('name', ''),
            "attendance": c.get('attendance', ''),
            "songs": songs,
            "link": c.get('url', '')
        })
    return concerts

# -----------------------------
# 2Ô∏è‚É£ Collect concerts for multiple artists
# -----------------------------
# artists = ["Coldplay", "Taylor Swift", "The Weeknd", "Ed Sheeran", "Adele", "Beyonce"]
concerts = []
for a in artists:
    concerts.extend(get_concerts(a, limit=5))  # get more concerts per artist

# -----------------------------
# 3Ô∏è‚É£ Generate multiple emails per concert
# -----------------------------
target_emails = 1200
mock_emails = []

while len(mock_emails) < target_emails:
    concert = random.choice(concerts)
    subject = random.choice(subject_templates).format(**concert)
    top_songs = ", ".join(concert['songs'][:3]) if concert['songs'] else "their greatest hits"
    body = random.choice(body_templates).format(**concert, top_songs=top_songs)

    mock_emails.append({
        "subject": subject,
        "body": body,
        "text": subject + " " + body,
        "category": "concert_promotion",
        "category_id": 6
    })

# -----------------------------
# 4Ô∏è‚É£ Create DataFrame
# -----------------------------
mock_df = pd.DataFrame(mock_emails)
mock_df["id"] = ["music_" + str(i) for i in range(len(mock_df))]

print(f"Generated {len(mock_df)} emails")


Generated 1200 emails


In [None]:
mock_df

Unnamed: 0,subject,body,text,category,category_id,id
0,ü§ò See Atif Aslam in Leicester on 07-09-2024 Li...,Live in Leicester: Atif Aslam at Mattioli Aren...,ü§ò See Atif Aslam in Leicester on 07-09-2024 Li...,concert_promotion,6,music_0
1,üé∏ Experience BigBang at Scene Grefsenkollen üéüÔ∏è...,From classics to new hits: BigBang performs []...,üé∏ Experience BigBang at Scene Grefsenkollen üéüÔ∏è...,concert_promotion,6,music_1
2,üí• Your Chance to See MONSTA X on 14-09-2025 Se...,"Get ready, Seoul! MONSTA X brings their iconic...",üí• Your Chance to See MONSTA X on 14-09-2025 Se...,concert_promotion,6,music_2
3,üéüÔ∏è Your Chance to See Miley Cyrus performing h...,Love live music? Miley Cyrus's hits New York ...,üéüÔ∏è Your Chance to See Miley Cyrus performing h...,concert_promotion,6,music_3
4,üôå Your Chance to See Sonu Nigam performing hit...,The icon returns! Sonu Nigam live at Riverside...,üôå Your Chance to See Sonu Nigam performing hit...,concert_promotion,6,music_4
...,...,...,...,...,...,...
1195,üé∂ Experience Coldplay performing hits like ['M...,Feel the energy as Coldplay rocks London at We...,üé∂ Experience Coldplay performing hits like ['M...,concert_promotion,6,music_1195
1196,üôå Get Ready for NCT 127 for the tour Tickets ...,Don't miss the spectacle: NCT 127 at Crypto.c...,üôå Get Ready for NCT 127 for the tour Tickets ...,concert_promotion,6,music_1196
1197,üî• Your Chance to See Imagine Dragons performin...,Witness timeless music by Imagine Dragons at M...,üî• Your Chance to See Imagine Dragons performin...,concert_promotion,6,music_1197
1198,‚ù§Ô∏è Join The Weeknd for the tour Limited Seats...,Love live music? The Weeknd's hits Atlantic C...,‚ù§Ô∏è Join The Weeknd for the tour Limited Seats...,concert_promotion,6,music_1198


In [None]:
mock_df.to_csv("mock_concert_emails.csv", index=False)

print("Saved mock emails to mock_concert_emails.csv")

Saved mock emails to mock_concert_emails.csv


In [None]:
subject_templates = [
    # Booking Confirmation
    "Your Flight with {airline} is Confirmed! ‚úàÔ∏è",
    "Booking Successful: {airline} Flight {flight_number}",
    "E-Ticket for {airline} Flight {flight_number} from {departure_city} to {arrival_city}",
    "Flight {flight_number} with {airline} ‚Äì Booking Complete",
    "Confirmation: {airline} {departure_city} ‚Üí {arrival_city}",

    # Reminder
    "Reminder: {airline} Flight {flight_number} on {departure_date}",
    "Upcoming Flight Alert: {airline} {flight_number}",
    "Your Trip is Coming Up! {airline} Flight {flight_number}",
    "Don't Forget! {airline} Flight {flight_number} on {departure_date}",

    # Urgency / Check-in
    "Check-in Open for {airline} Flight {flight_number}",
    "‚è≥ Boarding Soon: {airline} {flight_number}",
    "Last Chance to Check-in: {airline} Flight {flight_number}",
    "Prepare for Departure: {airline} {flight_number} from {departure_city}",

    # Promotional / Loyalty
    "Fly with {airline} ‚Äì Upgrade to Business Class!",
    "Earn Miles on {airline} Flight {flight_number}",
    "Special Offer: {airline} Flight {flight_number} This Week",
    "Exclusive: {airline} Flight Deals for {departure_city} ‚Üí {arrival_city}",

    # International / Domestic Variations
    "Travel Abroad with {airline} ‚Äì {departure_city} to {arrival_city}",
    "Domestic Flight Booking Confirmed: {airline} {flight_number}",
    "Your Next Adventure: {airline} {departure_city} ‚Üí {arrival_city}",

    # Casual / Fun
    "Pack Your Bags! {airline} Flight {flight_number} Awaits",
    "‚úàÔ∏è Adventure Starts Soon: {airline} Flight {flight_number}",
    "Your Journey with {airline} is About to Begin!",
    "Boarding Pass Inside: {airline} Flight {flight_number}",

    # VIP / Premium Focus
    "Your VIP Seat on {airline} Flight {flight_number} is Ready",
    "Luxury Travel Awaits! {airline} {flight_number} Booking Confirmed",
    "Enjoy Premium Comfort on {airline} Flight {flight_number}",

    # Generic / Catch-All
    "{airline} Flight {flight_number} ‚Äì Your Ticket Details",
    "Flight Confirmation: {airline} {flight_number} ‚Äì {departure_date}",
    "Ready to Fly? {airline} {flight_number} from {departure_city} to {arrival_city}",

    # Emojis & Fun Variations
    "üõ´ Your Flight {flight_number} with {airline} is Booked!",
    "üõ¨ Landing Soon: {airline} {flight_number}",
    "üéüÔ∏è E-Ticket for {airline} {flight_number}",
    "üåç Explore {arrival_city} ‚Äì {airline} Flight {flight_number}",
    "üïí Departure Alert: {airline} Flight {flight_number} on {departure_date}",

    # Loyalty / Frequent Flyer
    "Miles Update: {airline} Flight {flight_number} Confirmation",
    "Earn Points on Your Upcoming Flight {flight_number} with {airline}",
    "VIP Access: {airline} Flight {flight_number} Booking Confirmed",

    # Multiple Options / Group
    "Group Booking Confirmed: {airline} {flight_number}",
    "Family Trip? {airline} Flight {flight_number} Details Inside",
    "Business Travel? {airline} {flight_number} Itinerary",

    # Last-Minute / Flash
    "Last-Minute Flight Alert: {airline} {flight_number}",
    "Your Flight {flight_number} is About to Depart! ‚è∞",
    "Urgent: {airline} Flight {flight_number} ‚Äì Confirm Your Seat"
]


In [None]:
body_templates = [
    # Confirmation
    "Hello! Your flight {flight_number} with {airline} is confirmed. Depart from {departure_city} at {departure_time} on {departure_date} and arrive at {arrival_city} at {arrival_time}. Seat Class: {seat_class}. Price: ${price}.",
    "Booking complete! {airline} Flight {flight_number} from {departure_city} to {arrival_city}. Departure: {departure_date} at {departure_time}. Arrival: {arrival_time}. Reserve extras here: {booking_url}",
    "Your journey is ready! Flight {flight_number} with {airline} leaves {departure_city} at {departure_time} on {departure_date} and lands in {arrival_city} at {arrival_time}. Seat Class: {seat_class}.",

    # Reminder / Check-in
    "Reminder: Check-in is open for your {airline} Flight {flight_number}. Departure from {departure_city} at {departure_time}, arriving {arrival_city} at {arrival_time}.",
    "Boarding soon! {airline} {flight_number} departs {departure_city} at {departure_time} on {departure_date}. Ensure you have your ID and ticket ready.",
    "It's almost time to fly! {airline} {flight_number} departs {departure_city} at {departure_time}. Arrival: {arrival_city} at {arrival_time}.",

    # Promotions / Loyalty
    "Earn reward points on your upcoming flight {flight_number} with {airline}. Book meals, lounges, and more at {booking_url}.",
    "Upgrade your seat for extra comfort on {airline} {flight_number}. Departure {departure_city}, arrival {arrival_city}.",
    "Special offer! Add baggage or meals to your {airline} flight {flight_number} at {booking_url}.",

    # VIP / Premium
    "Enjoy premium comfort and priority boarding on your {airline} Flight {flight_number}. Departure {departure_date}, {departure_city} ‚Üí {arrival_city}.",
    "Your VIP access is confirmed for {airline} Flight {flight_number}. Seat Class: {seat_class}. Departure: {departure_date}, {departure_time}.",

    # Group / Family / Business
    "Family trip confirmed! {airline} Flight {flight_number} from {departure_city} to {arrival_city}. Seats booked: {seat_class}.",
    "Business travel itinerary: {airline} Flight {flight_number}. Departure: {departure_city} at {departure_time}, arrival: {arrival_city} at {arrival_time}.",

    # Urgency / Last Minute
    "Last chance to check-in for your flight {flight_number} with {airline}. Departure: {departure_date} {departure_time}.",
    "Hurry! Boarding for {airline} Flight {flight_number} begins soon. Don't miss your departure.",

    # Fun / Casual
    "Adventure awaits! {airline} Flight {flight_number} from {departure_city} to {arrival_city}. Pack your bags and enjoy the journey.",
    "‚úàÔ∏è Ready for takeoff? Flight {flight_number} with {airline} departs {departure_city} at {departure_time}. Arrival: {arrival_city} at {arrival_time}.",
    "Your next adventure starts here! {airline} Flight {flight_number}, {departure_city} ‚Üí {arrival_city}, {departure_date}. Seat Class: {seat_class}.",

    # International / Multi-city
    "Travel across continents! {airline} Flight {flight_number} from {departure_city} to {arrival_city} departs {departure_date}. Check-in: {booking_url}",
    "Explore {arrival_city} on {departure_date}! Flight {flight_number} with {airline} departs {departure_city} at {departure_time}."
]


In [None]:
import random
import pandas as pd
from datetime import datetime, timedelta

# Sample cities and airlines
airlines = ["Emirates", "IndiGo", "Delta", "Singapore Airlines", "Cathay Pacific", "Qatar Airways", "Lufthansa"]
cities = ["New York", "London", "Dubai", "Mumbai", "Singapore", "Seoul", "Beijing", "Tokyo", "Kuala Lumpur", "Chennai"]

def random_time():
    return f"{random.randint(0,23):02d}:{random.randint(0,59):02d}"

def make_flight_email(i):
    departure_city, arrival_city = random.sample(cities, 2)
    airline = random.choice(airlines)
    flight_number = f"{airline[:2].upper()}{random.randint(100,999)}"
    departure_date = (datetime.now() + timedelta(days=random.randint(1,60))).strftime("%Y-%m-%d")
    arrival_date = departure_date
    departure_time = random_time()
    arrival_time = random_time()
    seat_class = random.choice(["Economy", "Business", "First"])
    price = random.randint(100, 1500)
    booking_url = f"https://www.{airline.lower().replace(' ','')}.com/book/{flight_number}"

    flight_info = {
        "airline": airline,
        "flight_number": flight_number,
        "departure_city": departure_city,
        "arrival_city": arrival_city,
        "departure_date": departure_date,
        "arrival_date": arrival_date,
        "departure_time": departure_time,
        "arrival_time": arrival_time,
        "seat_class": seat_class,
        "price": price,
        "booking_url": booking_url
    }

    subject = random.choice(subject_templates).format(**flight_info)
    body = random.choice(body_templates).format(**flight_info)

    return {
        "id": f"flight_{i}",
        "subject": subject,
        "body": body,
        "text": subject + " " + body,
        "category": "flight_booking",
        "category_id": 7
    }

# Generate 600 mock flight emails
mock_flights = [make_flight_email(i) for i in range(1200)]
mock_flight_df = pd.DataFrame(mock_flights)


In [None]:
mock_flight_df

Unnamed: 0,id,subject,body,text,category,category_id
0,flight_0,Miles Update: Cathay Pacific Flight CA538 Conf...,It's almost time to fly! Cathay Pacific CA538 ...,Miles Update: Cathay Pacific Flight CA538 Conf...,flight_booking,7
1,flight_1,Luxury Travel Awaits! Qatar Airways QA845 Book...,Hurry! Boarding for Qatar Airways Flight QA845...,Luxury Travel Awaits! Qatar Airways QA845 Book...,flight_booking,7
2,flight_2,Earn Miles on Singapore Airlines Flight SI390,Earn reward points on your upcoming flight SI3...,Earn Miles on Singapore Airlines Flight SI390 ...,flight_booking,7
3,flight_3,üõ´ Your Flight DE889 with Delta is Booked!,Adventure awaits! Delta Flight DE889 from Beij...,üõ´ Your Flight DE889 with Delta is Booked! Adve...,flight_booking,7
4,flight_4,Boarding Pass Inside: Singapore Airlines Fligh...,Reminder: Check-in is open for your Singapore ...,Boarding Pass Inside: Singapore Airlines Fligh...,flight_booking,7
...,...,...,...,...,...,...
1195,flight_1195,Earn Miles on IndiGo Flight IN238,Business travel itinerary: IndiGo Flight IN238...,Earn Miles on IndiGo Flight IN238 Business tra...,flight_booking,7
1196,flight_1196,E-Ticket for Qatar Airways Flight QA113 from C...,Explore Tokyo on 2025-11-22! Flight QA113 with...,E-Ticket for Qatar Airways Flight QA113 from C...,flight_booking,7
1197,flight_1197,Your Next Adventure: IndiGo Kuala Lumpur ‚Üí Che...,Earn reward points on your upcoming flight IN2...,Your Next Adventure: IndiGo Kuala Lumpur ‚Üí Che...,flight_booking,7
1198,flight_1198,Pack Your Bags! IndiGo Flight IN664 Awaits,Upgrade your seat for extra comfort on IndiGo ...,Pack Your Bags! IndiGo Flight IN664 Awaits Upg...,flight_booking,7


In [None]:
mock_flight_df.to_csv("mock_flight_emails.csv", index=False)