<a href="https://colab.research.google.com/github/muhirwaJD/text_classification/blob/main/Carine_lstm_ml/lstm_ml_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# lstm Machine Learning for BBC News Text Classification

## Table of Contents
1. [Setup & Imports](#setup)
2. [Data Loading & Exploration](#data)
3. [Data Preprocessing](#preprocessing)
4. [Feature Extraction - TF-IDF](#tfidf)
5. [Feature Extraction - GloVe](#glove)
6. [Feature Extraction - FastText](#fasttext)
7. [Model Training - LSTM](#lstm)
8. [Results Comparison](#comparison)
9. [Conclusion](#conclusion)

1.   List item
2.   List item



### 1. Setup and import

In [1]:
# Standard libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

!pip install gensim
# Text processing
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Scikit-learn
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import numpy as np
from gensim.utils import simple_preprocess
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

# Gensim for FastText
from gensim.models import FastText
from gensim.utils import simple_preprocess

# Joblib for model saving
import joblib

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("All libraries imported successfully!")
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'  # forces CPU


Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
All libraries imported successfully!


In [2]:
# Download NLTK data
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [3]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive



## 2. Data Loading & Exploration <a id='data'></a>

In [4]:
# Define paths
BASE_DIR = Path('/content/drive/MyDrive/text_classification (1)/data/preprocessed_data')
TRAIN_PATH = BASE_DIR / 'train.csv'
VAL_PATH = BASE_DIR / 'validation.csv'
TEST_PATH = BASE_DIR / 'test.csv'

# Load datasets
print("Loading preprocessed datasets...")
train_df = pd.read_csv(TRAIN_PATH)
val_df = pd.read_csv(VAL_PATH)
test_df = pd.read_csv(TEST_PATH)

print(f"\nDataset Sizes:")
print(f"Training set: {len(train_df):,} samples")
print(f"Validation set: {len(val_df):,} samples")
print(f"Test set: {len(test_df):,} samples")
print(f"Total: {len(train_df) + len(val_df) + len(test_df):,} samples")

# Display first few rows
print("\nFirst 5 samples from training set:")
train_df.head(5)

Loading preprocessed datasets...


FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/text_classification (1)/data/preprocessed_data/train.csv'

In [None]:
# Check columns
print("Training columns:", train_df.columns.tolist())
print("Validation columns:", val_df.columns.tolist())
print("Test columns:", test_df.columns.tolist())

# Preview first few rows to see labels
print("\nTraining data sample:")
print(train_df.head(5))


In [None]:
# Basic dataset information
print("Training Data Info:")
print(train_df.info())

print("\nColumn Names:", train_df.columns.tolist())
print("\nCategories:", train_df['category'].unique())
print(f"Number of categories: {train_df['category'].nunique()}")

# Check for missing values
print("\nMissing Values:")
print(train_df.isnull().sum())

In [None]:
def text_length_analysis(df, name):
    df['text_length'] = df['text'].astype(str).apply(len)
    df['word_count'] = df['text'].astype(str).apply(lambda x: len(x.split()))

    print(f"\n{name} Text Length Statistics:")
    print(df[['text_length', 'word_count']].describe())

text_length_analysis(train_df, "Training")
text_length_analysis(val_df, "Validation")
text_length_analysis(test_df, "Test")


In [None]:
# Create output directory if it doesn't exist
output_dir = '/content/drive/MyDrive/text_classification (1)/results (1)'
os.makedirs(output_dir, exist_ok=True)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Text length boxplot
train_df.boxplot(
    column='text_length',
    by='category',
    ax=axes[0],
    grid=False
)
axes[0].set_title('Text Length Distribution by Category', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Category', fontsize=10)
axes[0].set_ylabel('Text Length (characters)', fontsize=10)
axes[0].tick_params(axis='x', rotation=45)

# Word count boxplot
train_df.boxplot(
    column='word_count',
    by='category',
    ax=axes[1],
    grid=False
)
axes[1].set_title('Word Count Distribution by Category', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Category', fontsize=10)
axes[1].set_ylabel('Word Count', fontsize=10)
axes[1].tick_params(axis='x', rotation=45)

plt.suptitle('')  # removes automatic pandas title
plt.tight_layout()
plt.savefig(
    os.path.join(output_dir, 'text_length_analysis.png'),
    dpi=300,
    bbox_inches='tight'
)
plt.show()



## 3. Data Preprocessing <a id='preprocessing'></a>

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text_strong(
    text,
    remove_stopwords=True,
    lemmatize=True,
    normalize_numbers=True
):
    # Ensure string & lowercase
    text = str(text).lower()

    # Remove HTML tags
    text = re.sub(r'<.*?>', ' ', text)

    # Remove URLs & emails
    text = re.sub(r'http\S+|www\S+|https\S+', ' ', text)
    text = re.sub(r'\S+@\S+', ' ', text)

    # Expand contractions
    text = expand_contractions(text)

    # Normalize numbers
    if normalize_numbers:
        text = re.sub(r'\d+', ' num ', text)

    # Remove punctuation & special characters
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords
    if remove_stopwords:
        tokens = [t for t in tokens if t not in stop_words]

    # Lemmatization
    if lemmatize:
        tokens = [lemmatizer.lemmatize(t) for t in tokens]

    # Remove junk tokens
    tokens = [
        t for t in tokens
        if 3 <= len(t) <= 20 and t.isalpha()
    ]

    # Final clean text
    return ' '.join(tokens)


In [None]:
# ====== NLP PREPROCESSING SETUP (RUN ONCE) ======

import re
import string
import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Contractions
CONTRACTIONS = {
    "don't": "do not", "can't": "can not", "won't": "will not",
    "i'm": "i am", "it's": "it is", "that's": "that is",
    "what's": "what is", "there's": "there is",
    "isn't": "is not", "aren't": "are not", "wasn't": "was not",
    "weren't": "were not", "haven't": "have not", "hasn't": "has not",
    "hadn't": "had not", "wouldn't": "would not",
    "couldn't": "could not", "shouldn't": "should not"
}

def expand_contractions(text):
    for c, e in CONTRACTIONS.items():
        text = re.sub(rf"\b{c}\b", e, text)
    return text

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(
    text,
    remove_stopwords=True,
    lemmatize=True,
    normalize_numbers=True
):
    text = str(text).lower()

    # Remove HTML
    text = re.sub(r'<.*?>', ' ', text)

    # Remove URLs / emails
    text = re.sub(r'http\S+|www\S+|https\S+', ' ', text)
    text = re.sub(r'\S+@\S+', ' ', text)

    # Expand contractions
    text = expand_contractions(text)

    # Normalize numbers
    if normalize_numbers:
        text = re.sub(r'\d+', ' num ', text)

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Tokenize
    tokens = word_tokenize(text)

    # Stopword removal
    if remove_stopwords:
        tokens = [t for t in tokens if t not in stop_words]

    # Lemmatization
    if lemmatize:
        tokens = [lemmatizer.lemmatize(t) for t in tokens]

    # Remove junk tokens
    tokens = [t for t in tokens if 3 <= len(t) <= 20 and t.isalpha()]

    return ' '.join(tokens)


In [None]:
print("Preprocessing training set...")
train_df['processed_text'] = train_df['text'].apply(preprocess_text_strong)

print("Preprocessing validation set...")
val_df['processed_text'] = val_df['text'].apply(preprocess_text_strong)

print("Preprocessing test set...")
test_df['processed_text'] = test_df['text'].apply(preprocess_text_strong)

print("Preprocessing complete!")


In [None]:
# Define features and labels
X_train = train_df['processed_text']
y_train = train_df['category']

X_val   = val_df['processed_text']
y_val   = val_df['category']

X_test  = test_df['processed_text']
y_test  = test_df['category']

# Encode labels to integers
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train_enc = label_encoder.fit_transform(y_train)
y_val_enc   = label_encoder.transform(y_val)
y_test_enc  = label_encoder.transform(y_test)

print("Number of classes:", len(label_encoder.classes_))
print("Sample encoded labels:", y_train_enc[:5])


## 4. Feature Extraction - TF-IDF <a id='tfidf'></a>

**TF-IDF (Term Frequency-Inverse Document Frequency)** is a numerical statistic that reflects how important a word is to a document in a collection.

In [None]:
# ======= Imports =======
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# Assume these already exist from previous steps:
# train_df, val_df, test_df with column 'processed_text'

X_train = train_df['processed_text']
X_val   = val_df['processed_text']
X_test  = test_df['processed_text']



# ====== TF-IDF Vectorizer ======
tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,  # Keep top 5000 features
    min_df=2,           # Ignore terms that appear in less than 2 documents
    max_df=0.8,         # Ignore terms that appear in more than 80% of documents
    ngram_range=(1, 2)  # Use unigrams and bigrams
)

# Fit on training data, transform train/val/test
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_val_tfidf   = tfidf_vectorizer.transform(X_val)
X_test_tfidf  = tfidf_vectorizer.transform(X_test)

print(f"TF-IDF features created!")
print(f"Train shape: {X_train_tfidf.shape}")
print(f"Validation shape: {X_val_tfidf.shape}")
print(f"Test shape: {X_test_tfidf.shape}")
print(f"Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")

# ====== Optional: reshape for LSTM ======
# NOTE: TF-IDF is not naturally sequential, but this matches your previous LSTM idea
X_train_tfidf_seq = np.expand_dims(X_train_tfidf.toarray(), axis=1)
X_val_tfidf_seq   = np.expand_dims(X_val_tfidf.toarray(), axis=1)
X_test_tfidf_seq  = np.expand_dims(X_test_tfidf.toarray(), axis=1)

print("TF-IDF reshaped for LSTM:", X_train_tfidf_seq.shape)


In [None]:
def get_top_tfidf_terms(vectorizer, X, y, category, n=10):
    """
    Returns the top n TF-IDF terms for a given category.
    """
    # Boolean mask -> row indices
    mask = y == category
    indices = np.where(mask)[0]  # get row numbers

    # Select only rows for this category
    X_category = X[indices]

    # Sum TF-IDF scores across documents
    tfidf_sum = np.array(X_category.sum(axis=0)).flatten()

    # Feature names
    feature_names = vectorizer.get_feature_names_out()

    # Indices of top n terms
    top_indices = tfidf_sum.argsort()[-n:][::-1]

    return [(feature_names[i], tfidf_sum[i]) for i in top_indices]



## 5. Feature Extraction - GloVe <a id='glove'></a>

**GloVe (Global Vectors)** creates word embeddings by aggregating global word-word co-occurrence statistics.

In [None]:
# ====== Imports ======
import os
import numpy as np

# ====== Path to GloVe file ======
GLOVE_PATH = '/content/drive/MyDrive/text_classification (1)/glove/glove.6B.100d.txt'  # update if needed
# ====== Load GloVe embeddings ======
def load_glove_embeddings(glove_path):
    """
    Loads GloVe embeddings into a dictionary.
    Key: word
    Value: np.array vector
    """
    embeddings = {}

    if not os.path.exists(glove_path):
        print(f"WARNING: GloVe file not found at {glove_path}")
        print("Please download from: https://nlp.stanford.edu/projects/glove/")
        print("Recommended: glove.6B.zip (glove.6B.100d.txt or 300d)")
        return None

    print("Loading GloVe embeddings...")
    with open(glove_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.strip().split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector

    # Summary
    print(f"Loaded {len(embeddings)} word vectors")
    print(f"Vector dimension: {len(next(iter(embeddings.values())))}")
    return embeddings

# ===== Load GloVe =====
glove_embeddings = load_glove_embeddings(GLOVE_PATH)

# Optional test
if glove_embeddings:
    test_word = 'amazon'
    if test_word in glove_embeddings:
        print(f"\nExample embedding for '{test_word}':\n", glove_embeddings[test_word][:10], "...")


In [None]:
import numpy as np
from typing import List

# -------------------------
# SETTINGS
# -------------------------
glove_path = "/content/drive/MyDrive/text_classification (1)/glove/glove.6B.300d.txt"  # change this to your local GloVe file path
embedding_dim = 300
max_len = 100  # for sequence embeddings

# -------------------------
# LOAD GLOVE
# -------------------------
print("Loading GloVe embeddings...")
glove_embeddings = {}
with open(glove_path, "r", encoding="utf8") as f:
    for line in f:
        parts = line.strip().split()
        word = parts[0]
        vector = np.array(parts[1:], dtype=np.float32)
        if vector.shape[0] == embedding_dim:  # sanity check
            glove_embeddings[word] = vector
print(f"Loaded {len(glove_embeddings)} word vectors.")

# -------------------------
# TOKENIZATION
# -------------------------
def tokenize(text: str) -> List[str]:
    return text.lower().split()

def get_word_embedding(word: str, glove_embeddings: dict) -> np.ndarray:
    return glove_embeddings.get(word, None)

# -------------------------
# DOCUMENT-LEVEL EMBEDDING
# -------------------------
def get_doc_embedding_glove(text: str, glove_embeddings: dict, embedding_dim=300) -> np.ndarray:
    tokens = tokenize(text)
    embeddings = [get_word_embedding(word, glove_embeddings) for word in tokens]
    embeddings = [e for e in embeddings if e is not None]

    if not embeddings:
        return np.zeros(embedding_dim, dtype=np.float32)
    return np.mean(np.stack(embeddings, axis=0), axis=0).astype(np.float32)

def create_doc_embeddings(texts, glove_embeddings, embedding_dim=300):
    return np.vstack([get_doc_embedding_glove(text, glove_embeddings, embedding_dim) for text in texts])

# -------------------------
# SEQUENCE-LEVEL EMBEDDING (FOR LSTM)
# -------------------------
def get_sequence_embedding_glove(text: str, glove_embeddings: dict, embedding_dim=300, max_len=100) -> np.ndarray:
    tokens = tokenize(text)
    embeddings = [get_word_embedding(word, glove_embeddings) for word in tokens]
    embeddings = [e for e in embeddings if e is not None]

    seq = embeddings[:max_len]  # truncate
    pad_length = max_len - len(seq)
    if pad_length > 0:
        seq += [np.zeros(embedding_dim, dtype=np.float32)] * pad_length

    return np.stack(seq, axis=0).astype(np.float32)

def create_sequence_embeddings(texts, glove_embeddings, embedding_dim=300, max_len=100):
    return np.stack([get_sequence_embedding_glove(text, glove_embeddings, embedding_dim, max_len) for text in texts])

# -------------------------
# USAGE EXAMPLE
# -------------------------
# Replace X_train, X_val, X_test with your actual datasets
# Example: X_train = ["This is a sentence.", "Another document!"]

print("Creating document-level embeddings...")
X_train_doc = create_doc_embeddings(X_train, glove_embeddings, embedding_dim)
X_val_doc   = create_doc_embeddings(X_val, glove_embeddings, embedding_dim)
X_test_doc  = create_doc_embeddings(X_test, glove_embeddings, embedding_dim)
print("Document-level shapes:", X_train_doc.shape, X_val_doc.shape, X_test_doc.shape)

print("Creating sequence-level embeddings for LSTM...")
X_train_seq = create_sequence_embeddings(X_train, glove_embeddings, embedding_dim, max_len)
X_val_seq   = create_sequence_embeddings(X_val, glove_embeddings, embedding_dim, max_len)
X_test_seq  = create_sequence_embeddings(X_test, glove_embeddings, embedding_dim, max_len)
print("Sequence-level shapes:", X_train_seq.shape, X_val_seq.shape, X_test_seq.shape)



## 6. Feature Extraction - FastText <a id='fasttext'></a>

**FastText** learns word embeddings using neural networks and subword information, making it robust to rare words and typos.

In [None]:
# ===== FastText Training (RAM-safe, single cell) =====
from gensim.models import FastText
from gensim.utils import simple_preprocess

# -------- Re-iterable streaming corpus (FIX) --------
class corpus_iterator:
    def __init__(self, texts):
        self.texts = texts

    def __iter__(self):
        for doc in self.texts:
            yield simple_preprocess(doc)

# -------- Hyperparameters --------
VECTOR_SIZE = 100
WINDOW = 5
MIN_COUNT = 5
WORKERS = 2
SG = 1
EPOCHS = 15
MIN_N = 3
MAX_N = 5
ALPHA = 0.025
MIN_ALPHA = 0.0001

print("Initializing FastText model...")
fasttext_model = FastText(
    vector_size=VECTOR_SIZE,
    window=WINDOW,
    min_count=MIN_COUNT,
    workers=WORKERS,
    sg=SG,
    min_n=MIN_N,
    max_n=MAX_N,
    alpha=ALPHA,
    min_alpha=MIN_ALPHA
)

print("Building vocabulary...")
fasttext_model.build_vocab(corpus_iterator(X_train))
print(f"Vocabulary size: {len(fasttext_model.wv)}")

print(f"Training FastText model for {EPOCHS} epochs...")
fasttext_model.train(
    corpus_iterator(X_train),
    total_examples=fasttext_model.corpus_count,
    epochs=EPOCHS
)

print("FastText training complete!")


\
## 7. Model building and training - LSTM <a id='random-forest'></a>

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Bidirectional, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt
import numpy as np

def train_lstm_best(X_train, y_train, X_val, y_val, num_classes=None,
                    lstm_units=128, dropout_rate=0.3, epochs=30, batch_size=32,
                    name="LSTM"):
    """
    Flexible LSTM trainer for TF-IDF, GloVe, FastText embeddings.

    X_train, X_val: np.array (2D for TF-IDF or 3D for embeddings)
    y_train, y_val: categorical (one-hot) or integer labels (sparse)
    num_classes: detected automatically if None
    """

    # Auto-detect number of classes
    if num_classes is None:
        if y_train.ndim == 1:  # sparse labels
            num_classes = int(np.max(y_train) + 1)
        else:  # one-hot labels
            num_classes = y_train.shape[1]

    input_shape = X_train.shape[1:]
    model = Sequential()

    # Choose model layers based on input shape
    if len(input_shape) == 2:  # 3D embeddings
        model.add(Bidirectional(LSTM(lstm_units, return_sequences=False), input_shape=input_shape))
        model.add(Dropout(dropout_rate))
    elif len(input_shape) == 1:  # 2D TF-IDF
        model.add(Input(shape=input_shape))
        model.add(Dense(256, activation='relu'))
        model.add(Dropout(dropout_rate))

    # Output layer + loss selection
    if y_train.ndim == 1:
        model.add(Dense(num_classes, activation='softmax'))
        loss = 'sparse_categorical_crossentropy'
    else:
        model.add(Dense(num_classes, activation='softmax'))
        loss = 'categorical_crossentropy'

    model.compile(
        loss=loss,
        optimizer=Adam(learning_rate=0.001),
        metrics=['accuracy']
    )

    # Early stopping
    early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

    # Train
    history = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=epochs,
        batch_size=batch_size,
        callbacks=[early_stop],
        verbose=2
    )

    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    axes[0].plot(history.history['accuracy'], 'b-o', label='Train Accuracy')
    axes[0].plot(history.history['val_accuracy'], 'r-s', label='Val Accuracy')
    axes[0].set_title(f'{name} Accuracy')
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Accuracy')
    axes[0].legend()
    axes[0].grid(True)

    axes[1].plot(history.history['loss'], 'b-o', label='Train Loss')
    axes[1].plot(history.history['val_loss'], 'r-s', label='Val Loss')
    axes[1].set_title(f'{name} Loss')
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('Loss')
    axes[1].legend()
    axes[1].grid(True)

    plt.tight_layout()
    plt.show()

    return model, history


In [None]:
# ===== Train model using TF-IDF features =====

model_tfidf, history_tfidf = train_lstm_best(
    X_train=X_train_tfidf,
    y_train=y_train_enc,
    X_val=X_val_tfidf,
    y_val=y_val_enc,
    epochs=30,
    batch_size=32,
    name="TF-IDF Classifier"
)


In [None]:
tfidf_model, tfidf_history = tfidf_model

In [None]:
print(type(tfidf_model))

In [None]:
# GloVe LSTM
glove_input_shape = X_train_glove_seq.shape[1:]  # (timesteps=1, features)
glove_model, glove_history = train_lstm_with_plot(
    X_train_glove_seq, y_train_enc,
    X_val_glove_seq, y_val_enc,
    input_shape=glove_input_shape,
    epochs=30,
    name="GloVe LSTM"
)

In [None]:
# FastText LSTM
fasttext_input_shape = X_train_seq.shape[1:]  # (MAX_LEN, vector_size)
fasttext_model_lstm, fasttext_history = train_lstm_with_plot(
    X_train_seq, y_train_enc,
    X_val_seq, y_val_enc,
    input_shape=fasttext_input_shape,
    epochs=30
)

## 8. Results Comparison <a id='comparison'></a>

In [None]:
print("TF-IDF model:", type(tfidf_model))
print("GloVe model:", type(glove_model))
print("FastText model:", type(fasttext_model_lstm))


In [None]:
def evaluate_model(model, X_test, y_test):
    y_pred = np.argmax(model.predict(X_test), axis=1)
    report = classification_report(y_test, y_pred, target_names=CATEGORIES)
    return report

print("TF-IDF LSTM Test Metrics:\n")
print(evaluate_model(tfidf_model, X_test_tfidf_seq, y_test_enc))

print("\nGloVe LSTM Test Metrics:\n")
print(evaluate_model(glove_model, X_test_glove_seq, y_test_enc))

print("\nFastText LSTM Test Metrics:\n")
print(evaluate_model(fasttext_model_lstm, X_test_seq, y_test_enc))


In [None]:
def evaluate_model_with_cm(model, X_test, y_test, categories, name="Model"):
    y_pred = np.argmax(model.predict(X_test), axis=1)

    print(f"\n{name} Classification Report:")
    print(classification_report(y_test, y_pred, target_names=categories))

    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(6,5))
    sns.heatmap(cm, annot=True, fmt='d', xticklabels=categories, yticklabels=categories, cmap='Blues')
    plt.title(f'{name} Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()

    # Return metrics
    return {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, average='macro'),
        'Recall': recall_score(y_test, y_pred, average='macro'),
        'F1': f1_score(y_test, y_pred, average='macro')
    }


In [None]:
def summarize_metrics(model, X_test, y_test, name):
    y_pred = np.argmax(model.predict(X_test), axis=1)
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    return {
        'Embedding': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, average='macro'),
        'Recall': recall_score(y_test, y_pred, average='macro'),
        'F1': f1_score(y_test, y_pred, average='macro')
    }

results = pd.DataFrame([
    summarize_metrics(tfidf_model, X_test_tfidf_seq, y_test_enc, 'TF-IDF'),
    summarize_metrics(glove_model, X_test_glove_seq, y_test_enc, 'GloVe'),
    summarize_metrics(fasttext_model_lstm, X_test_seq, y_test_enc, 'FastText')
])

print("\nComparison of LSTM Models with Different Embeddings:")
print(results)




TF-IDF is the best

Highest accuracy (96.6%) and F1 (96.4%).

Dense bag-of-words features seem to capture category-specific words very well.

Makes sense because your dataset is relatively small (~1335 training samples) — TF-IDF is very effective in low-data settings.

GloVe is very close

Accuracy 96.2%, F1 96.0%.

Pretrained embeddings work well, but since you averaged word vectors for the entire document (no real sequence), some subtle contextual info is lost.

Still slightly worse than TF-IDF here, likely because averaging smooths out important discriminative words.

FastText is slightly lower

Accuracy 94.6%, F1 94.5%.

Works reasonably well, but the gap shows that subword information wasn’t enough to beat TF-IDF.

Could be improved if you trained the LSTM on full word sequences instead of document-level averages.

**bold text**
##9. Conclusion <a id='conclusion'></a>

###Embeddings Comparison:

**TF-IDF**: Performed the best on this dataset, achieving 96.6% accuracy. Captures important words in each category effectively, especially for smaller datasets.

**GloVe**: Slightly lower performance (96.2% accuracy), but pretrained embeddings capture semantic relationships between words. Averaging word vectors works well, but some context is lost.

**FastText**: Accuracy of 94.6%, slightly behind the others. Subword information is robust for rare words, but averaging across the document reduces its advantage.

###Model Comparison:

**LSTM with TF-IDF**: Works surprisingly well on small-to-medium datasets, leveraging key words in the sequence.

**LSTM with GloVe**: Slightly lower than TF-IDF due to averaging embeddings, but strong generalization potential.

**LSTM with FastText**: Decent performance, could improve with sequence-level input rather than averaged embeddings.

###Next Steps:

Perform hyperparameter tuning (learning rate, LSTM units, dropout) to improve performance.

Explore ensemble methods combining multiple embeddings (e.g., TF-IDF + GloVe) to capture both statistical and semantic features.

Conduct error analysis on misclassified examples to understand model weaknesses.

Compare results with other deep learning models (RNN, GRU) to see if different architectures improve performance.