# Model Training - using CONDA Dataset

**Roshan Saigal & Mickey Paulus**

# CONDA Model Training Pipeline

This Python Notebook trains the following three models: Multinomial Naive Bayes, Logistic Regression, and Latent Dirichlet Allocation (LDA), on the preprocessed CONDA dataset.

## Summary
- Loads preprocessed features and labels from the preprocessing pipeline
- Splits data into training and testing sets (80/20 split)
- Trains three models:
  - Multinomial Naive Bayes
  - Logistic Regression
  - Latent Dirichlet Allocation (LDA) with LDA-based classifier
- Saves trained models for later inference

**Note:** Model evaluation, comparison, and visualization are performed in `model_analysis.ipynb`.

## Inputs
- `preprocessing_outputs/features/X_tfidf.npz` – TF-IDF matrix
- `preprocessing_outputs/features/y.npy` – Binary labels
- `preprocessing_outputs/conda_cleaned_binary_without_chat.csv` – Cleaned text data for LDA

## Outputs
- `model_outputs/naive_bayes_model.joblib` – trained Multinomial Naive Bayes
- `model_outputs/logistic_regression_model.joblib` – trained Logistic Regression
- `model_outputs/lda_model.joblib` – trained LDA model
- `model_outputs/lda_count_vectorizer.joblib` – Count vectorizer for LDA
- `model_outputs/lda_lr_classifier.joblib` – LDA-based classifier


## 0) Set Paths for Directories

In [1]:
# Preprocessing input paths
PREPROCESSING_DIR = "../without_chat_time/preprocessing_outputs"
FEATURES_DIR      = f"{PREPROCESSING_DIR}/features"

# Model output paths
OUTPUT_DIR    = "../without_chat_time/model_outputs"
MODELS_DIR    = OUTPUT_DIR  

import os
for d in [OUTPUT_DIR]:
    os.makedirs(d, exist_ok=True)

## 1) Imports

In [2]:
import pandas as pd
import numpy as np
from scipy import sparse
from joblib import dump

# Scikit-learn library imports
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Setting a seed for reproducibility
RANDOM_STATE = hash("I love MAXXXXXXX!") % (2**32)
np.random.seed(RANDOM_STATE)

## 2) Load Preprocessed Data

In [3]:
# Load TF-IDF features (sparse matrix)
X = sparse.load_npz(f"{FEATURES_DIR}/X_tfidf.npz")
print(f"Loaded TF-IDF features shape: {X.shape}")
print(f"Sparsity: {1.0 - X.nnz / (X.shape[0] * X.shape[1]):.2%}")

# Load labels
y = np.load(f"{FEATURES_DIR}/y.npy")
print(f"Loaded labels shape: {y.shape}")

# Check class distribution
unique, counts = np.unique(y, return_counts=True)
print(f"\nClass distribution:")
for label, count in zip(unique, counts):
    class_name = "Toxic" if label == 1 else "Non-Toxic"
    percentage = (count / len(y)) * 100
    print(f"  {class_name} ({label}): {count} samples ({percentage:.1f}%)")

Loaded TF-IDF features shape: (26914, 6171)
Sparsity: 99.96%
Loaded labels shape: (26914,)

Class distribution:
  Non-Toxic (0): 21694 samples (80.6%)
  Toxic (1): 5220 samples (19.4%)


## 3) Train-Test Split

In [4]:
# Split data into 80% training and 20% testing, this ratio is standard for training models

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y # stratify=y ensures both sets have same class distribution
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
train_unique, train_counts = np.unique(y_train, return_counts=True)
for label, count in zip(train_unique, train_counts):
    class_name = "Toxic" if label == 1 else "Non-Toxic"
    percentage = (count / len(y_train)) * 100
    print(f"  {class_name} ({label}): {count} ({percentage:.1f}%)")

test_unique, test_counts = np.unique(y_test, return_counts=True)
for label, count in zip(test_unique, test_counts):
    class_name = "Toxic" if label == 1 else "Non-Toxic"
    percentage = (count / len(y_test)) * 100
    print(f"  {class_name} ({label}): {count} ({percentage:.1f}%)")

Training set size: 21531 samples
Testing set size: 5383 samples
  Non-Toxic (0): 17355 (80.6%)
  Toxic (1): 4176 (19.4%)
  Non-Toxic (0): 4339 (80.6%)
  Toxic (1): 1044 (19.4%)


## 4) Train Multinomial Naive Bayes

In [5]:
nb_model = MultinomialNB(alpha=1.0)
nb_model.fit(X_train, y_train)
print("Multinomial Naive Bayes trained")

dump(nb_model, f"{MODELS_DIR}/naive_bayes_model.joblib")
print(f"Model saved to {MODELS_DIR}/naive_bayes_model.joblib")

Multinomial Naive Bayes trained
Model saved to ../without_chat_time/model_outputs/naive_bayes_model.joblib


## 5) Train Logistic Regression

In [6]:
lr_model = LogisticRegression(
    max_iter=1000,
    random_state=RANDOM_STATE,
    solver='lbfgs', # using lbfgs sovler for small data
    n_jobs=-1
)
lr_model.fit(X_train, y_train)

dump(lr_model, f"{MODELS_DIR}/logistic_regression_model.joblib")
print(f"Model saved to {MODELS_DIR}/logistic_regression_model.joblib")

Model saved to ../without_chat_time/model_outputs/logistic_regression_model.joblib


  norm2_w = weights @ weights if weights.ndim == 1 else squared_norm(weights)
  norm2_w = weights @ weights if weights.ndim == 1 else squared_norm(weights)
  norm2_w = weights @ weights if weights.ndim == 1 else squared_norm(weights)


## 6) Train Latent Dirichlet Allocation (LDA)

In [None]:
df_cleaned = pd.read_csv(f"{PREPROCESSING_DIR}/conda_cleaned_binary_without_chat_time.csv")
texts = df_cleaned['utterance_cleaned'].fillna('').astype(str).tolist()

texts_train, texts_test, y_train, y_test = train_test_split(
    texts,
    y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y
)

count_vectorizer = CountVectorizer(
    max_features=5000,
    min_df=2,
    max_df=0.98
)
X_count_train = count_vectorizer.fit_transform(texts_train)

lda_model = LatentDirichletAllocation(
    n_components=15,
    random_state=RANDOM_STATE,
    max_iter=20,
    n_jobs=-1
)
lda_model.fit(X_count_train)

dump(lda_model, f"{MODELS_DIR}/lda_model.joblib")
dump(count_vectorizer, f"{MODELS_DIR}/lda_count_vectorizer.joblib")
print(f"LDA model saved to {MODELS_DIR}/lda_model.joblib")
print(f"Count vectorizer saved to {MODELS_DIR}/lda_count_vectorizer.joblib")


LDA model saved to ../without_chat_time/model_outputs/lda_model.joblib
Count vectorizer saved to ../without_chat_time/model_outputs/lda_count_vectorizer.joblib


## 7) Topic Distributions + LDA-based Classifier


In [None]:
# Display top words for each topic
feature_names = count_vectorizer.get_feature_names_out()
n_top_words = 10
print(f"\nTop {n_top_words} words per topic:")
for topic_idx, topic in enumerate(lda_model.components_):
    top_words_idx = topic.argsort()[-n_top_words:][::-1]
    top_words = [feature_names[i] for i in top_words_idx]
    print(f"Topic {topic_idx + 1}: {' '.join(top_words)}")

"""
Get topic distributions for training classifier:
We need to do this because LDA is unsupervised; so use its topic distributions 
as features and train a Logistic Regression classifier so we can compare w/ 
supervised models 
"""
X_count_test = count_vectorizer.transform(texts_test)
doc_topic_dist_train = lda_model.transform(X_count_train)

# Create an "LDA-based classifier" that can be compared w/ other models
lda_lr_model = LogisticRegression(
    max_iter=1000,
    random_state=RANDOM_STATE,
    solver='lbfgs',
    n_jobs=-1
)
lda_lr_model.fit(doc_topic_dist_train, y_train)

dump(lda_lr_model, f"{MODELS_DIR}/lda_lr_classifier.joblib")
print(f"\nLDA-based classifier saved to {MODELS_DIR}/lda_lr_classifier.joblib")



Top 10 words per topic:
Topic 1: ggwp xd good hahaha game like report plz idiot guy
Topic 2: haha fuck game ok go get ur bad play come
Topic 3: gg ez mid kill shit time rofl lmao yes w8
Topic 4: nice end ty noob wait im wtf commend fucking yeah
Topic 5: lol wp report pls sf team sad please pudge dont

LDA-based classifier saved to ../without_chat_time/model_outputs/lda_lr_classifier.joblib


  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
