# NLP Topic Modeling

**AUTHOR:** Nolan MacDonald

**DATE OF LAST SIGNIFICANT UPDATE:** 2024-NOV-02

**DESCRIPTION:** Enron Corpus Natural Language Processing (NLP) topic modeling

**GITHUB ISSUE #2:** https://github.com/nolmacdonald/INTA6450_Enron/issues/2

# Overview

Topic modeling will use the Latent Dirichlet Allocation (LDA) generative model to provide all possible outcomes for a given phenomenon. This would reveal clusters of emails focused on an act of wrongdoing. Use Python scikit-learn for [LatentDirichletAllocation](https://scikit-learn.org/1.5/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).

# Load

In [10]:
import numpy as np
import pandas as pd
import sqlite3

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [2]:
# Connect to the database (or create it if it doesn't exist)
connection = sqlite3.connect("../data/emails.db")

# Create a cursor object to execute SQL commands
cursor = connection.cursor()

# Load the dataframe from the SQLite database
emails_df = pd.read_sql_query("SELECT * FROM emails", connection)

# Close the connection
connection.close()

# Show email data
emails_df.head()

Unnamed: 0,text,message_id,date,from,to,subject,cc,bcc,mime-version,content-type,content-transfer-encoding,x-from,x-to,x-cc,x-bcc,folder,origin,filename,priority
0,---------------------- Forwarded by Rika Imai/...,<88180.1075863689140.JavaMail.evans@thyme>,"Tue, 8 May 2001 08:37:00 -0700 (PDT)",rika.imai@enron.com,"john.forney@enron.com, mike.carson@enron.com, ...",4 Month Rolling Forecast,,,1.0,text/plain; charset=ANSI_X3.4-1968,,Rika Imai,"John M Forney, Mike Carson, Clint Dean, Doug G...",,,\Rob_Benson_Jun2001\Notes Folders\Notes inbox,Benson-R,rbenson.nsf,normal
1,great,<4460514.1075857469666.JavaMail.evans@thyme>,"Wed, 21 Jun 2000 02:01:00 -0700 (PDT)",hunter.shively@enron.com,richard.tomaski@enron.com,Re: Jim Simpson,,,1.0,text/plain; charset=us-ascii,,Hunter S Shively,Richard Tomaski,,,\Hunter_Shively_Jun2001\Notes Folders\Sent,Shively-H,hshivel.nsf,normal
2,"oohh la la. who was your ""friend""? did you g...",<2160301.1075858147494.JavaMail.evans@thyme>,"Wed, 16 Aug 2000 03:03:00 -0700 (PDT)",matthew.lenhart@enron.com,shelliott@dttus.com,Re: Re[2]:,,,1.0,text/plain; charset=us-ascii,,Matthew Lenhart,Shirley Elliott <shelliott@dttus.com> @ ENRON,,,\Matthew_Lenhart_Jun2001\Notes Folders\Sent,Lenhart-M,mlenhar.nsf,normal
3,\nAttached are the two files with this week's ...,<22847680.1075863611080.JavaMail.evans@thyme>,"Wed, 15 Aug 2001 05:46:47 -0700 (PDT)",rika.imai@enron.com,"russell.ballato@enron.com, hicham.benjelloun@e...",FW: Nuclear Rolling Forecast,,,1.0,text/plain; charset=us-ascii,,"Imai, Rika </O=ENRON/OU=NA/CN=RECIPIENTS/CN=RI...","Ballato, Russell </O=ENRON/OU=NA/CN=RECIPIENTS...",,,"\ExMerge - Benson, Robert\Inbox\Large Messages",BENSON-R,rob benson 6-25-02.PST,normal
4,lm:\nWhat are your thoughts going forward........,<15012282.1075852957298.JavaMail.evans@thyme>,"Wed, 3 Oct 2001 00:35:05 -0700 (PDT)",jennifer.fraser@enron.com,larry.may@enron.com,hello,,,1.0,text/plain; charset=us-ascii,,"Fraser, Jennifer </O=ENRON/OU=NA/CN=RECIPIENTS...","May, Larry </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Lm...",,,\LMAY2 (Non-Privileged)\Inbox,May-L,LMAY2 (Non-Privileged).pst,normal


# Initial Topic Model

In [None]:
# Step 1: Preprocess the text data
# For simplicity, we'll use the 'text' column from your DataFrame
text_data = emails_df["text"].values

# Step 2: Convert the text data into a document-term matrix
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words="english")
dtm = vectorizer.fit_transform(text_data)

# Step 3: Fit the LDA model to the document-term matrix
lda = LatentDirichletAllocation(n_components=10, random_state=42)
lda.fit(dtm)


# Step 4: Analyze the topics
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(
            " ".join(
                [feature_names[i] for i in topic.argsort()[: -no_top_words - 1 : -1]]
            )
        )


no_top_words = 10
tf_feature_names = vectorizer.get_feature_names_out()
display_topics(lda, tf_feature_names, no_top_words)

# Pre-Processing

The topic model lists a lot of insignificant information that pre-processing needs to clean up.
For example, Topic 0 results have no use as they are `td font com http br tr size width href align`.

- Text Extraction: Clean up insignificant data
    - Remove HTML Tags: `BeautifulSoup`
    - Remove URLs and email addresses
- Normalization: Normalize the text by converting to lowercase and removing special characters.
- Tokenization: Tokenization splits the text into individual words (tokens).
    - `nltk.tokenize.word_tokenize()`
- Removing Stop Words: Filter out common stop words that don’t carry significant meaning for topic modeling.
    - `nltk.corpus.stopwords()`
- Stemming: Stemming reduces words to their root form, which helps group similar words. 
    - `nltk.stem.PorterStemmer()`
- Final Prep: Join the tokens back into sentences for the CountVectorizer to process

## Import

In [3]:
from bs4 import BeautifulSoup
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

## Text Extraction

In [4]:
def extract_text(text):
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text() if pd.notnull(text) else ""
    # Remove URLs and email addresses
    text = re.sub(r"\S+@\S+", "", text)  # Remove email addresses
    text = re.sub(r"http\S+|www\S+", "", text)  # Remove URLs
    return text


emails_df["processed_text"] = emails_df["text"].apply(extract_text)

  text = BeautifulSoup(text, "html.parser").get_text() if pd.notnull(text) else ""
  text = BeautifulSoup(text, "html.parser").get_text() if pd.notnull(text) else ""


## Normalization

In [None]:
def normalize_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation using string.punctuation
    text = ''.join([char for char in text if char not in string.punctuation])
    # Remove non-alphabetic characters
    text = re.sub(r"[^a-z\s]", "", text)
    return text


emails_df["processed_text"] = emails_df["processed_text"].apply(normalize_text)

## Tokenization

In [6]:
# Tokenization splits the text into individual words (tokens)
emails_df["tokens"] = emails_df["processed_text"].apply(word_tokenize)

## Remove Stop Words

In [7]:
# Filter out common stop words that don’t carry significant meaning for topic modeling.
stop_words = set(stopwords.words("english"))

emails_df["tokens"] = emails_df["tokens"].apply(
    lambda x: [word for word in x if word not in stop_words]
)

## Stemming

In [8]:
# Stemming reduces words to their root form, which helps group similar words.
stemmer = PorterStemmer()
emails_df["tokens"] = emails_df["tokens"].apply(
    lambda x: [stemmer.stem(word) for word in x]
)

In [None]:
emails_df.head()

## Final Preparation for LDA

In [11]:
# Join the tokens back into sentences for the CountVectorizer to process
# Then proceed with your LDA modeling
emails_df["final_text"] = emails_df["tokens"].apply(lambda x: " ".join(x))

# Now, use 'final_text' for LDA
text_data = emails_df["final_text"].values
vectorizer = CountVectorizer(max_df=0.95, min_df=2)
dtm = vectorizer.fit_transform(text_data)

# Topic Model v2

In [12]:
# LDA
lda = LatentDirichletAllocation(n_components=20, random_state=42)
lda.fit(dtm)

In [14]:
# Step 4: Analyze the topics
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(
            " ".join(
                [feature_names[i] for i in topic.argsort()[: -no_top_words - 1 : -1]]
            )
        )


no_top_words = 10
tf_feature_names = vectorizer.get_feature_names_out()
display_topics(lda, tf_feature_names, no_top_words)

Topic 0:
power energi state electr california said price util market gener
Topic 1:
messag subject enron origin sent intend recipi contract corp may
Topic 2:
court schedul final date hour varianc school univers award program
Topic 3:
jeff subject ferc file issu cc pm would order iso
Topic 4:
enron compani said new million energi stock year employe financi
Topic 5:
compani servic million technolog new fund manag invest said ventur
Topic 6:
enron manag pleas busi report group risk inform market work
Topic 7:
subject pm sent origin messag thank meet john cc pleas
Topic 8:
click email free offer area save price receiv market get
Topic 9:
updat game week play wr start fantasi sunday team rb
Topic 10:
cc subject pm forward enron pleas thank mark agreement sara
Topic 11:
ga price capac day pleas chang volum deliveri product point
Topic 12:
trade market stock price futur report time may close compani
Topic 13:
would project cost ga issu need discuss us year meet
Topic 14:
travel hotel tofrom t