# Gensim Topic Modeling

**AUTHOR:** Nolan MacDonald

**DATE OF LAST SIGNIFICANT UPDATE:** 2024-NOV-24

**DESCRIPTION:** Use `gensim` for LDA topic modeling instead of NLTK, allows for dynamic topic modeling

**GITHUB ISSUE #2:** https://github.com/nolmacdonald/INTA6450_Enron/issues/2

## Import Modules

In [11]:
import numpy as np
import pandas as pd
import re
import sqlite3
# NLP with the LDA model for topic modeling
from gensim.corpora import Dictionary
from gensim.models import LdaSeqModel
from gensim.models import LdaModel, LdaMulticore
from gensim.models.coherencemodel import CoherenceModel
# Visualize the topics
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
from bs4 import BeautifulSoup

# LDA Topic Model

Train the LDA model using a defined number of topics, `num_topics`,
and `passes`.

- 5 Topics, 10 Passes: Coherence Score:
- 10 Topics, 10 Passes: Coherence Score: 0.5997052048168511 (~11 min)
- 25 Topics, 10 Passes: Coherence Score: 0.6024988483443581 (~17 min)
- 50 Topics, 10 Passes: Coherence Score: Coherence Score: 0.6112180168839475 (~32 min)
- 5 Topics, 20 Passes: Coherence Score: 0.5678327235325005
- 10 Topics, 20 Passes: Coherence Score: 0.5999938379847258
- 20 Topics, 20 Passes: Coherence Score: 0.5944007966496214 (~29 min)

251734 documents (total 36142821 corpus positions)

## Import Pre-Processed Database

To obtain `emails_processed.db`, download the emails and extract the zip file.

Next, use `data_wrangler` to parse the data and store the email data in `emails.db`.

After, use `email_processing.py` to utilize the email database and preprocess the data
for NLP models. This will return `emails_processed.db` with a table `emails_processed`.

In [6]:
# Connect to the database (or create it if it doesn't exist)
connection = sqlite3.connect("../data/emails_processed.db")

# Create a cursor object to execute SQL commands
cursor = connection.cursor()

# Load the dataframe from the SQLite database
emails_df = pd.read_sql_query("SELECT * FROM emails_processed", connection)

# Close the connection
connection.close()

# Show email data
emails_df.head()

Unnamed: 0,text,message_id,date,from,to,subject,cc,bcc,mime-version,content-type,...,x-cc,x-bcc,folder,origin,filename,priority,processed_text,tokens,stripped_date,datetime
0,---------------------- Forwarded by Rika Imai/...,<88180.1075863689140.JavaMail.evans@thyme>,"Tue, 8 May 2001 08:37:00 -0700 (PDT)",rika.imai@enron.com,"john.forney@enron.com, mike.carson@enron.com, ...",4 Month Rolling Forecast,,,1.0,text/plain; charset=ANSI_X3.4-1968,...,,,\Rob_Benson_Jun2001\Notes Folders\Notes inbox,Benson-R,rbenson.nsf,normal,forwarded by rika imainaenron on pm dan sal...,"['forward', 'rika', 'imainaenron', 'pm', 'dan'...","Tue, 8 May 2001 08:37:00",2001-05-08 08:37:00
1,great,<4460514.1075857469666.JavaMail.evans@thyme>,"Wed, 21 Jun 2000 02:01:00 -0700 (PDT)",hunter.shively@enron.com,richard.tomaski@enron.com,Re: Jim Simpson,,,1.0,text/plain; charset=us-ascii,...,,,\Hunter_Shively_Jun2001\Notes Folders\Sent,Shively-H,hshivel.nsf,normal,great,['great'],"Wed, 21 Jun 2000 02:01:00",2000-06-21 02:01:00
2,"oohh la la. who was your ""friend""? did you g...",<2160301.1075858147494.JavaMail.evans@thyme>,"Wed, 16 Aug 2000 03:03:00 -0700 (PDT)",matthew.lenhart@enron.com,shelliott@dttus.com,Re: Re[2]:,,,1.0,text/plain; charset=us-ascii,...,,,\Matthew_Lenhart_Jun2001\Notes Folders\Sent,Lenhart-M,mlenhar.nsf,normal,oohh la la who was your friend did you guys ...,"['oohh', 'la', 'la', 'friend', 'guy', 'read', ...","Wed, 16 Aug 2000 03:03:00",2000-08-16 03:03:00
3,\nAttached are the two files with this week's ...,<22847680.1075863611080.JavaMail.evans@thyme>,"Wed, 15 Aug 2001 05:46:47 -0700 (PDT)",rika.imai@enron.com,"russell.ballato@enron.com, hicham.benjelloun@e...",FW: Nuclear Rolling Forecast,,,1.0,text/plain; charset=us-ascii,...,,,"\ExMerge - Benson, Robert\Inbox\Large Messages",BENSON-R,rob benson 6-25-02.PST,normal,attached are the two files with this weeks nuc...,"['attach', 'two', 'file', 'week', 'nuclear', '...","Wed, 15 Aug 2001 05:46:47",2001-08-15 05:46:47
4,lm:\nWhat are your thoughts going forward........,<15012282.1075852957298.JavaMail.evans@thyme>,"Wed, 3 Oct 2001 00:35:05 -0700 (PDT)",jennifer.fraser@enron.com,larry.may@enron.com,hello,,,1.0,text/plain; charset=us-ascii,...,,,\LMAY2 (Non-Privileged)\Inbox,May-L,LMAY2 (Non-Privileged).pst,normal,lmwhat are your thoughts going forward also wh...,"['lmwhat', 'thought', 'go', 'forward', 'also',...","Wed, 3 Oct 2001 00:35:05",2001-10-03 00:35:05


## Create Corpus

In [9]:
emails_df['tokens'] = emails_df['tokens'].apply(lambda x: x.split())

# Create a dictionary and a corpus
dictionary = Dictionary(emails_df['tokens'])
corpus = [dictionary.doc2bow(tokens) for tokens in emails_df['tokens']]

## Form Topic Model

In [12]:
# Number of topics
num_topics = 10

# Train the LDA model
# lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=10, random_state=42)
lda_model = LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=10, random_state=42, workers=6)

### Print Topic Information

Print the topics, and for each topic there is a set of words with corresponding weights.

In [13]:
# Print topics
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

Topic 0: 0.009*"'deal'," + 0.009*"'pleas'," + 0.009*"'would'," + 0.008*"'need'," + 0.008*"'enron'," + 0.006*"'know'," + 0.006*"'time'," + 0.006*"'price'," + 0.005*"'report'," + 0.005*"'us',"
Topic 1: 0.011*"'email'," + 0.008*"'servic'," + 0.008*"'manag'," + 0.008*"'compani'," + 0.008*"'busi'," + 0.007*"'market'," + 0.007*"'inform'," + 0.007*"'new'," + 0.006*"'enron'," + 0.006*"'provid',"
Topic 2: 0.007*"'updat'," + 0.006*"'game'," + 0.006*"'email'," + 0.006*"'free'," + 0.006*"'week'," + 0.006*"'imag'," + 0.005*"'get'," + 0.005*"'one'," + 0.004*"'play'," + 0.004*"'time',"
Topic 3: 0.016*"'enron'," + 0.010*"'compani'," + 0.007*"'imag'," + 0.006*"'trade'," + 0.006*"'new'," + 0.006*"'stock'," + 0.005*"'market'," + 0.005*"'us'," + 0.004*"'rate'," + 0.004*"'day',"
Topic 4: 0.012*"'pleas'," + 0.011*"'email'," + 0.011*"'enron'," + 0.010*"'agreement'," + 0.009*"'subject'," + 0.009*"'attach'," + 0.007*"'contract'," + 0.007*"'messag'," + 0.007*"'may'," + 0.007*"'intend',"
Topic 5: 0.021*"'origin'

## Coherence Score

Calculate the coherence score.

Coherence Measures defined as `coherence=""`:

- `c_v`: measure is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity
- `umass`: based on document cooccurrence counts, a one-preceding segmentation and a logarithmic conditional probability as confirmation measure
- `c_uci`: measure is based on a sliding window and the pointwise mutual information (PMI) of all word pairs of the given top words
- `c_npmi`: an enhanced version of the C_uci coherence using the normalized pointwise mutual information (NPMI)
- `c_pmi`: Fastest method - `u_mass`, `c_uci`
- For `u_mass` corpus should be provided, if texts is provided it will be converted to corpus using the dictionary.
- For `c_v` `c_uci` `c_npmi` texts should be provided (corpus isn't needed)
- [Reference: Coherence Measures](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0)
- [Reference: Gensim models.coherencemodel](https://radimrehurek.com/gensim/models/coherencemodel.html)

- `topn` (int): Number of top words to be extracted from each topic
- `processes` (int) - Number of processes to use or `num_cpus=1` if not defined


In [14]:
# Compute Coherence Score (c_v)
coherence_model_lda = CoherenceModel(model=lda_model, texts=emails_df['tokens'], dictionary=dictionary, coherence='c_v', processes=6)
coherence_score = coherence_model_lda.get_coherence()
print(f"Coherence Score (c_v): {coherence_score}")

Coherence Score (c_v): 0.5158661585162183


### Not Ran/Used

In [None]:
# Compute Coherence Score (u_mass) -14 to 14
coherence_model_u_mass = CoherenceModel(model=lda_model, texts=emails_df['tokens'], dictionary=dictionary, coherence='u_mass', processes=6)
coherence_score_u_mass = coherence_model_u_mass.get_coherence()
print(f"Coherence Score (u_mass): {coherence_score_u_mass}")

In [None]:
# Compute Coherence Score (c_uci)
coherence_model_c_uci = CoherenceModel(model=lda_model, texts=emails_df['tokens'], dictionary=dictionary, coherence='c_uci', processes=6)
coherence_score_c_uci = coherence_model_c_uci.get_coherence()
print(f"Coherence Score (c_uci): {coherence_score_c_uci}")

In [None]:
# Compute Coherence Score (c_npmi)
coherence_model_c_npmi = CoherenceModel(model=lda_model, texts=emails_df['tokens'], dictionary=dictionary, coherence='c_npmi', processes=6)
coherence_score_c_npmi = coherence_model_c_npmi.get_coherence()
print(f"Coherence Score (c_npmi): {coherence_score_c_npmi}")

## Topic Distribution

The topic distribution is the proportion of documents associated with each topic. 
Topic distribution is calculated by iterating over the entire corpus and summing the weights of each topic for all documents.

In [15]:
# Step 1: Calculate topic importance (weights)
topic_importance = [0] * lda_model.num_topics
for doc in corpus:
    for topic_id, weight in lda_model[doc]:
        topic_importance[topic_id] += weight
topic_importance = [weight / len(corpus) for weight in topic_importance]  # Normalize by number of documents

# Step 2: Combine importance and terms into a DataFrame
num_words = 10  # Number of top words to show per topic
ranked_topics_data = []

for topic_id, importance in enumerate(topic_importance):
    topic_terms = lda_model.print_topic(topic_id, topn=num_words)
    ranked_topics_data.append({
        'Topic': topic_id,
        'Importance': importance,
        'Terms': topic_terms
    })

# Create DataFrame and sort by importance
ranked_topics_df = pd.DataFrame(ranked_topics_data).sort_values(by='Importance', ascending=False)
ranked_topics_df

Unnamed: 0,Topic,Importance,Terms
5,5,0.185359,"0.021*""'origin',"" + 0.020*""'messagefrom',"" + 0..."
4,4,0.181667,"0.012*""'pleas',"" + 0.011*""'email',"" + 0.011*""'..."
0,0,0.177235,"0.009*""'deal',"" + 0.009*""'pleas',"" + 0.009*""'w..."
8,8,0.140159,"0.012*""'cc',"" + 0.012*""'subject',"" + 0.009*""'p..."
2,2,0.078023,"0.007*""'updat',"" + 0.006*""'game',"" + 0.006*""'e..."
1,1,0.073583,"0.011*""'email',"" + 0.008*""'servic',"" + 0.008*""..."
9,9,0.065846,"0.016*""'power',"" + 0.012*""'energi',"" + 0.010*""..."
3,3,0.039798,"0.016*""'enron',"" + 0.010*""'compani',"" + 0.007*..."
7,7,0.025138,"0.020*""'pm',"" + 0.009*""'employe',"" + 0.008*""'p..."
6,6,0.017328,"0.020*""'tofrom',"" + 0.015*""'tx',"" + 0.013*""'co..."


In [16]:
# Add new columns for each term and weight
for i in range(1, num_words + 1):
    ranked_topics_df[f'Term {i}'] = None
    ranked_topics_df[f'Term {i} Weight'] = None

# Populate the new columns with terms and weights
for idx, row in ranked_topics_df.iterrows():
    terms_weights = row['Terms'].split(' + ')
    for i, term_weight in enumerate(terms_weights):
        weight, term = term_weight.split('*')
        ranked_topics_df.at[idx, f'Term {i+1}'] = term.strip("'\",")
        ranked_topics_df.at[idx, f'Term {i+1} Weight'] = float(weight)

ranked_topics_df.reset_index(inplace=True, drop=True)
ranked_topics_df


Unnamed: 0,Topic,Importance,Terms,Term 1,Term 1 Weight,Term 2,Term 2 Weight,Term 3,Term 3 Weight,Term 4,...,Term 6,Term 6 Weight,Term 7,Term 7 Weight,Term 8,Term 8 Weight,Term 9,Term 9 Weight,Term 10,Term 10 Weight
0,5,0.185359,"0.021*""'origin',"" + 0.020*""'messagefrom',"" + 0...",origin,0.021,messagefrom,0.02,know,0.012,get,...,sent,0.009,octob,0.007,would,0.007,let,0.007,think,0.007
1,4,0.181667,"0.012*""'pleas',"" + 0.011*""'email',"" + 0.011*""'...",pleas,0.012,email,0.011,enron,0.011,agreement,...,attach,0.009,contract,0.007,messag,0.007,may,0.007,intend,0.007
2,0,0.177235,"0.009*""'deal',"" + 0.009*""'pleas',"" + 0.009*""'w...",deal,0.009,pleas,0.009,would,0.009,need,...,know,0.006,time,0.006,price,0.006,report,0.005,us,0.005
3,8,0.140159,"0.012*""'cc',"" + 0.012*""'subject',"" + 0.009*""'p...",cc,0.012,subject,0.012,pm,0.009,meet,...,email,0.007,john,0.007,messagefrom,0.007,pleas,0.007,mark,0.007
4,2,0.078023,"0.007*""'updat',"" + 0.006*""'game',"" + 0.006*""'e...",updat,0.007,game,0.006,email,0.006,free,...,imag,0.006,get,0.005,one,0.005,play,0.004,time,0.004
5,1,0.073583,"0.011*""'email',"" + 0.008*""'servic',"" + 0.008*""...",email,0.011,servic,0.008,manag,0.008,compani,...,market,0.007,inform,0.007,new,0.007,enron,0.006,provid,0.006
6,9,0.065846,"0.016*""'power',"" + 0.012*""'energi',"" + 0.010*""...",power,0.016,energi,0.012,said,0.01,state,...,electr,0.008,price,0.007,market,0.007,ga,0.007,would,0.007
7,3,0.039798,"0.016*""'enron',"" + 0.010*""'compani',"" + 0.007*...",enron,0.016,compani,0.01,imag,0.007,trade,...,stock,0.006,market,0.005,us,0.005,rate,0.004,day,0.004
8,7,0.025138,"0.020*""'pm',"" + 0.009*""'employe',"" + 0.008*""'p...",pm,0.02,employe,0.009,perform,0.008,databasealia,...,oper,0.006,dbcapsdataunknown,0.005,law,0.005,request,0.005,enron,0.005
9,6,0.017328,"0.020*""'tofrom',"" + 0.015*""'tx',"" + 0.013*""'co...",tofrom,0.02,tx,0.015,court,0.013,oneway,...,way,0.01,appeal,0.008,san,0.007,travel,0.006,citi,0.006


## Get Dominant Topic

In [17]:
def get_dominant_topic(lda_model, corpus):
    dominant_topics = []
    for doc_bow in corpus:
        topic_probs = lda_model.get_document_topics(doc_bow)
        dominant_topic = max(topic_probs, key=lambda x: x[1])[0]
        dominant_topics.append(dominant_topic)
    return dominant_topics

emails_df['dominant_topic'] = get_dominant_topic(lda_model, corpus)

In [18]:
# Showing that dominant_topic is added to the DataFrame and contains an integer value
emails_df['dominant_topic']

0         8
1         6
2         5
3         9
4         5
         ..
251729    5
251730    0
251731    4
251732    8
251733    9
Name: dominant_topic, Length: 251734, dtype: int64

## Visualize

In [19]:
# Save the model
lda_model.save('enron_lda_model')

# Visualize
vis = gensimvis.prepare(lda_model, corpus, dictionary)

# Save as HTML
pyLDAvis.save_html(vis, 'lda_visualization.html')

# Show
pyLDAvis.display(vis)
