# NLP Topic Modeling

**AUTHOR:** David Mara

**DATE OF LAST SIGNIFICANT UPDATE:** 2024-NOV-24

**DESCRIPTION:** Enron Corpus Natural Language Processing (NLP) topic modeling

**GITHUB ISSUE #1:** https://github.com/nolmacdonald/INTA6450_Enron/issues/1

# Overview

The goal is to identify and filter out emails with sentiment scores that fall below a predefined "neutral" threshold. This allows for prioritizing emotionally charged or potentially problematic emails, such as those with negative sentiment. [SentimentAnalyzer](https://www.nltk.org/howto/sentiment.html).

## Download NLTK Data

macOS: Using `nltk.download('all')` in a Jupyter notebook will download the data 
to the wrong location at `/Users/username/nltk_data`.

To fix this, run the following command in a terminal:
    
```shell
$ sudo python -m nltk.downloader -d /usr/local/share/nltk_data all
```

This will save `nltk_data` for all modules in `/usr/local/share/`.
If you only need a certain subset of NLTK data, then you can change the last command `all`,
which defines download all NLTK data.
For example, you can use the tokenizer, `punkt`, or for sentiment analysis you can use `vader_lexicon`.

The optimal way is to save the path that NLTK looks for the data by default in your shell configuration.
After adding `NLTK_DATA` to your shell configuration, restart the shell and download the data with
the dynamic linking to the path.

```shell
# NLTK Data Path in ~/.bashrc or ~/.zshrc
export NLTK_DATA="/usr/local/share/nltk_data"
# Restart the shell
$ source ~/.zshrc
# Download all NLTK data
$ sudo python -m nltk.downloader -d $NLTK_DATA all
```

# Load

In [1]:
import sqlite3
import pandas as pd
# Run once here or in terminal
# nltk.download('all')


In [2]:
# Connect to the database (or create it if it doesn't exist)
connection = sqlite3.connect("../data/emails.db")

# Create a cursor object to execute SQL commands
cursor = connection.cursor()

# Load the dataframe from the SQLite database
emails_df = pd.read_sql_query("SELECT * FROM emails", connection)

# Close the connection
connection.close()

# Show email data
emails_df.head()

Unnamed: 0,text,message_id,date,from,to,subject,cc,bcc,mime-version,content-type,content-transfer-encoding,x-from,x-to,x-cc,x-bcc,folder,origin,filename,priority
0,Thanks so much. \n\n -----Original Message---...,<19486923.1075862012747.JavaMail.evans@thyme>,"Thu, 8 Nov 2001 11:24:50 -0800 (PST)",matt.smith@enron.com,kam.keiser@enron.com,RE: new books,,,1.0,text/plain; charset=us-ascii,,"Smith, Matt </O=ENRON/OU=NA/CN=RECIPIENTS/CN=M...","Keiser, Kam </O=ENRON/OU=NA/CN=RECIPIENTS/CN=K...",,,"\MSMITH18 (Non-Privileged)\Smith, Matt\Sent Items",Smith-M,MSMITH18 (Non-Privileged).pst,normal
1,Carol St. Clair\nEB 3892\n713-853-3989 (Phone)...,<12570643.1075842116980.JavaMail.evans@thyme>,"Tue, 9 May 2000 09:37:00 -0700 (PDT)",carol.clair@enron.com,russell.diamond@enron.com,American Central,,,1.0,text/plain; charset=us-ascii,,Carol St Clair,Russell Diamond,,,\Carol_StClair_Dec2000_1\Notes Folders\Sent,STCLAIR-C,cstclai.nsf,normal
2,\nPlease see attached. Hard copies are being ...,<21222840.1075861608475.JavaMail.evans@thyme>,"Tue, 20 Nov 2001 10:39:17 -0800 (PST)",jfagan@hewm.com,el00-95@listserv.gsa.gov,First Set of Discovery Requests of the Califor...,,,1.0,text/plain; charset=us-ascii,,"""Fagan, Joseph H."" <JFagan@HEWM.COM>",EL00-95@LISTSERV.GSA.GOV,,,"\JSTEFFE (Non-Privileged)\Steffes, James D.\De...",Steffes-J,JSTEFFE (Non-Privileged).pst,normal
3,"When: Tuesday, April 17, 2001 2:00 PM-4:00 PM ...",<11978852.1075840779217.JavaMail.evans@thyme>,"Tue, 8 May 2001 12:14:42 -0700 (PDT)",rumaldo.lopez@enron.com,"teresa.seibel@enron.com, vasant.shanbhogue@enr...",Moody's/Famas Credit Scoring Models--Moody's j...,"william.bradford@enron.com, vince.kaminski@enr...","william.bradford@enron.com, vince.kaminski@enr...",1.0,text/plain; charset=us-ascii,,"Lopez, Rumaldo </O=ENRON/OU=NA/CN=RECIPIENTS/C...","Teresa Seibel, Vasant Shanbhogue, Rabi De, Rud...","William S Bradford, Vince J Kaminski",,\vkamins\Calendar,KAMINSKI-V,vincent kaminski 1-30-02.pst,normal
4,I'll get you my comments ASAP. Has Shelley Co...,<12733008.1075843217120.JavaMail.evans@thyme>,"Wed, 13 Dec 2000 02:27:00 -0800 (PST)",jeff.dasovich@enron.com,sarah.novosel@enron.com,Re: Enron Response to San Diego Request for Ga...,,,1.0,text/plain; charset=us-ascii,,Jeff Dasovich,Sarah Novosel,,,\Jeff_Dasovich_Dec2000\Notes Folders\Sent,DASOVICH-J,jdasovic.nsf,normal


# Pre-Processing

The topic model lists a lot of insignificant information that pre-processing needs to clean up.
For example, Topic 0 results have no use as they are `td font com http br tr size width href align`.

- Text Extraction: Clean up insignificant data
    - Remove HTML Tags: `BeautifulSoup`
    - Remove URLs and email addresses
- Normalization: Normalize the text by converting to lowercase and removing special characters.
- Tokenization: Tokenization splits the text into individual words (tokens).
    - `nltk.tokenize.word_tokenize()`
- Removing Stop Words: Filter out common stop words that don’t carry significant meaning for topic modeling.
    - `nltk.corpus.stopwords()`
- Stemming: Stemming reduces words to their root form, which helps group similar words. 
    - `nltk.stem.PorterStemmer()`
- Final Prep: Join the tokens back into sentences for the CountVectorizer to process

## Import

In [3]:
from bs4 import BeautifulSoup
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import string

## Text Extraction

In [4]:
def extract_text(text):
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text() if pd.notnull(text) else ""
    # Remove URLs and email addresses
    text = re.sub(r"\S+@\S+", "", text)  # Remove email addresses
    text = re.sub(r"http\S+|www\S+", "", text)  # Remove URLs
    return text


emails_df["processed_text"] = emails_df["text"].apply(extract_text)

  text = BeautifulSoup(text, "html.parser").get_text() if pd.notnull(text) else ""
  text = BeautifulSoup(text, "html.parser").get_text() if pd.notnull(text) else ""


## Normalization

In [5]:
def normalize_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation using string.punctuation
    text = ''.join([char for char in text if char not in string.punctuation])
    # Remove non-alphabetic characters
    text = re.sub(r"[^a-z\s]", "", text)
    return text


emails_df["processed_text"] = emails_df["processed_text"].apply(normalize_text)

## Tokenization

In [6]:
# Tokenization splits the text into individual words (tokens)
emails_df["tokens"] = emails_df["processed_text"].apply(word_tokenize)

## Remove Stop Words

In [7]:
# Filter out common stop words that don’t carry significant meaning for topic modeling.
stop_words = set(stopwords.words("english"))

emails_df["tokens"] = emails_df["tokens"].apply(
    lambda x: [word for word in x if word not in stop_words]
)

## Stemming

In [8]:
# Stemming reduces words to their root form, which helps group similar words.
stemmer = PorterStemmer()
emails_df["tokens"] = emails_df["tokens"].apply(
    lambda x: [stemmer.stem(word) for word in x]
)

## Test Query

In [9]:
def get_sentiment(text,sentiment ):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(text)
    return scores[sentiment]

In [None]:

emails_df['pos'] = emails_df['processed_text'].head(1000).apply(get_sentiment,args=("pos",))
emails_df['neu'] = emails_df['processed_text'].head(1000).apply(get_sentiment,args=("neu",))
emails_df['neg'] = emails_df['processed_text'].head(1000).apply(get_sentiment,args=("neg",))
emails_df['compound'] = emails_df['processed_text'].head(1000).apply(get_sentiment,args=("compound",))

KeyboardInterrupt: 