# NLP Topic Modeling

**AUTHOR:** Nolan MacDonald

**DATE OF LAST SIGNIFICANT UPDATE:** 2024-NOV-02

**DESCRIPTION:** Enron Corpus Natural Language Processing (NLP) topic modeling

**GITHUB ISSUE #2:** https://github.com/nolmacdonald/INTA6450_Enron/issues/2

## Overview




Data wrangling or parsing data needs to be performed to obtain a data set in the appropriate state for different models.

# Testing Data

- json file: text, headers, subject, messageId, priority, from, to, cc, bcc, date

In [9]:
import os
import json
import pandas as pd

# Get the current working directory
cwd = os.getcwd()
print(cwd)

/Users/nmacdonald/projects/INTA6450_Enron/notebooks


In [10]:
# Load emails csv ONLY - takes forever
emails_df = pd.read_csv("../emails.csv")

## Data Columns

- Look at the data for each column (first row)

In [41]:
# Column names: Index(['file', 'message'], dtype='object')
# Print the column names
print("Column names:", emails_df.columns)

# Print the first row of the first column
print(f"\nFirst row, first column:\n{emails_df.iloc[0, 0]}")

# Print the first row of the second column
print("\nFirst row, second column:\n", emails_df.iloc[0, 1])

print(f"\n11697 row, first column:\n{emails_df.iloc[11696, 0]}")

Column names: Index(['file', 'message'], dtype='object')

First row, first column:
allen-p/_sent_mail/1.

First row, second column:
 Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 

11697 row, first column:
bass-e/all_documents/1310.


In [15]:
# Alternative way to print the first row of the second column
print(emails_df["message"].iloc[0])

Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 


# Individual Email

- Open a single email

Lots of info in headers

Might be useful:
- `message-id`: Message identifier 
- `from`: Sender email
- `to`: Recipient email
- `cc`: CC'd recipient emails (separated by comma)
- `bcc`: BCC'd recipient emails
- `x-from`: Sender name
- `x-to`: Recipient name
- `x-cc`: CC'd recipient names
- `x-bcc`: BCC'd recipient names
- `subject`

Not so useful:
- `mime_version`: 1.0
- `content-type`: text/plain; `charset`=us-ascii
- `content-transfer-encoding`: 7bit
- `x-folder`: Email folder
- `x-origin`: Who provided the email?
- `x-filename`: File name

`data["subject"]` and `data["headers"]["subject"]` return the same thing.

The remaining `data.keys()` include the entries below with example values.
```
data["subject"]: Compagnie de Papiers Stadacona
data["messageId"]: 18639077.1075844798592.JavaMail.evans@thyme
data["priority"]: normal
data["from"]: [{'address': 'susan.bailey@enron.com', 'name': ''}]
data["to"]: [{'address': 'jeff.blumenthal@enron.com', 'name': ''}]
data["cc"]: [{'address': 'sara.shackleton@enron.com', 'name': ''}, {'address': 'laurel.adams@enron.com', 'name': ''}]
data["bcc"]: [{'address': 'sara.shackleton@enron.com', 'name': ''}, {'address': 'laurel.adams@enron.com', 'name': ''}]
data["date"]: 2001-04-25T18:42:00.000Z
```

Same:
- `data["subject"]` and `data["headers"]["subject"]`: Same
- `data["messageID"]` and `data["headers"]["message-id"]`
    - `data["headers"]["message-id"]`: Same but wrapped in `<message-id>`

Nested Lists:
- `data["from"]` and `data["headers"]["from"]`
    - `data["from"]` is nested list with "address" and "name" (blank)
- `data["to"]` and `data["headers"]["to"]`
    - `data["to"]` is nested list with "address" and "name" (blank)
- `data["cc"]` and `data["headers"]["cc"]`
    - `data["cc"]` is nested list with "address" and "name" (blank)
- `data["bcc"]` and `data["headers"]["bcc"]`
    - `data["bcc"]` is nested list with "address" and "name" (blank)

Different:
- `data["date"]` and `data["headers"]["date"]` different formats
    - `data["date"]`: 2001-04-25T18:42:00.000Z
    - `data["headers"]["date"]`: Wed, 25 Apr 2001 11:42:00 -0700 (PDT)

## Import File

In [43]:
# Load the JSON file
with open("../data/0a0a0a98bade637d3c593c0db8e1e3f7.json", "r") as file:
    data = json.load(file)

# Print the data to verify
print(data)

data.keys()

{'text': 'Jeff,\n\nSara has asked me to follow-up regarding the captioned Enron Affiliate and \nthe need to confirm a trade.\n\nBeing that this Enron affiliate does not appear in the Corporate Database I \nwas hoping  you could furnish details regarding this Enron Affiliate, as \nfollows:\n\n1. When was the company incorporated??? \n2. Where was it incorporated???\n3 Or did we acquire an existing company by virtue of an acquisition???\n4. What is the date of the FX trade, for which this Enron Affiliate is a \nparty???\n5. If this is a newly formed Enron Affiliate was the date of the FX trade \ndated simultaneous with the company\'s incorporation???\n6. For notice purposes under the ISDA -- would we use the following??? \n    \n    Bay Wellington Tower - BCE Place\n    181 Bay Street, Suite 1540\n    Toronto, Ontario M5J 2T3\n7. If the foregoing address for notice is sufficient -- please advise as to: \n(a)  whom\'s attention, (b) phone number, and (c) fax number.\n\nIn discussing this 

dict_keys(['text', 'headers', 'subject', 'messageId', 'priority', 'from', 'to', 'cc', 'bcc', 'date'])

## Text

In [34]:
print(data["text"])

Jeff,

Sara has asked me to follow-up regarding the captioned Enron Affiliate and 
the need to confirm a trade.

Being that this Enron affiliate does not appear in the Corporate Database I 
was hoping  you could furnish details regarding this Enron Affiliate, as 
follows:

1. When was the company incorporated??? 
2. Where was it incorporated???
3 Or did we acquire an existing company by virtue of an acquisition???
4. What is the date of the FX trade, for which this Enron Affiliate is a 
party???
5. If this is a newly formed Enron Affiliate was the date of the FX trade 
dated simultaneous with the company's incorporation???
6. For notice purposes under the ISDA -- would we use the following??? 
    
    Bay Wellington Tower - BCE Place
    181 Bay Street, Suite 1540
    Toronto, Ontario M5J 2T3
7. If the foregoing address for notice is sufficient -- please advise as to: 
(a)  whom's attention, (b) phone number, and (c) fax number.

In discussing this matter with Sara, we have determined

## Headers

Lots of info in headers

Might be useful:
- `message-id`: Message identifier 
- `from`: Sender email
- `to`: Recipient email
- `cc`: CC'd recipient emails (separated by comma)
- `bcc`: BCC'd recipient emails
- `x-from`: Sender name
- `x-to`: Recipient name
- `x-cc`: CC'd recipient names
- `x-bcc`: BCC'd recipient names
- `subject`

Not so useful:
- `mime_version`: 1.0
- `content-type`: text/plain; `charset`=us-ascii
- `content-transfer-encoding`: 7bit
- `x-folder`: Email folder
- `x-origin`: Who provided the email?
- `x-filename`: File name


In [29]:
keys_headers = data["headers"].keys()
print(
    f"Keys in headers column: {len(keys_headers)}\n\nKey names in headers column:\n{keys_headers}"
)

# dict_keys(['message-id', 'date', 'from', 'to', 'subject', 'cc', 'mime-version',
# 'content-type', 'content-transfer-encoding', 'bcc',
# 'x-from', 'x-to', 'x-cc', 'x-bcc', 'x-folder', 'x-origin', 'x-filename'])

Keys in headers column: 17

Key names in headers column:
dict_keys(['message-id', 'date', 'from', 'to', 'subject', 'cc', 'mime-version', 'content-type', 'content-transfer-encoding', 'bcc', 'x-from', 'x-to', 'x-cc', 'x-bcc', 'x-folder', 'x-origin', 'x-filename'])


In [30]:
# Who sent the email?
data["headers"]["from"]
print(f"message-id: {data['headers']['message-id']}")
print(f"date: {data['headers']['date']}")
print(f"from: {data['headers']['from']}")
print(f"to: {data['headers']['to']}")
print(f"subject: {data['headers']['subject']}")
print(f"cc: {data['headers']['cc']}")
print(f"mime-version: {data['headers']['mime-version']}")
print(f"content-type: {data['headers']['content-type']}")
print(f"content-transfer-encoding: {data['headers']['content-transfer-encoding']}")
print(f"bcc: {data['headers']['bcc']}")
print(f"x-from: {data['headers']['x-from']}")
print(f"x-to: {data['headers']['x-to']}")
print(f"x-cc: {data['headers']['x-cc']}")
print(f"x-bcc: {data['headers']['x-bcc']}")
print(f"x-folder: {data['headers']['x-folder']}")
print(f"x-origin: {data['headers']['x-origin']}")
print(f"x-filename: {data['headers']['x-filename']}")

message-id: <18639077.1075844798592.JavaMail.evans@thyme>
date: Wed, 25 Apr 2001 11:42:00 -0700 (PDT)
from: susan.bailey@enron.com
to: jeff.blumenthal@enron.com
subject: Compagnie de Papiers Stadacona
cc: sara.shackleton@enron.com, laurel.adams@enron.com
mime-version: 1.0
content-type: text/plain; charset=us-ascii
content-transfer-encoding: 7bit
bcc: sara.shackleton@enron.com, laurel.adams@enron.com
x-from: Susan Bailey
x-to: Jeff Blumenthal
x-cc: Sara Shackleton, Laurel Adams
x-bcc: 
x-folder: \Sara_Shackleton_Dec2000_June2001_2\Notes Folders\Notes inbox
x-origin: SHACKLETON-S
x-filename: sshackle.nsf


## Remaining Keys

In [33]:
print(f'data["subject"]: {data["subject"]}')
print(f'data["messageId"]: {data["messageId"]}')
print(f'data["priority"]: {data["priority"]}')
print(f'data["from"]: {data["from"]}')
print(f'data["to"]: {data["to"]}')
print(f'data["cc"]: {data["cc"]}')
print(f'data["bcc"]: {data["bcc"]}')
print(f'data["date"]: {data["date"]}')

data["subject"]: Compagnie de Papiers Stadacona
data["messageId"]: 18639077.1075844798592.JavaMail.evans@thyme
data["priority"]: normal
data["from"]: [{'address': 'susan.bailey@enron.com', 'name': ''}]
data["to"]: [{'address': 'jeff.blumenthal@enron.com', 'name': ''}]
data["cc"]: [{'address': 'sara.shackleton@enron.com', 'name': ''}, {'address': 'laurel.adams@enron.com', 'name': ''}]
data["bcc"]: [{'address': 'sara.shackleton@enron.com', 'name': ''}, {'address': 'laurel.adams@enron.com', 'name': ''}]
data["date"]: 2001-04-25T18:42:00.000Z


# Approach

- Store like `emails.csv`
    - 517401 rows of data, nearly stored like CMU
    - first column `file`, second column `message`
    - `file`: 11697 row, first column `bass-e/all_documents/1310.`
        - Similar to how CMU stores the file there
    - `message`: A lot more data

Output from `emails_df["message"].iloc[0]`:
```
Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst
```

- Store like CMU `maildir`
    - The `maildir` is different for each person (e.g., `allen-p` and `arnold-j`).

Example of `maildir` tree for `allen-p`: 
- `lastname-firstinitial`
    - `_sent_mail`
    - `all_documents`
    - `contacts`
    - `deleted_items`
    - `discussion_threads`
    - `inbox`
    - `notes_inbox`
    - `sent`
    - `sent_items`
    - `straw`


# Parse All Emails

THIS TOOK FOREVER

In [45]:
import os
import json

# Directory containing the JSON files
json_dir = os.path.join(cwd, "../data")

# List to store the data from each JSON file
json_data_list = []

# Iterate over all files in the directory
for filename in os.listdir(json_dir):
    if filename.endswith(".json"):
        file_path = os.path.join(json_dir, filename)
        with open(file_path, "r") as file:
            data = json.load(file)
            json_data_list.append(data)

# Print the number of JSON files read
print(f"Number of JSON files read: {len(json_data_list)}")

# Example: Print the keys of the first JSON file
if json_data_list:
    print(f"Keys in the first JSON file: {json_data_list[0].keys()}")

Number of JSON files read: 251734
Keys in the first JSON file: dict_keys(['text', 'headers', 'subject', 'messageId', 'priority', 'from', 'to', 'date'])


In [47]:
import os
import json
import pandas as pd

# Directory containing the JSON files
json_dir = os.path.join(cwd, "../data")

# List to store the data from each JSON file
data_list = []

# Iterate over all files in the directory
for filename in os.listdir(json_dir):
    if filename.endswith(".json"):
        file_path = os.path.join(json_dir, filename)
        with open(file_path, "r") as file:
            data = json.load(file)
            # Extract relevant information
            email_data = {
                "message_id": data["headers"].get("message-id", ""),
                "date": data["headers"].get("date", ""),
                "from": data["headers"].get("from", ""),
                "from_name": data["headers"].get("x-from", ""),
                "to": data["headers"].get("to", ""),
                "to_name": data["headers"].get("x-to", ""),
                "cc": data["headers"].get("cc", ""),
                "cc_name": data["headers"].get("x-cc", ""),
                "bcc": data["headers"].get("bcc", ""),
                "bcc_name": data["headers"].get("x-bcc", ""),
                "subject": data["headers"].get("subject", ""),
                "text": data.get("text", ""),
                "folder": data["headers"].get("x-folder", ""),
                "origin": data["headers"].get("x-origin", ""),
                "filename": data["headers"].get("x-filename", ""),
            }
            data_list.append(email_data)

# Create a DataFrame from the list of dictionaries
emails_df = pd.DataFrame(data_list)

# Save the DataFrame to a CSV file
emails_df.to_csv("parsed_emails.csv", index=False)

# Print the first few rows of the DataFrame to verify
# print(emails_df.head())

In [48]:
emails_df.head(5)

Unnamed: 0,message_id,date,from,from_name,to,to_name,cc,cc_name,bcc,bcc_name,subject,text,folder,origin,filename
0,<88180.1075863689140.JavaMail.evans@thyme>,"Tue, 8 May 2001 08:37:00 -0700 (PDT)",rika.imai@enron.com,Rika Imai,"john.forney@enron.com, mike.carson@enron.com, ...","John M Forney, Mike Carson, Clint Dean, Doug G...",,,,,4 Month Rolling Forecast,---------------------- Forwarded by Rika Imai/...,\Rob_Benson_Jun2001\Notes Folders\Notes inbox,Benson-R,rbenson.nsf
1,<4460514.1075857469666.JavaMail.evans@thyme>,"Wed, 21 Jun 2000 02:01:00 -0700 (PDT)",hunter.shively@enron.com,Hunter S Shively,richard.tomaski@enron.com,Richard Tomaski,,,,,Re: Jim Simpson,great,\Hunter_Shively_Jun2001\Notes Folders\Sent,Shively-H,hshivel.nsf
2,<2160301.1075858147494.JavaMail.evans@thyme>,"Wed, 16 Aug 2000 03:03:00 -0700 (PDT)",matthew.lenhart@enron.com,Matthew Lenhart,shelliott@dttus.com,Shirley Elliott <shelliott@dttus.com> @ ENRON,,,,,Re: Re[2]:,"oohh la la. who was your ""friend""? did you g...",\Matthew_Lenhart_Jun2001\Notes Folders\Sent,Lenhart-M,mlenhar.nsf
3,<22847680.1075863611080.JavaMail.evans@thyme>,"Wed, 15 Aug 2001 05:46:47 -0700 (PDT)",rika.imai@enron.com,"Imai, Rika </O=ENRON/OU=NA/CN=RECIPIENTS/CN=RI...","russell.ballato@enron.com, hicham.benjelloun@e...","Ballato, Russell </O=ENRON/OU=NA/CN=RECIPIENTS...",,,,,FW: Nuclear Rolling Forecast,\nAttached are the two files with this week's ...,"\ExMerge - Benson, Robert\Inbox\Large Messages",BENSON-R,rob benson 6-25-02.PST
4,<15012282.1075852957298.JavaMail.evans@thyme>,"Wed, 3 Oct 2001 00:35:05 -0700 (PDT)",jennifer.fraser@enron.com,"Fraser, Jennifer </O=ENRON/OU=NA/CN=RECIPIENTS...",larry.may@enron.com,"May, Larry </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Lm...",,,,,hello,lm:\nWhat are your thoughts going forward........,\LMAY2 (Non-Privileged)\Inbox,May-L,LMAY2 (Non-Privileged).pst


# Topic Modeling

- First attempt took 9 min 12 s!!!

In [49]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

# Step 1: Preprocess the text data
# For simplicity, we'll use the 'text' column from your DataFrame
text_data = emails_df['text'].values

# Step 2: Convert the text data into a document-term matrix
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = vectorizer.fit_transform(text_data)

# Step 3: Fit the LDA model to the document-term matrix
lda = LatentDirichletAllocation(n_components=10, random_state=42)
lda.fit(dtm)

# Step 4: Analyze the topics
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
tf_feature_names = vectorizer.get_feature_names_out()
display_topics(lda, tf_feature_names, no_top_words)

Topic 0:
td font com http br tr size width href align
Topic 1:
know time just like good going let don think ll
Topic 2:
com company new business services http management www million technology
Topic 3:
enron power energy said company california state market gas new
Topic 4:
00 2001 message sent original subject pm 10 11 12
Topic 5:
com http www way new travel 00 2001 tx day
Topic 6:
com enron mail message intended subject mailto recipient 2001 net
Topic 7:
com http image www click asp email free net gif
Topic 8:
ect enron hou cc subject 2000 pm corp ees 2001
Topic 9:
enron subject gas need agreement thanks cc attached 2001 know
