<a href="https://colab.research.google.com/github/gstripling00/conferences/blob/main/Text_Mining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Word search with dataframes
-  We're going to work with text data, containing emails from Enron employees. The Enron scandal is a famous fraud case. Enron employees covered up the bad financial position of the company, thereby keeping the stock price artificially high. Enron employees sold their own stock options, and when the truth came out, Enron investors were left with nothing. The goal is to find all emails that mention specific words, such as **"sell enron stock"**.
- By using string operations on dataframes, we can easily sift through messy email data and create flags based on word-hits.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('data/enron_emails_clean.csv')
df.head()

Unnamed: 0,Message-ID,From,To,Date,content,clean_content
0,<8345058.1075840404046.JavaMail.evans@thyme>,('advdfeedback@investools.com'),('advdfeedback@investools.com'),2002-01-29 23:20:55,INVESTools Advisory\nA Free Digest of Trusted ...,investools advisory free digest trusted invest...
1,<1512159.1075863666797.JavaMail.evans@thyme>,('richard.sanders@enron.com'),('richard.sanders@enron.com'),2000-09-20 19:07:00,----- Forwarded by Richard B Sanders/HOU/ECT o...,forwarded richard b sanders hou ect pm justin ...
2,<26118676.1075862176383.JavaMail.evans@thyme>,('m..love@enron.com'),('m..love@enron.com'),2001-10-30 16:15:17,hey you are not wearing your target purple shi...,hey wearing target purple shirt today mine wan...
3,<10369289.1075860831062.JavaMail.evans@thyme>,('leslie.milosevich@kp.org'),('leslie.milosevich@kp.org'),2002-01-30 17:54:18,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...,leslie milosevich santa clara avenue alameda c...
4,<26728895.1075860815046.JavaMail.evans@thyme>,('rtwait@graphicaljazz.com'),('rtwait@graphicaljazz.com'),2002-01-30 19:36:01,"Rini Twait\n1010 E 5th Ave\nLongmont, CO 80501...",rini twait e th ave longmont co rtwait graphic...


In [None]:
# drop empty rows

df.dropna(subset=['clean_content'], inplace=True)

In [None]:
df['clean_content'].head()

0    investools advisory free digest trusted invest...
1    forwarded richard b sanders hou ect pm justin ...
2    hey wearing target purple shirt today mine wan...
3    leslie milosevich santa clara avenue alameda c...
4    rini twait e th ave longmont co rtwait graphic...
Name: clean_content, dtype: object

In [None]:
# Find all cleaned emails that contain 'sell enron stock'
mask = df['clean_content'].str.contains('sell enron stock', na=False)
mask.head()

0    False
1    False
2    False
3    False
4    False
Name: clean_content, dtype: bool

In [None]:
# Select the data from df using the mask
df.loc[mask].head()

Unnamed: 0,Message-ID,From,To,Date,content,clean_content
154,<6336501.1075841154311.JavaMail.evans@thyme>,('sarah.palmer@enron.com'),('sarah.palmer@enron.com'),2002-02-01 14:53:35,\nJoint Venture: A 1997 Enron Meeting Belies O...,joint venture enron meeting belies officers cl...


- searching for particular string values in a dataframe can be relatively easy, and allows us to include textual data into our model for analysis. We can use this word search as an additional flag, or as a feature in our fraud detection model.

### Using list of terms
- Oftentimes we don't want to search on just one term. We probably can create a full **"fraud dictionary"** of terms that could potentially flag fraudulent clients and/or transactions. Fraud analysts often will have an idea what should be in such a dictionary.
- We're going to flag a multitude of terms, and we'll create a new flag variable out of it.
- The 'flag' can be used either directly in a machine learning model as a feature, or as an additional filter on top of our machine learning model results.

In [None]:
# Create a list of terms to search for
searchfor = ['enron stock', 'sell stock', 'stock bonus', 'sell enron stock']

# Filter cleaned emails on searchfor list and select from df
filtered_emails = df.loc[df['clean_content'].str.contains('|'.join(searchfor), na=False)]
filtered_emails.head()

Unnamed: 0,Message-ID,From,To,Date,content,clean_content
0,<8345058.1075840404046.JavaMail.evans@thyme>,('advdfeedback@investools.com'),('advdfeedback@investools.com'),2002-01-29 23:20:55,INVESTools Advisory\nA Free Digest of Trusted ...,investools advisory free digest trusted invest...
1,<1512159.1075863666797.JavaMail.evans@thyme>,('richard.sanders@enron.com'),('richard.sanders@enron.com'),2000-09-20 19:07:00,----- Forwarded by Richard B Sanders/HOU/ECT o...,forwarded richard b sanders hou ect pm justin ...
2,<26118676.1075862176383.JavaMail.evans@thyme>,('m..love@enron.com'),('m..love@enron.com'),2001-10-30 16:15:17,hey you are not wearing your target purple shi...,hey wearing target purple shirt today mine wan...
3,<10369289.1075860831062.JavaMail.evans@thyme>,('leslie.milosevich@kp.org'),('leslie.milosevich@kp.org'),2002-01-30 17:54:18,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...,leslie milosevich santa clara avenue alameda c...
4,<26728895.1075860815046.JavaMail.evans@thyme>,('rtwait@graphicaljazz.com'),('rtwait@graphicaljazz.com'),2002-01-30 19:36:01,"Rini Twait\n1010 E 5th Ave\nLongmont, CO 80501...",rini twait e th ave longmont co rtwait graphic...


- By joining the search terms with the 'or' sign, i.e. |, we can search on a multitude of terms in our dataset very easily.

### Creating a flag
- We are going to create an actual flag variable that gives a 1 when the emails get a hit on the search terms of interest, and 0 otherwise. This is the last step we need to make in order to actually use the text data content as a feature in a machine learning model, or as an actual flag on top of model results.

In [None]:
# Create flag variable where the emails match the searchfor terms
df['flag'] = np.where((df['clean_content'].str.contains('|'.join(searchfor)) == True), 1, 0)

# Count the values of the flag variable
count = df['flag'].value_counts()
print(count)

0    1772
1     314
Name: flag, dtype: int64


In [None]:
df['flag'].head()

0    1
1    1
2    1
3    1
4    1
Name: flag, dtype: int32

- We have now managed to search for a list of strings in several lines of text data. These skills come in handy when we want to flag certain words based on what we discovered in our topic model, or when we know beforehand what we want to search for.

## Data Preprocessing

### Removing stopwords
- clean the Enron emails, in order to be able to use the data in a topic model.
- In a first step we need to define the list of stopwords and punctuations that are to be removed

In [None]:
# Import nltk packages and string
from nltk.corpus import stopwords
import string

# Define stopwords to exclude
stop = set(stopwords.words('english'))
stop.update(("to","cc","subject","http","from","sent", "ect", "u", "fwd", "www", "com"))

# Define punctuations to exclude and lemmatizer
exclude = set(string.punctuation)

### Lemmatization and stemming

In [None]:
# Import the lemmatizer from nltk
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()

# Define word cleaning function
def clean(text, stop):
#     print(text)
    text = text.rstrip()
    stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
    punc_free = ''.join(i for i in stop_free if i not in exclude)
    normalized = " ".join(lemma.lemmatize(i) for i in punc_free.split())
    return normalized

# Clean the emails in df and print results
text_clean=[]
for text in df['clean_content']:
    text_clean.append(clean(text, stop).split())

In [None]:
print(text_clean[0][:5])

['investools', 'advisory', 'free', 'digest', 'trusted']


- Now that data is cleaned entirely with the necessary steps, including splitting the text into words, removing stopwords and punctuations, and lemmatizing our words. We are now ready to run a topic model on this data.

### Topic modelling on fraud

#### Create dictionary and corpus
- In order to run an LDA topic model, we first need to define our dictionary and corpus first, as those need to go into the model. We're going to continue working on the cleaned text data

In [None]:
# Import the packages
import gensim
from gensim import corpora

# Define the dictionary
dictionary = corpora.Dictionary(text_clean)

# Define the corpus
corpus = [dictionary.doc2bow(text) for text in text_clean]


In [None]:
# Print dictionary
c=0
for k,v in dictionary.items():
    print(k,v)
    c+=1
    if c==5:
        break

0 account
1 accurate
2 acquiring
3 acre
4 address


In [None]:
# Print corpus
print(corpus[0][:5])

[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1)]


- These are the two ingredients we need to run your topic model on the enron emails. We are now ready for the final step and create our fraud detection topic model.

### LDA model
- build the LDA model. Using the dictionary and corpus, we are ready to discover which topics are present in the Enron emails. With a quick print of words assigned to the topics, we can do a first exploration about whether there are any obvious topics that jump out. Be mindful that the topic model is heavy to calculate so it will take a while to run.

In [None]:
# Define the LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=5)

# Save the topics and top 5 words
topics = ldamodel.print_topics(num_words=5)

# Print the results
for topic in topics:
    print(topic)

(0, '0.011*"enron" + 0.010*"bakernet" + 0.010*"employee" + 0.007*"company" + 0.007*"schedule"')
(1, '0.038*"image" + 0.037*"td" + 0.026*"net" + 0.025*"money" + 0.024*"tr"')
(2, '0.053*"enron" + 0.014*"pm" + 0.013*"hou" + 0.011*"development" + 0.010*"corp"')
(3, '0.020*"enron" + 0.007*"company" + 0.006*"e" + 0.004*"said" + 0.004*"may"')
(4, '0.010*"enron" + 0.009*"please" + 0.006*"market" + 0.006*"conference" + 0.005*"energy"')


- We have now successfully created our first topic model on the Enron email data. However, the print of words doesn't really give us enough information to find a topic that might lead us to signs of fraud. We'll therefore need to closely inspect the model results in order to be able to detect anything that can be related to fraud in our data.

### Visualizing topics

In [None]:
import pyLDAvis.gensim

lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)

pyLDAvis.display(lda_display)

### Finding fraudsters based on topic
- We're going to link the results from the topic model back to our original data. We now learned that we want to flag everything related to topic 3. As we will see, this is actually not that straightforward.

In [None]:
def get_topic_details(ldamodel, corpus):
    topic_details_df = pd.DataFrame()
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_details_df = topic_details_df.append(pd.Series([topic_num, prop_topic]), ignore_index=True)
    topic_details_df.columns = ['Dominant_Topic', '% Score']
    return topic_details_df

# Run get_topic_details function and check the results
print(get_topic_details(ldamodel, corpus))

      Dominant_Topic   % Score
0                3.0  0.999211
1                3.0  0.427233
2                4.0  0.798478
3                0.0  0.993462
4                0.0  0.993382
...              ...       ...
2081             4.0  0.910436
2082             1.0  0.598023
2083             1.0  0.999323
2084             4.0  0.998146
2085             2.0  0.902532

[2086 rows x 2 columns]


In [None]:
# Add original text to topic details in a dataframe
contents = pd.DataFrame({'Original text': text_clean})
topic_details = pd.concat([get_topic_details(ldamodel, corpus), contents], axis=1)
print(topic_details.head())

   Dominant_Topic   % Score                                      Original text
0             3.0  0.999212  [investools, advisory, free, digest, trusted, ...
1             3.0  0.427231  [forwarded, richard, b, sander, hou, pm, justi...
2             4.0  0.798478  [hey, wearing, target, purple, shirt, today, m...
3             0.0  0.993462  [leslie, milosevich, santa, clara, avenue, ala...
4             0.0  0.993383  [rini, twait, e, th, ave, longmont, co, rtwait...


In [None]:
# Create flag for text highest associated with topic 3
topic_details['flag'] = np.where((topic_details['Dominant_Topic'] == 3.0), 1, 0)
print(topic_details.head())

   Dominant_Topic   % Score  \
0             3.0  0.999212   
1             3.0  0.427231   
2             4.0  0.798478   
3             0.0  0.993462   
4             0.0  0.993383   

                                       Original text  flag  
0  [investools, advisory, free, digest, trusted, ...     1  
1  [forwarded, richard, b, sander, hou, pm, justi...     1  
2  [hey, wearing, target, purple, shirt, today, m...     0  
3  [leslie, milosevich, santa, clara, avenue, ala...     0  
4  [rini, twait, e, th, ave, longmont, co, rtwait...     0  


- We have now flagged all data that is highest associated with topic 3, that seems to cover internal conversation about enron stock options.