# Chapter Four. Using text data to detect fraud
In this final chapter, you will use text data, text mining and topic modeling to detect fraudulent behavior.

> **Topics:**
- 1. Using text data
    - 1.1. Word search with dataframes
    - 1.2. Using list of terms
    - 1.3. Creating a flag
- 2. Text mining to detect fraud
    - 2.1. Removing stopwords
    - 2.2. Cleaning text data
- 3. Topic modeling on fraud
    - 3.1. Create dictionary and corpus
    - 3.2. LDA model
- 4. Flagging fraud based on topics
    - 4.1. Interpreting the topic model
    - 4.2. Finding fraudsters based on topic
- 5. Lesson 5: Recap

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

filepath = '../_datasets/chapter_4/'

sns.set()

## 1. Using text data
### You will often encounter text data during fraud detection
Types of useful text data:
1. Emails from employees and/or clients
2. Transaction descriptions
3. Employee notes
4. Insurance claim form description box
5. Recorded telephone conversations
6. ...

### Text mining techniques for fraud detection
1. Word search
2. Sentiment analysis
3. Word frequencies and topic analysis
4. Style

### Word search for fraud detection
Flagging suspicious words:
1. Simple, straightforward and easy to explain
2. Match results can be used as a filter on top of machine learning model
3. Match results can be used as a feature in a machine learning model

### Word counts to flag fraud with pandas
```Python
# Using a string operator to find words
df['email_body'].str.contains('money laundering')

# Select data that matches
df.loc[df['email_body'].str.contains('money laundering', na=False)]

# Create a list of words to search for
list_of_words = ['police', 'money laundering']
df.loc[df['email_body'].str.contains('|'.join(list_of_words), na=False)]

# Create a fraud flag
df['flag'] = np.where((df['email_body'].str.contains('|'.join(list_of_words)) == True), 1, 0)
```

### 1.1 Word search with dataframes
In this exercise you're going to work with text data, containing emails from Enron employees. The **Enron scandal** is a famous fraud case. Enron employees covered up the bad financial position of the company, thereby keeping the stock price artificially high. Enron employees sold their own stock options, and when the truth came out, Enron investors were left with nothing. The goal is **to find all emails** that mention specific words, such as "sell enron stock".

By using string operations on dataframes, you can easily sift through messy email data and create flags based on word-hits. The Enron email data has been put into a dataframe called `df` so let's search for suspicious terms. Feel free to explore `df` in the Console before getting started.

In [5]:
df = pd.read_csv(filepath+"enron_emails_clean.csv", index_col='Message-ID')
df.head(2)

Unnamed: 0_level_0,From,To,Date,content,clean_content
Message-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
<8345058.1075840404046.JavaMail.evans@thyme>,('advdfeedback@investools.com'),('advdfeedback@investools.com'),2002-01-29 23:20:55,INVESTools Advisory\nA Free Digest of Trusted ...,investools advisory free digest trusted invest...
<1512159.1075863666797.JavaMail.evans@thyme>,('richard.sanders@enron.com'),('richard.sanders@enron.com'),2000-09-20 19:07:00,----- Forwarded by Richard B Sanders/HOU/ECT o...,forwarded richard b sanders hou ect pm justin ...


In [8]:
# Find all cleaned emails that contain 'sell enron stock'
mask = df['clean_content'].str.contains('sell enron stock', na=False)

# Select the data from df using the mask
df.loc[mask]

Unnamed: 0_level_0,From,To,Date,content,clean_content
Message-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
<6336501.1075841154311.JavaMail.evans@thyme>,('sarah.palmer@enron.com'),('sarah.palmer@enron.com'),2002-02-01 14:53:35,\nJoint Venture: A 1997 Enron Meeting Belies O...,joint venture enron meeting belies officers cl...


You see that searching for particular string values in a dataframe can be relatively easy, and allows you to include textual data into your model or analysis. You can use this word search as an additional flag, or as a feauture in your fraud detection model. Let's now have a look at how to filter the data using multiple search terms.

### 1.2 Using list of terms
Oftentimes you don't want to search on just one term. You probably can create a full **"fraud dictionary"** of terms that could potentially **flag fraudulent clients** and/or transactions. Fraud analysts often will have an idea what should be in such a dictionary. In this exercise you're going to **flag a multitude of terms**, and in the next exercise you'll create a new flag variable out of it. The 'flag' can be used either directly in a machine learning model as a feature, or as an additional filter on top of your machine learning model results. Let's first use a **list of terms** to filter our data on. The dataframe containing the cleaned emails is again available as `df`.

In [11]:
# Create a list of terms to search for
searchfor = [ 'enron stock', 'sell stock', 'stock bonus', 'sell enron stock']

# Filter cleaned emails on searchfor list and select from df 
filtered_emails = df.loc[df['clean_content'].str.contains('|'.join(searchfor), na=False)]
filtered_emails.head()

Unnamed: 0_level_0,From,To,Date,content,clean_content
Message-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
<8345058.1075840404046.JavaMail.evans@thyme>,('advdfeedback@investools.com'),('advdfeedback@investools.com'),2002-01-29 23:20:55,INVESTools Advisory\nA Free Digest of Trusted ...,investools advisory free digest trusted invest...
<1512159.1075863666797.JavaMail.evans@thyme>,('richard.sanders@enron.com'),('richard.sanders@enron.com'),2000-09-20 19:07:00,----- Forwarded by Richard B Sanders/HOU/ECT o...,forwarded richard b sanders hou ect pm justin ...
<26118676.1075862176383.JavaMail.evans@thyme>,('m..love@enron.com'),('m..love@enron.com'),2001-10-30 16:15:17,hey you are not wearing your target purple shi...,hey wearing target purple shirt today mine wan...
<10369289.1075860831062.JavaMail.evans@thyme>,('leslie.milosevich@kp.org'),('leslie.milosevich@kp.org'),2002-01-30 17:54:18,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...,leslie milosevich santa clara avenue alameda c...
<26728895.1075860815046.JavaMail.evans@thyme>,('rtwait@graphicaljazz.com'),('rtwait@graphicaljazz.com'),2002-01-30 19:36:01,"Rini Twait\n1010 E 5th Ave\nLongmont, CO 80501...",rini twait e th ave longmont co rtwait graphic...


By joining the search terms with the 'or' sign, i.e. `|`, you can search on a multitude of terms in your dataset very easily. Let's now create a flag from this which you can use as a feature in a machine learning model.

### 1.3 Creating a flag
This time you are going to **create an actual flag** variable that gives a **1 when the emails get a hit on the search terms of interest**, and 0 otherwise. This is the last step you need to make in order to actually use the text data content as a feature in a machine learning model, or as an actual flag on top of model results. You can continue working with the dataframe `df` containing the emails, and the `searchfor` list is the one defined in the last exercise.

In [12]:
# Create flag variable where the emails match the searchfor terms
df['flag'] = np.where((df['clean_content'].str.contains('|'.join(searchfor)) == True), 1, 0)

# Count the values of the flag variable
count = df['flag'].value_counts()
count

0    1776
1     314
Name: flag, dtype: int64

You have now managed to search for a list of strings in several lines of text data. These skills come in handy when you want to flag certain words based on what you discovered in your topic model, or when you know beforehand what you want to search for. In the next exercises you're going to learn how to clean text data and to create your own topic model to further look for indications of fraud in your text data.

## 2. Text mining techniques for fraud detection
### Cleaning your text data
Must do's when working with textual data:
1. Tokenization
2. Remove all stopwords
3. Lemmatize your words
4. Stem your words

### Go from this...
![][28-from]

### To this...
![][29-to]

### Data preprocessing part 1
```Python
# 1. Tokenization
from nltk import word_tokenize
text = df.apply(lambda row: word_tokenize(row["email_body"]), axis=1)
text = text.rstrip()
text = re.sub(r'[^a-zA-Z]', ' ', text)

# 2. Remove all stopwords and punctuation
from nltk.corpus import stopwords
import string
exclude = set(string.punctuation)
stop = set(stopwords.words('english'))
stop_free = " ".join([word for word in text if((word not in stop) and (not word.isdigit()))])
punc_free = ''.join(word for word in stop_free if word not in exclude)
```

### Data preprocessing part 2
```Python
# 3. Lemmatize words
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())

# 4. Stem words
from nltk.stem.porter import PorterStemmer
porter= PorterStemmer()
cleaned_text = " ".join(porter.stem(token) for token in normalized.split())
print (cleaned_text)
['philip','going','street','curious','hear','perspective','may','wish',
'offer','trading','floor','enron','stock','lower','joined','company',
'business','school','imagine','quite','happy','people','day','relate',
'somewhat','stock','around','fact','broke','day','ago','knowing',
'imagine','letting','event','get','much','taken','similar',
'problem','hope','everything','else','going','well','family','knee',
'surgery','yet','give','call','chance','later']
```

[28-from]:_Docs/28-from.png
[29-to]:_Docs/29-to.png

### 2.1 Removing stopwords
In the following exercises you're going to **clean the Enron emails**, in order to be able to use the data in a topic model. Text cleaning can be challenging, so you'll learn some steps to do this well. The dataframe containing the emails `df` is available. In a first step you need to **define the list of stopwords and punctuations** that are to be removed in the next exercise from the text data. Let's give it a try.

In [13]:
# Import nltk packages and string 
from nltk.corpus import stopwords
import string

# Define stopwords to exclude
stop = set(stopwords.words('english'))
stop.update(("to","cc","subject","http","from","sent", "ect", "u", "fwd", "www", "com"))

# Define punctuations to exclude and lemmatizer
exclude = set(string.punctuation)

### 2.2 Cleaning text data
Now that you've defined the **stopwords and punctuations**, let's use these to **clean our enron emails** in the dataframe `df` further. The lists containing stopwords and punctuations are available under `stop` and `exclude`. There are a few more steps to take before you have cleaned data, such as **"lemmatization" of words, and stemming the verbs**. The verbs in the email data are already stemmed, and the lemmatization is already done for you in this exercise.

In [36]:
# Import the lemmatizer from nltk
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()

# Define word cleaning function
def clean(text, stop):
    text = text.rstrip()
    stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
    punc_free = ''.join(i for i in stop_free if i not in exclude)
    normalized = " ".join(lemma.lemmatize(i) for i in punc_free.split())      
    return normalized

# Clean the emails in df and print results
text_clean=[]
for text in df['clean_content'][:5]:
    if type(text) != str:
        continue
    text_clean.append(clean(text, stop).split())    
print(text_clean)

[['investools', 'advisory', 'free', 'digest', 'trusted', 'investment', 'advice', 'unsubscribe', 'free', 'newsletter', 'please', 'see', 'issue', 'fried', 'sell', 'stock', 'gain', 'month', 'km', 'rowe', 'january', 'index', 'confirms', 'bull', 'market', 'aloy', 'small', 'cap', 'advisor', 'earns', 'lbix', 'compounding', 'return', 'pine', 'tree', 'pcl', 'undervalued', 'high', 'yield', 'bank', 'put', 'customer', 'first', 'aso', 'word', 'sponsor', 'top', 'wall', 'street', 'watcher', 'ben', 'zacks', 'year', 'year', 'gain', 'moving', 'best', 'brightest', 'wall', 'street', 'big', 'money', 'machine', 'earned', 'ben', 'zacks', 'five', 'year', 'average', 'annual', 'gain', 'start', 'outperforming', 'long', 'term', 'get', 'zacks', 'latest', 'stock', 'buylist', 'free', 'day', 'trial', 'investools', 'c', 'go', 'zaks', 'mtxtu', 'zakstb', 'investools', 'advisory', 'john', 'brobst', 'investools', 'fried', 'sell', 'stock', 'lock', 'month', 'km', 'david', 'fried', 'know', 'stock', 'undervalued', 'company', 

Now that you have cleaned your data entirely with the necessary steps, including splitting the text into words, removing stopwords and punctuations, and lemmatizing your words. You are now ready to run a topic model on this data. In the following exercises you're going to explore how to do that.

## 3. Topic modeling on fraud

### Topic modelling: discover hidden patterns in text data
1. Discovering topics in text data
2. "What is the text about"
3. Conceptually similar to clustering data
4. Compare topics of fraud cases to non-fraud cases and use as a feature or flag
5. Or.. is there a particular topic in the data that seems to point to fraud?

### Latent Dirichlet Allocation (LDA)
With LDA you obtain:
1. "topics per text item" model (i.e. probabilities)
2. "words per topic" model
Creating your own topic model:
1. Clean your data
2. Create a bag of words with dictionary and corpus
3. Feed dictionary and corpus into the LDA model

![][30-LDA]

### Bag of words: dictionary and corpus
```Python
from gensim import corpora

# Create dictionary number of times a word appears
dictionary = corpora.Dictionary(cleaned_emails)

# Filter out (non)frequent words
dictionary.filter_extremes(no_below=5, keep_n=50000)

# Create corpus
corpus = [dictionary.doc2bow(text) for text in cleaned_emails]
```

### Latent Dirichlet Allocation (LDA) with gensim
```Python
import gensim

# Define the LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3,
id2word=dictionary, passes=15)

# Print the three topics from the model with top words
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
print(topic)
(0, ‘0.029*”email” + 0.016*”send” + 0.016*”results” + 0.016*”invoice”’)
(1, ‘0.026*”price” + 0.026*”work” + 0.026*”management” + 0.026*”sell”’)
(2, ‘0.029*”distribute” + 0.029*”contact” + 0.016*”supply” + 0.016*”fast”’)
```

# To learn more in detail about LDA see:
- DATACAMP: [Latent Semantic Analysis using Python][2]
- Towards Data Science: [Topic Modeling in Python: Latent Dirichlet Allocation (LDA)][1]
- Machine Learning Plus: [Topic Modeling with Gensim (Python)][3]

[30-LDA]:_Docs/30-LDA.png
[1]: https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0
[2]: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
[3]: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

### 3.1 Create dictionary and corpus
In order to run an LDA topic model, you first need to **define your dictionary and corpus** first, as those need to go into the model. You're going to continue working on the cleaned text data that you've done in the previous exercises. That means that `text_clean` is available for you already to continue working with, and you'll use that to create your dictionary and corpus.

This exercise will take a little longer to execute than usual.

In [39]:
# Import the packages
import gensim
from gensim import corpora

# Define the dictionary
dictionary = corpora.Dictionary(text_clean)

# Define the corpus 
corpus = [dictionary.doc2bow(text) for text in text_clean]

# Print corpus and dictionary
print(corpus)
print(dictionary)

[[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 6), (6, 1), (7, 2), (8, 4), (9, 1), (10, 1), (11, 3), (12, 2), (13, 1), (14, 5), (15, 3), (16, 1), (17, 3), (18, 1), (19, 1), (20, 1), (21, 5), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 3), (31, 3), (32, 1), (33, 3), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 2), (42, 1), (43, 1), (44, 2), (45, 1), (46, 1), (47, 1), (48, 1), (49, 4), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 4), (56, 1), (57, 4), (58, 9), (59, 5), (60, 1), (61, 8), (62, 1), (63, 1), (64, 2), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 2), (78, 1), (79, 1), (80, 1), (81, 12), (82, 2), (83, 2), (84, 1), (85, 1), (86, 3), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 4), (96, 1), (97, 1), (98, 2), (99, 2), (100, 1), (101, 2), (102, 5), (103, 1), (104, 3), (105, 8), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1

These are the two ingredients you need to run your topic model on the enron emails. You are now ready for the final step and create your first fraud detection topic model.

### 3.2 LDA model
Now it's time to build the **LDA model**. Using the `dictionary` and `corpus`, you are ready to discover which topics are present in the Enron emails. With a quick print of words assigned to the topics, you can do a first exploration about whether there are any obvious topics that jump out. Be mindful that the topic model is **heavy to calculate** so it will take a while to run. Let's give it a try!

In [40]:
# Define the LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=5)

# Save the topics and top 5 words
topics = ldamodel.print_topics(num_words=5)

# Print the results
for topic in topics:
    print(topic)

(0, '0.015*"league" + 0.015*"due" + 0.012*"message" + 0.012*"love" + 0.012*"phillip"')
(1, '0.010*"investools" + 0.006*"year" + 0.006*"stock" + 0.006*"free" + 0.006*"go"')
(2, '0.021*"enron" + 0.021*"mg" + 0.015*"plc" + 0.015*"rudolph" + 0.015*"jane"')
(3, '0.031*"enron" + 0.022*"employee" + 0.019*"company" + 0.018*"million" + 0.014*"fund"')
(4, '0.018*"investools" + 0.011*"stock" + 0.010*"go" + 0.010*"see" + 0.010*"company"')


You have now successfully created your first topic model on the Enron email data. However, the print of words doesn't really give you enough information to find a topic that might lead you to signs of fraud. You'll therefore need to closely inspect the model results in order to be able to detect anything that can be related to fraud in your data. You'll learn more about this in the next video.

## 4. Flagging fraud based on topics
### Using your LDA model results for fraud detection
1. Are there any suspicious topics? (no labels)
2. Are the topics in fraud and non-fraud cases similar? (with labels)
3. Are fraud cases associated more with certain topics? (with labels)

### To understand topics, you need to visualize

- pyLDAvis.gensim only works in jupyter notebooks

```Python
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus,
dictionary, sort_topics=False)
pyLDAvis.display(lda_display)
```

### Inspecting how topics differ

![][31-topics]

### Assign topics to your original data
```Python
def get_topic_details(ldamodel, corpus):
    topic_details_df = pd.DataFrame()
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0: # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_details_df = topic_details_df.append(pd.Series([topic_num,
    topic_details_df.columns = ['Dominant_Topic', '% Score']
    return topic_details_df
                                                                      
contents = pd.DataFrame({'Original text':text_clean})
topic_details = pd.concat([get_topic_details(ldamodel,corpus), contents], axis=1)
topic_details.head()
                                                                      
Dominant_Topic % Score Original text
0 0.0 0.989108 [investools, advisory, free, ...
1 0.0 0.993513 [forwarded, richard, b, ...
2 1.0 0.964858 [hey, wearing, target, purple, ...
3 0.0 0.989241 [leslie, milosevich, santa, clara, ...
```
[31-topics]:_Docs/31-topics.png

### 4.1 Interpreting the topic model
Below are visualisation results from the pyLDAvis library available. Have a look at topic 1 and 3 from the LDA model on the Enron email data. Which one would you research further for fraud detection purposes and why?

- Topic 1:
![][32-pyLDAvis_topic1]

- Topic 3
![][33-pyLDAvis_topic3]

**Possible Answers**
- [x] Topic 1.
> **Correct:** Topic 1 seems to discuss the employee share option program, and seems to point to internal conversation (with "please, may, know" etc), so this is more likely to be related to the internal accounting fraud and trading stock with insider knowledge. *Topic 3 seems to be more related to general news around Enron.*
- [ ] Topic 3.
- [ ] None of these topics seem related to fraud.

[32-pyLDAvis_topic1]:_Docs/32-pyLDAvis_topic1.png
[33-pyLDAvis_topic3]:_Docs/33-pyLDAvis_topic3.png

### 4.2 Finding fraudsters based on topic
In this exercise you're going to **link the results** from the topic model **back to your original data**. You now learned that you want to **flag** everything related to **topic 3**. As you will see, this is actually not that straightforward. You'll be given the function `get_topic_details()` which takes the arguments `ldamodel` and `corpus`. It retrieves the details of the topics for each line of text. With that function, you can append the results back to your original data. If you want to learn more detail on how to work with the model results, which is beyond the scope of this course, you're highly encouraged to read this article: [Topic Modeling with Gensim (Python)][1].

Available for you are the `dictionary` and `corpus`, the text data `text_clean` as well as your model results `ldamodel`. Also defined is `get_topic_details()`.

[1]:https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [63]:
def get_topic_details(ldamodel, corpus):
    topic_details_df = pd.DataFrame()
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0: # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ', '.join(str(topic[0]) for topic in wp)
                topic_details_df = topic_details_df.append(pd.Series([topic_num,prop_topic, topic_keywords]), ignore_index=True)
    topic_details_df.columns = ['Dominant_Topic', '% Score', 'Topic_Keywords']
    return topic_details_df

In [64]:
# Run get_topic_details function and check the results
print(get_topic_details(ldamodel, corpus))

   Dominant_Topic   % Score                                     Topic_Keywords
0             4.0  0.999229  investools, stock, go, see, company, say, year...
1             2.0  0.998953  enron, mg, plc, rudolph, jane, lon, allen, wol...
2             0.0  0.994390  league, due, message, love, phillip, david, bu...
3             3.0  0.993535  enron, employee, company, million, fund, consu...
4             3.0  0.993427  enron, employee, company, million, fund, consu...


In [66]:
# Add original text to topic details in a dataframe
contents = pd.DataFrame({'Original text': text_clean})
topic_details = pd.concat([get_topic_details(ldamodel, corpus), contents], axis=1)
topic_details.head()

Unnamed: 0,Dominant_Topic,% Score,Topic_Keywords,Original text
0,4.0,0.999229,"investools, stock, go, see, company, say, year...","[investools, advisory, free, digest, trusted, ..."
1,2.0,0.998953,"enron, mg, plc, rudolph, jane, lon, allen, wol...","[forwarded, richard, b, sander, hou, pm, justi..."
2,0.0,0.99439,"league, due, message, love, phillip, david, bu...","[hey, wearing, target, purple, shirt, today, m..."
3,3.0,0.993535,"enron, employee, company, million, fund, consu...","[leslie, milosevich, santa, clara, avenue, ala..."
4,3.0,0.993427,"enron, employee, company, million, fund, consu...","[rini, twait, e, th, ave, longmont, co, rtwait..."


In [68]:
# Create flag for text highest associated with topic 3
topic_details['flag'] = np.where((topic_details['Dominant_Topic'] == 3.0), 1, 0)
topic_details.head(10)

Unnamed: 0,Dominant_Topic,% Score,Topic_Keywords,Original text,flag
0,4.0,0.999229,"investools, stock, go, see, company, say, year...","[investools, advisory, free, digest, trusted, ...",0
1,2.0,0.998953,"enron, mg, plc, rudolph, jane, lon, allen, wol...","[forwarded, richard, b, sander, hou, pm, justi...",0
2,0.0,0.99439,"league, due, message, love, phillip, david, bu...","[hey, wearing, target, purple, shirt, today, m...",0
3,3.0,0.993535,"enron, employee, company, million, fund, consu...","[leslie, milosevich, santa, clara, avenue, ala...",1
4,3.0,0.993427,"enron, employee, company, million, fund, consu...","[rini, twait, e, th, ave, longmont, co, rtwait...",1


**You have now flagged all data that is highest associated with topic 3, that seems to cover internal conversation about enron stock options**. You are a true detective. With these exercises you have demonstrated that text mining and topic modeling can be a powerful tool for fraud detection.

## 5. Fraud detection in Python Recap

### 1. Working with imbalanced data
- Worked with highly imbalanced fraud data
- Learned how to resample your data
- Learned about different resampling methods

### 2. Fraud detection with labeled data
- Refreshed supervised learning techniques to detect fraud
- Learned how to get reliable performance metrics and worked with the precision recall trade-off
- Explored how to optimise your model parameters to handle fraud data
- Applied ensemble methods to fraud detection

### 3. Fraud detection without labels
- Learned about the importance of segmentation
- Refreshed your knowledge on clustering methods
- Learned how to detect fraud using outliers and small clusters with K-means clustering
- Applied a DB-scan clustering model for fraud detection

### 4. Text mining for fraud detection
- Know how to augment fraud detection analysis with text mining techniques
- Applied word searches to flag use of certain words, and learned how to apply topic modelling for fraud detection
- Learned how to effectively clean messy text data

### 5. Further learning for fraud detection
- Network analysis to detect fraud
- Different supervised and unsupervised learning techniques (e.g. Neural Networks)
- Working with very large data