## **Fraud Detection in Python**
A typical organization loses an estimated 5% of its yearly revenue to fraud. In this course, learn to fight fraud by using data. Apply supervised learning algorithms to detect fraudulent behavior based upon past fraud, and use unsupervised learning methods to discover new types of fraud activities. 

Fraudulent transactions are rare compared to the norm.  As such, learn to properly classify imbalanced datasets.

This notebook technical and theoretical insights and demonstrates how to implement fraud detection models. Finally, get tips and advice from real-life experience to help prevent common mistakes in fraud analytics.

**Imports**

In [1]:
import warnings

warnings.filterwarnings("ignore")
warnings.simplefilter("ignore")

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint as pp
import csv
from pathlib import Path
import string

import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.metrics import (
    r2_score,
    classification_report,
    confusion_matrix,
    accuracy_score,
    roc_auc_score,
    roc_curve,
    precision_recall_curve,
    average_precision_score,
)

import gensim
from gensim import corpora

**Pandas Configuration Options**

In [5]:
pd.set_option("display.max_columns", 700)
pd.set_option("display.max_rows", 400)
pd.set_option("display.min_rows", 10)
pd.set_option("display.expand_frame_repr", True)

**Data Files Location**

* Most data files for the exercises can be found on the [this site](https://www.datacamp.com/courses/fraud-detection-in-python):
    * [Chapter 4](https://assets.datacamp.com/production/repositories/2162/datasets/94f2356652dc9ea8f0654b5e9c29645115b6e77f/chapter_4.zip)

**Data File Objects**

In [6]:
data = Path.cwd() / "data"

ch4 = data / "chapter_4"
enron_emails_clean_file = ch4 / "enron_emails_clean.csv"
cleantext_file = ch4 / "cleantext.pickle"
corpus_file = ch4 / "corpus.pickle"
dict_file = ch4 / "dict.pickle"
ldamodel_file = ch4 / "ldamodel.pickle"

## Introduction to fraud detection

* Types:
    * Insurance
    * Credit card
    * Identity theft
    * Money laundering
    * Tax evasion
    * Healthcare
    * Product warranty
* e-commerce businesses must continuously assess the legitimacy of client transactions
* Detecting fraud is challenging:
    * Uncommon; < 0.01% of transactions
    * Attempts are made to conceal fraud
    * Behavior evolves
    * Fraudulent activities perpetrated by networks - organized crime
* Fraud detection requires training an algorithm to identify concealed observations from any normal observations
* Fraud analytics teams:
    * Often use rules based systems, based on manually set thresholds and experience
    * Check the news
    * Receive external lists of fraudulent accounts and names
        * suspicious names or track an external hit list from police to reference check against the client base
    * Sometimes use machine learning algorithms to detect fraud or suspicious behavior
        * Existing sources can be used as inputs into the ML model
        * Verify the veracity of rules based labels

# Fraud detection using text

Use text data, text mining and topic modeling to detect fraudulent behavior.

## Using text data

* Types of useful text data:
    1. Emails from employees and/or clients
    1. Transaction descriptions
    1. Employee notes
    1. Insurance claim form description box
    1. Recorded telephone conversations
* Text mining techniques for fraud detection
    1. Word search
    1. Sentiment analysis
    1. Word frequencies and topic analysis
    1. Style
* Word search for fraud detection
    * Flagging suspicious words:
        1. Simple, straightforward and easy to explain
        1. Match results can be used as a filter on top of machine learning model
        1. Match results can be used as a feature in a machine learning model

#### Word counts to flag fraud with pandas

```python
# Using a string operator to find words
df['email_body'].str.contains('money laundering')

 # Select data that matches 
df.loc[df['email_body'].str.contains('money laundering', na=False)]

 # Create a list of words to search for
list_of_words = ['police', 'money laundering']
df.loc[df['email_body'].str.contains('|'.join(list_of_words), na=False)]

 # Create a fraud flag 
df['flag'] = np.where((df['email_body'].str.contains('|'.join(list_of_words)) == True), 1, 0)
```

### Word search with dataframes

In this section, you're going to work with text data, containing emails from Enron employees. The **Enron scandal** is a famous fraud case. Enron employees covered up the bad financial position of the company, thereby keeping the stock price artificially high. Enron employees sold their own stock options, and when the truth came out, Enron investors were left with nothing. The goal is to find all emails that mention specific words, such as "sell enron stock".

By using string operations on dataframes, you can easily sift through messy email data and create flags based on word-hits. The Enron email data has been put into a dataframe called `df` so let's search for suspicious terms. Feel free to explore `df` in the Console before getting started.

**Instructions 1/2**

* Check the head of `df` in the console and look for any emails mentioning 'sell enron stock'.

In [7]:
df = pd.read_csv(enron_emails_clean_file)

In [12]:
df

Unnamed: 0,Message-ID,From,To,Date,content,clean_content
0,<8345058.1075840404046.JavaMail.evans@thyme>,('advdfeedback@investools.com'),('advdfeedback@investools.com'),2002-01-29 23:20:55,INVESTools Advisory\nA Free Digest of Trusted ...,investools advisory free digest trusted invest...
1,<1512159.1075863666797.JavaMail.evans@thyme>,('richard.sanders@enron.com'),('richard.sanders@enron.com'),2000-09-20 19:07:00,----- Forwarded by Richard B Sanders/HOU/ECT o...,forwarded richard b sanders hou ect pm justin ...
2,<26118676.1075862176383.JavaMail.evans@thyme>,('m..love@enron.com'),('m..love@enron.com'),2001-10-30 16:15:17,hey you are not wearing your target purple shi...,hey wearing target purple shirt today mine wan...
3,<10369289.1075860831062.JavaMail.evans@thyme>,('leslie.milosevich@kp.org'),('leslie.milosevich@kp.org'),2002-01-30 17:54:18,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...,leslie milosevich santa clara avenue alameda c...
4,<26728895.1075860815046.JavaMail.evans@thyme>,('rtwait@graphicaljazz.com'),('rtwait@graphicaljazz.com'),2002-01-30 19:36:01,"Rini Twait\n1010 E 5th Ave\nLongmont, CO 80501...",rini twait e th ave longmont co rtwait graphic...
...,...,...,...,...,...,...
2085,<19039088.1075851547721.JavaMail.evans@thyme>,('andy.zipper@enron.com'),('andy.zipper@enron.com'),2001-10-22 14:00:17,"i bot 1,000/d at 3.175 apr/oct02. put it again...",bot apr oct put digital gas x thanks
2086,<6813352.1075842016977.JavaMail.evans@thyme>,('andy.zipper@enron.com'),('andy.zipper@enron.com'),2002-01-25 17:39:38,I'm okay. How are you ?,okay
2087,<4833106.1075842022184.JavaMail.evans@thyme>,('tradersummary@syncrasy.com'),('tradersummary@syncrasy.com'),2002-02-01 16:15:17,\n[IMAGE]=09\n\n\n[IMAGE] [IMAGE][IMAGE][IMAGE...,image image image image image image image imag...
2088,<3550151.1075842023814.JavaMail.evans@thyme>,('lmrig@qwest.net'),('lmrig@qwest.net'),2002-01-29 02:01:00,\n\nTransmission Expansion and Systems in Tran...,transmission expansion systems transition conf...


In [13]:
mask = df["clean_content"].str.contains("sell enron stock", na=False)

**Instructions 2/2**

* Locate the data in `df` that meets the condition we created earlier.

In [14]:
# Select the data from df using the mask
df[mask]

Unnamed: 0,Message-ID,From,To,Date,content,clean_content
154,<6336501.1075841154311.JavaMail.evans@thyme>,('sarah.palmer@enron.com'),('sarah.palmer@enron.com'),2002-02-01 14:53:35,\nJoint Venture: A 1997 Enron Meeting Belies O...,joint venture enron meeting belies officers cl...


**You see that searching for particular string values in a dataframe can be relatively easy, and allows you to include textual data into your model or analysis. You can use this word search as an additional flag, or as a feature in your fraud detection model. Let's look at how to filter the data using multiple search terms.**

### Using list of terms

Oftentimes you don't want to search on just one term. You probably can create a full **"fraud dictionary"** of terms that could potentially **flag fraudulent clients** and/or transactions. Fraud analysts often will have an idea what should be in such a dictionary. In this section, you're going to **flag a multitude of terms**, and in the next section you'll create a new flag variable out of it. The 'flag' can be used either directly in a machine learning model as a feature, or as an additional filter on top of your machine learning model results. Let's first use a list of terms to filter our data on. The dataframe containing the cleaned emails is again available as `df`.

**Instructions**

* Create a list to search for including 'enron stock', 'sell stock', 'stock bonus', and 'sell enron stock'.
* Join the string terms in the search conditions.
* Filter data using the emails that match with the list defined under `searchfor`.

In [15]:
# Create a list of terms to search for
searchfor = ["enron stock", "sell stock", "stock bonus", "sell enron stock"]

# Filter cleaned emails on searchfor list and select from df
filtered_emails = df[df.clean_content.str.contains("|".join(searchfor), na=False)]
filtered_emails.head()

Unnamed: 0,Message-ID,From,To,Date,content,clean_content
0,<8345058.1075840404046.JavaMail.evans@thyme>,('advdfeedback@investools.com'),('advdfeedback@investools.com'),2002-01-29 23:20:55,INVESTools Advisory\nA Free Digest of Trusted ...,investools advisory free digest trusted invest...
1,<1512159.1075863666797.JavaMail.evans@thyme>,('richard.sanders@enron.com'),('richard.sanders@enron.com'),2000-09-20 19:07:00,----- Forwarded by Richard B Sanders/HOU/ECT o...,forwarded richard b sanders hou ect pm justin ...
2,<26118676.1075862176383.JavaMail.evans@thyme>,('m..love@enron.com'),('m..love@enron.com'),2001-10-30 16:15:17,hey you are not wearing your target purple shi...,hey wearing target purple shirt today mine wan...
3,<10369289.1075860831062.JavaMail.evans@thyme>,('leslie.milosevich@kp.org'),('leslie.milosevich@kp.org'),2002-01-30 17:54:18,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...,leslie milosevich santa clara avenue alameda c...
4,<26728895.1075860815046.JavaMail.evans@thyme>,('rtwait@graphicaljazz.com'),('rtwait@graphicaljazz.com'),2002-01-30 19:36:01,"Rini Twait\n1010 E 5th Ave\nLongmont, CO 80501...",rini twait e th ave longmont co rtwait graphic...


**By joining the search terms with the 'or' sign, i.e. |, you can search on a multitude of terms in your dataset very easily. Let's now create a flag from this which you can use as a feature in a machine learning model.**

### Creating a flag

This time you are going to **create an actual flag** variable that gives a **1 when the emails get a hit** on the search terms of interest, and 0 otherwise. This is the last step you need to make in order to actually use the text data content as a feature in a machine learning model, or as an actual flag on top of model results. You can continue working with the dataframe `df` containing the emails, and the `searchfor` list is the one defined in the last section.

**Instructions**

* Use a numpy where condition to flag '1' where the cleaned email contains words on the `searchfor` list and 0 otherwise.
* Join the words on the `searchfor` list with an "or" indicator.
* Count the values of the newly created flag variable.

In [16]:
# Create flag variable where the emails match the searchfor terms
df["flag"] = np.where(
    (df["clean_content"].str.contains("|".join(searchfor)) == True), 1, 0
)

# Count the values of the flag variable
count = df["flag"].value_counts()
print(count)

0    1776
1     314
Name: flag, dtype: int64


**You have now managed to search for a list of strings in several lines of text data. These skills come in handy when you want to flag certain words based on what you discovered in your topic model, or when you know beforehand what you want to search for. In the next sections you're going to learn how to clean text data and to create your own topic model to further look for indications of fraud in your text data.**

## Text mining to detect fraud

#### Cleaning your text data

**Must dos when working with textual data:**

1. Tokenization
    * Split the text into sentences and the sentences in words
    * Transform everything to lowercase
    * Remove punctuation
1. Remove all stopwords
1. Lemmatize 
    * Change from third person into first person
    * Change past and future tense verbs to present tense
    * This makes it possible to combine all words that point to the same thing
1. Stem the words
    * Reduce words to their root form
    * e.g. walking and walked to walk

* **Unprocessed Text**
    * ![](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/text_df.JPG)
* **Processed Text**
    * ![](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/text_processed.JPG)

#### Data Preprocessing I

* Tokenizers divide strings into list of substrings
* nltk word tokenizer can be used to find the words and punctuation in a string
    * it splits the words on whitespace, and separated the punctuation out

```python
from nltk import word_tokenize
from nltk.corpus import stopwords 
import string

# 1. Tokenization
text = df.apply(lambda row: word_tokenize(row["email_body"]), axis=1)
text = text.rstrip()  # remove whitespace
# replace with lowercase
# text = re.sub(r'[^a-zA-Z]', ' ', text)
text = text.lower()

 # 2. Remove all stopwords and punctuation
exclude = set(string.punctuation)
stop = set(stopwords.words('english'))
stop_free = " ".join([word for word in text if((word not in stop) and (not word.isdigit()))])
punc_free = ''.join(word for word in stop_free if word not in exclude)
```

#### Data Preprocessing II

```python
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

# Lemmatize words
lemma = WordNetLemmatizer()
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())

# Stem words
porter= PorterStemmer()
cleaned_text = " ".join(porter.stem(token) for token in normalized.split())
print (cleaned_text)

['philip','going','street','curious','hear','perspective','may','wish',
'offer','trading','floor','enron','stock','lower','joined','company',
'business','school','imagine','quite','happy','people','day','relate',
'somewhat','stock','around','fact','broke','day','ago','knowing',
'imagine','letting','event','get','much','taken','similar',
'problem','hope','everything','else','going','well','family','knee',
'surgery','yet','give','call','chance','later']
```

### Removing stopwords

In the following sections, you're going to **clean the Enron emails**, in order to be able to use the data in a topic model. Text cleaning can be challenging, so you'll learn some steps to do this well. The dataframe containing the emails `df` is available. In a first step you need to **define the list of stopwords and punctuations** that are to be removed in the next section from the text data. Let's give it a try.

**Instructions**

* Import the stopwords from `ntlk`.
* Define 'english' words to use as stopwords under the variable `stop`.
* Get the punctuation set from the `string` package and assign it to `exclude`.

In [17]:
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")

[nltk_data] Downloading package stopwords to /home/masoud/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/masoud/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/masoud/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [18]:
# Define stopwords to exclude
stop = set(stopwords.words("english"))
stop.update(
    (
        "to",
        "cc",
        "subject",
        "http",
        "from",
        "sent",
        "ect",
        "u",
        "fwd",
        "www",
        "com",
        "html",
    )
)

# Define punctuations to exclude and lemmatizer
exclude = set(string.punctuation)

### Cleaning text data

Now that you've defined the **stopwords and punctuations**, let's use these to clean our enron emails in the dataframe `df` further. The lists containing stopwords and punctuations are available under `stop` and `exclude` There are a few more steps to take before you have cleaned data, such as **"lemmatization"** of words, and **stemming the verbs**. The verbs in the email data are already stemmed, and the lemmatization is already done for you in this section.

**Instructions 1/2**

* Use the previously defined variables `stop` and `exclude` to finish of the function: Strip the words from whitespaces using `rstrip`, and exclude stopwords and punctuations. Finally lemmatize the words and assign that to `normalized`.

In [19]:
# Import the lemmatizer from nltk
lemma = WordNetLemmatizer()


def clean(text, stop):
    text = str(text).rstrip()
    stop_free = " ".join(
        [i for i in text.lower().split() if ((i not in stop) and (not i.isdigit()))]
    )
    punc_free = "".join(i for i in stop_free if i not in exclude)
    normalized = " ".join(lemma.lemmatize(i) for i in punc_free.split())
    return normalized

**Instructions 2/2**

* Apply the function `clean(text,stop)` on each line of text data in our dataframe, and take the column `df['clean_content']` for this.

In [21]:
# Clean the emails in df and print results
text_clean = []
for text in df["clean_content"]:
    text_clean.append(clean(text, stop).split())

In [26]:
df["clean_content"].iloc[0]

'investools advisory free digest trusted investment advice unsubscribe free newsletter please see issue fried sells stocks gains months km rowe january index confirms bull market aloy small cap advisor earns lbix compounding returns pine trees pcl undervalued high yield bank puts customers first aso word sponsor top wall street watcher ben zacks year year gain moving best brightest wall street big money machines earned ben zacks five year average annual gain start outperforming long term get zacks latest stock buylist free day trial http www investools com c go zaks mtxtu zakstb investools advisory john brobst investools com fried sells stocks locks months km david fried knows stock undervalued company management buys back shares open market latest triumph pocketing impressive gain three short months selling four buyback stocks include gain auto retailer automation incorporated gain digital phone system purveyor inter tel intl fried recent move buy kmart corporation km beleaguered disc

In [22]:
text_clean[0][:10]

['investools',
 'advisory',
 'free',
 'digest',
 'trusted',
 'investment',
 'advice',
 'unsubscribe',
 'free',
 'newsletter']

**Now that you have cleaned your data entirely with the necessary steps, including splitting the text into words, removing stopwords and punctuations, and lemmatizing your words. You are now ready to run a topic model on this data. In the following sections you're going to explore how to do that.**

## Topic modeling on fraud

1. Discovering topics in text data
1. "What is the text about"
1. Conceptually similar to clustering data
1. Compare topics of fraud cases to non-fraud cases and use as a feature or flag
1. Or.. is there a particular topic in the data that seems to point to fraud?

#### Latent Dirichlet Allocation (LDA)

* With LDA you obtain:
    * "topics per text item" model (i.e. probabilities)
    * "words per topic" model
* Creating your own topic model:
    * Clean your data
    * Create a bag of words with dictionary and corpus
        * Dictionary contain words and word frequency from the entire text
        * Corpus: word count for each line of text
    * Feed dictionary and corpus into the LDA model
* LDA:
    * ![lda](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/lda.JPG)
    1. [LDA2vec: Word Embeddings in Topic Models](https://www.datacamp.com/community/tutorials/lda2vec-topic-model)
    1. see how each word in the dataset is associated with each topic
    1. see how each text item in the data associates with topics (in the form of probabilities)
        1. image on the right 

#### Bag of words: dictionary and corpus

* use the `Dictionary` function in `corpora` to create a `dict` from the text data
    * contains word counts
* filter out words that appear in less than 5 emails and keep only the 50000 most frequent words
    * this is a way of cleaning the outlier noise
* create the corpus, which for each email, counts the number of words and the count for each word (`doc2bow`)
* `doc2bow`
    * Document to Bag of Words
    * converts text data into bag-of-words format
    * each row is now a list of words with the associated word count
    
```python
from gensim import corpora

 # Create dictionary number of times a word appears
dictionary = corpora.Dictionary(cleaned_emails)

# Filter out (non)frequent words 
dictionary.filter_extremes(no_below=5, keep_n=50000)

# Create corpus
corpus = [dictionary.doc2bow(text) for text in cleaned_emails]
```

#### Latent Dirichlet Allocation (LDA) with gensim

* Run the LDA model after cleaning the text date, and creating the dictionary and corpus
* Pass the corpus and dictionary into the model
* As with K-means, beforehand, pick the number of topics to obtain, even if there is uncertainty about what topics exist
* The calculated LDA model, will contain the associated words for each topic, and topic scores per email
* Use `print_topics` to obtain the top words from the topics

```python
import gensim

# Define the LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, 
id2word=dictionary, passes=15)

# Print the three topics from the model with top words
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

>>> (0, '0.029*"email" + 0.016*"send" + 0.016*"results" + 0.016*"invoice"')
>>> (1, '0.026*"price" + 0.026*"work" + 0.026*"management" + 0.026*"sell"')
>>> (2, '0.029*"distribute" + 0.029*"contact" + 0.016*"supply" + 0.016*"fast"')
```

### Create dictionary and corpus

In order to run an LDA topic model, you first need to **define your dictionary and corpus** first, as those need to go into the model. You're going to continue working on the cleaned text data that you've done in the previous sections. That means that `text_clean` is available for you already to continue working with, and you'll use that to create your dictionary and corpus.

This section will take a little longer to execute than usual.

**Instructions**

* Import the gensim package and corpora from gensim separately.
* Define your dictionary by running the correct function on your clean data `text_clean`.
* Define the corpus by running `doc2bow` on each piece of text in `text_clean`.
* Print your results so you can see `dictionary` and `corpus` look like.

In [27]:
# Define the dictionary
dictionary = corpora.Dictionary(text_clean)

# Define the corpus
corpus = [dictionary.doc2bow(text) for text in text_clean]

In [29]:
print(dictionary)

Dictionary<33980 unique tokens: ['account', 'accurate', 'acquiring', 'acre', 'address']...>


In [32]:
corpus[0][:10]

[(0, 2),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 6),
 (6, 1),
 (7, 2),
 (8, 4),
 (9, 1)]

**These are the two ingredients you need to run your topic model on the enron emails. You are now ready for the final step and create your first fraud detection topic model.**

### LDA model

Now it's time to **build the LDA model**. Using the `dictionary` and `corpus`, you are ready to discover which topics are present in the Enron emails. With a quick print of words assigned to the topics, you can do a first exploration about whether there are any obvious topics that jump out. Be mindful that the topic model is **heavy to calculate** so it will take a while to run. Let's give it a try!

**Instructions**

* Build the LDA model from gensim models, by inserting the `corpus` and `dictionary`.
* Save the 5 topics by running `print_topics` on the model results, and select the top 5 words.

In [33]:
# Define the LDA model
ldamodel = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=5, id2word=dictionary, passes=5
)

# Save the topics and top 5 words
topics = ldamodel.print_topics(num_words=5)

# Print the results
for topic in topics:
    print(topic)

(0, '0.008*"email" + 0.008*"enron" + 0.006*"request" + 0.005*"time" + 0.005*"day"')
(1, '0.037*"image" + 0.036*"td" + 0.026*"net" + 0.025*"money" + 0.024*"tr"')
(2, '0.032*"enron" + 0.009*"company" + 0.006*"market" + 0.005*"energy" + 0.005*"development"')
(3, '0.029*"enron" + 0.012*"hou" + 0.011*"pm" + 0.010*"e" + 0.007*"outage"')
(4, '0.013*"enron" + 0.011*"message" + 0.008*"pm" + 0.008*"original" + 0.007*"please"')


**You have now successfully created your first topic model on the Enron email data. However, the print of words doesn't really give you enough information to find a topic that might lead you to signs of fraud. You'll therefore need to closely inspect the model results in order to be able to detect anything that can be related to fraud in your data.**

## Flagging fraud based on topic

#### Using your LDA model results for fraud detection

1. Are there any suspicious topics? (no labels)
    1. if you don't have labels, first check for the frequency of suspicious words within topics and check whether topics seem to describe the fraudulent behavior
    1. for the Enron email data, a suspicious topic would be one where employees are discussing stock bonuses, selling stock, stock price, and perhaps mentions of accounting or weak financials
    1. Defining suspicious topics does require some pre-knowledge about the fraudulent behavior
    1. If the fraudulent topic is noticeable, *flag all instances that have a high probability for this topic*
1. Are the topics in fraud and non-fraud cases similar? (with labels)
    1. If there a previous cases of fraud, ran a topic model on the fraud text only, and on the non-fraud text
    1. Check whether the results are similar
        1. Whether the frequency of the topics are the same in fraud vs non-fraud
1. Are fraud cases associated more with certain topics? (with labels)
    1. Check whether fraud cases have a higher probability score for certain topics
        1. If so, run a topic model on new data and create a flag directly on the instances that score high on those topics

#### To understand topics, you need to visualize

```python
import pyLDAvis
from pyLDAvis import gensim_models as gensimvis
pyLDAvis.enable_notebook()
lda_display = gensimvis.prepare(ldamodel, corpus, dictionary, sort_topics=False)
```

![topics](https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/fraud_detection/topics2.jpg)

* Each bubble on the left-hand side, represents a topic
* The larger the bubble, the more prevalent that topic is
* Click on each topic to get the details per topic in the right panel
* The words are the most important keywords that form the selected topic.
* A good topic model will have fairly big, non-overlapping bubbles, scattered throughout the chart
* A model with too many topics, will typically have many overlaps, or small sized bubbles, clustered in one region
* In the case of the model above, there is a slight overlap between topic 2 and 3, which may point to 1 topic too many

In [37]:
import pyLDAvis
from pyLDAvis import gensim_models as gensimvis

pyLDAvis.enable_notebook()

In [38]:
lda_display = gensimvis.prepare(ldamodel, corpus, dictionary, sort_topics=False)

In [39]:
pyLDAvis.display(lda_display)

  from imp import reload
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  from imp import reload
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  from imp import reload
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '

  from imp import reload
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  from imp import reload
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  from imp import reload
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '

#### Assign topics to your original data

* One practical application of topic modeling is to determine what topic a given text is about
* To find that, find the topic number that has the highest percentage contribution in that text
* The function, `get_topic_details` shown here, nicely aggregates this information in a presentable table
* Combine the original text data with the output of the `get_topic_details` function
* Each row contains the dominant topic number, the probability score with that topic and the original text data

```python
def get_topic_details(ldamodel, corpus):
    topic_details_df = pd.DataFrame()
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_details_df = topic_details_df.append(pd.Series([topic_num, prop_topic]), ignore_index=True)
    topic_details_df.columns = ['Dominant_Topic', '% Score']
    return topic_details_df


contents = pd.DataFrame({'Original text':text_clean})
topic_details = pd.concat([get_topic_details(ldamodel,
                           corpus), contents], axis=1)
topic_details.head()


     Dominant_Topic    % Score     Original text
0    0.0              0.989108    [investools, advisory, free, ...
1    0.0              0.993513    [forwarded, richard, b, ...
2    1.0              0.964858    [hey, wearing, target, purple, ...
3    0.0              0.989241    [leslie, milosevich, santa, clara, ...
```

### Interpreting the topic model

* Use the visualization results from the pyLDAvis library
* Have a look at topics from the LDA model on the Enron email data. Which one would you research further for fraud detection purposes?

**Possible Answers**

* __**Topic 4.**__
* ~~Topic 3.~~
* ~~None of these topics seem related to fraud.~~


**Topic 4 seems to discuss the employee share option program, and seems to point to internal conversation (with "please, may, know" etc), so this is more likely to be related to the internal accounting fraud and trading stock with insider knowledge. Topic 3 seems to be more related to general news around Enron.**

### Finding fraudsters based on topic

In this section, you're going to **link the results** from the topic model **back to your original data**. 
You now learned that you want to **flag everything related to topic 3**. 
As you will see, this is actually not that straightforward. 
You'll be given the function `get_topic_details()` which takes the arguments `ldamodel` and `corpus`. 
It retrieves the details of the topics for each line of text. 
With that function, you can append the results back to your original data. 
If you want to learn more detail on how to work with the model results, which is beyond the scope of this notebook, 
you're highly encouraged to read this [article](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/).

Available for you are the `dictionary` and `corpus`, the text data `text_clean` as well as your model results `ldamodel`. Also defined is `get_topic_details()`.

**Instructions 1/3**

* Print and inspect the results from the `get_topic_details()` function by inserting your LDA model results and `corpus`.

#### def get_topic_details

In [40]:
def get_topic_details(ldamodel, corpus):
    topic_details_df = pd.DataFrame()
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_details_df = topic_details_df.append(
                    pd.Series([topic_num, prop_topic]), ignore_index=True
                )
    topic_details_df.columns = ["Dominant_Topic", "% Score"]
    return topic_details_df

In [42]:
# Run get_topic_details function and check the results
topic_details_df = get_topic_details(ldamodel, corpus)

In [43]:
topic_details_df.head()

Unnamed: 0,Dominant_Topic,% Score
0,2.0,0.992876
1,2.0,0.800692
2,0.0,0.428277
3,2.0,0.993486
4,2.0,0.993389


In [44]:
topic_details_df.tail()

Unnamed: 0,Dominant_Topic,% Score
2085,2.0,0.908144
2086,4.0,0.599751
2087,1.0,0.999322
2088,2.0,0.998145
2089,4.0,0.98842


**Instructions 2/3**

* Concatenate column-wise the results from the previously defined function `get_topic_details()` to the original text data contained under `contents` and inspect the results.

In [45]:
# Add original text to topic details in a dataframe
contents = pd.DataFrame({"Original text": text_clean})
topic_details = pd.concat([get_topic_details(ldamodel, corpus), contents], axis=1)

In [46]:
topic_details.sort_values(by=["% Score"], ascending=False).head(10).head()

Unnamed: 0,Dominant_Topic,% Score,Original text
154,2.0,0.999957,"[joint, venture, enron, meeting, belies, offic..."
135,2.0,0.999953,"[lawyer, agree, order, safeguard, document, ho..."
107,2.0,0.999907,"[sample, article, original, message, schmidt, ..."
849,3.0,0.999874,"[original, message, received, thu, aug, cdt, e..."
149,2.0,0.999769,"[electricity, trading, build, oh, slowly, fran..."


In [47]:
topic_details.sort_values(by=["% Score"], ascending=False).head(10).tail()

Unnamed: 0,Dominant_Topic,% Score,Original text
81,2.0,0.999721,"[brazil, scramble, energy, new, york, time, wo..."
2081,1.0,0.999631,"[unsubscribe, mailing, please, go, money, net,..."
297,2.0,0.999498,"[karen, yes, like, sauce, reference, sound, li..."
478,2.0,0.999354,"[greeting, jeff, thanks, make, copy, bring, cl..."
2087,1.0,0.999322,"[image, image, image, image, image, image, ima..."


**Instructions 3/3**

* Create a flag with the `np.where()` function to flag all content that has topic 3 as a dominant topic with a 1, and 0 otherwise

In [48]:
# Create flag for text highest associated with topic 3
topic_details["flag"] = np.where((topic_details["Dominant_Topic"] == 3.0), 1, 0)

In [49]:
topic_details_1 = topic_details[topic_details.flag == 1]

In [50]:
topic_details_1.sort_values(by=["% Score"], ascending=False).head(10)

Unnamed: 0,Dominant_Topic,% Score,Original text,flag
849,3.0,0.999874,"[original, message, received, thu, aug, cdt, e...",1
2014,3.0,0.997888,"[forwarded, chris, h, foster, hou, pm, enron, ...",1
2039,3.0,0.997664,"[w, e, e, k, e, n, e, v, l, b, l, f, r, decemb...",1
2023,3.0,0.997091,"[fyi, kim, original, message, schoolcraft, dar...",1
2020,3.0,0.995853,"[jerry, remove, distribution, nng, outage, rep...",1
2054,3.0,0.995691,"[sound, like, deny, commodity, logic, please, ...",1
653,3.0,0.994966,"[reservation, status, changed, received, see, ...",1
2057,3.0,0.994935,"[reservation, status, changed, detail, reserva...",1
2031,3.0,0.993948,"[forwarded, eric, boyt, corp, enron, pm, jason...",1
2015,3.0,0.993089,"[sound, good, let, know, time, gonna, work, or...",1


**You have now flagged all data that is highest associated with topic 3, that seems to cover internal conversation about enron stock options. You are a true detective. With these sections you have demonstrated that text mining and topic modeling can be a powerful tool for fraud detection.**

### Text mining for fraud detection

* Know how to augment fraud detection analysis with text mining techniques
* Applied word searches to flag use of certain words, and learned how to apply topic modeling for fraud detection
* Learned how to effectively clean messy text data

### Further learning for fraud detection

* Network analysis to detect fraud
* Different supervised and unsupervised learning techniques (e.g. Neural Networks)
* Working with very large data