# Lab 12: Topic Modeling

In this lab, we will try to predict fraudulent statements from the CEO/CFO of Northwest Pipe Co., a company that overstated revenues
for several years before being charged with securities fraud ([Story](https://www.columbian.com/news/2011/jan/04/northwest-pipe-faces-new-allegations/)).

The data was originally used in [this study](https://journals.sagepub.com/doi/abs/10.1177/0261927X15586792). It contains each sentence
from several earnings calls that took place during the period of fraud. Each sentence has been labeled as relevant to the fraud,
i.e. something that was later restated in revised financial statements. 

The original study concluded that the executives used different linguistic and verbal patterns when they were discussing fraud-related items.
However, the study did not look at the topics of each sentence. This type of analysis could lead to greater insight into how 
fraudelent financial earnings could be presented to analysts and investors. 

Today, we will use Latent Dirichlet Allocation (LDA) to summarize the topics of the text. LDA is the most popular method for
identifying the topics in a corpus of text. 

## Before you begin
To conduct LDA analysis, we will use the `gensim` library in Python. This does not come with Anaconda by default, so we will need
to download it. You can use the Anaconda Navigator. If you prefer the command line (faster method), you can follow 
[these instructions](https://docs.anaconda.com/anaconda/user-guide/getting-started/#open-anaconda-prompt).

To install `gensim` in the Anaconda Prompt, just type `conda install gensim` and confirm the install when it asks to do so.

# Data Exploration
The data is on Blackboard. Import the data and run a few summary statistics. The
primary column of interest is `Restatement Topic`, which indicates that the sentence
was related to the fraud.

In [None]:
import numpy as np
import pandas as pd
from nltk.sentiment import vader

In [None]:
call_data = pd.read_csv('data/earningscall_fraud.csv')

print(call_data.describe())

In [None]:
# percent of fraud
print(call_data['Restatement Topic'].mean())

In [None]:
# are there more restatement topic sentences in the presentation or Q&A session?
print(call_data.groupby('PRES')['Restatement Topic'].mean())

# Clean the text
Like the previous lab, we will need to clean the raw text before fitting the models. 
The `preprocess_string` function performs many common tasks for us, including converting to lowercase,
removing punctuation, removing stopwords, and [stemming](https://www.tutorialspoint.com/natural_language_toolkit/natural_language_toolkit_stemming_lemmatization.htm).

In [None]:
from gensim.parsing.preprocessing import preprocess_string
from gensim import corpora

call_data['clean_text'] = call_data['Sentence'].apply(preprocess_string)
print(call_data.loc[1, ['Sentence', 'clean_text']])


Next, create a dictionary with all of the words in the dataset.


In [None]:
dictionary = corpora.Dictionary(call_data['clean_text'])
print(dictionary)

## Count words
Now, we can create a word count matrix. This has columns for 
each word and rows represent a document (a sentence, in this case).
The entries in the matrix are the counts of a word in a document. 
They are indexed numerically, so looking at this list will not be 
of human use without later re-linking.

BOW is an acronym for bag-of-words. 

In [None]:
bow_corpus = [dictionary.doc2bow(text) for text in call_data['clean_text']]

# Fitting an LDA model
Next, we will fit the model. One important consideration with LDA is that you must 
choose the number of topics in advance. The total number of topics allowed is not 
restricted, but too few topics and they will be too general interpret, too many topics 
and there may be considerable overlap. Later, we will see how to measure fit for different
topic counts. This dataset is small, so model fitting is fast. Larger datasets could
take minutes, hours, or days.

For this first example, let's try with 10 topics.

In [None]:
from gensim import models

lda_10 = models.LdaModel(bow_corpus, num_topics=10, id2word=dictionary)

## Viewing topics
First, let's view the topics the model determined. The topics show the top words associated with each topic, by probability. 
In practice, we can use this to summarize or label what each topic corresponds to. Often, this has some 

In [None]:
for topic in lda_10.show_topics():
    print("Topic", topic[0], ":", topic[1])

## Topics per sentence
After we fit the model, we can see the probabilities for each sentence. 

In [None]:
for doc in bow_corpus[0:9]:
    print(lda_10.get_document_topics(doc))

## Evaluation:
To evaluate topic model fit, we can use perplexity or coherence. These measures indicate improvement as they get 
closer to 0. 

In [None]:
print('Perplexity: ', lda_10.log_perplexity(bow_corpus))

In [None]:
from gensim.models import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_10, texts=call_data['clean_text'], dictionary=dictionary, coherence='u_mass')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# Remove additional stopwords 

In the model above, some words, like *million, quarter,* and *year*, appear as the highest probability
words for each topic. They add very little to our ability to interpret the topics generated by this model. 
These types of words are called `stopwords`, and it is commom practice in LDA to remove these words. 

We will import a pre-defined list of stopwords that is based on financial reports to get rid of these words.
Then, we will define a function re-process the sentences and remove these additional stopwords (many of 
which were not included in the original stopword removal). 

In [None]:
stopwords = []
with open('data/stoplist.txt', 'r') as f:
    stopwords = f.read().splitlines()

In [None]:
def remove_stopwords(text):
    """ preprocess string and remove words from custom stopword list. """
    result = []

    for word in preprocess_string(text):
        if word not in stopwords:
            result.append(word)
    return result

call_data['clean_newstop'] = call_data['Sentence'].apply(remove_stopwords)

Define a new dictionary and corpus based on the new results


In [None]:
new_dictionary = corpora.Dictionary(call_data['clean_newstop'])
print(new_dictionary)

new_corpus = [new_dictionary.doc2bow(text) for text in call_data['clean_newstop']]

## Fit new model

In [None]:
lda_new = models.LdaModel(new_corpus, num_topics=10, id2word=new_dictionary)

for topic in lda_new.show_topics():
    print("Topic", topic[0], ":", topic[1])

This new topic list is much better, though still not great. Perhaps adjusting the number of topics could 
be beneficial. 

In [None]:
for doc in new_corpus[0:9]:
    print(lda_new.get_document_topics(doc))

In [None]:
print('Perplexity: ', lda_new.log_perplexity(new_corpus))

In [None]:
print('Perplexity: ', lda_new.log_perplexity(new_corpus))

# Classification with topic models
We can use the results of this analysis to build a classification model. The features used
to predict fraud will consist of the topic probability for each sentence. For example,
a sentence might have a 10% probability of belonging to Topic 0, a 3.1% chance of belonging to Topic 1, etc.
The next step will calculate the topic probabilities for each document and add them to our original data frame.
The columns with numerical names represent the topic probability for each topic.

In [None]:
from gensim.matutils import corpus2csc
all_topics = lda_new.get_document_topics(new_corpus, minimum_probability=0.0)
all_topics_csr = corpus2csc(all_topics)
all_topics_numpy = all_topics_csr.T.toarray()
all_topics_df = pd.DataFrame(all_topics_numpy)

# make topic names easier to read
topic_names = ['Topic ' + str(x) for x in all_topics_df.columns]
all_topics_df.columns = topic_names


classification_df = pd.concat([call_data, all_topics_df], axis=1)

In [None]:
classification_df.describe()

## Build Classifier
We can now build a classifier to see how well the topics perform at predicting whether a sentence is 
related to the fraud or not. 

Since this data is small, we will use 5 fold cross validation to estimate performance.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import cross_validate

n_splits = 10

pred_vars = topic_names + ['PRES', 'WORDCOUNT', 'CEO', 'TURN_AT_TALK']

scoring = ['accuracy', 'neg_log_loss', 'f1', 'roc_auc']

rf = RandomForestClassifier()
nb = GaussianNB()

classifiers = [rf, nb]

for clf in classifiers:
    cv_clf = cross_validate(clf, classification_df[pred_vars], classification_df['Restatement Topic'], cv=StratifiedShuffleSplit(n_splits), scoring=scoring)
    # print(cv_clf)
    print('------', clf.__class__.__name__, '------')
    print("Mean Accuracy:", cv_clf['test_accuracy'].mean())
    print("Mean F1:", cv_clf['test_f1'].mean())
    print("Mean ROC:", cv_clf['test_roc_auc'].mean())
    print("Mean Log Loss:", cv_clf['test_neg_log_loss'].mean())
    print()

# Exercises
1. Try fitting LDA with just 5 topics instead of 10. How does this affect human interpretability, perplexity, coherence, and classification performance?
2. Try fitting LDA with 15 topics. How does this affect human interpretability, perplexity, coherence, and classification performance?
3. In addition to Random Forest, try another classifier of your choosing. How does this compare to the Random Forest?

## Optional Exercises 
1. Run sentiment analysis on this data. Does adding that to a classifier improve performance?
2. See the section below on weighted word counts. Does using tf-idf improve human interpretability?

# Weighted word counts
The models above used actual word counts. We can also weight word counts by how many documents (sentences) they appear in.
This gives rare words a higher weight per appearance, and common words very little weight.
The most common method for this is term frequency-inverse document frequency (tf-idf). 
You can see a description of this method on [Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

In [None]:
tfidf = models.TfidfModel(new_corpus)
tfidf_corpus = tfidf[new_corpus]

lda_tfidf = models.LdaModel(tfidf_corpus, num_topics=10, id2word=new_dictionary)
for topic in lda_tfidf.show_topics():
    print("Topic", topic[0], ":", topic[1])



In [None]:
print('Perplexity: ', lda_tfidf.log_perplexity(new_corpus))