# Introduction to Topic Modeling


* *What is Topic Modeling?*
    * A statistical model used to uncover latent/abstract topic structure within a corpus of documents. The output of these models provides a means of interpreting and/or representing each document within a corpus as a collection of *k* topics. 
    
    
* *Why Topic Modeling?*   
    * Intuitively, topic modeling provides a "sum of parts (topics)" representation of documents in a given corpus. These representations of documents can further be used for other tasks of interest, such as search, close reading, labeling, supervised machine learning, etc. In this workshop, we will focus on two topic modeling approaches, namely LDA and NMF.
    
    
* *LDA*:
    * **Latent Dirichlet Allocation** [(Blei, Ng and Jordan; 2003)](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) - A generative probabilistic model for implementation of topic modeling that assumes a Dirichlet prior. 
    
    
* *NMF*:
    * **Non-Negative Matrix Factorization** [(Lee and Seung, 1999)](https://www.nature.com/articles/44565) - Matrix decomposition method with an imposed non-negativity constraint. 
    

## I. The Data
The superbly well-organized and high-quality data used for this workshop comes from the wonderful people working on the [case.law](https://case.law/) project at the [Harvard Library Innovation Lab](https://lil.law.harvard.edu/). 

We will be implementing topic modeling on New Mexico's court decisions. 




In [None]:
import lzma
import pandas as pd
import numpy as np
import json

In [None]:
# specify file path
file_path = "data/data.jsonl.xz" 

# open and read 
with lzma.open(file_path) as f, open(file_path[:-3], 'wb') as fout:
    file_content = f.read() 
    fout.write(file_content)

### Loading data: Option 1 - if you don't have enough memory and/or space
The uncompressed New Mexico dataset is about ~350 MB and contains 18,338 cases going back as far as 1852. If you don't have a strong computer, you should go for this option.

In addition, for the purposes of this workshop, it is advised that you stick with Option 1 as building actual topic models on a larger corpus will take some time.


In [None]:
# specify the maximum amount of records
max_records = 1000
data = []

with open(file_path[:-3]) as f:
    for i, line in enumerate(f):
        data.append(json.loads(line))
        if i >= max_records - 1:
            break

data = pd.DataFrame(data)     
data.head()

### (Optional) Loading Data: Option 2 - if you have enough memory/space
If you have good computer, you can use pandas' built-in *read_json()* function to load all 18,338 cases from New Mexico. 




In [None]:
# since this data is jsonl, make sure lines is set to True
#data = pd.read_json(file_path[:-3], lines = True)

#view the dataframe
data.head()

In [None]:
data.shape

In [None]:
# Select the columns we want
data  = data[['decision_date','name_abbreviation', 'court', 'casebody']]
data.head()

## II. Preprocessing 
Preprocessing methods such as stemming, lemmatization, stopword removal/down-weighting, etc. are commonly used in natural language processing research and applications. However, it is important to point out that fundamentally, using a certain preprocessing method(s) is a *choice*, and whenever we make choices as researchers, we must be able to justify our choices in the context of our domain and/or research objective. 

In [None]:
from gensim.parsing.preprocessing import remove_stopwords, strip_non_alphanum, strip_numeric

opinion_texts = [] # create empty list to store text of opinions

for i in range(len(data)):
    if data.iloc[i]['casebody']['data']['opinions']: 
        text = data.iloc[i]['casebody']['data']['opinions'][0]['text'].lower() # lowercase 
        text = strip_non_alphanum(text) # remove non-alphanumeric characters like #,@,¶ etc
        text = strip_numeric(text)      # remove numbers
        text = remove_stopwords(text)   # remove stopwords
        opinion_texts.append(' '.join(text.split()))

In [None]:
opinion_texts[1]

In [None]:
# putting the text back into the dataframe, in case you need to save it
data['text'] = opinion_texts
#data.head()

## III. Latent Dirichlet Allocation (LDA)
### A. Fit a Topic Model using sklearn's LDA

[Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a generative model and is probably the most popular topic modeling approach in research and other applications. We will be using sklearn's LatentDirichletAllocation function. See [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) for more information about this function. 

This part of the workshop borrows heavily from [Laura Nelson's](https://github.com/lknelson) course on computational text analysis.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

##This is a function to print out the top words for each topic in a pretty way.
#Don't worry too much about understanding every line of this code.
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [None]:
# Vectorize our text using CountVectorizer
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df = 0.60, # ignore terms that appear in more than 80% of documents, ie corpus-specific stopwords
                                min_df = 50,   # ignore terms that appear in less than 50 of documents, ie remove very rare terms
                                max_features = 10000, # consider only 10k top words by frequency
                                stop_words='english' # remove stopwords
                                )

tf = tf_vectorizer.fit_transform(opinion_texts)

In [None]:
# vizualize the document term matrix 
tf_matrix = tf.todense()
tf_matrix = pd.DataFrame(tf_matrix, 
                         columns = tf_vectorizer.get_feature_names())
tf_matrix

In [None]:
n_samples = len(data)
n_topics = 10
n_top_words = 20

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_topics=%d..."
      % (n_samples, n_topics))


#define the lda function, with desired options
#Check the documentation, linked above, to look through the options
lda = LatentDirichletAllocation(n_components = n_topics, # how many topics we want 
                                max_iter = 20, # maximum learning iterations 
                                learning_method = 'online',
                                learning_offset = 80., 
                                total_samples = n_samples,
                                random_state = 0)
#fit the model
lda.fit(tf)

In [None]:
#print the top words per topic, using the function defined above.

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

### Document by Topic Distribution

One thing we may want to do with the output is find the most representative texts for each topic. A simple way to do this is to merge the topic distribution back into the Pandas dataframe.


In [None]:
# get the distribution array
topic_dist = lda.transform(tf)
#topic_dist

In [None]:
# Merge back in with the original dataframe.
pd.options.display.max_colwidth = 100

topic_dist_df = pd.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(data)
df_w_topics

Now we can sort the dataframe for the topic of interest, and view the top documents for the topics.



In [None]:
topic_of_interest = 8

df_w_topics[['name_abbreviation', 
             'decision_date', 
             'text',
              topic_of_interest]].sort_values(by=[topic_of_interest], ascending=False)

In [None]:
topic_of_interest = 1

df_w_topics[['name_abbreviation', 
             'decision_date', 
             'text',
              topic_of_interest]].sort_values(by=[topic_of_interest], ascending=False)

In [None]:
import sys
!{sys.executable} -m pip install pyldavis

In [None]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

lda_display = pyLDAvis.sklearn.prepare(lda, 
                                       tf, 
                                       tf_vectorizer)

pyLDAvis.save_html(lda_display, 'lda_visualization.html')
# See lda_visualization.html to explore the LDA based topics

lda_display # smaller lambda shows more unique/rare terms for the topic 

### B. Fitting an LDA Topic Model using gensim library


In [None]:
from gensim.utils import simple_preprocess
from gensim import corpora

## Create DTM using Gensim
# Tokenize the documents
tokenized_list = [simple_preprocess(doc) for doc in opinion_texts] # tokenize

# Create a Dictionary, that is a mapping between words and their integer ids.
dictionary = corpora.Dictionary(tokenized_list)

# Convert a document in a corpus, into the bag-of-words (BoW) format = list of (token_intiger_id, token_count) tuples.
corpus = [dictionary.doc2bow(line) for line in tokenized_list]


In [None]:
dictionary.token2id

In [None]:
from gensim.models import LdaModel, LdaMulticore

n_topics = 10
lda_model = LdaMulticore(corpus = corpus,
                         id2word = dictionary,
                         random_state = 100,
                         num_topics = n_topics,
                         passes=2, # Number of passes through the corpus during training.
                         per_word_topics=True)


# See the topics
lda_model.print_topics(-1)

In [None]:
top_words = 10
topics = lda_model.show_topics(formatted = False,  
                               num_topics = -1,
                               num_words = top_words)

for t in range(len(topics)):
    print("Topic {}, top {} words:".format( t+1, top_words))
    print(", ".join([w[0] for w in topics[t][1]]))

In [None]:
### takes a bit longer due to passes in lda_model
#import pyLDAvis.gensim 

#pyLDAvis.enable_notebook()

#vis = pyLDAvis.gensim.prepare(lda_model, 
#                              corpus, 
#                              dictionary=lda_model.id2word)
#vis

## IV. Non-negative Matrix Factorization (NMF)
[Non-Negative Matrix Factorization](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization) is another approach to topic modeling. It is a matrix decomposition method and does not assume a prior probability distribution. NMF provides a simple, deterministic method which seems to give highly interpretable results with minimal tweaking/hyperparameter-tuning. NMF topic models are also extremely fast and memory optimized and are fit on TFIDF normalized DTMs.

This part of the workshop borrows heavily from [Derek Greene's](https://github.com/derekgreene) tutorial on topic modeling using NMF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TFIDF vectorizer
vectorizer = TfidfVectorizer(min_df = 2)

V = vectorizer.fit_transform(opinion_texts)
V.shape

In [None]:
# get terms/features in our new matrix
features = vectorizer.get_feature_names()
len(features)
features[:10]

In [None]:
n_topics = 10
top_words = 10

# create NMF model
from sklearn import decomposition
model = decomposition.NMF(init = "nndsvd", 
                          n_components = n_topics)


In [None]:
# apply the model and extract the two W and H matrices -> V ~= W*H 
W = model.fit_transform(V)
H = model.components_

In [None]:
# Define get_descriptor function which will show top words for a given topic
def get_descriptor( features, H, topic_index, top ):
    top_indices = np.argsort( H[topic_index,:] )[::-1]
    top_terms = []
    for term_index in top_indices[0:top]:
        top_terms.append( features[term_index] )
    return top_terms

# define get_top_documents function which will show us top cases associated with topics
def get_top_documents( cases, W, topic_index, top ):
    top_indices = np.argsort( W[:,topic_index] )[::-1]
    top_documents = []
    for doc_index in top_indices[0:top]:
        top_documents.append(cases[doc_index])
    return top_documents

In [None]:
# show topics and words in those topics
descriptors = []
for topic_index in range( n_topics ):
    descriptors.append( get_descriptor( features, H, topic_index, top_words) )  # Top 10 words
    str_descriptor = ", ".join( descriptors[topic_index] )
    print("Topic %02d: %s" % ( topic_index+1, str_descriptor ) )

In [None]:
case_names = data.name_abbreviation.tolist()

topic_of_interest = 2
n_docs = 10

#Print top documents for a given topic
topic_documents = get_top_documents(case_names, W, topic_of_interest, n_docs) 
for i, doc in enumerate(topic_documents):
    print("%02d. %s" % ((i+1), doc))



### NMF with gensim
To tokenize, we use the same code as we used for LDA for gensim. 

In [None]:
from gensim.models import nmf
from gensim.models import TfidfModel

In [None]:
## Create DTM using Gensim
# Tokenize the documents
tokenized_list = [simple_preprocess(doc) for doc in opinion_texts] # tokenize

# Create a Dictionary, that is a mapping between words and their integer ids.
dictionary = corpora.Dictionary(tokenized_list)

# Convert a document in a corpus, into the bag-of-words (BoW) format = list of (token_intiger_id, token_count) tuples.
corpus = [dictionary.doc2bow(line) for line in tokenized_list]

# An important benefit of NMF is its use of TFIDF document term matrix out of the box 
model_tfidf = TfidfModel(corpus) 

In [None]:
n_topics = 10
nmf_model = nmf.Nmf(model_tfidf[corpus], 
                    id2word = dictionary,
                    num_topics = 10,
                    passes = 20,
                    random_state = 100)

In [None]:
nmf_model.show_topics(num_topics=10, 
                      num_words=10)

In [None]:
top_words = 10
topics_nmf = nmf_model.show_topics(formatted = False,  
                               num_topics = -1,
                               num_words = top_words)

for t in range(len(topics_nmf)):
    print("Topic {}, top {} words:".format( t+1, top_words))
    print(", ".join([w[0] for w in topics_nmf[t][1]]))

## LDA as dimensionality reduction

Now that we obtained a distribution of topic weights for each document, we can represent our corpus with a dense document-weight matrix as opposed to our initial sparse DTM. The weights can then replace tokens as features for any subsequent task (classification, prediction, etc). A simple example may consist in measuring cosine similarity between documents. For instance, which case is most similar to the first case in our corpus? Let's use pairwise cosine similarity to find out. 

NB: cosine similarity measures an angle between two vectors, which provides a measure of distance robust to vectors of different lenghts (total number of tokens)

First, let's turn the DTM into a readable dataframe.

In [None]:
dtm = pd.DataFrame(tf_vectorizer.fit_transform(opinion_texts).toarray(), columns=tf_vectorizer.get_feature_names())

Next let's import the cosine_similarity function from sklearn and print the cosine similarity between the first and second case or the first and third book.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity 
import numpy as np

document_1 = np.array([dtm.iloc[0,:]])
document_2 = np.array([dtm.iloc[1,:]])
document_3 = np.array([dtm.iloc[2,:]])

print("Cosine similarity between first and second cases: " + str(cosine_similarity(document_1, document_2)))
print("Cosine similarity between first and third cases: " + str(cosine_similarity(document_1, document_3)))


What if we use the topic weights instead of word frequencies?

In [None]:
dwm = df_w_topics.iloc[:,:10]

In [None]:
dwm

In [None]:
document_with_topics_1 = np.array([dwm.iloc[0,:]])
document_with_topics_2 = np.array([dwm.iloc[1,:]])
document_with_topics_3 = np.array([dwm.iloc[2,:]])

print("Cosine similarity between first and second cases: " + str(cosine_similarity(document_with_topics_1, document_with_topics_2)))
print("Cosine similarity between first and third cases: " + str(cosine_similarity(document_with_topics_1, document_with_topics_3)))

In [None]:
comparison_document = 0
matrix = dtm

sim = cosine_similarity(np.array([matrix.iloc[comparison_document,:]]), 
                        np.array([matrix.iloc[1,:]]))                     #cosine similarity with 2nd case
for i in range(2, len(matrix)):
    sim = np.append(sim, cosine_similarity(np.array([matrix.iloc[comparison_document,:]]), 
                                           np.array([matrix.iloc[i,:]]))) # append cosine similarity with i'th case

In [None]:
print("Max similarity: " + str(np.max(sim)) + '\n'
      + "Index of most similar case: " + str(np.argmax(sim) + 1) + '\n' + '\n'
      + "Name of most similar case: " +  '\n' 
      + str(df_w_topics['name_abbreviation'][comparison_document]) + '\n' 
      + df_w_topics['name_abbreviation'][np.argmax(sim)+1])

## V. Conclusion
Topic modeling is an important part of NLP research and there are many applications where representing documents as topics is extremely useful. This notebook introduces sklearn and gensim based LDA and NMF topic modeling using [case.law](https://case.law) data. 

A key practical insight to be gained from this is the fundamental importance of the underlying data, domain knowledge, and the centrality of text vectorization for many NLP tasks. Another practical consideration worth keeping in mind is that the topics will be somewhat different for NMF and LDA models depending on hyperparameters and the underlying library used (sklearn vs gensim). 

There are a number of questions that this workshop does not address, largely due to its introductory nature. Firstly, the question which constantly comes up in topic modeling applications relates to the determination of the number of topics **k** to be used in these algorithms - that is, how do we really know that a corpus has **k** number of topics? Secondly, how coherent are the words lists for a given topic, and how do we make sure that our judgement on the interpretability of top-n words in a topic is not entirely subjective? These question are outside the scope of this introductory workshop, but nevertheless, one should keep them in mind when working with topic modeling. 



## VI. Acknowledgements

In addition to my own code, this notebook utilizes some (modified) code and insights from the following repositories:

* [CAP's example code notebooks for the case.law dataset](https://github.com/harvard-lil)

* [Laura Nelson computational text analysis course](https://github.com/lknelson)

* [Derek Greene topic modeling tutorial](https://github.com/derekgreene/)

* [Geoffrey Boushey work on case.law dataset](https://github.com/gboushey/)

* For more on topic modeling: [Topic modeling with Textacy](https://github.com/repmax/topic-model/blob/master/topic-modelling.ipynb)


