# Topic Modeling with Binder, Gensim and Mallet

This notebook implements [Gensim](https://radimrehurek.com/gensim/) and [Mallet](http://mallet.cs.umass.edu/index.php) for topic modeling using the [Binder](https://mybinder.org/) platform. The README is available at the [Binder + Gensim + Mallet Github repository](https://github.com/polsci/binder-gensim-mallet).

## Setup

In [1]:
# check java version - used to make sure OpenJDK installed ok
!java -version

openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)


In [2]:
# download and unzip mallet
!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
!unzip -q mallet-2.0.8.zip

--2022-06-27 12:27:07--  http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
Resolving mallet.cs.umass.edu (mallet.cs.umass.edu)... 128.119.246.70
Connecting to mallet.cs.umass.edu (mallet.cs.umass.edu)|128.119.246.70|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://mallet.cs.umass.edu/dist/mallet-2.0.8.zip [following]
--2022-06-27 12:27:07--  https://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
Connecting to mallet.cs.umass.edu (mallet.cs.umass.edu)|128.119.246.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16184794 (15M) [application/zip]
Saving to: ‘mallet-2.0.8.zip’


2022-06-27 12:27:09 (9.93 MB/s) - ‘mallet-2.0.8.zip’ saved [16184794/16184794]



In [3]:
# testing we can get some output from mallet - should see a list of Mallet 2.0 commands
!mallet-2.0.8/bin/mallet

Unrecognized command: 
Mallet 2.0 commands: 

  import-dir         load the contents of a directory into mallet instances (one per file)
  import-file        load a single file into mallet instances (one per line)
  import-svmlight    load SVMLight format data files into Mallet instances
  info               get information about Mallet instances
  train-classifier   train a classifier from Mallet data files
  classify-dir       classify data from a single file with a saved classifier
  classify-file      classify the contents of a directory with a saved classifier
  classify-svmlight  classify data from a single file in SVMLight format
  train-topics       train a topic model from Mallet data files
  infer-topics       use a trained topic model to infer topics for new documents
  evaluate-topics    estimate the probability of new documents under a trained model
  prune              remove features based on frequency or information gain
  split              divide data i

## Upload and extract corpus

You should use Jupyter's file browser to upload a zip file with your corpus. The zip file of the corpus should contain a single directory containing .txt files. Make sure the `path_to_zip_file` is correct below and then run the cell to unzip your corpus. Check Jupyter's file browser to make sure your corpus has been correctly extracted.

In [4]:
import zipfile

path_to_zip_file = 'ted-transcripts.zip' # change this to your zip file name

with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    zip_ref.extractall('.')

FileNotFoundError: [Errno 2] No such file or directory: 'ted-transcripts.zip'

## Import required libraries for topic modeling

In [None]:
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models.wrappers import LdaMallet
from gensim.models.coherencemodel import CoherenceModel
from gensim import similarities

import os.path
import re
import glob

import nltk
nltk.download('stopwords')

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

## Set the path to the Mallet binary and set the path to the corpus

In [None]:
# you should NOT need to change this 
mallet_path = 'mallet-2.0.8/bin/mallet' 

# you need to change this path to the directory containing your corpus of .txt files
corpus_path = 'transcripts' 

## Functions to load and preprocess the corpus and create the document-term matrix

The following cell contains functions to load a corpus from a directory of text files, preprocess the corpus and create the bag of words document-term matrix. 

In [None]:
def load_data_from_dir(path):
    file_list = glob.glob(path + '/*.txt')

    # create document list:
    documents_list = []
    source_list = []
    for filename in file_list:
        with open(filename, 'r', encoding='utf8') as f:
            text = f.read()
            f.close()
            documents_list.append(text)
            source_list.append(os.path.basename(filename))
    print("Total Number of Documents:",len(documents_list))
    return documents_list, source_list

def preprocess_data(doc_set,extra_stopwords = {}):
    # adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
    # replace all newlines or multiple sequences of spaces with a standard space
    doc_set = [re.sub('\s+', ' ', doc) for doc in doc_set]
    # initialize regex tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    # create English stop words list
    en_stop = set(stopwords.words('english'))
    # add any extra stopwords
    if (len(extra_stopwords) > 0):
        en_stop = en_stop.union(extra_stopwords)
    
    # list for tokenized documents in loop
    texts = []
    # loop through document list
    for i in doc_set:
        # clean and tokenize document string
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
        # remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in en_stop]
        # add tokens to list
        texts.append(stopped_tokens)
    return texts

def prepare_corpus(doc_clean):
    # adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
    # Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
    dictionary = corpora.Dictionary(doc_clean)
    
    dictionary.filter_extremes(no_below=5, no_above=0.5)
    # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
    # generate LDA model
    return dictionary,doc_term_matrix

## Load and pre-process the corpus
Load the corpus, preprocess with additional stop words and output dictionary and document-term matrix.

In [None]:
# adjust the path below to wherever you have the transcripts2018 folder
document_list, source_list = load_data_from_dir(corpus_path)

# I've added extra stopwords here in addition to NLTK's stopword list - you could look at adding others.
doc_clean = preprocess_data(document_list,{'laughter','applause'})
dictionary, doc_term_matrix = prepare_corpus(doc_clean)

## LDA model with 30 topics
The following cell sets the number of topics we are training the model for. 

In [None]:
number_of_topics=30 # adjust this to alter the number of topics
words=20 #adjust this to alter the number of words output for the topic below

The following cell runs LDA using Mallet from Gensim using the number_of_topics specified above. This might take a few minutes!

In [None]:
ldamallet30 = LdaMallet(mallet_path, corpus=doc_term_matrix, num_topics=number_of_topics, id2word=dictionary, workers=1)

The following cell outputs the topics.

In [None]:
ldamallet30.show_topics(num_topics=number_of_topics,num_words=words)

## Convert to Gensim model format
Convert the Mallet model to Gensim format.

In [None]:
gensimmodel30 = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet30)

## Get a coherence score

In [None]:
coherencemodel = CoherenceModel(model=gensimmodel30, texts=doc_clean, dictionary=dictionary, coherence='c_v')
print (coherencemodel.get_coherence())

## Get id for specific videos

In [None]:
lookup_doc_id = source_list.index('2017-09-20-zeynep_tufekci_we_re_building_a_dystopia_just_to_make_people_click_on_ads.txt')
print('Document ID from lookup:', lookup_doc_id)

## Preview a document

Preview a document - you can change the doc_id to view another document.

In [None]:
doc_id = lookup_doc_id # index of document to explore - this can be an id number or set to lookup_doc_id
print(re.sub('\s+', ' ', document_list[doc_id])) 

## Output the distribution of topics for the document

The next cell outputs the distribution of topics on the document specified above.

In [None]:
document_topics = gensimmodel30.get_document_topics(doc_term_matrix[doc_id])
document_topics = sorted(document_topics, key=lambda x: x[1], reverse=True) # sorts document topics

for topic, prop in document_topics:
    topic_words = [word[0] for word in gensimmodel30.show_topic(topic, 10)]
    print ("%.2f" % prop, topic, topic_words)

## Find similar documents
This will find the 5 most similar documents to the document specified above based on their topic distribution.

In [None]:
# gensimmodel30[doc_term_matrix] below represents the documents in the corpus in LDA vector space
lda_index = similarities.MatrixSimilarity(gensimmodel30[doc_term_matrix])

# query for our doc_id from above
similarity_index = lda_index[gensimmodel30[doc_term_matrix[doc_id]]]

# Sort the similarity index
similarity_index = sorted(enumerate(similarity_index), key=lambda item: -item[1])

for i in range(1,6): 
    document_id, similarity_score = similarity_index[i]

    print('Document Index:',document_id)
    print('Document:', source_list[document_id])
    print('Similarity Score:',similarity_score)
    
    print(re.sub('\s+', ' ', document_list[document_id][:500]), '...') # preview first 500 characters
    
    document_topics = gensimmodel30[doc_term_matrix[document_id]]
    document_topics = sorted(document_topics, key=lambda x: x[1], reverse=True)
    for topic, prop in document_topics:
        topic_words = [word[0] for word in gensimmodel30.show_topic(topic, 10)]
        print ("%.2f" % prop, topic, topic_words)
    
    print()