# Data-Driven Research Assignment 2: Topic Modeling
This notebook contains the second, collaborative, graded assignment of the 2023 Data-Driven Research course. In this assignment you'll use a topic modeling tool in order to uncover the ''topics'' of a large set of reviews of popular films. 

To complete the assignment, complete **Part 1, Part 2, Part 3 and Part 4** of the **Your Model** section at the end.

This is a collaborative assignment. In the text cell below, please include all the names of your group members.

If you used code or a solution from the internet (such as StackOverflow) or another external resource, please make reference to it (in any format). Unattributed copied code will be considered plagiarism and therefore fraud.


**Authors of this answer:**

# 1. Introduction

You'll use a Topic Modelling tool from Gensim, a popular library for topic modelling in Python, though these days mainly known for its implementation of Word2Vec to train word embeddings (dense representations). Using this library, you will model topics based on reviews of popular films. The reviews are stored in plain text files, organized by film and rating. The aim of this exercise is to familiarize you with the topic modeling process and its output and to get insight in what kinds of topics are modeled.

# 2. Preparation

This assignment comes with the following files:


1.   The reviews of the films. This is the data in which we want to find topics. They are found in the movie2k/txt_sentoken directory. There are then two types: negative reviews (neg directory) and positive reviews (pos directory). The reviews are already tokenized.
2.   Stopword list files. They are found in the stopwords directory.

Let's start by loading the movie reviews from the files (I'll do it for you):

In [76]:
import os

def load_reviews(folder_path):
    reviews = [] #Make a list to put the reviews in
    reviewnames = [] # Make a list to put the review filenames in (to be able to look them up later)
    tokens = 0 #Make a counter for the number of tokens
    
    for file in os.listdir(folder_path):
        #Loop through all the text files in the folder, each containing one review
        
        if not file.endswith('.txt'):  #Only read text files
            continue

        file_path = os.path.join(folder_path, file)

        #Open the text file and read its contents
        with open(file_path, encoding='utf-8') as infile:
            review = infile.read()
        reviewnames.append(file)
            
        # Turn the string with the review into a list of words (this is easy because it is already tokenized)
        review = review.split()
        # And add it to the list
        reviews.append(review)
        # To count the number of tokens processed so far
        tokens = tokens + len(review)

    print(f"Loaded reviews from {folder_path} containing {tokens} tokens in total.") 
    return reviews, reviewnames
        
folder_path = "movie2k/txt_sentoken"
    
movie_reviews_pos, movie_reviewnames_pos = load_reviews(folder_path + "/pos") #Load the positive reviews
movie_reviews_neg, movie_reviewnames_neg = load_reviews(folder_path + "/neg") #Load the negative reviews

movie_reviews = movie_reviews_pos + movie_reviews_neg #Combine the lists of positive and negative reviews into one
movie_reviewnames = movie_reviewnames_pos + movie_reviewnames_neg #The same for the list of filenames

Loaded reviews from movie2k/txt_sentoken/pos containing 787051 tokens in total.
Loaded reviews from movie2k/txt_sentoken/neg containing 705630 tokens in total.


If you are working on Google Colab, you will probably have to change the path to the files to something that Google Colab has access to. For example, you could put the files on your Google Drive and then load them from there, as we did in Coding the Humanities. For more details about how to work with files in Python and load them from Google Drive, have a look at the Coding the Humanities course notebook on Files: https://github.com/bloemj/2023-coding-the-humanities/blob/main/notebooks/4_ReadingAndWritingFiles.ipynb

How to load files off Google Drive is explained at the beginning there.

## Preprocessing

Now that we have loaded the text, you might want to perform some pre-processing steps to be able to create a better bag-of-words model in which all forms of a word are mapped to a single number. For example, you could remove the punctuation characters, or you could perform lemmatization or stemming, which we discussed in the lecture. This would be the place to do it by writing a preprocessing function that accepts a list of movie reviews as its argument and returns a preprocessed list of movie reviews. Feel free to use your knowledge of text normalization from Coding the Humanities or the functions you wrote then. Here is some information on how to perform stemming with NLTK: https://www.nltk.org/howto/stem.html

You can also try other forms of preprocessing, if you are able to do it.

Make sure to also keep the unmodified reviews, so you can compare the results with preprocessing and without preprocessing.

**Part 1: Preprocessing**

You can also skip this part for now - it is not required to perform the topic modelling, but you will get better results.

In [77]:
import nltk
from nltk.stem.snowball import SnowballStemmer
nltk.download('stopwords')
from nltk.stem import *
stemmer = PorterStemmer()
stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import string

preprocessed_movie_reviews = []
#not stemmed movie reviews
nsmr = []

#with stemming
for review in movie_reviews:
  review = [stemmer.stem(token) for token in review if token not in string.punctuation and token != '']
  
  preprocessed_movie_reviews.append(review)

#without stemming
for review in movie_reviews:
  review = [token for token in review if token not in string.punctuation and token != '']
  nsmr.append(review)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/n1tr0maverick/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 3. Topic Modelling using Gensim

Gensim offers an implementation of Latent Dirichlet Allocation (LDA), the most popular topic modelling algorithm, which we discussed in the lecture. If you are working on Google Colab, it is normally already installed there. Otherwise, you can install it with `pip install --upgrade gensim` or if you are using Conda, `conda install -c conda-forge gensim`.

Let's load it, and some other things we use:

In [115]:
import gensim
import gensim.corpora as corpora
import gensim.models as models
import itertools
from operator import itemgetter
print(gensim.__version__)

4.3.1


## Constructing the bag-of-words model

The `gensim.corpora.Dictionary()` class allows you to map words to numbers, which is what we need to make a bag-of-words model. In particular, the doc2bow() function converts a collection of words to a bag-of-words representation:

In [79]:
movie_dictionary = corpora.Dictionary(movie_reviews)
movie_bow_corpus = [movie_dictionary.doc2bow(d) for d in movie_reviews]

Let's see what happened:

In [80]:
print('Number of unique tokens in the dataset:', len(movie_dictionary))

#Checking the first 11 words in the bag-of-words model
print('\nThe first 11 words in the bag-of-words model:')
print(dict(itertools.islice(movie_dictionary.token2id.items(), 12)))

#Checking the first 100 words of the first review
print('\nThe start of the first review:')
print(movie_reviews[0][:100])
#And the filename of that review is...
print('\nThe filename of the first review:')
print(movie_reviewnames[0])

#Which words are used in that review?
print('\nMost frequent words in the first review:')
for i, freq in sorted(movie_bow_corpus[0], key=itemgetter(1), reverse=True)[:20]:
    print(movie_dictionary[i], "-->", freq)
print("...")

Number of unique tokens in the dataset: 50920

The first 11 words in the bag-of-words model:
{'"': 0, "'80s": 1, '(': 2, ')': 3, ',': 4, '-': 5, '.': 6, '00': 7, '102': 8, '12-part': 9, '1888': 10, '2': 11}

The start of the first review:
['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', "they're", 'about', 'superheroes', '(', 'batman', ',', 'superman', ',', 'spawn', ')', ',', 'or', 'geared', 'toward', 'kids', '(', 'casper', ')', 'or', 'the', 'arthouse', 'crowd', '(', 'ghost', 'world', ')', ',', 'but', "there's", 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', '.', 'for', 'starters', ',', 'it', 'was', 'created', 'by', 'alan', 'moore', '(', 'and', 'eddie', 'campbell', ')', ',', 'who', 'brought', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'80s", 'with', 'a', '12-part', 'series', 'called', 'the', 'watchmen', '.', 'to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'r

## The topic model

Now, we can train our LDA model on this bag-of-words data by using `gensim.models.ldamodel.LDAModel()`.

This model can take various parameters that specify what kind of model gets made. Some important ones:


* num_topics: how many topics do we want? In what follows, we set the number of topics to 5, because we want to have a few topics that we can interpret, but the number of topics is data and application-dependent;
* id2word: our bag-of-words dictionary needed to map ids to strings;
* passes: how often we iterate over the entire corpus (default = 1). In general, the more passes, the higher the accuracy. This number is also called epochs in Artificial Intelligence and Machine Learning.

Let's first make a model that finds 5 topics, and tries 25 times to improve its estimate. This code may take a while to run, as it is the process that creates the topic model. If it takes too long, you can reduce the number of passes, but the topics might be worse.

In [81]:
reviews_ldamodel = models.ldamodel.LdaModel(movie_bow_corpus, num_topics=5, id2word = movie_dictionary, passes=25)

And let's have a look! An easy way to inspect the created topics is by using the `show_topics()` method, which prints the most representative word for each topic along with their probability.

In [82]:
reviews_ldamodel.show_topics(num_words=8) #Show the top 8 words for each topic

[(0,
  '0.061*"," + 0.042*"the" + 0.038*"." + 0.024*"a" + 0.024*"and" + 0.019*"of" + 0.017*"to" + 0.013*"in"'),
 (1,
  '0.057*"the" + 0.056*"," + 0.038*"." + 0.027*"of" + 0.025*"and" + 0.021*"a" + 0.019*"to" + 0.015*"in"'),
 (2,
  '0.022*"=" + 0.002*"pokemon" + 0.001*"powder" + 0.001*"dalmatians" + 0.001*"jesus" + 0.001*"impact" + 0.001*"christ" + 0.001*"stuart"'),
 (3,
  '0.050*"the" + 0.048*"," + 0.047*"." + 0.026*"a" + 0.023*"and" + 0.023*"to" + 0.022*"of" + 0.018*"is"'),
 (4,
  '0.001*"bilko" + 0.001*"dinosaurs" + 0.001*"twister" + 0.001*"spoon" + 0.001*"alessa" + 0.001*"stretch" + 0.001*"rocky" + 0.001*"doubtfire"')]

There we go, we have a topic model. However, you can probably see that it is far from perfect and some uninteresting 'words' appear there. Now, it is your turn to make it better!

## Your model

**Part 1: Preprocessing**

Show the effect of your preprocessing by also making a topic model for your preprocessed_movie_reviews. First, you make a bag-of-words model and then the LdaModel, as above. Feel free to go back to your preprocessing code above and update it based on what you saw from the show_topics function applied to the initial model.

Try to make a model with 8 topics, and show the top 8 words for each topic. **Assign the model to a new variable with a sensible name** (avoid overwriting the previous models).

Also for the dictionary and corpus, **give the variables different and expressive names to avoid overwriting the other ones**. Otherwise, you will get confused between your different topic models.

In [83]:
# your code here
movie_dictionary1 = corpora.Dictionary(preprocessed_movie_reviews)
movie_bow_corpus_1 = [movie_dictionary1.doc2bow(d) for d in preprocessed_movie_reviews]

reviews_ldamodel_1 = models.ldamodel.LdaModel(movie_bow_corpus_1, num_topics=8, id2word = movie_dictionary1, passes=25)
reviews_ldamodel_1.show_topics(num_words=8) 

[(0,
  '0.053*"the" + 0.028*"of" + 0.027*"a" + 0.020*"to" + 0.019*"and" + 0.015*"in" + 0.014*"is" + 0.009*"that"'),
 (1,
  '0.059*"the" + 0.030*"of" + 0.029*"and" + 0.027*"a" + 0.023*"to" + 0.019*"in" + 0.016*"is" + 0.010*"that"'),
 (2,
  '0.003*"hous" + 0.003*"de" + 0.003*"mandingo" + 0.003*"bont" + 0.003*"hammond" + 0.002*"slave" + 0.002*"neeson" + 0.002*"_the"'),
 (3,
  '0.056*"the" + 0.028*"and" + 0.027*"a" + 0.026*"to" + 0.026*"is" + 0.021*"of" + 0.016*"in" + 0.010*"it"'),
 (4,
  '0.070*"the" + 0.032*"of" + 0.028*"a" + 0.028*"and" + 0.022*"to" + 0.021*"is" + 0.017*"in" + 0.010*"as"'),
 (5,
  '0.055*"the" + 0.029*"a" + 0.026*"and" + 0.025*"to" + 0.024*"of" + 0.017*"is" + 0.016*"in" + 0.014*"that"'),
 (6,
  '0.002*"the" + 0.002*"alessa" + 0.002*"haunt" + 0.002*"heal" + 0.002*"spencer" + 0.002*"bulli" + 0.002*"michel" + 0.002*"cyborsuit"'),
 (7,
  '0.040*"the" + 0.034*"a" + 0.026*"to" + 0.026*"and" + 0.021*"is" + 0.017*"of" + 0.016*"in" + 0.012*"that"')]

**Part 2: Stopwords**

The topics you saw so far are probably mostly made up of stopwords such as "the". As discussed in the lecture, our results will probably be more interesting if we get rid of them.

We have included 3 generic lists of stopwords: the default list of the tool Mallet, a shorter frequent word list used in search applications (Snowball stemmer), and the top 10,000 words based on Google n-grams (in frequency order, select as many lines as you want). Gensim and NLTK also have stopword lists.

Make a function that accepts the path to a stopwords file (e.g. `stopwords/standard-mallet-en.txt`), and returns a list of stopwords.

In [84]:
def load_stopwords(filename):
    
    #load a file from disk and return a list of stopwords
    stopword_list = []
    with open(filename, encoding='utf-8') as infile:
        for line in infile:
            stopword_list.append(line.strip())

    return stopword_list
stopword_list = load_stopwords("stopwords/standard-mallet-en.txt")

Then, make a function that takes a stopword list and a list of reviews (e.g. `preprocessed_movie_reviews`). The function should remove all stopwords from all the reviews, returning a list of the reviews without stopwords. This code may be a bit slow if you have many stopwords, since there is a lot of data to process.

In [85]:
def filter_stopwords(stopword_list, movie_reviews):
        #remove stopwords from the list of movie reviews
        filtered_reviews = []
        for review in movie_reviews:
            filtered_review = []
            for token in review:
                if token not in stopword_list:
                    filtered_review.append(token)
            filtered_reviews.append(filtered_review)
        return filtered_reviews
    
filtered_movie_reviews = filter_stopwords(stopword_list, preprocessed_movie_reviews)
#print the first 10 reviews
print(filtered_movie_reviews[:10])

#filter and print the first 10 reviews of not stemmed
filtered_notstemmed_movie_reviews = filter_stopwords(stopword_list, nsmr)
print(filtered_notstemmed_movie_reviews[:10])

[['film', 'adapt', 'comic', 'book', 'plenti', 'success', "they'r", 'superhero', 'batman', 'superman', 'spawn', 'gear', 'kid', 'casper', 'arthous', 'crowd', 'ghost', 'world', "there'", 'realli', 'comic', 'book', 'hell', 'befor', 'starter', 'wa', 'creat', 'alan', 'moor', 'eddi', 'campbel', 'brought', 'medium', 'level', 'mid', "'80", '12-part', 'seri', 'call', 'watchmen', 'moor', 'campbel', 'thoroughli', 'research', 'subject', 'jack', 'ripper', 'michael', 'jackson', 'start', 'littl', 'odd', 'book', 'graphic', '500', 'page', 'long', 'includ', 'nearli', '30', 'consist', 'noth', 'footnot', 'word', "don't", 'dismiss', 'thi', 'film', 'becaus', 'sourc', 'past', 'comic', 'book', 'thing', 'find', 'anoth', 'stumbl', 'block', "hell'", 'director', 'albert', 'allen', 'hugh', 'hugh', 'brother', 'direct', 'thi', 'ludicr', 'cast', 'carrot', 'top', 'anyth', 'riddl', 'thi', 'direct', 'film', "that'", 'set', 'ghetto', 'featur', 'realli', 'violent', 'street', 'crime', 'mad', 'genius', 'menac', 'ii', 'societ

Lastly, let's make another topic model with this filtered data! Again, you make a bag-of-words model and then the LdaModel, as above.

Try to make a model with 8 topics, and show the top 8 words for each topic. Assign the model to a new variable with a sensible name (avoid overwriting the previous models).

In [86]:
# your code here
movie_dictionary2 = corpora.Dictionary(filtered_movie_reviews)
movie_bow_corpus_2 = [movie_dictionary2.doc2bow(d) for d in filtered_movie_reviews]

reviews_ldamodel_2 = models.ldamodel.LdaModel(movie_bow_corpus_2, num_topics=5, id2word = movie_dictionary2, passes=25)
reviews_ldamodel_2.show_topics(num_words=8) 


[(0,
  '0.016*"thi" + 0.013*"film" + 0.012*"movi" + 0.007*"wa" + 0.006*"ha" + 0.005*"time" + 0.005*"it\'" + 0.005*"charact"'),
 (1,
  '0.011*"film" + 0.007*"thi" + 0.006*"movi" + 0.006*"star" + 0.005*"wa" + 0.005*"ha" + 0.004*"charact" + 0.003*"war"'),
 (2,
  '0.019*"film" + 0.017*"thi" + 0.009*"movi" + 0.008*"ha" + 0.007*"wa" + 0.006*"it\'" + 0.006*"charact" + 0.005*"make"'),
 (3,
  '0.013*"film" + 0.012*"thi" + 0.012*"movi" + 0.008*"wa" + 0.007*"ha" + 0.006*"it\'" + 0.006*"charact" + 0.005*"make"'),
 (4,
  '0.014*"film" + 0.008*"thi" + 0.007*"ha" + 0.006*"wa" + 0.006*"movi" + 0.005*"charact" + 0.004*"it\'" + 0.004*"make"')]

Making a model without using the SnowballStemmer, due to ineligible words

In [87]:
#modelling with not stemmed words
movie_dictionary3 = corpora.Dictionary(filtered_notstemmed_movie_reviews)
movie_bow_corpus_3 = [movie_dictionary3.doc2bow(d) for d in filtered_notstemmed_movie_reviews]

reviews_ldamodel_3 = models.ldamodel.LdaModel(movie_bow_corpus_3, num_topics=5, id2word = movie_dictionary3, passes=25)
reviews_ldamodel_3.show_topics(num_words=8)

[(0,
  '0.009*"movie" + 0.009*"film" + 0.006*"it\'s" + 0.003*"story" + 0.003*"time" + 0.003*"life" + 0.003*"good" + 0.003*"character"'),
 (1,
  '0.006*"movie" + 0.005*"film" + 0.004*"it\'s" + 0.003*"time" + 0.003*"big" + 0.002*"funny" + 0.002*"--" + 0.002*"story"'),
 (2,
  '0.019*"film" + 0.009*"movie" + 0.006*"it\'s" + 0.004*"good" + 0.004*"time" + 0.004*"story" + 0.003*"characters" + 0.003*"films"'),
 (3,
  '0.010*"film" + 0.008*"movie" + 0.006*"it\'s" + 0.004*"good" + 0.003*"time" + 0.003*"character" + 0.003*"story" + 0.003*"he\'s"'),
 (4,
  '0.015*"film" + 0.009*"movie" + 0.006*"it\'s" + 0.004*"story" + 0.003*"good" + 0.003*"time" + 0.003*"characters" + 0.003*"character"')]

**Part 3: Experimentation**

Are these general stopword lists sufficient? We are working in the movie review domain, meaning that we may have other uninformative stopwords than in the general domain, such as the word 'movie'. Some key experimentation is to add specific stopwords for the movie review domain, which would occur frequently in all (or most) of the clusters. Note that removing words will not just hide these words, but lead to (even very) different topics and different top ranked reviews.

**Make your own domain-specific stopwords file** by taking one of the existing ones and adding your own stopwords (make sure that the stopword file is saved as a plain text file). Think about what stopwords are in this domain (e.g., the word film is not a stopword in general, but it will occur in essentially every film review).

Re-use the functions you previously made to load your own stopwords file and filter the movie reviews. Then, make another topic model with your new filtering and show the top 8 words for each topic.

In [136]:
stpwords = load_stopwords('stopwords/stopwords-empty.txt')

#Filter movie reviews with my stopwords
domainfiltered_movie_reviews = filter_stopwords(stpwords, filtered_movie_reviews)
domainfilteredunstemmed_movie_reviews = filter_stopwords(stpwords, filtered_notstemmed_movie_reviews)

#Make a model with 8 topics
movie_dictionary4 = corpora.Dictionary(domainfiltered_movie_reviews)
movie_bow_corpus_4 = [movie_dictionary4.doc2bow(d) for d in domainfiltered_movie_reviews]

reviews_ldamodel_4 = models.ldamodel.LdaModel(movie_bow_corpus_4, num_topics=8, id2word = movie_dictionary4, passes=25)
reviews_ldamodel_4.show_topics(num_words=8) 

[(0,
  '0.016*"thi" + 0.015*"movi" + 0.008*"ha" + 0.005*"charact" + 0.004*"onli" + 0.004*"play" + 0.003*"thing" + 0.003*"ani"'),
 (1,
  '0.015*"thi" + 0.007*"ha" + 0.006*"charact" + 0.005*"movi" + 0.005*"stori" + 0.004*"play" + 0.004*"becaus" + 0.004*"onli"'),
 (2,
  '0.007*"thi" + 0.006*"ha" + 0.005*"charact" + 0.004*"stori" + 0.003*"movi" + 0.003*"onli" + 0.003*"harri" + 0.003*"famili"'),
 (3,
  '0.019*"thi" + 0.013*"movi" + 0.008*"ha" + 0.006*"charact" + 0.004*"onli" + 0.004*"action" + 0.004*"play" + 0.004*"veri"'),
 (4,
  '0.012*"thi" + 0.012*"movi" + 0.007*"ha" + 0.005*"charact" + 0.004*"stori" + 0.004*"onli" + 0.003*"play" + 0.003*"alien"'),
 (5,
  '0.006*"scream" + 0.005*"ha" + 0.004*"thi" + 0.004*"charact" + 0.003*"ape" + 0.002*"stori" + 0.002*"love" + 0.002*"end"'),
 (6,
  '0.012*"thi" + 0.010*"movi" + 0.007*"ha" + 0.006*"charact" + 0.005*"stori" + 0.004*"play" + 0.004*"veri" + 0.003*"love"'),
 (7,
  '0.015*"thi" + 0.007*"ha" + 0.007*"charact" + 0.007*"movi" + 0.004*"play" + 0

Model with 8 topics without using the SnowballStemmer

In [137]:
stpwords = load_stopwords('stopwords/stopwords-empty.txt')

domainfilteredunstemmed_movie_reviews = filter_stopwords(stpwords, filtered_notstemmed_movie_reviews)
#Make a model with 8 topics for not stemmed words
moviedictionary5 = corpora.Dictionary(domainfilteredunstemmed_movie_reviews)
movie_bow_corpus_5 = [moviedictionary5.doc2bow(d) for d in domainfilteredunstemmed_movie_reviews]

reviews_ldamodel_5 = models.ldamodel.LdaModel(movie_bow_corpus_5, num_topics=8, id2word = moviedictionary5, passes=25)
reviews_ldamodel_5.show_topics(num_words=15)

[(0,
  '0.003*"funny" + 0.002*"back" + 0.002*"big" + 0.002*"love" + 0.002*"comedy" + 0.002*"original" + 0.002*"great" + 0.002*"man" + 0.002*"years" + 0.002*"work" + 0.002*"made" + 0.002*"makes" + 0.002*"director" + 0.002*"there\'s" + 0.002*"end"'),
 (1,
  '0.003*"action" + 0.003*"star" + 0.003*"series" + 0.002*"trek" + 0.002*"great" + 0.002*"back" + 0.002*"godzilla" + 0.002*"makes" + 0.002*"big" + 0.002*"made" + 0.002*"find" + 0.002*"acting" + 0.001*"years" + 0.001*"director" + 0.001*"end"'),
 (2,
  '0.003*"star" + 0.002*"effects" + 0.002*"love" + 0.002*"great" + 0.002*"back" + 0.002*"special" + 0.002*"wars" + 0.002*"end" + 0.002*"there\'s" + 0.002*"audience" + 0.002*"original" + 0.002*"funny" + 0.002*"big" + 0.002*"young" + 0.002*"man"'),
 (3,
  '0.002*"man" + 0.002*"work" + 0.002*"comedy" + 0.002*"action" + 0.002*"role" + 0.002*"great" + 0.002*"end" + 0.002*"john" + 0.002*"love" + 0.002*"director" + 0.002*"family" + 0.002*"makes" + 0.002*"audience" + 0.002*"young" + 0.001*"truman"'),

Now, you should have 3 models (or more): one without any stopword filtering, one with the standard stopword filtering and one with the domain-filtered stopwords using the list you modified yourself. Compare the topics found by the three models (just looking at them is fine, no need to code a comparison).

Do the topics look better with stopword filtering and with domain-specific stopword filtering? At this point, do the resulting topics correspond to particular film genres you have expected?

reviews_ldamodel == This is using raw unprocessed data in movie_reviews.
Contained mostly punctuation marks and stuff

reviews_ldamodel_1 == This is using preprocessed data using Stemmer/Removing Punctuations.
Contains stopwords such as a, an, the etc.

reviews_ldamodel_2 == This is the model after removing Stopwords from Mallet.
Contains stems of words such as movi, film, etc.

reviews_ldamodel_3 == This is the unstemmed model after removing Stopwords.
Contains film, movies, story, life etc.

reviews_ldamodel_4 == This is the model that has been filtered with domain specific stopwords.
Contains words such as thi, movi, charact, tough to make sense of what is being said due to stemming

reviews_ldamodel_5 == This is the unstemmed model filtered with domain specific stopwords.
Contains a lot of good and bad, but has plot, people, director, etc.

Model 5 appears to have the most legible data, so moving ahead with 5 for future use and optimization.

Increase the number of topics. What happens with the topics if you model very few or very many topics? (answer in a text box). Assign the model(s) to a new variable with a sensible name (avoid overwriting the previous models).

In [138]:
# using lda model 5, with increased topics
# your code here
movie_dictionary6 = corpora.Dictionary(domainfilteredunstemmed_movie_reviews)
movie_bow_corpus6 = [movie_dictionary6.doc2bow(d) for d in domainfilteredunstemmed_movie_reviews]

lda_main = models.ldamodel.LdaModel(movie_bow_corpus6, num_topics=15, id2word = movie_dictionary6, passes=25)
lda_main.show_topics(num_words=10)

[(1,
  '0.002*"man" + 0.002*"work" + 0.002*"years" + 0.002*"there\'s" + 0.002*"made" + 0.002*"back" + 0.002*"director" + 0.002*"love" + 0.002*"great" + 0.002*"action"'),
 (9,
  '0.003*"love" + 0.002*"man" + 0.002*"director" + 0.002*"made" + 0.002*"played" + 0.002*"end" + 0.002*"things" + 0.002*"real" + 0.002*"work" + 0.002*"that\'s"'),
 (2,
  '0.003*"great" + 0.002*"end" + 0.002*"man" + 0.002*"things" + 0.002*"love" + 0.002*"performance" + 0.002*"work" + 0.002*"back" + 0.002*"director" + 0.002*"find"'),
 (8,
  '0.004*"scream" + 0.004*"funny" + 0.003*"comedy" + 0.003*"there\'s" + 0.002*"horror" + 0.002*"director" + 0.002*"kevin" + 0.002*"great" + 0.002*"back" + 0.002*"plays"'),
 (0,
  '0.003*"end" + 0.002*"john" + 0.002*"director" + 0.002*"audience" + 0.002*"high" + 0.002*"made" + 0.002*"great" + 0.002*"fact" + 0.002*"comedy" + 0.002*"back"'),
 (13,
  '0.003*"great" + 0.003*"funny" + 0.002*"makes" + 0.002*"work" + 0.002*"run" + 0.002*"comedy" + 0.002*"director" + 0.002*"big" + 0.002*"ac

what to infer idk

(your answer here)

Increase the number of topic words printed to get more information per topic.  Is it easier to make sense of a topic if you look further down the list, or are the initial words more clear?

In [101]:
lda_main.show_topics(num_words=20)
# your code here

[(9,
  '0.004*"good" + 0.004*"plot" + 0.002*"man" + 0.002*"bad" + 0.002*"director" + 0.002*"love" + 0.002*"batman" + 0.002*"role" + 0.002*"people" + 0.002*"plays" + 0.002*"family" + 0.002*"things" + 0.002*"action" + 0.002*"real" + 0.002*"back" + 0.002*"fact" + 0.002*"john" + 0.002*"great" + 0.002*"performance" + 0.002*"end"'),
 (1,
  '0.003*"troopers" + 0.003*"starship" + 0.002*"good" + 0.002*"people" + 0.002*"bad" + 0.002*"back" + 0.002*"love" + 0.002*"cast" + 0.002*"makes" + 0.002*"made" + 0.002*"end" + 0.002*"there\'s" + 0.002*"family" + 0.002*"plot" + 0.002*"work" + 0.002*"allen" + 0.002*"funny" + 0.002*"performance" + 0.002*"sex" + 0.002*"home"'),
 (13,
  '0.005*"good" + 0.004*"star" + 0.004*"effects" + 0.003*"action" + 0.003*"bad" + 0.003*"people" + 0.003*"plot" + 0.003*"special" + 0.002*"great" + 0.002*"end" + 0.002*"made" + 0.002*"director" + 0.002*"man" + 0.002*"young" + 0.002*"love" + 0.002*"there\'s" + 0.002*"wars" + 0.002*"original" + 0.002*"makes" + 0.002*"fact"'),
 (12,
 

The topics do become a bit clearer, as we can see the movies, their genres and their overall rating be inferred through this

If you are interested, you can also experiment with the difference between positive and negative reviews.

### Part 4: Evaluation

There are a few numbers we can compute that indicate the quality of a topic model, such as [perplexity and coherence](https://github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/R_text_LDA_perplexity.md). For perplexity, a lower number means a better model, and for coherence, a higher number is better. Try computing these scores for your models, and see which is the best one according to the numbers

In a real project, you should compute these numbers over a separate part of the dataset (the test set) for a proper evaluation, but for simplicity and because we have not talked about this in the lecture we will skip that here.

In [143]:
from gensim.models import CoherenceModel

#Compute perplexity for the basic model on the bag-of-words representation of the reviews:
print('Perplexity: ', reviews_ldamodel_1.log_perplexity(movie_bow_corpus_1))  

# Compute coherence score on the same:
coherence_model_lda = CoherenceModel(model=reviews_ldamodel_1, texts=preprocessed_movie_reviews, dictionary=movie_dictionary1, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence score: ', coherence_lda)

Perplexity:  -7.093870619261154
Coherence score:  0.2513707999082332


In [141]:
# Compute these numbers for your other models here
print('Perplexity for Model 5: ', lda_main.log_perplexity(movie_bow_corpus_5))

#computing coherence score for model 5
coherence_model_lda = CoherenceModel(model=lda_main, texts=domainfilteredunstemmed_movie_reviews, dictionary=movie_dictionary6, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence score for Model 5: ', coherence_lda)


Perplexity for Model 5:  -9.583329429053814
Coherence score for Model 5:  0.2601086408980652


In [128]:
print(movie_reviews[:10])
print(domainfilteredunstemmed_movie_reviews[:10])

[['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', "they're", 'about', 'superheroes', '(', 'batman', ',', 'superman', ',', 'spawn', ')', ',', 'or', 'geared', 'toward', 'kids', '(', 'casper', ')', 'or', 'the', 'arthouse', 'crowd', '(', 'ghost', 'world', ')', ',', 'but', "there's", 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', '.', 'for', 'starters', ',', 'it', 'was', 'created', 'by', 'alan', 'moore', '(', 'and', 'eddie', 'campbell', ')', ',', 'who', 'brought', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'80s", 'with', 'a', '12-part', 'series', 'called', 'the', 'watchmen', '.', 'to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'researched', 'the', 'subject', 'of', 'jack', 'the', 'ripper', 'would', 'be', 'like', 'saying', 'michael', 'jackson', 'is', 'starting', 'to', 'look', 'a', 'little', 'odd', '.', 'the', 'book', '(', 'or', '"', 'graphic', 'novel', ',', '"', 'i

Since the Coherence score seems low for the model without using the SnowballStemmer, trying on model after using it.

In [139]:
#computing perplexity for model 4
print('Perplexity for Model 4: ', reviews_ldamodel_4.log_perplexity(movie_bow_corpus_4))

#computing coherence score for model 4
coherence_model_lda = CoherenceModel(model=reviews_ldamodel_4, texts=movie_reviews, dictionary=movie_dictionary4, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence score for Model 4: ', coherence_lda)


Perplexity for Model 4:  -8.651938001372137
Coherence score for Model 4:  nan


However, just comparing numbers is not very interpretable. We will choose our topic model with the highest coherence score and validate the evaluation.

Using the top 20 topic words for each topic in the model with the highest coherence score, pick at least 5 topic numbers and determine what film genres (in an informal sense) they represent, i.e. think of a meaningful label for the topic. Write down the topic number and your topic label. Is it easy to guess what the topic represents? For how many topics are you fairly confident, for how many do you have to make a guess, and for how many do you have no real clue.

(your answer here)

In [None]:
reviews_ldamodel.get_term_topics("the", minimum_probability = 1e-3)

[(1, 0.052959308), (3, 0.03552808), (4, 0.0030824954)]

Do this for your own best model and the labels you just picked. For each of your topic labels, if the probability for the label is the highest for the topic number you wrote down, your guess was probably correct. Did you guess a suitable label for every topic?

In [112]:
# your code here
reviews_ldamodel_5.get_term_topics("good", minimum_probability = 1e-3)

[(0, 0.0024713383),
 (1, 0.004900068),
 (2, 0.0020549402),
 (3, 0.0034489438),
 (4, 0.004087925),
 (5, 0.0034804526),
 (6, 0.0035401687),
 (7, 0.0042395815)]

(your answer here)

In a real project, you would also want to validate your topics by examining the reviews that are most strongly associated with that topic. You can see what documents have what topics using the get_document_topics() method. Here we look at the topics for the first document in the model (change the name of the model to yours):

In [104]:
reviews_ldamodel_5.get_document_topics(movie_bow_corpus[0], minimum_probability = 0)

[(0, 0.00015640355),
 (1, 0.00015646798),
 (2, 0.00015646858),
 (3, 0.00015643754),
 (4, 0.23685333),
 (5, 0.75400776),
 (6, 0.00015644266),
 (7, 0.008356733)]

Or for the first 20 of them:

In [103]:
for i, doc_topics in enumerate(reviews_ldamodel_5.get_document_topics(movie_bow_corpus)):
    if i >= 20:
        break
    print(f"Topics for the review {movie_reviewnames[i]}: {doc_topics}")

Topics for the review cv000_29590.txt: [(4, 0.23686397), (5, 0.75400853)]
Topics for the review cv001_18431.txt: [(1, 0.17214037), (4, 0.3357013), (5, 0.4835417)]
Topics for the review cv002_15918.txt: [(1, 0.4678076), (4, 0.037359186), (5, 0.49348322)]
Topics for the review cv003_11664.txt: [(1, 0.22001721), (4, 0.15501367), (5, 0.37171966), (6, 0.24740753)]
Topics for the review cv004_11636.txt: [(1, 0.3063768), (4, 0.13730076), (5, 0.48968762), (7, 0.065961674)]
Topics for the review cv005_29443.txt: [(1, 0.21698801), (3, 0.11653629), (4, 0.13869707), (5, 0.42899436), (6, 0.098395616)]
Topics for the review cv006_15448.txt: [(1, 0.10966835), (3, 0.04555083), (4, 0.25933298), (5, 0.54796857), (7, 0.03553256)]
Topics for the review cv007_4968.txt: [(1, 0.17226005), (4, 0.35740015), (5, 0.46585268)]
Topics for the review cv008_29435.txt: [(1, 0.24910447), (4, 0.1768782), (5, 0.49528638), (6, 0.077033974)]
Topics for the review cv009_29592.txt: [(0, 0.026072387), (1, 0.30831927), (4, 0.

But this assignment is already long enough so I will not ask you to report on this too!