<a href="https://colab.research.google.com/github/n1tr0maverick/ddr23-research2/blob/main/2_topicmodelling(2.0).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data-Driven Research Assignment 2: Topic Modeling
This notebook contains the second, collaborative, graded assignment of the 2023 Data-Driven Research course. In this assignment you'll use a topic modeling tool in order to uncover the ''topics'' of a large set of reviews of popular films. 

To complete the assignment, complete **Part 1, Part 2, Part 3 and Part 4** of the **Your Model** section at the end.

This is a collaborative assignment. In the text cell below, please include all the names of your group members.

If you used code or a solution from the internet (such as StackOverflow) or another external resource, please make reference to it (in any format). Unattributed copied code will be considered plagiarism and therefore fraud.


**Authors of this answer:**
Aravind Ashok 13149970
Ngan Nguyen 13653830
Ellianna Kim 12718289

# 1. Introduction

You'll use a Topic Modelling tool from Gensim, a popular library for topic modelling in Python, though these days mainly known for its implementation of Word2Vec to train word embeddings (dense representations). Using this library, you will model topics based on reviews of popular films. The reviews are stored in plain text files, organized by film and rating. The aim of this exercise is to familiarize you with the topic modeling process and its output and to get insight in what kinds of topics are modeled.

# 2. Preparation

This assignment comes with the following files:


1.   The reviews of the films. This is the data in which we want to find topics. They are found in the movie2k/txt_sentoken directory. There are then two types: negative reviews (neg directory) and positive reviews (pos directory). The reviews are already tokenized.
2.   Stopword list files. They are found in the stopwords directory.

Let's start by loading the movie reviews from the files (I'll do it for you):

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os

def load_reviews(folder_path):
    reviews = [] #Make a list to put the reviews in
    reviewnames = [] # Make a list to put the review filenames in (to be able to look them up later)
    tokens = 0 #Make a counter for the number of tokens
    
    for file in os.listdir(folder_path):
        #Loop through all the text files in the folder, each containing one review
        
        if not file.endswith('.txt'):  #Only read text files
            continue

        file_path = os.path.join(folder_path, file)

        #Open the text file and read its contents
        with open(file_path, encoding='utf-8') as infile:
            review = infile.read()
        reviewnames.append(file)
            
        # Turn the string with the review into a list of words (this is easy because it is already tokenized)
        review = review.split()
        # And add it to the list
        reviews.append(review)
        # To count the number of tokens processed so far
        tokens = tokens + len(review)

    print(f"Loaded reviews from {folder_path} containing {tokens} tokens in total.") 
    return reviews, reviewnames
        
folder_path = "/content/drive/MyDrive/ddr23-research2-main/2_TopicModelling/movie2k/txt_sentoken"
    
movie_reviews_pos, movie_reviewnames_pos = load_reviews(folder_path + "/pos") #Load the positive reviews
movie_reviews_neg, movie_reviewnames_neg = load_reviews(folder_path + "/neg") #Load the negative reviews

movie_reviews = movie_reviews_pos + movie_reviews_neg #Combine the lists of positive and negative reviews into one
movie_reviewnames = movie_reviewnames_pos + movie_reviewnames_neg #The same for the list of filenames

Loaded reviews from /content/drive/MyDrive/ddr23-research2-main/2_TopicModelling/movie2k/txt_sentoken/pos containing 787051 tokens in total.
Loaded reviews from /content/drive/MyDrive/ddr23-research2-main/2_TopicModelling/movie2k/txt_sentoken/neg containing 705630 tokens in total.


If you are working on Google Colab, you will probably have to change the path to the files to something that Google Colab has access to. For example, you could put the files on your Google Drive and then load them from there, as we did in Coding the Humanities. For more details about how to work with files in Python and load them from Google Drive, have a look at the Coding the Humanities course notebook on Files: https://github.com/bloemj/2023-coding-the-humanities/blob/main/notebooks/4_ReadingAndWritingFiles.ipynb

How to load files off Google Drive is explained at the beginning there.

## Preprocessing

Now that we have loaded the text, you might want to perform some pre-processing steps to be able to create a better bag-of-words model in which all forms of a word are mapped to a single number. For example, you could remove the punctuation characters, or you could perform lemmatization or stemming, which we discussed in the lecture. This would be the place to do it by writing a preprocessing function that accepts a list of movie reviews as its argument and returns a preprocessed list of movie reviews. Feel free to use your knowledge of text normalization from Coding the Humanities or the functions you wrote then. Here is some information on how to perform stemming with NLTK: https://www.nltk.org/howto/stem.html

You can also try other forms of preprocessing, if you are able to do it.

Make sure to also keep the unmodified reviews, so you can compare the results with preprocessing and without preprocessing.

**Part 1: Preprocessing**

You can also skip this part for now - it is not required to perform the topic modelling, but you will get better results.

In [None]:
import nltk
from nltk.stem.snowball import SnowballStemmer
nltk.download('stopwords')
from nltk.stem import *
stemmer = PorterStemmer()
stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import string

preprocessed_movie_reviews = []
#not stemmed movie reviews
nsmr = []

#with stemming
for review in movie_reviews:
  review = [stemmer.stem(token) for token in review if token not in string.punctuation and token != '']
  
  preprocessed_movie_reviews.append(review)

#without stemming
for review in movie_reviews:
  review = [token for token in review if token not in string.punctuation and token != '']
  nsmr.append(review)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# 3. Topic Modelling using Gensim

Gensim offers an implementation of Latent Dirichlet Allocation (LDA), the most popular topic modelling algorithm, which we discussed in the lecture. If you are working on Google Colab, it is normally already installed there. Otherwise, you can install it with `pip install --upgrade gensim` or if you are using Conda, `conda install -c conda-forge gensim`.

Let's load it, and some other things we use:

In [None]:
import gensim
import gensim.corpora as corpora
import gensim.models as models
import itertools
from operator import itemgetter
print(gensim.__version__)

4.3.1


## Constructing the bag-of-words model

The `gensim.corpora.Dictionary()` class allows you to map words to numbers, which is what we need to make a bag-of-words model. In particular, the doc2bow() function converts a collection of words to a bag-of-words representation:

In [None]:
movie_dictionary = corpora.Dictionary(movie_reviews)
movie_bow_corpus = [movie_dictionary.doc2bow(d) for d in movie_reviews]

Let's see what happened:

In [None]:
print('Number of unique tokens in the dataset:', len(movie_dictionary))

#Checking the first 11 words in the bag-of-words model
print('\nThe first 11 words in the bag-of-words model:')
print(dict(itertools.islice(movie_dictionary.token2id.items(), 12)))

#Checking the first 100 words of the first review
print('\nThe start of the first review:')
print(movie_reviews[0][:100])
#And the filename of that review is...
print('\nThe filename of the first review:')
print(movie_reviewnames[0])

#Which words are used in that review?
print('\nMost frequent words in the first review:')
for i, freq in sorted(movie_bow_corpus[0], key=itemgetter(1), reverse=True)[:20]:
    print(movie_dictionary[i], "-->", freq)
print("...")

Number of unique tokens in the dataset: 50920

The first 11 words in the bag-of-words model:
{'"': 0, '(': 1, ')': 2, ',': 3, '-': 4, '.': 5, '000': 6, '1': 7, '50': 8, ';': 9, 'a': 10, 'about': 11}

The start of the first review:
['vampire', 'lore', 'and', 'legend', 'has', 'always', 'been', 'a', 'popular', 'fantasy', 'element', ',', 'substantiated', 'by', 'not', 'only', 'the', 'sheer', 'number', 'of', 'movies', 'about', 'the', 'subject', ',', 'but', 'also', 'the', 'proliferation', 'of', 'cults', 'and', 'sects', 'of', 'adherents', '.', 'and', ',', 'unlike', 'any', 'of', 'the', 'more', 'outlandish', 'myths', ',', 'the', 'vampire', 'holds', 'some', 'real-world', 'probability', '(', 'one', 'study', 'claims', '1', ',', '000', 'bloodsuckers', 'exist', 'worldwide', ',', 'and', 'places', '50', 'in', 'los', 'angeles', ')', '.', 'but', 'lest', 'the', 'nasties', 'be', 'mistaken', 'for', 'simple', 'comic', 'book', 'bad', 'guys', ',', 'john', 'carpenter', 'would', 'like', 'to', 'remind', 'us', 'th

## The topic model

Now, we can train our LDA model on this bag-of-words data by using `gensim.models.ldamodel.LDAModel()`.

This model can take various parameters that specify what kind of model gets made. Some important ones:


* num_topics: how many topics do we want? In what follows, we set the number of topics to 5, because we want to have a few topics that we can interpret, but the number of topics is data and application-dependent;
* id2word: our bag-of-words dictionary needed to map ids to strings;
* passes: how often we iterate over the entire corpus (default = 1). In general, the more passes, the higher the accuracy. This number is also called epochs in Artificial Intelligence and Machine Learning.

Let's first make a model that finds 5 topics, and tries 25 times to improve its estimate. This code may take a while to run, as it is the process that creates the topic model. If it takes too long, you can reduce the number of passes, but the topics might be worse.

In [None]:
reviews_ldamodel = models.ldamodel.LdaModel(movie_bow_corpus, num_topics=5, id2word = movie_dictionary, passes=25)

And let's have a look! An easy way to inspect the created topics is by using the `show_topics()` method, which prints the most representative word for each topic along with their probability.

In [None]:
reviews_ldamodel.show_topics(num_words=8) #Show the top 8 words for each topic

[(0,
  '0.018*"," + 0.015*"." + 0.010*""" + 0.009*"a" + 0.009*"the" + 0.008*"and" + 0.007*"to" + 0.007*")"'),
 (1,
  '0.054*"," + 0.052*"the" + 0.045*"." + 0.026*"a" + 0.024*"and" + 0.023*"of" + 0.021*"to" + 0.017*"is"'),
 (2,
  '0.018*"," + 0.014*"the" + 0.012*"and" + 0.011*"." + 0.010*""" + 0.009*"a" + 0.006*"of" + 0.005*"to"'),
 (3,
  '0.002*"silverman" + 0.001*"osmosis" + 0.001*"chucky" + 0.001*"judith" + 0.001*"darren" + 0.001*"seagal" + 0.001*"pollock\'s" + 0.001*"zahn"'),
 (4,
  '0.043*"the" + 0.040*"." + 0.028*"," + 0.023*"and" + 0.021*"a" + 0.020*"to" + 0.019*"is" + 0.016*"of"')]

There we go, we have a topic model. However, you can probably see that it is far from perfect and some uninteresting 'words' appear there. Now, it is your turn to make it better!

## Your model

**Part 1: Preprocessing**

Show the effect of your preprocessing by also making a topic model for your preprocessed_movie_reviews. First, you make a bag-of-words model and then the LdaModel, as above. Feel free to go back to your preprocessing code above and update it based on what you saw from the show_topics function applied to the initial model.

Try to make a model with 8 topics, and show the top 8 words for each topic. **Assign the model to a new variable with a sensible name** (avoid overwriting the previous models).

Also for the dictionary and corpus, **give the variables different and expressive names to avoid overwriting the other ones**. Otherwise, you will get confused between your different topic models.

In [None]:
# your code here
movie_dictionary1 = corpora.Dictionary(preprocessed_movie_reviews)
movie_bow_corpus_1 = [movie_dictionary1.doc2bow(d) for d in preprocessed_movie_reviews]

reviews_ldamodel_1 = models.ldamodel.LdaModel(movie_bow_corpus_1, num_topics=8, id2word = movie_dictionary1, passes=25)
reviews_ldamodel_1.show_topics(num_words=8) 

[(0,
  '0.040*"the" + 0.036*"a" + 0.025*"and" + 0.024*"to" + 0.021*"of" + 0.020*"is" + 0.016*"in" + 0.012*"that"'),
 (1,
  '0.050*"the" + 0.031*"of" + 0.027*"a" + 0.024*"and" + 0.019*"to" + 0.016*"in" + 0.016*"is" + 0.010*"with"'),
 (2,
  '0.001*"relentless" + 0.001*"dietz" + 0.000*"dietz\'" + 0.000*"rossi" + 0.000*"demoralis" + 0.000*"signi" + 0.000*"investigator\'" + 0.000*"mini-seri"'),
 (3,
  '0.061*"the" + 0.030*"a" + 0.028*"and" + 0.028*"of" + 0.025*"to" + 0.022*"is" + 0.017*"in" + 0.011*"it"'),
 (4,
  '0.041*"the" + 0.028*"a" + 0.028*"and" + 0.026*"to" + 0.021*"of" + 0.018*"is" + 0.018*"hi" + 0.014*"in"'),
 (5,
  '0.063*"the" + 0.027*"a" + 0.027*"and" + 0.025*"of" + 0.024*"to" + 0.018*"is" + 0.016*"in" + 0.013*"that"'),
 (6,
  '0.013*"and" + 0.013*"a" + 0.009*"the" + 0.009*"to" + 0.008*"in" + 0.006*"of" + 0.006*"hi" + 0.004*"is"'),
 (7,
  '0.040*"the" + 0.025*"of" + 0.021*"and" + 0.018*"a" + 0.015*"to" + 0.013*"in" + 0.010*"is" + 0.009*"that"')]

**Part 2: Stopwords**

The topics you saw so far are probably mostly made up of stopwords such as "the". As discussed in the lecture, our results will probably be more interesting if we get rid of them.

We have included 3 generic lists of stopwords: the default list of the tool Mallet, a shorter frequent word list used in search applications (Snowball stemmer), and the top 10,000 words based on Google n-grams (in frequency order, select as many lines as you want). Gensim and NLTK also have stopword lists.

Make a function that accepts the path to a stopwords file (e.g. `stopwords/standard-mallet-en.txt`), and returns a list of stopwords.

In [None]:
def load_stopwords(filename):
    
    #load a file from disk and return a list of stopwords
    stopword_list = []
    with open(filename, encoding='utf-8') as infile:
        for line in infile:
            stopword_list.append(line.strip())

    return stopword_list
stopword_list = load_stopwords("/content/drive/MyDrive/ddr23-research2-main/2_TopicModelling/stopwords/standard-mallet-en.txt")

Then, make a function that takes a stopword list and a list of reviews (e.g. `preprocessed_movie_reviews`). The function should remove all stopwords from all the reviews, returning a list of the reviews without stopwords. This code may be a bit slow if you have many stopwords, since there is a lot of data to process.

In [None]:
def filter_stopwords(stopword_list, movie_reviews):
        #remove stopwords from the list of movie reviews
        filtered_reviews = []
        for review in movie_reviews:
            filtered_review = []
            for token in review:
                if token not in stopword_list:
                    filtered_review.append(token)
            filtered_reviews.append(filtered_review)
        return filtered_reviews
    
filtered_movie_reviews = filter_stopwords(stopword_list, preprocessed_movie_reviews)
#print the first 10 reviews
print(filtered_movie_reviews[:10])

#filter and print the first 10 reviews of not stemmed
filtered_notstemmed_movie_reviews = filter_stopwords(stopword_list, nsmr)
print(filtered_notstemmed_movie_reviews[:10])

[['vampir', 'lore', 'legend', 'ha', 'alway', 'popular', 'fantasi', 'element', 'substanti', 'onli', 'sheer', 'number', 'movi', 'subject', 'prolifer', 'cult', 'sect', 'adher', 'unlik', 'ani', 'outlandish', 'myth', 'vampir', 'hold', 'real-world', 'probabl', 'studi', 'claim', '1', '000', 'bloodsuck', 'exist', 'worldwid', 'place', '50', 'lo', 'angel', 'nasti', 'mistaken', 'simpl', 'comic', 'book', 'bad', 'guy', 'john', 'carpent', 'remind', 'alway', 'truli', 'frighten', 'element', 'thriller', 'genr', 'remind', 'doe', 'latest', 'film', 'vampir', 'wa', 'question', 'halloween', 'weekend', 'approach', 'vampir', 'comparison', 'line', "cinema'", 'immens', 'success', 'blade', 'releas', 'august', 'film', 'notic', 'differ', 'stand', 'vampir', 'issu', "don't", 'agre', 'basic', 'point', 'slay', 'method', 'instanc', "blade'", 'main', 'weapon', 'silver', 'garlic', 'wherea', 'main', 'charact', 'jack', "crow'", 'techniqu', 'wooden', 'stake', 'heart', 'blade', 'give', 'face', 'vampir', 'civil', 'carpent', '

Lastly, let's make another topic model with this filtered data! Again, you make a bag-of-words model and then the LdaModel, as above.

Try to make a model with 8 topics, and show the top 8 words for each topic. Assign the model to a new variable with a sensible name (avoid overwriting the previous models).

In [None]:
# your code here
movie_dictionary2 = corpora.Dictionary(filtered_movie_reviews)
movie_bow_corpus_2 = [movie_dictionary2.doc2bow(d) for d in filtered_movie_reviews]

reviews_ldamodel_2 = models.ldamodel.LdaModel(movie_bow_corpus_2, num_topics=5, id2word = movie_dictionary2, passes=25)
reviews_ldamodel_2.show_topics(num_words=8) 


[(0,
  '0.011*"film" + 0.008*"thi" + 0.006*"ha" + 0.005*"wa" + 0.005*"charact" + 0.004*"make" + 0.003*"stori" + 0.003*"love"'),
 (1,
  '0.013*"thi" + 0.011*"film" + 0.010*"movi" + 0.006*"wa" + 0.006*"ha" + 0.005*"it\'" + 0.005*"charact" + 0.004*"scene"'),
 (2,
  '0.010*"thi" + 0.009*"film" + 0.009*"movi" + 0.007*"ha" + 0.006*"wa" + 0.005*"charact" + 0.004*"make" + 0.004*"play"'),
 (3,
  '0.006*"film" + 0.005*"thi" + 0.004*"ha" + 0.003*"movi" + 0.003*"time" + 0.003*"play" + 0.002*"war" + 0.002*"stori"'),
 (4,
  '0.020*"film" + 0.017*"thi" + 0.012*"movi" + 0.009*"wa" + 0.008*"ha" + 0.007*"it\'" + 0.006*"charact" + 0.005*"make"')]

Making a model without using the SnowballStemmer, due to ineligible words

In [None]:
#modelling with not stemmed words
movie_dictionary3 = corpora.Dictionary(filtered_notstemmed_movie_reviews)
movie_bow_corpus_3 = [movie_dictionary3.doc2bow(d) for d in filtered_notstemmed_movie_reviews]

reviews_ldamodel_3 = models.ldamodel.LdaModel(movie_bow_corpus_3, num_topics=5, id2word = movie_dictionary3, passes=25)
reviews_ldamodel_3.show_topics(num_words=8)

[(0,
  '0.013*"film" + 0.009*"movie" + 0.006*"it\'s" + 0.004*"time" + 0.003*"character" + 0.003*"good" + 0.003*"story" + 0.003*"plot"'),
 (1,
  '0.015*"film" + 0.008*"movie" + 0.004*"it\'s" + 0.003*"films" + 0.003*"good" + 0.003*"time" + 0.002*"characters" + 0.002*"story"'),
 (2,
  '0.014*"film" + 0.007*"movie" + 0.005*"it\'s" + 0.004*"good" + 0.003*"story" + 0.003*"time" + 0.003*"character" + 0.003*"--"'),
 (3,
  '0.010*"film" + 0.004*"movie" + 0.004*"story" + 0.003*"it\'s" + 0.002*"life" + 0.002*"good" + 0.002*"make" + 0.002*"time"'),
 (4,
  '0.014*"film" + 0.011*"movie" + 0.007*"it\'s" + 0.004*"time" + 0.004*"good" + 0.004*"story" + 0.003*"character" + 0.003*"characters"')]

**Part 3: Experimentation**

Are these general stopword lists sufficient? We are working in the movie review domain, meaning that we may have other uninformative stopwords than in the general domain, such as the word 'movie'. Some key experimentation is to add specific stopwords for the movie review domain, which would occur frequently in all (or most) of the clusters. Note that removing words will not just hide these words, but lead to (even very) different topics and different top ranked reviews.

**Make your own domain-specific stopwords file** by taking one of the existing ones and adding your own stopwords (make sure that the stopword file is saved as a plain text file). Think about what stopwords are in this domain (e.g., the word film is not a stopword in general, but it will occur in essentially every film review).

Re-use the functions you previously made to load your own stopwords file and filter the movie reviews. Then, make another topic model with your new filtering and show the top 8 words for each topic.

In [None]:
stopwords = load_stopwords('/content/drive/MyDrive/ddr23-research2-main/2_TopicModelling/stopwords/stopwords-empty.txt')

#Filter movie reviews with my stopwords
domainfiltered_movie_reviews = filter_stopwords(stopwords, filtered_movie_reviews)
domainfilteredunstemmed_movie_reviews = filter_stopwords(stopwords, filtered_notstemmed_movie_reviews)

#Make a model with 8 topics
movie_dictionary4 = corpora.Dictionary(domainfiltered_movie_reviews)
movie_bow_corpus_4 = [movie_dictionary4.doc2bow(d) for d in domainfiltered_movie_reviews]

reviews_ldamodel_4 = models.ldamodel.LdaModel(movie_bow_corpus_4, num_topics=8, id2word = movie_dictionary4, passes=25)
reviews_ldamodel_4.show_topics(num_words=8) 

[(0,
  '0.019*"thi" + 0.009*"movi" + 0.007*"ha" + 0.007*"charact" + 0.004*"play" + 0.004*"onli" + 0.004*"thing" + 0.003*"end"'),
 (1,
  '0.014*"thi" + 0.010*"movi" + 0.008*"ha" + 0.006*"stori" + 0.006*"charact" + 0.004*"play" + 0.003*"onli" + 0.003*"vampir"'),
 (2,
  '0.013*"thi" + 0.008*"movi" + 0.007*"ha" + 0.005*"charact" + 0.004*"onli" + 0.003*"jacki" + 0.003*"stori" + 0.003*"perform"'),
 (3,
  '0.009*"thi" + 0.008*"movi" + 0.007*"ha" + 0.006*"charact" + 0.005*"stori" + 0.004*"play" + 0.003*"onli" + 0.003*"love"'),
 (4,
  '0.008*"thi" + 0.008*"movi" + 0.005*"ha" + 0.004*"charact" + 0.003*"onli" + 0.003*"play" + 0.003*"veri" + 0.003*"work"'),
 (5,
  '0.019*"thi" + 0.013*"movi" + 0.007*"ha" + 0.006*"charact" + 0.004*"onli" + 0.004*"action" + 0.004*"veri" + 0.004*"stori"'),
 (6,
  '0.015*"thi" + 0.012*"movi" + 0.008*"ha" + 0.005*"charact" + 0.005*"onli" + 0.004*"stori" + 0.004*"play" + 0.003*"love"'),
 (7,
  '0.012*"thi" + 0.009*"movi" + 0.007*"ha" + 0.005*"charact" + 0.004*"onli" + 0

Model with 8 topics without using the SnowballStemmer

In [None]:
stopwords = load_stopwords('/content/drive/MyDrive/ddr23-research2-main/2_TopicModelling/stopwords/stopwords-empty.txt')

domainfilteredunstemmed_movie_reviews = filter_stopwords(stopwords, filtered_notstemmed_movie_reviews)
#Make a model with 8 topics for not stemmed words
moviedictionary5 = corpora.Dictionary(domainfilteredunstemmed_movie_reviews)
movie_bow_corpus_5 = [moviedictionary5.doc2bow(d) for d in domainfilteredunstemmed_movie_reviews]

reviews_ldamodel_5 = models.ldamodel.LdaModel(movie_bow_corpus_5, num_topics=8, id2word = moviedictionary5, passes=25)
reviews_ldamodel_5.show_topics(num_words=15)

[(0,
  '0.003*"scream" + 0.003*"horror" + 0.002*"back" + 0.002*"original" + 0.002*"end" + 0.002*"made" + 0.002*"director" + 0.002*"man" + 0.002*"action" + 0.002*"great" + 0.002*"played" + 0.002*"work" + 0.002*"evil" + 0.002*"2" + 0.002*"years"'),
 (1,
  '0.003*"love" + 0.002*"man" + 0.002*"family" + 0.002*"men" + 0.002*"godzilla" + 0.002*"war" + 0.002*"end" + 0.002*"makes" + 0.002*"ryan" + 0.002*"made" + 0.002*"home" + 0.002*"back" + 0.001*"mother" + 0.001*"comedy" + 0.001*"things"'),
 (2,
  '0.003*"things" + 0.003*"alien" + 0.002*"director" + 0.002*"man" + 0.002*"love" + 0.002*"made" + 0.002*"find" + 0.002*"there\'s" + 0.002*"real" + 0.002*"back" + 0.002*"makes" + 0.002*"i\'m" + 0.001*"work" + 0.001*"town" + 0.001*"action"'),
 (3,
  '0.005*"jackie" + 0.002*"action" + 0.002*"chan" + 0.002*"big" + 0.002*"kind" + 0.002*"makes" + 0.002*"plays" + 0.002*"great" + 0.001*"man" + 0.001*"takes" + 0.001*"war" + 0.001*"end" + 0.001*"money" + 0.001*"can\'t" + 0.001*"martial"'),
 (4,
  '0.003*"love

Now, you should have 3 models (or more): one without any stopword filtering, one with the standard stopword filtering and one with the domain-filtered stopwords using the list you modified yourself. Compare the topics found by the three models (just looking at them is fine, no need to code a comparison).

Do the topics look better with stopword filtering and with domain-specific stopword filtering? At this point, do the resulting topics correspond to particular film genres you have expected?

reviews_ldamodel == This is using raw unprocessed data in movie_reviews.
Contained mostly punctuation marks and stuff

reviews_ldamodel_1 == This is using preprocessed data using Stemmer/Removing Punctuations.
Contains stopwords such as a, an, the etc.

reviews_ldamodel_2 == This is the model after removing Stopwords from Mallet.
Contains stems of words such as movi, film, etc.

reviews_ldamodel_3 == This is the unstemmed model after removing Stopwords.
Contains film, movies, story, life etc.

reviews_ldamodel_4 == This is the model that has been filtered with domain specific stopwords.
Contains words such as thi, movi, charact, tough to make sense of what is being said due to stemming

reviews_ldamodel_5 == This is the unstemmed model filtered with domain specific stopwords.
Contains a lot of good and bad, but has plot, people, director, etc.

Model 5 appears to have the most legible data, so moving ahead with 5 for future use and optimization.

Increase the number of topics. What happens with the topics if you model very few or very many topics? (answer in a text box). Assign the model(s) to a new variable with a sensible name (avoid overwriting the previous models).

In [None]:
# using lda model 5, with increased topics
# your code here
movie_dictionary6 = corpora.Dictionary(domainfilteredunstemmed_movie_reviews)
movie_bow_corpus6 = [movie_dictionary6.doc2bow(d) for d in domainfilteredunstemmed_movie_reviews]

lda_main = models.ldamodel.LdaModel(movie_bow_corpus6, num_topics=15, id2word = movie_dictionary6, passes=25)
lda_main.show_topics(num_words=10)

[(11,
  '0.003*"man" + 0.003*"mulan" + 0.003*"love" + 0.002*"disney" + 0.002*"director" + 0.002*"young" + 0.002*"comedy" + 0.002*"place" + 0.002*"simon" + 0.002*"family"'),
 (13,
  '0.002*"back" + 0.002*"man" + 0.002*"league" + 0.002*"love" + 0.002*"made" + 0.002*"jeff" + 0.002*"big" + 0.002*"find" + 0.002*"american" + 0.001*"home"'),
 (10,
  '0.003*"love" + 0.002*"back" + 0.002*"great" + 0.002*"big" + 0.002*"man" + 0.002*"makes" + 0.002*"action" + 0.002*"joe" + 0.002*"family" + 0.002*"made"'),
 (0,
  '0.003*"john" + 0.002*"end" + 0.002*"love" + 0.002*"back" + 0.002*"made" + 0.002*"comedy" + 0.002*"man" + 0.002*"years" + 0.002*"real" + 0.002*"funny"'),
 (5,
  '0.002*"funny" + 0.002*"series" + 0.002*"carter" + 0.002*"there\'s" + 0.002*"van" + 0.002*"young" + 0.002*"work" + 0.002*"comedy" + 0.002*"man" + 0.002*"godzilla"'),
 (6,
  '0.006*"scream" + 0.004*"tarzan" + 0.003*"horror" + 0.003*"2" + 0.002*"director" + 0.002*"man" + 0.002*"end" + 0.002*"action" + 0.002*"played" + 0.002*"harry"'

**Answer:**

When modeling topics, it can be challenging to determine the genres of movies, especially when the number of topics is low. In one instance, when only five topics were modeled, words such as "funny," "comedy," "family," "horror," and "action" were good indicators for identifying genres. However, the rest of the resulting topics were too general and not informative.

By increasing the number of topics to ten, it is more likely to identify distinct genres and even guess the title of the movie. For example, in addition to the movie-genre terms mentioned above, we can see words like "Truman" and "Kevin," which are the names of specific characters in movies.

However, increasing the number of topics can have both positive and negative effects. Too few topics may result in a low-quality output, while too many can cause overlapping topics and performance issues. It is important to create a balance between the number of topics and their quality and interpretability to ensure that the results are significant and pertinent to the analysis.

Increase the number of topic words printed to get more information per topic.  Is it easier to make sense of a topic if you look further down the list, or are the initial words more clear?

**Answer:** Expanding the number of subject words displayed can offer more details on each topic. However, whether it is beneficial to delve deeper into the list to comprehend a topic depends on the particular model and the topic under scrutiny.

According to the topics we've printed out above, most of the words at the top of the list clearly show film genres and character names like "jackie", "truman","tazan". while in others, the more specific words further down the list may be more helpful in understanding the topic. Therefore, it is important to explore the entire list of topic words for a comprehensive understanding of each topic and its relevance to the data being analyzed.

In [None]:
lda_main.show_topics(num_words=20)
# your code here

[(11,
  '0.003*"great" + 0.002*"war" + 0.002*"action" + 0.002*"black" + 0.002*"joe" + 0.002*"made" + 0.002*"death" + 0.002*"men" + 0.002*"end" + 0.002*"john" + 0.002*"director" + 0.002*"work" + 0.002*"thing" + 0.002*"find" + 0.002*"man" + 0.001*"audience" + 0.001*"cast" + 0.001*"comedy" + 0.001*"ryan" + 0.001*"makes"'),
 (14,
  '0.003*"end" + 0.002*"great" + 0.002*"man" + 0.002*"love" + 0.002*"action" + 0.002*"funny" + 0.002*"isn\'t" + 0.002*"back" + 0.002*"audience" + 0.002*"there\'s" + 0.002*"things" + 0.002*"that\'s" + 0.002*"big" + 0.002*"thing" + 0.002*"made" + 0.002*"director" + 0.002*"day" + 0.002*"makes" + 0.002*"performance" + 0.002*"part"'),
 (8,
  '0.003*"funny" + 0.002*"love" + 0.002*"great" + 0.002*"libby" + 0.002*"played" + 0.002*"family" + 0.002*"plays" + 0.002*"man" + 0.002*"sex" + 0.002*"husband" + 0.002*"carry" + 0.002*"deuce" + 0.002*"jones" + 0.002*"there\'s" + 0.002*"judd" + 0.002*"george" + 0.002*"performance" + 0.002*"kenneth" + 0.002*"made" + 0.002*"double"'),
 

If you are interested, you can also experiment with the difference between positive and negative reviews.

### Part 4: Evaluation

There are a few numbers we can compute that indicate the quality of a topic model, such as [perplexity and coherence](https://github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/R_text_LDA_perplexity.md). For perplexity, a lower number means a better model, and for coherence, a higher number is better. Try computing these scores for your models, and see which is the best one according to the numbers

In a real project, you should compute these numbers over a separate part of the dataset (the test set) for a proper evaluation, but for simplicity and because we have not talked about this in the lecture we will skip that here.

In [None]:
from gensim.models import CoherenceModel

# Compute perplexity for the basic model on the bag-of-words representation of the reviews:
print('Perplexity: ', reviews_ldamodel_1.log_perplexity(movie_bow_corpus_1))  

# Compute coherence score on the same:
coherence_model_lda = CoherenceModel(model=reviews_ldamodel_1, texts=preprocessed_movie_reviews, dictionary=movie_dictionary1, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence score: ', coherence_lda)

Perplexity:  -7.097968116127571
Coherence score:  0.24518595278157773


In [None]:
#computing perplexity for model 2
print('Perplexity for Model 2: ', reviews_ldamodel_2.log_perplexity(movie_bow_corpus_2))

#computing coherence score for model 2
coherence_model_lda = CoherenceModel(model=reviews_ldamodel_2, texts=movie_reviews, dictionary=movie_dictionary2, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence score for Model 2: ', coherence_lda)

Perplexity for Model 2:  -8.47365523892959
Coherence score for Model 2:  nan


  m_lr_i = np.log(numerator / denominator)
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))


In [None]:
#computing perplexity for model 3
print('Perplexity for Model 3: ', reviews_ldamodel_3.log_perplexity(movie_bow_corpus_3))

#computing coherence score for model 3
coherence_model_lda = CoherenceModel(model=reviews_ldamodel_3, texts=movie_reviews, dictionary=movie_dictionary3, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence score for Model 3: ', coherence_lda)


Perplexity for Model 3:  -9.204168679173886
Coherence score for Model 3:  0.26105578270279745


In [None]:
#computing perplexity for model 4
print('Perplexity for Model 4: ', reviews_ldamodel_4.log_perplexity(movie_bow_corpus_4))

#computing coherence score for model 4
coherence_model_lda = CoherenceModel(model=reviews_ldamodel_4, texts=movie_reviews, dictionary=movie_dictionary4, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence score for Model 4: ', coherence_lda)


Perplexity for Model 4:  -8.653652036370891
Coherence score for Model 4:  nan


In [None]:
#computing perplexity for model 5 
print('Perplexity for Model 5: ', lda_main.log_perplexity(movie_bow_corpus_5))

#computing coherence score for model 5
coherence_model_lda = CoherenceModel(model=lda_main, texts=domainfilteredunstemmed_movie_reviews, dictionary=movie_dictionary6, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence score for Model 5: ', coherence_lda)


Perplexity for Model 5:  -9.588741545531217
Coherence score for Model 5:  0.25501913804045984


In [None]:
#computing perplexity for lda main 
print('Perplexity for lda mian: ', lda_main.log_perplexity(movie_bow_corpus6))

#computing coherence score for lda main
coherence_model_lda = CoherenceModel(model=lda_main, texts=domainfilteredunstemmed_movie_reviews, dictionary=movie_dictionary6, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence score for lda main: ', coherence_lda)

Perplexity for lda mian:  -9.57793832337022
Coherence score for lda main:  0.2715332391817373


In [None]:
print(movie_reviews[:10])
print(domainfilteredunstemmed_movie_reviews[:10])

[['vampire', 'lore', 'and', 'legend', 'has', 'always', 'been', 'a', 'popular', 'fantasy', 'element', ',', 'substantiated', 'by', 'not', 'only', 'the', 'sheer', 'number', 'of', 'movies', 'about', 'the', 'subject', ',', 'but', 'also', 'the', 'proliferation', 'of', 'cults', 'and', 'sects', 'of', 'adherents', '.', 'and', ',', 'unlike', 'any', 'of', 'the', 'more', 'outlandish', 'myths', ',', 'the', 'vampire', 'holds', 'some', 'real-world', 'probability', '(', 'one', 'study', 'claims', '1', ',', '000', 'bloodsuckers', 'exist', 'worldwide', ',', 'and', 'places', '50', 'in', 'los', 'angeles', ')', '.', 'but', 'lest', 'the', 'nasties', 'be', 'mistaken', 'for', 'simple', 'comic', 'book', 'bad', 'guys', ',', 'john', 'carpenter', 'would', 'like', 'to', 'remind', 'us', 'that', 'they', 'are', '-', 'and', 'always', 'have', 'been', '-', 'a', 'truly', 'frightening', 'element', 'of', 'the', 'thriller', 'genre', '.', 'and', 'remind', 'us', 'he', 'does', 'in', 'his', 'latest', 'film', ',', 'vampires', '.'

After caluating all the lda models we created, we found that lda model 3 has the highest coherence. (lda model 2 and 4 show 'nan' for coherence. 

However, just comparing numbers is not very interpretable. We will choose our topic model with the highest coherence score and validate the evaluation.

Using the top 20 topic words for each topic in the model with the highest coherence score, pick at least 5 topic numbers and determine what film genres (in an informal sense) they represent, i.e. think of a meaningful label for the topic. Write down the topic number and your topic label. Is it easy to guess what the topic represents? For how many topics are you fairly confident, for how many do you have to make a guess, and for how many do you have no real clue.

In [None]:
lda_main.show_topics(num_words=20)

[(0,
  '0.002*"back" + 0.002*"man" + 0.002*"young" + 0.002*"work" + 0.002*"great" + 0.002*"made" + 0.002*"comedy" + 0.002*"years" + 0.002*"audience" + 0.002*"54" + 0.002*"makes" + 0.002*"things" + 0.002*"performance" + 0.001*"role" + 0.001*"watch" + 0.001*"original" + 0.001*"american" + 0.001*"fact" + 0.001*"funny" + 0.001*"big"'),
 (5,
  '0.003*"star" + 0.003*"trek" + 0.002*"love" + 0.002*"series" + 0.002*"man" + 0.002*"back" + 0.002*"cast" + 0.002*"script" + 0.002*"family" + 0.002*"made" + 0.002*"end" + 0.002*"director" + 0.002*"real" + 0.002*"lost" + 0.002*"smith" + 0.002*"batman" + 0.002*"role" + 0.001*"plays" + 0.001*"great" + 0.001*"makes"'),
 (14,
  '0.003*"end" + 0.002*"great" + 0.002*"man" + 0.002*"love" + 0.002*"action" + 0.002*"funny" + 0.002*"isn\'t" + 0.002*"back" + 0.002*"audience" + 0.002*"there\'s" + 0.002*"things" + 0.002*"that\'s" + 0.002*"big" + 0.002*"thing" + 0.002*"made" + 0.002*"director" + 0.002*"day" + 0.002*"makes" + 0.002*"performance" + 0.002*"part"'),
 (2,


**Answer:** In general, this version of lda model is relativly easy to identify specific movie genres than other models. 


###Confident topics 
7: Star wars Jedi / Sceince Fiction action movie ('star', 'wars', 'effects', 'action', 'phantom', 'jedi', 'lucas')


4: Tarzan / Disney family movie ('tarzan', 'performance', 'mother', love', disney', 'young', love')

###Unsure 
0: American youth comedy movie ('young', 'comedy', 'performance', 'american', 'funny')


1: Romantic action movie ('love', 'action', 'sex', 'earth')

11: comedy action movie ('war', 'action', 'joe', 'comedy', 'ryan', 'john')


8: Romantic comedy family movie ('funny', 'love', 'family', 'sex', 'performance')

###No clue 
12: Probably Science Fiction movie but the given words are insufficient to decide a specific genre. 


14: Probably romantic action comedy movie but the givven words are insufficient to decide a specific genre. 

In [None]:
reviews_ldamodel.get_term_topics("the", minimum_probability = 1e-3)

[(0, 0.008768726), (1, 0.051883694), (2, 0.014157759), (4, 0.042809784)]

Do this for your own best model and the labels you just picked. For each of your topic labels, if the probability for the label is the highest for the topic number you wrote down, your guess was probably correct. Did you guess a suitable label for every topic?

In [None]:
# for 7: Star Wars Jedi / Science Fiction action movie
reviews_ldamodel_5.get_term_topics("wars", minimum_probability = 1e-3)

[(7, 0.0012031012)]

In [None]:
# for 4: Tarzan / Disney family movie
reviews_ldamodel_5.get_term_topics("disney", minimum_probability = 1e-3)

[]

In [None]:
# for 0: American youth comedy movie
reviews_ldamodel_5.get_term_topics("young", minimum_probability = 1e-3)

[(0, 0.0012168501),
 (3, 0.0010463247),
 (4, 0.0014712237),
 (5, 0.0012671159),
 (7, 0.0015831874)]

In [None]:
# for 1: Romantic action movie
reviews_ldamodel_5.get_term_topics("love", minimum_probability = 1e-3)

[(0, 0.0011613829),
 (1, 0.0030778323),
 (2, 0.0018016004),
 (3, 0.0012398992),
 (4, 0.0026947956),
 (5, 0.0018552404),
 (6, 0.001812625),
 (7, 0.001192238)]

In [None]:
# for 11: comedy action movie
reviews_ldamodel_5.get_term_topics("comedy", minimum_probability = 1e-3)

[(0, 0.0010416702),
 (1, 0.0014102843),
 (2, 0.0010323751),
 (3, 0.0012745974),
 (6, 0.0018786602),
 (7, 0.0022394771)]

In [None]:
# for 8: Romantic comedy family movie
reviews_ldamodel_5.get_term_topics("sex", minimum_probability = 1e-3)

[(2, 0.0011390533)]

**Answer:**

For 7 and 1, we gussed a suitable lable as the probability for these lables was the highest. 

For 0, the propobility was second highest. 

For 11 and 8, the chosen topic number was not in the result. 

For an unknown reason, there was no result for 4.  

In a real project, you would also want to validate your topics by examining the reviews that are most strongly associated with that topic. You can see what documents have what topics using the get_document_topics() method. Here we look at the topics for the first document in the model (change the name of the model to yours):

In [None]:
reviews_ldamodel_5.get_document_topics(movie_bow_corpus[0], minimum_probability = 0)

[(0, 0.00020208835),
 (1, 0.00020215847),
 (2, 0.67783415),
 (3, 0.00020212383),
 (4, 0.32095322),
 (5, 0.0002020559),
 (6, 0.00020205133),
 (7, 0.00020213268)]

Or for the first 20 of them:

In [None]:
for i, doc_topics in enumerate(reviews_ldamodel_5.get_document_topics(movie_bow_corpus)):
    if i >= 20:
        break
    print(f"Topics for the review {movie_reviewnames[i]}: {doc_topics}")

Topics for the review cv646_15065.txt: [(2, 0.67783535), (4, 0.32095203)]
Topics for the review cv214_12294.txt: [(0, 0.12752531), (1, 0.28005826), (2, 0.36358535), (4, 0.22777729)]
Topics for the review cv359_6647.txt: [(0, 0.29968885), (1, 0.06936749), (2, 0.48840475), (4, 0.14156759)]
Topics for the review cv427_10825.txt: [(0, 0.23593853), (1, 0.18432239), (2, 0.50961334), (4, 0.06846304)]
Topics for the review cv091_7400.txt: [(0, 0.4947778), (1, 0.1730839), (2, 0.29577956), (4, 0.0353083)]
Topics for the review cv409_29786.txt: [(0, 0.13893992), (1, 0.09574248), (2, 0.6286759), (4, 0.09736471), (5, 0.012029308), (6, 0.026944764)]
Topics for the review cv876_9390.txt: [(0, 0.02980622), (1, 0.14525464), (2, 0.63666195), (4, 0.18678387)]
Topics for the review cv673_24714.txt: [(0, 0.09601555), (1, 0.1317539), (2, 0.46734402), (3, 0.010601408), (4, 0.13929486), (6, 0.15483601)]
Topics for the review cv955_25001.txt: [(0, 0.3235855), (1, 0.07736087), (2, 0.3817947), (4, 0.127152), (6,

But this assignment is already long enough so I will not ask you to report on this too!