# Assignment 1 for FIT5212, Semester 1

**Student Name:**  Phakhanan Rataphaibul

**Student ID:**    33654735

## Part 2: Topic Modelling
### Chosen configurations:
#### 1) With and without bigrams usage
#### 2) Use different number of topics: k = 10 for models with bigrams and k = 40 for models without bigrams

####Justification: Models without are typically simpler because they only consider individual words, hence a lower number of topics (like 10) might suffice to capture the essential themes without overcomplicating the model, while models with bigrams allows the model to capture more nuanced and specific themes because it considers word pairs as single tokens. therefore, setting a higher number of topics (like 40) might be more appropriate to explore the detailed thematic structures that bigrams can reveal.


In [7]:
# Please uncomment if any of the libraries are not yet installed

#!pip3 install pyldavis
#!pip3 install scikit-learn
# nltk.download('stopwords')

In [12]:
# Importing necessary libraries and modules

%matplotlib inline

from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import nltk

import time
from gensim.models import Phrases

import re, spacy, string

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from pprint import pprint
from gensim.corpora import Dictionary

import pyLDAvis
import pyLDAvis.lda_model
import matplotlib.pyplot as plt
%matplotlib inline

from plotly.offline import plot
import plotly.graph_objects as go
import plotly.express as px

import seaborn as sns

In [5]:
# Download necessary resources for NLTK
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [6]:
# Define file paths for the training, testing, and dev datasets
train_path = 'train_set.csv'
test_path = 'dev_set.csv'
validation_path = 'test_set.csv'

In [7]:
# Read the CSV files
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
validation_df = pd.read_csv(validation_path)

In [9]:
# Create a copy of the training dataframe
df = train_df.copy()

# Combine the 'title' and 'abstract' columns into a single column named 'combined'
df['combined'] = df['title'] + ' ' + df['abstract']

# Create subsets of the dataframe with different numbers of training samples
train_1000_df = df.iloc[:1000]
train_20000_df = df.iloc[:20000]

# Extract the 'title' column from the subsets and convert to lists
docs_1000 = train_1000_df['title'].tolist()
docs_20000 = train_20000_df['title'].tolist()

# Create copies of the lists to preserve the original data
raw_docs_1000 = docs_1000.copy()
raw_docs_20000 = docs_20000.copy()


## Text pre-processing inludes tokenization, stopwords removal, and stemming.

In [10]:
# Define a function for preprocessing documents
def preprocess_docs(docs):
    # Initialize the tokenizer and stemmer
    tokenizer = RegexpTokenizer(r'\w+')
    stemmer = PorterStemmer()
    stop_words = set(stopwords.words('english'))  # Load stopwords
    # Tokenize, remove stopwords, and clean the documents
    processed_docs = []
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert to lowercase
        tokens = tokenizer.tokenize(docs[idx])  # Split into words
        # Remove stopwords and apply stemming
        filtered_tokens = [stemmer.stem(token) for token in tokens if token not in stop_words and not token.isnumeric() and len(token) > 1]
        # Append processed tokens to list
        processed_docs.append(filtered_tokens)

    return processed_docs


In [11]:
# Get the preprocessed data without bigrams
processed_docs_1000 = preprocess_docs(docs_1000)
processed_docs_20000 = preprocess_docs(docs_20000)


In [None]:
# Define function to add bigrams and trigrams to docs (only ones that appear 20 times or more).
def add_bigrams(docs):
  bigram = Phrases(docs, min_count = 20)
  for idx in range(len(docs)):
      for token in bigram[docs[idx]]:
          if '_' in token:
              # Token is a bigram, add to document.
              docs[idx].append(token)
  return docs



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
# Get the preprocessed data with bi-grams
bigram_1000_docs = add_bigrams(processed_docs_1000)
bigram_20000_docs = add_bigrams(processed_docs_20000)



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
# Define function to create a dictionary representation of the documents
def get_dictionary(docs):
    # Create a dictionary representation of the documents.
    dictionary = Dictionary(docs)
    # Filter out words that occur less than 20 documents, or more than 50% of the documents.
    dictionary.filter_extremes(no_below=20, no_above=0.5)
    return dictionary

# Define function to create a bag-of-words representation of the documents using the passed
def get_corpus(docs, dictionary):
    # Bag-of-words representation of the documents using the passed dictionary.
    corpus = [dictionary.doc2bow(doc) for doc in docs]
    return corpus



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
# Get the dictionary representation
dictionary_1000 = get_dictionary(processed_docs_1000)
dictionary_20000 = get_dictionary(processed_docs_20000)
bigram_dictionary_1000 = get_dictionary(bigram_1000_docs)
bigram_dictionary_20000 = get_dictionary(bigram_20000_docs)

# Get the bag of words representation using the respective dictionary
corpus_1000 = get_corpus(processed_docs_1000, dictionary_1000)
corpus_20000 = get_corpus(processed_docs_20000, dictionary_20000)
bigram_corpus_1000 = get_corpus(bigram_1000_docs, bigram_dictionary_1000)
bigram_corpus_20000 = get_corpus(bigram_20000_docs, bigram_dictionary_20000)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
# Print the number of unique tokens and documents for unigram and bigram representations
# For the first 1000 rows
print('Number of unique tokens: %d' % len(dictionary_1000))
print('Number of documents: %d' % len(corpus_1000))

# For the first 20000 rows
print('Number of unique tokens: %d' % len(dictionary_20000))
print('Number of documents: %d' % len(corpus_20000))

# For the first 1000 rows
print('Number of unique tokens: %d' % len(bigram_dictionary_1000))
print('Number of documents: %d' % len(bigram_corpus_1000))

# For the first 20000 rows
print('Number of unique tokens: %d' % len(bigram_dictionary_20000))
print('Number of documents: %d' % len(bigram_corpus_20000))

Number of unique tokens: 61
Number of documents: 1000
Number of unique tokens: 1296
Number of documents: 20000
Number of unique tokens: 61
Number of documents: 1000
Number of unique tokens: 1296
Number of documents: 20000



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
import logging
from gensim.models import LdaModel

# Define function to rrain an LDA topic model
def train_topic_model(dictionary, corpus, NUM_TOPICS, chunksize, passes, iterations, output_filename):
  # Set training parameters.
  NUM_TOPICS = NUM_TOPICS
  chunksize = chunksize
  passes = passes
  iterations = iterations
  eval_every = None
  # Make a index to word dictionary.
  temp = dictionary[0]
  id2word = dictionary.id2token
  # Initialize the model
  model = LdaModel(
      corpus = corpus,
      id2word = id2word,
      chunksize = chunksize,
      alpha = 'auto',
      eta = 'auto',
      iterations = iterations,
      num_topics = NUM_TOPICS,
      passes = passes,
      eval_every = eval_every
  )
  # Save the model
  model.save(output_filename)
  print(f"Saving the model in {output_filename}")
  return model


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
# Train LDA models for different datasets with and without bigrams
# For the first 1000 rows
model_1000_noBigrams = train_topic_model(dictionary_1000, corpus_1000, 10, 2000, 30, 500, 'model_1000_noBigrams') # Number of topics = 10
model_1000_bigrams = train_topic_model(bigram_dictionary_1000, bigram_corpus_1000, 40, 2000, 30, 500, 'model_1000_bigrams') # Number of topics = 40

# For the first 20000 rows
model_20000_noBigrams = train_topic_model(dictionary_20000, corpus_20000, 10, 40000, 30, 500, 'model_20000_noBigrams') # Number of topics = 10
model_20000_bigrams = train_topic_model(bigram_dictionary_20000, bigram_corpus_20000, 40, 40000, 30, 500, 'model_20000_bigrams') # Number of topics = 40


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Saving the model in model_1000_noBigrams
Saving the model in model_20000_noBigrams
Saving the model in model_1000_bigrams
Saving the model in model_20000_bigrams


In [None]:
# Define function to evaluate the model
def evaluate_model(model, corpus, model_name):
    # Get the list of topic coherence scores and the topics themselves
    top_topics = model.top_topics(corpus)
    # Calculate the average topic coherence across all topics
    avg_topic_coherence = sum([t[1] for t in top_topics]) / model.num_topics
    print(f"Average topic coherence for {model_name}: {avg_topic_coherence:.4f}")
    # Print all the topics with the words
    topics = model.print_topics(num_words=20)
    for topic_number, topic in enumerate(topics):
        print(f"Topic #{topic_number + 1} for {model_name}:")
        print(topic)
    return topics


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
# Evaluate each LDA model and store the top topics
# For the first 1000 rows
top_topics_1000_noBigrams = evaluate_model(model_1000_noBigrams, corpus_1000, "model_1000_noBigrams") # Number of topics = 10
top_topics_1000_bigrams = evaluate_model(model_1000_bigrams, bigram_corpus_1000, "model_1000_bigrams") # Number of topics = 40

# For the first 1000 rows
top_topics_20000_noBigrams = evaluate_model(model_20000_noBigrams, corpus_20000, "model_20000_noBigrams") # Number of topics = 10
top_topics_20000_bigrams = evaluate_model(model_20000_bigrams, bigram_corpus_20000, "model_20000_bigrams") # Number of topics = 40


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Average topic coherence for model_1000_noBigrams: -7.0351
Topic #1 for model_1000_noBigrams:
(0, '0.189*"network" + 0.135*"neural" + 0.135*"deep" + 0.098*"neural_network" + 0.072*"convolut" + 0.063*"learn" + 0.057*"imag" + 0.054*"use" + 0.050*"approach" + 0.045*"inform" + 0.029*"base" + 0.028*"data" + 0.012*"train" + 0.010*"classif" + 0.008*"estim" + 0.002*"segment" + 0.001*"graph" + 0.001*"detect" + 0.000*"unsupervis" + 0.000*"analysi"')
Topic #2 for model_1000_noBigrams:
(1, '0.141*"supervis" + 0.110*"self" + 0.100*"learn" + 0.098*"featur" + 0.095*"local" + 0.095*"attent" + 0.065*"predict" + 0.061*"imag" + 0.052*"visual" + 0.035*"use" + 0.034*"deep" + 0.030*"base" + 0.024*"segment" + 0.022*"improv" + 0.013*"via" + 0.010*"adapt" + 0.001*"recognit" + 0.001*"unsupervis" + 0.000*"network" + 0.000*"semant"')
Topic #3 for model_1000_noBigrams:
(2, '0.165*"multi" + 0.115*"graph" + 0.092*"adversari" + 0.085*"learn" + 0.080*"via" + 0.077*"task" + 0.076*"gener" + 0.061*"knowledg" + 0.058*"netw

In [None]:
import pyLDAvis.gensim

# Define function to display the LDA visualization
def lda_vis_display(model, corpus, dictionary):
  lda_display = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics = False)
  return lda_display


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



## Display interactive LDA Visualization

In [None]:
# Generate and display the interactive LDA visualization for the model trained on 1000 documents without bigrams, number of topics = 10
vis_1000_noBigrams = lda_vis_display(model_1000_noBigrams, corpus_1000, dictionary_1000)
pyLDAvis.display(vis_1000_noBigrams)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
# Generate and display the interactive LDA visualization for the model trained on 20000 documents without bigrams, number of topics = 10
vis_20000_noBigrams = lda_vis_display(model_20000_noBigrams, corpus_20000, dictionary_20000)
pyLDAvis.display(vis_20000_noBigrams)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
# Generate and display the interactive LDA visualization for the model trained on 1000 documents with bigrams, number of topics = 40
vis_1000_bigrams = lda_vis_display(model_1000_bigrams, bigram_corpus_1000, bigram_dictionary_1000)
pyLDAvis.display(vis_1000_bigrams)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
# Generate and display the interactive LDA visualization for the model trained on 20000 documents with bigrams, number of topics = 40
vis_20000_bigrams = lda_vis_display(model_20000_bigrams, bigram_corpus_20000, bigram_dictionary_20000)
pyLDAvis.display(vis_20000_bigrams)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.

