**Student Name:**  Anh Huy Phung

**Student ID:**    34140298

# Task 2

In this task, I will run two variations of the LDA model, with the key **difference being the inclusion or exclusion of bigrams** in the training data (for both 1000 and 20000 ariticles).

## Connect to google drive and install necessary library

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
# !pip3 install scikit-learn plotly gensim -q
# !pip3 uninstall patsy -y
# !pip3 install patsy
# !pip3 uninstall seaborn -y
# !pip3 install seaborn
import seaborn as sns
import re
import nltk
import spacy
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
import csv

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    precision_score, recall_score, f1_score, matthews_corrcoef,
    precision_recall_curve
)

import plotly.graph_objects as go
import plotly.express as px

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

import copy  # Import the copy module for deep copy


In [4]:
# !pip3 install pyldavis
import pyLDAvis
import pyLDAvis.lda_model



In [7]:
# !pip3 uninstall -y torch torchtext torchvision torchaudio
# !pip3 install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.0+cu118 torchtext==0.15.1 --index-url https://download.pytorch.org/whl/cu118

import torch
print("torch version:", torch.__version__)
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset

torch version: 2.0.0+cu118


## 1. Data Preprocessing and loading data from file

Since the topics in the dataset—like Computational Linguistics, Machine Learning, and Human-Computer Interaction—share a lot of overlapping words (like "computer" and "learning"), using the Title field would be too vague and not help in distinguishing the topics well. Personally, I feel that using the Abstract field makes more sense, as it provides much more detailed and specific information, making it easier to separate the topics clearly.

In this task, I will use  **nltk** and **spaCy** for text preprocessing, including:
- Tokenization
- Stopword removal
- Lemmatization

The reason is that I want to select words that are not stopwords, as stopwords typically do not carry meaningful information for topic analysis. Additionally, I want to ensure the remaining words have grammatically accurate and meaningful base forms, which lemmatization provides. Unlike stemming, lemmatization maps words to their actual dictionary forms, preserving semantic meaning, improving interpretability that are suitable the experiment set up.


In [8]:
#  lots of Python code here
tokenizer = get_tokenizer('basic_english')

# Ensure required NLTK resources are available
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [9]:
# Tokenize the text with 2 options: reuturn tokens list or texts that are processed after being removed stopwords and got lemmmatized
def tokenize_text(text, remove_stopwords=False, lemmatize=False, stem=False, pre_process=False, is_return_text=False):
    # Tokenize the text
    tokens = tokenizer(text)

    # Preprocess = remove stopwords and keep only alphabetic tokens
    if pre_process or remove_stopwords:
        tokens = [t for t in tokens if t not in stop_words and t.isalpha()]

    # Lemmatization
    if lemmatize:
        tokens = [lemmatizer.lemmatize(t) for t in tokens]

    # Stemming
    if stem:
        tokens = [stemmer.stem(t) for t in tokens]
    tokens = [token for token in tokens if len(token) > 2]
    if is_return_text:
        return ' '.join(tokens)

    return tokens



In [10]:
# text = """  Molecule discovery is a pivotal research field, impacting everything from the
# medicines we take to the materials we use. Recently, Large Language Models
# (LLMs) have been widely adopted in molecule understanding and generation, yet
# the alignments between molecules and their corresponding captions remain a
# significant challenge. Previous endeavours often treat the molecule as a
# general SMILES string or molecular graph, neglecting the fine-grained
# alignments between the molecular sub-structures and the descriptive textual
# phrases, which are crucial for accurate and explainable predictions. In this
# case, we introduce MolReFlect, a novel teacher-student framework designed to
# contextually perform the molecule-caption alignments in a fine-grained way. Our
# approach initially leverages a larger teacher LLM to label the detailed
# alignments by directly extracting critical phrases from molecule captions or
# SMILES strings and implying them to corresponding sub-structures or
# characteristics. To refine these alignments, we propose In-Context Selective
# Reflection, which retrieves previous extraction results as context examples for
# teacher LLM to reflect and lets a smaller student LLM select from in-context
# reflection and previous extraction results. Finally, we enhance the learning
# process of the student LLM through Chain-of-Thought In-Context Molecule Tuning,
# integrating the fine-grained alignments and the reasoning processes within the
# Chain-of-Thought format. Our experimental results demonstrate that MolReFlect
# enables LLMs like Mistral-7B to significantly outperform the previous
# baselines, achieving SOTA performance on the ChEBI-20 dataset. This advancement
# not only enhances the generative capabilities of LLMs in the molecule-caption
# translation task, but also contributes to a more explainable framework.
# """
# a = processed_text = tokenize_text(
#                 text,
#                 remove_stopwords=True,
#                 lemmatize=True,
#                 stem=False,
#                 is_return_text=False
#             )
# a

After that, we load the file and process it as described above. The result would be new dataframe that has column named **processed_abstract** with list of tokens.

In [11]:
def load_data2(file_path, col_name, label_name=None, remove_stopwords=True, lemmatize=False, stem=False, first_row_num=None):
    """
    Load and preprocess text data from a CSV file. Always processes the text using tokenize_text.

    Parameters:
    - file_path (str): Path to the CSV file.
    - col_name (str): Column name containing the text.
    - label_name (str, optional): Column name for the label. If None, labels are not used.
    - remove_stopwords (bool): Whether to remove stopwords.
    - lemmatize (bool): Whether to apply lemmatization.
    - stem (bool): Whether to apply stemming.
    - first_row_num (int, optional): The number of rows to retrieve. If None, all rows are used.

    Returns:
    - pd.DataFrame: DataFrame with columns ['label', 'processed_text'] or ['original_text', 'processed_text']
    """
    data = []

    with open(file_path, 'r', newline='', encoding='utf-8') as f:
        reader = csv.reader(f)
        headers = next(reader)

        col_index = headers.index(col_name)
        label_index = headers.index(label_name) if label_name else None

        # Counter to limit rows
        row_count = 0

        for row in reader:
            if first_row_num and row_count >= first_row_num:
                break  # Stop reading after first_row_num rows

            text = row[col_index].strip()
            processed_text = tokenize_text(
                text,
                remove_stopwords=remove_stopwords,
                lemmatize=lemmatize,
                stem=stem,
                is_return_text=False
            )

            if label_name:
                label = int(row[label_index])
                data.append((label, processed_text))
            else:
                data.append((text, processed_text))

            row_count += 1

    if label_name is None:
        return pd.DataFrame(data, columns=['abstract', 'processed_abstract'])
    return data



In [12]:
# train_url = '/content/drive/MyDrive/FIT5212/Ass1/Dataset_Assignment1/train_set.csv'
# dev_url = '/content/drive/MyDrive/FIT5212/Ass1/Dataset_Assignment1/dev_set.csv'
# test_url = '/content/drive/MyDrive/FIT5212/Ass1/Dataset_Assignment1/test_set.csv'
train_url = 'train_set.csv'
dev_url = 'dev_set.csv'
test_url = 'test_set.csv'

train_abs_1000_df = load_data2(train_url, 'abstract', remove_stopwords = True, lemmatize = True, first_row_num = 1000)
train_abs_20000_df = load_data2(train_url, 'abstract', remove_stopwords = True, lemmatize = True, first_row_num = 20000)


In [13]:
train_abs_1000_df.head(3)

Unnamed: 0,abstract,processed_abstract
0,Molecule discovery is a pivotal research field...,"[molecule, discovery, pivotal, research, field..."
1,Counterfactual (CF) explanations for machine l...,"[counterfactual, explanation, machine, learnin..."
2,"The gauge function, closely related to the ato...","[gauge, function, closely, related, atomic, no..."


In [14]:
docs_train_abs_1000 = train_abs_1000_df['processed_abstract'].tolist()
docs_train_abs_20000 = train_abs_20000_df['processed_abstract'].tolist()
docs_train_abs_1000_raw = copy.deepcopy(train_abs_1000_df['abstract'].tolist())
docs_train_abs_20000_raw = copy.deepcopy(train_abs_20000_df['abstract'].tolist())


In [15]:
docs_train_abs_20000_raw[1][:50]

'Counterfactual (CF) explanations for machine learn'

# 2. Vocab selection


In this section, we aim to build vocabularies by experimenting with the inclusion or exclusion of bigrams for two sets of training data.
For the vocabulary consisting of single words, since we are using data from the abstract field, it is expected that many tokens will appear only once. Therefore, we will define a threshold of appear more than **20 times** to filter out infrequent tokens.
For the bigram vocabulary, we will apply a minimum frequency threshold of more than **10 occurrences**. This occurrence is relative small for all of size of datasets (1000 and 20000) so it would be a general view for examination

In [16]:
# !pip3 install gensim
from gensim.models import Phrases
def add_biagram(docs, docs_name):
  print(f"Processing document: {docs_name}")
  bigram = Phrases(docs, min_count=10)
  print(f'Total bigrams (vocab size): {len(bigram.vocab)}')
  #bigram = Phrases(docs)
  for idx in range(len(docs)):
      for token in bigram[docs[idx]]:
          if '_' in token:
              # Token is a bigram, add to document.
              docs[idx].append(token)
  return docs

In [17]:
docs_train_abs_1000_bi = add_biagram(copy.deepcopy(docs_train_abs_1000), 'docs_train_abs_1000')
docs_train_abs_20000_bi = add_biagram(copy.deepcopy(docs_train_abs_20000), 'docs_train_abs_20000')

train_list = [docs_train_abs_1000, docs_train_abs_1000_bi, docs_train_abs_20000, docs_train_abs_20000_bi]
train_list_with_name = ["docs_train_abs_1000",  "docs_train_abs_1000_bi", "docs_train_abs_20000", "docs_train_abs_20000_bi"]

Processing document: docs_train_abs_1000
Total bigrams (vocab size): 86472
Processing document: docs_train_abs_20000
Total bigrams (vocab size): 992550


In [18]:
import numpy as np
from collections import Counter
from gensim.corpora import Dictionary

def dictionary_with_doc_percentile(train_docs, doc_names):
    for docs, doc_name in zip(train_docs, doc_names):
        # Create a dictionary for the documents
        dictionary = Dictionary(docs)
        # Filter out words that occur in fewer than 20 documents
        dictionary.filter_extremes(no_below=20)

        # Initialize a Counter for document frequency
        doc_freqs = Counter()

        # Count document frequency: in how many documents each word appears
        for doc in docs:
            word_ids = set(dictionary.doc2idx(doc, unknown_word_index=-1))
            word_ids.discard(-1)  # remove unknowns
            for word_id in word_ids:
                doc_freqs[word_id] += 1

        # Calculate document frequency percentiles
        doc_freq_values = list(doc_freqs.values())
        percentile_levels = [5, 20, 30, 40, 50, 60, 70, 80, 95]
        percentiles = np.percentile(doc_freq_values, percentile_levels)

        # Display results
        print(f'Train data: {doc_name}')
        print(f'  Unique tokens in corpus: {len(dictionary)}')
        print(f'  Document frequency percentiles:')
        for level, value in zip(percentile_levels, percentiles):
            print(f'    {level}th percentile: {value:.1f}')
        print()

# Example usage
dictionary_with_doc_percentile(train_list, train_list_with_name)


Train data: docs_train_abs_1000
  Unique tokens in corpus: 860
  Document frequency percentiles:
    5th percentile: 21.0
    20th percentile: 24.0
    30th percentile: 28.0
    40th percentile: 31.6
    50th percentile: 37.0
    60th percentile: 44.0
    70th percentile: 55.0
    80th percentile: 73.0
    95th percentile: 171.1

Train data: docs_train_abs_1000_bi
  Unique tokens in corpus: 902
  Document frequency percentiles:
    5th percentile: 21.0
    20th percentile: 24.0
    30th percentile: 27.0
    40th percentile: 31.0
    50th percentile: 36.5
    60th percentile: 43.6
    70th percentile: 54.0
    80th percentile: 72.0
    95th percentile: 166.7

Train data: docs_train_abs_20000
  Unique tokens in corpus: 5304
  Document frequency percentiles:
    5th percentile: 22.0
    20th percentile: 32.0
    30th percentile: 42.0
    40th percentile: 56.0
    50th percentile: 79.0
    60th percentile: 111.0
    70th percentile: 173.0
    80th percentile: 296.0
    95th percentile: 106

I chose *no_above*=0.6 based on an analysis of document frequency percentiles across all datasets. Specifically, the 60th percentile in each case corresponds to tokens that appear in approximately **4–5% of the total documents**. This makes it a consistent and objective threshold for filtering out extremely common terms while retaining those that are still broadly representative of meaningful content.

By setting *no_above*=0.6, we strike a balanced trade-off:

*   I exclude only the most frequent words, which often lack topic-distinguishing
power.
*   I retain a richer and more diverse vocabulary, especially important when working with smaller datasets where overly strict filtering could overly limit the token set.
*   It ensures comparability **(around 4-5%)** across datasets of different sizes (1000 vs. 20000), providing a uniform criterion for vocabulary pruning.

Overall, this slightly looser cutoff supports both topic interpretability and model flexibility, particularly when experimenting with different LDA configurations.

Furthermore, we also save all variations into dictionary name **dic_with_corpus_glodbaldic** for later usage

In [19]:
from gensim.corpora import Dictionary
def dictionary_for_doc(train_docs, doc_names):
    """
    Create a dictionary representation of the documents and filter rare and common tokens.

    Args:
    train_docs (list of list): List of tokenized documents (list of words for each document).
    doc_names (list of str): List of names or identifiers for each document.

    Returns:
    dic_with_doc2bow (dict): Dictionary mapping document names to their Bag-of-Words representation and the dictionary.
    """
    dic_with_corpus_glodbaldic = {}


    # Iterate over documents and their names
    for doc_name, docs in zip(doc_names, train_docs):
        # Create a global dictionary for all documents
        dictionary = Dictionary(docs)

        # Filter out words that occur in less than 20 documents or more than 50% of the documents
        dictionary.filter_extremes(no_below=20, no_above=0.6)

        # Create Bag-of-Words representation for the document
        corpus = [dictionary.doc2bow(doc) for doc in docs]
        # Print document-wise statistics (optional for debugging)
        print(f'Train data: {doc_name}')
        print(f'  Number of unique tokens in corpus: {len(dictionary)}')
        print('\n')

        # Store both the corpus and the dictionary for each document in a list
        dic_with_corpus_glodbaldic[doc_name] = [corpus, dictionary]

    return dic_with_corpus_glodbaldic
dic_with_corpus_glodbaldic = dictionary_for_doc(train_list, train_list_with_name)

Train data: docs_train_abs_1000
  Number of unique tokens in corpus: 860


Train data: docs_train_abs_1000_bi
  Number of unique tokens in corpus: 902


Train data: docs_train_abs_20000
  Number of unique tokens in corpus: 5304


Train data: docs_train_abs_20000_bi
  Number of unique tokens in corpus: 7314




# 3. Training

We are ready to train the LDA model with number of topic is 10. The rest of the set up is similar to tutorial. After that we will save LDA model of each variations

In [20]:
from gensim.models import LdaModel
def LDA_model(corpus, dictionary, doc_name):
    # Set training parameters.
    NUM_TOPICS = 10
    chunksize = 5000
    passes = 20
    iterations = 400
    eval_every = None  # Don't evaluate model perplexity, takes too much time.

    # Make an index to word dictionary.
    temp = dictionary[0]  # This is only to "load" the dictionary.
    id2word = dictionary.id2token

    model = LdaModel(
        corpus=corpus,
        id2word=id2word,
        chunksize=chunksize,
        alpha='auto',
        eta='auto',
        iterations=iterations,
        num_topics=NUM_TOPICS,
        passes=passes,
        eval_every=eval_every
    )

    # Define the output file name based on document name
    outputfile = f'{doc_name}.gensim'
    print(f"Saving model for {doc_name} in {outputfile}")
    model.save(outputfile)
    return model

# Assume dic_with_corpus_glodbaldic contains the corpus and dictionary for each document
# Loop through each document in train_list_with_name and its corresponding corpus/dictionary
for doc_name, (corpus, dictionary) in dic_with_corpus_glodbaldic.items():
    # Train and save the model for each document
    model = LDA_model(corpus, dictionary, doc_name)
    dic_with_corpus_glodbaldic[doc_name].append(model)


Saving model for docs_train_abs_1000 in docs_train_abs_1000.gensim
Saving model for docs_train_abs_1000_bi in docs_train_abs_1000_bi.gensim
Saving model for docs_train_abs_20000 in docs_train_abs_20000.gensim
Saving model for docs_train_abs_20000_bi in docs_train_abs_20000_bi.gensim


In [21]:
# Print the top words for each topic
num_topics = 10
for topic_id in range(num_topics):
    print(f"Topic {topic_id}:")
    print(model.print_topic(topic_id, topn=15))  # topn=10 for top 10 words per topic
    print("\n")

Topic 0:
0.019*"language" + 0.009*"research" + 0.009*"system" + 0.008*"study" + 0.007*"text" + 0.006*"paper" + 0.006*"human" + 0.006*"dataset" + 0.006*"question" + 0.006*"evaluation" + 0.006*"analysis" + 0.006*"task" + 0.005*"user" + 0.005*"work" + 0.005*"present"


Topic 1:
0.023*"algorithm" + 0.018*"problem" + 0.014*"function" + 0.013*"learning" + 0.009*"method" + 0.009*"distribution" + 0.008*"optimization" + 0.008*"show" + 0.007*"bound" + 0.006*"result" + 0.006*"optimal" + 0.005*"sample" + 0.005*"gradient" + 0.005*"theoretical" + 0.005*"set"


Topic 2:
0.021*"llm" + 0.020*"task" + 0.015*"learning" + 0.011*"performance" + 0.011*"agent" + 0.010*"policy" + 0.010*"method" + 0.009*"knowledge" + 0.008*"language" + 0.008*"reinforcement" + 0.008*"large" + 0.008*"approach" + 0.007*"reasoning" + 0.007*"framework" + 0.007*"large_language"


Topic 3:
0.013*"system" + 0.013*"data" + 0.013*"time" + 0.009*"approach" + 0.009*"dynamic" + 0.008*"prediction" + 0.007*"method" + 0.006*"using" + 0.006*"p

# 4. Experiment with document within each topic

In this section, we also try to find document topic of all variations in the experiment and find the document and topic's words in each variation

In [36]:
def get_document_topics(ldamodel, corpus, texts, topn=15):
   # Init output
    document_topics_df = pd.DataFrame()
    data = []

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):  # interation of main topic of each doc
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num, topn=topn)
                topic_keywords = ", ".join([word for word, prop in wp]) # get all word
                data.append([int(topic_num), round(prop_topic,4), topic_keywords])
            else:
                break

    document_topics_df = pd.DataFrame(data, columns=['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords'])

    # Add original text to the end of the output
    document_topics_df['Original_Text'] = pd.Series(texts)

    return document_topics_df



In [37]:
def find_top_k_doc(doc_topic_df, k=5):

  doc_topics_sorted_df = pd.DataFrame()

  doc_topic_df_grpd = doc_topic_df.groupby('Dominant_Topic')

  for i, grp in doc_topic_df_grpd:
      doc_topics_sorted_df = pd.concat([doc_topics_sorted_df,
                                              grp.sort_values(['Perc_Contribution'], ascending=[0]).head(k)],
                                              axis=0)

  doc_topics_sorted_df.reset_index(drop=True, inplace=True)
  doc_topics_sorted_df.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]
  return doc_topics_sorted_df


In [38]:
train_raw_list = [docs_train_abs_1000_raw, docs_train_abs_1000_raw, docs_train_abs_20000_raw, docs_train_abs_20000_raw]

# Initialize the dictionary to store results
document_topics_df_dic = {}

# Iterate over both `dic_with_corpus_glodbaldic.items()` and `train_raw_list` simultaneously using zip
for (doc_name, (corpus, dictionary, model)), raw_doc in zip(dic_with_corpus_glodbaldic.items(), train_raw_list):
    # Initialize an empty list for each doc_name in the dictionary
    if doc_name not in document_topics_df_dic:
        document_topics_df_dic[doc_name] = []

    # Get document topics
    document_topics_df = get_document_topics(model, corpus, raw_doc)
    document_topics_sorted_df = find_top_k_doc(document_topics_df, 5)

    # Append the result to the dictionary
    document_topics_df_dic[doc_name].append(document_topics_df)
    document_topics_df_dic[doc_name].append(document_topics_sorted_df)



In [39]:
document_topics_df_dic['docs_train_abs_20000'][0].head(5)

Unnamed: 0,Dominant_Topic,Perc_Contribution,Topic_Keywords,Original_Text
0,5,0.68,"language, task, llm, data, performance, method...",Molecule discovery is a pivotal research field...
1,4,0.2917,"algorithm, learning, problem, optimization, me...",Counterfactual (CF) explanations for machine l...
2,2,0.5669,"data, method, distribution, function, problem,...","The gauge function, closely related to the ato..."
3,4,0.5186,"algorithm, learning, problem, optimization, me...",Reinforcement learning (RL) is a promising met...
4,4,0.5604,"algorithm, learning, problem, optimization, me...",Acceleration and momentum are the de facto sta...


In [40]:
document_topics_df_dic['docs_train_abs_20000'][1].head(50)

Unnamed: 0,Topic_Num,Topic_Perc_Contrib,Keywords,Text
0,0,0.9925,"performance, training, architecture, method, m...",The Mixture of Experts (MoE) framework has bec...
1,0,0.9908,"performance, training, architecture, method, m...",Designing accurate and efficient convolutional...
2,0,0.9897,"performance, training, architecture, method, m...",Ensembling is a simple and popular technique f...
3,0,0.9817,"performance, training, architecture, method, m...",We study the problem of compressing recurrent ...
4,0,0.9735,"performance, training, architecture, method, m...",In this work we introduce a new transformer ar...
5,1,0.9911,"network, neural, training, data, learning, dee...","In this paper, we address the issue of how to ..."
6,1,0.9901,"network, neural, training, data, learning, dee...",It is well-known that a deep neural network ha...
7,1,0.9898,"network, neural, training, data, learning, dee...",Adversarial attacks hamper the functionality a...
8,1,0.9842,"network, neural, training, data, learning, dee...",The commercialization of deep learning creates...
9,1,0.9797,"network, neural, training, data, learning, dee...",Increasingly machine learning systems are bein...


In [41]:
def split_topic_words(df, k = 15):
    """
    This function splits the 'Top_Words' column in the DataFrame into separate columns based on a given number of top words.

    Parameters:
    - df: DataFrame containing 'Topic' and 'Top_Words' columns
    - k: Number of top words to display for each topic

    Returns:
    - DataFrame with each word in 'Top_Words' as a separate column
    """
    df = df.copy()
    # Split the 'Top_Words' column by comma and space, then expand it into separate columns
    words_split = df['Keywords'].str.split(', ', expand=True)

    # If there are more words than 'k', truncate the words to 'k' columns
    words_split = words_split.iloc[:, :k]  # Keep only the first 'k' words

    # Concatenate the original 'Topic' column with the new words columns
    df_words = pd.concat([df['Topic_Num'], words_split], axis=1)

    # Rename columns to match the word columns (e.g., Word 0, Word 1, etc.)
    df_words.columns = ['Topic_Num'] + [f'Word {i}' for i in range(df_words.shape[1] - 1)]
    df_words = df_words.drop_duplicates(subset='Topic_Num', keep='first')  # Remove duplicates
    return df_words

In [42]:
# Iterate through the dictionary items to retrieve the data
for train_name, topic_data_list in document_topics_df_dic.items():
    topic_word_df = split_topic_words(topic_data_list[1], 15)
    topic_data_list.append(topic_word_df)

In [43]:
# Print out the topic words of each topic in each variation
for train_name, topic_data_list in document_topics_df_dic.items():
    print(f'Train data: {train_name}:')

    # Select the desired DataFrame (assuming the 2nd index contains the sorted topics)
    topic_df = topic_data_list[2].head(10)

    # Display the dataframe
    display(topic_df)
    print('\n')

Train data: docs_train_abs_1000:


Unnamed: 0,Topic_Num,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
0,0,network,data,image,neural,method,learning,deep,training,approach,proposed,input,using,result,based,show
5,1,language,llm,task,large,knowledge,text,data,information,method,performance,benchmark,system,result,paper,question
10,2,learning,data,prediction,uncertainty,accuracy,deep,method,query,system,challenge,task,synthetic,trajectory,result,propose
15,3,task,agent,learning,policy,reward,environment,reinforcement,human,action,system,control,approach,interaction,robot,graph
20,4,data,learning,problem,datasets,training,representation,method,online,performance,feature,domain,propose,instance,algorithm,time
25,5,user,data,system,method,study,task,information,approach,based,recommendation,learning,analysis,review,using,context
30,6,training,network,architecture,performance,neural,paper,transformer,show,data,inference,also,analysis,using,information,large
35,7,algorithm,learning,problem,method,function,optimization,network,result,show,approach,new,sample,paper,optimal,work
40,8,method,graph,feature,learning,task,network,performance,propose,classification,image,datasets,training,proposed,node,representation
45,9,entity,data,bias,word,approach,technique,used,sparse,game,embeddings,performance,method,inference,semantic,existing




Train data: docs_train_abs_1000_bi:


Unnamed: 0,Topic_Num,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
0,0,learning,image,method,domain,time,agent,brain,series,training,imaging,representation,detection,game,challenge,adaptation
5,1,llm,language,large,large_language,task,performance,text,response,however,approach,data,benchmark,generation,method,evaluation
10,2,image,patient,method,dataset,word,learning,clinical,task,information,segmentation,analysis,cluster,label,context,framework
15,3,learning,data,graph,task,result,algorithm,system,performance,method,user,machine,technique,recommendation,proposed,based
20,4,network,neural,proposed,result,neural_network,method,data,using,feature,learning,accuracy,approach,deep,different,based
25,5,task,method,representation,data,learning,propose,knowledge,graph,performance,information,approach,training,datasets,entity,experiment
30,6,system,user,human,data,research,paper,feedback,study,learning,work,query,analysis,design,application,interaction
35,7,data,method,network,training,learning,neural,loss,performance,graph,datasets,result,propose,node,proposed,accuracy
40,8,algorithm,learning,problem,method,function,policy,show,optimization,network,sample,set,approach,optimal,propose,machine
45,9,system,approach,attack,prediction,object,network,context,dynamic,using,method,robot,user,paper,propose,present




Train data: docs_train_abs_20000:


Unnamed: 0,Topic_Num,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
0,0,performance,training,architecture,method,memory,time,parameter,transformer,computational,inference,accuracy,search,cost,speech,efficient
5,1,network,neural,training,data,learning,deep,attack,adversarial,method,performance,classification,image,accuracy,privacy,architecture
10,2,data,method,distribution,function,problem,show,space,approach,result,set,sample,matrix,using,variable,error
15,3,user,human,system,agent,interaction,bias,behavior,robot,study,feedback,task,environment,preference,fairness,participant
20,4,algorithm,learning,problem,optimization,method,policy,gradient,function,reinforcement,show,reward,optimal,bound,approach,performance
25,5,language,task,llm,data,performance,method,large,knowledge,generation,domain,text,benchmark,approach,datasets,dataset
30,6,graph,representation,learning,method,network,feature,information,structure,task,node,propose,data,prediction,datasets,neural
35,7,word,using,language,classification,feature,result,text,dataset,medical,translation,patient,method,sentence,used,clinical
40,8,data,research,learning,machine,system,study,application,paper,analysis,tool,challenge,explanation,development,design,provide
45,9,image,data,method,time,detection,approach,using,object,system,prediction,video,proposed,event,series,based




Train data: docs_train_abs_20000_bi:


Unnamed: 0,Topic_Num,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
0,0,language,research,system,study,text,paper,human,dataset,question,evaluation,analysis,task,user,work,present
5,1,algorithm,problem,function,learning,method,distribution,optimization,show,bound,result,optimal,sample,gradient,theoretical,set
10,2,llm,task,learning,performance,agent,policy,method,knowledge,language,reinforcement,large,approach,reasoning,framework,large_language
15,3,system,data,time,approach,dynamic,prediction,method,using,process,series,based,event,control,state,simulation
20,4,graph,representation,task,method,information,feature,propose,speech,approach,word,learning,performance,show,text,two
25,5,network,training,neural,method,parameter,performance,architecture,neural_network,layer,inference,show,accuracy,time,computational,weight
30,6,image,data,method,training,domain,label,learning,datasets,performance,task,dataset,sample,approach,class,propose
35,7,learning,data,user,system,algorithm,machine,machine_learning,device,communication,framework,federated,performance,approach,proposed,paper
40,8,attack,adversarial,detection,data,privacy,robustness,training,anomaly,perturbation,method,example,fairness,robust,work,show
45,9,network,learning,neural,deep,feature,method,data,prediction,classification,deep_learning,machine,neural_network,using,performance,accuracy






## Visualization with pyLDAvis

In this section, we use pyLDAvis to visualize all topics, including word frequency distributions and the intertopic distance map, to better understand topic coherence and separation of all variations

In [44]:
import pyLDAvis.gensim
corpus, dictionary, model= dic_with_corpus_glodbaldic['docs_train_abs_1000']
lda_display = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

In [45]:
corpus, dictionary, model= dic_with_corpus_glodbaldic['docs_train_abs_1000_bi']
lda_display = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

In [46]:
corpus, dictionary, model= dic_with_corpus_glodbaldic['docs_train_abs_20000']
lda_display = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

In [47]:
corpus, dictionary, model= dic_with_corpus_glodbaldic['docs_train_abs_20000_bi']
lda_display = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)



*   **1000 articles, unigrams:** Topics were tightly clustered with poor separation, and dominant terms like data, learning, and system appeared across multiple topics, leading to vague and overlapping groupings.
*  **1000 articles, bigrams:** Topic separation slightly improved, and some domain-specific phrases such as neural_network and large_language emerged. However, overlap remained, and the small dataset limited topic depth.
*   **20,000 articles, unigrams:** Topics were more dispersed with improved diversity, but common unigrams like model, language, and data still overlapped across topics, reducing clarity in boundaries.
*   **20,000 articles, bigrams:** Topics were clearly separated and semantically rich. Bigrams such as deep_learning, natural_language, and user_interface defined distinct, interpretable clusters aligned with real-world domains.


# Reference

https://towardsdatascience.com/understanding-topic-coherence-measures-4aa41339634c/
https://neptune.ai/blog/pyldavis-topic-modeling-exploration-tool

https://medium.com/data-science/topic-model-visualization-using-pyldavis-fecd7c18fbf6

https://www.geeksforgeeks.org/topic-modeling-using-latent-dirichlet-allocation-lda/

https://medium.com/data-science/latent-dirichlet-allocation-lda-9d1cd064ffa2