The final section of this thesis seeks to explore the application of large language models (LLMs) in the domain of topic modeling. Given the inherent token limitations of LLMs, their direct application to extensive corpora for comprehensive topic extraction presents notable challenges (Devlin et al., 2018). These models are not ideally suited to process large datasets in a single iteration, which complicates their ability to distill and analyze the embedded topics within such datasets. Furthermore, LLMs typically do not have built-in quantitative evaluation metrics that can directly assess the quality and relevance of the topics generated, which poses additional hurdles for their use in systematic topic analysis.

However, LLMs hold significant potential in augmenting traditional topic modeling techniques such as Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF). Both LDA and NMF are capable of identifying topics represented by clusters of keywords and their corresponding weights (Blei et al., 2003; Lee & Seung, 1999). However, the interpretation of these topics often remains ambiguous and subject to individual interpretation, as the mere presentation of keywords and weights does not necessarily convey the nuanced context or thematic depth of the topics.

This is where LLMs can provide substantial value. By leveraging their advanced natural language processing capabilities, LLMs can be employed to interpret and summarize the topics identified by traditional methods (Raffel et al., 2019). Specifically, LLMs can analyze the clusters of keywords and their weights, synthesizing this information into a more coherent and contextually enriched summary. This process not only enhances the clarity of the topic representation but also provides a narrative that can be more readily understood and utilized by humans.

Employing LLMs in this capacity could transform topic modeling from a largely quantitative exercise into a more qualitative and interpretative process. This integration could lead to more precise topic identification, better alignment with human cognitive processes, and potentially, more actionable insights for researchers and practitioners in various fields. Thus, while LLMs face certain limitations in direct application to large-scale topic modeling, their strength lies in their ability to refine and elucidate the outputs of traditional topic modeling techniques, thereby adding a layer of depth and usability that purely algorithmic approaches may lack.

In [1]:
from gensim import models


lda_model = models.LdaModel.load("ProcessedData/ldamodel")  # Load the model

In [7]:
import pandas as pd
Abstract = pd.read_csv('ProcessedData/abstract_data.csv')


In [8]:
Abstract

Unnamed: 0,Abstract
0,Fluorescence in situ hybridization (FISH) is a...
1,Mining association rules is an important issue...
2,With the increasing concern about global envir...
3,Automatic summarization evaluation is very imp...
4,As the number of pages on the web is permanent...
...,...
8837,This article has been retracted at the request...
8838,The unfolding technique is an efficient tool t...
8839,Although potential benefits are frequently men...
8840,Architectural Design Decisions (ADD) form a ke...


In [9]:
# Parameters
num_topics = 20
num_words = 50

# Collect all topics with their words and probabilities
all_topics = []
for topic_id in range(num_topics):
    # Extract words and probabilities for the topic
    topic_words = lda_model.show_topic(topic_id, num_words)
    # Create a string representation of the topic
    topic_string = ", ".join([f"{word} ({prob:.4f})" for word, prob in topic_words])
    all_topics.append([topic_string])

# Convert list to DataFrame
topics_df = pd.DataFrame(all_topics, columns=['Topic Words'])

# Save the DataFrame to a CSV file
output_file = 'ProcessedData/all_topics_lda.csv'
topics_df.to_csv(output_file, index=False)

print(f"All topics saved to '{output_file}'")

All topics saved to 'ProcessedData/all_topics_lda.csv'


write a python function, the input is the paper number, output is the abstract, dominant topics and topic loadings. 

In [20]:
lda_model.show_topic(0, 50)

[('study', 0.00799858),
 ('network', 0.0069343806),
 ('their', 0.006873849),
 ('research', 0.00678384),
 ("'", 0.006085316),
 ('new', 0.005669204),
 ('technology', 0.005425469),
 ('at', 0.0048812977),
 ('knowledge', 0.0047922987),
 ('different', 0.0045970464),
 ('information', 0.004454832),
 ('interaction', 0.00424667),
 ('process', 0.004090605),
 ('student', 0.0039881305),
 ('"', 0.0038622585),
 ('work', 0.0038196004),
 ('between', 0.0038100283),
 ('or', 0.0037261343),
 ('they', 0.0036489153),
 ('design', 0.00362047),
 ('how', 0.0035869998),
 ('learning', 0.003555055),
 ('activity', 0.0033634857),
 ('more', 0.0033417395),
 ('provide', 0.0033188546),
 ('develop', 0.0032589743),
 ('support', 0.0032436778),
 ("'s", 0.0031785183),
 ('datum', 0.0031522776),
 ('its', 0.0031339836),
 ('one', 0.0031283577),
 ('level', 0.0031226769),
 ('will', 0.0030208337),
 ('not', 0.0030019628),
 ('well', 0.0029758718),
 ('development', 0.002974455),
 ('project', 0.002971783),
 ('human', 0.002889741),
 ('th

In [11]:
file_path = 'ProcessedData/lda_for_regression.csv'  # Replace with your actual file path
data = pd.read_csv(file_path)

In [19]:

data.iloc[100]

n_citation                3.000000
x_tsne                  -15.770838
y_tsne                   31.155502
x_1_topic_probability     0.439834
dominant_topic            0.000000
topic0                    0.439834
topic1                    0.050912
topic2                    0.000000
topic3                    0.000000
topic4                    0.000000
topic5                    0.000000
topic6                    0.000000
topic7                    0.014453
topic8                    0.000000
topic9                    0.000000
topic10                   0.000000
topic11                   0.000000
topic12                   0.000000
topic13                   0.131584
topic14                   0.000000
topic15                   0.161673
topic16                   0.140692
topic17                   0.053284
topic18                   0.000000
topic19                   0.000000
Name: 100, dtype: float64

In [13]:
Abstract['Abstract'][100]

'Neurons of primary sensory cortices are known to have specific responsiveness to elemental features. To express more complex sensory attributes that are embedded in objects or events, the brain must integrate them. This is referred to as feature binding and is reflected in correlated neuronal activity. We investigated how local intracortical circuitry modulates ongoing-spontaneous neuronal activity, which would have a great impact on the processing of subsequent combinatorial input, namely, on the correlating (binding) of relevant features. We simulated a functional, minimal neural network model of primary visual cortex, in which lateral excitatory connections were made in a diffusive manner between cell assemblies that function as orientation columns. A pair of bars oriented at specific angles, expressing a visual corner, was applied to the network. The local intracortical circuitry contributed not only to inducing correlated neuronal activation and thus to binding the paired feature

https://chatgpt.com/share/173fd80b-3a02-4e90-93d8-dd0af8c79a3b