## Topic Modeling with LDA

In this section we develop a topic model using Latent Dirichlet Allocation (LDA) to discover unobserved themes across papers. This may have practical value in the following ways: 

1. Uncovering nontrivial relationships between disparate fields of research 
2. Organizing papers into useful categories
3. Navigating citations based on their usage in papers within & across categories

#### Step 1: Import & Preprocess Data

In [48]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
from datasets import load_dataset

In [7]:
# Load pubmed dataset from huggingface
df = load_dataset("scientific_papers", "pubmed", split="validation")
df_articles = pd.DataFrame(df)

Downloading and preparing dataset scientific_papers/pubmed to /Users/mattroth/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f...



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A

Dataset scientific_papers downloaded and prepared to /Users/mattroth/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f. Subsequent calls will reuse this data.


In [17]:
# Inspect df_articles
df_articles.head()

Unnamed: 0,article,abstract,section_names
0,"approximately , one - third of patients with s...",background and aim : there is lack of substan...,Introduction\nSubjects and Methods\nResults\nD...
1,there is an epidemic of stroke in low and midd...,backgroundthe questionnaire for verifying str...,1. Introduction\n2. Methods\n2.1. Study sites\...
2,\n cardiovascular diseases account for the hig...,\n background : timely access to cardiovascul...,Introduction\nMethods\nResults\nDiscussion\nCo...
3,results of a liquid culturing system ( bd bact...,to determine differences in the ability of my...,The Study\nConclusions\nSupplementary Material
4,the need for magnetic resonance imaging ( mri ...,aimsour aim was to evaluate the potential for...,Introduction\nMethods\nPatient selection\nMagn...


In [51]:
# Initialize regex tokenizer
tokenizer = RegexpTokenizer(r"\w+")

# Encode data with TF-IDF
tfidf = TfidfVectorizer(lowercase=True,
                        stop_words="english",
                        ngram_range=(1,1),
                        tokenizer=tokenizer.tokenize)
vectorized_articles = tfidf.fit_transform(df_articles["article"])

# Save vocab
vocab = tfidf.get_feature_names_out()



#### Step 2: Run LDA

In [59]:
# Instantiate LDA
lda = LDA(n_components=5)

# Run on vectorized_articles
X_topics = lda.fit_transform(vectorized_articles)


In [61]:
# Get topic distribution
topic_words = lda.components_

# Retrieve topics
for i, topic_dist in enumerate(topic_words):
    
    zipped = zip(vocab, topic_dist)
    top_terms_key = sorted(zipped, key=lambda t: t[1], reverse=True)[0:10]
    top_terms_list = list(dict(top_terms_key).keys())
    print(f"Topic {i + 1}: {top_terms_list}")
    

Topic 1: ['renalase', 'alp7', 'btv', 'angiokeratoma', 'bhasma', 'atfap1', 'fordyce', 'yashada', 'e2f3', 'npst']
Topic 2: ['ctspd', 'ang2', 'noaf', 'dkk3', 'drh', 'seinjoki', 'slominski', 'vaasa', 'killips', 'pirkanmaa']
Topic 3: ['patients', '1', '0', '2', 'study', 'cells', 'patient', '3', '5', 'group']
Topic 4: ['mews', 'pgsn', 'mypt1', 'urussovii', 'dactylogyrus', 'shwas', 'kuthar', 'sartor', 'tungiasis', 'bhuyan']
Topic 5: ['ifx', 'kibra', 'dgcd', 'mtwa', 'senps', 'pyrethrins', 'ceacam1', 'atrx', 'pertactin', 'magnaporthe']


In [62]:
# Assign topics to articles
article_topic = lda.transform(vectorized_articles)

topics = []
for i in range(article_topic.shape[0]):

    topic = article_topic[i].argmax() + 1

    topics.append(topic)

df_articles["topic"] = topics

In [63]:
df_articles.head()

Unnamed: 0,article,abstract,section_names,topic
0,"approximately , one - third of patients with s...",background and aim : there is lack of substan...,Introduction\nSubjects and Methods\nResults\nD...,3
1,there is an epidemic of stroke in low and midd...,backgroundthe questionnaire for verifying str...,1. Introduction\n2. Methods\n2.1. Study sites\...,3
2,\n cardiovascular diseases account for the hig...,\n background : timely access to cardiovascul...,Introduction\nMethods\nResults\nDiscussion\nCo...,3
3,results of a liquid culturing system ( bd bact...,to determine differences in the ability of my...,The Study\nConclusions\nSupplementary Material,3
4,the need for magnetic resonance imaging ( mri ...,aimsour aim was to evaluate the potential for...,Introduction\nMethods\nPatient selection\nMagn...,3
