## Topic Modeling with LDA

In this section we develop a topic model using Latent Dirichlet Allocation (LDA) to discover unobserved themes across papers. This may have practical value in the following ways: 

1. Uncovering nontrivial relationships between disparate fields of research 
2. Organizing papers into useful categories
3. Navigating citations based on their usage in papers within & across categories

#### Step 1: Import & Preprocess Data

In [76]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
from datasets import load_dataset

In [77]:
# Load pubmed dataset from huggingface
articles = load_dataset("scientific_papers", "pubmed", split="train")
df_articles = pd.DataFrame(articles)

Found cached dataset scientific_papers (/Users/mattroth/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f)


In [78]:
# Inspect df_articles
df_articles.head()

Unnamed: 0,article,abstract,section_names
0,a recent systematic analysis showed that in 20...,background : the present study was carried ou...,INTRODUCTION\nMATERIALS AND METHODS\nParticipa...
1,it occurs in more than 50% of patients and may...,backgroundanemia in patients with cancer who ...,Introduction\nPatients and methods\nStudy desi...
2,"tardive dystonia ( td ) , a rarer side effect ...",tardive dystonia ( td ) is a serious side eff...,INTRODUCTION\nCASE REPORT\nDISCUSSION\nDeclara...
3,"lepidoptera include agricultural pests that , ...",many lepidopteran insects are agricultural pe...,1. Introduction\n2. Insect Immunity\n3. Signal...
4,syncope is caused by transient diffuse cerebra...,we present an unusual case of recurrent cough...,Introduction\nCase report\nDiscussion\nConflic...


In [80]:
# Initialize regex tokenizer
tokenizer = RegexpTokenizer(
    "(?:(?<=\s)|(?<=^)|(?<=[>\"]))[a-z-']+(?:(?=\s)|(?=\:\s)|(?=$)|(?=[.!,;\"]))"
)

# Encode data with TF-IDF
tfidf = TfidfVectorizer(lowercase=True,
                        stop_words="english",
                        max_df=0.95,
                        min_df=2,
                        max_features=1000,
                        tokenizer=tokenizer.tokenize)
vectorized_articles = tfidf.fit_transform(df_articles["article"])

# Save vocab
vocab = tfidf.get_feature_names_out()

#### Step 2: Run LDA

In [81]:
# Instantiate LDA
lda = LDA(n_components=10)

# Run on vectorized_articles
X_topics = lda.fit_transform(vectorized_articles)

In [82]:
# Get topic distribution
topic_words = lda.components_

# Retrieve topics
for i, topic_dist in enumerate(topic_words):
    
    zipped = zip(vocab, topic_dist)
    top_terms_key = sorted(zipped, key=lambda t: t[1], reverse=True)[0:10]
    top_terms_list = list(dict(top_terms_key).keys())
    print(f"Topic {i + 1}: {top_terms_list}")
    

Topic 1: ['artery', 'patient', 'left', 'patients', 'right', 'cardiac', 'pulmonary', 'coronary', 'pressure', 'abdominal']
Topic 2: ['patient', 'patients', 'mg', 'infection', 'case', 'disease', 'treatment', 'dl', 'day', 'cases']
Topic 3: ['tumor', 'lesion', 'tumors', 'lesions', 'ct', 'patient', 'mass', 'diagnosis', 'cases', 'case']
Topic 4: ['health', 'study', 'patients', 'care', 'children', 'participants', 'students', 'data', 'age', 'women']
Topic 5: ['surface', 'teeth', 'figure', 'using', 'mm', 'tooth', 'used', 'root', 'values', 'water']
Topic 6: ['cells', 'cell', 'expression', 'mice', 'rats', 'inflammatory', 'induced', 't', 'anti', 'levels']
Topic 7: ['gene', 'genes', 'al', 'et', 'protein', 'proteins', 'expression', 'cells', 'cell', 'dna']
Topic 8: ['samples', 'm', 'ml', 'pcr', 'c', 'l', 'isolates', 'h', 'dna', 'g']
Topic 9: ['patients', 'study', 'diabetes', 'p', 'group', 'risk', 'insulin', 'levels', 'subjects', 'age']
Topic 10: ['patients', 'surgery', 'pain', 'group', 'study', 'patie

In [84]:
# Assign topics to articles
article_topic = lda.transform(vectorized_articles)

topics = []
for i in range(article_topic.shape[0]):

    topic = article_topic[i].argmax() + 1

    topics.append(topic)

df_articles["topic"] = topics

In [85]:
df_articles.head()

Unnamed: 0,article,abstract,section_names,topic
0,a recent systematic analysis showed that in 20...,background : the present study was carried ou...,INTRODUCTION\nMATERIALS AND METHODS\nParticipa...,4
1,it occurs in more than 50% of patients and may...,backgroundanemia in patients with cancer who ...,Introduction\nPatients and methods\nStudy desi...,9
2,"tardive dystonia ( td ) , a rarer side effect ...",tardive dystonia ( td ) is a serious side eff...,INTRODUCTION\nCASE REPORT\nDISCUSSION\nDeclara...,2
3,"lepidoptera include agricultural pests that , ...",many lepidopteran insects are agricultural pe...,1. Introduction\n2. Insect Immunity\n3. Signal...,7
4,syncope is caused by transient diffuse cerebra...,we present an unusual case of recurrent cough...,Introduction\nCase report\nDiscussion\nConflic...,1
