## Topic Modeling with LDA

In this section we develop a topic model using Latent Dirichlet Allocation (LDA) to discover unobserved themes across papers. This may have practical value in the following ways: 

1. Uncovering nontrivial relationships between disparate fields of research 
2. Organizing papers into useful categories
3. Navigating citations based on their usage in papers within & across categories

#### Step 1: Import & Preprocess Data

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF
from datasets import load_dataset

In [7]:
# Load pubmed dataset from huggingface
df = load_dataset("scientific_papers", "pubmed", split="validation")

Downloading and preparing dataset scientific_papers/pubmed to /Users/mattroth/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f...



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A

Dataset scientific_papers downloaded and prepared to /Users/mattroth/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f. Subsequent calls will reuse this data.


In [12]:
pd.DataFrame(df[0:10])#.iloc[0,0]

Unnamed: 0,article,abstract,section_names
0,"approximately , one - third of patients with s...",background and aim : there is lack of substan...,Introduction\nSubjects and Methods\nResults\nD...
1,there is an epidemic of stroke in low and midd...,backgroundthe questionnaire for verifying str...,1. Introduction\n2. Methods\n2.1. Study sites\...
2,\n cardiovascular diseases account for the hig...,\n background : timely access to cardiovascul...,Introduction\nMethods\nResults\nDiscussion\nCo...
3,results of a liquid culturing system ( bd bact...,to determine differences in the ability of my...,The Study\nConclusions\nSupplementary Material
4,the need for magnetic resonance imaging ( mri ...,aimsour aim was to evaluate the potential for...,Introduction\nMethods\nPatient selection\nMagn...
5,treatment with statins is highly effective in ...,backgroundstatin use is frequently associated...,Background\nMaterial and Methods\nPatients\nSt...
6,most adults with autoimmune diabetes non - ins...,objectivesthe optimal treatment of latent aut...,Introduction\nObjective\nSubjects and methods\...
7,anemia is a global health problem and an ind...,epidemiological evidence suggests that circul...,1. Introduction\n2. Methods\n3. Results\n4. Di...
8,following a clinically predictable progression...,the lack of effective \n therapies for bone m...,Introduction\nMaterials\nand Methods\nResults\...
9,organisms often employ more than one mechanism...,"to perform tasks , organisms often use multip...",Introduction\nMaterials and methods\nResults\n...


#### Title