# Building a Latent Dirichlet Allocation topic model for Pubmed Articles

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime,re, string, timeit,nltk, gensim
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import sentiwordnet as swn
from nltk.corpus.reader.wordnet import WordNetError
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from  sklearn.externals import joblib
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [4]:
df = pd.read_csv("pubmed_cleaned.csv")
df = df[df['Clean_Abstract'].isnull() == False]

#get the abstracts for analysis
abstracts = df['Clean_Abstract']

I will try 2 different approaches to discern topics within our dataset: Latent Dirochlet Allocation (LDA) and Non-negative Matrix Factorization (NMF).

What do these approaches do? At a high level they are able to return the documents that beling to a topic in a corpus and the words that belong to the topic. LDA uses probabilistic graphical modeling, NMF uses linear algebra. 

LDA uses a countvectorized matrix, NMF can use a TF-IDF matrix.

The ultimate goal is to produce 2 smaller matrices: 1 that maps documents to themes, and the other that has the words for each theme. When multiplied, together, they reproduce the bag of words matrix with the lowest error.

In [11]:
n_features = 5000

# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.01, 
                                   max_features=n_features, 
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(abstracts)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=0.01, 
                                max_features=n_features, 
                                stop_words='english')

tf = tf_vectorizer.fit_transform(abstracts)
tf_feature_names = tf_vectorizer.get_feature_names()

My aim here is to see if the search term that we used to obtain the article from pubmed is indeed a theme in the returned abstract. If it is, then we have some evidence that shows when they search for a disease, that the paper is indeed about the desired topic. If not, then new data will have to be scraped, either with a smaller amount of returned results (it is ordered by relevance), or with more specific search terms.

My hope is that there will be ~33 topics, as those are the number of search terms used to obtain the data.

In [22]:
n_topics = df['disease'].nunique()
print("Number of search terms: {}".format(n_topics))

Number of search terms: 33


In [29]:
# h is the word to topics matrix
# w is the topics to documents matrix

# Run NMF
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5,
          init='nndsvd').fit(tfidf)

nmf_w = nmf.transform(tfidf)
nmf_h = nmf.components_

# Run LDA
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=5, 
                                learning_method='online', learning_offset=50.,
                                random_state=0).fit(tf)

lda_w = lda.transform(tf)
lda_h = lda.components_

In [34]:
# from 
# https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730
        
def display_topics(h, w, feature_names, documents, n_top_words,n_top_documents):
    for ix, topic in enumerate(h):
        print("Topic {}:".format(ix))
        print()
        print(" ".join([feature_names[i]
                       for i in topic.argsort()[:-n_top_words -1:-1]]))
        top_doc_indices = np.argsort( w[:, ix])[::-1][0:n_top_documents]
        print()
        for doc_index in top_doc_indices:
            print(documents[doc_index])
            print()

n_top_words = 15
n_top_documents = 2
display_topics(nmf_h, nmf_w, tfidf_feature_names, abstracts, n_top_words,n_top_documents)
display_topics(lda_h, lda_w, tf_feature_names, abstracts ,n_top_words , n_top_documents)

Topic 0:

patient wa survival year month median stage rate treated surgery overall outcome follow disease age

based ajcc seventh tnm classification intraglandular tumor subdivided t1a ≤10 t1b difference prognosis remain controversial present study aimed determine clinicopathological feature outcome t1a t1b patient retrospective study patient including t1a t1b patient underwent surgery ptc wa conducted patient preoperative operative diagnosis ptc total thyroidectomy prophylactic macroscopically therapeutic evident lymph node dissection lnd wa performed patient partial thyroidectomy without lnd mean follow time wa year median year range year wa performed patient including lnd patient single lobectomy isthmectomy multifocality bilaterality number tumor sum largest size focus vascular invasion patient lnd metastasis significantly frequent t1b t1a patient patient lnd metastasis including t1a t1b patient metastasis diagnosed prophylactic lnd t1a t1b patient recurrence frequent t1b t1a patie