## Topic Modeling 

Topic modeling is an Unsupervised Machine Learning problem. There are many algorithms for topic modeling but Latent Dirichlet Allocation (LDA) is the most commonly used one. LDA assumes that documents are mixture of topics [3]. It assigns topics to each word. Those topics are assigned with certain probability. This topic to word assignment process is repeated until the algorithm reaches to a stable point where the topic assignment works fairly well.

In this project, we will categorize research papers into different topics using LDA algorithm. For simplicity, we will only look into the abstract of the papers. Each topic will have numeric label 0, 1, 2, ...etc. We will print out the top words in a topic and also the most relevant papers in that topic.

We will use Scikit Learn library for our purpose. Lets start.

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt


## Lets see how the data looks like

I am using this open research corpus dataset https://labs.semanticscholar.org/corpus/. The full dataset is 36GB. For simplicity, I am using the sample dataset which is 90KB only. 

In [2]:
filename = "sample-S2-records"
df = pd.read_json(filename, lines=True)
df.head()

Unnamed: 0,authors,doi,doiUrl,entities,id,inCitations,journalName,journalPages,journalVolume,outCitations,paperAbstract,pdfUrls,pmid,s2PdfUrl,s2Url,sources,title,venue,year
0,"[{'ids': ['38280253'], 'name': 'Kate Jack'}]",,,[Jack Device Component],f2320c08c7d95bbf8bb72e4d6deaa6845ea4cf27,[],Nursing times,26,109 49-50,[],,[],24568020v1,,https://semanticscholar.org/paper/f2320c08c7d9...,[Medline],60 seconds with Kate Jack.,Nursing times,2013.0
1,"[{'ids': ['5862934'], 'name': 'W N Spellacy'},...",,,"[Decision Making, Laboratory Certification Doc...",5432a99cdd9f8b248c50274cd3d2a6016f3d081e,[],The Journal of reproductive medicine,127-30,31 2,[],The search for new administrators in complex s...,[],3514907v1,,https://semanticscholar.org/paper/5432a99cdd9f...,[Medline],Organizing a search for an academic administra...,The Journal of reproductive medicine,1986.0
2,"[{'ids': ['39900230'], 'name': 'Stefanie Ernst...",,,"[Annexin A1, Annexins, Bacterial Infections, C...",155663331ea93379e99997bd43340eb54ab41a73,"[3738fad17126054f03cfe736b7156b6d6eef0481, 927...",Journal of immunology,7669-76,172 12,"[c2b53b26c004fe57e85424df6ad101d283150648, d30...",The human N-formyl peptide receptor (FPR) is a...,[http://www.jimmunol.org/content/jimmunol/172/...,15187149v1,http://pdfs.semanticscholar.org/cb73/147dc0bf1...,https://semanticscholar.org/paper/155663331ea9...,[Medline],An annexin 1 N-terminal peptide activates leuk...,Journal of immunology,2004.0
3,"[{'ids': ['1801874'], 'name': 'S Yamamoto'}, {...",,,"[Adrenal Cortex Hormones, Bladder Neoplasm, Ca...",b5a25960ebee9a6e5db79196e6b07f0edfcf5313,"[8bcedf8512f672310326a6cc0ec897939d28c6d1, 8b7...",Nihon Rinsho Men'eki Gakkai kaishi = Japanese ...,128-35,19 2,[],Serum CA 19-9 (2-3 sialyl Le(a)) is a marker o...,[],8705689v1,,https://semanticscholar.org/paper/b5a25960ebee...,[Medline],[Serum CA 19-9 levels in rheumatic diseases wi...,Nihon Rinsho Men'eki Gakkai kaishi = Japanese ...,1996.0
4,"[{'ids': ['14380299'], 'name': 'Edwards'}, {'i...",,,"[Cell Nucleus, Dependence, Nucleic Acids]",3b7538465b0559e2d3ff2b65991c8e399e457822,[],"Physical review. A, Atomic, molecular, and opt...",2709-2717,44 4,[],,[],9906253v1,,https://semanticscholar.org/paper/3b7538465b05...,[Medline],Sequence dependence of low-frequency Raman-act...,"Physical review. A, Atomic, molecular, and opt...",1991.0


We only care about the abstract column. So extract that

In [3]:
abstract = df['paperAbstract']
abstract.replace('', np.NaN)


absDictionary = abstract.to_dict()
absData = []
for key in absDictionary:
    if absDictionary[key] =='':
        continue
    else:
        absData.append(absDictionary[key])

#print(len(absData))
absData

['The search for new administrators in complex systems is an important activity. The special requirements of academic organizations, particularly those with health centers, present some unique considerations that can confound this important and difficult process. Typically, national searches attract a sizable candidate list composed of persons with diverse backgrounds and experiences, and a committee is empowered to sort through their qualifications. A critical step in the planning of each search is the development of a process that allows participatory decision making while not requiring too much time. Too often the search becomes an unmanageable activity that confuses the searchers and frustrates the administration. A seven-step process has proven successful for use by committees to attract and sort through written candidate applications, to agree upon a preliminary ranking of candidates and to reach a consensus on a final list of recommendations. The process could be applied in almo

## Data Cleanup

The data cleanup code is courtesy to Susan Li [2]

In [4]:
import spacy
spacy.load('en')
from spacy.lang.en import English
parser = English()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

In [5]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/nahalam/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

In [7]:
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nahalam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

#### Clean abstract data using the above techniques

In [9]:
textTokens = []
for item in absData:
    tokens = prepare_text_for_lda(item)
    #print(tokens)
    #convert the tokens from a list of strings to a string and then append it to textToken
    temp = " ".join(tokens)
    textTokens.append(temp)
textTokens


['search administrator complex system important activity special requirement academic organization particularly health center present unique consideration confound important difficult process typically national search attract sizable candidate compose person diverse background experience committee empower qualification critical planning search development process allow participatory decision making require often search become unmanageable activity confuse searcher frustrate administration seven process prove successful committee attract write candidate application agree preliminary ranking candidate reach consensus final recommendation process could apply almost organizational setting',
 'human formyl peptide receptor modulator chemotaxis direct granulocyte toward site bacterial infection founding member subfamily protein couple receptor thought function inflammatory process member fprl)1 fprl2 greatly reduce affinity bacterial peptide fprl2 consider orphan receptor study peptide deriv

We will use Scikit-Learn API to apply LDA algorithm on our dataset. 

First, we will use Scikit-learn `CountVectorizer` class to convert a collection of text documents to a matrix of token counts. There are many parameter of this class (surprise!) and we have to apply some rational thinking while chosing the parameters. Details about CountVectorizer http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

The `max_features` parameter of `CountVectorizer` is a critical one. You have to chose it in a way so that the vectorizer doesn't include less frequent words. Because those less frequent words will create an unwanted dimention making the topics less differentiable from one another. More about how to chose the max_features parameter is here https://stackoverflow.com/questions/46118910/scikit-learn-vectorizer-max-features but a high level approach is to pick a lower number if you have a smaller dataset and vice versa. 

`max_df`: When building the vocabulary ignore terms that have a document frequency strictly higher than `max_df`

`min_df`: When building the vocabulary ignore terms that have a document frequency strictly lower than `min_df`



In [10]:
from sklearn.feature_extraction.text import CountVectorizer

#no_features will be used as the max_features parameter in creating a CountVectorizer object later. 
no_features = 100

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
#training. fit_transform returns a Document-term matrix.
tf = tf_vectorizer.fit_transform(textTokens)
#Array mapping from feature integer indices to feature name
tf_feature_names = tf_vectorizer.get_feature_names()

In [11]:
from sklearn.decomposition import LatentDirichletAllocation

#how many topics we want to classify
no_topics = 20
# Run LDA
# http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)



## Evaluating Topics

In [12]:
def display_topics_old(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
display_topics_old(lda, tf_feature_names, no_top_words)

Topic 0:
metastasis liver patient medicine medical academic protein result control study
Topic 1:
paper phase compare significantly model consider positive structure propose algorithm
Topic 2:
child immunize result symptom study measure compare present consider suggest
Topic 3:
paper method simulation different consider obtain improve using problem respectively
Topic 4:
lower model increase concentration decrease state expression measure effect years
Topic 5:
pressure decrease induce activity expression group model disease control different
Topic 6:
level energy determine simulation different state phase method positive algorithm
Topic 7:
ratio expression population concentration result plasma factor activity species using
Topic 8:
treatment peptide result study disease present activity effect consider local
Topic 9:
phase result medicine state findings years reveal structure ratio control
Topic 10:
control network presence simulation stroke result model obtain method propose
Topic 11:

## Displaying both Top Words and Documents in a Topic


In [13]:
def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print("Documents")
            print (documents[doc_index])

In [14]:
lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
lda_model = lda.fit(tf)
#topics to documents matrix
lda_W = lda_model.transform(tf)

#word to topics matrix
lda_H = lda_model.components_

no_top_words = 4
no_top_documents = 4
display_topics(lda_H, lda_W, tf_feature_names, textTokens, no_top_words, no_top_documents)

Topic 0:
metastasis liver patient medicine
Documents
background liver common target organ metastasis colorectal cancer synchronous liver metastasis confer poor prognosis metachronous metastasis genetic alteration inflammatory response associate prognosis case liver metastasis arise however study examine relationship mutation inflammatory status especially respect liver metastasis method effect activate mitogen activate protein kinase pathway another protein involve inflammation reactive protein liver metastasis examine aim determine impact specific single nucleotide polymorphism rs7553007 liver metastasis specific survival patient colorectal liver metastasectomy result found significant difference genotype distribution allele frequency rs7553007 patient liver metastasis control group rates subgroup patient synchronous metastasis allele rs7553007 mutate liver metastatic specimen furthermore rs7553007 hazard ratio 1.101 confidence interval 1.011 1.200 0.027 mutation 2.377 1.293 4.368 0.0



In [18]:
#save the model for future use
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
import pickle

filename = 'ldaModel.sav'
pickle.dump(lda_model, open(filename, 'wb'))

# Visualization

We will use pyLDAvis (https://pyldavis.readthedocs.io/en/latest/) for visualizing the topics


In [22]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

pyLDAvis.sklearn.prepare(lda_model, tf, tf_vectorizer)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


## Ways to Improve

#### Chosing the Right Parameter
We can apply a methodical approach in chosing the `max_features` parameter in `CountVectorizer` class. If we know the word frequencies across the documents we can chose a number that ensures that we don't include less frequent words in our trained model. As described in [4] - "imagine you have set a threshold of 50, and your data corpus consists of 100 words. After looking at the word frequences 20 words occur less than 50 times. Thus, you set max_features=80 and you are good to go".

## Reference:
1. https://towardsdatascience.com/improving-the-interpretation-of-topic-models-87fd2ee3847d
2. https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21
3. https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
4. https://stackoverflow.com/questions/46118910/scikit-learn-vectorizer-max-features