# Topic Modelling using Latent Dirichlet Allocation (LDA)
Major assumptions:
1. Each document is as homogeneous as possible in terms of topic, i.e. each document is likely expplained by as less topics as possible
2. Each word is as homogeneous as possible in terms of a topic, i.e. each word is as unique as possible for a topic description.

## Load necessary libraries for this notebook

In [1]:
# Download nltk stowards for stopword removal

import nltk
nltk.download("stopwords")

# LDA visualization using pyLDAvis

!pip install pyLDAvis

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 13.7 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis, sklearn
  Building wheel for pyLDAvis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136898 sha256=efbc0a86f71f08f49d35e99a9ab004c1aba0501cd177749b73bcb82075abbcfd
  Stored in directory: /root/.cache/pip/wheels/c9/21/f6/17bcf2667e8a68532ba2fbf6d5c72fdf4c7f7d9abfa4852d2f
  Building wheel for sklearn 

In [2]:
# synatctic processing related libraries
from nltk.corpus import stopwords
import spacy
import re

# Numpy and pandas to read the data
import numpy as np
import pandas as pd
from pandas.core.base import value_counts

# LDA from gensim

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Visulaization
import pyLDAvis
import pyLDAvis.gensim_models

# Stop showing deprecation realted warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

  from collections import Iterable


# Load and pre-process data

In [3]:
bc_incl = pd.read_csv('/content/BreastCancer_incl.csv',index_col=[0], encoding="utf-8")
print("Number of protocols in the file", bc_incl['Trial_ID'].nunique())
print("Total number of inclusion criteria in the file", bc_incl.shape[0])

Number of protocols in the file 213
Total number of inclusion criteria in the file 1979


In [4]:
# Check for missing values related to encodings
bc_incl[bc_incl.isnull().any(axis=1)]

Unnamed: 0,Trial_ID,Incl_crit


There are no missing records.

### Pre-process
- No cleaning
- Simple cleaning : remove alpha numeric letters and convert to lower case 
- Super cleaning:
  - Simple cleaning
  - Remove stopwords
  - Create pure words using lemmatization

**No cleaning and simple cleaning**

In [5]:
no_clean= list(bc_incl['Incl_crit'])
simple_clean = [re.sub(r'[^\w]',' ',sent.lower()) for sent in bc_incl['Incl_crit']]

In [6]:
#New
import string

test_clean = [sent.translate(str.maketrans('', '', string.punctuation)).lower() for sent in bc_incl['Incl_crit']]

In [7]:
print('Oringinal text\n', bc_incl['Incl_crit'][7])
print('Clended text after removing alphanumeric symbols and making lower case\n', simple_clean[7])

Oringinal text
 TNBC patients （ HER2-neu 0-1+ by IHC or FISH-negative by ASCO CAP guidelines）
Clended text after removing alphanumeric symbols and making lower case
 tnbc patients   her2 neu 0 1  by ihc or fish negative by asco cap guidelines 


Note that, we might well use data without cleaning because 
1. TNBC, FISH these are short form of some sort of diagnosis. After cleasing the meaning is changed (e.g. fish might be interpreted as an creature rather than a test type)

**Super Cleaning**

In [8]:
# Stopword removal
stopwords= stopwords.words("english")

In [9]:
# POS_tag specific lemmatization
def lemmatize_text(texts, lemma_tags=["NOUN","ADJ", "VERB", "ADV"]):
  # Load the spacy model and remove NER or Parsing pipelines to make it faster
  nlp= spacy.load("en_core_web_sm", disable=["parser", "ner"])
  texts_out=[]
  for text in texts:
    sent=nlp(text)
    new_text = []
    for token in sent:
      if token.pos_ in lemma_tags:
        new_text.append(token.lemma_)
    final= " ".join(new_text)
    texts_out.append(final)
  return texts_out

In [10]:
lemmatized_texts= lemmatize_text(simple_clean)

  config_value=config["nlp"][key],


In [11]:
print('Oringinal text\n', no_clean[7])
print('Clended text after removing alphanumeric symbols and making lower case\n', simple_clean[7])
print('Lemmatized version\n', lemmatized_texts[7])

Oringinal text
 TNBC patients （ HER2-neu 0-1+ by IHC or FISH-negative by ASCO CAP guidelines）
Clended text after removing alphanumeric symbols and making lower case
 tnbc patients   her2 neu 0 1  by ihc or fish negative by asco cap guidelines 
Lemmatized version
 tnbc patient fish negative guideline


Lemmatization can pose potential dangerous threat as it has removed key information like `ASCO CAP`

In [12]:
# Stopwords removal using gensim utils pre processing
def gen_words(texts):
  final=[]
  for text in texts:
    # Along with pre processing the deaccenct option allows to remove non english letters
    new= gensim.utils.simple_preprocess(text, deacc=True)
    final.append(new)
  return final

In [13]:
super_clean= gen_words(lemmatized_texts)

In [14]:
print('Oringinal text\n', no_clean[7])
print('Clended text after removing alphanumeric symbols and making lower case\n', simple_clean[7])
print('Super clean version\n', super_clean[7])

Oringinal text
 TNBC patients （ HER2-neu 0-1+ by IHC or FISH-negative by ASCO CAP guidelines）
Clended text after removing alphanumeric symbols and making lower case
 tnbc patients   her2 neu 0 1  by ihc or fish negative by asco cap guidelines 
Super clean version
 ['tnbc', 'patient', 'fish', 'negative', 'guideline']



Idea to improve further:
1. Use no_clean version with stop word removal only.
2. Use simple clean with stop word removal
3. Try using bi-gram or trigram to keep expression like 'FISH-positive' as it is

For this notebook we are working only with super clean version.

**Data dictionary**

this step will create a bag of words representation i.e. unique words and it's frequency in each document. 
*corpora.Dictionary()* a mapping between words and their integer ids <br>
*corpora.Dictionary.doc2bow()* Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document;

In [15]:
#a mapping between words and their integer ids
id2word= corpora.Dictionary(super_clean)

corpus = []

for text in super_clean:
  #Bag of words representation
  new= id2word.doc2bow(text)
  corpus.append(new)


Demonstration

In [16]:
print('Super clean version\n', super_clean[101])

Super clean version
 ['receptor', 'status', 'receptor', 'progesterone', 'receptor', 'positive', 'human', 'epidermal', 'growth', 'factor', 'receptor', 'negative']


In [17]:
print('All the words in the complete set of texts (i.e all senteces) are converted to an unique ID\n')
print('Frequency of each ID in each sentence is counted. The final representation is a list of tuples (ID, frequency) for each sentence')

print(corpus[101])

All the words in the complete set of texts (i.e all senteces) are converted to an unique ID

Frequency of each ID in each sentence is counted. The final representation is a list of tuples (ID, frequency) for each sentence
[(25, 1), (28, 1), (42, 1), (387, 1), (388, 1), (389, 1), (390, 1), (391, 1), (392, 4)]


In [18]:
print('We can see that the word receptor appear 4 times')
print('According to the logic stated above, the word RECEPTOR should map to ID 392')
print('word map for ID 392:', id2word[392])
print('Great job !!')

We can see that the word receptor appear 4 times
According to the logic stated above, the word RECEPTOR should map to ID 392
word map for ID 392: receptor
Great job !!


## LDA model

Initial run with super clean data. <br>
- Number of topics (num_topics): 30
- Seed (random_state): 2022
- update parameters (update_every) after every epoch (i.e., 1)
- Number of senetences to consider (chunksize) at each run (i.e. batch size): 128
- Number of epochs (passes): 30

In [19]:
num_topics=30

lda_model= gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics, random_state=2022, update_every=1, chunksize=128, passes=30, alpha='auto')

## Visualizing Data

In [20]:
#enable automatic D3 display of prepared model data in the IPython notebook
pyLDAvis.enable_notebook()

`pyLDAvis.gensim_models.prepare` parameters:
- R: Number of words to display per topic
- mds: Multidimensional scaling 
  - PCA
  - t-SNE

In [21]:
vis_pca = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, R=20, mds='pcoa')

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [22]:
vis_pca

Few topics are well seperated , but still there is overlap between topics.
Let's try 2D mapping with t-SNE.

In [23]:
vis_tsne = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, R=20, mds='tsne')

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [24]:
vis_tsne

### Extract all topic descriptions

In [36]:
#Store topic descriptions in a dictionary
topic_dict={'ID': list(), 'Words':list()}

for topicid in range(num_topics):
  topic_desc=gensim.models.LdaModel.get_topic_terms(lda_model,topicid=topicid,topn=20)
  topic_word=[id2word[x[0]] for x in topic_desc]
  topic_dict['ID'].append(topicid)
  topic_dict['Words'].append(topic_word)

In [77]:
#onvert results to a data frame
pd.DataFrame.from_dict(topic_dict)

Unnamed: 0,ID,Words
0,0,"[study, woman, contraception, undergo, entry, ..."
1,1,"[willing, control, protocol, comply, birth, st..."
2,2,"[platelet, treat, radiation, grade, provide, p..."
3,3,"[history, able, procedure, active, other, rela..."
4,4,"[breast, cancer, confirm, stage, invasive, his..."
5,5,"[least, month, at, trial, life, surgical, expe..."
6,6,"[eligible, creatinine, cell, cytologically, re..."
7,7,"[lymph, node, less, currently, physical, axill..."
8,8,"[negative, positive, receptor, progesterone, h..."
9,9,"[use, potential, childbeare, method, agree, bl..."


Demonstration of function `get_topic_terms` <br>
Return a list of (word_id, probability) 2-tuples for the most probable words in topic topicid.

Only return 2-tuples for the topn most probable words.


In [79]:
print('Topic description of 1st topic with top 20 words\n ',gensim.models.LdaModel.get_topic_terms(lda_model,topicid=1,topn=20))

Topic description of 1st topic with top 20 words
  [(262, 0.12935804), (293, 0.10291982), (321, 0.07453537), (528, 0.067721136), (354, 0.062605284), (89, 0.05939475), (319, 0.058500625), (363, 0.029915432), (1913, 0.029367482), (1912, 0.029367482), (318, 0.023051871), (247, 0.016041677), (357, 0.015150953), (361, 0.012715697), (597, 0.012375045), (356, 0.011786299), (337, 0.010371798), (362, 0.010063686), (94, 0.009644796), (1772, 0.009384186)]


Note that, each word is a word_id and the actual word needs to be extracted from id2word dictionary

### Identify top 3 topic for each inclusion criteria

In [73]:
# Top 3 topics for each inclusion criteria
top1=list()
top2=list()
top3=list()

for bow in corpus:
  topic_dist = gensim.models.LdaModel.get_document_topics(lda_model,bow=bow,minimum_probability=1e-3)
  # Sorted by probability of topic 
  topic_dist= sorted(topic_dist,key=lambda tup: tup[1],reverse=True)
  top1.append(topic_dist[0])
  top2.append(topic_dist[1])
  top3.append(topic_dist[2])


Note that, `get_document_topics` provides list of topics (topic_id, probability) that are higher than minimum_probability threshold

In [74]:
# store in a data frame
topic_df= pd.DataFrame(zip(top1, top2, top3), columns=['Topic1','Topic2', 'Topic3'])

In [75]:
# merge with original inclusion criteria file
lda_bc_result = bc_incl.join(topic_df)

In [76]:
lda_bc_result.head()

Unnamed: 0,Trial_ID,Incl_crit,Topic1,Topic2,Topic3
0,NCT05376241,Female,"(14, 0.22667772)","(22, 0.05178042)","(4, 0.050488904)"
1,NCT05376241,Between 39-49 years of age,"(18, 0.3489918)","(22, 0.043348685)","(4, 0.042267475)"
2,NCT05376241,No history of breast cancer,"(4, 0.31641614)","(3, 0.15250365)","(22, 0.037278406)"
3,NCT05376241,No known BRCA 1/2 mutation,"(22, 0.45737976)","(4, 0.036348604)","(19, 0.034764476)"
4,NCT05600257,individuals were diagnosed with breast cancer ...,"(27, 0.38971403)","(4, 0.24804328)","(28, 0.092094205)"


Most of the inclusion criteria are described by 1 or 2 topics

Demonstration of `get_document_topics`

In [83]:
topic_dist = gensim.models.LdaModel.get_document_topics(lda_model,bow=corpus[10],minimum_probability=1e-3)

A small threshold (minimum_probability) is chosen to ensure that we always get at least 3 top topics per criteria. 
The input text must be in a BOW (bag of words) representation in order to perform topic prediction.

In [84]:
# Sorted by probability - descending order
topic_dist= sorted(topic_dist,key=lambda tup: tup[1],reverse=True)

In [86]:
print('Oringinal text\n', no_clean[10])

print('Lemmatized final version of the text\n', super_clean[10])

# top 3 topics
topic_dist[:3]

Oringinal text
 Any nodal status
Lemmatized final version of the text
 ['nodal', 'status']


[(6, 0.18787658), (29, 0.1785286), (22, 0.043348685)]

Not bad, clusters are well seperated using tsne representation. But do they really have meaning?
Let us do the following:
1. Extract words explaining each topic
2. Map topic IDs to original inclusion criteria to study them better

Reference (implementation):
1. [Topic modelling part 1](https://www.youtube.com/watch?v=TKjjlp5_r7o)
2. [Topic Modelling part 2](https://www.youtube.com/watch?v=UEn3xHNBXJU)
3. [Topic modelling part 3](https://www.youtube.com/watch?v=i74DVqMsRWY)

Reference (Concept):
1. [LDA part 1](https://www.youtube.com/watch?v=T05t-SqKArY)
2. [LDA part 2](https://www.youtube.com/watch?v=BaM1uiCpj_E)
