### 4. LATENT DIRICHLET ALLOCATION

Latent Dirichlet Allocation (LDA) is an unsupervised machine learning algorithm used to identify topics within a body of text.  In this case, I'm using LDA with a corpuso of ICD-10 codes from the underlying and multiple cause fields to see if the algorithm can separate the records into a predetermined number of causes of death.

LDA works by assuming each document (series of ICD-10 code for each death record) is comprised of multiple topics with some topics being more dominant.  Each topic is made up of key words, again, with certain words within each topic being more dominant. 

When I specify the number of topics I want the algorithm to find it classifies records into the specified number of topics by rearranging words within topics and topics within documents to reach the best groupings of records. 

In this section, I used LDA with three versions of the ICD-10 corpus and compared their performance based on a 'coherence score' to determine which type of corpus provides the algorithm the most information. In addition to examining relative performance, I also looked at the most salient codes in each topic to see if the clustering made sense in terms of conditions might co-occur on death records.

In [49]:
import re
import ast
import numpy as np
import pandas as pd
from pprint import pprint
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text 

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.parsing.preprocessing import STOPWORDS

from collections import Counter

# nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from tokenize import tokenize

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')


**Load data**

In [14]:
ds = pd.read_csv(r'Y:/DQSS/Death/MBG/py/capstone2/data/d1619_clean.csv',
                low_memory=False)

In [15]:
ds.gc_cat.value_counts()

0    211924
6      3868
5      2973
2      2767
3      2376
1      2149
4       534
8       368
9        38
Name: gc_cat, dtype: int64

**Keep records with garbage underlying cause codes**

In [18]:
ds = ds.loc[ds['gc_cat']!=0, :]
len(ds)

15073

**Tokenize text and create 3 versions of the corpus**

The three versions of the ICD-10 corpus are:

1. Short ICD-10 unigrams without garbage codes in any multiple cause field
2. Short ICD-10 unigrams with garbage codes
3. Short ICD-10 bigrams without garbage codes in any multiple cause field

In [19]:
## call stored object 'gc_all' created in 2_PrepData_ICD10.ipynb.  Will only work if you run that notebook first.
%store -r gc_all  
# add 'respiratory failure' codes to the list of stopwords
respfail_tobac_codes = ['J960','J961','J969','F179']
gc_plus = gc_all + respfail_tobac_codes


In [20]:
from myfunction_2 import make_corpi
ds['clean_mc'], ds['short_mc'], ds['clean_mcgc'] = make_corpi(ds.loc[:,'AllMC'], gc_plus)

In [40]:
def make_bigrams(doc):
    bi = []
    for i in range(len(doc)-1):
        bigrm = doc[i] + "_" + doc[i+1]
        bi.append(bigrm)
    return bi

In [41]:
ds['bigrams'] = ds.loc[:,'short_mc'].apply(lambda row: make_bigrams(row))

In [60]:
 # to remove rows with empty lists in 'clean_mc' column.
ds = ds[ds.bigrams.astype(bool)]
len(ds)

4570

**CONSTRUCT DICTIONARY**

In this step I create dictionaries for each version of the corpus.  A dictionary is a pairing of each unique word with a unique ID.  It is a necessary input for the LDA algorithm.

In [61]:
'''create dictionary for later use and also create bag of words - prune words that occur in 5 or fewer records
or more than 50% of records'''

dictionary_uni_short = gensim.corpora.Dictionary(ds.short_mc)
dictionary_uni_short.filter_extremes(no_below=5, no_above=0.5)

dictionary_mcgc = gensim.corpora.Dictionary(ds.clean_mcgc)
dictionary_mcgc.filter_extremes(no_below=5, no_above=0.5)

dictionary_bi = gensim.corpora.Dictionary(ds.bigrams)
dictionary_bi.filter_extremes(no_below=5, no_above=0.5)

Below is a print out of the first 11 ID-word pairs to show the contents of one of the dictionaries.

In [46]:
count = 0
for k, v in dictionary_uni_short.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 N17
1 W80
2 D46
3 I80
4 J18
5 Y83
6 B18
7 C78
8 I12
9 I48
10 J44


In [62]:
len(dictionary_uni_short),len(dictionary_mcgc),len(dictionary_bi)

(231, 305, 361)

**Construct bag of words for each corpus**

In [63]:
'''bag of words - for each document there is now a dictionary.  
Word counts can be figured out using these with BOW.'''

bow_corpus_uni_short = [dictionary_uni_short.doc2bow(doc) for doc in ds.short_mc]
bow_corpus_mcgc = [dictionary_mcgc.doc2bow(doc) for doc in ds.clean_mcgc]
bow_corpus_bi = [dictionary_bi.doc2bow(doc) for doc in ds.bigrams]

**Create TF-IDF for each corpus**

The text (ICD-10 codes) in the corpus cannot be used directly for modeling and need to be converted into a vector of numbers.  TF-IDF is a calculation that captures the frequency of each word (in this case ICD-10 codes) within the document and also across all documents in the data set.  These values are provided to models for use in classification.

In [64]:
from gensim import corpora, models

tfidf_uni_short = models.TfidfModel(bow_corpus_uni_short)
corpus_uni_short = tfidf_uni_short[bow_corpus_uni_short]

tfidf_mcgc = models.TfidfModel(bow_corpus_mcgc)
corpus_mcgc = tfidf_mcgc[bow_corpus_mcgc]

tfidf_bi = models.TfidfModel(bow_corpus_bi)
corpus_bi = tfidf_bi[bow_corpus_bi]

For each of the LDA models I built I specified that the algorithm should find 6 topics. I also kept other parameters constant so that I could compare models.

**LDA MODEL WITH UNIGRAMS AND SHORT ICD-10 CODES EXCLUDING GARBAGE CODES**

In [30]:
ldamodel_all_uni_short = gensim.models.LdaMulticore(corpus_uni_short,
                                              id2word=dictionary_uni_short,
                                              num_topics=6,
                                              passes=30,
                                              alpha = 0.5,
                                              eta = 0.1,
                                              workers=4
                                             )

In [31]:
for idx, topic in ldamodel_all_uni_short.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx,topic))

Topic: 0 
Words: 0.288*"J44" + 0.050*"K74" + 0.043*"J98" + 0.039*"J80" + 0.036*"W80" + 0.035*"Y83" + 0.033*"K76" + 0.030*"E43" + 0.029*"J84" + 0.027*"K55"
Topic: 1 
Words: 0.214*"J18" + 0.194*"N17" + 0.181*"N18" + 0.058*"F10" + 0.049*"J90" + 0.039*"F19" + 0.033*"I42" + 0.031*"E88" + 0.028*"K70" + 0.028*"I80"
Topic: 2 
Words: 0.143*"C79" + 0.112*"G30" + 0.101*"N28" + 0.090*"N19" + 0.088*"J69" + 0.055*"K92" + 0.054*"I73" + 0.044*"I95" + 0.039*"I35" + 0.037*"E03"
Topic: 3 
Words: 0.236*"I48" + 0.170*"C78" + 0.112*"E11" + 0.083*"E66" + 0.041*"G47" + 0.039*"D64" + 0.037*"J81" + 0.029*"I27" + 0.024*"I63" + 0.023*"F17"
Topic: 4 
Words: 0.179*"E14" + 0.130*"E78" + 0.097*"I25" + 0.067*"I71" + 0.064*"I64" + 0.063*"I21" + 0.055*"K72" + 0.052*"I45" + 0.030*"G20" + 0.028*"J15"
Topic: 5 
Words: 0.175*"F03" + 0.145*"A49" + 0.142*"G93" + 0.116*"F01" + 0.064*"N39" + 0.037*"I12" + 0.036*"E46" + 0.026*"I47" + 0.023*"X59" + 0.022*"F32"


In [33]:
coherence_ldamodel_all_uni_short = CoherenceModel(model = ldamodel_all_uni_short, 
                                      texts = ds.short_mc, 
                                      dictionary = dictionary_uni_short, 
                                      coherence = 'c_v')

coherence_ldamodel_all_uni_short = coherence_ldamodel_all_uni_short.get_coherence()

print("Coherence score for LDA model: short ICD-10 unigrams, 6 topics, 30 passes: ", coherence_ldamodel_all_uni_short)


Coherence score for LDA model: short ICD-10 unigrams, 6 topics, 30 passes:  0.5315311618041512


This LDA model had a coherence score of 0.53 (out of a range of 0 to 1 with 1 being the best score).  The coherence score provides an estimate of the similarity or relative distance of ICD-10 codes in each topic. Based on the experience of other NLP practitioners, it would seem that achieving a score of 0.7 is a good goal and that 0.8 is very unlikely to happen unless words are identical.



In [34]:
from pyLDAvis import sklearn as sklearn_lda
import pyLDAvis
pyLDAvis.enable_notebook()
ldamodel_all_uni_short_vis = pyLDAvis.gensim.prepare(ldamodel_all_uni_short, corpus_uni_short, dictionary_uni_short)
ldamodel_all_uni_short_vis

The plot shows a visualization of the models attempts at finding 6 topics (as I specified).  Optimally, the circles which represent the topics should be separated and larger.  In this instance, we see two overlapping groups. Topics 1 and 5 have a high occurrence of cancer codes C78 and C79 (secondary malignant neoplasm of respiratory and digestive organs, and secondary malignant neoplasm of other unspecified site). In the cluster appearing in the top left quadrant (topics 2, 3, and 6) there appear to be distinct groups based on the most salient 1 to 3 ICD-10 codes which is puzzling given the fact that they are overlapping circles. 

**LDA MODEL WITH SHORT ICD-10 UNIGRAMS INCLUDING GARBAGE CODES**

In [38]:
ldamodel_mcgc = gensim.models.LdaMulticore(corpus_mcgc,
                                              id2word=dictionary_mcgc,
                                              num_topics=6,
                                              passes=30,
                                              alpha = 0.5,
                                              eta = 0.1,
                                              workers=4
                                             )

for idx, topic in ldamodel_mcgc.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx,topic))


Topic: 0 
Words: 0.136*"I51" + 0.120*"I26" + 0.107*"I70" + 0.103*"R09" + 0.081*"R62" + 0.050*"G93" + 0.038*"N28" + 0.022*"K92" + 0.021*"R58" + 0.020*"J90"
Topic: 1 
Words: 0.190*"I10" + 0.092*"F17" + 0.078*"J44" + 0.060*"E14" + 0.052*"A49" + 0.045*"E78" + 0.043*"E11" + 0.038*"G30" + 0.032*"I25" + 0.027*"R06"
Topic: 2 
Words: 0.155*"A41" + 0.086*"R68" + 0.071*"E87" + 0.071*"J18" + 0.066*"R54" + 0.042*"F01" + 0.027*"E86" + 0.024*"R56" + 0.023*"I64" + 0.023*"R53"
Topic: 3 
Words: 0.502*"R99" + 0.138*"I49" + 0.083*"F03" + 0.053*"R95" + 0.045*"N19" + 0.016*"G20" + 0.016*"E88" + 0.014*"K70" + 0.014*"G31" + 0.012*"K55"
Topic: 4 
Words: 0.312*"C80" + 0.085*"C78" + 0.070*"I48" + 0.066*"F17" + 0.064*"C79" + 0.033*"J69" + 0.031*"R00" + 0.025*"K72" + 0.021*"R91" + 0.017*"R19"
Topic: 5 
Words: 0.282*"I50" + 0.278*"I46" + 0.111*"J96" + 0.057*"R57" + 0.053*"N18" + 0.040*"N17" + 0.020*"C76" + 0.018*"N39" + 0.015*"I45" + 0.014*"F17"


In [42]:
coherence_ldamodel_mcgc = CoherenceModel(model = ldamodel_mcgc, 
                                      texts = ds.clean_mcgc, 
                                      dictionary = dictionary_mcgc, 
                                      coherence = 'c_v')

coherence_ldamodel_mcgc = coherence_ldamodel_mcgc.get_coherence()

print("Coherence score for LDA model: ICD-10 bigrams, 6 topics, 30 passes: ", coherence_ldamodel_mcgc)

Coherence score for LDA model: ICD-10 bigrams, 6 topics, 30 passes:  0.37550018208718233


This LDA model with a corpus consisting of short ICD-10 codes including garbage codes performed relatively poorly compared with the first model.  It is likely that the frequently occuring garbage code obscured any meaningful information that might have helped the algorithm distinguish between topics.

In [43]:
from pyLDAvis import sklearn as sklearn_lda
import pyLDAvis
pyLDAvis.enable_notebook()
ldamodel_all_bi_vis = pyLDAvis.gensim.prepare(ldamodel_mcgc, corpus_mcgc, dictionary_mcgc)
ldamodel_all_bi_vis

**LDA WITH BIGRAMS EXLUDING GARBAGE TERMS**

In [65]:
ldamodel_bi = gensim.models.LdaMulticore(corpus_bi,
                                              id2word=dictionary_bi,
                                              num_topics=6,
                                              passes=30,
                                              alpha = 0.5,
                                              eta = 0.1,
                                              workers=4
                                             )

for idx, topic in ldamodel_bi.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx,topic))


Topic: 0 
Words: 0.085*"I48_J44" + 0.054*"J18_N17" + 0.050*"F03_I48" + 0.049*"J44_N18" + 0.048*"E11_E78" + 0.032*"E11_E66" + 0.031*"J44_N17" + 0.030*"I48_I63" + 0.026*"K92_N17" + 0.020*"I48_N17"
Topic: 1 
Words: 0.054*"F01_I48" + 0.047*"J18_N18" + 0.037*"N18_N28" + 0.035*"J18_J90" + 0.032*"N17_N39" + 0.031*"I25_J44" + 0.031*"G93_I45" + 0.030*"E11_I25" + 0.025*"J18_N39" + 0.025*"K72_N19"
Topic: 2 
Words: 0.114*"N17_N18" + 0.048*"I25_I48" + 0.046*"I48_I64" + 0.033*"E78_I48" + 0.029*"E11_I48" + 0.029*"E11_G30" + 0.028*"G93_J18" + 0.026*"I48_N18" + 0.023*"I21_N17" + 0.022*"J44_N28"
Topic: 3 
Words: 0.162*"C78_C79" + 0.139*"C78_C78" + 0.032*"E66_G47" + 0.030*"G30_I48" + 0.028*"K72_N17" + 0.027*"E14_F03" + 0.026*"E14_I48" + 0.024*"F03_J44" + 0.023*"E11_F03" + 0.023*"I21_I25"
Topic: 4 
Words: 0.055*"E14_E66" + 0.052*"I48_J18" + 0.041*"J18_J80" + 0.039*"F03_J18" + 0.031*"I73_J44" + 0.030*"E66_E78" + 0.028*"J18_N28" + 0.026*"E78_J44" + 0.024*"G93_I48" + 0.021*"G47_I48"
Topic: 5 
Words: 0.083*"C

In [66]:
coherence_ldamodel_bi = CoherenceModel(model = ldamodel_bi, 
                                      texts = ds.bigrams, 
                                      dictionary = dictionary_bi, 
                                      coherence = 'c_v')

coherence_ldamodel_bi = coherence_ldamodel_bi.get_coherence()

print("Coherence score for LDA model: ICD-10 bigrams, 6 topics, 30 passes: ", coherence_ldamodel_bi)

Coherence score for LDA model: ICD-10 bigrams, 6 topics, 30 passes:  0.7706705544662625


The coherence score for the LDA model built with bigrams and no garbage codes was much higher than either of the two previous models.

In [67]:
from pyLDAvis import sklearn as sklearn_lda
import pyLDAvis
pyLDAvis.enable_notebook()
ldamodel_all_bi_vis = pyLDAvis.gensim.prepare(ldamodel_bi, corpus_bi, dictionary_bi)
ldamodel_all_bi_vis

Topic 1 includes mostly C78-C79 pairs (secondary malignant neoplasm of the lung and secondary malignant neoplasm of unspecified sites). Topic 2 was dominated by I48-J44 pairs (atrial fibrillation and COPD) and to a lesser extent by J18-N17 (bronchopneumonia and acute kidney failure).Topic 3 was dominated by E14-E66 (unspecified diabetes and obesity) and I48-J18 (atrial fibrillation and bronchopneumonia). Topic 5 mostly relates to kidney disease and topic 6 related to multi-infarc dementia and atrial fibrillation indicating cerebrovascular disease.  There is overlap between some of the topic indicating that there may be more than 6 topics in this corpus.

**FINDING IDEAL NUMBER OF TOPICS - SHORT ICD-10 UNIGRAMS**

I started by aribitrarily setting the number of topics to 6.  In the following function, I attempt to identify the optimal number of topics using the last LDA model by repeating the model with varying numbers of topics.

In [68]:
#define function to compute a series of coherence scores based on using different numbers of topics

def compute_coherence(dictionary,texts, corpus, limit, start=2, step=3):
    
    '''
    parameters:
    dictionary: Gensim dictionary
    corpus: Gensim corpus
    limit: Max number of topcs
    texts: list of texts
    output:
    model_list: list of LDA topic models
    coherence_values: coherence values corresponding to each LDA model with number of topics
    '''
    
    model_list = []
    coherence_values = []
    
    for num_topics in range(start, limit, step):
        model = gensim.models.LdaMulticore(corpus = corpus,
                                           id2word = dictionary,
                                           num_topics = num_topics,
                                           passes = 30,
                                           workers = 4)
        model_list.append(model)
        
        coherencemodel = CoherenceModel(model=model, texts = texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
        
    return model_list, coherence_values
    

In [None]:
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
model_list, coherence_values = compute_coherence_values(dictionary = dictionary_bi, 
                                                       corpus = corpus_bi,
                                                        text = ds.bigrams,
                                                        start=2, 
                                                        limit=40,
                                                        step=6
                                                       )

In [None]:
limit=40; start=2; step=6
x=range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Number of topics")
plt.ylabel("Coherence Score")
plt.legend(("coherence_values"), loc='best')
plt.show()

**Additional strategies with LDA for future implementation**

While LDA provides some interesting results, there may be ways of improving on the performance of the models created above. LDA is intended to be used with longer texts relative to the short list of ICD-10 codes for each record. Other variations of and alternatives to LDA, such as Gibbs Sampling Dirichlet Mixture Model (GSDMM) and bi-term topic modeling, are intended to be used for short text topic modeling. GSDMM starts with the assumption that each document contains only one topic which may be more applicable to this use case. Bi-term topic modeling examines pairs of words and their co-occurrence across the corpus in finding topics.

Another approach I'd like to try with the current LDA model is to add a random sample of records with valid underlying cause codes to the corpus. This would increase the amoung of information available to the algorithm. With increasing examples of the distribution of codes within topics and topics across the corpus the classifier may gain accuracy.