In [23]:
#| hide
#| default_exp discovering_topics

In [24]:
#| hide
%matplotlib inline
from nbdev.showdoc import *

In [1]:
#| export
import matplotlib.pyplot as plt
import pandas as pd
import spacy
import re
from gensim.models import Phrases
from gensim.corpora import Dictionary
from gensim.models import LdaModel
import pyLDAvis.gensim_models
import warnings

# Discovering and Visualizing Topics in Texts
(follows: https://github.com/nlptown/nlp-notebooks/blob/master/Discovering%20and%20Visualizing%20Topics%20in%20Texts%20with%20LDA.ipynb)

Often texts are just that: texts without metadata and labels that tell us what the texts are about. We can use unsupervised ML, topic models, in such cases to find out about the topics discussed in the texts.

Topics: Groups of related words that often occur together in texts. Topic models can find clusters of related words. The humans interpret these clusters and assign them labels. So, a "natural workflow" seems to present itself: Contextualize: Group publications according to a context (person, affiliation, research topic), then use "unsupervised ML to get at topics. After that humans can interpret these topics and assign them labels (thesaurus-like broader terms). With these assigned labels (supervised ML) we can optimize training sets.

Popular topic model: Latent Dirichlet Allocation (LDA). It uses a prior distribution topics in a text will have (Dirichlet probability distribution). LDA is often used to model open-ended survey questions.

In [2]:
#| export
faculty = 'Erasmus School of Health Policy & Management'
f = '/home/peter/Documents/repub/projects/ESHPM/data/faculty-abstracts-20210906.csv'
df = pd.read_csv(f, low_memory=False).dropna(subset=['abstract'])
df = df.loc[df['faculty_name'] == faculty]
df.drop_duplicates(subset=['doi'],inplace=True)

In [3]:
df.to_csv('eshpm.csv', index=False)

In [27]:
#| export
df.columns

Index(['year', 'faculty_name', 'department_name', 'pure_id', 'wos_id', 'doi',
       'issued', 'doctype', 'language', 'title', 'abstract', 'keywords',
       'subjects'],
      dtype='object')

We will focus on the contents of the last, open question of the questionnaire:

In [28]:
#| export
abstract = "abstract"
df[abstract].head(10)

30     Conclusion: We observed an increase in inciden...
60     In this commentary, we reflect on Rinaldi and ...
79     Preference heterogeneity is one of the central...
97     Conclusion Adherence to these EULAR PtC will i...
107    In the evaluation of well-being, it is not onl...
128    Background In many healthcare systems, physici...
215    Healthcare organisations rely on their financi...
277    Patient summary: We observed no differences in...
287    Objectives: To assess the cost-effectiveness o...
346    Medical interventions that increase life expec...
Name: abstract, dtype: object

## Preprocessing

Before we can train a model, we need to tokenize the texts. For this we use the spaCy NLP library. The author uses a blank model (does not work anymore).

In [29]:
#| export
nlp = spacy.load('en_core_web_lg')

The are 4 NaN's in the first 10 answers, so we throw these out and keep all the texts in the target column.

In [30]:
#| export
texts = df[df[abstract].notnull()][abstract]

Next we use spaCy to perform the first pre-processing pass:

In [31]:
#| export
spacy_docs = list(nlp.pipe(texts))

Now we have a list of spaCy documents that we need to transform into a list of tokens. We will work with lemmatized tokens in order to be able to work with the lemmas. So, these are the following pre-processing steps:

- remove all words < 3 characters (interesting for sentiment analysis, but no so much for topic analysis)
- drop all stopwords
- take the lemmas of all remaining words and lowercase them

In [32]:
#| export
docs = [[t.lemma_.lower() for t in doc if len(t.orth_) > 3 and not t.is_stop] for doc in spacy_docs]

docs is a list of lists. The lists contain the lemmas of the answers of the survey participants.

But we want to take frequent bigrams into account when topic modelling. In the French language they often carry important meaning ("poids lourds" = "trucks").

For this we use the Python Gensim library:

- identify frequent bigrams in the corpus
- append these to the list of tokens for the documents in which they appear

In [33]:
#| export
bigram = Phrases(docs, min_count=10)

for idx in range(len(docs)):
  for token in bigram[docs[idx]]:
    if '_' in token: # bigrams can be picked out by using the '_' that joins the individual words
      docs[idx].append(token) # appended to the end, but topic modelling is BoW, so order is not important!

Lets have a look at the fifth document:

In [34]:
#| export
docs[4]

['evaluation',
 'important',
 'people',
 'absolute',
 'term',
 'compare',
 'reference',
 'point',
 'relative',
 'term',
 'explore',
 'relevance',
 'relative',
 'comparison',
 'test',
 'effect',
 'people',
 'self',
 'rate',
 'position',
 'potential',
 'reference',
 'point',
 'income',
 'health',
 'subjective',
 'multiple',
 'discrepancies',
 'theory',
 'framework',
 'identify',
 'seven',
 'potentially',
 'relevant',
 'reference',
 'point',
 'income',
 'health',
 'representative',
 'sample',
 'netherlands',
 'assess',
 'income',
 'health',
 'relative',
 'reference',
 'point',
 'addition',
 'elicit',
 'monthly',
 'household',
 'income',
 'health',
 'status',
 'eq-5d-5l',
 'subjective',
 'swls',
 'line',
 'literature',
 'find',
 'negative',
 'convex',
 'relationship',
 'subjective',
 'positive',
 'relationship',
 'employ',
 'income',
 'health',
 'income',
 'subjective',
 'associate',
 'current',
 'income',
 'compare',
 'respondent',
 'need',
 'progression',
 'time',
 'health',
 'especially

Perfect, we have found two frequently used (over the corpus) in this particular document of the corpus.

Next, the final Gensim-specific pre-processing steps:

- create a dictionary representation of the documents; the dictionary will map each word to an unique ID so that we can make BoW representations of each document. The dictionary will contain ids of words in documents and their frequency;
- we can remove the least and most frequent words from the vocabulary (faster, better quality). We express the min freq as an absolute number, the max freq is the proportion of documents a word is allowed to occur in:

In [35]:
#| export
dictionary = Dictionary(docs)
print(f"Number of unique words in original documents: {len(dictionary)}")

dictionary.filter_extremes(no_below=3, no_above=0.25)
print(f"Number of unique words after removing rare and common words: {len(dictionary)}")

# Let's look at an example document:
print(f"Example representation of document 5: {dictionary.doc2bow(docs[5])}")

Number of unique words in original documents: 10540
Number of unique words after removing rare and common words: 4057
Example representation of document 5: [(34, 1), (37, 1), (40, 1), (46, 1), (55, 3), (65, 6), (82, 1), (86, 1), (103, 1), (110, 1), (122, 3), (139, 1), (152, 1), (177, 1), (188, 1), (196, 1), (197, 2), (198, 8), (199, 2), (200, 1), (201, 3), (202, 1), (203, 1), (204, 1), (205, 4), (206, 2), (207, 1), (208, 1), (209, 1), (210, 1), (211, 1), (212, 3), (213, 1), (214, 1), (215, 1), (216, 3), (217, 1), (218, 9), (219, 2), (220, 1), (221, 2), (222, 1), (223, 2), (224, 2), (225, 1), (226, 1), (227, 1), (228, 1), (229, 1), (230, 1), (231, 1), (232, 2), (233, 1), (234, 1), (235, 1), (236, 1), (237, 1), (238, 1), (239, 2), (240, 1), (241, 1), (242, 1), (243, 1), (244, 1), (245, 1), (246, 1), (247, 1), (248, 1), (249, 2), (250, 1), (251, 1), (252, 1), (253, 1), (254, 1), (255, 1), (256, 1), (257, 1), (258, 1), (259, 1), (260, 10), (261, 6), (262, 1), (263, 1), (264, 7), (265, 1), 

Next, we create bag-of-word (BoW) representations for each of our documents in the corpus:

In [36]:
#| export
corpus = [dictionary.doc2bow(doc) for doc in docs]
corpus[5]

[(34, 1),
 (37, 1),
 (40, 1),
 (46, 1),
 (55, 3),
 (65, 6),
 (82, 1),
 (86, 1),
 (103, 1),
 (110, 1),
 (122, 3),
 (139, 1),
 (152, 1),
 (177, 1),
 (188, 1),
 (196, 1),
 (197, 2),
 (198, 8),
 (199, 2),
 (200, 1),
 (201, 3),
 (202, 1),
 (203, 1),
 (204, 1),
 (205, 4),
 (206, 2),
 (207, 1),
 (208, 1),
 (209, 1),
 (210, 1),
 (211, 1),
 (212, 3),
 (213, 1),
 (214, 1),
 (215, 1),
 (216, 3),
 (217, 1),
 (218, 9),
 (219, 2),
 (220, 1),
 (221, 2),
 (222, 1),
 (223, 2),
 (224, 2),
 (225, 1),
 (226, 1),
 (227, 1),
 (228, 1),
 (229, 1),
 (230, 1),
 (231, 1),
 (232, 2),
 (233, 1),
 (234, 1),
 (235, 1),
 (236, 1),
 (237, 1),
 (238, 1),
 (239, 2),
 (240, 1),
 (241, 1),
 (242, 1),
 (243, 1),
 (244, 1),
 (245, 1),
 (246, 1),
 (247, 1),
 (248, 1),
 (249, 2),
 (250, 1),
 (251, 1),
 (252, 1),
 (253, 1),
 (254, 1),
 (255, 1),
 (256, 1),
 (257, 1),
 (258, 1),
 (259, 1),
 (260, 10),
 (261, 6),
 (262, 1),
 (263, 1),
 (264, 7),
 (265, 1),
 (266, 1),
 (267, 1),
 (268, 1),
 (269, 1),
 (270, 1),
 (271, 8),
 (272,

## Training

In [46]:
#| export
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, chunksize=1000, passes=5, random_state=1)

## Results

What did the model learn? We start by printing out the 10 words that were most characteristic for each of the topics. Some of the topics are general, but others more precise:

In [47]:
#| export
for (topic, words) in model.print_topics():
  print(topic + 1, ":", words)

1 : 0.033*"treatment" + 0.026*"cancer" + 0.023*"cost" + 0.015*"trial" + 0.014*"year" + 0.013*"technique" + 0.012*"breast" + 0.012*"breast_cancer" + 0.012*"datum" + 0.010*"effectiveness"
2 : 0.015*"utility" + 0.014*"life" + 0.013*"cost" + 0.012*"model" + 0.012*"value" + 0.011*"method" + 0.010*"time" + 0.010*"quality" + 0.009*"analysis" + 0.009*"measure"
3 : 0.012*"practice" + 0.011*"work" + 0.011*"management" + 0.010*"intervention" + 0.010*"professional" + 0.010*"support" + 0.009*"process" + 0.008*"need" + 0.008*"quality" + 0.008*"approach"
4 : 0.049*"cost" + 0.027*"economic" + 0.017*"treatment" + 0.015*"clinical" + 0.013*"policy" + 0.012*"quality" + 0.012*"cost_effectiveness" + 0.012*"effectiveness" + 0.011*"term" + 0.011*"effective"
5 : 0.039*"cost" + 0.031*"liver" + 0.027*"failure" + 0.026*"chronic" + 0.017*"high" + 0.017*"acute" + 0.017*"medical" + 0.014*"effectiveness" + 0.012*"copd" + 0.011*"production"
6 : 0.033*"health_care" + 0.020*"system" + 0.015*"elsevier" + 0.013*"right" + 

In [48]:
#| export
pyLDAvis.enable_notebook()
warnings.filterwarnings("ignore", category=DeprecationWarning)

pyLDAvis.gensim_models.prepare(model, corpus, dictionary, sort_topics=False)

  default_term_info = default_term_info.sort_values(


Let's check the topics the model assigns to some individual documents. LDA assigns a high probability to a low number of topics for each document:

In [49]:
#| export
for (text, doc) in zip(texts[:10], docs[:10]):
    print(text)
    print([(topic+1, prob) for (topic, prob) in model[dictionary.doc2bow(doc)] if prob > 0.1])

Conclusion: We observed an increase in incidence for patients with stage I and III and an improvement in survival for patients with stage II, III and IV. These trends can be partly explained by the introduction of the SLNB and the novel drugs. (C) 2021 The Authors. Published by Elsevier Ltd.
[(1, 0.40028837), (3, 0.26206908), (6, 0.16136518), (10, 0.14764579)]
In this commentary, we reflect on Rinaldi and Bekker's scoping review of the literature on populist radical right (PRR) parties and welfare policies. We argue that their review provides political scientists and healthcare scholars with a firm basis to further explore the relationships between populism and welfare policies in different political systems. In line with the authors, we furthermore (re)emphasize the need for additional empirical inquiries into the relationship between populism and healthcare. But instead of expanding the research agenda suggested - for instance by adding categories or niches in which this relationship

In [41]:
#| hide
import nbdev; nbdev.nbdev_export()