# Case study example -- vaccination related literature topic detection

In the first case study, we will explore, whether the existing COVID-19-related literature, that focuses on vaccines and treatment programs clusters in well defined topics that could help medical professionals further explore the existing studies. The case study is structured as follows.

1. Description of the preprocessing machinery and the request-ready API we designed around CORD-19 data set.
2. An example run, that is completely interactive
3. Discussion of results

In [1]:
## the code corresponding to c1.
import json
import requests
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
import nltk
from gensim import corpora
nltk.download('wordnet')
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
from gensim.models.ldamodel import LdaModel

def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

def prepare_text_for_lda(text):
    tokens = text.split(" ")
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens


[nltk_data] Downloading package wordnet to /home/blazs/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/blazs/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The above we first define some functions, that shall be used for the preprocessing purposes. Next, we simply quarry the API to fetch the documents, most related to the user specified keywords (interesting keywords).

In [10]:
interesting_keywords = ["covid-19","vaccines","treatment"]
json_data_all = []
for keyword in interesting_keywords:
    example_query = "http://covid19explorer.ijs.si/gp/api?keyword={}".format(keyword)
    response = requests.get(example_query)
    json_data = json.loads(response.text)
    json_data_all+=json_data

## get scores and titles
top_docs = []
for hit in json_data_all:
    title, abstract = hit['article_title'], hit['article_abstract']
    if len(abstract) > 30:
        top_docs.append(abstract)

## clean
clean_text = []
for el in top_docs:
    tokens = prepare_text_for_lda(el)
    clean_text.append(tokens)


Finally, let's use GENSIM's LDA package to detect some topics!

In [11]:
dictionary = corpora.Dictionary(clean_text)
corpus = [dictionary.doc2bow(text) for text in clean_text]

NUM_TOPICS = 10
ldamodel = LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)

topics = ldamodel.print_topics(num_words=5)

What are the representative words in these topics?

In [12]:
for topic in topics:
    print(topic)

(0, '0.012*"Symptoma" + 0.011*"treatment" + 0.009*"case" + 0.009*"clinical" + 0.008*"number"')
(1, '0.011*"treatment" + 0.009*"relapse" + 0.008*"remain" + 0.007*"disease" + 0.007*"treat"')
(2, '0.027*"vaccine" + 0.009*"periodontal" + 0.009*"gulae" + 0.007*"disease" + 0.007*"edible"')
(3, '0.017*"treatment" + 0.015*"vaccine" + 0.009*"group" + 0.007*"veterinary" + 0.007*"current"')
(4, '0.018*"vaccine" + 0.017*"disease" + 0.013*"infectious" + 0.009*"control" + 0.008*"potential"')
(5, '0.042*"vaccine" + 0.016*"animal" + 0.014*"human" + 0.010*"method" + 0.009*"vaccines."')
(6, '0.013*"vaccine" + 0.007*"infection" + 0.005*"pneumonia" + 0.005*"lupus" + 0.005*"problem"')
(7, '0.014*"vaccine" + 0.011*"disease" + 0.009*"week" + 0.008*"safety" + 0.007*"COVID-19"')
(8, '0.018*"vaccine" + 0.009*"staff" + 0.008*"health" + 0.008*"medical" + 0.007*"include"')
(9, '0.018*"treatment" + 0.012*"vaccine" + 0.008*"reaction" + 0.008*"acute" + 0.007*"patient"')


As we can see, there exist various sub-topics that might be of interest to a medical professional. To explore the abstracts in more detail, abstracts can be fetched too!

In [14]:
top_docs[-1]

'Currently, Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2, formerly known as 2019-nCoV, the causative pathogen of Coronavirus Disease 2019 (COVID-19)) has rapidly spread across China and around the world, causing an outbreak of acute infectious pneumonia. No specific anti-virus drugs or vaccines are available for the treatment of this sudden and lethal disease. The supportive care and non-specific treatment to ameliorate the symptoms of the patient are the only options currently. At the top of these conventional therapies, greater than 85% of SARS-CoV-2 infected patients in China are receiving Traditional Chinese Medicine (TCM) treatment. In this article, relevant published literatures are thoroughly reviewed and current applications of TCM in the treatment of COVID-19 patients are analyzed. Due to the homology in epidemiology, genomics, and pathogenesis of the SARS-CoV-2 and SARS-CoV, and the widely use of TCM in the treatment of SARS-CoV, the clinical evidence showing t