# Case study example: Literature related to vaccination and treatment plans

In the first case study, we will explore, whether the existing COVID-19-related literature, that focuses on vaccines and treatment programs clusters in well defined topics that could help medical professionals further explore the existing studies. The case study is structured as follows.

1. Using unsupervised keyword detection, we annotated all existing COVID-19-related literature
2. **MeDEP** offers simple API access to the annotated documents
3. An example where vaccination-related literature is explored is shown next!

In [2]:
## the code corresponding to c1.
import json
import requests
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
import nltk
from gensim import corpora
nltk.download('wordnet')
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
from gensim.models.ldamodel import LdaModel

def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

def prepare_text_for_lda(text):
    tokens = text.split(" ")
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens


[nltk_data] Downloading package wordnet to /home/blazs/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/blazs/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The above we first define some functions, that shall be used for the preprocessing purposes. Next, we simply quarry the API to fetch the documents, most related to the user specified keywords (interesting keywords).

In [3]:
interesting_keywords = ["covid-19","vaccines","treatment"]
json_data_all = []
for keyword in interesting_keywords:
    example_query = "http://covid19explorer.ijs.si/gp/api?keyword={}".format(keyword)
    response = requests.get(example_query)
    json_data = json.loads(response.text)
    json_data_all+=json_data

## get scores and titles
top_docs = []
for hit in json_data_all:
    title, abstract = hit['article_title'], hit['article_abstract']
    if len(abstract) > 30:
        top_docs.append(abstract)

## clean
clean_text = []
for el in top_docs:
    tokens = prepare_text_for_lda(el)
    clean_text.append(tokens)


Finally, let's use GENSIM's LDA package to detect some topics!

In [4]:
dictionary = corpora.Dictionary(clean_text)
corpus = [dictionary.doc2bow(text) for text in clean_text]

NUM_TOPICS = 10
ldamodel = LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)

topics = ldamodel.print_topics(num_words=5)

What are the representative words in these topics?

In [20]:
for enx, topic in enumerate(topics):
    print("TOPIC {} KEYWORDS: ".format(enx)+"AND ".join(topic[1].split("+")[0:3]))

TOPIC 0 KEYWORDS: 0.012*"staff" AND  0.010*"disease" AND  0.009*"treatment" 
TOPIC 1 KEYWORDS: 0.020*"vaccine" AND  0.006*"animal" AND  0.006*"treatment" 
TOPIC 2 KEYWORDS: 0.014*"disease" AND  0.011*"infectious" AND  0.008*"vaccine" 
TOPIC 3 KEYWORDS: 0.023*"vaccine" AND  0.010*"respiratory" AND  0.007*"delivery" 
TOPIC 4 KEYWORDS: 0.010*"treatment" AND  0.010*"group" AND  0.006*"disease" 
TOPIC 5 KEYWORDS: 0.036*"vaccine" AND  0.010*"animal" AND  0.009*"disease" 
TOPIC 6 KEYWORDS: 0.000*"vaccine" AND  0.000*"animal" AND  0.000*"human" 
TOPIC 7 KEYWORDS: 0.014*"treatment" AND  0.014*"week" AND  0.009*"patient" 
TOPIC 8 KEYWORDS: 0.024*"vaccine" AND  0.015*"treatment" AND  0.010*"vaccination" 
TOPIC 9 KEYWORDS: 0.020*"vaccine" AND  0.010*"animal" AND  0.010*"treatment" 


As we can see, there exist various sub-topics that might be of interest to a medical professional. To explore the abstracts in more detail, abstracts can be fetched too!

In [6]:
top_docs[-1]

'Currently, Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2, formerly known as 2019-nCoV, the causative pathogen of Coronavirus Disease 2019 (COVID-19)) has rapidly spread across China and around the world, causing an outbreak of acute infectious pneumonia. No specific anti-virus drugs or vaccines are available for the treatment of this sudden and lethal disease. The supportive care and non-specific treatment to ameliorate the symptoms of the patient are the only options currently. At the top of these conventional therapies, greater than 85% of SARS-CoV-2 infected patients in China are receiving Traditional Chinese Medicine (TCM) treatment. In this article, relevant published literatures are thoroughly reviewed and current applications of TCM in the treatment of COVID-19 patients are analyzed. Due to the homology in epidemiology, genomics, and pathogenesis of the SARS-CoV-2 and SARS-CoV, and the widely use of TCM in the treatment of SARS-CoV, the clinical evidence showing t

In [7]:
top_docs[3]

'COVID-19 is a novel coronavirus with an outbreak of unusual viral pneumonia in Wuhan, China, and then pandemic. Based on its phylogenetic relationships and genomic structures the COVID-19 belongs to genera Betacoronavirus. Human Betacoronaviruses (SARS-CoV-2, SARS-CoV, and MERS-CoV) have many similarities, but also have differences in their genomic and phenotypic structure that can influence their pathogenesis. COVID-19 is containing single-stranded (positive-sense) RNA associated with a nucleoprotein within a capsid comprised of matrix protein. A typical CoV contains at least six ORFs in its genome.'