## Project Design

Knowledge gap between public and specialists and uncertainties are an important factors that drive pandemic anxiety, in this task, we will examine papers that discuss some of the controversial topics that contribute to rumours and anxiety

Here I develop a search system to extract sentences from abstracts that are relevant to a question. The questions are associated with rumour and uncertain information circulating in the public. We can try different questions in here, and an important part is to evaluate the search system with human annotation baseline if we want to push forward this work as a paper. 

### Step 1:
The search system first extract abstract contains a keyword (e.g. ‘mask’), then we use LDA to group the abstract topics. We identify a topic that is  most relevant to the question and we extract abstracts that contain the target topic. The system sentences that contain the keyword from the relevant abstracts. The standard apporach of a search system is to used TFIDF to rank documents, here we use LDA topic modeling on nouns, verbs and adjectives of the abstract. Users can decide the relevant information when they know what are the most frequent keywords in each topic.

The benefit of this approach is that when we want to know the relevant content for a question, we don't know what are the keywords in the article are more relevant to the question we ask, because the users are usually not farmiliar with academic papers. In our system, the topic keywords serve as prime for the query in the next step for extracting sentences in the abstract.

### Step 2:
We manually annotate the key sentences to identify information in these key sentences. 


Keywords:
Incubation period, asymptomatic, mask, death rate, paracetamone


### Annotation
To understand the answer to the relevant question, we need to annotate the stance of the results, such as, does the abstract for / against the statement. 

To evaluate the search system, we need to annotate the relevance of the retrieved result. Please refer to each section for annotation guildline

Retrieved results and annotations are in this document 
https://docs.google.com/spreadsheets/d/1-eWEqji7mLXNF0Z9KH8RE5djcxK-97dUHzPWY7GEhI8/edit?usp=sharing

The document contains:

1. annotation of stance:
sheet: mask, incubation, asymtomatic, seasonality, column 'stance'

2. annotation for relevant
sheet: mask, incubation, asymtomatic, seasonality, column 'relevance'

3. annotation for system evaluation 
sheet: system_eval_varname, column 'relevance'


### Evaluation of the system:
For evaluation of the system, we first use keyword approach to extract abstract contains the keywords. Then we mannually annotate whether the abstract extracted are relevant to the question asked. We compute the precision and recall of our system based on this annotation 

## Search System 

In [2]:
import pandas as pd 
from collections import defaultdict
import string
from gensim.models import CoherenceModel
import gensim
from pprint import pprint
import spacy,en_core_web_sm
from nltk.stem import PorterStemmer
import os
import json
from gensim.models import Word2Vec
import nltk
import re
import collections

### Read metadata into dictionary format

In [4]:
class MetaData:
    def __init__(self):
        """Define varibles."""
        # path and data
        self.path = '/afs/inf.ed.ac.uk/user/s16/s1690903/share/cov19_2/'
        self.meta_data = pd.read_csv(self.path + 'metadata.csv')

    def data_dict(self):
        """Convert df to dictionary. """
        mydict = lambda: defaultdict(mydict)
        meta_data_dict = mydict()

        for cord_uid, abstract, title, sha in zip(self.meta_data['cord_uid'], self.meta_data['abstract'], self.meta_data['title'], self.meta_data['sha']):
            meta_data_dict[cord_uid]['title'] = title
            meta_data_dict[cord_uid]['abstract'] = abstract
            meta_data_dict[cord_uid]['sha'] = sha

        return meta_data_dict


### Extract documents contain keywords, preprocessing 

In [235]:
class ExtractText:
    """Extract text according to keywords or phrases"""

    def __init__(self, metaDict, keyword, variable):
        """Define varibles."""
        self.path = '/afs/inf.ed.ac.uk/user/s16/s1690903/share/cov19_2/'
        self.metadata = metaDict
        self.keyword = keyword
        self.variable = variable


    def simple_preprocess(self):
        """Simple text process: lower case, remove punc. """
        mydict = lambda: defaultdict(mydict)
        cleaned = mydict()
        for k, v in self.metadata.items():
            sent = v[self.variable]
            sent = str(sent).lower().translate(str.maketrans('', '', string.punctuation))
            cleaned[k]['processed_text'] = sent
            cleaned[k]['sha'] = v['sha']
            cleaned[k]['title'] = v['title']

        return cleaned

    def very_simple_preprocess(self):
        """Simple text process: lower case only. """
        mydict = lambda: defaultdict(mydict)
        cleaned = mydict()
        for k, v in self.metadata.items():
            sent = v[self.variable]
            sent = str(sent)
            #sent = str(sent).lower()
            cleaned[k]['processed_text'] = sent
            cleaned[k]['sha'] = v['sha']
            cleaned[k]['title'] = v['title']

        return cleaned
     

    def extract_w_keywords(self):
        """Select content with keywords."""
        ps = PorterStemmer()
        mydict = lambda: defaultdict(mydict)
        selected = mydict()
        textdict = self.simple_preprocess()
        
        for k, v in textdict.items():
            if self.keyword in v['processed_text'].split():
                #print(ps.stem(str(self.keyword)))
                selected[k]['processed_text'] = v['processed_text']
                selected[k]['sha'] = v['sha']
                selected[k]['title'] = v['title']
        return selected

    def extract_w_keywords_punc(self):
        """Select content with keywords, with punctuations in text"""
        ps = PorterStemmer()
        mydict = lambda: defaultdict(mydict)
        selected = mydict()
        textdict = self.very_simple_preprocess()
        
        for k, v in textdict.items():
            if ps.stem(str(self.keyword)) in ps.stem(str(v['processed_text'].split())):
                selected[k]['processed_text'] = v['processed_text']
                selected[k]['sha'] = v['sha']
                selected[k]['title'] = v['title']
        return selected

    def get_noun_verb(self, text):
        """get noun trunks for the lda model,
        change noun and verb part to decide what
        you want to use as input for LDA"""
        ps = PorterStemmer()
      
        #find nound trunks
        nlp = en_core_web_sm.load()
        all_extracted = {}
        for k, v in text.items():
            #v = v.replace('incubation period', 'incubation_period')
            doc = nlp(v)
            nouns = ' '.join(str(v) for v in doc if v.pos_ is 'NOUN').split()
            verbs = ' '.join(ps.stem(str(v)) for v in doc if v.pos_ is 'VERB').split()
            adj = ' '.join(str(v) for v in doc if v.pos_ is 'ADJ').split()
            all_w = nouns + verbs + adj
            all_extracted[k] = all_w
      
        return all_extracted

    def get_noun_verb2(self, text):
        """get noun trunks for the lda model,
        change noun and verb part to decide what
        you want to use as input for LDA"""
        ps = PorterStemmer()
      
        #find nound trunks
        nlp = en_core_web_sm.load()
        all_extracted = {}
        for k, v in text.items():
            #v = v.replace('incubation period', 'incubation_period')
            doc = nlp(v['processed_text'])
            nouns = ' '.join(ps.stem(str(v)) for v in doc if v.pos_ is 'NOUN').split()
            verbs = ' '.join(ps.stem(str(v)) for v in doc if v.pos_ is 'VERB').split()
            adj = ' '.join(str(v) for v in doc if v.pos_ is 'ADJ').split()
            all_w = nouns + verbs + adj
            all_extracted[k] = all_w
      
        return all_extracted

    def tokenization(self, text):
        """get noun trunks for the lda model,
        change noun and verb part to decide what
        you want to use as input for the next step"""
        nlp = spacy.load("en_core_web_sm")

        all_extracted = {}
        for k, v in text.items():
            doc = nlp(v)
            all_extracted[k] = [w.text for w in doc]
      
        return all_extracted



## Using LDA to rank documents
LDA is optimized by coherence score u_mass

In [219]:
class LDATopic:
    def __init__(self, processed_text, topic_num, alpha, eta):
        """Define varibles."""
        self.path = '/afs/inf.ed.ac.uk/user/s16/s1690903/share/cov19_2/'
        self.text = processed_text
        self.topic_num = topic_num
        self.alpha = alpha
        self.eta = eta

    def get_lda_score_eval(self, dictionary, bow_corpus):
        """LDA model and coherence score."""

        lda_model = gensim.models.ldamodel.LdaModel(bow_corpus, num_topics=self.topic_num, id2word=dictionary, passes=10,  update_every=1, random_state = 300, alpha=self.alpha, eta=self.eta)
        #pprint(lda_model.print_topics())

        # get coherence score
        cm = CoherenceModel(model=lda_model, corpus=bow_corpus, coherence='u_mass')
        coherence = cm.get_coherence()
        print('coherence score is {}'.format(coherence))

        return lda_model, coherence

    def get_score_dict(self, bow_corpus, lda_model_object):
        """
        get lda score for each document
        """
        all_lda_score = {}
        for i in range(len(bow_corpus)):
            lda_score ={}
            for index, score in sorted(lda_model_object[bow_corpus[i]], key=lambda tup: -1*tup[1]):
                lda_score[index] = score
                od = collections.OrderedDict(sorted(lda_score.items()))
            all_lda_score[i] = od
        return all_lda_score


    def topic_modeling(self):
        """Get LDA topic modeling."""
        # generate dictionary
        dictionary = gensim.corpora.Dictionary(self.text.values())
        bow_corpus = [dictionary.doc2bow(doc) for doc in self.text.values()]
        # modeling
        model, coherence = self.get_lda_score_eval(dictionary, bow_corpus)

        lda_score_all = self.get_score_dict(bow_corpus, model)

        all_lda_score_df = pd.DataFrame.from_dict(lda_score_all)
        all_lda_score_dfT = all_lda_score_df.T
        all_lda_score_dfT = all_lda_score_dfT.fillna(0)

        return model, coherence, all_lda_score_dfT

    def get_ids_from_selected(self, text):
        """Get unique id from text """
        id_l = []
        for k, v in text.items():
            id_l.append(k)
            
        return id_l


## Select document (abstract/ article body) according to search result

In [218]:
class MatchArticleBody:
    def __init__(self, path, selected_id):
        """Define varibles."""
        self.path = path
        self.selected_id = selected_id


    def read_folder(self):
        """
        Creates a nested dictionary that represents the folder structure of rootdir
        """
        rootdir = self.path.rstrip(os.sep)

        article_dict = {}
        for path, dirs, files in os.walk(rootdir):
            for f in files:
                file_id = f.split('.')[0]
                #print(file_id)
                try:
                # load json file according to id
                    with open(self.path + f) as f:
                        data = json.load(f)
                except:
                    pass
                article_dict[file_id] = data

        return article_dict


    def extract_bodytext(self):
        """Unpack nested dictionary and extract body of the article"""
        body = {}
        article_dict = self.read_folder()
        for k, v in article_dict.items():
            strings = ''
            prevString = ''
            for entry in v['body_text']:
                strings = strings + prevString
                prevString = entry['text']

            body[k] = strings
        return body


    def get_title_by_bodykv(self, article_dict, keyword):
        """Search keyword in article body and return title"""

        article_dict = self.read_folder()
        selected_id = self.extract_id_list()

        result = {}
        for k, v in article_dict.items():
            for entry in v['body_text']:
                if (keyword in entry['text'].split()) and (k in selected_id):
                    result[k] = v['metadata']['title']

        return result


    def extract_id_list(self):
        """Extract ids from the selected text. """
        selected_id = []
        for k, v in self.selected_id.items():
            selected_id.append(str(v['sha']).split(';')[0])
            try:
                selected_id.append(str(v['sha']).split(';')[1])
                selected_id.append(str(v['sha']).split(';')[2])
                selected_id.append(str(v['sha']).split(';')[3])
            except:
                pass

        return selected_id


    def select_text_w_id(self):
        body_text = self.extract_bodytext()
        selected_id = self.extract_id_list()
        selected_text = {}
        for k, v in body_text.items():
            if k in selected_id:
                selected_text[k] = v
        return selected_text

In [217]:
# Now we extract articles contain the most relevant topic

def selected_best_LDA(keyword, varname):
        """Select the best lda model with extracted text """
        # convert data to dictionary format
        m = MetaData()
        metaDict = m.data_dict()

        #process text and extract text with keywords
        et = ExtractText(metaDict, keyword, varname)
        text1 = et.extract_w_keywords()


        # extract nouns, verbs and adjetives
        text = et.get_noun_verb2(text1)

        # optimized alpha and beta
        alpha = [0.1, 0.3, 0.5, 0.7, 0.9]
        beta = [0.1, 0.3, 0.5, 0.7, 0.9]

        mydict = lambda: defaultdict(mydict)
        cohere_dict = mydict()
        for a in alpha:
            for b in beta:
                lda = LDATopic(text, 20, a, b)
                model, coherence, scores = lda.topic_modeling()
                cohere_dict[coherence]['a'] = a
                cohere_dict[coherence]['b'] = b
    
        # sort result dictionary to identify the best a, b
        # select a,b with the largest coherence score 
        sort = sorted(cohere_dict.keys())[0] 
        a = cohere_dict[sort]['a']
        b = cohere_dict[sort]['b']
        
        # run LDA with the optimized values
        lda = LDATopic(text, 20, a, b)
        model, coherence, scores_best = lda.topic_modeling()
        pprint(model.print_topics())

        # select merge ids with the LDA topic scores
        id_l = lda.get_ids_from_selected(text)
        scores_best['cord_uid'] = id_l

        return scores_best




def select_text_from_LDA_results(keyword, varname, scores_best, topic_num):
        # choose papers with the most relevant topic
        # convert data to dictionary format
        m = MetaData()
        metaDict = m.data_dict()

        # process text and extract text with keywords
        et = ExtractText(metaDict, keyword, varname)
        # extract text together with punctuation
        text1 = et.extract_w_keywords_punc()
        # need to decide which topic to choose after training
        sel = scores_best[scores_best[topic_num] > 0] 
        

        mydict = lambda: defaultdict(mydict)
        selected = mydict()
        for k, v in text1.items():
            if k in sel.cord_uid.tolist():
                selected[k]['title'] = v['title']
                selected[k]['processed_text'] = v['processed_text']
                selected[k]['sha'] = v['sha']
    
        return selected

def extract_relevant_sentences(cor_dict, search_keywords):
    """Extract sentences contain keyword in relevant articles. """

    mydict = lambda: defaultdict(mydict)
    sel_sentence = mydict()
    
    for k, v in cor_dict.items():
        keyword_sentence = []
        sentences = v['processed_text'].split('.')
        for sentence in sentences:
            # for each sentence, check if keyword exist
            # append sentences contain keyword to list
            keyword_sum = sum(1 for word in search_keywords if word in sentence)
            if keyword_sum > 0:
                keyword_sentence.append(sentence)         

        # store results
        if not keyword_sentence:
            pass
        else:
            sel_sentence[k]['sentences'] = keyword_sentence
            sel_sentence[k]['sha'] = v['sha']
            sel_sentence[k]['title'] = v['title']
    print('{} articles are relevant to the topic you choose'.format(len(sel_sentence)))

    path = '/afs/inf.ed.ac.uk/user/s16/s1690903/share/cov19_2/'
    df = pd.DataFrame.from_dict(sel_sentence, orient='index')
    df.to_csv(path + 'search_results_{}.csv'.format(search_keywords))
    sel_sentence_df = pd.read_csv(path + 'search_results_{}.csv'.format(search_keywords))
    return sel_sentence, sel_sentence_df

def extract_relevant_sentences2(cor_dict, search_keywords):
    """Extract sentences contain keyword in relevant articles. """

    mydict = lambda: defaultdict(mydict)
    sel_sentence = mydict()
    
    for k, v in cor_dict.items():
        keyword_sentence = []
        sentences = v['processed_text'].split('.')
        for sentence in sentences:
            # for each sentence, check if keyword exist
            # append sentences contain keyword to list
            keyword_sum = sum(1 for word in search_keywords if word in sentence)
            if keyword_sum > 0:
                keyword_sentence.append(sentence)         

        # store results
        if not keyword_sentence:
            pass
        else:
            sel_sentence[k]['sentences'] = keyword_sentence
            sel_sentence[k]['sha'] = v['sha']
            sel_sentence[k]['title'] = v['title']
    print('{} articles contain keyword {}'.format(len(sel_sentence),  search_keywords))

    path = '/afs/inf.ed.ac.uk/user/s16/s1690903/share/cov19_2/eval/'
    df = pd.DataFrame.from_dict(sel_sentence, orient='index')
    df.to_csv(path + 'eval_results_{}.csv'.format(search_keywords))
    sel_sentence_df = pd.read_csv(path + 'eval_results_{}.csv'.format(search_keywords))
    return sel_sentence, sel_sentence_df


def evaluation(keyword, varname, search_keywords):
        #process text and extract text with keywords
        m = MetaData()
        metaDict = m.data_dict()
        et = ExtractText(metaDict, keyword, varname)
        text1 = et.extract_w_keywords_punc()
        
        sel_sentence, sel_sentence_df = extract_relevant_sentences2(text1, search_keywords)

        

## Question 1: Is wearing mask an effective way to control pandemic?

In [236]:
#here we select the LDA model with the lowe
scores_best_mask = selected_best_LDA('mask', 'abstract')

coherence score is -4.336722639525242
coherence score is -3.9010410239440367
coherence score is -4.389157036678158
coherence score is -4.585109650895275
coherence score is -4.658858479495897
coherence score is -4.181983373771045
coherence score is -3.8225572164182773
coherence score is -4.115948791902841
coherence score is -4.405996653207033
coherence score is -4.893575872869723
coherence score is -3.763406611071013
coherence score is -4.054636612549744
coherence score is -3.974976478023649
coherence score is -3.605098511028769
coherence score is -3.218469889363284
coherence score is -3.922771443759005
coherence score is -4.56961216145051
coherence score is -2.9539869626024027
coherence score is -3.2776086868166603
coherence score is -3.6310570980589993
coherence score is -3.8129640801122164
coherence score is -3.7657510362336026
coherence score is -3.4720471247415454
coherence score is -3.0704332896336615
coherence score is -2.990024667297296
coherence score is -4.893575872869723
[(0,

In [237]:
scores_best_mask.shape

(170, 21)

We observe topic No. 1 is most relevant to public wearing mask

In [238]:
# topic number 1 is most relevant to public wearing mask
# which topic do you think is most relevant to your search
cor_dict_mask = select_text_from_LDA_results('mask', 'abstract', scores_best_mask, 1)
print ("There are {} abstracts selected". format(len(cor_dict_mask)))

There are 40 abstracts selected


In [239]:
# extract relevant sentences  #search keywords can be a list
sel_sentence_mask, sel_sentence_df_mask = extract_relevant_sentences(cor_dict_mask, ['mask'])

40 articles are relevant to the topic you choose


In [240]:
#read extracted article
sel_sentence_df_mask.head(10)

Unnamed: 0.1,Unnamed: 0,sentences,sha,title
0,8o3l3rsf,"[', escalatory quarantine, mask wearing when g...",,Effectiveness of control strategies for Corona...
1,1mu1z4xd,[' Wearing a mask when going out and avoiding ...,5bb89950ec5a06e2b7f69b2a9c4213dda19b1ab0,Prediction of New Coronavirus Infection Based ...
2,kkpaovhh,"[' For symptomatic, unconfirmed patients, doct...",,Covid-19: What’s the current advice for UK doc...
3,ht88wu6s,[' CONCLUSION: To early end of the COVID-19 ep...,,Estimating the reproductive number and the out...
4,le0ogx1s,"[""The army of the men of death, in John Bunyan...",,A new recruit for the army of the men of death
5,nzh87aux,"[' On the other hand, the model predicts that ...",9b7a0ad7b6c7f59e7a6cf1dc9d07912a273d19b5,The Waiting Time for Inter-Country Spread of P...
6,n2r4pzan,"[', wearing face mask in public venues (73', '...",b7c8e73cf095e30552a32cea04a398331c55ab41,Anticipated and current preventive behaviors i...
7,ywb9krdp,"['2%), and wear a face mask (59']",16627f4c7134394da448b1417a771d13ad7cca4a,Pandemic influenza in Australia: Using telepho...
8,bhnh2dq4,[' If an infected person will not use a mask a...,bb9f6cef633c9baf595daae5166b11f88c1271cb,Risk of transmission of airborne infection dur...
9,49xvz389,['3%) were carrying out one of prevention meas...,545def8771357b4cb2875f5795a0760e97534cc9,Knowledge and attitudes of university students...


### Annotation guidline for question 1
We extracted 33 papers that are supposed to discuss whether using masks is useful. We annotate  whether the key sentences suggest using mask can reduce the risk of infection.

#### Stance Annotation 
* ‘1’ sentences that support using a mask during a pandemic is useful 
* ‘2’  papers that assume masks as useful and examine the public’s willingness to comply the rules,
* ’0’ no obvious evidence that shows using mask is protective or the protection is very little
* '3' Not relevant to the above stance

#### relevance annotation
* '1' the result is relevent to the question  
* '0' the result is not relevant to the question

In [None]:
#here we need to add the stats analysis 
path = '/afs/inf.ed.ac.uk/user/s16/s1690903/share/cov19_2/annotation/''
annotation_mask = pd.read_csv(path + )

## Results
According to the key sentences in 33 abstract that discuss the topic of public using masks, only one paper suggests that there’s not enough evidence to show that mask is useful.
There are 14 papers that suggest their results show using surgical mask during a pandemic is effective in reducing infection
14 paper consider public individuals using masks are necessary in reducing risks of being infect, and these paper look at whether the public are willing to comply to the rules. (X papers are from  Hong Kong, based on the region of the first author)
5 papers are not relevant to the topic

Conclusion:
government in some regions advocate using masks as a standard approach to reduce risk of infection, papers in these regions focus on whether people comply to the rules. When some government advocate that there is little evidence show that mask is effective in controlling the pandemic, nearly half of the academic papers from our search result either consider wearing masks as a standard practice that the public show comply, nearly half of the papers found evidence to support that wearing masks is effective in controlling the pandemic.


### Question 2: How long in incubation period? In some region (e.g. China), there’s rumour circulating that the incubation period is longer than 14 days

### Annotation guideline for question 2:

#### stance annotation
Here we want to identify papers that report a result aligns with the incubation period reported by the governments
UK government advocate: 2-14 days, mean 5
* ‘1’  same as government advocate 
* ‘0’  different from what the government
*  Not relevant to the question 

#### relevance annotation
* '1' the result is relevent to the question  
* '2' the result is not relevant to the question

In [244]:
scores_best_incu = selected_best_LDA('incubation', 'abstract')

coherence score is -4.402954777107732
coherence score is -7.622641764950241
coherence score is -9.667138416026404
coherence score is -8.840829560303707
coherence score is -7.75012700862416
coherence score is -4.690846825022652
coherence score is -7.537936150644841
coherence score is -8.204060613000161
coherence score is -7.780107529965858
coherence score is -7.243947472108755
coherence score is -4.860236911790262
coherence score is -6.818539222229658
coherence score is -8.020048375633895
coherence score is -6.931077713393968
coherence score is -6.426623602144103
coherence score is -4.8710456253984065
coherence score is -6.629118614124333
coherence score is -5.956842964386887
coherence score is -5.355865377504925
coherence score is -4.764257961119005
coherence score is -4.585814626682113
coherence score is -6.216045484929411
coherence score is -5.112704786045706
coherence score is -4.641328661092419
coherence score is -3.607818546962199
coherence score is -9.667138416026404
[(0,
  '0.02

In [246]:
# topic number 0 is most relevant to public wearing mask
# which topic do you think is most relevant to your search
cor_dict_incu = select_text_from_LDA_results('incubation', 'abstract', scores_best_incu, 0)
print ("There are {} abstracts selected". format(len(cor_dict_incu)))

There are 213 abstracts selected


In [259]:
# extract relevant sentences  #search keywords can be a list
sel_sentence_incu, sel_sentence_df_incu = extract_relevant_sentences(cor_dict_incu, ['day'])

124 articles are relevant to the topic you choose


In [260]:
#read extracted article
sel_sentence_df_incu.head(10)

Unnamed: 0.1,Unnamed: 0,sentences,sha,title
0,h89scli5,"[' For monitored individuals, we identified un...",b5161b031c7f720562e94735a018d1c3c8be3ae5,Quantifying the Risk and Cost of Active Monito...
1,ykofrn9i,[' Our results show that the incubation period...,cbc05d14c57b91081970a232ab83bc993f998fe2,Incubation Period and Other Epidemiological Ch...
2,u8goc7io,"['7, 95% CI) days, ranging from 2', '1 days (2']",12fac9aedb1a09a3922a3c084ce4723708e463d6,The incubation period of 2019-nCoV infections ...
3,vspnuxz9,"['0 days (95% credible interval [CrI]: 3', '6 ...",a1bff76ce360e8990b0a4ee2a5228a6e6e63d9c1,Serial interval of novel coronavirus (2019-nCo...
4,ra3t6kmm,"['9 days (95% credible interval [CrI], 2 days-...",c85f571a674c7fed0ccb9176e9cf9f3d3659ca32,Analysis of the epidemic growth of the early 2...
5,tovfd9lw,"['0 days (range, 0 to 24', '0 days)']",dfb0fedbeed56bd2b795a67faab28295afc14c96,Clinical characteristics of 2019 novel coronav...
6,45g12waw,"['4 days, and the R0 value is likely to be bet...",36a5f6d55d7c5f67d4344e36da0a72856ad3dda0,"The Novel Coronavirus, 2019-nCoV, is Highly Co..."
7,rcbw54xc,"['5days', ' Cumulation number of patients at t...",57e01ad2a4961cd5cc6a3733f5f8c013a8946f3c,A model simulation study on effects of interve...
8,fmymklz6,[' Incubation time ranged from one to twenty d...,f3ff1ecae96700f41b83d2a034a3a959428388b0,The cross-sectional study of hospitalized coro...
9,dbzrd23n,['6) days and the mean onset-admission interva...,eb8ac60527db35b10881cb4fd86b8a6e21983d02,A descriptive study of the impact of diseases ...


## Question 3: Are asymptomatic patients infectious?


### Annotation guideline for question 3:
Here we want to identify whether asymtomatic cases contribute to the spread of the virus

#### stance annotation
* ‘1’  there is clear evidence show that asymtomatic cases contribute to the spread of the virus
* ‘0’  it is unlikely that asymtomatic cases contribute to the spread of the virus
* '3' Not relevant to the question

#### relevance annotation
* '1' the result is relevent to the question  
* '0' the result is not relevant to the question

In [249]:
scores_best_asym = selected_best_LDA('asymptomatic', 'abstract')

coherence score is -2.333875362764049
coherence score is -4.70447389062267
coherence score is -6.939599491635526
coherence score is -8.440505363564451
coherence score is -9.591134816334584
coherence score is -2.2594033960424147
coherence score is -5.105876829676597
coherence score is -6.9371475741261035
coherence score is -8.606789439278637
coherence score is -9.453634935286304
coherence score is -2.638911020212841
coherence score is -4.846954903243043
coherence score is -6.8515540183404955
coherence score is -8.924401332546264
coherence score is -8.268359620682485
coherence score is -2.4049378700075303
coherence score is -4.651142459964543
coherence score is -6.56690398681833
coherence score is -7.81408598379605
coherence score is -5.9898250531948944
coherence score is -2.334124552642998
coherence score is -4.078897796588015
coherence score is -6.8391754668949485
coherence score is -6.1568698549644205
coherence score is -4.795885092862488
coherence score is -9.591134816334584
[(0,
  '

In [252]:
# topic number 19 is most relevant to public wearing mask
# which topic do you think is most relevant to your search
cor_dict_asym = select_text_from_LDA_results('asymptomatic', 'abstract', scores_best_asym, 19)
print ("There are {} abstracts selected". format(len(cor_dict_asym)))

There are 404 abstracts selected


In [253]:
# extract relevant sentences  #search keywords can be a list
sel_sentence_asym, sel_sentence_df_asym = extract_relevant_sentences(cor_dict_asym, ['transmission'])

143 articles are relevant to the topic you choose


In [254]:
sel_sentence_df_asym.tail(10)

Unnamed: 0.1,Unnamed: 0,sentences,sha,title
133,3w63yt7f,"[' Of the 28 cases, 16 were index cases import...",,Early Epidemiological and Clinical Characteris...
134,ue6e3ua3,"[' Dromedary camels, hosts for MERS-CoV, are i...",72076bc07694d7ba7e9fd2adfcb10b11fde1c9ba; 76b7...,Middle East respiratory syndrome
135,1nhlu89c,"[' However, the recent report on asymptomatic ...",,Coronavirus disease-2019: is fever an adequate...
136,kwq2y3il,"[' Therefore, there is still a theoretical ris...",a9a4101b25236a4fc0e14a9cbdd904ca8b2baffd,Coronavirus Disease 2019: Coronaviruses and Bl...
137,6kuh4njb,[' Conclusion Being able to protect healthcare...,5ec1bf2fc5d286672feb316e70accdd302d7ed50,MERS-CoV infection among healthcare workers an...
138,k3f7ohzg,[' The measures to prevent transmission was ve...,14dbf1c01f2c422c1aefee32f094cc524ea03af1,Characteristics of COVID-19 infection in Beijing
139,kiq6xb6k,[' Interpretation Person-to-person transmissio...,ad0e9c151402df00786e0aa6dd30987004966deb,First known person-to-person transmission of s...
140,626ch774,['We simulated 100 2019-nCoV infected travelle...,09e25e413faba97b87efc701d1ab8d2a18386efb; 4e55...,Effectiveness of airport screening at detectin...
141,pth2d40p,"[' In addition, nosocomial infection of hospit...",89a8918f7e3044b89642aaa74defc7381abef482; 1f5c...,"Asymptomatic carrier state, acute respiratory ..."
142,hfkzu18p,[' Here we highlight nine most important resea...,,SARS-CoV-2 and COVID-19: The most important re...


## Question 4: Will the virus disappear in the summer? 

### Annotation guideline for question 4
* '1' the result is relevent to the question  
* '0' the result is not relevant to the question

In [255]:
scores_best_sea = selected_best_LDA('seasonality', 'abstract')

coherence score is -3.0018444354246148
coherence score is -3.7713565323755445
coherence score is -4.853842927051636
coherence score is -5.115316283418001
coherence score is -5.275887027502832
coherence score is -2.7310261720813687
coherence score is -3.365557896859708
coherence score is -3.823341883449962
coherence score is -4.286592442380965
coherence score is -4.657320289835465
coherence score is -2.686902420165036
coherence score is -3.455505499881666
coherence score is -3.4104528813580304
coherence score is -3.9792818874763447
coherence score is -4.118397656468215
coherence score is -2.7034845774008556
coherence score is -3.4266148606542437
coherence score is -3.2756509301082404
coherence score is -3.5675747343801505
coherence score is -3.359507266843969
coherence score is -2.7106744166911487
coherence score is -2.9796069633608218
coherence score is -2.9585235165206862
coherence score is -3.077832723977046
coherence score is -2.9506491538081954
coherence score is -5.275887027502832

In [268]:
# topic number 19 is most relevant to publicr wearing mask
# which topic do you think is most relevant to your search
cor_dict_sea = select_text_from_LDA_results('season', 'abstract', scores_best_sea, 0)
print ("There are {} abstracts selected". format(len(cor_dict_sea)))

There are 130 abstracts selected


In [269]:
# extract relevant sentences  #search keywords can be a list
sel_sentence_sea , sel_sentence_df_sea  = extract_relevant_sentences(cor_dict_sea, ['summer'])

21 articles are relevant to the topic you choose


In [172]:
sel_sentence_df_sea.tail(10)

Unnamed: 0.1,Unnamed: 0,sentences,sha,title
11,rgrp73ca,"[' we found that, whilst summer influenza epid...",,Increasing similarity in the dynamics of influ...
12,lxff5c9i,[' the results confirm that school term versus...,,Parameterizing state–space models for infectio...
13,gkia3rx4,"['2 later in the summer, suggesting changes in...",384d87ada46fa690603de7cfbe1286e1d99d6fc9,The effective reproduction number of pandemic ...
14,5ji6512w,[' birth weight conditional on gestation lengt...,,Within-mother analysis of seasonal patterns in...
15,whtqlu1y,"[' in seasonal analysis, human and bovine viru...",742e4f9080c4ff3a0211c8ef20dfb8594b911f69,"Hydrologic, land cover, and seasonal patterns ..."
16,t8vmow9s,[' although overall rates of respiratory illne...,2219620e84342b84e1ac0b8cd1e7be4703dd5799,The seasonality of rhinovirus infections and i...
17,c83xyev5,[' although increased detection of human enter...,d55fd7533a77803778e93b4a2fcd13076f622585,Respiratory viruses are continuously detected ...
18,zg17f7bd,"['abstract influenza a and b, and many unrelat...",68ac63121bded12e1db3178a0c9b050154f81ab8,Seasonality and selective trends in viral acut...
19,6vbhgwsi,"[' hpiv-3 was detected at varying levels, but ...",cfeda05ff998d6973560c4f55577a5feaeccb59b,A molecular epidemiological study of human par...
20,s5c0grz0,[' bacteria were commoner in spring and summer...,e52b14466f05891d49a5ebcdb53b4381f82a0b36,Multiplex PCR reveals that viruses are more fr...


# extract keyword search entry and annotate the data for evaluation

In this part, we search the data using keywords only, the keyword search will be baseline for our model

In [229]:
evaluation('seasonality', 'abstract', ['summer'])
evaluation('mask', 'abstract', ['mask'])
evaluation('incubation', 'abstract', ['incubation','day'])
evaluation('asymptomatic', 'abstract', ['transmission'])

105 articles contain keyword ['summer']
349 articles contain keyword ['mask']
468 articles contain keyword ['incubation', 'day']
151 articles contain keyword ['transmission']
