## Project Design

Knowledge gap between public and specialists and uncertainties are an important factors that drive pandemic anxiety, in this task, we will examine papers that discuss some of the controversial topics that contribute to rumours and anxiety

Here I develop a search system to extract sentences from abstracts that are relevant to a question. The questions are associated with rumour and uncertain information circulating in the public. We can try different questions in here, and an important part is to evaluate the search system with human annotation baseline if we want to push forward this work as a paper. 

Step 1:
The search system first extract abstract contains a keyword (e.g. ‘mask’), then we use LDA to group the abstract topics. We identify a topic that is  most relevant to the question and we extract abstracts that contain the target topic. The system sentences that contain the keyword from the relevant abstracts. The standard apporach of a search system is to used TFIDF to rank documents, here we use LDA topic modeling on nouns, verbs and adjectives of the abstract. Users can decide the relevant information when they know what are the most frequent keywords in each topic.

Step 2:
We manually annotate the key sentences to identify information in these key sentences. 

Keywords:
Incubation period, asymptomatic, mask, death rate, paracetamone


## Search System 

In [1]:
import pandas as pd 
from collections import defaultdict
import string
from gensim.models import CoherenceModel
import gensim
from pprint import pprint
import spacy,en_core_web_sm
from nltk.stem import PorterStemmer
import os
import json
from gensim.models import Word2Vec
import nltk
import re
import collections

### Read metadata into dictionary format

In [2]:
class MetaData:
    def __init__(self):
        """Define varibles."""
        # path and data
        self.path = '/afs/inf.ed.ac.uk/user/s16/s1690903/share/cov19_2/'
        self.meta_data = pd.read_csv(self.path + 'metadata.csv')

    def data_dict(self):
        """Convert df to dictionary. """
        mydict = lambda: defaultdict(mydict)
        meta_data_dict = mydict()

        for cord_uid, abstract, title, sha in zip(self.meta_data['cord_uid'], self.meta_data['abstract'], self.meta_data['title'], self.meta_data['sha']):
            meta_data_dict[cord_uid]['title'] = title
            meta_data_dict[cord_uid]['abstract'] = abstract
            meta_data_dict[cord_uid]['sha'] = sha

        return meta_data_dict


### Extract documents contain keywords, preprocessing 

In [3]:
class ExtractText:
    """Extract text according to keywords or phrases"""

    def __init__(self, metaDict, keyword, variable):
        """Define varibles."""
        self.path = '/afs/inf.ed.ac.uk/user/s16/s1690903/share/cov19_2/'
        self.metadata = metaDict
        self.keyword = keyword
        self.variable = variable


    def simple_preprocess(self):
        """Simple text process: lower case, remove punc. """
        mydict = lambda: defaultdict(mydict)
        cleaned = mydict()
        for k, v in self.metadata.items():
            sent = v[self.variable]
            sent = str(sent).lower().translate(str.maketrans('', '', string.punctuation))
            cleaned[k]['processed_text'] = sent
            cleaned[k]['sha'] = v['sha']
            cleaned[k]['title'] = v['title']

        return cleaned

    def very_simple_preprocess(self):
        """Simple text process: lower case only. """
        mydict = lambda: defaultdict(mydict)
        cleaned = mydict()
        for k, v in self.metadata.items():
            sent = v[self.variable]
            sent = str(sent).lower()
            cleaned[k]['processed_text'] = sent
            cleaned[k]['sha'] = v['sha']
            cleaned[k]['title'] = v['title']

        return cleaned
     

    def extract_w_keywords(self):
        """Select content with keywords."""
        mydict = lambda: defaultdict(mydict)
        selected = mydict()
        textdict = self.simple_preprocess()
        for k, v in textdict.items():
            if self.keyword in v['processed_text'].split():
                #print(v['sha'])
                selected[k]['processed_text'] = v['processed_text']
                selected[k]['sha'] = v['sha']
                selected[k]['title'] = v['title']
        return selected

    def extract_w_keywords_punc(self):
        """Select content with keywords, with punctuations in text"""
        mydict = lambda: defaultdict(mydict)
        selected = mydict()
        textdict = self.very_simple_preprocess()
        for k, v in textdict.items():
            if self.keyword in v['processed_text'].split():
                    #print(v['sha'])
                selected[k]['processed_text'] = v['processed_text']
                selected[k]['sha'] = v['sha']
                selected[k]['title'] = v['title']
        return selected

    def get_noun_verb(self, text):
        """get noun trunks for the lda model,
        change noun and verb part to decide what
        you want to use as input for LDA"""
        ps = PorterStemmer()
      
        #find nound trunks
        nlp = en_core_web_sm.load()
        all_extracted = {}
        for k, v in text.items():
            #v = v.replace('incubation period', 'incubation_period')
            doc = nlp(v)
            nouns = ' '.join(str(v) for v in doc if v.pos_ is 'NOUN').split()
            verbs = ' '.join(ps.stem(str(v)) for v in doc if v.pos_ is 'VERB').split()
            adj = ' '.join(str(v) for v in doc if v.pos_ is 'ADJ').split()
            all_w = nouns + verbs + adj
            all_extracted[k] = all_w
      
        return all_extracted

    def get_noun_verb2(self, text):
        """get noun trunks for the lda model,
        change noun and verb part to decide what
        you want to use as input for LDA"""
        ps = PorterStemmer()
      
        #find nound trunks
        nlp = en_core_web_sm.load()
        all_extracted = {}
        for k, v in text.items():
            #v = v.replace('incubation period', 'incubation_period')
            doc = nlp(v['processed_text'])
            nouns = ' '.join(ps.stem(str(v)) for v in doc if v.pos_ is 'NOUN').split()
            verbs = ' '.join(ps.stem(str(v)) for v in doc if v.pos_ is 'VERB').split()
            adj = ' '.join(str(v) for v in doc if v.pos_ is 'ADJ').split()
            all_w = nouns + verbs + adj
            all_extracted[k] = all_w
      
        return all_extracted

    def tokenization(self, text):
        """get noun trunks for the lda model,
        change noun and verb part to decide what
        you want to use as input for the next step"""
        nlp = spacy.load("en_core_web_sm")

        all_extracted = {}
        for k, v in text.items():
            doc = nlp(v)
            all_extracted[k] = [w.text for w in doc]
      
        return all_extracted



## Using LDA to rank documents
LDA is optimized by coherence score u_mass

In [8]:
class LDATopic:
    def __init__(self, processed_text, topic_num, alpha, eta):
        """Define varibles."""
        self.path = '/afs/inf.ed.ac.uk/user/s16/s1690903/share/cov19_2/'
        self.text = processed_text
        self.topic_num = topic_num
        self.alpha = alpha
        self.eta = eta

    def get_lda_score_eval(self, dictionary, bow_corpus):
        """LDA model and coherence score."""

        lda_model = gensim.models.ldamodel.LdaModel(bow_corpus, num_topics=self.topic_num, id2word=dictionary, passes=10,  update_every=1, random_state = 300, alpha=self.alpha, eta=self.eta)
        #pprint(lda_model.print_topics())

        # get coherence score
        cm = CoherenceModel(model=lda_model, corpus=bow_corpus, coherence='u_mass')
        coherence = cm.get_coherence()
        print('coherence score is {}'.format(coherence))

        return lda_model, coherence

    def get_score_dict(self, bow_corpus, lda_model_object):
        """
        get lda score for each document
        """
        all_lda_score = {}
        for i in range(len(bow_corpus)):
            lda_score ={}
            for index, score in sorted(lda_model_object[bow_corpus[i]], key=lambda tup: -1*tup[1]):
                lda_score[index] = score
                od = collections.OrderedDict(sorted(lda_score.items()))
            all_lda_score[i] = od
        return all_lda_score


    def topic_modeling(self):
        """Get LDA topic modeling."""
        # generate dictionary
        dictionary = gensim.corpora.Dictionary(self.text.values())
        bow_corpus = [dictionary.doc2bow(doc) for doc in self.text.values()]
        # modeling
        model, coherence = self.get_lda_score_eval(dictionary, bow_corpus)

        lda_score_all = self.get_score_dict(bow_corpus, model)

        all_lda_score_df = pd.DataFrame.from_dict(lda_score_all)
        all_lda_score_dfT = all_lda_score_df.T
        all_lda_score_dfT = all_lda_score_dfT.fillna(0)

        return model, coherence, all_lda_score_dfT

    def get_ids_from_selected(self, text):
        """Get unique id from text """
        id_l = []
        for k, v in text.items():
            id_l.append(k)
            
        return id_l


## Select document (abstract/ article body) according to search result

In [20]:
class MatchArticleBody:
    def __init__(self, path, selected_id):
        """Define varibles."""
        self.path = path
        self.selected_id = selected_id


    def read_folder(self):
        """
        Creates a nested dictionary that represents the folder structure of rootdir
        """
        rootdir = self.path.rstrip(os.sep)

        article_dict = {}
        for path, dirs, files in os.walk(rootdir):
            for f in files:
                file_id = f.split('.')[0]
                #print(file_id)
                try:
                # load json file according to id
                    with open(self.path + f) as f:
                        data = json.load(f)
                except:
                    pass
                article_dict[file_id] = data

        return article_dict


    def extract_bodytext(self):
        """Unpack nested dictionary and extract body of the article"""
        body = {}
        article_dict = self.read_folder()
        for k, v in article_dict.items():
            strings = ''
            prevString = ''
            for entry in v['body_text']:
                strings = strings + prevString
                prevString = entry['text']

            body[k] = strings
        return body


    def get_title_by_bodykv(self, article_dict, keyword):
        """Search keyword in article body and return title"""

        article_dict = self.read_folder()
        selected_id = self.extract_id_list()

        result = {}
        for k, v in article_dict.items():
            for entry in v['body_text']:
                if (keyword in entry['text'].split()) and (k in selected_id):
                    result[k] = v['metadata']['title']

        return result


    def extract_id_list(self):
        """Extract ids from the selected text. """
        selected_id = []
        for k, v in self.selected_id.items():
            selected_id.append(str(v['sha']).split(';')[0])
            try:
                selected_id.append(str(v['sha']).split(';')[1])
                selected_id.append(str(v['sha']).split(';')[2])
                selected_id.append(str(v['sha']).split(';')[3])
            except:
                pass

        return selected_id


    def select_text_w_id(self):
        body_text = self.extract_bodytext()
        selected_id = self.extract_id_list()
        selected_text = {}
        for k, v in body_text.items():
            if k in selected_id:
                selected_text[k] = v
        return selected_text

In [24]:
# Now we extract articles contain the most relevant topic

def selected_best_LDA(keyword, varname):
        """Select the best lda model with extracted text """
        # convert data to dictionary format
        m = MetaData()
        metaDict = m.data_dict()

        #process text and extract text with keywords
        et = ExtractText(metaDict, keyword, varname)
        text1 = et.extract_w_keywords()


        # extract nouns, verbs and adjetives
        text = et.get_noun_verb2(text1)

        # optimized alpha and beta
        alpha = [0.1, 0.3, 0.5, 0.7, 0.9]
        beta = [0.1, 0.3, 0.5, 0.7, 0.9]

        mydict = lambda: defaultdict(mydict)
        cohere_dict = mydict()
        for a in alpha:
            for b in beta:
                lda = LDATopic(text, 20, a, b)
                model, coherence, scores = lda.topic_modeling()
                cohere_dict[coherence]['a'] = a
                cohere_dict[coherence]['b'] = b
    
        # sort result dictionary to identify the best a, b
        # select a,b with the largest coherence score 
        sort = sorted(cohere_dict.keys())[0] 
        a = cohere_dict[sort]['a']
        b = cohere_dict[sort]['b']
        
        # run LDA with the optimized values
        lda = LDATopic(text, 20, a, b)
        model, coherence, scores_best = lda.topic_modeling()
        pprint(model.print_topics())

        # select merge ids with the LDA topic scores
        id_l = lda.get_ids_from_selected(text)
        scores_best['cord_uid'] = id_l

        return scores_best



def select_text_from_LDA_results(keyword, varname, scores_best, topic_num):
        # choose papers with the most relevant topic
        # convert data to dictionary format
        m = MetaData()
        metaDict = m.data_dict()

        # process text and extract text with keywords
        et = ExtractText(metaDict, keyword, varname)
        # extract text together with punctuation
        text1 = et.extract_w_keywords_punc()
        # need to decide which topic to choose after training
        sel = scores_best[scores_best[topic_num] > 0] 
        

        mydict = lambda: defaultdict(mydict)
        selected = mydict()
        for k, v in text1.items():
            if k in sel.cord_uid.tolist():
                selected[k]['title'] = v['title']
                selected[k]['processed_text'] = v['processed_text']
                selected[k]['sha'] = v['sha']

        return selected

def extract_relevant_sentences(cor_dict, search_keywords):
    """Extract sentences contain keyword in relevant articles. """

    mydict = lambda: defaultdict(mydict)
    sel_sentence = mydict()
    
    for k, v in cor_dict.items():
        keyword_sentence = []
        sentences = v['processed_text'].split('.')
        for sentence in sentences:
            # for each sentence, check if keyword exist
            # append sentences contain keyword to list
            keyword_sum = sum(1 for word in search_keywords if word in sentence)
            if keyword_sum > 0:
                keyword_sentence.append(sentence)
            

        # store results
        sel_sentence[k]['sentences'] = keyword_sentence
        sel_sentence[k]['sha'] = v['sha']
        sel_sentence[k]['title'] = v['title']
    print('{} articles are relevant to the topic you choose'.format(len(sel_sentence)))

    path = '/afs/inf.ed.ac.uk/user/s16/s1690903/share/cov19_2/'
    df = pd.DataFrame.from_dict(sel_sentence, orient='index')
    df.to_csv(path + 'search_results_{}.csv'.format(search_keywords))
    sel_sentence_df = pd.read_csv(path + 'search_results_{}.csv'.format(search_keywords))
    return sel_sentence, sel_sentence_df





## Question 1 Is wearing mask an effective way to control pandemic?

In [11]:
#here we select the LDA model with the lowe
scores_best = selected_best_LDA('mask', 'abstract')

coherence score is -4.336722639525242
coherence score is -3.9010410239440367
coherence score is -4.389157036678158
coherence score is -4.585109650895275
coherence score is -4.658858479495897
coherence score is -4.181983373771045
coherence score is -3.8225572164182773
coherence score is -4.115948791902841
coherence score is -4.405996653207033
coherence score is -4.893575872869723
coherence score is -3.763406611071013
coherence score is -4.054636612549744
coherence score is -3.974976478023649
coherence score is -3.605098511028769
coherence score is -3.218469889363284
coherence score is -3.922771443759005
coherence score is -4.56961216145051
coherence score is -2.9539869626024027
coherence score is -3.2776086868166603
coherence score is -3.6310570980589993
coherence score is -3.8129640801122164
coherence score is -3.7657510362336026
coherence score is -3.4720471247415454
coherence score is -3.0704332896336615
coherence score is -2.990024667297296
coherence score is -4.893575872869723
[(0,

We observe topic No. 1 is most relevant to public wearing mask

In [16]:
# topic number 1 is most relevant to public wearing mask
# which topic do you think is most relevant to your search
cor_dict = select_text_from_LDA_results('mask', 'abstract', scores_best, 1)
print ("There are {} abstracts selected". format(len(cor_dict)))

There are 33 abstracts selected


In [25]:
# extract relevant sentences  #search keywords can be a list
sel_sentence, sel_sentence_df = extract_relevant_sentences(cor_dict, ['mask'])

33 articles are relevant to the topic you choose


In [27]:
#read extracted article
sel_sentence_df.head(10)

Unnamed: 0.1,Unnamed: 0,sentences,sha,title
0,8o3l3rsf,"[', escalatory quarantine, mask wearing when g...",,Effectiveness of control strategies for Corona...
1,1mu1z4xd,[' wearing a mask when going out and avoiding ...,5bb89950ec5a06e2b7f69b2a9c4213dda19b1ab0,Prediction of New Coronavirus Infection Based ...
2,ht88wu6s,[' conclusion: to early end of the covid-19 ep...,,Estimating the reproductive number and the out...
3,nzh87aux,"[' on the other hand, the model predicts that ...",9b7a0ad7b6c7f59e7a6cf1dc9d07912a273d19b5,The Waiting Time for Inter-Country Spread of P...
4,n2r4pzan,"[', wearing face mask in public venues (73', '...",b7c8e73cf095e30552a32cea04a398331c55ab41,Anticipated and current preventive behaviors i...
5,ywb9krdp,"['2%), and wear a face mask (59']",16627f4c7134394da448b1417a771d13ad7cca4a,Pandemic influenza in Australia: Using telepho...
6,bhnh2dq4,[' if an infected person will not use a mask a...,bb9f6cef633c9baf595daae5166b11f88c1271cb,Risk of transmission of airborne infection dur...
7,49xvz389,['3%) were carrying out one of prevention meas...,545def8771357b4cb2875f5795a0760e97534cc9,Knowledge and attitudes of university students...
8,r3in76wm,[' preventive behavior such as handwashing and...,24d7fe6bbb9945f1536fef5b281d074fe69cfc6a,Avian Influenza Risk Perception and Preventive...
9,e94synjc,['this research assessed factors associated wi...,14d04f36cb13550aa7769b61a079fa54031a21eb,Public health measures during an anticipated i...


## Annotation 
We extracted 33 papers that are supposed to discuss whether using masks is useful. We annotate  whether the key sentences suggest using mask can reduce the risk of infection.

Annotation 
1. ‘1’ sentences that support using a mask during a pandemic is useful 
2. ‘2’  papers that assume masks as useful and examine the public’s willingness to comply the rules,
3. ’0’ no obvious evidence that shows using mask is protective or the protection is very little
4. Not relevant to the above points

In [None]:
#here we need to add the stats analysis 

## Results
According to the key sentences in 33 abstract that discuss the topic of public using masks, only one paper suggests that there’s not enough evidence to show that mask is useful.
There are 14 papers that suggest their results show using surgical mask during a pandemic is effective in reducing infection
14 paper consider public individuals using masks are necessary in reducing risks of being infect, and these paper look at whether the public are willing to comply to the rules. (X papers are from  Hong Kong, based on the region of the first author)
5 papers are not relevant to the topic

Conclusion:
government in some regions advocate using masks as a standard approach to reduce risk of infection, papers in these regions focus on whether people comply to the rules. When some government advocate that there is little evidence show that mask is effective in controlling the pandemic, nearly half of the academic papers from our search result either consider wearing masks as a standard practice that the public show comply, nearly half of the papers found evidence to support that wearing masks is effective in controlling the pandemic.
