# COVID19 Search Engine

Our goal is to create a search engine that can aid people in finding information from all the papers available.

## Preprocessing
- remove stop words ("the", "and", "a", etc.)
- stemming and lemmatization (remove "-s", "-es", "-ing", "-ed") 
- Manual pattern matching --> slow, but we can use it for exploring/validation

## fast pattern matching a.
1. represent all papers as bow with words + freq counts
2. do pattern matching by looking for word in bow representations
3. Once relevant papers are found --> use context (let go of bow) to get better results and extract sentences.

## fast pattern matching b.
1. express paper, abstract, paragraphs, sentences in word embeddings. (and save somewhere)
2. express user input ass word embedding
3. comute cosine similarity of user input and paper (sentences, paragraphs, full text (whatever works best))


--> TF-IDF --> retreive pattern matches

--> pattern matching is slow. Maybe we can make a cache somehow?

--> we can use manual pattern matching as validation method


In [0]:
# unpack data
import os

import zipfile
#with zipfile.ZipFile('CORD-19-research-challenge.zip', 'r') as zip_ref:
#    zip_ref.extractall('.')

In [0]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

%matplotlib inline

import matplotlib.pyplot as plt
import spacy

import glob, json, os

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# Any results you write to the current directory are saved as output.

In [0]:
root_path = '../content/'
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file
0,,Elsevier,Intrauterine virus infections and congenital h...,10.1016/0002-8703(72)90077-4,,4361535,els-covid,Abstract The etiologic basis for the vast majo...,1972-12-31,"Overall, James C.",American Heart Journal,,,False,custom_license
1,,Elsevier,Coronaviruses in Balkan nephritis,10.1016/0002-8703(80)90355-5,,6243850,els-covid,,1980-03-31,"Georgescu, Leonida; Diosi, Peter; Buţiu, Ioan;...",American Heart Journal,,,False,custom_license
2,,Elsevier,Cigarette smoking and coronary heart disease: ...,10.1016/0002-8703(80)90356-7,,7355701,els-covid,,1980-03-31,"Friedman, Gary D",American Heart Journal,,,False,custom_license
3,aecbc613ebdab36753235197ffb4f35734b5ca63,Elsevier,Clinical and immunologic studies in identical ...,10.1016/0002-9343(73)90176-9,,4579077,els-covid,"Abstract Middle-aged female identical twins, o...",1973-08-31,"Brunner, Carolyn M.; Horwitz, David A.; Shann,...",The American Journal of Medicine,,,True,custom_license
4,,Elsevier,Epidemiology of community-acquired respiratory...,10.1016/0002-9343(85)90361-4,,4014285,els-covid,Abstract Upper respiratory tract infections ar...,1985-06-28,"Garibaldi, Richard A.",The American Journal of Medicine,,,False,custom_license


In [0]:
all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)
len(all_json)

29315

In [0]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            # Abstract
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'
first_row = FileReader(all_json[0])

In [0]:
from spacy.matcher import PhraseMatcher

dict_ = {'paper_id': [], 'abstract': [], 'body_text': [], 'authors': [], 'title': [], 'journal': []}
for idx, entry in enumerate(all_json):
    if idx % 100 == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    if idx == 10000:
        break
    
    content = FileReader(entry)
    
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    # no metadata, skip this paper
    if len(meta_data) == 0:
        continue
    
    dict_['paper_id'].append(content.paper_id)
    dict_['abstract'].append(content.abstract)
    dict_['body_text'].append(content.body_text)
        
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]

    # if only one author - or Null valie
    dict_['authors'].append(meta_data['authors'].values[0])
    
    dict_['title'].append(str(meta_data['title'])[9:-27])
    
    # add the journal information
    dict_['journal'].append(meta_data['journal'].values[0])
    
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text', 'authors', 'title', 'journal'])
df_covid.head()

Processing index: 0 of 29315
Processing index: 100 of 29315
Processing index: 200 of 29315
Processing index: 300 of 29315
Processing index: 400 of 29315
Processing index: 500 of 29315
Processing index: 600 of 29315
Processing index: 700 of 29315
Processing index: 800 of 29315
Processing index: 900 of 29315
Processing index: 1000 of 29315
Processing index: 1100 of 29315
Processing index: 1200 of 29315
Processing index: 1300 of 29315
Processing index: 1400 of 29315
Processing index: 1500 of 29315
Processing index: 1600 of 29315
Processing index: 1700 of 29315
Processing index: 1800 of 29315
Processing index: 1900 of 29315
Processing index: 2000 of 29315
Processing index: 2100 of 29315
Processing index: 2200 of 29315
Processing index: 2300 of 29315
Processing index: 2400 of 29315
Processing index: 2500 of 29315
Processing index: 2600 of 29315
Processing index: 2700 of 29315
Processing index: 2800 of 29315
Processing index: 2900 of 29315
Processing index: 3000 of 29315
Processing index: 31

Unnamed: 0,paper_id,abstract,body_text,authors,title,journal
0,0c27c0ddaa4761f6155838df81a88d24619720f8,"Genome Detective is a web-based, user-friendly...",We are currently faced with a potential global...,"Cleemput, S.; Dumon, W.; Fonseca, V.; Abdool K...",Genome Detective Coronavirus Typing Tool for r...,
1,92860a0f4425d12aafb8e096d8e65d38eb67dff5,"Ebolaviruses are non-segmented, negative-sense...","Ebolaviruses are non-segmented, negative-sense...","Morwitzer, M. J.; Corona, A.; Zinzula, L.; Fan...",Mutation of Ebola virus VP35 Ser129 uncouples ...,
2,0ab795fc615df6457551a8e231dce1f268eef9d2,With the emergence of 4 rd generation transmis...,"Since December 2019, there has been an outbrea...",Zhaowei Chen; Jijia Hu; Zongwei Zhang; Shan Ji...,Caution: The clinical characteristics of COVID...,
3,9af0b8e13e1e5ac714135dca8ba6d5216966066b,Intrinsically disordered regions (IDRs) challe...,Intrinsically disordered regions (IDRs) featur...,"Cohan, M. C.; Posey, A. E.; Grigsby, S. J.; Mi...",Evolved sequence features within the intrinsic...,
4,ab1692d56f32fdb0922e5f323f6ca385fb1e0808,Virophages are satellite-like double stranded ...,Host range is defined as the number and nature...,"Mougari, S.; Chelkha, N.; Sahmi-Bounsiar, D.; ...",First evidence of host range expansion in viro...,


# Pattern matching
Print papers that contain the word transmission, incubation or covid-19.

In [0]:
terms = ['transmission','incubation','covid-19']

t_found = np.zeros((len(df_covid),len(terms)))

nlp = spacy.load('en')
nlp.max_length = 1000000

patterns = [nlp(text) for text in terms]

for i,text in enumerate(df_covid['abstract']):
    doc = nlp(text)
    matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
    matcher.add("TerminologyList", None, *patterns)
    matches = matcher(doc)
    if len(matches)>0:
        for match in matches:
            word = str(doc[match[1]:match[2]]).lower()
            t_found[i,terms.index(word)] = 1

In [0]:
pd.options.display.max_colwidth = 300

print('papers containing ')
df_covid[(t_found[:,0]==1)][['title','authors']]

Unnamed: 0,title,authors
2,Caution: The clinical characteristics of COVID...,Zhaowei Chen; Jijia Hu; Zongwei Zhang; Shan Jiang; Tao Wang; Zhengli Shi; Zhan Zhang
5,Imaging Striatal Dopamine Release Using a Non-...,"Beyene, A. G.; Delevich, K.; Del Bonis ODonnell, J. T.; Piekarski, D. J.; Lin, W. C.; Thomas, A. W.; Yang, S. J.; Kosillo, P.; Yang, D.; Wilbrecht, L.; Landry, M. P."
6,Quantifying transmission of emerging zoonoses:...,"Ambrose, M.; Kucharski, A. J.; Formenty, P.; Muyembe-Tamfum, J.-J.; Rimoin, A. W.; Lloyd-Smith, J. O."
9,Risk estimation and prediction by modeling the...,Hui Wan; Jing-an Cui; Guo-Jing Yang
10,DEN-IM: Dengue Virus identification from shotg...,"Mendes, C. I.; Lizarazo, E.; Machado, M. P.; Silva, D. N.; Tami, A.; Ramirez, M.; Couto, N.; Rossen, J. W. A.; Carrico, J. A."
...,...,...
9715,Probable Secondary Infections in Households of...,"Lau, Joseph T.F.; Lau, Mason; Kim, Jean H.; Wong, Eric; Tsui, Hi-Yi; Tsang, Thomas; Wong, Tze Wai"
9721,Assessing hospital emergency management plans:...,"Rebmann, Terri"
9753,Detecting a trend change in cross-border epide...,"Maeno, Yoshiharu"
9776,Evaluation of Ultra-Microscopic Changes and Pr...,"Ali-Saeed, Rola; Alabsi, Aied M; Ideris, Aini; Omar, Abdul Rahman; Yusoff, Khatijah; Ali, Abdul Manaf"


In [0]:
df_covid[(t_found[:,1]==1)][['title','authors']]

Unnamed: 0,title,authors
31,Transmission interval estimates suggest pre-sy...,Lauren Tindale; Michelle Coombe; Jessica E Stockdale; Emma Garlock; Wing Yin Venus Lau; Manu Saraswat; Yen-Hsiang Brian Lee; Louxin Zhang; Dongxuan Chen; Jacco Wallinga; Caroline Colijn
49,Mapping a viral phylogeny onto outbreak trees ...,Stephen P Velsko;Jonathan E Allen;
61,A Generalized Discrete Dynamic Model for Human...,"Zhang, W.; Chen, Z.; Lu, Y.; Guo, Z.; Qi, Y.; Wang, G.; Lu, J."
63,The Viral Protein Corona Directs Viral Pathoge...,"Ezzat, K.; Pernemalm, M.; Palsson, S.; Roberts, T. C.; Jarver, P.; Dondalska, A.; Bestas, B.; Sobkowiak, M. J.; Levanen, B.; Skold, M.; Thompson, E. A.; Saher, O.; Kari, O. K.; Lajunen, T.; Sverremark Ekstrom, E.; Nilsson, C.; Ishchenko, Y.; Malm, T.; Wood, M. J. A.; Power, U. F.; Masich, S.; Li..."
67,Epidemiological and clinical features of COVID...,Penghui Yang; Yibo Ding; Zhe Xu; Rui Pu; Ping Li; Jin Yan; Jiluo Liu; Fanping Meng; Lei Huang; Lei Shi; Tianjun Jiang; Enqiang Qin; Min Zhao; Dawei Zhang; Peng Zhao; Lingxiang Yu; Zhaohai Wang; Zhixian Hong; Zhaohui Xiao; Qing Xi; Dexi Zhao; Peng Yu; Caizhong Zhu; Zhu Chen; Shaogeng Zhang; Junsh...
...,...,...
8688,DGHM 2011,
8781,election of staphylococcal enterotoxin B (SEB...,"Soykut, Esra Acar; Dudak, Fahriye Ceyda; Boyacı, İsmail Hakkı"
8807,lapsing subacute demyelinating encephalomyel...,"Wege, Helmut; Watanabe, Rihito; ter Meulen, Volker"
9135,High fatality rates and associated factors in ...,"Nam, Hae-Sung; Park, Jung Wan; Ki, Moran; Yeon, Mi-Yeon; Kim, Jin; Kim, Seung Woo"


In [0]:
df_covid[(t_found[:,2]==1)&(t_found[:,0]==1)][['title','authors']]

Unnamed: 0,title,authors
2,Caution: The clinical characteristics of COVID...,Zhaowei Chen; Jijia Hu; Zongwei Zhang; Shan Jiang; Tao Wang; Zhengli Shi; Zhan Zhang
9,Risk estimation and prediction by modeling the...,Hui Wan; Jing-an Cui; Guo-Jing Yang
22,Prediction of the COVID-19 outbreak based on a...,Yuan Zhang; Chong You; Zhenghao Cai; Jiarui Sun; Wenjie Hu; Xiao-Hua Zhou
31,Transmission interval estimates suggest pre-sy...,Lauren Tindale; Michelle Coombe; Jessica E Stockdale; Emma Garlock; Wing Yin Venus Lau; Manu Saraswat; Yen-Hsiang Brian Lee; Louxin Zhang; Dongxuan Chen; Jacco Wallinga; Caroline Colijn
70,A spatial model of CoVID-19 transmission in En...,Leon Danon; Ellen Brooks-Pollock; Mick Bailey; Matt J Keeling
...,...,...
7067,écanismes d’émergence virale et transmission ...,"Gessain, Antoine"
7686,Can we contain the COVID-19 outbreak with the ...,"Wilder-Smith, Annelies; Chiew, Calvin J; Lee, Vernon J"
7725,Inactivation photochimique des pathogènes des ...,"Cazenave, J.-P."
9301,Early dynamics of transmission and control of ...,"Kucharski, Adam J; Russell, Timothy W; Diamond, Charlie; Liu, Yang; Edmunds, John; Funk, Sebastian; Eggo, Rosalind M; Sun, Fiona; Jit, Mark; Munday, James D; Davies, Nicholas; Gimma, Amy; van Zandvoort, Kevin; Gibbs, Hamish; Hellewell, Joel; Jarvis, Christopher I; Clifford, Sam; Quilty, Billy J;..."


# Word Embedding
Compute cosine similarity between questions and paper abstracts

In [0]:
nlp = spacy.load('en_core_web_lg')

OSError: ignored

In [0]:
with nlp.disable_pipes():
    vectors = np.array([nlp(text).vector for text in df_covid['abstract']])
v_mean = np.mean(vectors,axis=0)

In [0]:
from sklearn.metrics.pairwise import cosine_similarity

In [0]:
q1 = 'What is known about transmission, incubation, and environmental stability?'

'Range of incubation periods for the disease in humans (and how this varies across age and health status) and how long individuals are contagious, even after recovery.'
'Prevalence of asymptomatic shedding and transmission (e.g., particularly children).''
'Seasonality of transmission.''
'Physical science of the coronavirus (e.g., charge distribution, adhesion to hydrophilic/phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding).'
'Persistence and stability on a multitude of substrates and sources (e.g., nasal discharge, sputum, urine, fecal matter, blood).'
'Persistence of virus on surfaces of different materials (e,g., copper, stainless steel, plastic).'
'Natural history of the virus and shedding of it from an infected person'
'Implementation of diagnostics and products to improve clinical processes'
'Disease models, including animal models for infection, disease and transmission'
'Tools and studies to monitor phenotypic change and potential adaptation of the virus'
'Immune response and immunity'
'Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings'
'Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings'
'Role of the environment in transmission'

In [0]:
qv = nlp(q1).vector
cs = cosine_similarity(vectors-v_mean,qv.reshape(1,-1)-v_mean)
cs.argmax()

In [0]:
df_covid['abstract'][cs.argmax()]