# Covid19 Semantic-based Search using Word Embedding


# Goal


With over 57,000 scholarly articles on the coronavirus family, it is extremely difficult for the medical researchers to go through this tremendous amount of research papers, hence very difficult to get useful insights about the new Covid-19 pandemic. **The main goal** is to implement a **semantic-based search** rather a *keyword-based* search.


# Approach


Instead of comparing occurences and counts, we will use *gensim's word2vec* in order to generate word embedding using the abstract texts as our corpus. For each document, we calculate the centroid of its abstract and for each query word, we map it to a vector then calculate the word centroid similarity for the query and each document's abstract. The top ranked papers are then selected and output.

# Dataset Loading and Preprocessing


Some loading and pre-processing steps are introduced by the notebooks by ****Ivan Ega Pratama****, ****Maksim Ekin**** from Kaggle. 



**Citation: ** [COVID EDA: Initial Exploration Tool](https://www.kaggle.com/ivanegapratama/covid-eda-initial-exploration-tool) and [COVID-19 Literature Clustering](https://www.kaggle.com/maksimeren/covid-19-literature-clustering#Loading-the-Data)

Since the data is too large to work with, we will use the abstracts only. We will start by loading the metadata file and extract from it the paper_id, title, abstract and doi.

In [1]:
# Imports 
!pip install langdetect

import spacy
import string
import warnings

import numpy as np
import pandas as pd

from pprint import pprint
from IPython.utils import io
from tqdm.notebook import tqdm
from gensim.models import Word2Vec
from langdetect import DetectorFactory, detect
from IPython.core.display import HTML, display
from spacy.lang.en.stop_words import STOP_WORDS

warnings.filterwarnings('ignore')



In [3]:
root_path = 'kaggle/input/CORD-19-research-challenge'
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str,
    'abstract': str
})
meta_df.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url
0,xqhn0vbp,1e1286db212100993d03cc22374b624f7caee956,PMC,Airborne rhinovirus detection and effect of ul...,10.1186/1471-2458-3-5,PMC140314,12525263,no-cc,"BACKGROUND: Rhinovirus, the most common cause ...",2003-01-13,"Myatt, Theodore A; Johnston, Sebastian L; Rudn...",BMC Public Health,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
1,gi6uaa83,8ae137c8da1607b3a8e4c946c07ca8bda67f88ac,PMC,Discovering human history from stomach bacteria,10.1186/gb-2003-4-5-213,PMC156578,12734001,no-cc,Recent analyses of human pathogens have reveal...,2003-04-28,"Disotell, Todd R",Genome Biol,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
2,le0ogx1s,,PMC,A new recruit for the army of the men of death,10.1186/gb-2003-4-7-113,PMC193621,12844350,no-cc,"The army of the men of death, in John Bunyan's...",2003-06-27,"Petsko, Gregory A",Genome Biol,,,False,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
3,fy4w7xz8,0104f6ceccf92ae8567a0102f89cbb976969a774,PMC,Association of HLA class I with severe acute r...,10.1186/1471-2350-4-9,PMC212558,12969506,no-cc,BACKGROUND: The human leukocyte antigen (HLA) ...,2003-09-12,"Lin, Marie; Tseng, Hsiang-Kuang; Trejaut, Jean...",BMC Med Genet,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...
4,0qaoam29,5b68a553a7cbbea13472721cd1ad617d42b40c26,PMC,A double epidemic model for the SARS propagation,10.1186/1471-2334-3-19,PMC222908,12964944,no-cc,BACKGROUND: An epidemic of a Severe Acute Resp...,2003-09-10,"Ng, Tuen Wai; Turinici, Gabriel; Danchin, Antoine",BMC Infect Dis,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...


In [3]:
df_covid = pd.DataFrame(columns=['paper_id', 'title','abstract', 'doi'])
df_covid['paper_id'] = meta_df.sha
df_covid['title'] = meta_df.title
df_covid['abstract'] = meta_df.abstract
df_covid['doi'] = meta_df.doi

df_covid.head()

Unnamed: 0,paper_id,title,abstract,doi
0,b2897e1277f56641193a6db73825f707eed3e4c9,Sequence requirements for RNA strand transfer ...,Nidovirus subgenomic mRNAs contain a leader se...,10.1093/emboj/20.24.7220
1,e3d0d482ebd9a8ba81c254cc433f314142e72174,"Crystal structure of murine sCEACAM1a[1,4]: a ...",CEACAM1 is a member of the carcinoembryonic an...,10.1093/emboj/21.9.2076
2,00b1d99e70f779eb4ede50059db469c65e8c1469,Synthesis of a novel hepatitis C virus protein...,Hepatitis C virus (HCV) is an important human ...,10.1093/emboj/20.14.3840
3,cf584e00f637cbd8f1bb35f3f09f5ed07b71aeb0,Structure of coronavirus main proteinase revea...,The key enzyme in coronavirus polyprotein proc...,10.1093/emboj/cdf327
4,dde02f11923815e6a16a31dd6298c46b109c5dfa,Discontinuous and non-discontinuous subgenomic...,"Arteri-, corona-, toro- and roniviruses are ev...",10.1093/emboj/cdf635


## Duplicates and Null values.

We will look into the data and check if we have any null values.

In [4]:
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59887 entries, 0 to 59886
Data columns (total 4 columns):
paper_id    45763 non-null object
title       59724 non-null object
abstract    48757 non-null object
doi         55801 non-null object
dtypes: object(4)
memory usage: 1.8+ MB


In [5]:
df_covid.drop_duplicates(['abstract'], inplace=True)
df_covid.dropna(inplace=True)
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38584 entries, 0 to 59886
Data columns (total 4 columns):
paper_id    38584 non-null object
title       38584 non-null object
abstract    38584 non-null object
doi         38584 non-null object
dtypes: object(4)
memory usage: 1.5+ MB


## Dropping non-English articles.

Now we dropped the null values, and removed the duplicates as well. Now we will check the number of non-english articles and see if we can drop them for the sake of simplicity.

In [6]:
# set seed
DetectorFactory.seed = 0

# hold label - language
languages = []

# go through each text
for ii in tqdm(range(0,len(df_covid))):
    # split by space into list, take the first x intex, join with space
    text = df_covid.iloc[ii]['abstract'].split(" ")
    
    lang = "en"
    try:
        if len(text) > 50:
            lang = detect(" ".join(text[:50]))
        elif len(text) > 0:
            lang = detect(" ".join(text[:len(text)]))
    # ught... beginning of the document was not in a good format
    except Exception as e:
        all_words = set(text)
        try:
            lang = detect(" ".join(all_words))
        
        except Exception as e:        
            lang = "unknown"
            pass
    
    # get the language    
    languages.append(lang)


HBox(children=(FloatProgress(value=0.0, max=38584.0), HTML(value='')))

Let's look at the numbers of articles for each language.

In [7]:
languages_dict = {}
for lang in set(languages):
    languages_dict[lang] = languages.count(lang)
    
print("Total: {}\n".format(len(languages)))
pprint(languages_dict)

Total: 38584

{'af': 1,
 'ca': 4,
 'de': 57,
 'en': 38071,
 'es': 187,
 'et': 1,
 'fr': 205,
 'it': 11,
 'nl': 43,
 'pt': 2,
 'unknown': 2}


Since most of the articles are in English, we can safely drop non-English articles.

In [8]:
df_covid['language'] = languages
df_covid = df_covid[df_covid['language'] == 'en'] 
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38071 entries, 0 to 59886
Data columns (total 5 columns):
paper_id    38071 non-null object
title       38071 non-null object
abstract    38071 non-null object
doi         38071 non-null object
language    38071 non-null object
dtypes: object(5)
memory usage: 1.7+ MB


In [9]:
df_covid = df_covid.drop(['language'], axis = 1) 
df_covid.head()

Unnamed: 0,paper_id,title,abstract,doi
0,b2897e1277f56641193a6db73825f707eed3e4c9,Sequence requirements for RNA strand transfer ...,Nidovirus subgenomic mRNAs contain a leader se...,10.1093/emboj/20.24.7220
1,e3d0d482ebd9a8ba81c254cc433f314142e72174,"Crystal structure of murine sCEACAM1a[1,4]: a ...",CEACAM1 is a member of the carcinoembryonic an...,10.1093/emboj/21.9.2076
2,00b1d99e70f779eb4ede50059db469c65e8c1469,Synthesis of a novel hepatitis C virus protein...,Hepatitis C virus (HCV) is an important human ...,10.1093/emboj/20.14.3840
3,cf584e00f637cbd8f1bb35f3f09f5ed07b71aeb0,Structure of coronavirus main proteinase revea...,The key enzyme in coronavirus polyprotein proc...,10.1093/emboj/cdf327
4,dde02f11923815e6a16a31dd6298c46b109c5dfa,Discontinuous and non-discontinuous subgenomic...,"Arteri-, corona-, toro- and roniviruses are ev...",10.1093/emboj/cdf635


# Spacy Parser and Tokenizer

We will be using spacy for the pre-processing. We will use use en_core_sci_lg which is spacy's model for scientific and medical documents and create "processed abstracts" feature. It will later be used to calculate the centroid for each abstract.

In [10]:
# Download the spacy bio parser
with io.capture_output() as captured:
    !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz

In [11]:
#NLP 
import en_core_sci_lg  # model downloaded in previous step

In [12]:
punctuations = string.punctuation

stopwords = list(STOP_WORDS)
custom_stop_words = [
    'doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org', 'https', 'et', 'al', 'author', 'figure', 
    'rights', 'reserved', 'permission', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.', 
    'al.', 'Elsevier', 'PMC', 'CZI', 'www'
]

for w in custom_stop_words:
    if w not in stopwords:
        stopwords.append(w)

In [13]:
# Parser
parser = en_core_sci_lg.load(disable=["tagger", "ner"])
parser.max_length = 7000000

def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens


tqdm.pandas()
df_covid["processed_abstract"] = df_covid["abstract"].progress_apply(spacy_tokenizer)
df_covid.head()

HBox(children=(FloatProgress(value=0.0, max=38071.0), HTML(value='')))




Unnamed: 0,paper_id,title,abstract,doi,processed_abstract
0,b2897e1277f56641193a6db73825f707eed3e4c9,Sequence requirements for RNA strand transfer ...,Nidovirus subgenomic mRNAs contain a leader se...,10.1093/emboj/20.24.7220,nidovirus subgenomic mrnas contain leader sequ...
1,e3d0d482ebd9a8ba81c254cc433f314142e72174,"Crystal structure of murine sCEACAM1a[1,4]: a ...",CEACAM1 is a member of the carcinoembryonic an...,10.1093/emboj/21.9.2076,ceacam1 member carcinoembryonic antigen cea fa...
2,00b1d99e70f779eb4ede50059db469c65e8c1469,Synthesis of a novel hepatitis C virus protein...,Hepatitis C virus (HCV) is an important human ...,10.1093/emboj/20.14.3840,hepatitis c virus hcv important human pathogen...
3,cf584e00f637cbd8f1bb35f3f09f5ed07b71aeb0,Structure of coronavirus main proteinase revea...,The key enzyme in coronavirus polyprotein proc...,10.1093/emboj/cdf327,key enzyme coronavirus polyprotein process vir...
4,dde02f11923815e6a16a31dd6298c46b109c5dfa,Discontinuous and non-discontinuous subgenomic...,"Arteri-, corona-, toro- and roniviruses are ev...",10.1093/emboj/cdf635,arteri- corona- toro- roniviruses evolutionari...


# Sentence Tokenization

gensim's word2vec corpus should be in the form of separate sentences, therefore we will use spacy's tokenizer in order to split the corpus (all the abstracts) into sentences.

In [14]:
#sentence tokenization to prepare the corpus
abstracts = df_covid['abstract'].values

nlp = en_core_sci_lg.load(disable = ['ner', 'tagger'])
nlp.add_pipe(nlp.create_pipe('sentencizer'), before="parser")
word2vec_corpus = []

for i in tqdm(range(0, len(abstracts))):
    raw_text = abstracts[i]
    doc = nlp(raw_text)
    sentences = [sent.string.strip() for sent in doc.sents]
    
    for sent in sentences:
        processed_sent = spacy_tokenizer(sent)
        processed_sent_list = processed_sent.split(" ")
        word2vec_corpus.append(processed_sent_list)

HBox(children=(FloatProgress(value=0.0, max=38071.0), HTML(value='')))

check out the corpus.

In [15]:
word2vec_corpus[:10]

[['nidovirus',
  'subgenomic',
  'mrnas',
  'contain',
  'leader',
  'sequence',
  'derive',
  '5′',
  'end',
  'genome',
  'fuse',
  'different',
  'sequence',
  '‘',
  'body',
  '’',
  'derive',
  '3′',
  'end'],
 ['generation',
  'involve',
  'unique',
  'mechanism',
  'discontinuous',
  'subgenomic',
  'rna',
  'synthesis',
  'resemble',
  'copy-choice',
  'rna',
  'recombination'],
 ['process',
  'nascent',
  'rna',
  'strand',
  'transfer',
  'site',
  'template',
  'plus',
  'minus',
  'strand',
  'synthesis',
  'yield',
  'subgenomic',
  'rna',
  'molecule'],
 ['central',
  'process',
  'transcription-regulating',
  'sequence',
  'trss',
  'present',
  'template',
  'site',
  'ensure',
  'fidelity',
  'strand',
  'transfer'],
 ['present',
  'result',
  'comprehensive',
  'co-variation',
  'mutagenesis',
  'study',
  'equine',
  'arteritis',
  'virus',
  'trss',
  'demonstrate',
  'discontinuous',
  'rna',
  'synthesis',
  'depend',
  'base',
  'pair',
  'sense',
  'leader',
  '

# Word2vec Training

We will use gensim's word2vec and train it on the corpus we prepared. `min_count` is the minimum count for a word to occur in the corpus in order to be mapped to a vector. `size` is the size of the vectors produced. `workers` is the number of cores. `window` is the context size to consider. `sg` is skipgram model. The `min_count`, `size` and `window` were calculated empirically. 

In [16]:
# Train the genisim word2vec model with our own custom corpus
model = Word2Vec(word2vec_corpus, min_count=3,size= 50,workers=4, window =5, sg = 1)


# Word Centroid Similarity (WCS)

Now that we have word embedding for our corpus, the approach we will take is to measure the cosine similarity between the centroid of each document and the query. This idea is inspired from https://github.com/lgalke/vec4ir#word-centroid-similarity-wcs where it is mentioned:

>An intuitive approach to use word embeddings in information retrieval is the word centroid similarity (WCS). The representation for each document is the centroid of its respective word vectors. Since word vectors carry semantic information of the words, one could assume that the centroid of the word vectors within a document encodes its meaning to some extent. At query time, the centroid of the query’s word vectors is computed. The cosine similarity to the centroids of the (matching) documents is used as a measure of relevance. When the initial word frequencies of the queries and documents are first re-weighted according to inverse-document frequency (i.e. frequent words in the corpus are discounted), the technique is labeled IDF re-weighted word centroid similarity (IWCS).


For this purpose, we will calculate the centroid for each abstract using the vectors of all the words incorporating the anstract.

In [17]:
#calculate the centroid for each abstract

a = [0.0]*50
df_covid["centroid"] = [a]*df_covid.shape[0]

for index, row in df_covid.iterrows():
    abstract = row['processed_abstract']
    total_sim = 0
    words = abstract.split(" ")
    centroid = np.array([0.0]*50)
    for word in words:
        try:
            b = model[word]
        except:
            continue
        centroid = np.add(centroid, b)

    df_covid.at[index,'centroid'] = centroid.tolist()

df_covid.head()

Unnamed: 0,paper_id,title,abstract,doi,processed_abstract,centroid
0,b2897e1277f56641193a6db73825f707eed3e4c9,Sequence requirements for RNA strand transfer ...,Nidovirus subgenomic mRNAs contain a leader se...,10.1093/emboj/20.24.7220,nidovirus subgenomic mrnas contain leader sequ...,"[42.89778371725697, 17.383791841566563, -46.81..."
1,e3d0d482ebd9a8ba81c254cc433f314142e72174,"Crystal structure of murine sCEACAM1a[1,4]: a ...",CEACAM1 is a member of the carcinoembryonic an...,10.1093/emboj/21.9.2076,ceacam1 member carcinoembryonic antigen cea fa...,"[30.22426011785865, 13.17461103014648, -33.199..."
2,00b1d99e70f779eb4ede50059db469c65e8c1469,Synthesis of a novel hepatitis C virus protein...,Hepatitis C virus (HCV) is an important human ...,10.1093/emboj/20.14.3840,hepatitis c virus hcv important human pathogen...,"[37.079803220462054, 12.635716843418777, -43.3..."
3,cf584e00f637cbd8f1bb35f3f09f5ed07b71aeb0,Structure of coronavirus main proteinase revea...,The key enzyme in coronavirus polyprotein proc...,10.1093/emboj/cdf327,key enzyme coronavirus polyprotein process vir...,"[33.258001547306776, 11.860749502433464, -31.2..."
4,dde02f11923815e6a16a31dd6298c46b109c5dfa,Discontinuous and non-discontinuous subgenomic...,"Arteri-, corona-, toro- and roniviruses are ev...",10.1093/emboj/cdf635,arteri- corona- toro- roniviruses evolutionari...,"[47.382160579785705, 14.095707904547453, -42.6..."


# Ranking documents

Now we will create a function that given a query, would rank the documents from most similar to least similar.

In [18]:
def rank_docs(model, query, df_covid, num) :
    #[(paper_id, processed_abstract, url, cosine_sim)]
    cosine_list = []
    
    a = []
    query = query.split(" ")
    for q in query:
        try:
            a.append(model[q])
        except:
            continue
    
    for index, row in df_covid.iterrows():
        centroid = row['centroid']
        total_sim = 0
        for a_i in a:
            cos_sim = np.dot(a_i, centroid)/(np.linalg.norm(a_i)*np.linalg.norm(centroid))
            total_sim += cos_sim
        cosine_list.append((row['title'], row['doi'], total_sim)) 
    
    
    cosine_list.sort(key=lambda x:x[2], reverse=True) ## in Descedning order 
    
    papers_list = []
    for item in cosine_list[:num]:
        papers_list.append((item[0], item[1], item[2]))
    return papers_list

# Saving the model and the dataframe

## saving the model and the data

In [19]:
model.save("./model.model")
df_covid.to_pickle("./df_covid.pkl")

## Loading the model and the data
The save/load steps are done in order to avoid re-training the model each time.

In [20]:
saved_model = Word2Vec.load("./model.model")
saved_df_covid = pd.read_pickle("./df_covid.pkl")

# Results

Now let's see some results using our model. query() function takes the query string as an input, together with a number representing the top matches you want, and print the titles of the top matches most relevant articles retrieved for each query, clicking the title a new tab will open with the paper.

In [21]:
def query(the_query, top_matches=10):
    q = spacy_tokenizer(the_query)
    try:
        model_to_use = model
    except:
        model_to_use = saved_model
    try:
        df_covid_to_use = df_covid
    except:
        df_covid_to_use = saved_df_covid
    results = rank_docs(model_to_use, q, df_covid_to_use, top_matches)
    html = """
    <html>
    <style>
        body {
            margin: 0;
            padding: 0;
            background: #ffffff;
            font-family: sans-serif;
        }

        ul {
            position: relative;
            width: 100%;
            margin: 100px auto 0;
            padding: 20px;
            box-sizing: border-box;
            background: rgba(0, 0, 0, 0);
            box-shadow: inset 0 0 15px rgba(0, 0, 0, .2);
            border-radius: 5px;
            overflow: hidden;
        }

        ul li {
            display: flex;
            background: rgba(255, 0, 0, 0.25);
            padding: 10px 20px;
            color: #fff;
            margin: 5px 0;
            transition: .6s;
            border-radius: 5px;
        }

        ul li:hover {
            transform: scale(1.02);
            background: rgba(255, 0, 0, 0.5);
        }
    </style>
        
        <body>
        <ul>
    """
    for i in range(len(results)):
        paper_name = results[i][0]
        paper_doi = results[i][1]
        paper_link = "https://doi.org/" + str(paper_doi)
        html += """
                <li>
                <a href=" """ + str(paper_link) + """ " target="_blank">
                    <div>
                        <h3>
                            """ + str(i+1) + "&emsp;" +str(paper_name) + """ 
                        </h3>
                        <cite>
                            doi
                            <span >
                                > """ + str(paper_doi) + """
                            </span>
                        </cite>
                    </div>
                </a>
        """
        
    html += "</body></html>"
    display(HTML(html))

> # What do we know about virus genetics, origin, and evolution?

In [22]:
query('origin of coronavirus')

In [23]:
query('covid19 genetics', top_matches=7)

> # What is known about transmission, incubation, and environmental stability?

In [24]:
query('transmission')

In [25]:
query('incubation period of covid19')

In [26]:
query('environmental stability of coronavirus', top_matches=5)

> # What do we know about diagnostics and surveillance?

In [27]:
query('diagnostics')

## Thank you for your time, your reviews and suggenstions are highly appreciated !