# COVID-19-BERT-ResearchPapers-Semantic-Search

This work builds a **semantic search engine using BERT**, to search a query through the dataset of research papers provided as part of [Kaggle's competion CORD-19-research-challenge](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge), we like to thank kaggle and all of the competion sponsers for this competion in bringing up efforts for fighting this virus.

This work, 
1.   first divides the dataset to paragraphs
2.   then uses BERT to embedded paragraphs of papers using bert-base-nli-mean-tokens pretrained model
3.   finally runs a query and returns the top 5 paragraphs and their papers' titles,abstract,abstract_summary

We have built this notebook to run seamlesly on google colab, connect with google drive, and downalod the data using [kaggle api](https://github.com/Kaggle/kaggle-api), so no data is downloaded to your device, and no need to have a powerful GPU, as all is done freely through google colab, we like to thank google for providing the research community with google colab

**Code** is found here [on github ](https://github.com/theamrzaki/COVID-19-BERT-ResearchPapers-Semantic-Search), we truly hope that this work has a postive impact in the fight aganist this evil virus, we truly pray for all people to be able to win this fight.


**References** :

*   We use the library provided by [UKPLab](https://github.com/UKPLab) called [sentence-transformers](https://github.com/UKPLab/sentence-transformers), this library makes it truly easy to use BERT and other architectures like ALBERT,XLNet for sentence embedding, they also provide simple interface to query and cluster data.
*   We have used the code from [maksimeren](https://www.kaggle.com/maksimeren/covid-19-literature-clustering) for data processing, we truly like to thank him.
*   We used the concept of drawing BERT, disccussed here [Jay Alammar](http://jalammar.github.io/) in illustrating how our architecture works, his blogs are extremly informative and easily understood.
*   We used the pre-trained models disccess in Conneau et al., 2017, show in the InferSent-Paper (Supervised Learning of Universal Sentence Representations from Natural Language Inference Data) that training on Natural Language Inference (NLI) data can produce universal sentence embeddings.






## Architecture

The paper is found in the json from the dataset in paragraphs, we use this division in the papers, then we pass these paragraphs to a pre-trained BERT model [bert-base-nli-mean-tokens](https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/nli-models.md) to be embedded


![alt text](https://github.com/theamrzaki/COVID-19-BERT-ResearchPapers-Semantic-Search/blob/master/assets/Bert%20Information%20Retrival_Train.jpg?raw=true)


After embedding is done, we pass the embedded the query using the same bert model

Then we compare the both embedding represnetations (paragraphs and query) using cosine similarity, we then return the most similar paragraphs with their paper details (title,abstract,abstract_summary)

![alt text](https://github.com/theamrzaki/COVID-19-BERT-ResearchPapers-Semantic-Search/blob/master/assets/Bert%20Information%20Retrival_Test.jpg?raw=true)

In [None]:
#first install the library that would help us use BERT in an easy to use interface
#https://github.com/UKPLab/sentence-transformers/tree/master/sentence_transformers
!pip install -U sentence-transformers

In [None]:
#install the kaggle data to google colab
#https://github.com/Kaggle/kaggle-api#api-credentials
!pip install kaggle
import os
!cp "/content/kaggle.json" /root/.kaggle
!kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge
!unzip  CORD-19-research-challenge.zip -d /content/CORD-19-research-challenge

## Data Processing

built using
https://www.kaggle.com/maksimeren/covid-19-literature-clustering

In [None]:
import glob
import json
import pandas as pd
from tqdm import tqdm
root_path = '/content/CORD-19-research-challenge/'
all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)
len(all_json)

In [None]:
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file
0,,Elsevier,Intrauterine virus infections and congenital h...,10.1016/0002-8703(72)90077-4,,4361535,els-covid,Abstract The etiologic basis for the vast majo...,1972-12-31,"Overall, James C.",American Heart Journal,,,False,custom_license
1,,Elsevier,Coronaviruses in Balkan nephritis,10.1016/0002-8703(80)90355-5,,6243850,els-covid,,1980-03-31,"Georgescu, Leonida; Diosi, Peter; Buţiu, Ioan;...",American Heart Journal,,,False,custom_license
2,,Elsevier,Cigarette smoking and coronary heart disease: ...,10.1016/0002-8703(80)90356-7,,7355701,els-covid,,1980-03-31,"Friedman, Gary D",American Heart Journal,,,False,custom_license
3,aecbc613ebdab36753235197ffb4f35734b5ca63,Elsevier,Clinical and immunologic studies in identical ...,10.1016/0002-9343(73)90176-9,,4579077,els-covid,"Abstract Middle-aged female identical twins, o...",1973-08-31,"Brunner, Carolyn M.; Horwitz, David A.; Shann,...",The American Journal of Medicine,,,True,custom_license
4,,Elsevier,Epidemiology of community-acquired respiratory...,10.1016/0002-9343(85)90361-4,,4014285,els-covid,Abstract Upper respiratory tract infections ar...,1985-06-28,"Garibaldi, Richard A.",The American Journal of Medicine,,,False,custom_license


### Read Data (Helpers)

In [None]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            # Abstract
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'
first_row = FileReader(all_json[0])
print(first_row)

dcd7a1235ea74e3ef71d051103bf8a64c3c8f457: 12 Background 13 After the outbreak of novel coronavirus (2019-nCoV) starting in late 2019, a number 14 of researchers have reported the predicted the virus transmission dynamics. However, 15 under th... A novel coronavirus (2019-nCoV) appeared in December 2019 in Wuhan, Hubei 33 Province in central China had triggered city closure on Jan. 23, 2020, and lockdown 34 of all major cities in the province ...


In [None]:
def get_breaks(content, length):
    data = ""
    words = content.split(' ')
    total_chars = 0

    # add break every length characters
    for i in range(len(words)):
        total_chars += len(words[i])
        if total_chars > length:
            data = data + "<br>" + words[i]
            total_chars = 0
        else:
            data = data + " " + words[i]
    return data

In [None]:
dict_ = {'paper_id': [], 'abstract': [], 'body_text': [], 'authors': [], 'title': [], 'journal': [], 'abstract_summary': []}
for idx, entry in enumerate(all_json):
    if idx % (len(all_json) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    content = FileReader(entry)
    
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    # no metadata, skip this paper
    if len(meta_data) == 0:
        continue
    
    dict_['paper_id'].append(content.paper_id)
    dict_['abstract'].append(content.abstract)
    dict_['body_text'].append(content.body_text)
    
    # also create a column for the summary of abstract to be used in a plot
    if len(content.abstract) == 0: 
        # no abstract provided
        dict_['abstract_summary'].append("Not provided.")
    elif len(content.abstract.split(' ')) > 100:
        # abstract provided is too long for plot, take first 300 words append with ...
        info = content.abstract.split(' ')[:100]
        summary = get_breaks(' '.join(info), 40)
        dict_['abstract_summary'].append(summary + "...")
    else:
        # abstract is short enough
        summary = get_breaks(content.abstract, 40)
        dict_['abstract_summary'].append(summary)
        
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    
    try:
        # if more than one author
        authors = meta_data['authors'].values[0].split(';')
        if len(authors) > 2:
            # more than 2 authors, may be problem when plotting, so take first 2 append with ...
            dict_['authors'].append(". ".join(authors[:2]) + "...")
        else:
            # authors will fit in plot
            dict_['authors'].append(". ".join(authors))
    except Exception as e:
        # if only one author - or Null valie
        dict_['authors'].append(meta_data['authors'].values[0])
    
    # add the title information, add breaks when needed
    try:
        title = get_breaks(meta_data['title'].values[0], 40)
        dict_['title'].append(title)
    # if title was not provided
    except Exception as e:
        dict_['title'].append(meta_data['title'].values[0])
    
    # add the journal information
    dict_['journal'].append(meta_data['journal'].values[0])
    
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text', 'authors', 'title', 'journal', 'abstract_summary'])
df_covid.head()

Processing index: 0 of 29315
Processing index: 2931 of 29315
Processing index: 5862 of 29315
Processing index: 8793 of 29315
Processing index: 11724 of 29315
Processing index: 14655 of 29315
Processing index: 17586 of 29315
Processing index: 20517 of 29315
Processing index: 23448 of 29315
Processing index: 26379 of 29315
Processing index: 29310 of 29315


Unnamed: 0,paper_id,abstract,body_text,authors,title,journal,abstract_summary
0,dcd7a1235ea74e3ef71d051103bf8a64c3c8f457,12 Background 13 After the outbreak of novel c...,A novel coronavirus (2019-nCoV) appeared in De...,Xinhai Li. Xumao Zhao...,The lockdown of Hubei Province causing<br>dif...,,12 Background 13 After the outbreak of novel<...
1,86b6b0c1b2777541feb83116bcb7a5cb12a52310,,Firstly informed to World Health Organization ...,"Jung, Y. J.. Park, G.-S....",Comparative analysis of primer-probe sets for...,,Not provided.
2,73d80c8f5780d70bd8d343188c56e898e91557b6,Middle East respiratory syndrome coronavirus (...,Coronaviruses (CoVs) comprise a family of enve...,"Straus, M. R.. Tang, T....",Ca2+ ions promote fusion of Middle East<br>Re...,,Middle East respiratory syndrome coronavirus<...
3,70cc2e5152d3dc4d44494124ff556c9bbe9e6f41,"1 Background: A new virus broke out in Wuhan, ...","In December 2019, a new type of unexplained pn...",Yafei Wang. Ying Zhou...,Clinical Characteristics of Patients with<br>...,,"1 Background: A new virus broke out in Wuhan,..."
4,3b22eecad8a582436c52284a4db2198a98a94e18,The host antiviral response involves the induc...,Respiratory syncytial virus (RSV) belongs to t...,"Robitaille, A. C.. Caron, E....","DUSP1 regulates apoptosis and cell migration,...",,The host antiviral response involves the<br>i...


### Handle Possible Duplicates

In [None]:
df_covid.drop_duplicates(['abstract', 'body_text'], inplace=True)
df_covid['abstract'].describe(include='all')

count     27663
unique    20191
top            
freq       7444
Name: abstract, dtype: object

In [None]:
df_covid['body_text'].describe(include='all')

count                                                 27663
unique                                                27662
top       In a global world, knowledge of imported infec...
freq                                                      2
Name: body_text, dtype: object

### Take a Look at the Data

In [None]:
df_covid.head()

Unnamed: 0,paper_id,abstract,body_text,authors,title,journal,abstract_summary
0,dcd7a1235ea74e3ef71d051103bf8a64c3c8f457,12 Background 13 After the outbreak of novel c...,A novel coronavirus (2019-nCoV) appeared in De...,Xinhai Li. Xumao Zhao...,The lockdown of Hubei Province causing<br>dif...,,12 Background 13 After the outbreak of novel<...
1,86b6b0c1b2777541feb83116bcb7a5cb12a52310,,Firstly informed to World Health Organization ...,"Jung, Y. J.. Park, G.-S....",Comparative analysis of primer-probe sets for...,,Not provided.
2,73d80c8f5780d70bd8d343188c56e898e91557b6,Middle East respiratory syndrome coronavirus (...,Coronaviruses (CoVs) comprise a family of enve...,"Straus, M. R.. Tang, T....",Ca2+ ions promote fusion of Middle East<br>Re...,,Middle East respiratory syndrome coronavirus<...
3,70cc2e5152d3dc4d44494124ff556c9bbe9e6f41,"1 Background: A new virus broke out in Wuhan, ...","In December 2019, a new type of unexplained pn...",Yafei Wang. Ying Zhou...,Clinical Characteristics of Patients with<br>...,,"1 Background: A new virus broke out in Wuhan,..."
4,3b22eecad8a582436c52284a4db2198a98a94e18,The host antiviral response involves the induc...,Respiratory syncytial virus (RSV) belongs to t...,"Robitaille, A. C.. Caron, E....","DUSP1 regulates apoptosis and cell migration,...",,The host antiviral response involves the<br>i...


In [None]:
df_covid.describe()

Unnamed: 0,paper_id,abstract,body_text,authors,title,journal,abstract_summary
count,27663,27663.0,27663,26917,27619,26769,27663
unique,27663,20191.0,27662,25568,27240,3323,20184
top,ce248b901191d45f3e56f1e6664a0239738aa148,,"In a global world, knowledge of imported infec...","Domingo, Esteban",Index,PLoS One,Not provided.
freq,1,7444.0,2,14,68,1511,7444


### Data Pre-Process

In [None]:
df_covid.dropna(inplace=True)
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26043 entries, 885 to 27677
Data columns (total 7 columns):
paper_id            26043 non-null object
abstract            26043 non-null object
body_text           26043 non-null object
authors             26043 non-null object
title               26043 non-null object
journal             26043 non-null object
abstract_summary    26043 non-null object
dtypes: object(7)
memory usage: 1.6+ MB


In [None]:
df_covid = df_covid.head(12500)

In [None]:
import re

df_covid['body_text'] = df_covid['body_text'].apply(lambda x: re.sub('[^a-zA-z0-9\s]','',x))
df_covid['abstract'] = df_covid['abstract'].apply(lambda x: re.sub('[^a-zA-z0-9\s]','',x))

In [None]:
def lower_case(input_str):
    input_str = input_str.lower()
    return input_str

df_covid['body_text'] = df_covid['body_text'].apply(lambda x: lower_case(x))
df_covid['abstract'] = df_covid['abstract'].apply(lambda x: lower_case(x))

In [None]:
df_covid.head(4)

Unnamed: 0,paper_id,abstract,body_text,authors,title,journal,abstract_summary
885,db5333b01a10f165ae516d30f9d1fbf96ab4b841,footandmouth disease virus fmdv represses host...,footandmouth disease fmd an acute highly conta...,"Gao, Yuan. Sun, Shi-Qi...",Biological function of Foot-and-mouth<br>dise...,Virol J,Foot-and-mouth disease virus (FMDV)<br>repres...
886,335b0a3f21f764adcbe20ff71e422d823c410098,background gray wolves canis lupus were reintr...,several highmortality disease outbreaks among ...,"Almberg, Emily S.. Mech, L. David...",A Serological Survey of Infectious Disease in...,PLoS One,Background: Gray wolves (Canis lupus) were<br...
887,bad0e9f737316570c33138d5cc95cc233cd937ab,in niger acute respiratory infections aris are...,acute respiratory infections aris are responsi...,"Lagare, Adamou. Ousmane, Sani...",Molecular detection of respiratory pathogens<...,Health Sci Rep,"In Niger, acute respiratory infections (ARIs)..."
888,007bf75961da42a7e0cc8e2855e5c208a5ec65c1,the hemagglutininesterases hes envelope glycop...,to initiate infection viruses must bind to an ...,"Langereis, Martijn A.. Zeng, Qinghong...",The Murine Coronavirus<br>Hemagglutinin-ester...,PLoS Pathog,"The hemagglutinin-esterases (HEs), envelope<b..."


In [None]:
df_covid.to_csv("/content/drive/My Drive/BertSentenceSimilarity/Data/covid.csv")

In [None]:
df_covid_test = pd.read_csv("/content/drive/My Drive/BertSentenceSimilarity/Data/covid.csv")
text = df_covid_test.drop(["authors", "journal", "Unnamed: 0"], axis=1)
text.head(5)

Unnamed: 0,paper_id,abstract,body_text,title,abstract_summary
0,db5333b01a10f165ae516d30f9d1fbf96ab4b841,footandmouth disease virus fmdv represses host...,footandmouth disease fmd an acute highly conta...,Biological function of Foot-and-mouth<br>dise...,Foot-and-mouth disease virus (FMDV)<br>repres...
1,335b0a3f21f764adcbe20ff71e422d823c410098,background gray wolves canis lupus were reintr...,several highmortality disease outbreaks among ...,A Serological Survey of Infectious Disease in...,Background: Gray wolves (Canis lupus) were<br...
2,bad0e9f737316570c33138d5cc95cc233cd937ab,in niger acute respiratory infections aris are...,acute respiratory infections aris are responsi...,Molecular detection of respiratory pathogens<...,"In Niger, acute respiratory infections (ARIs)..."
3,007bf75961da42a7e0cc8e2855e5c208a5ec65c1,the hemagglutininesterases hes envelope glycop...,to initiate infection viruses must bind to an ...,The Murine Coronavirus<br>Hemagglutinin-ester...,"The hemagglutinin-esterases (HEs), envelope<b..."
4,d6a325260dac29bfe718f1e57160583cb23b5908,emerging evidence suggests that dipeptidyl pep...,the global burden of diabetes is escalating at...,The role of renal dipeptidyl peptidase-4 in<b...,Emerging evidence suggests that dipeptidyl<br...


In [None]:
text_dict = text.to_dict()
len_text = len(text_dict["paper_id"])

In [None]:
paper_id_list  = []
body_text_list = []

title_list = []
abstract_list = []
abstract_summary_list = []
for i in tqdm(range(0,len_text)):
  paper_id = text_dict["paper_id"][i]
  body_text = text_dict["body_text"][i].split("\n")
  title = text_dict["title"][i]
  abstract = text_dict["abstract"][i]
  abstract_summary = text_dict["abstract_summary"][i]
  for b in body_text:
    paper_id_list.append(paper_id)
    body_text_list.append(b)
    title_list.append(title)
    abstract_list.append(abstract)
    abstract_summary_list.append(abstract_summary)

100%|██████████| 12500/12500 [00:00<00:00, 23067.45it/s]


In [None]:
df_sentences = pd.DataFrame({"paper_id":paper_id_list},index=body_text_list)
df_sentences.to_csv("/content/drive/My Drive/BertSentenceSimilarity/Data/covid_sentences.csv")
df_sentences.head()

In [None]:
df_sentences = pd.DataFrame({"paper_id":paper_id_list,"title":title_list,"abstract":abstract_list,"abstract_summary":abstract_summary_list},index=body_text_list)
df_sentences.to_csv("/content/drive/My Drive/BertSentenceSimilarity/Data/covid_sentences_Full.csv")
df_sentences.head()

Unnamed: 0,paper_id,title,abstract,abstract_summary
footandmouth disease fmd an acute highly contagious viral disease in susceptible clovenhoofed animals was described 100 years ago the etiologic agent fmd virus fmdv is a positivesense singlestranded rna virus that belongs to the aphthovirus genus picornaviridae family fmdv is one of the most contagious viruses in clovenhoofed animals and can cause both acute and prolonged asymptomatic but persistent infection [1] upon infection of susceptible species fmdv proliferates rapidly and causes vesicular disease in feet and mouth,db5333b01a10f165ae516d30f9d1fbf96ab4b841,Biological function of Foot-and-mouth<br>dise...,footandmouth disease virus fmdv represses host...,Foot-and-mouth disease virus (FMDV)<br>repres...
the rna virus genome of fmdv displays a very high mutation rate because the virusencoded rna polymerase lacks a proofreading mechanism [2 3] the high mutation rate of fmdv coupled with its rapid proliferation and extensive population result in the rapid evolution of this virus [4] which contributes to the existence of seven main serotypes a o c asia1 south african territories sat 1 sat2 and sat3 in addition numerous variants and subtypes have been further evolved from each serotype [1] given that crossreactivity varies antigenic diversity among these serotypes have to be considered during vaccine development [5],db5333b01a10f165ae516d30f9d1fbf96ab4b841,Biological function of Foot-and-mouth<br>dise...,footandmouth disease virus fmdv represses host...,Foot-and-mouth disease virus (FMDV)<br>repres...
fmdv virion has a symmetric protein shell or capsid enclosing the genomic rna genome rna contains a positive singlestrand chain approximately 83 kb long and encodes a single long open reading frame orf of about 7 kb with two alternative initiation sites the orf is flanked by a long 5untranslated region 5utr and a short 3utr and ends with a genetically encoded polya tail [6] a genomelinked viral nonstructural protein nsp 3b also known as vpg containing 2324 amino acid aa residues is covalently bound to its 5 end although this protein is rapidly released into an infected cell and is deemed to play no part in translation initiation [7] the viral orf can be translated into a polyprotein of about 250 kda which is subsequently cleaved by two virusencoded proteinases leader l pro and 3c pro to yield structural and nsps [8 9] fig 1,db5333b01a10f165ae516d30f9d1fbf96ab4b841,Biological function of Foot-and-mouth<br>dise...,footandmouth disease virus fmdv represses host...,Foot-and-mouth disease virus (FMDV)<br>repres...
the fmdv genome was completely sequenced and all cleavage sites involved in the processing of polypeptides were also identified in the past two decades generally the orf region in fmdv genome is artificially divided into four functional areas due to the different functions of mature polypeptides [10] which are shown as follows fig 1 l region which is located at 5 end to the capsid component and codes for l pro p1 region encoding a precursor for capsid polypeptide which can generate four mature capsid proteins vp4 vp2 vp3 and vp1 upon cleavage by viral protease p2 region encodes three viral proteins 2a 2b and 2c in the middle region of the genome and p3 region which encodes four viral proteins 3a 3b 3c pro and 3d pol in which 3c is a viral protease and 3d an rnadependent rna polymerase [11] actually primary polyprotein is not strictly processed into four products as the functional regions by initial protease but l pro p12a 2bc and p3 by l pro 2a and 3c pro the precursors p12a 2bc and p3 are further processed into mature viral proteins and some cleavage intermediates with relative stability such as vp0 or 1ab 3abc 3bcd 3ab and 3cd by 3c pro fig 1 usually the intermediates may perform functions other than those of their individual constituents with two alternative initiation sites the orf is flanked by a long 5untranslated region 5utr and a short 3utr 3b vpg is covalently bound to its 5 end the orf region is generally divided into four functional areas l p1 p2 and p3 due to the different functions of mature polypeptides orfencoded polyprotein is processed into four products l pro p12a 2bc and p3 by l pro 2a and 3c pro the precursors p12a 2bc and p3 are further processed into mature viral proteins and some cleavage intermediates with relative stability such as vp0 or 1ab 3abc 3bcd 3ab and 3cd by 3c pro structural proteins form the biological protomer and viral capsid,db5333b01a10f165ae516d30f9d1fbf96ab4b841,Biological function of Foot-and-mouth<br>dise...,footandmouth disease virus fmdv represses host...,Foot-and-mouth disease virus (FMDV)<br>repres...
the virus capsid consists of 60 copies of each of the four structural polypeptides vp1 to vp4 which are selfassembled into an icosahedral structure with a diameter of 30 nm [12 13] fig 1 studies on structural information and protein interaction have shown that the structural protein or the precursor products vp0 vp24 or 1ab vp1 1d and vp3 1c which are encoded by p1 region form immature protomers through weak chemical bond interaction then pentamers are assembled by five protomers [14] after selfassembly of pentamers to generate an empty capsid the viral genomic rna covalently linked to vpg at the 5 end enters the capsid to produce provirion then the provirion is eventually processed into a mature virion following the rnatriggered autocleavage of vp0 [15] finally the virion particles with complete assembly are released from the infected host cells fig 2,db5333b01a10f165ae516d30f9d1fbf96ab4b841,Biological function of Foot-and-mouth<br>dise...,footandmouth disease virus fmdv represses host...,Foot-and-mouth disease virus (FMDV)<br>repres...


## Preparing Data for Embedding

In [None]:
import pandas as pd
from tqdm import tqdm

df_sentences = pd.read_csv("/content/drive/My Drive/BertSentenceSimilarity/Data/covid_sentences.csv")
df_sentences = df_sentences.set_index("Unnamed: 0")

In [None]:
df_sentences.head()

Unnamed: 0_level_0,paper_id
Unnamed: 0,Unnamed: 1_level_1
footandmouth disease fmd an acute highly contagious viral disease in susceptible clovenhoofed animals was described 100 years ago the etiologic agent fmd virus fmdv is a positivesense singlestranded rna virus that belongs to the aphthovirus genus picornaviridae family fmdv is one of the most contagious viruses in clovenhoofed animals and can cause both acute and prolonged asymptomatic but persistent infection [1] upon infection of susceptible species fmdv proliferates rapidly and causes vesicular disease in feet and mouth,db5333b01a10f165ae516d30f9d1fbf96ab4b841
the rna virus genome of fmdv displays a very high mutation rate because the virusencoded rna polymerase lacks a proofreading mechanism [2 3] the high mutation rate of fmdv coupled with its rapid proliferation and extensive population result in the rapid evolution of this virus [4] which contributes to the existence of seven main serotypes a o c asia1 south african territories sat 1 sat2 and sat3 in addition numerous variants and subtypes have been further evolved from each serotype [1] given that crossreactivity varies antigenic diversity among these serotypes have to be considered during vaccine development [5],db5333b01a10f165ae516d30f9d1fbf96ab4b841
fmdv virion has a symmetric protein shell or capsid enclosing the genomic rna genome rna contains a positive singlestrand chain approximately 83 kb long and encodes a single long open reading frame orf of about 7 kb with two alternative initiation sites the orf is flanked by a long 5untranslated region 5utr and a short 3utr and ends with a genetically encoded polya tail [6] a genomelinked viral nonstructural protein nsp 3b also known as vpg containing 2324 amino acid aa residues is covalently bound to its 5 end although this protein is rapidly released into an infected cell and is deemed to play no part in translation initiation [7] the viral orf can be translated into a polyprotein of about 250 kda which is subsequently cleaved by two virusencoded proteinases leader l pro and 3c pro to yield structural and nsps [8 9] fig 1,db5333b01a10f165ae516d30f9d1fbf96ab4b841
the fmdv genome was completely sequenced and all cleavage sites involved in the processing of polypeptides were also identified in the past two decades generally the orf region in fmdv genome is artificially divided into four functional areas due to the different functions of mature polypeptides [10] which are shown as follows fig 1 l region which is located at 5 end to the capsid component and codes for l pro p1 region encoding a precursor for capsid polypeptide which can generate four mature capsid proteins vp4 vp2 vp3 and vp1 upon cleavage by viral protease p2 region encodes three viral proteins 2a 2b and 2c in the middle region of the genome and p3 region which encodes four viral proteins 3a 3b 3c pro and 3d pol in which 3c is a viral protease and 3d an rnadependent rna polymerase [11] actually primary polyprotein is not strictly processed into four products as the functional regions by initial protease but l pro p12a 2bc and p3 by l pro 2a and 3c pro the precursors p12a 2bc and p3 are further processed into mature viral proteins and some cleavage intermediates with relative stability such as vp0 or 1ab 3abc 3bcd 3ab and 3cd by 3c pro fig 1 usually the intermediates may perform functions other than those of their individual constituents with two alternative initiation sites the orf is flanked by a long 5untranslated region 5utr and a short 3utr 3b vpg is covalently bound to its 5 end the orf region is generally divided into four functional areas l p1 p2 and p3 due to the different functions of mature polypeptides orfencoded polyprotein is processed into four products l pro p12a 2bc and p3 by l pro 2a and 3c pro the precursors p12a 2bc and p3 are further processed into mature viral proteins and some cleavage intermediates with relative stability such as vp0 or 1ab 3abc 3bcd 3ab and 3cd by 3c pro structural proteins form the biological protomer and viral capsid,db5333b01a10f165ae516d30f9d1fbf96ab4b841
the virus capsid consists of 60 copies of each of the four structural polypeptides vp1 to vp4 which are selfassembled into an icosahedral structure with a diameter of 30 nm [12 13] fig 1 studies on structural information and protein interaction have shown that the structural protein or the precursor products vp0 vp24 or 1ab vp1 1d and vp3 1c which are encoded by p1 region form immature protomers through weak chemical bond interaction then pentamers are assembled by five protomers [14] after selfassembly of pentamers to generate an empty capsid the viral genomic rna covalently linked to vpg at the 5 end enters the capsid to produce provirion then the provirion is eventually processed into a mature virion following the rnatriggered autocleavage of vp0 [15] finally the virion particles with complete assembly are released from the infected host cells fig 2,db5333b01a10f165ae516d30f9d1fbf96ab4b841


In [None]:
df_sentences = df_sentences["paper_id"].to_dict()
df_sentences_list = list(df_sentences.keys())
len(df_sentences_list)

403341

In [None]:
list(df_sentences.keys())[:5]

['footandmouth disease fmd an acute highly contagious viral disease in susceptible clovenhoofed animals was described 100 years ago the etiologic agent fmd virus fmdv is a positivesense singlestranded rna virus that belongs to the aphthovirus genus picornaviridae family fmdv is one of the most contagious viruses in clovenhoofed animals and can cause both acute and prolonged asymptomatic but persistent infection [1]  upon infection of susceptible species fmdv proliferates rapidly and causes vesicular disease in feet and mouth',
 'the rna virus genome of fmdv displays a very high mutation rate because the virusencoded rna polymerase lacks a proofreading mechanism [2 3]  the high mutation rate of fmdv coupled with its rapid proliferation and extensive population result in the rapid evolution of this virus [4]  which contributes to the existence of seven main serotypes a o c asia1 south african territories sat 1 sat2 and sat3 in addition numerous variants and subtypes have been further evo

In [None]:
df_sentences_list = [str(d) for d in tqdm(df_sentences_list)]

100%|██████████| 403341/403341 [00:00<00:00, 1913170.91it/s]


In [None]:
import pandas as pd
df = pd.read_csv("/content/drive/My Drive/BertSentenceSimilarity/Data/covid_sentences_Full.csv", index_col=0)
df.head()

Unnamed: 0,paper_id,title,abstract,abstract_summary
footandmouth disease fmd an acute highly contagious viral disease in susceptible clovenhoofed animals was described 100 years ago the etiologic agent fmd virus fmdv is a positivesense singlestranded rna virus that belongs to the aphthovirus genus picornaviridae family fmdv is one of the most contagious viruses in clovenhoofed animals and can cause both acute and prolonged asymptomatic but persistent infection [1] upon infection of susceptible species fmdv proliferates rapidly and causes vesicular disease in feet and mouth,db5333b01a10f165ae516d30f9d1fbf96ab4b841,Biological function of Foot-and-mouth<br>dise...,footandmouth disease virus fmdv represses host...,Foot-and-mouth disease virus (FMDV)<br>repres...
the rna virus genome of fmdv displays a very high mutation rate because the virusencoded rna polymerase lacks a proofreading mechanism [2 3] the high mutation rate of fmdv coupled with its rapid proliferation and extensive population result in the rapid evolution of this virus [4] which contributes to the existence of seven main serotypes a o c asia1 south african territories sat 1 sat2 and sat3 in addition numerous variants and subtypes have been further evolved from each serotype [1] given that crossreactivity varies antigenic diversity among these serotypes have to be considered during vaccine development [5],db5333b01a10f165ae516d30f9d1fbf96ab4b841,Biological function of Foot-and-mouth<br>dise...,footandmouth disease virus fmdv represses host...,Foot-and-mouth disease virus (FMDV)<br>repres...
fmdv virion has a symmetric protein shell or capsid enclosing the genomic rna genome rna contains a positive singlestrand chain approximately 83 kb long and encodes a single long open reading frame orf of about 7 kb with two alternative initiation sites the orf is flanked by a long 5untranslated region 5utr and a short 3utr and ends with a genetically encoded polya tail [6] a genomelinked viral nonstructural protein nsp 3b also known as vpg containing 2324 amino acid aa residues is covalently bound to its 5 end although this protein is rapidly released into an infected cell and is deemed to play no part in translation initiation [7] the viral orf can be translated into a polyprotein of about 250 kda which is subsequently cleaved by two virusencoded proteinases leader l pro and 3c pro to yield structural and nsps [8 9] fig 1,db5333b01a10f165ae516d30f9d1fbf96ab4b841,Biological function of Foot-and-mouth<br>dise...,footandmouth disease virus fmdv represses host...,Foot-and-mouth disease virus (FMDV)<br>repres...
the fmdv genome was completely sequenced and all cleavage sites involved in the processing of polypeptides were also identified in the past two decades generally the orf region in fmdv genome is artificially divided into four functional areas due to the different functions of mature polypeptides [10] which are shown as follows fig 1 l region which is located at 5 end to the capsid component and codes for l pro p1 region encoding a precursor for capsid polypeptide which can generate four mature capsid proteins vp4 vp2 vp3 and vp1 upon cleavage by viral protease p2 region encodes three viral proteins 2a 2b and 2c in the middle region of the genome and p3 region which encodes four viral proteins 3a 3b 3c pro and 3d pol in which 3c is a viral protease and 3d an rnadependent rna polymerase [11] actually primary polyprotein is not strictly processed into four products as the functional regions by initial protease but l pro p12a 2bc and p3 by l pro 2a and 3c pro the precursors p12a 2bc and p3 are further processed into mature viral proteins and some cleavage intermediates with relative stability such as vp0 or 1ab 3abc 3bcd 3ab and 3cd by 3c pro fig 1 usually the intermediates may perform functions other than those of their individual constituents with two alternative initiation sites the orf is flanked by a long 5untranslated region 5utr and a short 3utr 3b vpg is covalently bound to its 5 end the orf region is generally divided into four functional areas l p1 p2 and p3 due to the different functions of mature polypeptides orfencoded polyprotein is processed into four products l pro p12a 2bc and p3 by l pro 2a and 3c pro the precursors p12a 2bc and p3 are further processed into mature viral proteins and some cleavage intermediates with relative stability such as vp0 or 1ab 3abc 3bcd 3ab and 3cd by 3c pro structural proteins form the biological protomer and viral capsid,db5333b01a10f165ae516d30f9d1fbf96ab4b841,Biological function of Foot-and-mouth<br>dise...,footandmouth disease virus fmdv represses host...,Foot-and-mouth disease virus (FMDV)<br>repres...
the virus capsid consists of 60 copies of each of the four structural polypeptides vp1 to vp4 which are selfassembled into an icosahedral structure with a diameter of 30 nm [12 13] fig 1 studies on structural information and protein interaction have shown that the structural protein or the precursor products vp0 vp24 or 1ab vp1 1d and vp3 1c which are encoded by p1 region form immature protomers through weak chemical bond interaction then pentamers are assembled by five protomers [14] after selfassembly of pentamers to generate an empty capsid the viral genomic rna covalently linked to vpg at the 5 end enters the capsid to produce provirion then the provirion is eventually processed into a mature virion following the rnatriggered autocleavage of vp0 [15] finally the virion particles with complete assembly are released from the infected host cells fig 2,db5333b01a10f165ae516d30f9d1fbf96ab4b841,Biological function of Foot-and-mouth<br>dise...,footandmouth disease virus fmdv represses host...,Foot-and-mouth disease virus (FMDV)<br>repres...


## BERT

In [None]:
#https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py
"""
This is a simple application for sentence embeddings: semantic search
We have a corpus with various sentences. Then, for a given query sentence,
we want to find the most similar sentence in this corpus.
This script outputs for various queries the top 5 most similar sentences in the corpus.
"""

from sentence_transformers import SentenceTransformer
import scipy.spatial
import pickle as pkl
embedder = SentenceTransformer('bert-base-nli-mean-tokens')

# Corpus with example sentences
corpus = df_sentences_list
#corpus_embeddings = embedder.encode(corpus,show_progress_bar=True)
with open("/content/drive/My Drive/BertSentenceSimilarity/Pickles/corpus_embeddings.pkl" , "rb") as file_:
  corpus_embeddings = pkl.load(file_)

# Query sentences:
queries = ['What has been published about medical care?',
           'Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest',
           'Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually',
           'Resources to support skilled nursing facilities and long term care facilities.',
           'Mobilization of surge medical staff to address shortages in overwhelmed communities .',
           'Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies .']
query_embeddings = embedder.encode(queries,show_progress_bar=True)

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 5
print("\nTop 5 most similar sentences in corpus:")
for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n\n=========================================================")
    print("==========================Query==============================")
    print("===",query,"=====")
    print("=========================================================")


    for idx, distance in results[0:closest_n]:
        print("Score:   ", "(Score: %.4f)" % (1-distance) , "\n" )
        print("Paragraph:   ", corpus[idx].strip(), "\n" )
        row_dict = df.loc[df.index== corpus[idx]].to_dict()
        print("paper_id:  " , row_dict["paper_id"][corpus[idx]] , "\n")
        print("Title:  " , row_dict["title"][corpus[idx]] , "\n")
        print("Abstract:  " , row_dict["abstract"][corpus[idx]] , "\n")
        print("Abstract_Summary:  " , row_dict["abstract_summary"][corpus[idx]] , "\n")
        print("-------------------------------------------")

Batches: 100%|██████████| 1/1 [00:00<00:00, 27.80it/s]



Top 5 most similar sentences in corpus:


=== What has been published about medical care? =====
Score:    (Score: 0.8296) 

Paragraph:    how may state authorities require persons to undergo medical treatment 

paper_id:   1950c30fea7ef227129d94831df3fd0c57b9802c 

Title:    Chapter 10 Legal Aspects of Biosecurity 

Abstract:   when bad men combine the good must associate else they will fall one by one an unpitied sacrifice in a contemptible struggle
the study of this chapter will enable you to
1 discuss the definitions of terrorism and weapons of mass destruction and their relation to the illicit use of biological agents
2 list all legislative and administrative documents that address the legal aspects of the unlawful use of biological agents
4 discuss the prohibited uses of biological agents under international law 5 list and briefly discuss the homeland security presidential directives that apply to biosecurity and biodefense 

Abstract_Summary:    When bad men combine, the good mu

In [None]:
#import pickle as pkl
#with open("/content/drive/My Drive/BertSentenceSimilarity/Pickles/corpus_embeddings.pkl" , "wb") as file_:
#  pkl.dump(corpus_embeddings,file_)