In [0]:
import torch
torch.cuda.get_device_name(0)
# 'Tesla P100-PCIE-16GB' worked great for me

'Tesla P100-PCIE-16GB'

In [0]:
"""
INSTRUCTIONS:
- upload data.py, util.py, preprocessing.py (optional) in ./content and edit them in a way relative paths work(when importing)
- i managed to build an engine by creating corpus(DataFrame) of abstracts back in the python console localy, then serialize it 
  with pickle, upload it here in /content and then use it here along with paper_ids in query_engine.fit method
- the rest is basically executing the code given here with possible minor tweaks
"""

In [0]:
!pip install sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/b9/46/b7d6c37d92d1bd65319220beabe4df845434930e3f30e42d3cfaecb74dc4/sentence-transformers-0.2.6.1.tar.gz (55kB)
[K     |████████████████████████████████| 61kB 3.2MB/s 
[?25hCollecting transformers>=2.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/a3/78/92cedda05552398352ed9784908b834ee32a0bd071a9b32de287327370b7/transformers-2.8.0-py3-none-any.whl (563kB)
[K     |████████████████████████████████| 573kB 13.9MB/s 
Collecting tokenizers==0.5.2
[?25l  Downloading https://files.pythonhosted.org/packages/d1/3f/73c881ea4723e43c1e9acf317cf407fab3a278daab3a69c98dcac511c04f/tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (3.7MB)
[K     |████████████████████████████████| 3.7MB 57.6MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |██

In [0]:
from sentence_transformers import SentenceTransformer
from sentence_transformers import models
from nltk import sent_tokenize
import numpy as np

In [0]:
def normalize(embeddings):
        """
        Normalizes embeddings using L2 normalization.
        Args:
            embeddings: input embeddings matrix
        Returns:
            normalized embeddings
        """
        # Calculation is different for matrices vs vectors
        if len(embeddings.shape) > 1:
            return embeddings / np.linalg.norm(embeddings, axis=1).reshape(-1, 1)

        else:
            return embeddings / np.linalg.norm(embeddings)

In [0]:
def BERT_sentence_embeddings(data, query=False):
    
    """
    Input:
        corpus: DataFrame containing information about paragraphs : paper_id, section, text
        query: if True, import is one sentence - a query
    Returns:
        corpus embeddings: numpy array containing paragraph embeddings for each text paragraph in input
        which is obtained by averaging over sentence embeddings(try #1 - until a better idea arrives (probably not so great))
        -dimensions: n x 768 where n represents number of input paragraphs
    
    References
    ----------
    {
    reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
    }

    """
    
    #pre-trained model on semantic text similarity task
    model = SentenceTransformer('bert-base-nli-stsb-mean-tokens') 
    
    if query:
        return normalize(np.array(model.encode([data])).reshape(1,768))
    
    else:
        text_paragraphs = [paragraph for paragraph in list(data['text'])]
        n=len(text_paragraphs)
        
        corpus_embeddings=[]
        for paragraph in text_paragraphs:
            sentences = sent_tokenize(paragraph)
            sent_embeddings = normalize(np.array(model.encode(sentences)).reshape(-1,768))#shape = no_of_sents_in_paragraph X 768
            corpus_embeddings.append(np.mean(sent_embeddings,axis=0).reshape(1,768)) 
        
        return normalize(np.array(corpus_embeddings).reshape(n,768))

In [0]:
emb1 = BERT_sentence_embeddings('What do we know about COVID-19 risk factors?', query=True)
emb2 = BERT_sentence_embeddings('Do co-existing respiratory/viral infections make the virus more transmissible or virulent?', query=True)
emb3 = BERT_sentence_embeddings('The text is small and will load quickly and easily fit into memory.', query=True)

100%|██████████| 405M/405M [00:24<00:00, 16.3MB/s]


In [0]:
#testing sentence embeddings
np.dot(emb1,emb2.T)

array([[0.32969183]], dtype=float32)

In [0]:
np.dot(emb1,emb3.T)

array([[-0.01860096]], dtype=float32)

In [0]:
np.dot(emb3,emb2.T)

array([[0.11308034]], dtype=float32)

In [0]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from data import CovidDataLoader
from nltk import sent_tokenize

import pickle
import pandas as pd
import numpy as np
import time

In [0]:
class QueryEngine_BERT():

    def __init__(self):
        self.corpus = None
        self.ids = None

    def fit(self, corpus, document_ids=None):
        """
        Builds the query engine on the given corpus.

        Args:
            corpus: list of documents to build the model on
            document_ids: optional, if given it will associate the given id's to each document given in the corpus
        """
        self.ids = document_ids
        self.corpus = [paragraph for paragraph in corpus['text']]
        self.corpus_embeddings = BERT_sentence_embeddings(corpus, query=False)
        
    def __create_query_result(self, query, similarities, n):
        """

        Args:
            similarities: sparse matrix containing cosine similarities between the query vector and documents from corpus
            n: number of most similar documents to include in the result

        Returns:
            pandas DataFrame containing query, document, similarity
        """

        result = {
            'query': query * len(similarities),
            'text': self.corpus,
            'sim': np.squeeze(similarities)
        }
        if self.ids:
            result.update({'id': self.ids})

        result = pd.DataFrame(result).sort_values(by='sim', ascending=False)[:n]

        return result[result['sim'] > 0]

    def run_query(self, query, n=5):
        """
        Runs the given query, returns max n most similar documents from the corpus on which the model was build.

        Args:
            query: query to run
            n: max number of results returned

        Returns:
            n(or less) most similar documents from the corpus on which the model was build
        """
        if self.corpus is None:
            raise AttributeError('Model not built yet, please call the fit method before running queries!')

        assert type(query) == str
            

        query_embedding = BERT_sentence_embeddings(query, query=True)
        similarities = np.dot(self.corpus_embeddings,query_embedding.T)  # TODO: check if this already sorts values
        
        return self.__create_query_result(query, similarities, n)

    def save(self, dir_path, name):
        """
        Serializes the object to file(name.dat) to the directory defined by the path.

        Args:
            dir_path: path of the directory to save the object to
            name: name of the file without any extensions
        """
        pickle_path = dir_path + name + '.dat'
        print('Writing object to %s' % pickle_path)
        with open(pickle_path, 'wb') as f:
            pickle.dump(self, f)

    @staticmethod
    def load(pickle_path):
        """
        Loads(de-serializes) QueryEngine object from the given path.

        Args:
            pickle_path: path to QueryEngine pickle

        Returns:
            QueryEngine object
        """
        with open(pickle_path, 'rb') as f:
            query_engine = pickle.load(f)
            if type(query_engine) != QueryEngine_BERT:
                raise ValueError('Path to non QueryEngine_BERT object!')
            return query_engine

In [0]:
with open("test.txt", "rb") as fp:
  abstracts = pickle.load(fp)

In [0]:
len(abstracts)

29306

In [0]:
paper_ids = [abstract['paper_id'] for abstract in abstracts]

In [0]:
query_engine = QueryEngine_BERT()
query_engine.fit(abstracts, paper_ids)
    
query_engine.save('./', 'abstracts_query_engine')

Writing object to ./abstracts_query_engine.dat


In [0]:
query_engine = QueryEngine_BERT.load('abstracts_query_engine.dat')

In [0]:
pd.set_option("display.max_rows", None, "display.max_columns", None)
pd.set_option("display.max_colwidth",None)

In [0]:
query_engine.run_query('Physical science of the coronavirus',10)[['text','sim']].style.set_properties(**{'font-size': '7pt'})

Unnamed: 0,text,sim
25204,"Coronaviruses (CoVs), enveloped positive-sense RNA viruses, are characterized by club-like spikes that project from their surface, an unusually large RNA genome, and a unique replication strategy. Coronaviruses cause a variety of diseases in mammals and birds ranging from enteritis in cows and pigs and upper respiratory disease in chickens to potentially lethal human respiratory infections. Here we provide a brief introduction to coronaviruses discussing their replication and pathogenicity, and current prevention and treatment strategies. We also discuss the outbreaks of the highly pathogenic Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV) and the recently identifi ed Middle Eastern Respiratory Syndrome Coronavirus (MERS-CoV). Coronavirus virions are spherical with diameters of approximately 125 nm as depicted in recent studies by cryo-electron tomography and cryo-electron microscopy [ 2 , 3 ]. The most prominent feature of coronaviruses is the club-shaped spike projections emanating from the surface of the virion. These spikes are a defi ning feature of the virion and give them the appearance of a solar corona, prompting the name, coronaviruses. Within the envelope of the virion is the nucleocapsid. Coronaviruses have helically symmetrical nucleocapsids, which is uncommon among positive-sense RNA viruses, but far more common for negative-sense RNA viruses.",0.631149
12368,"Coronaviruses are etiologic agents of respiratory and enteric diseases in humans and in animals. In this study, a one-step real-time reverse transcriptionpolymerase chain reaction (RT-PCR) assay based on SYBR Green chemistry and degenerate primers was developed for the generic detection of coronaviruses. The primers, designed in the open reading frame 1b, enabled the detection of 32 animal coronaviruses including strains of canine coronavirus, feline coronavirus, transmissible gastroenteritis virus (TGEV), bovine coronavirus (BCoV), murine hepatitis virus (MHV) and infectious bronchitis virus (IBV). A specific amplification was also observed with the human coronaviruses (HCoV) HCoV-NL63, HCoV-OC43, HCoV-229E and severe acute respiratory syndrome coronavirus (SARS-CoV). The real-time RT-PCR detected down to 10 cRNA copies from TGEV, BCoV, SARS-CoV and IBV. In addition, the assay exhibited a high sensitivity and specificity on clinical samples from different animal species. The developed assay represents a potential tool for laboratory diagnostics and for detecting still uncharacterized coronaviruses.",0.61151
4371,"In 2012, a novel coronavirus, initially named as human coronavirus EMC (HCoV-EMC) but recently renamed as Middle East respiratory syndrome human coronavirus (MERS-CoV), was identified in patients who suffered severe acute respiratory infection and subsequent renal failure that resulted in death. Ongoing epidemiological investigations together with retrospective studies have found 61 laboratory-confirmed cases of infection with this novel coronavirus, including 34 deaths to date. This novel coronavirus is culturable and two complete genome sequences are now available. Furthermore, molecular detection and indirect immunofluorescence assay have been developed. The present paper summarises the limited recent advances of this novel human coronavirus, including its discovery, genomic characterisation and detection. HCoV-EMC, MERS-CoV, genomic characterisation, molecular detection Geng H Y, Tan W J. A novel human coronavirus: Middle East respiratory syndrome human coronavirus.",0.609226
17116,"Severe acute respiratory syndrome (SARS) was first described during a 2002-2003 global outbreak of severe pneumonia associated with human deaths and person-toperson disease transmission. The etiologic agent was initially identified as a coronavirus by thin-section electron microscopic examination of a virus isolate. Virions were spherical, 78 nm in mean diameter, and composed of a helical nucleocapsid within an envelope with surface projections. We show that infection with the SARS-associated coronavirus resulted in distinct ultrastructural features: double-membrane vesicles, nucleocapsid inclusions, and large granular areas of cytoplasm. These three structures and the coronavirus particles were shown to be positive for viral proteins and RNA by using ultrastructural immunogold and in situ hybridization assays. In addition, ultrastructural examination of a bronchiolar lavage specimen from a SARS patient showed numerous coronavirus-infected cells with features similar to those in infected culture cells. Electron microscopic studies were critical in identifying the etiologic agent of the SARS outbreak and in guiding subsequent laboratory and epidemiologic investigations.",0.600576
14400,"Despite years of research, the precise determinants of coronavirus replication and pathogenesis remain unidentified. What is known of the pathogenesis of the severe acute respiratory syndrome coronavirus (SARS-CoV) is limited, but clinical observations suggest that both viral-induced cytotoxicity and host immune-mediated destruction contribute to the severity of disease. This summary discusses recent advances in coronavirus research that will facilitate the identification of crucial molecular targets for the rational design of SARS therapeutics. When the SARS coronavirus came dramatically to attention as a fatal and potentially pandemic respiratory disease, the scientific response was swift and effective. The etiological agent was identified, the genome was cloned and sequenced, and large strides have been made in understanding key points in the viral life cycle, including recent identification of a functional host cell receptor and solution of the three-dimensional structure of the viral attachment protein. Much of this progress was possible only because a foundation of basic coronavirus biology existed. The Denison laboratory has been at the forefront of the small group of investigators who have dissected the basic features of coronaviruses. Denison has discovered critical mechanisms in the life cycle of the coronaviruses and is a key figure in the effective and urgent application of this knowledge to SARS.",0.594483
13216,"Doughnut-shaped particles, 55-65 nm in diameter, were revealed by electron microscopy in the cisterns of the rough endoplasmic reticulum of cells from an active lesion in autopsied brain tissue from a multiple sclerosis patient. The morphology of the particles closely resembled that of coronaviruses.",0.59129
24715,"Coronaviruses have been closely related with mankind for thousands of years. Communityacquired human coronaviruses have long been recognized to cause common cold. However, zoonotic coronaviruses are now becoming more a global concern with the discovery of highly pathogenic severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS) coronaviruses causing severe respiratory diseases. Infections by these emerging human coronaviruses are characterized by less robust interferon production. Treatment of patients with recombinant interferon regimen promises beneficial outcomes, suggesting that compromised interferon expression might contribute at least partially to the severity of disease. The mechanisms by which coronaviruses evade host innate antiviral response are under intense investigations. This review focuses on the fierce arms race between host innate antiviral immunity and emerging human coronaviruses. Particularly, the host pathogen recognition receptors and the signal transduction pathways to mount an effective antiviral response against SARS and MERS coronavirus infection are discussed. On the other hand, the counter-measures evolved by SARS and MERS coronaviruses to circumvent host defense are also dissected. With a better understanding of the dynamic interaction between host and coronaviruses, it is hoped that insights on the pathogenesis of newly-identified highly pathogenic human coronaviruses and new strategies in antiviral development can be derived.",0.585042
24504,"Development of vaccination strategies for emerging pathogens are particularly challenging because of the sudden nature of the emergence of these viruses and the long process needed for traditional vaccine development. Therefore, there is a need for development of a rapid method of vaccine development that can respond to emerging pathogens in a short time frame. The emergence of severe acute respiratory syndrome coronavirus (SARS-CoV) in 2003 and Middle East respiratory syndrome (MERS-CoV) in late 2012 demonstrate the importance of coronaviruses as emerging pathogens. The spike glycoproteins of coronaviruses reside on the surface of the virion and are responsible for virus entry. The spike glycoprotein is the major immunodominant antigen of coronaviruses and has proven to be an excellent target for vaccine designs that seek to block coronavirus entry and promote antibody targeting of infected cells. Vaccination strategies for coronaviruses have involved live attenuated virus, recombinant viruses, non-replicative virus-like particles expressing coronavirus proteins or DNA plasmids expressing coronavirus genes. None of these strategies has progressed to an approved human coronavirus vaccine in the ten years since SARS-CoV emerged. Here we describe a novel method for generating MERS-CoV and SARS-CoV full-length spike nanoparticles, which in combination with adjuvants are able to produce high titer antibodies in mice.",0.581003
23009,"Human coronaviruses are known to be a common cause of respiratory infections in man. However, the diagnosis of human coronavirus infections is not carried out routinely, primarily because the isolation and propagation of these viruses in tissue culture is difficult and time consuming. The aim of this study was to evaluate the use of recombinant, bacterial expressed proteins in the serodiagnosis of coronavirus infections. Two proteins were examined: the human coronavirus 229E nucleocapsid protein (N), expressed as a fusion protein in the vector pUR and the coronavirus 229E surface glycoprotein (S), expressed as a fusion protein in the vector PROS. The recombinant proteins were used as antigens in Western blot (WB) assays to detect the 229E-specific IgG antibodies and the results were compared with a standard serological method, indirect immunofluorescence. Serum samples of 51 paediatric patients, suffering from acute respiratory illness, and 10 adults, voluntarily infected with human coronavirus, were tested. The semm samples of the adult group had coronavims-specific IgG antibodies in both test systems. In contrast, only 8/51 sera of the paediatric group were positive for coronavirus-specific IgG by both WB and IF and 20/51 sera were positive by WB, but not by IF. The overall incidence of human coronavims infections in the paediatric age group was 55% evaluated by WB analysis and 16% evaluated by IF. This study shows that recombinant human coronavirus 229E proteins are suitable reagents for the epidemiological screening of coronavirus 229E infections. 0166-0934/95/$09.50 0 1995 Elsevier Science B.V. All rights reserved SSDIO166-0934(95)00041-O",0.578708
14512,"Development of vaccination strategies for emerging pathogens are particularly challenging because of the sudden nature of their emergence and the long process needed for traditional vaccine development. Therefore, there is a need for development of a rapid method of vaccine development that can respond to emerging pathogens in a short time frame. The emergence of severe acute respiratory syndrome coronavirus (SARS-CoV) in 2003 and Middle East Respiratory Syndrome Coronavirus (MERS-CoV) in late 2012 demonstrate the importance of coronaviruses as emerging pathogens. The spike glycoproteins of coronaviruses reside on the surface of the virion and are responsible for virus entry. The spike glycoprotein is the major immunodominant antigen of coronaviruses and has proven to be an excellent target for vaccine designs that seek to block coronavirus entry and promote antibody targeting of infected cells. Vaccination strategies for coronaviruses have involved live attenuated virus, recombinant viruses, nonreplicative virus-like particles expressing coronavirus proteins or DNA plasmids expressing coronavirus genes. None of these strategies has progressed to an approved human coronavirus vaccine in the ten years since SARS-CoV emerged. Here we describe a novel method for generating MERS-CoV and SARS-CoV full-length spike nanoparticles, which in combination with adjuvants are able to produce high titer antibodies in mice.",0.575677


In [0]:
"""
TODO: analogno za body_texts
"""