# Find Similar Documents From Scientific Corpus Using Deep Learning With SciBERT       
This kernel is a comprehensive overview of performing semantic similarity of documents with KNN and Cosine Similarity.

# Introduction  
When reading an interesting article, you might want to find similar articles from the a large number of candidate publications. Manual processing is obviously not the strategy to go for. Why not take advantage of the power of Artificial Intelligence to solve such problem? 
From this article, you will be able to use SciBERT and cosine similarity in order to find articles that are most similar in meaning to your specific query.  

# Approach    
Here are the different steps performed 
* Data extraction and cleaning   
* Data Processing 
    * Load the pretrained model  
    * Vectorize documents
     
* Semantic Similarity search 
    * Cosine Similarity   
    * k-NN with Faiss

# Useful Libraries

In [1]:
"""
Data Loading and other libraries
"""
import warnings
import pandas as pd
import numpy as np
from tqdm import tqdm

"""
Transformer libraries useful to using the pretrained model and data preprocessing
"""
import torch
from keras.preprocessing.sequence import pad_sequences
from transformers import BertTokenizer,  AutoModelForSequenceClassification

"""
Similarity search section: cosine similarity search and facebook AI research library
"""
from sklearn.metrics.pairwise import cosine_similarity
!pip install faiss-gpu # please uncomment this line when you're running the notebook for the first time
import faiss

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
     |████████████████████████████████| 85.5 MB 134 kB/s             
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [2]:
warnings.filterwarnings("ignore")

# About the data   
- This CORD-19 data set, a resource of over 59,000 scholarly articles, including over 48,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. 
- It is downloadable from Kaggle. 
- Further details about the dataset can be found on this [page](https://www.kaggle.com/danielwolffram/discovid-ai-a-search-and-recommendation-engine/data). 

#### Data loading

In [3]:
data = pd.read_csv("../input/cord19createdataframe/cord19_df.csv") 
print("Data Shape: {}".format(data.shape))

Data Shape: (47110, 16)


In [4]:
# Percentage of missing column values
percent_missing = data.isnull().sum() * 100 / len(data)
percent_missing

paper_id         0.000000
body_text        0.000000
methods         58.439822
results         63.984292
source          10.365103
title           10.439397
doi             12.604543
abstract        12.838039
publish_time    10.365103
authors         11.676926
journal         18.199958
arxiv_id        98.681809
url             10.617703
publish_year     0.000000
is_covid19       0.000000
study_design     0.000000
dtype: float64

There are 47110 articles overall, each one having 16 columns. 

**Note**: We will focuse our analysis on the **abstract** column for simplicity sake, also it is the one with 0% missing data. But you could use other textual columns such as **body_text**; it is up to you. 
On the other hand, we will use 2000 observation in order to speed the processing. 

In [5]:
# remove articles with missing abstract
data = data.dropna(subset = ['abstract'])
data = data.reset_index(drop = True)
percent_missing = data.isnull().sum() * 100 / len(data)
percent_missing

paper_id         0.000000
body_text        0.000000
methods         53.796698
results         59.466173
source           8.645463
title            8.723394
doi             11.095417
abstract         0.000000
publish_time     8.645463
authors          9.432078
journal         17.473577
arxiv_id        98.487653
url              8.927963
publish_year     0.000000
is_covid19       0.000000
study_design     0.000000
dtype: float64

In [6]:
# Show first N (default value is 100) words of each of the #total_number random articles
def show_random_articles(total_number, df, n=100):
    
    # Get the random number of articles
    n_reviews = df.sample(total_number)
    
    # Print each one of the articles
    for val in list(n_reviews.index):
        print("Article #{}".format(val))
        print(" --> Title: {}".format(df.iloc[val]["title"]))
        print(" --> Abstract: {} ...".format(" ".join(df.iloc[val]["abstract"].split()[:n])))
        print("\n")
        
# Show 3 random headlines
show_random_articles(3, data)

Article #3252
 --> Title: Tuning antiviral CD8 T-cell response via proline-altered peptide ligand vaccination
 --> Abstract: AbstractViral escape from CD8+ cytotoxic T lymphocyte responses correlates with disease progression and represents a significant challenge for vaccination. Here, we demonstrate that CD8+ T cell recognition of the naturally occurring MHC-I-restricted LCMV-associated immune escape variant Y4F is restored following vaccination with a proline-altered peptide ligand (APL). The APL increases MHC/peptide (pMHC) complex stability, rigidifies the peptide and facilitates T cell receptor (TCR) recognition through reduced entropy costs. Structural analyses of pMHC complexes before and after TCR binding, combined with biophysical analyses, revealed that although the TCR binds similarly to all complexes, the p3P modification alters the conformations of a ...


Article #36667
 --> Title: A highly conserved WDYPKCDRA epitope in the RNA directed RNA polymerase of human coronaviru

# Data Processing & Vectorization     
The data processing aims to vectorize the articles' body text so that we can perform the similarity analysis. Since we are dealing with scientific document, we will use the SciBERT model and tokenizer to generate an embedding for each of the articles using their text data.  
SciBERT is a pretrained language model for Scientific text data. You can find more information about it on the [Semantic Scholar](https://www.semanticscholar.org/paper/SciBERT%3A-A-Pretrained-Language-Model-for-Scientific-Beltagy-Lo/5e98fe2163640da8ab9695b9ee9c433bb30f5353)   
Here is how we proceed:  

## Load model artifacts   
Load the pretrained model & tokenizer. When loading the pretrained model, we need to set the output_hidden_states to True so that we can extract the embeddings.  

In [7]:
# Get the SciBERT pretrained model path from Allen AI repo
pretrained_model = 'allenai/scibert_scivocab_uncased'

# Get the tokenizer from the previous path
sciBERT_tokenizer = BertTokenizer.from_pretrained(pretrained_model, 
                                          do_lower_case=True)

# Get the model
model = AutoModelForSequenceClassification.from_pretrained(pretrained_model,
                                                          output_attentions=False,
                                                          output_hidden_states=True)

Downloading:   0%|          | 0.00/223k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/422M [00:00<?, ?B/s]

Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification we

## Transform text data to embeddings   
This function *convert_single_abstract_to_embedding* is mostly inspired of the BERT Word [Embeddings Tutorial](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#3-extracting-embeddings) of Chris McCormick 

It aims to create an embedding for a given text data using SciBERT pre-trained model. 

In [8]:
def convert_single_abstract_to_embedding(tokenizer, model, in_text, MAX_LEN = 510):
    
    input_ids = tokenizer.encode(
                        in_text, 
                        add_special_tokens = True, 
                        max_length = MAX_LEN,                           
                   )    

    results = pad_sequences([input_ids], maxlen=MAX_LEN, dtype="long", 
                              truncating="post", padding="post")
    
    # Remove the outer list.
    input_ids = results[0]

    # Create attention masks    
    attention_mask = [int(i>0) for i in input_ids]
    
    # Convert to tensors.
    input_ids = torch.tensor(input_ids)
    attention_mask = torch.tensor(attention_mask)

    # Add an extra dimension for the "batch" (even though there is only one 
    # input in this batch.)
    input_ids = input_ids.unsqueeze(0)
    attention_mask = attention_mask.unsqueeze(0)
    
    # Put the model in "evaluation" mode, meaning feed-forward operation.
    model.eval()

    #input_ids = input_ids.to(device)
    #attention_mask = attention_mask.to(device)
    
    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():        
        logits, encoded_layers = model(
                                    input_ids = input_ids, 
                                    token_type_ids = None, 
                                    attention_mask = attention_mask,
                                    return_dict=False)

    layer_i = 12 # The last BERT layer before the classifier.
    batch_i = 0 # Only one input in the batch.
    token_i = 0 # The first token, corresponding to [CLS]
        
    # Extract the embedding.
    embedding = encoded_layers[layer_i][batch_i][token_i]

    # Move to the CPU and convert to numpy ndarray.
    embedding = embedding.detach().cpu().numpy()

    return(embedding)

### Test on a single text data  
Here we test the function on the "abstract" field of the 30th article. You can choose whatever number you want, as long as it exists in the data.

In [9]:
input_abstract = data.abstract.iloc[30]

# Use the model and tokenizer to generate an embedding for the input_abstract
abstract_embedding = convert_single_abstract_to_embedding(sciBERT_tokenizer, model, input_abstract)

print('Embedding shape: {}'.format(abstract_embedding.shape))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Embedding shape: (768,)


**(768,)** means that the embedding is composed of 768 values. Now that the convert from text to embedding works, we can finally apply it to all the our text data. But, before that, we are going to remove some columns from the data, in order to have less columns in the result of the final query search. Also, I selected only 2000 articles to perform the analysis, so that the overall processing does not become time-consuming.

In [10]:
def get_min_viable_data(df, sample_size=2000):
    
    # Select only the columns we need for the analysis
    useless_cols = ['methods', 'results', 'source', 'doi',
           'body_text', 'publish_time', 'authors', 'journal', 'arxiv_id',
           'publish_year', 'is_covid19', 'study_design']

    df.drop(useless_cols, axis=1, inplace=True)

    """
    It was taking too much time to run the analysis on the overall dataset, so I decided to take 
    a subset (2000 observations) of the original dataset in order to speed the processing.
    """

    df = df.sample(sample_size)
    
    return df

In [11]:
def convert_overall_text_to_embedding(df):
    
    # The list of all the embeddings
    embeddings = []
    
    # Get overall text data
    overall_text_data = data.abstract.values
    
    # Loop over all the comment and get the embeddings
    for abstract in tqdm(overall_text_data):
        
        # Get the embedding 
        embedding = convert_single_abstract_to_embedding(sciBERT_tokenizer, model, abstract)
        
        #add it to the list
        embeddings.append(embedding)
        
    print("Conversion Done!")
    
    return embeddings

In [12]:
"""
# This task can take a lot of time depending on the sample_size value 
in the "get_min_viable_data" function
"""
data = get_min_viable_data(data)
embeddings = convert_overall_text_to_embedding(data)

100%|██████████| 2000/2000 [1:03:36<00:00,  1.91s/it]

Conversion Done!





In [13]:
# Create a new column that will contain embedding of each body text
def create_final_embeddings(df, embeddings):
    
    df["embeddings"] = embeddings
    df["embeddings"] = df["embeddings"].apply(lambda emb: np.array(emb))
    df["embeddings"] = df["embeddings"].apply(lambda emb: emb.reshape(1, -1))
    
    return df

In [14]:
data = create_final_embeddings(data, embeddings)
data.head(3)

Unnamed: 0,paper_id,title,abstract,url,embeddings
11754,9778c2bdf6be32053f9a450194f767b47c87c891,,Graphical Abstract Highlights d The cryo-EM st...,,"[[-0.3626312, 0.34195867, 0.40458974, -0.65087..."
20720,98fc84b407a8632ee5ac42431d12c6acf7dfe3d8,Genetic Loci That Influence Cause of Death in ...,A genome scan was conducted to seek evidence f...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,"[[-0.49350545, -1.1899799, -0.0245932, -0.8623..."
15193,fd0fcdc8d711665338facb36ba94489c3b63a0d7,New Pre-pandemic Influenza Vaccines: An Egg-an...,Highly pathogenic avian H5N1 influenza viruses...,http://europepmc.org/articles/pmc2793094?pdf=r...,"[[0.3071721, -0.85410595, 0.43242905, 0.003505..."


# Similarity Search   
Each of the body text data has a corresponding embedding. Now, we can perform the similarity analysis between a given ***query*** vector and all the embeddings vectors. The scope of this article is limited to:  
- Cosine similarity which ...     
- k-Nearest Neighbor (KNN) search 

## Utility functions

In [15]:
def process_query(query_text):
    """
    # Create a vector for given query and adjust it for cosine similarity search
    """

    query_vect = convert_single_abstract_to_embedding(sciBERT_tokenizer, model, query_text)
    query_vect = np.array(query_vect)
    query_vect = query_vect.reshape(1, -1)
    return query_vect


def get_top_N_articles_cosine(query_text, data, top_N=5):
    """
    Retrieve top_N (5 is default value) articles similar to the query
    """
    query_vect = process_query(query_text)
    revevant_cols = ["title", "abstract", "url", "cos_sim"]
    
    # Run similarity Search
    data["cos_sim"] = data["embeddings"].apply(lambda x: cosine_similarity(query_vect, x))
    data["cos_sim"] = data["cos_sim"].apply(lambda x: x[0][0])
    
    """
    Sort Cosine Similarity Column in Descending Order 
    Here we start at 1 to remove similarity with itself because it is always 1
    """
    moost_similar_articles = data.sort_values(by='cos_sim', ascending=False)[1:top_N+1]
    
    return moost_similar_articles[revevant_cols]

### Similarity Search with Cosine

In [16]:
query_text_test = data.iloc[0].abstract

top_articles = get_top_N_articles_cosine(query_text_test, data)

In [17]:
top_articles

Unnamed: 0,title,abstract,url,cos_sim
493,,Highlights d ENDU-2 nuclease regulates nucleot...,,0.825622
17460,Regioselective synthesis of 6-substituted-2-am...,Abstract A series of 2-amino-5-bromo-4(3H)-pyr...,https://doi.org/10.1016/j.ejmech.2013.06.036,0.773827
8943,Synthesis of 4-aminoquinoline–pyrimidine hybri...,Abstract One of the most viable options to tac...,https://doi.org/10.1016/j.ejmech.2013.05.046,0.761577
25022,Australia was indeed the “lucky country” in th...,"Anton Y Peleg, Wendy J Munckhof\nAustralia was...",https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,0.757679
30241,Antigen delivery systems for veterinary vaccin...,Abstract The recent advances in molecular gene...,https://doi.org/10.1016/j.vaccine.2008.09.044,0.754393


In [18]:
top_articles.iloc[0].abstract

'Highlights d ENDU-2 nuclease regulates nucleotide metabolism and germ cell proliferation in worms d ENDU-2 expression is induced by nucleotide imbalance and other genotoxic stresses d ENDU-2 inhibits CTP synthase phosphorylation by repressing PKA and HDA-1 in the gut d ENDU-2 function may be conserved in mammalian cells'

In [19]:
top_articles.iloc[1].abstract

'Abstract A series of 2-amino-5-bromo-4(3H)-pyrimidinone derivatives bearing different substituents at the C-6 position were synthesized using a highly regioselective lithiation–substitution protocol, and the effect of structural variation at the C-6 position on their antiviral activity in cell culture was evaluated. Although some of the derivatives were found to be active against various virus strains, they were effective only close to their toxicity threshold.'

In [20]:
top_articles.iloc[2].abstract

'Abstract One of the most viable options to tackle the growing resistance to the antimalarial drugs such as artemisinin is to resort to synthetic drugs. The multi-target strategy involving the use of hybrid drugs has shown promise. In line with this, new hybrids of quinoline with pyrimidine have been synthesized and evaluated for their antiplasmodial activity against both CQS and CQR strains of Plasmodium falciparum. These depicted activity in nanomolar range and were found to bind to heme as well as AT rich pUC18 DNA.'

In [21]:
top_articles.iloc[3].abstract

'Anton Y Peleg, Wendy J Munckhof\nAustralia was indeed the "lucky country" in the recent worldwide SARS epidemic 229'

### Similarity Search Using KNN with Faiss   

Faiss is a library developed by [Facebook AI Research](https://research.facebook.com/research-areas/facebook-ai-research-fair/). According to their [wikipage](https://github.com/facebookresearch/faiss/wiki), 
> Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM

Here are the steps to build the search engine using the previously built embeddings  
- create the flat index: This is used to flat the vectors. The index uses the L2 (Euclidean) distance metrics to mesure the similarity betweeen the query vector and all the vectors (embeddings). 
- add all the vectors to the index 
- define the number **K** of similar document we want 
- run the similarity search  

In [22]:
embedding_dimension = len(embeddings[0])

In [23]:
indexFlatL2 = faiss.IndexFlatL2(embedding_dimension)

# Convert the embeddings list of vectors into a 2D array.
vectors = np.stack(embeddings)

indexFlatL2.add(vectors)

In [24]:
print("Total Added Number of Vectors: {}".format(indexFlatL2.ntotal))

Total Added Number of Vectors: 2000


### Perform Query    
We will use the same query as previously. Change it to another one if you want.  

In [25]:
# Get query vector
query_text = data.iloc[0].abstract 
query_vector = process_query(query_text)

K = 5

# Run the search
D, I = indexFlatL2.search(query_vector, K)

In [26]:
I # this contains the index of all the similar articles

array([[   0, 1683, 1021,  348, 1712]])

In [27]:
D # this contains the L2 distance values of all the similar articles

array([[  0.     , 181.52379, 230.63765, 244.74084, 245.87946]],
      dtype=float32)

**Note**:  
I decided to breakdown all the steps on purpose in order to make sure you understand properly. But you can put everything together into a single function.  

In [28]:
for i in range(I.shape[1]):
    
    article_index = I[0, i]
    
    abstract = data.iloc[article_index].abstract
    print("** Article #{} **".format(article_index))
    print("** --> Abstract : \n{}**".format(abstract))
    print("** --> L2 Distance: %.2f**" % D[0, i])
    print("\n")

** Article #0 **
** --> Abstract : 
Graphical Abstract Highlights d The cryo-EM structure of full-length human NPC1 was determined at 4.4 Å resolution d Structure-guided biochemical analysis of cholesterol transfer from NPC2 to NPC1 d Low-resolution cryo-EM structure of NPC1 bound to GPcl of Ebola virus was obtained d A trimeric GPcl binds to one NPC1 through the crystal structure-revealed interface**
** --> L2 Distance: 0.00**


** Article #1683 **
** --> Abstract : 
Highlights d ENDU-2 nuclease regulates nucleotide metabolism and germ cell proliferation in worms d ENDU-2 expression is induced by nucleotide imbalance and other genotoxic stresses d ENDU-2 inhibits CTP synthase phosphorylation by repressing PKA and HDA-1 in the gut d ENDU-2 function may be conserved in mammalian cells**
** --> L2 Distance: 181.52**


** Article #1021 **
** --> Abstract : 
Abstract A series of 2-amino-5-bromo-4(3H)-pyrimidinone derivatives bearing different substituents at the C-6 position were synthesiz

**Observation**   
- The lower the distance is, the most similar the article is to the query.   
- The first document has L2 = 0, which means 100% similarity. This is obvious, because the query was compared with itself. 
- We can simply remove it to the analysis.

Made with ♥️ by Zoumana   
Did you like it?  Git it an upvote and [let's connet](https://medium.com/@zoumanakeita)