# SciBERT + Cosine Similarity  on Neuroscience data     

Apply pretrained SciBERT transformer model and Cosine Similarity for recommending reviewers who have published neuroscience research papers on semantically similar research topic as the user's input abstract query.

# Approach    
- Load the pretrained SciBert model and tokenizer
- Vectorize documents by creating embeddings
- Semantic Similarity search by Cosine Similarity   


# Libraries

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import warnings

# Hugging Face Transformer libraries
!pip install transformers
import torch
from transformers import BertTokenizer,  AutoModelForSequenceClassification

# Similarity search: cosine similarity search 
from sklearn.metrics.pairwise import cosine_similarity

warnings.filterwarnings("ignore")

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.5/101.5 KB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m74.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.8.1 tokenizers-0.12.1 transformers-4.20.1
You should consider upgrading via the '/home/

## Data loading

In [2]:
data = pd.read_json("bioarxiv_parsed.json") 
print("Data Shape: {}".format(data.shape))

Data Shape: (3948, 5)


In [3]:
# Percentage of missing column values
null_check_percent = data.isnull().sum() * 100 / len(data)
null_check_percent

title       0.0
abstract    0.0
doi         0.0
authors     0.0
source      0.0
dtype: float64

In [4]:
# remove articles with missing abstract
data = data.dropna(subset = ['abstract'])
data = data.reset_index(drop = True)
null_check_percent = data.isnull().sum() * 100 / len(data)
null_check_percent

title       0.0
abstract    0.0
doi         0.0
authors     0.0
source      0.0
dtype: float64

## Load Pretrained SciBERT model  and tokenizer 

set the `output_hidden_states` to `True` so that we can extract the embeddings.  

In [5]:
# Get the SciBERT pretrained model path from Allen AI repo
pretrained_model = 'allenai/scibert_scivocab_uncased'

# Get the tokenizer from the previous path
sciBERT_tokenizer = BertTokenizer.from_pretrained(pretrained_model, 
                                          do_lower_case=True)

# Get the model
model = AutoModelForSequenceClassification.from_pretrained(pretrained_model,
                                                          output_attentions=False,
                                                          output_hidden_states=True)


Downloading:   0%|          | 0.00/223k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/422M [00:00<?, ?B/s]

Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification we

In [6]:
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(31090, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

## Create an embedding for a given text data using SciBERT pre-trained model. 

Reference: [3, 4]

In [7]:
def convert_single_abstract_to_embedding(tokenizer, model, in_text, MAX_LEN = 510):
    
    input_ids = tokenizer.encode(
                        in_text, 
                        add_special_tokens = True, 
                        max_length = MAX_LEN,                           
                   )    

    results = pad_sequences([input_ids], maxlen=MAX_LEN, dtype="long", 
                              truncating="post", padding="post")
    
    # Remove the outer list.
    input_ids = results[0]

    # Create attention masks    
    attention_mask = [int(i>0) for i in input_ids]
    
    # Convert to tensors.
    input_ids = torch.tensor(input_ids)
    attention_mask = torch.tensor(attention_mask)

    # Add an extra dimension for the "batch" (even though there is only one 
    # input in this batch.)
    input_ids = input_ids.unsqueeze(0)
    attention_mask = attention_mask.unsqueeze(0)
    
    # Put the model in "evaluation" mode, meaning feed-forward operation.
    model.eval()

 
    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():        
        logits, encoded_layers = model(
                                    input_ids = input_ids, 
                                    token_type_ids = None, 
                                    attention_mask = attention_mask,
                                    return_dict=False)

    layer_i = 12 # The last BERT layer before the classifier.
    batch_i = 0 # Only one input in the batch.
    token_i = 0 # The first token, corresponding to [CLS]
        
    # Extract the embedding.
    embedding = encoded_layers[layer_i][batch_i][token_i]

    # Move to the CPU and convert to numpy ndarray.
    embedding = embedding.detach().cpu().numpy()

    return(embedding)

In [8]:
!pip install keras

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting keras
  Downloading keras-2.9.0-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hInstalling collected packages: keras
Successfully installed keras-2.9.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p38/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [9]:
!pip3 install tensorflow

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting tensorflow
  Downloading tensorflow-2.9.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (511.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.7/511.7 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting gast<=0.4.0,>=0.2.1
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting astunparse>=1.6.0
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting flatbuffers<2,>=1.12
  Downloading flatbuffers-1.12-py2.py3-none-any.whl (15 kB)
Collecting opt-einsum>=2.3.2
  Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 KB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hCollecting keras-preprocessing>=1.1.1
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

## Use the model and tokenizer to generate an embedding for the 3rd input_abstract  


In [10]:
from keras_preprocessing.sequence import pad_sequences

input_abstract = data.abstract.iloc[3]

abstract_embedding = convert_single_abstract_to_embedding(sciBERT_tokenizer, model, input_abstract)

print('Embedding shape: {}'.format(abstract_embedding.shape))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Embedding shape: (768,)


Embedding is composed of 768 values. 

## Create Embedding for all the abstracts

In [11]:
def convert_all_abstract_text_to_embedding(df):
    
    # The list of all the embeddings
    embeddings = []
    
    # Get overall text data
    overall_text_data = data.abstract.values
    
    # Loop over all the comment and get the embeddings
    for abstract in tqdm(overall_text_data):
        
        # Get the embedding 
        embedding = convert_single_abstract_to_embedding(sciBERT_tokenizer, model, abstract)
        
        #add it to the list
        embeddings.append(embedding)
        
    print("Conversion Done!")
    
    return embeddings

In [12]:
# This task can take a lot of time depending on the sample_size value 

embeddings = convert_all_abstract_text_to_embedding(data)

100%|██████████| 3948/3948 [08:58<00:00,  7.34it/s]

Conversion Done!





In [None]:
embeddings = np.array(embeddings)
np.save('embeddings.npy', embeddings)

In [14]:
embeddings = np.load('embeddings.npy')

In [16]:
embeddings_dim = len(embeddings[0])
embeddings_dim

768

In [None]:
#data.x = np.load('file.npy', pickle=True).

In [14]:
# loading library
import pickle

# create an iterator object with write permission - embeddings.pkl

with open('embeddings.pkl', 'wb') as files:
    pickle.dump(embeddings, files)

In [16]:
# load save model 
with open('embeddings.pkl', 'rb') as f:
    embeddings = pickle.load(f) 

In [17]:
embeddings_dim = len(embeddings[0])
embeddings_dim

768

In [19]:
# Create a new column that will contain embedding of each body text
def create_final_embeddings(df, embeddings):
    
    df["embeddings"] = embeddings
    df["embeddings"] = df["embeddings"].apply(lambda emb: np.array(emb))
    df["embeddings"] = df["embeddings"].apply(lambda emb: emb.reshape(1, -1))
    
    return df

In [20]:
data = create_final_embeddings(data, embeddings)
data.head(3)

Unnamed: 0,title,abstract,doi,authors,source,embeddings
0,The natverse: a versatile computational toolbo...,"To analyse neuron data at scale, neuroscientis...",10.1101/006353,"[{'author': 'Bates, A. S.', 'number on Paper':...",bioarxiv,"[[0.75996983, -0.70971274, 0.76068234, -0.2982..."
1,Long-range functional coupling predicts perfor...,The integration of sensory signals from differ...,10.1101/014423,"[{'author': 'Wang, P.', 'number on Paper': 1, ...",bioarxiv,"[[0.90075547, 0.055804443, 0.05721068, -0.1960..."
2,Medial prefrontal cortex population activity i...,Cortical population activity may represent sam...,10.1101/027102,"[{'author': 'Singh, A.', 'number on Paper': 1,...",bioarxiv,"[[-0.11369413, -0.3029109, 0.44304308, -0.6325..."


# Cosine Similarity Search 

## Utility functions

In [22]:
def process_query(query_text):
    """ 
    Create a vector for given query and adjust it for cosine similarity search
    """

    query_vect = convert_single_abstract_to_embedding(sciBERT_tokenizer, model, query_text)
    query_vect = np.array(query_vect)
    query_vect = query_vect.reshape(1, -1)
    return query_vect



In [23]:
def get_top_N_similar_articles_cosine(query_text, data, top_N=10):
    """
    Retrieve top_N (10 is default value) article abstract similar to the query
    """
    query_vect = process_query(query_text)
    revevant_cols = ["title", "abstract",  "doi", "authors", "source", "cosine_similarity"]
    
    # Run similarity Search
    data["cosine_similarity"] = data["embeddings"].apply(lambda x: cosine_similarity(query_vect, x))
    data["cosine_similarity"] = data["cosine_similarity"].apply(lambda x: x[0][0])
    
    """
    Sort Cosine Similarity Column in Descending Order.
    Below index starts at 1 to remove similarity with itself because it is always 1.
    """
    most_similar_articles = data.sort_values(by='cosine_similarity', ascending=False)[1:top_N+1]
    
    return most_similar_articles[revevant_cols]

In [24]:
query_text_test = data.iloc[0].abstract # query abstract input

top_articles = get_top_N_similar_articles_cosine(query_text_test, data) # 10 similar recommendations in descending order

In [25]:
top_articles

Unnamed: 0,title,abstract,doi,authors,source,cosine_similarity
779,Open Source Brain: a collaborative resource fo...,Computational models are powerful tools for in...,10.1101/229484,"[{'author': 'Gleeson, P.', 'number on Paper': ...",bioarxiv,0.897324
282,AFQ-Browser: Supporting reproducible human neu...,Human neuroscience research faces several chal...,10.1101/182402,"[{'author': 'Yeatman, J. D.', 'number on Paper...",bioarxiv,0.895607
281,AFQ-Browser: Supporting reproducible human neu...,Human neuroscience research faces several chal...,10.1101/182402,"[{'author': 'Yeatman, J. D.', 'number on Paper...",bioarxiv,0.895607
3933,Real-time experimental control using network-b...,Modern neuroscience research often requires th...,10.1101/392654,"[{'author': 'Kim, B.', 'number on Paper': 1, '...",bioarxiv,0.894662
2440,A low-cost hyperspectral scanner for natural i...,Hyperspectral imaging is a widely used technol...,10.1101/322172,"[{'author': 'Nevala, N. E.', 'number on Paper'...",bioarxiv,0.892926
1801,AutonoMouse: High throughput automated operant...,Operant conditioning is a crucial tool in neur...,10.1101/291815,"[{'author': 'Erskine, A.', 'number on Paper': ...",bioarxiv,0.885081
1622,Regional protein expression in human Alzheimer...,Alzheimers disease (AD) is a progressive neuro...,10.1101/283705,"[{'author': 'Xu, J.', 'number on Paper': 1, 'i...",bioarxiv,0.884527
1203,Variation among intact tissue samples reveals ...,It is widely assumed that cells must be physic...,10.1101/265397,"[{'author': 'Kelley, K. W.', 'number on Paper'...",bioarxiv,0.877315
1591,An Open-source Tool for Analysis and Automatic...,"Synaptic plasticity, the cellular basis for le...",10.1101/281667,"[{'author': 'Smirnov, M. S.', 'number on Paper...",bioarxiv,0.874579
1133,Inferring and validating mechanistic models of...,The interpretation of neuronal spike train rec...,10.1101/261016,"[{'author': 'Ladenbauer, J.', 'number on Paper...",bioarxiv,0.874049


In [26]:
top_articles.iloc[0].abstract

'Computational models are powerful tools for investigating brain function in health and disease. However, biologically detailed neuronal and circuit models are complex and implemented in a range of specialized languages, making them inaccessible and opaque to many neuroscientists. This has limited critical evaluation of models by the scientific community and impeded their refinement and widespread adoption. To address this, we have combined advances in standardizing models, open source software development and web technologies to develop Open Source Brain, a platform for visualizing, simulating, disseminating and collaboratively developing standardized models of neurons and circuits from a range of brain regions. Model structure and parameters can be visualized and their dynamical properties explored through browser-controlled simulations, without writing code. Open Source Brain makes neural models transparent and accessible and facilitates testing, critical evaluation and refinement, 

In [27]:
top_articles.iloc[0].authors

[{'author': 'Gleeson, P.',
  'number on Paper': 1,
  'institution': 'University College London'},
 {'author': ' Cantarelli, M.',
  'number on Paper': 2,
  'institution': 'University College London'},
 {'author': ' Marin, B.',
  'number on Paper': 3,
  'institution': 'University College London'},
 {'author': ' Quintana, A.',
  'number on Paper': 4,
  'institution': 'University College London'},
 {'author': ' Earnshaw, M.',
  'number on Paper': 5,
  'institution': 'University College London'},
 {'author': ' Piasini, E.',
  'number on Paper': 6,
  'institution': 'University College London'},
 {'author': ' Birgiolas, J.',
  'number on Paper': 7,
  'institution': 'University College London'},
 {'author': ' Cannon, R. C.',
  'number on Paper': 8,
  'institution': 'University College London'},
 {'author': ' Cayco-Gajic, N. A.',
  'number on Paper': 9,
  'institution': 'University College London'},
 {'author': ' Crook, S.',
  'number on Paper': 10,
  'institution': 'University College London'}

In [28]:
top_articles.iloc[1].abstract

'Human neuroscience research faces several challenges with regards to reproducibility. While scientists are generally aware that data sharing is an important component of reproducible research, it is not always clear how to usefully share data in a manner that allows other labs to understand and reproduce published findings. Here we describe a new open source tool, AFQ-Browser, that builds an interactive website as a companion to a published diffusion MRI study. Because AFQ-browser is portable -- it runs in any modern web-browser -- it can facilitate transparency and data sharing. Moreover, by leveraging new web-visualization technologies to create linked views between different dimensions of a diffusion MRI dataset (anatomy, quantitative diffusion metrics, subject metadata), AFQ-Browser facilitates exploratory data analysis, fueling new scientific discoveries based on previously published datasets. In an era where Big Data is playing an increasingly prominent role in scientific discov

In [29]:
top_articles.iloc[2].abstract

'Human neuroscience research faces several challenges with regards to reproducibility. While scientists are generally aware that data sharing is an important component of reproducible research, it is not always clear how to usefully share data in a manner that allows other labs to understand and reproduce published findings. Here we describe a new open source tool, AFQ-Browser, that builds an interactive website as a companion to a published diffusion MRI study. Because AFQ-browser is portable -- it runs in any modern web-browser -- it can facilitate transparency and data sharing. Moreover, by leveraging new web-visualization technologies to create linked views between different dimensions of a diffusion MRI dataset (anatomy, quantitative diffusion metrics, subject metadata), AFQ-Browser facilitates exploratory data analysis, fueling new scientific discoveries based on previously published datasets. In an era where Big Data is playing an increasingly prominent role in scientific discov

In [30]:
top_articles.iloc[3].abstract

'Modern neuroscience research often requires the coordination of multiple processes such as stimulus generation, real-time experimental control, as well as behavioral and neural measurements. The technical demands required to simultaneously manage these processes with high temporal fidelity limits the number of labs capable of performing such work. Here we present an open-source network-based parallel processing framework that eliminates these barriers. The Real-Time Experimental Control with Graphical User Interface (REC-GUI) framework offers multiple advantages: (i) a modular design agnostic to coding language(s) and operating system(s) that maximizes experimental flexibility and minimizes researcher effort, (ii) simple interfacing to connect measurement and recording devices, (iii) high temporal fidelity by dividing task demands across CPUs, and (iv) real-time control using a fully customizable and intuitive GUI. Testing results demonstrate that the REC-GUI framework facilitates tec

# Cosine Similarity for user input abstract 

In [31]:
query_text_test = str(input())
print("--------------------------------------------------------------------------------------------------------------------\n")
print("********************     RECOMMENDATIONS     *************\n")

top_articles = get_top_N_similar_articles_cosine(query_text_test, data)  # take input from user and recommend top 10 using cosine similarity


Our visual environment impacts multiple aspects of cognition including perception, attention and memory, yet most studies traditionally remove or control the external environment. As a result, we have a limited understanding of neurocognitive processes beyond the controlled lab environment. Here, we aim to study neural processes in real-world environments, while also maintaining a degree of control over perception. To achieve this, we combined mobile EEG (mEEG) and augmented reality (AR), which allows us to place virtual objects into the real world. We validated this AR and mEEG approach using a well-characterised cognitive response-the face inversion effect. Participants viewed upright and inverted faces in three EEG tasks (1) a lab-based computer task, (2) walking through an indoor environment while seeing face photographs, and (3) walking through an indoor environment while seeing virtual faces. We find greater low frequency EEG activity for inverted compared to upright faces in all

In [32]:
top_articles # top 10 recommendations

Unnamed: 0,title,abstract,doi,authors,source,cosine_similarity
3102,Pins & Needles: Towards Limb Disownership in A...,The seemingly stable construct of our bodily s...,10.1101/349795,"[{'author': 'Kannape, O. A.', 'number on Paper...",bioarxiv,0.915672
906,Decoding digits and dice with Magnetoencephalo...,Numerical format describes the way magnitude i...,10.1101/249342,"[{'author': 'Teichmann, L.', 'number on Paper'...",bioarxiv,0.913674
1984,word2brain,Mapping brain functions to their underlying ne...,10.1101/299024,"[{'author': 'Nunes, A.', 'number on Paper': 1,...",bioarxiv,0.912345
2343,More is Better: Using Machine Learning Techniq...,A basic aim of marketing research is to predic...,10.1101/317073,"[{'author': 'Hakim, A.', 'number on Paper': 1,...",bioarxiv,0.911026
2344,Pathways to Consumers Minds: Using Machine Lea...,A basic aim of marketing research is to predic...,10.1101/317073,"[{'author': 'Hakim, A.', 'number on Paper': 1,...",bioarxiv,0.911026
1081,Real-time decoding of selective attention from...,Humans are highly skilled at analysing complex...,10.1101/259853,"[{'author': 'Etard, O.', 'number on Paper': 1,...",bioarxiv,0.910582
1082,Decoding of selective attention to continuous ...,Humans are highly skilled at analysing complex...,10.1101/259853,"[{'author': 'Etard, O.', 'number on Paper': 1,...",bioarxiv,0.910582
1232,Alpha-band oscillations track the retrieval of...,A hallmark of episodic memory is the phenomeno...,10.1101/207860,"[{'author': 'Sutterer, D. W.', 'number on Pape...",bioarxiv,0.909132
3152,Non-assortative community structure in resting...,Brain networks exhibit community structure tha...,10.1101/355016,"[{'author': 'Betzel, R. F.', 'number on Paper'...",bioarxiv,0.908134
2038,Spotting the path that leads nowhere: Modulati...,The capacity to take efficient detours and exp...,10.1101/301697,"[{'author': 'Javadi, A.-H.', 'number on Paper'...",bioarxiv,0.907643


In [33]:
cos_sim_top_10 = top_articles.to_json('cos_sim_top_10.json')


In [34]:
df_cos_sim = pd.read_json("cos_sim_top_10.json") 
print("Data Shape: {}".format(df_cos_sim.shape))
df_cos_sim.head(10)

Data Shape: (10, 6)


Unnamed: 0,title,abstract,doi,authors,source,cosine_similarity
3102,Pins & Needles: Towards Limb Disownership in A...,The seemingly stable construct of our bodily s...,10.1101/349795,"[{'author': 'Kannape, O. A.', 'number on Paper...",bioarxiv,0.915672
906,Decoding digits and dice with Magnetoencephalo...,Numerical format describes the way magnitude i...,10.1101/249342,"[{'author': 'Teichmann, L.', 'number on Paper'...",bioarxiv,0.913674
1984,word2brain,Mapping brain functions to their underlying ne...,10.1101/299024,"[{'author': 'Nunes, A.', 'number on Paper': 1,...",bioarxiv,0.912345
2343,More is Better: Using Machine Learning Techniq...,A basic aim of marketing research is to predic...,10.1101/317073,"[{'author': 'Hakim, A.', 'number on Paper': 1,...",bioarxiv,0.911026
2344,Pathways to Consumers Minds: Using Machine Lea...,A basic aim of marketing research is to predic...,10.1101/317073,"[{'author': 'Hakim, A.', 'number on Paper': 1,...",bioarxiv,0.911026
1081,Real-time decoding of selective attention from...,Humans are highly skilled at analysing complex...,10.1101/259853,"[{'author': 'Etard, O.', 'number on Paper': 1,...",bioarxiv,0.910582
1082,Decoding of selective attention to continuous ...,Humans are highly skilled at analysing complex...,10.1101/259853,"[{'author': 'Etard, O.', 'number on Paper': 1,...",bioarxiv,0.910582
1232,Alpha-band oscillations track the retrieval of...,A hallmark of episodic memory is the phenomeno...,10.1101/207860,"[{'author': 'Sutterer, D. W.', 'number on Pape...",bioarxiv,0.909132
3152,Non-assortative community structure in resting...,Brain networks exhibit community structure tha...,10.1101/355016,"[{'author': 'Betzel, R. F.', 'number on Paper'...",bioarxiv,0.908134
2038,Spotting the path that leads nowhere: Modulati...,The capacity to take efficient detours and exp...,10.1101/301697,"[{'author': 'Javadi, A.-H.', 'number on Paper'...",bioarxiv,0.907643


In [252]:
top_articles.iloc[0].abstract  # read abstract to see research similarity

'The seemingly stable construct of our bodily self depends on the continued, successful integration of multisensory feedback about our body, rather than its purely physical composition. Accordingly, pathological disruption of such neural processing is linked to striking alterations of the bodily self, ranging from limb misidentification to disownership, and even the desire to amputate a healthy limb. While previous embodiment research has relied on experimental setups using supernumerary limbs in variants of the Rubber Hand Illusion, we here used Augmented Reality to directly manipulate the feeling of ownership for ones own, biological limb. Using a Head-Mounted Display, participants received visual feedback about their own arm, from an embodied first-person perspective. In a series of three studies, in independent cohorts, we altered embodiment by providing visuotactile feedback that could be synchronous (control condition) or asynchronous (400ms delay, Real Hand Illusion). During the

In [35]:
top_articles.iloc[0].authors  # top recommended reviewers

[{'author': 'Kannape, O. A.',
  'number on Paper': 1,
  'institution': 'Swiss Institute of Technology Lausanne (EPFL)'},
 {'author': ' Smith, E. J.',
  'number on Paper': 2,
  'institution': 'Swiss Institute of Technology Lausanne (EPFL)'},
 {'author': ' Moseley, P.',
  'number on Paper': 3,
  'institution': 'Swiss Institute of Technology Lausanne (EPFL)'},
 {'author': ' Roy, M.',
  'number on Paper': 4,
  'institution': 'Swiss Institute of Technology Lausanne (EPFL)'},
 {'author': ' Lenggenhager, B.',
  'number on Paper': 5,
  'institution': 'Swiss Institute of Technology Lausanne (EPFL)'}]

In [36]:
top_articles.iloc[2].authors

[{'author': 'Nunes, A.',
  'number on Paper': 1,
  'institution': 'Dalhousie University'}]

In [37]:
top_articles.authors

3102    [{'author': 'Kannape, O. A.', 'number on Paper...
906     [{'author': 'Teichmann, L.', 'number on Paper'...
1984    [{'author': 'Nunes, A.', 'number on Paper': 1,...
2343    [{'author': 'Hakim, A.', 'number on Paper': 1,...
2344    [{'author': 'Hakim, A.', 'number on Paper': 1,...
1081    [{'author': 'Etard, O.', 'number on Paper': 1,...
1082    [{'author': 'Etard, O.', 'number on Paper': 1,...
1232    [{'author': 'Sutterer, D. W.', 'number on Pape...
3152    [{'author': 'Betzel, R. F.', 'number on Paper'...
2038    [{'author': 'Javadi, A.-H.', 'number on Paper'...
Name: authors, dtype: object

# References:

1. @inproceedings{beltagy-etal-2019-scibert,
    title = "SciBERT: A Pretrained Language Model for Scientific Text",
    author = "Beltagy, Iz  and Lo, Kyle  and Cohan, Arman",
    booktitle = "EMNLP",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1371"
}

2. @article{johnson2019billion,
  title={Billion-scale similarity search with {GPUs}},
  author={Johnson, Jeff and Douze, Matthijs and J{\'e}gou, Herv{\'e}},
  journal={IEEE Transactions on Big Data},
  volume={7},
  number={3},
  pages={535--547},
  year={2019},
  publisher={IEEE}
}

3. “Bert Word Embeddings Tutorial.” BERT Word Embeddings Tutorial · Chris McCormick, 14 May 2019, https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#3-extracting-embeddings. 

4. Keita, Zoumana. “Scientific Documents Similarity Search with Deep Learning Using Transformers (Scibert).” Medium, Towards Data Science, 17 Jan. 2022, https://towardsdatascience.com/scientific-documents-similarity-search-with-deep-learning-using-transformers-scibert-d47c4e501590. 

5. @article{Beltagy2020Longformer,
  title={Longformer: The Long-Document Transformer},
  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
  journal={arXiv:2004.05150},
  year={2020},
}
