<a href="https://colab.research.google.com/github/masadeghi/journal_finder/blob/main/finetuned_BERT_journal_recommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Workflow:**

Our intention is to use a pretrained BERT model to generate emeddings from a given abstract and a list of scopes. Then we'll determine the closest scope to our abstract using their cosine similarity.

To improve performance, we want to finetune the model on a dataset of [scope, abstract] pairs. As a result, the first step is to generate these pairs through scraping PubMed.

**Note:** If you're not interested in the scraping part, you could skip to the "Alternatively, use the ..." section and use the dataset available at this link:

https://drive.google.com/file/d/18a8qnM37rwKbAXZEHOdQZhp9dldbjtCt/view?usp=share_link


# **1.1. Scrape journal title and abstract information from PubMed**

## Clone github repo to access datasets scraped form scimagojr.com

In [265]:
!git clone https://github.com/masadeghi/journal_finder

fatal: destination path 'journal_finder' already exists and is not an empty directory.


## Load the scraped datasets containing journal name and scope data

In [266]:
import pandas as pd

pharma_toxico_dataset = pd.read_csv("/content/journal_finder/scraped_from_scimago/pharma_toxico_journals.csv")

pharma_toxico_dataset["Title"] = pharma_toxico_dataset["Title"].str.replace('"', '')

pharma_toxico_dataset.head()

Unnamed: 0,Rank,Sourceid,Title,SJR,SJR Best Quartile,URL,Scope
0,1,20425,Nature Reviews Drug Discovery,11.296,Q1,https://www.scimagojr.com/journalsearch.php?q=...,nature reviews drug discovery is a monthly jou...
1,2,21191,Pharmacological Reviews,5.54,Q1,https://www.scimagojr.com/journalsearch.php?q=...,pharmacological reviews presents important rev...
2,3,19479,Annual Review of Pharmacology and Toxicology,4.002,Q1,https://www.scimagojr.com/journalsearch.php?q=...,the annual review of pharmacology and toxicolo...
3,4,4700152457,Nano Today,3.89,Q1,https://www.scimagojr.com/journalsearch.php?q=...,nano today publishes original articles on all ...
4,5,12611,Drug Resistance Updates,3.845,Q1,https://www.scimagojr.com/journalsearch.php?q=...,drug resistance updates is a bimonthly publica...


## Make desired Pubmed URLs from journal titles

The desired Pubmed URLs refer to Pubmed's output page following advanced search with the journal title (in the journal field) and the following modifiers:

- Filter: Only papers with abstracts
- Display: Papers sorted by data (descending)
- Display: Show 100 items on page
- Display: Show paper title and abstract

In [3]:
# Make URL-list using the Title of each journal
def dataset_to_url(dataset):
  """
  Given a dataset with a column "Title", add a "Pubmed URL" column to the
  dataset containing the URLs for each journal on Pubmed

  Args:
    dataset (DataFrame): A DataFrame containing a "Title" column.
  
  Returns:
    dataset (DataFrame): The input dataset with an added column "Pubmed URL" 
    containing the URLs for each journal
  """

  URL_list = ['https://pubmed.ncbi.nlm.nih.gov/?term=%22' + i.replace(' ', '+') + '"%5BJournal%5D&filter=simsearch1.fha&format=abstract&sort=date&size=100' for i in dataset["Title"]]
  dataset["Pubmed URL"] = URL_list

  return dataset

In [4]:
pharma_toxico_dataset = dataset_to_url(pharma_toxico_dataset) # The function could also be applied on other datasets in the repo

pharma_toxico_dataset.head()

Unnamed: 0,Rank,Sourceid,Title,SJR,SJR Best Quartile,URL,Scope,Pubmed URL
0,1,20425,Nature Reviews Drug Discovery,11.296,Q1,https://www.scimagojr.com/journalsearch.php?q=...,nature reviews drug discovery is a monthly jou...,https://pubmed.ncbi.nlm.nih.gov/?term=%22Natur...
1,2,21191,Pharmacological Reviews,5.54,Q1,https://www.scimagojr.com/journalsearch.php?q=...,pharmacological reviews presents important rev...,https://pubmed.ncbi.nlm.nih.gov/?term=%22Pharm...
2,3,19479,Annual Review of Pharmacology and Toxicology,4.002,Q1,https://www.scimagojr.com/journalsearch.php?q=...,the annual review of pharmacology and toxicolo...,https://pubmed.ncbi.nlm.nih.gov/?term=%22Annua...
3,4,4700152457,Nano Today,3.89,Q1,https://www.scimagojr.com/journalsearch.php?q=...,nano today publishes original articles on all ...,https://pubmed.ncbi.nlm.nih.gov/?term=%22Nano+...
4,5,12611,Drug Resistance Updates,3.845,Q1,https://www.scimagojr.com/journalsearch.php?q=...,drug resistance updates is a bimonthly publica...,https://pubmed.ncbi.nlm.nih.gov/?term=%22Drug+...


## Scrape abstracts from each Pubmed URL

In [5]:
# Setting up the scraping 
from bs4 import BeautifulSoup
import requests

# Setting the User Agent for requests to prevent IP blocking
headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (Linux; Android 12; SM-T875) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36',
})

In [6]:
# Function for scraping all abstracts in a Pubmed search page
def abstract_finder(journal_url):

  page = requests.get(journal_url)

  soup = BeautifulSoup(page.content, 'html.parser')

  abstract_list = []

  for i in range(1, 101):

    id_text = 'search-result-1-' + f'{i}' + '-eng-abstract'

    abstract = soup.find(id = id_text)

    if abstract != None:
      abstract = abstract.get_text()
      abstract = abstract.replace("\n", "")
      abstract = abstract.strip()
      abstract = abstract.lower()
      abstract = " ".join(abstract.split())

    abstract_list.append(abstract)
  
  return abstract_list

In [7]:
# Function for scraping abstracts for all Pubmed URLs in a dataset
import random

def dataset_to_abstract(dataset):

  dataset_abstracts = []

  for i in range(len(dataset)):

    journal_url = dataset.loc[i, 'Pubmed URL']

    abstract_list = abstract_finder(journal_url)

    random.shuffle(abstract_list)

    dataset_abstracts.append(abstract_list)

    if i % 50 == 0 and i != 0:
      print(i)
  
  return dataset_abstracts    

In [8]:
pharma_toxico_abstracts = dataset_to_abstract(pharma_toxico_dataset) # The function could also be applied on other datasets in the repo

50
100
150
200
250
300
350
400
450
500
550
600
650
700


## Save scraped abstracts

In [13]:
pharma_toxico_abstracts_df = pd.DataFrame(pharma_toxico_abstracts, index = pharma_toxico_dataset["Title"]).transpose()

pharma_toxico_abstracts_df.to_csv('./pharma_toxico_abstracts_df.csv')

# **1.2. Alternatively, use the presaved dataset containing abstracts**

In [None]:
# Download the file from the link given at the beginning of the notebook and
# upload it to your Google Drive to use the following two cells.
from google.colab import drive

drive.mount('/content/gdrive')

In [11]:
import pandas as pd

pharma_toxico_abstracts_df = pd.read_csv('/content/gdrive/MyDrive/Coding projects/journal_finder/pharma_toxico_abstracts_df.csv')

pharma_toxico_abstracts_df = pharma_toxico_abstracts_df.drop(columns = "Unnamed: 0")

pharma_toxico_abstracts_df.head()

Unnamed: 0,Nature Reviews Drug Discovery,Pharmacological Reviews,Annual Review of Pharmacology and Toxicology,Nano Today,Drug Resistance Updates,Journal for ImmunoTherapy of Cancer,Trends in Pharmacological Sciences,Emerging Microbes and Infections,Pharmacology and Therapeutics,Molecular Therapy,...,Pharmacognosy Research (discontinued),Pharmacologyonline (discontinued),Reference Series in Phytochemistry,Research Journal of Pharmacognosy,Revista de Chimie (discontinued),Science and Technology Indonesia,Systematic Reviews in Pharmacy (discontinued),Trends in Phytochemical Research,"Tromboz, Gemostaz i Reologiya",World Economics
0,an amendment to this paper has been published ...,relapse to drug use during abstinence is a def...,investigations in pharmacology and toxicology ...,thrombosis is a principle cause of various lif...,,"merkel cell carcinoma is a rare, highly aggres...",irritable bowel syndrome (ibs) is a chronic ga...,,,,...,,,,,,,,,,
1,the accumulation of misfolded proteins in the ...,cancer in children is rare with approximately ...,"over the past two decades, deadly coronaviruse...",drug delivery systems (ddss) face several chal...,,background: the addition of cetuximab signific...,pseudouridine (ψ) is the most abundant post-tr...,,,,...,,,,,,,,,,
2,atopic dermatitis (ad) is a common chronic inf...,"rna-based therapies, including rna molecules a...",antiplatelet therapy is used in the treatment ...,many diseases and conditions affect a relative...,,"in january 2022, the us food and drug administ...",the cases of pancreatic cancer and associated ...,,,,...,,,,,,,,,,
3,diabetes mellitus is a metabolic disorder that...,endogenous ions play important roles in the fu...,cognitive impairment is a core feature of schi...,the pneumonia outbreak of coronavirus disease ...,,background: the complete response rate of cerv...,chronic pain remains a major burden and is dif...,,,,...,,,,,,,,,,
4,oligonucleotides can be used to modulate gene ...,epilepsy is a chronic neurologic disorder that...,camkii (the multifunctional ca2+ and calmoduli...,nanoparticles (nps) have emerged as an effecti...,,why the gut microbiome is critical for the suc...,chronic liver diseases (clds) caused by viral ...,,,,...,,,,,,,,,,


In [12]:
# Remove columns with NaNs
pharma_toxico_abstracts_df = pharma_toxico_abstracts_df.dropna(axis = 1)

# **2. Redistribute the abstracts into [scope, abstract] pairs**

In [13]:
all_pairs = []

for j in pharma_toxico_abstracts_df.columns:
  for i in range(len(pharma_toxico_abstracts_df)): 
    scope = pharma_toxico_dataset.loc[pharma_toxico_dataset["Title"] == j, "Scope"].item()
    pair = [scope, pharma_toxico_abstracts_df.loc[i, j]]
    all_pairs.append(pair)

In [14]:
# Remove pairs were either the scope or abstract is nan
import math

removals = []

for i in range(len(all_pairs)):
  if type(all_pairs[i][0]) != str:
    if math.isnan(all_pairs[i][0]):
      removals = removals + [i]

for i in range(len(all_pairs)):
  if type(all_pairs[i][1]) != str:
    if math.isnan(all_pairs[i][1]):
      removals = removals + [i]

for index in sorted(removals, reverse = True):
  del all_pairs[index]

# **3. Finetune a pretrained model**

In [18]:
# Install sentence-transformers from huggingface
%%capture
pip install -U sentence-transformers

## Prepare data for input into transformers

In [21]:
from sentence_transformers import InputExample

train_examples = []

for i in range(len(all_pairs)):
  example = all_pairs[i]
  train_examples.append(InputExample(texts = [example[0], example[1]]))

## Define cosine similarity function

In [289]:
# Define cosine similarity function
def cosine_similarity(list_1, list_2):
  solution = np.dot(list_1, list_2)/(np.linalg.norm(list_1) * np.linalg.norm(list_2))
  return solution

## Model_A - with default batch_sampler

The difference between model_A and model_B is in their DataLoaders. Model_A uses the default batch_sampler, hence there is a possibility that abstract-scope pairs from the same journal are present in each batch. However, this possibility is eliminated in the DataLoader for model_B.

In [19]:
# Load model
from sentence_transformers import SentenceTransformer

model_A = SentenceTransformer('pritamdeka/PubMedBERT-mnli-snli-scinli-scitail-mednli-stsb')

Downloading:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.46k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/720 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/768 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/299 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/706k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/441 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [20]:
# Set the loss functions
from sentence_transformers import losses

train_loss_A = losses.MultipleNegativesRankingLoss(model = model_A)

In [217]:
# Construct the dataloader
from torch.utils.data import DataLoader

data_loader_A = DataLoader(
    dataset = train_examples,
    shuffle = True,
    batch_size = 32
)

In [221]:
# Fit the model
model_A.fit(train_objectives = [(data_loader_A, train_loss_A)], epochs = 5)

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1007 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1007 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1007 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1007 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1007 [00:00<?, ?it/s]

## Model_B - with custom batch_sampler

In [None]:
# Load model
from sentence_transformers import SentenceTransformer

model_B = SentenceTransformer('pritamdeka/PubMedBERT-mnli-snli-scinli-scitail-mednli-stsb')

In [None]:
# Set the loss functions
from sentence_transformers import losses

train_loss_B = losses.MultipleNegativesRankingLoss(model = model_B)

In [23]:
# Define custom batch sampler
class CustomBatchSampler():

  def __init__(self, y, batch_size):
    n_batches = int(len(y)/batch_size)
    self.journals_in_sample = np.random.choice(range(int(len(y)/100)), size = batch_size, replace = False)
    self.y = y
    self.n_batches = n_batches
  
  def __iter__(self):
    idx = []
    for j in self.journals_in_sample:
        sample_for_each_journal = np.random.randint(0, 100) + j * 100
        idx = idx + [sample_for_each_journal]
    yield idx

  def __len__(self):
    return self.n_batches

In [17]:
# Define the indexes list which is used for stratifying in CustomBatchSampler
import numpy as np

journal_index = []

for i in range(int(len(all_pairs)/100)):
  journal_index = journal_index + [i] * 100

journal_index = np.array(journal_index)

In [25]:
# Construct the dataloader
from torch.utils.data import DataLoader

data_loader_B = DataLoader(
    dataset = train_examples,
    batch_sampler = CustomBatchSampler(y = journal_index, batch_size = 32)
)

In [267]:
model_B.fit(train_objectives = [(data_loader_B, train_loss_B)], epochs = 5)

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1006 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1006 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1006 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1006 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1006 [00:00<?, ?it/s]

# **4. Compare a sample abstract with the scopes in a dataset**

In [291]:
# Encode the scopes
scope_encodings = []
journal_names = []

for i in range(len(pharma_toxico_dataset)):
  scope = pharma_toxico_dataset.loc[i, "Scope"]
  if type(scope) == str: # Automatically removes the NaNs as they are considered floats
    encoded_scope = model_B.encode([scope])
    scope_encodings.append(encoded_scope)  

    journal_name = pharma_toxico_dataset.loc[i, "Title"]
    journal_names.append(journal_name)

In [281]:
# Encode the abstract of the following article: https://pubmed.ncbi.nlm.nih.gov/36174841/
user_input_encoding = model_A.encode(["While SSRIs are the current first-line pharmacotherapies against post-traumatic stress disorder (PTSD), they suffer from delayed onset of efficacy and low remission rates. One solution is to combine SSRIs with other treatments. Neuronal nitric oxide synthase (nNOS) has been shown to play a role in serotonergic signaling, and there is evidence of synergism between nNOS modulation and SSRIs in models of other psychiatric conditions. Therefore, in this study, we combined subchronic fluoxetine (Flx) with 7-nitroindazole (NI), a selective nNOS inhibitor, and evaluated their efficacy against anxiety-related behavior in an animal model of PTSD. We used the underwater trauma model to induce PTSD in rats. Animals underwent the open field (OFT) and elevated plus maze tests on days 14 (baseline) and 21 (post-treatment) after PTSD induction to assess anxiety-related behaviors. Between the two tests, the rats received daily intraperitoneal injections of 10 mg/kg Flx or saline, and were injected intraperitoneally before the second test with either 15 mg/kg NI or saline. The change in behaviors between the two tests was compared between treatment groups. Individual treatment with both Flx and NI had anxiogenic effects in the OFT. These effects were associated with modest increases in cFOS expression in the hippocampus. Combination therapy with Flx + NI did not show any anxiogenic effects, while causing even higher expression levels of cFOS. In conclusion, addition of NI treatment to subchronic Flx therapy accelerated the abrogation of Flx's anxiogenic properties. Furthermore, hippocampal activity, as evidenced by cFOS expression, was biphasically related to anxiety-related behavior."])

In [302]:
# Compute similarities between abstract and the scopes and sort
similarities = []

for i in scope_encodings:
  similarity = cosine_similarity(i, user_input_encoding.T).item()
  similarities.append(similarity)

sorted_indexes = np.argsort(similarities)

sorted_indexes = sorted_indexes[::-1]

In [305]:
print(pharma_toxico_dataset["Title"][sorted_indexes][:20])

244                        Future Medicinal Chemistry
37                                    Clinical Trials
298    Naunyn-Schmiedeberg's Archives of Pharmacology
108              Expert Opinion on Biological Therapy
120                           Molecular Pharmaceutics
201                       Cardiovascular Therapeutics
306                         Drug Development Research
30                          Current Neuropharmacology
148      Experimental and Clinical Psychopharmacology
186       Contemporary Clinical Trials Communications
159                      Food and Chemical Toxicology
161                                Toxicology Reports
217            Regulatory Toxicology and Pharmacology
280                      Current Radiopharmaceuticals
318              Journal of Experimental Pharmacology
67                             Molecular Pharmacology
375                                        Toxicon: X
204                             Journal of Toxicology
96            European Journ