<a href="https://colab.research.google.com/github/masadeghi/journal_finder/blob/main/journal_finder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clone GitHub repo to have access to datasets

**Note: journal encodings are available in the 'journal_scope_encodings' directory of the github repo. You could skip to the 'Define function that computes similarity ...' section if you load the encodings from these .pkl files.**

In [27]:
!git clone https://github.com/masadeghi/journal_finder

Cloning into 'journal_finder'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 19 (delta 3), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (19/19), done.


# Read datasets containing the scope of each journal

These datasets were generated from our scimago_scrape.py script.

In [1]:
import pandas as pd

# Read datasets and drop journals without a scope
def read_journal_dataset(file_path):
  """
  Given the file_path of a journal information dataset, read the file as a pandas
  DataFrame and drop journals without a scope.

  Args:
    file_path (str): The path of a .csv file containing the journal information
  
  Returns:
    dataset (DataFrame): A pandas DataFrame of journal information.
  """
  dataset = pd.read_csv(file_path)
  dataset.dropna(axis = 0, subset = ["Scope"], inplace = True) # Drop rows without a scope
  dataset = dataset.reset_index(drop = True) # Reset indices
  return dataset

In [2]:
# biochem_molbio_dataset = read_journal_dataset(file_path)
# immuno_micro_dataset = read_journal_dataset(file_path)
# med_dataset = read_journal_dataset(file_path)
neuro_dataset = read_journal_dataset("/content/journal_finder/scraped_from_scimago/neuro_journals.csv")
pharma_toxico_dataset = read_journal_dataset("/content/journal_finder/scraped_from_scimago/pharma_toxico_journals.csv")

In [30]:
pharma_toxico_dataset.head()

Unnamed: 0,Rank,Sourceid,Title,SJR,SJR Best Quartile,URL,Scope
0,1,20425,"""Nature Reviews Drug Discovery""",11.296,Q1,https://www.scimagojr.com/journalsearch.php?q=...,nature reviews drug discovery is a monthly jou...
1,2,21191,"""Pharmacological Reviews""",5.54,Q1,https://www.scimagojr.com/journalsearch.php?q=...,pharmacological reviews presents important rev...
2,3,19479,"""Annual Review of Pharmacology and Toxicology""",4.002,Q1,https://www.scimagojr.com/journalsearch.php?q=...,the annual review of pharmacology and toxicolo...
3,4,4700152457,"""Nano Today""",3.89,Q1,https://www.scimagojr.com/journalsearch.php?q=...,nano today publishes original articles on all ...
4,5,12611,"""Drug Resistance Updates""",3.845,Q1,https://www.scimagojr.com/journalsearch.php?q=...,drug resistance updates is a bimonthly publica...


# Encoding the journal scopes

To do this, we utilize the pretrained BERT model by google which has been trained on MEDLINE/Pubmed data. The model description may be found here:

https://tfhub.dev/google/experts/bert/pubmed/2

## Import model from TensorFlow Hub

In [3]:
!pip3 install "tensorflow-text==2.8.*"

import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
# Imports TF ops for preprocessing.
import tensorflow_text as text

# Load the BERT encoder and preprocessing models
preprocess = hub.load('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')
bert = hub.load('https://tfhub.dev/google/experts/bert/pubmed/2')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Define function for encoding a single string

In [4]:
def get_string_encoding(text):
  """
  Given a string, calculate it's encoding using BERT.

  Args:
    text (str): The text to be encoded.

  Returns:
    encoding (tenosr): A tensor of shape (1, 768) containing the pooled_output
    of the BERT model.
  """
  text_list = [text]
  
  # Convert the text to bert inputs
  bert_inputs = preprocess(text_list)

  # Feed the inputs to the model to get the pooled output
  bert_outputs = bert(bert_inputs, training=False)
  pooled_output = bert_outputs['pooled_output']

  return pooled_output

# Define function for encoding a list of strings

In [5]:
def get_journal_encodings(dataset):

  # Initialize a dictionary for the journal encoding tensors
  journal_encodings = {}

  # Convert journal scopes and titles from DataFrame columns to lists to reduce RAM usage
  journal_scopes = dataset.loc[:, 'Scope'].tolist()
  journal_names = dataset.loc[:, 'Title'].tolist()

  # Due to limitations in RAM capacity on the free version of Colab,
  # we run the code for a maximum of 2000 journals
  if len(journal_scopes) > 2000:
    for i in range(2000):
      text = journal_scopes[i]
      encoding = get_string_encoding(text)
      journal_encodings[journal_names[i]] = encoding
  else:
    for i in range(len(journal_scopes)):
      text = journal_scopes[i]
      encoding = get_string_encoding(text)
      journal_encodings[journal_names[i]] = encoding

  if i % 100 == 0 and i != 0:
    print(f"Successfully gone through {i} scopes")
  
  return journal_encodings

## Get the encodings for each dataset

In [6]:
# biochem_molbio_encodings = get_journal_encodings(biochem_molbio_dataset)
# immuno_micro_encodings = get_journal_encodings(immuno_micro_dataset)
# med_encodings = get_journal_encodings(med_dataset)
# neuro_encodings = get_journal_encodings(neuro_dataset)
pharma_toxico_encodings = get_journal_encodings(pharma_toxico_dataset)

## Save journal encodings

In [35]:
import pickle

def save_dict(dict, name):
  output_file = name + '.pkl'
  f = open(output_file, "wb")
  pickle.dump(dict,f)
  f.close()

In [36]:
# save_dict(biochem_molbio_encodings, "biochem_molbio_encodings")
# save_dict(immuno_micro_encodings, "immuno_micro_encodings")
# save_dict(med_encodings, "med_encodings")
# save_dict(neuro_encodings, "neuro_encodings")
save_dict(pharma_toxico_encodings, "pharma_toxico_encodings")

## Define the cosine similarity function as a measure of similarity between two vectors

In [7]:
def cosine_similarity(abstract_encoding, journal_encoding):
  abstract_encoding = tf.cast(abstract_encoding, dtype = tf.float32)
  journal_encoding = tf.cast(journal_encoding, dtype = tf.float32)
  numerator = tf.matmul(abstract_encoding, tf.transpose(journal_encoding))
  denominator = tf.norm(abstract_encoding) * tf.norm(journal_encoding)
  return numerator/denominator

## Define function that computes similarity between abstract and journal list

In [9]:
def similar_journals():
  """
  Ask the user to input their manuscript abstract, desired subject category, and
  the number of journals they would like to see.
  
  Returns
    A list of journal names and their similarity scores
  """
  abstract = input("Paste your manuscript abstract:")
  category = input("Choose your manuscript subject category:\n1) Biochemistry, Genetics, and Molecular Biology\n2) Immunology and Microbiology\n3) Medicine\n4) Neuroscience\n5) Pharmacology, Toxicology and Pharmaceutics [1/2/3/4/5]")
  number = input("How many journals would you like to see?")

  if category == '1':
    dataset = biochem_molbio_encodings
  elif category == '2':
    dataset = immuno_micro_encodings
  elif category == '3':
    dataset = med_encodings
  elif category == '4':
    dataset = neuro_encodings
  elif category == '5':
    dataset = pharma_toxico_encodings
  else:
    print("Please rerun the program and enter a valid category.")

  # Get bert encoding of abstract
  abstract_encoding = get_string_encoding(abstract)

  # Compute similarity scores between abstract encodings and journal scope encodings
  similarity_scores = []

  for name, journal_encoding in dataset.items():
    similarity = cosine_similarity(abstract_encoding, journal_encoding)
    similarity = similarity[0][0]
    similarity_scores.append(similarity)

  sorted_indices = sorted(
      range(len(similarity_scores)),
      key = lambda index: similarity_scores[index],
      reverse = True
  )

  sorted_journal_names = [list(dataset)[index] for index in sorted_indices]
  sorted_similarity_scores = [similarity_scores[index] for index in sorted_indices]

  output = pd.DataFrame.from_dict({'Title' : sorted_journal_names,
                                  'Similarity score' : sorted_similarity_scores})
  
  print(output[:int(number)])

  return output

## Test the function

In [10]:
recommended_journals = similar_journals()

Paste your manuscript abstract:Current pharmacological treatments against post-traumatic stress disorder (PTSD) lack adequate efficacy. As a result, intense research has focused on identifying other molecular pathways mediating the pathogenesis of this condition. One such pathway is neuroinflammation, which has demonstrated a role in PTSD pathogenesis by causing synaptic dysfunction, neuronal death, and functional impairment in the hippocampus. Phosphodiesterase (PDE) inhibitors (PDEIs) have emerged as promising therapeutic agents against neuroinflammation in other neurological conditions. Furthermore, PDEIs have shown some promise in animal models of PTSD. However, the current model of PTSD pathogenesis, which is based on dysregulated fear learning, implies that PDE inhibition in neurons should enhance the acquisition of fear memory from the traumatic event. As a result, we hypothesized that PDEIs may improve PTSD symptoms through inhibiting neuroinflammation rather than learning and 