# Topic Modeling Scientific Text

*This template and workflow were developed by Margaret Gratian. This set of notebooks can be used to find topics in scientific text.*
____________________________________
## 2. Produce PubMedBERT Embeddings

**Notebook Goals**
- Demonstrate the process to embed abstracts from NIH grants using the PubMedBERT model so that they can be used for topic modeling.

**Major Caveats**
- When using Transformer-based models to embed text, preprocessing steps such as removing stop words are not necessary (and in fact, should not be done because Transformer models use context to produce embeddings). However, abstracts from NIH RePORTER often begin with phrases such as "Project Summary" or "Abstract." You might optionally consider removing these with regular expressions, though it is not essential and is not done here.

**Requirements**
- This notebook requires the sentence-transformers library. Learn more about it here: https://sbert.net/.
- Please see the README for instructions and recommendations on proper installation.

**Embedding Details**
- This notebook uses the PubMedBERT model to produce embeddings of size 768: https://huggingface.co/NeuML/pubmedbert-base-embeddings
- This model is a PubMed-base model fine-tuned using Sentence Transformers. The base model was original built by Microsoft: https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

**References**

We make use of work from the following papers:
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint: [arXiv:1908.10084](https://arxiv.org/abs/1908.10084).

**Inputs**

The following assumes you used the recommended path for saving your data in Notebook 1. If you modified it, be sure to modify the input path here.

- Input Filepath 1: "../data/reporter_results.csv"
    - Table of awards from a RePORTER request.

**Outputs**

The following is a recommended path for saving your data. If you modify it, be sure to modify the inputs and outputs of subsequent notebooks.

- Output Filepath 1: "../data/pubmedbert_embeddings.csv"
     - Grant abstracts embedded with the PubMedBERT model.

## Import Packages

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer

## Functions

In [None]:
def embed(text, model):
    """
    Takes as input text and an embedding model and returns the embedding as a list. Note that models must already be loaded for this
    to work.

    Parameters:
    -----------
    text: string
        A string of the text to embed
    model: SentenceTransformer object 
        A pre-trained embedding model

    Returns:
    ---------
    embedding: list of ints
        List representing the vector embedding the text. 
    """
    
    # First check that text is not empty
    if pd.isna(text):
        # Return None, not possible to emebd
        return

    # Get the embedding
    embedding = model.encode([text])
    
    # Return the embedding
    # Note embeddings are returned as a nested list 
    # Return the first element because we are embedding one sentence at a time
    return embedding[0]

## Read in Data

In [None]:
# Read in the tabular PubMed data
input_df = pd.read_csv("../data/reporter_results.csv", index_col=0)
print(input_df.shape)

# Preview
input_df.head()

## Load Pre-Trained Embedding Model

https://huggingface.co/sentence-transformers/allenai-specter

In [None]:
# Load the allenai-specter model with SentenceTransformers
model = SentenceTransformer("neuml/pubmedbert-base-embeddings")

## Dataset Development

### Prep Data for Embedding

- First make a copy of input_df
- Drop rows with missing abstracts
- Check that the data is unique for Appl Id (the NIH application ID that uniquely identifies awarded applications)

In [None]:
# Make a copy of the input data 
embedded_df = input_df.copy()
print(embedded_df.shape)

In [None]:
# Confirm we don't have duplicates
embedded_df = embedded_df.drop_duplicates()
print(embedded_df.shape)

In [None]:
# Drop any with missing abstracts
embedded_df = embedded_df.dropna(subset=["abstract_text"])
print(embedded_df.shape)

In [None]:
# Check if the data is unique for Appl Id, which uniquely identifies records
embedded_df[["appl_id"]].nunique()

In [None]:
# Preview the data
embedded_df[["abstract_text"]].head()

### Embed Titles and Abstracts

In [None]:
# Apply the embedding model to each title + abstract and save as a new column
embedded_df["abstract_text_embedding"] = embedded_df["abstract_text"].apply(embed, args=(model,))

# See column info
embedded_df["abstract_text_embedding"].info()

### Preview Data

In [None]:
embedded_df.head()

In [None]:
# Check that we have no missing embedded abstracts
embedded_df[["abstract_text", "abstract_text_embedding"]].info()

In [None]:
# Look at an example - we have vectors of size 768
print(len(embedded_df.at[0,"abstract_text_embedding"]))

## Save Outputs

In [None]:
# Save dfs with embeddings
embedded_df.to_csv("../data/pubmedbert_embeddings.csv")