# Topic Modeling Scientific Text

*This template and workflow were developed by Margaret Gratian. This set of notebooks can be used to find topics in scientific text.*
____________________________________
## 3. Produce SPECTER Embeddings

**Notebook Goals**
- Demonstrate the process to embed titles and abstracts from scientific publications using the SPECTER model so that they can be used for topic modeling.

**Requirements**
- This notebook requires the sentence-transformers library. Learn more about it here: https://sbert.net/.
- Please see the README for instructions and recommendations on proper installation.

**Embedding Details**
- This notebook uses Sentence Transformers and the SPECTER embedding model to produce embeddings of size 768.
- For additional implementation examples, see:
    - https://huggingface.co/sentence-transformers/allenai-specter
    - https://github.com/allenai/specter
    - https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search_publications.py

**Major Caveats**
- Data should be subset to the articles that have titles and abstracts.
- Data to embed is strictly of the format title + [SEP] + abstract, as this is what is required for the SPECTER model.
- The SPECTER embedding model is one of many great possible Transformer-based embedding models. Another example that is well-suited to scientific text is https://huggingface.co/NeuML/pubmedbert-base-embeddings, which is a sentence-transformer fine tuned version of https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext.

**References**

We make use of work from the following papers:
- Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. SPECTER: Document-level Representation Learning using Citation-informed Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint: [arXiv:1908.10084](https://arxiv.org/abs/1908.10084).

**Inputs**

The following assumes you used the recommended path for saving your data in Notebook 2. If you modified it, be sure to modify the input path here.

- Input Filepath 1: "../data/pubmed_text_tabular.csv"
    - Table of formatted PubMed articles, with columns PMID, title, and abstract

**Outputs**

The following is a recommended path for saving your data. If you modify it, be sure to modify the inputs and outputs of subsequent notebooks.

- Output Filepath 1: "../data/SPECTER_embeddings.csv"
     - Titles and abstracts embedded with the SPECTER model

## Import Packages

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer

## Functions

In [None]:
def embed(text, model):
    """
    Takes as input text and an embedding model and returns the embedding as a list. Note that models must already be loaded for this
    to work.

    Parameters:
    -----------
    text: string
        A string of the text to embed
    model: SentenceTransformer object 
        A pre-trained embedding model

    Returns:
    ---------
    embedding: list of ints
        List representing the vector embedding the text. 
    """
    
    # First check that text is not empty
    if pd.isna(text):
        # Return None, not possible to emebd
        return

    # Get the embedding
    embedding = model.encode([text])
    
    # Return the embedding
    # Note embeddings are returned as a nested list 
    # Return the first element because we are embedding one sentence at a time
    return embedding[0]

## Read in Data

In [None]:
# Read in the tabular PubMed data
input_df = pd.read_csv("../data/pubmed_text_tabular.csv", index_col=0)
print(input_df.shape)

# Preview
input_df.head()

## Load Pre-Trained Embedding Model

https://huggingface.co/sentence-transformers/allenai-specter

In [None]:
# Load the allenai-specter model with SentenceTransformers
model = SentenceTransformer("allenai-specter")

## Dataset Development

### Prep Data for Embedding

- First make a copy of input_df
- Check and drop any duplicate rows
- Drop rows with missing titles and/or abstracts
- Some titles and abstracts are surrounded by [] so we will strip these out 
- Then, add a new column that is the combination of title and abstract with the '[SEP]'.

In [None]:
# Make a copy of the input data 
embedded_df = input_df.copy()
print(embedded_df.shape)

In [None]:
# Confirm we don't have duplicates
embedded_df = embedded_df.drop_duplicates()
print(embedded_df.shape)

In [None]:
# Drop any with missing values
embedded_df = embedded_df.dropna()
print(embedded_df.shape)

In [None]:
# Preview the data
# Note the [] around some titles
embedded_df.head()

In [None]:
# Strip any []
embedded_df["title"] = embedded_df["title"].str.strip("[]")
embedded_df["abstract"] = embedded_df["abstract"].str.strip("[]")

# We also strip ]. from the end because we can see in that sometimes the closing bracket preceeds the ]
# If needed, we could add further text processing here, potentially using regexes to further clean the data
embedded_df["title"] = embedded_df["title"].str.rstrip("].")
embedded_df["abstract"] = embedded_df["abstract"].str.rstrip("].")

# Preview
embedded_df.head()

In [None]:
# Add a column that is the combination of title + abstract
embedded_df["title_abstract"] = embedded_df["title"] + "[SEP]" + embedded_df["abstract"]

# Preview
# Uncomment the option below to display the full content of each row and column
# pd.set_option('display.max_colwidth', None)
embedded_df.head()

In [None]:
# Data checks - any empty rows?
embedded_df.info()

### Embed Titles and Abstracts

In [None]:
# Apply the embedding model to each title + abstract and save as a new column
embedded_df["title_abstract_embedding"] = embedded_df["title_abstract"].apply(embed, args=(model,))

# See column info
embedded_df["title_abstract_embedding"].info()

### Preview Data

In [None]:
embedded_df.head()

In [None]:
# Check that we have no misisng values
embedded_df.info()

In [None]:
# Look at an example - we have vectors of size 768
print(len(embedded_df.at[0,"title_abstract_embedding"]))

## Save Outputs

In [None]:
# Save dfs with embeddings
embedded_df.to_csv("../data/SPECTER_embeddings.csv")