__NOTE__ This notebook originated and was produced by Chris McCormick and you can find the original [here](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial).

## A playground for comparing PLMs embeddings


### Introduction

We have added a couple of functions/ideas and removed certain sections throughout, and it may be that the original is easier to follow so please do check it out. But the majority is taken directly from the referenced tutorial. The main goal of this repo is to explore how different BERT/RoBERTa models produce embeddings based on the pre-training and fine-tuning they have recieved. 

If your text data is domain specific (e.g. legal, financial, academic, industry-specific) or otherwise different from the "standard" text corpus used to train BERT and other langauge models you might want to consider either continuing to train BERT with some of your text data or looking for a domain-specific language model.

Faced with the issue mentioned above, a number of researchers have created their own domain-specific language models. These models are created by  training the BERT architecture *from scratch* on a domain-specific corpus rather than the general purpose English text corpus used to train the original BERT model. This leads to a model with vocabulary and word embeddings better suited than the original BERT model to domain-specific NLP problems. Some examples include: 

- SciBERT (biomedical and computer science literature corpus)
- FinBERT (financial services corpus)
- BioBERT (biomedical literature corpus)
- ClinicalBERT (clinical notes corpus)
- mBERT (corpora from multiple languages)
- patentBERT (patent corpus)

In this notebook, we will explore the embedding space created by differently trained RoBERTa models with a focus on patient safety incident reports.


#### Why not do my own pre-training?

If you think your text is too domain-specific for the generic BERT, your first thought might be to train BERT from scratch on your own dataset. (Just to be clear: BERT was "Pre-Trained" by Google, and we download and "Fine-Tune" Google's pre-trained model on our own data. When I say "train BERT from scratch", I mean specifically re-doing BERT's *pre-training*).

Chances are you won't be able to pre-train BERT on your own dataset, though, for the following reasons. 

**1. Pre-training BERT requires a huge corpus**

BERT-base is a 12-layer neural network with roughly 110 million weights. This enormous size is key to BERT's impressive performance. To train such a complex model, though, (and expect it to work) requires an enormous dataset, on the order of 1B words. Wikipedia is a suitable corpus, for example, with its ~10 million articles. For the majority of applications I assume you won't have a dataset with that many documents. 

**2. Huge Model + Huge Corpus = Lots of GPUs**

Pre-Training BERT is expensive. The cost of pre-training is a whole subject of discussion, and there's been a lot of work done on bringing the cost down, but a *single* pre-training experiment could easily cost you thousands of dollars in GPU or TPU time. 

That's why these domain-specific pre-trained models are so interesting. Other organizations have footed the bill to produce and share these models which, while not pre-trained on your specific dataset, may at least be much closer to yours than "generic" BERT.


### Example Code for Importing

If you're interested in a BERT variant from the community models in the transformers library, importing can be incredibly simple--you just supply the name of the model as it appears in the library page.

First, we'll need to install the `transformers` library.


In [None]:
# !pip install transformers
# for plotting
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
from transformers import AutoModel, AutoTokenizer

% matplotlib inline

import networkx as nx
import pandas as pd
from d3blocks import D3Blocks



The `transformers` library includes classes for different model architectures (e.g., `AutoModel`, `AlAutoModel`, `RobertaModel`, ...). With whatever model you're using, it needs to be loaded with the correct class (based on its architecture), which may not be immediately apparent. 

Luckily, the `transformers` library has a solution for this, demonstrated in the following cell. These "Auto" classes will choose the correct architecture for you! 

That's a nice feature, but I'd still prefer to know what I'm working with, so I'm printing out the class names (which show that SciBERT uses the original BERT classes).


In [None]:
cache_dir = ".cache"  # if you want to use a non-default cache directory for downloaded models - change this
scibert_tokenizer = AutoTokenizer.from_pretrained(
    "allenai/scibert_scivocab_uncased", cache_dir=cache_dir
)
scibert_model = AutoModel.from_pretrained(
    "allenai/scibert_scivocab_uncased", cache_dir=cache_dir
)

print("scibert_tokenizer is type:", type(scibert_tokenizer))
print("    scibert_model is type:", type(scibert_model))

### Comparing different transformer based PLMs

#### Comparing Vocabularies

The most apparent difference between SciBERT and the original BERT should be the model's vocabulary, since they were trained on such different corpuses.

Both tokenizers have a 30,000 word vocabulary that was automatically built based on the most frequently seen words and subword units in their respective corpuses. 

The authors of SciBERT note:

> "The resulting token overlap between [BERT vocabulary] and
[SciBERT vocabulary] is 42%, illustrating a substantial difference in frequently used words between scientific and general domain texts."

Let's load the original BERT as well and do some of our own comparisons.

*Side note: BERT used a "WordPiece" model for tokenization, whereas SciBERT employs a newer approach called "SentencePiece", but the difference is mostly cosmetic. I cover SentencePiece in more detail in our [ALBERT eBook](https://www.chrismccormick.ai/offers/HaABTJQH).*


In [None]:
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", cache_dir=cache_dir)
bert_model = AutoModel.from_pretrained("bert-base-uncased", cache_dir=cache_dir)

Let's apply both tokenizers to some biomedical text and see how they compare. 

I took the below sentence from the 2001 paper [Hydrogels for biomedical applications](http://yunus.hacettepe.edu.tr/~damlacetin/kmu407/index_dosyalar/Hoffman,%202012.pdf), which seems to be one of the most-cited papers in the field of biomedical applications (if I'm interpreting these [Google Scholar results](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=biomedical+applications&btnG=) correctly).

In [None]:
text = (
    "Hydrogels are hydrophilic polymer networks which may absorb from "
    "10–20% (an arbitrary lower limit) up to thousands of times their "
    "dry weight in water."
)

# Split the sentence into tokens, with both BERT and SciBERT.
bert_tokens = bert_tokenizer.tokenize(text)
scibert_tokens = scibert_tokenizer.tokenize(text)

# Pad out the scibert list to be the same length.
while len(scibert_tokens) < len(bert_tokens):
    scibert_tokens.append("")

# Label the columns.
print("{:<12} {:<12}".format("BERT", "SciBERT"))
print("{:<12} {:<12}".format("----", "-------"))

# Display the tokens.
for tup in zip(bert_tokens, scibert_tokens):
    print("{:<12} {:<12}".format(tup[0], tup[1]))

SciBERT apparently has embeddings for the words 'hydrogels' and 'hydrophillic', whereas BERT had to break these down into three subwords each. (Remember that the '##' in a token is just a way to flag it as a subword that is not the first subword). Apparently BERT does have "polymer", though!

I skimmed the paper and pulled out some other esoteric terms--check out the different numbers of tokens required by each model.

In [None]:
# Some strange terms from the paper.
words = ["polymerization", "2,2-azo-isobutyronitrile", "multifunctional crosslinkers"]

# For each term...
for word in words:

    # Print it out
    print("\n\n", word, "\n")

    # Start a list of tokens for each model, with the first one being the model name.
    list_a = ["BERT:"]
    list_b = ["SciBERT:"]

    # Run both tokenizers.
    list_a.extend(bert_tokenizer.tokenize(word))
    list_b.extend(scibert_tokenizer.tokenize(word))

    # Pad the lists to the same length.
    while len(list_a) < len(list_b):
        list_a.append("")
    while len(list_b) < len(list_a):
        list_b.append("")

    # Wrap them in a DataFrame to display a pretty table.
    df = pd.DataFrame([list_a, list_b])

    display(df)

The fact that SciBERT is able to represent all of these terms in fewer tokens seems like a good sign!

##### Vocab Dump

It can be pretty interesting just to dump the full vocabulary of a model into a text file and skim it to see what stands out.

This cell will write out SciBERT's vocab to 'vocabulary.txt', which you can open in Colab by going to the 'Files' tab in the pane on the left and double clicking the .txt file.

In [None]:
with open("vocabulary.txt", "w", encoding="utf-8") as f:

    # For each token in SciBERT's vocabulary...
    for token in scibert_tokenizer.vocab.keys():

        # Write it out, one per line.
        f.write(token + "\n")

You'll see that roughly the first 100 tokens are reserved, and then it looks like the rest of the vocabulary is sorted by frequency... The first actual tokens are:

`t`, `a`, `##in`, `##he`, `##re`, `##on`, `the`, `s`, `##ti`

> *Because the tokenizer breaks down "unknown" words into subtokens, it makes sense that some individual characters and subwords would be higher in frequency even than the most common words like "the".*

##### Numbers and Symbols

There seem to be a lot of number-related tokens in SciBERT--you see them constantly as you scroll through the vocabulary. Here are some examples:

In [None]:
"##.2%)" in scibert_tokenizer.vocab

In [None]:
"0.36" in scibert_tokenizer.vocab

In the below loops, we'll tally up the number of tokens which include a digit, and show a random sample of these tokens. We'll do this for both SciBERT and BERT for comparison.


In [None]:
import random

# ======== BERT ========
bert_examples = []

count = 0

# For each token in the vocab...
for token in bert_tokenizer.vocab:

    # If there's a digit in the token...
    # (But don't count those reserved tokens, e.g. "[unused59]")
    if any(i.isdigit() for i in token) and not ("unused" in token):
        # Count it.
        count += 1

        # Keep ~1% as examples to print.
        if random.randint(0, 100) == 1:
            bert_examples.append(token)

# Calculate the count as a percentage of the total vocab.
prcnt = float(count) / len(bert_tokenizer.vocab)

# Print the result.
print("In BERT:    {:>5,} tokens ({:.2%}) include a digit.".format(count, prcnt))

# ======== SciBERT ========
scibert_examples = []
count = 0

# For each token in the vocab...
for token in scibert_tokenizer.vocab:

    # If there's a digit in the token...
    # (But don't count those reserved tokens, e.g. "[unused59]")
    if any(i.isdigit() for i in token) and not ("unused" in token):
        # Count it.
        count += 1

        # Keep ~1% as examples to print.
        if random.randint(0, 100) == 1:
            scibert_examples.append(token)


# Calculate the count as a percentage of the total vocab.
prcnt = float(count) / len(scibert_tokenizer.vocab)

# Print the result.
print("In SciBERT: {:>5,} tokens ({:.2%}) include a digit.".format(count, prcnt))

print("")
print("Examples from BERT:", bert_examples)
print("Examples from SciBERT:", scibert_examples)

So it looks like:
- SciBERT has about 3x as many tokens with digits. 
- BERT's tokens are whole integers, and many look like they could be dates. (In [another Notebook](https://colab.research.google.com/drive/1fCKIBJ6fgWQ-f6UKs7wDTpNTL9N-Cq9X#scrollTo=-M1biDEVYjaL), I showed that BERT contains 384 of the integers in the range 1600 - 2021).
- SciBERT's number tokens are much more diverse. They are often subwords, and many include decimal places or  other symbols like `%` or `(`.

Random -- check out token 17740!:

⎝

Looks like something is stuck to your monitor!  o_O

### Comparing Embeddings


**Semantic Similarity on Scientific Text**

To create a simple demonstration of the differences in the word and sentence level embeddings, we can invesitgate how each model creates embeddings for "domain" specific words and documents. It may be a stretch to call one embedding space better than another, but we predict a domain trained PLM will produce embeddings that cluster together more appropriately than a more general model. A simple approach is to calculate the cosine similarity between embeddings of domain words/documents that should or should not be close together based on human intuition.

We thought our code and results are interesting to share all the same. 

For a paper that more vigurously compared a scientific-text domain trained BERT see this paper [here](https://arxiv.org/abs/1903.10676). 




**Our Approach**

In our semantic similarity task, we have three pieces of text--call them "query", "A", and "B", that are all on scientific topics. We pick these such that the query text is always more similar to A than to B. 

Here's an example:

* query: "Mitochondria (mitochondrion, singular) are membrane-bound cell organelles."
* A: "These powerhouses of the cell produce adenosine triphosphate (ATP)."
* B: "Ribosomes contain RNA and are responsible for synthesizing the proteins needed for many cellular functions."

`query` and `A` are both about mitochondria, whereas `B` is about ribosomes. However, to recognize the similarity between `query` and `A`, you would need to know that mitochondria are responsible for producing ATP.  

Our intuition was that SciBERT, being trained on biomedical text, would better distinguish the similarities than BERT. 





**Interpreting Cosine Similarities**

When comparing two different models for semantic similarity, it's best to look at how well they *rank* the similarities, and not to compare the specific cosine similarity *values* across the two models.

It's for this reason that we've structured our example as "is `query` more similar to `A` or to `B`?"


**Embedding Functions**

In order to try out different examples, we've defined a `get_embedding` function below. It takes the average of the embeddings from the second-to-last layer or the final layer of the model to use as a sentence embedding.

`get_embedding` also supports calculating an embedding for a specific word or sequence of words within the sentence. 

To locate the indeces of the tokens for these words, we've also defined the `get_word_indeces` helper function below. 

To calculate the word embedding, we again take the average of its token embeddings from the second-to-last layer of the model.


#### get_word_indeces


In [None]:
def get_word_indeces(tokenizer, text, word):
    """
    Determines the index or indeces of the tokens corresponding to `word`
    within `text`. `word` can consist of multiple words, e.g., "cell biology".

    Determining the indeces is tricky because words can be broken into multiple
    tokens. I've solved this with a rather roundabout approach--I replace `word`
    with the correct number of `[MASK]` tokens, and then find these in the
    tokenized result.
    """
    # Tokenize the 'word'--it may be broken into multiple tokens or subwords.
    word_tokens = tokenizer.tokenize(word)
    # Create a sequence of `[MASK]` tokens to put in place of `word`.
    # get the tokenizers version of MASK - it differs between roberta and bert
    mask_token = tokenizer.mask_token

    masks_str = " ".join([mask_token] * len(word_tokens))

    # Replace the word with mask tokens.
    text_masked = text.replace(word, masks_str)
    # `encode` performs multiple functions:
    #   1. Tokenizes the text
    #   2. Maps the tokens to their IDs
    #   3. Adds the special [CLS] and [SEP] tokens.
    input_ids = tokenizer.encode(text_masked)
    # print(f"input ids are: {input_ids}")
    # print(f"tokenizer mask id is:{tokenizer.mask_token_id}")
    # Use numpy's `where` function to find all indeces of the [MASK] token.
    mask_token_indeces = np.where(np.array(input_ids) == tokenizer.mask_token_id)[0]

    return mask_token_indeces

#### get_embedding

In [None]:
def get_embedding(b_model, b_tokenizer, text, word="", method="last_hidden"):
    """
    Uses the provided model and tokenizer to produce an embedding for the
    provided `text`, and a "contextualized" embedding for `word`, if provided.

    NOTE this only works on one sample at a time - not a batch
    """

    # If a word is provided, figure out which tokens correspond to it.
    if not word == "":
        word_indeces = get_word_indeces(b_tokenizer, text, word)
        # print(f"Word indeces are:{word_indeces}")

    # Encode the text, adding the (required!) special tokens, and converting to
    # PyTorch tensors.
    encoded_dict = b_tokenizer.encode_plus(
        text,  # Sentence to encode.
        add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
        return_tensors="pt",  # Return pytorch tensors.
    )

    input_ids = encoded_dict["input_ids"]

    b_model.eval()

    # Run the text through the model and get the hidden states.
    bert_outputs = b_model(input_ids)

    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers.
    with torch.no_grad():

        outputs = b_model(input_ids)

        # Evaluating the model will return a different number of objects based on
        # how it's  configured in the `from_pretrained` call earlier. In this case,
        # becase we set `output_hidden_states = True`, the third item will be the
        # hidden states from all layers. See the documentation for more details:
        # https://huggingface.co/transformers/model_doc/bert.html#AutoModel

        # print(f"outputs are:{outputs}")
        if method == "last_hidden":
            # the 0th element will be the last layer hidden states for each token (1,sequence_length, embed_dims)
            # we just grab the 0th element of this, essentially equivalent to torch.squeeze to get (sequence_length, embed_dims)
            token_vecs = outputs[0][0]
            # print(f"token_vecs shape: {token_vecs.shape}")

        elif method == "second_last":
            # get all layers hidden states
            hidden_states = outputs[2]
            # `hidden_states` has shape [13 x 1 x <sentence length> x 768]

            # Select the embeddings from the second to last layer.
            # `token_vecs` is a tensor with shape [<sent length> x 768]
            token_vecs = hidden_states[-2][0]

        else:
            raise NotImplementedError

    # Calculate the average of all token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)

    # Convert to numpy array.
    sentence_embedding = sentence_embedding.detach().numpy()

    # If `word` was provided, compute an embedding for those tokens.
    if not word == "":
        # Take the average of the embeddings for the tokens in `word`.
        word_embedding = torch.mean(token_vecs[word_indeces], dim=0)

        # Convert to numpy array.
        word_embedding = word_embedding.detach().numpy()

        return (sentence_embedding, word_embedding)
    else:
        return sentence_embedding

Retrieve the models and tokenizers

In [None]:
# Retrieve SciBERT.
scibert_model = AutoModel.from_pretrained(
    "allenai/scibert_scivocab_uncased", output_hidden_states=True, cache_dir=cache_dir
)
scibert_tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")

scibert_model.eval()

# Retrieve generic BERT.
bert_model = AutoModel.from_pretrained(
    "bert-base-uncased", output_hidden_states=True, cache_dir=cache_dir
)
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

bert_model.eval()

Make a function to load and store multuiple models and tokenizers - 

___WARNING___ this is RAM heavy so use with caution

In [None]:
def get_model_tokenizers(model_names_or_paths: list = [], cache_dir: str = None):

    """
    Function:
            Read in and store models/tokenizers based on provided model_names_or_paths using Transformers AutoModel/AutoTokenizer classes.

    Args:
        model_names_or_paths: list -> comma separated list of model names found on HF hub or local paths
        cache_
    Returns:

            A separate model and tokenizer dictionary containing entries for each model in the list by the same name. Can be accessed by model/tokenizer[model_name]
    """

    # create an empty dict to store the different models/tokenizers
    model_dict = {}
    tokenizer_dict = {}

    # load in each model and add to the dictionary for later use
    for model_name in tqdm(model_names_or_paths):
        # load model
        model = AutoModel.from_pretrained(
            model_name, output_hidden_states=True, cache_dir=cache_dir
        )
        tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

        # add to dict
        # if a local path was provided we will want to extract a more useful name for the dictionary
        # crude regex for now - based on how we created these models so very hard coded
        if "declutr_nhs" in model_name:
            model_name = "declutr_nhs_roberta"
        elif "roberta-base-NHS-incidents" in model_name:
            model_name = "roberta_NHS_incidents"
        else:
            raise NotImplementedError  # Add any regex rules you want here to rename model_name after loading the model/tokenizer
        # add the model_name to the dictionaries
        model_dict[model_name] = model
        tokenizer_dict[model_name] = tokenizer

    return model_dict, tokenizer_dict

In [None]:
# load in multiple models for later use - this will take up a lot of RAM if many models
model_names_or_paths = [
    "roberta-base",
    "allenai/scibert_scivocab_uncased",
    "bert-base-uncased",
    "johngiorgi/declutr-sci-base",
    "johngiorgi/declutr-base",
]

cache_dir = "CACHE_DIR"

all_models, all_tokenizers = get_model_tokenizers(model_names_or_paths, cache_dir)

In [None]:
# look at the model keys i.e. the model_names
all_models.keys()

#### quick example with one model

In [None]:
model_name = "roberta-base"

text = "hydrogels are hydrophilic polymer networks which may absorb from 10–20% (an arbitrary lower limit) up to thousands of times their dry weight in water."
word = "hydrogels"

# Get the embedding for the sentence, as well as an embedding for 'hydrogels'.
(sen_emb, word_emb) = get_embedding(
    all_models[model_name], all_tokenizers[model_name], text, word
)

print("Embedding sizes:")
print(sen_emb.shape)
print(word_emb.shape)

Here's the code for calculating cosine similarity. We'll test it by comparing the word embedding with the sentence embedding--not a very interesting comparison, but a good sanity check.

In [None]:
# Calculate the cosine similarity of the two embeddings.
sim = 1 - cosine(sen_emb, word_emb)

print("Cosine similarity: {:.2}".format(sim))

#### Alternative method for calculating embeddings

This method is slightly different in that it takes the last layers hidden states and uses the attention mask to calculate batch wide means and will be easier for processing batches of inputs


In [None]:
model_name = "roberta-base"

text = [
    "hydrogels are hydrophilic polymer networks which may absorb from 10–20% (an arbitrary lower limit) up to thousands of times their dry weight in water.",
    "hydrogels smell",
    "cheese is really nice",
]

inputs = all_tokenizers[model_name](
    text, padding=True, truncation=True, return_tensors="pt"
)

# embed text
with torch.no_grad():
    # get last hidden states for each token
    sequence_output = all_models[model_name](**inputs)[0]

# mean pool the token-level embeddings to get sentence-level
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

In [None]:
embeddings.shape

In [None]:
# now get similarity via cosine distance for the first and second sentence
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

In [None]:
semantic_sim

In [None]:
# sanity check that calculating the pairwise embeddings for all possible pairs
# the [0][1] and [1][0] element of the matrix should be the semantic sim in the above cell
cosine_similarity(embeddings)

### Sentence Comparison Examples

#### Compare an anchor to multiple queries

Here we will provide a set of sentneces as a list or batch, and the first or the 0th element of that list will be used as an *anchor* and the rest treated as *queries*. The embeddings will be calculated for each and the cosine similarity between the *anchor* and all other *queries* will be calculated. The entire pairwise similarity matrix will also be returned for inspection.

In [None]:
# function to take in a batch of sentences, compute embeddings and calculate similarity and/or visualise embedddings
def batch_compare_sentence_embs(
    texts: list = [],
    model_names: list = None,
    cache_dir: str = None,
    device: str = None,
):
    """
    Function:
            Will take in a number models and produce embeddings for provided texts.
            Cosine distance/similarity will be calculated between document embeddings for each model and
            optionally the embeddings will be passed through UMAP to allow plotting in 2D.
    """

    # if a single string is provided for model names put into a list
    if type(model_names) == str:
        model_names = [model_names]

    # separate the anchor and query texts
    anchor_text = texts[0]
    query_texts = texts[1:]

    # create initial dataframe to store the texts
    results_df = pd.DataFrame({"anchor_text": anchor_text, "query_text": query_texts})

    # create dict to store model specific full cosine similarity matrix
    cosine_sim_matrices = {}

    print(f"Anchor text:\n\n{texts[0]}")
    print(f"=" * 50)

    # cycle through models provided:

    for model_name in tqdm(model_names, desc="Processing"):

        model = AutoModel.from_pretrained(model_name, cache_dir=cache_dir)
        tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

        # if a path was provided we will want to extract a more useful name for the dictionary
        # crude regex for now - based on how we created these models so very hard coded
        if "declutr_nhs" in model_name:
            model_name = "declutr_roberta_nhs_incident"
        elif "roberta-base-NHS-incidents" in model_name:
            model_name = "roberta_NHS_incidents"
        elif "incident/declutr-base/" in model_name:
            model_name = "declutr_base_nhs_incident"
        else:
            raise NotImplementedError  # add any regex rules to rename model_names here if desired

        inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

        # embed text
        with torch.no_grad():
            # get last hidden states for each token
            sequence_output = model(**inputs)[0]

        # mean pool the token-level embeddings to get sentence-level
        embeddings = torch.sum(
            sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
        ) / torch.clamp(
            torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9
        )

        # print(f"Embeddings calculated of shape: {embeddings.shape}")

        # presume that the anchor text is the 0th element of the provided texts list
        anchor_emb = embeddings[0]

        # assign the non-anchor embeddings to another variable
        query_embs = embeddings[1:]

        # now get similarity via cosine distance
        semantic_sim = 1 - cosine(embeddings[0], embeddings[1])
        cosine_sims = [
            (1 - cosine(query_embs[i], anchor_emb)) for i in range(query_embs.size(0))
        ]

        # add similarities to a table or something
        results_df[f"{model_name}_cs"] = cosine_sims

        # add full cosine similarity matrix to the dictionary
        cosine_sim_matrices[model_name] = cosine_similarity(embeddings)

    return results_df, cosine_sim_matrices

function to find the **most** similary based on cosine distance

In [None]:
# additional function to find the
def most_similar(texts, doc_id, similarity_matrix, matrix):
    print(f"Document: {texts[doc_id]}")
    print("\n")
    print(f"Similar Documents using {matrix}:")
    if matrix == "Cosine Similarity":
        similar_ix = np.argsort(similarity_matrix[doc_id])[::-1]
    elif matrix == "Euclidean Distance":
        similar_ix = np.argsort(similarity_matrix[doc_id])
    for ix in similar_ix:
        if ix == doc_id:
            continue
        print("\n")
        print(f"Document: {texts[ix]}")
        print(f"{matrix} : {similarity_matrix[doc_id][ix]}")

### Compare a incident report specific anchor with both similar and dissimilar domain sentences and also general domain

In [None]:
# TODO  - have a incident specific anchor with:

texts = [
    "The patient fell off the bed and was left unattended for 3 hours.",
    "The man was running for cover as it was raining and fell.",
    "Tension grew between the two nations as stocks ran low.",
    "Bedside incident. Pt left prone for 4 hours. Breach.",
    "RTT for patient with high blood pressure gone unchecked.",
    "The patient fell off the bed and was laying unattended for 3 hours",
    "Left unattended after patient fell off bed.",
    "The movie was amazing with some great acting performances.",
    "Awful film with subpar performances from all actors involved.",
]

# supply list of models of interest
model_names_or_paths = ["roberta-base", "bert-base-uncased", "johngiorgi/declutr-base"]

embed_results, cosine_sim_matrices = batch_compare_sentence_embs(
    texts=texts, model_names=model_names_or_paths, cache_dir=cache_dir
)

In [None]:
# allow pandas dataframe to display whole columns for easier reading
with pd.option_context("display.max_colwidth", None):
    display(embed_results)

In [None]:
embed_results.iloc[7, :]

In [None]:
# retrieval the full cosine similarity matrix for one of the models using the model name
cosine_sim_matrices["roberta-base"]

plot the cosine similarity matrix

In [None]:
fig, ax = plt.subplots()

in_arr = cosine_sim_matrices["roberta-base"]

ax.matshow(in_arr)

for i in range(in_arr.shape[0]):
    for j in range(in_arr.shape[1]):
        c = in_arr[j, i]
        ax.text(i, j, str(c)[:4], va="center", ha="center")

Can make a graph plot using network x using the cosine simialrity as a threshold for whether a link is made. Very simply and can be explored further

In [None]:
G = nx.from_numpy_array(cosine_sim_matrices["roberta-base"])

In [None]:
sorted(G.edges(data="weight"), key=lambda x: x[2])[0]

In [None]:
LOWER_BOUND = 0.6

for n, text in enumerate(texts):
    print(f"{n}: {text}")

print(f"\nLower bound cutoff for connections: {LOWER_BOUND}")

fig, ax = plt.subplots(figsize=(4, 4))

pos = nx.spring_layout(G, seed=0)
nx.draw(G, pos, with_labels=True, width=0)

edge_minimum = sorted(G.edges(data="weight"), key=lambda x: x[2])[0][2]

for edge in G.edges(data="weight"):
    if edge[2] > LOWER_BOUND:
        nx.draw_networkx_edges(G, pos, edgelist=[edge], width=2)
    else:
        nx.draw_networkx_edges(G, pos, edgelist=[edge], width=0)

Can also make a more interactive version of the above using [d3blocks](https://github.com/d3blocks/d3blocks)

In [None]:
# Initialize
d3 = D3Blocks()
#
# Import example
df = pd.DataFrame(nx.to_pandas_edgelist(G))
df["weight"] = df["weight"].apply(lambda x: x * 100)

# Create network using default
d3.d3graph(df, filepath="./d3graph.html", slider=[0, 100])

In [None]:
most_similar(texts, 0, cosine_sim_matrices["roberta-base"], "Cosine Similarity")

### Try some interesting examples for each model



In [None]:
texts = [
    "The patient fell off the bed and was left unattended for 3 hours.",
    "Left unattended after patient fell off bed.",
]

# load in multiple models for later use - this will take up a lot of RAM if many models
model_names_or_paths = ["roberta-base", "<LOCAL_MODEL_PATH>", "johngiorgi/declutr-base"]

embed_results, cosine_sim_matrices = batch_compare_sentence_embs(
    texts=texts, model_names=model_names_or_paths, cache_dir="<CACHE_DIR>"
)

In [None]:
with pd.option_context("display.max_colwidth", None):
    display(embed_results)

In [None]:
texts = [
    "The patient fell off the bed.",  # and was left unattended for 3 hours.",
    "The movie was amazing.",  # with some great acting performances."
]

# load in multiple models for later use - this will take up a lot of RAM if many models
model_names_or_paths = ["roberta-base", "<LOCAL_MODEL_PATH>", "johngiorgi/declutr-base"]

embed_results, cosine_sim_matrices = batch_compare_sentence_embs(
    texts=texts, model_names=model_names_or_paths, cache_dir="<CACHE_DIR>"
)

In [None]:
with pd.option_context("display.max_colwidth", None):
    display(embed_results)

#### Word Comparison Examples
We can also look at the embeddings of single words in different contexts. 

In [None]:
# all_models.keys()

This is a bit of a messy notebook at this point, but in its current form certain functions work better on a one sample at a time basis, others work better with batches. 


In [None]:
# this loop is reliant upon having loaded and stored models and tokenizers in dictionary objects as per an earlier cell
for model_name in tqdm(all_models.keys()):
    text = "Pt was acting aggressive towards another patient."

    print('"' + text + '"\n')

    # Get contextualized embeddings for "prison", "animal", and "cell"
    (emb_sen, emb_pt) = get_embedding(
        all_models[model_name], all_tokenizers[model_name], text, word="pt"
    )
    (emb_sen, emb_patient) = get_embedding(
        all_models[model_name], all_tokenizers[model_name], text, word="patient"
    )
    (emb_sen, emb_aggressive) = get_embedding(
        all_models[model_name], all_tokenizers[model_name], text, word="aggressive"
    )

    print(f"{model_name}:")

    # Compare the embeddings

    print("  sim(pt, patient): {:}".format((1 - cosine(emb_pt, emb_patient))))
    print(
        "  sim(patient, aggressive): {:}".format(
            1 - cosine(emb_patient, emb_aggressive)
        )
    )

    print("")

Let us know if you find some more interesting examples to try!

# Appendix: BioBERT vs. SciBERT

I don't have much insight into the merits of BioBERT versus SciBERT, but I thought I would at least share what I do know.

**Publish Dates & Authors**

* *BioBERT*
    * First submitted to arXiv: `Jan 25th, 2019`
        * [link](https://arxiv.org/abs/1901.08746)
    * First Author: Jinhyuk Lee
    * Organization: Korea University, Clova AI (also Korean)

* *SciBERT*
   * First submitted to arXiv: `Mar 26, 2019`
       * [arXiv](https://arxiv.org/abs/1903.10676), [pdf](https://arxiv.org/pdf/1903.10676.pdf)
    * First Author: Iz Beltagy
    * Organization: Allen AI

**Differences**

* BioBERT used the same tokens as the original BERT, rather than choosing a new vocabulary of tokens based on their corpus. Their justification was "to maintain compatibility", which I don't entirely understand.
* SciBERT learned a new vocabulary of tokens, but they also found that this detail is less important--it's training on the specialized corpus that really makes the difference.
* SciBERT is more recent, and outperforms BioBERT on many, but not all, scientific NLP benchmarks.
* The difference in naming seems unfortunate--SciBERT is also trained primarily on biomedical research papers, but the name "BioBERT" was already taken, so....

**huggingface transformers**

* Allen AI published their SciBERT models for the transformers library, and you can see their popularity:
    * [SciBERT uncased](https://huggingface.co/allenai/scibert_scivocab_uncased): ~16.7K downloads (from 5/22/20 - 6/22/20)
        * `allenai/scibert_scivocab_uncased`
    * [SciBERT cased](https://huggingface.co/allenai/scibert_scivocab_cased ): ~3.8k downloads (from 5/22/20 - 6/22/20)
        * `allenai/scibert_scivocab_cased`
* The BioBERT team has published their models, but not for the `transformers` library, as far as I can tell. 
    * The most popular BioBERT model in the huggingface community appears to be [this one](https://huggingface.co/monologg/biobert_v1.1_pubmed): `monologg/biobert_v1.1_pubmed`, with ~8.6K downloads (from 5/22/20 - 6/22/20)
       * You could also download BioBERT's pre-trained weights yourself from https://github.com/naver/biobert-pretrained, but I'm not sure what it would take to pull these into the `transformers` library exactly. 

