<a href="https://colab.research.google.com/github/mrurao/datascience/blob/master/Copy_of_BERT_For_Humanists_Word_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT Word Search:
# Measuring Word Similarity with BERT

By The BERT for Humanists Team
<br></br>

How can we say that two words are similar? Well, one approach is to use BERT.

BERT projects tokens in a document into vectors. We can then use the geometric similarity between the resulting token vectors as a way to represent varying types of similarity between words. In this example, we'll take a collection of poems and look for words that have a similar vector to a query word from the collection.

The result will be illustrative of what BERT vectors represent, but also of the limitations of the tokenization scheme that it uses.

## **Import necessary Python libraries and modules**

First, we will import a few necessary Python libraries and modules.

In [None]:
# For downloading large files from Google Drive
# https://github.com/wkentaro/gdown
import gdown

# For data manipulation and analysis
import pandas as pd
import numpy as np

To use the HuggingFace [`transformers` Python library](https://huggingface.co/transformers/installation.html), we will install it with `pip`.

In [None]:
!pip3 install transformers



In [None]:
from transformers import DistilBertTokenizerFast, DistilBertModel

## **Load Poetry dataset**

The dataset that we're going to use in this notebook contains most of the poems found on PoetryFoundation.org, which were collected and uploaded to [Kaggle](https://www.kaggle.com/johnhallman/complete-poetryfoundationorg-dataset?select=kaggle_poem_dataset.csv) by John Hallman. We downloaded this dataset from Kaggle and then uploaded it to Google Drive.

To download a file from Google Drive and use it in a Colab notebook environment, we need to do two things:  
1. Alter a Google Drive share link so that it can be downloaded 
2. Use `gdown()` to download the data with this new link 

**Here's how to convert a Google Drive share link into a download link (generalized example)**:

Google Drive share and view link: `https://drive.google.com/file/d/UNIQUE-ID/view?usp=sharing`

Google Drive download link: `https://drive.google.com/uc?export=download&id=UNIQUE-ID`



---
**Here's how to convert a Google Drive share link into a download link (specific example)**:

Google Drive share and view link (Poetry Foundation example): https://drive.google.com/file/d/1PERnd0l6QRmuu-Nt9cyvAU8E0ffIpCdv/view?usp=sharing

Google Drive download link (Poetry Foundation example): https://drive.google.com/uc?export=download&id=1PERnd0l6QRmuu-Nt9cyvAU8E0ffIpCdv



Here we download the CSV file from Google Drive and rename it `poetry_foundation.csv`

In [None]:
gdown.download("https://drive.google.com/uc?export=download&id=1PERnd0l6QRmuu-Nt9cyvAU8E0ffIpCdv", output="poetry_foundation.csv", quiet=False)

Downloading...
From:  https://drive.google.com/uc?export=download&id=1PERnd0l6QRmuu-Nt9cyvAU8E0ffIpCdv
To: /content/poetry_foundation.csv
23.4MB [00:00, 87.6MB/s]


'poetry_foundation.csv'

**NOTE**: Another handy way to upload files to Colab is to use the following:

In [None]:
#from google.colab import files
#uploaded = files.upload()

Here we're using Pandas to read in our CSV file and inspect the first few rows. To be clear, knowledge of Pandas is not necessary to use BERT. This is simply how we've chosen to load our data, and you can load your own data however you are most comfortable.

In [None]:
poetry_df = pd.read_csv("poetry_foundation.csv", encoding='utf-8')
poetry_df.head()

Unnamed: 0.1,Unnamed: 0,Author,Title,Poetry Foundation ID,Content
0,0,Wendy Videlock,!,55489,"Dear Writers, I’m compiling the first in what ..."
1,1,Hailey Leithauser,0,41729,"Philosophic\nin its complex, ovoid emptiness,\..."
2,2,Jody Gladding,1-800-FEAR,57135,We'd like to talk with you about fear t...
3,3,Joseph Brodsky,1 January 1965,56736,The Wise Men will unlearn your name.\nAbove yo...
4,4,Ted Berrigan,3 Pages,51624,For Jack Collom\n10 Things I do Every Day\n\np...


Let's check to see how many poems are in this dataset:

In [None]:
len(poetry_df)

15652

Let's check to see which authors show up the most in this dataset to get a sense of its contours:

In [None]:
poetry_df['Author'].value_counts()[:20]

William Shakespeare           85
Anonymous                     82
Alfred, Lord Tennyson         78
Rae Armantrout                62
William Wordsworth            59
Emily Dickinson               57
William Butler Yeats          47
John Ashbery                  46
Yusef Komunyakaa              43
Percy sshe Shelley            43
John Donne                    42
Walt Whitman                  41
Kay Ryan                      40
Algernon Charles Swinburne    39
Sir Philip Sidney             39
Robert Browning               39
William Blake                 38
Henry Wadsworth Longfellow    38
Thomas Hardy                  38
Samuel Menashe                38
Name: Author, dtype: int64

There are a lot of poems here. So we're going to take a random sample of 5,000 poems that are 2,000 characters or fewer.

In [None]:
poetry_df = poetry_df[poetry_df['Content'].str.len() < 2000]

In [None]:
poetry_df = poetry_df.sample(5000)

Next we're going to take this Pandas DataFrame and make three lists: `poetry_authors`, `poetry_titles`, `poetry_texts`.

In [None]:
poetry_authors = poetry_df['Author'].tolist()
poetry_titles = poetry_df['Title'].tolist()
poetry_texts = poetry_df['Content'].tolist()

Let's check the lengths of the lists that we created:

In [None]:
len(poetry_authors), len(poetry_titles), len(poetry_texts)

(5000, 5000, 5000)

Let's examine a poem in our dataset:

In [None]:
print(poetry_titles[8], '\n\n', poetry_authors[8], '\n\n', poetry_texts[8])

Extempore Effusion upon the Death of James Hogg 

 William Wordsworth 

 When first, descending from the moorlands,
I saw the Stream of Yarrow glide
Along a bare and open valley,
The Ettrick Shepherd was my guide.

When last along its banks I wandered,
Through groves that had begun to shed
Their golden leaves upon the pathways,
My steps the Border-minstrel led.

The mighty Minstrel breathes no longer,
'Mid mouldering ruins low he lies;
And death upon the braes of Yarrow,
Has closed the Shepherd-poet's eyes:

Nor has the rolling year twice measured,
From sign to sign, its stedfast course,
Since every mortal power of Coleridge
Was frozen at its marvellous source;

The rapt One, of the godlike forehead,
The heaven-eyed creature sleeps in earth:
And Lamb, the frolic and the gentle,
Has vanished from his lonely hearth.

Like clouds that rake the mountain-summits,
Or waves that own no curbing hand,
How fast has brother followed brother,
From sunshine to the sunless land!

Yet I, whose lids f

<br><br>

## **Encode data for BERT**

We're going to transform our poems into a format that BERT (via Huggingface) will understand. This is called *encoding* the data.

Here are the steps we need to follow:

1. The texts — in this case, poems — need to be truncated if they're more than 512 tokens or padded if they're fewer than 512 tokens. The tokens, or words in the texts, also need to be separated into "word pieces."

2. We need to add special tokens to help BERT:
    - [CLS] — Start token of every document
    - [SEP] — Separator between each sentence 
    - [PAD] — Padding at the end of the document as many times as necessary, up to 512 tokens
    - &#35;&#35; — Start of a "word piece" 

| BERT special token | Explanation |
| --------------| ---------|
| [CLS] | Start token of every document. |
| [SEP] | Separator between each sentence |
| [PAD] | Padding at the end of the document as many times as necessary, up to 512 tokens |
|  &#35;&#35; | Start of a "word piece" |

Here we will load `DistilBertTokenizerFast` from HuggingFace library, which will help us transform and encode the texts so they can be used with BERT.

In [None]:
from transformers import DistilBertTokenizerFast

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

The `tokenizer()` will break word tokens into word pieces, truncate to 512 tokens, and add padding and special BERT tokens.

In [None]:
tokenized_poems = tokenizer(poetry_texts, truncation=True, padding=True, return_tensors="pt")

Let's examine the first tokenized sonnet. We can see that the special BERT tokens have been inserted where necessary.

In [None]:
' '.join(tokenized_poems[8].tokens)

'[CLS] when first , descending from the moor ##lands , i saw the stream of ya ##rrow glide along a bare and open valley , the et ##trick shepherd was my guide . when last along its banks i wandered , through groves that had begun to shed their golden leaves upon the pathways , my steps the border - min ##strel led . the mighty min ##strel breathe ##s no longer , \' mid mo ##uld ##ering ruins low he lies ; and death upon the bra ##es of ya ##rrow , has closed the shepherd - poet \' s eyes : nor has the rolling year twice measured , from sign to sign , its ste ##df ##ast course , since every mortal power of cole ##ridge was frozen at its marvel ##lous source ; the rap ##t one , of the god ##like forehead , the heaven - eyed creature sleeps in earth : and lamb , the fr ##olic and the gentle , has vanished from his lonely hearth . like clouds that rake the mountain - summit ##s , or waves that own no curb ##ing hand , how fast has brother followed brother , from sunshine to the sun ##less 

<br><br>

## **Load pre-trained BERT model**

Here we will load a pre-trained BERT model. To speed things up we will use a GPU, but using GPU involves a few extra steps.
The command `.to("cuda")` moves data from regular memory to the GPU's memory.




In [None]:
from transformers import DistilBertModel

In [None]:
model = DistilBertModel.from_pretrained('distilbert-base-uncased').to("cuda")

In [None]:
num_docs = len(poetry_texts)

In [None]:
num_docs

5000

Now we will loop through the poems and convert each poem from a single string into the BERT input representation using the `tokenizer()`.

Then we will ask for all of the token-level vectors in the poem.

Finally, we will save vectors for every token *except* the 0th one (the `CLS` token) and the last one (the `SEP` token), which is at position `-1`.

In [None]:
doc_token_vectors = []
doc_tokens = []

for i, poem in enumerate(poetry_texts):
    # Here we tokenize each poem with the DistilBERT Tokenizer
    inputs = tokenizer(poem, return_tensors="pt", truncation=True, padding=True)
    doc_tokens.append(inputs.input_ids[0].numpy()[1:-1])
    # Here we send the tokenized poems to the GPU
    inputs.to("cuda")
    # Here we run the tokenized poem through the DistilBERT model
    outputs = model(**inputs)

    
    first_document = 0
    second_position = 1
    last_position = -1

    # We take every element from the first or 0th document, from the 2nd to the 2nd to last position
    #  doc_token_vectors.append(outputs.last_hidden_state[0,1:-1,:].detach().cpu().numpy())
    doc_token_vectors.append(outputs.last_hidden_state[first_document,second_position:last_position,:].detach().cpu().numpy())


Confirm that we have the same number of documents for both the tokens and the vectors:

In [None]:
len(doc_tokens), len(doc_token_vectors)

(5000, 5000)

In [None]:
doc_tokens[0], doc_token_vectors[0]

(array([ 3904,  1997,  2149, 19821,  2256,  2466,  2488,  2084,  2023,
         3904, 16778,  3723,  1010,  9787,  7540,  1997,  3267,  1010,
         9690,  2256,  2691,  6687, 29454,  5844,  2012,  1996,  3953,
         1997,  1996,  2712,  1012,  1996,  6687,  1010,  2205,  1010,
         1997, 24318,  1998, 15606,  1010,  1997, 21484,  1998,  4731,
         1010,  2320,  2009,  2001,  4326,  3432,  2005,  2108,  2025,
         1037,  2600,  2993,  1517,  2025,  2600,  1010,  2021,  1037,
         8744,  3637,  2006,  1037,  2600, 11142,  1012,  1998,  1010,
         6171, 13026,  2000,  1996,  2600,  1010,  2000,  2712,  1998,
         2712,  1011,  6497, 12699,  2083,  2049,  3096,  1010,  2009,
         4282,  1010,  2348,  2009,  2987,  1521,  1056,  2113,  2009,
         4282,  1010,  2008,  9273,  1998,  2037, 23689, 17301,  2595,
         2024,  2035,  2028,  2518,  1012,  2070,  2156,  2049,  2126,
         1997,  3241,  1025,  2087,  1010,  2025,  2664,  1012,  2145,
      

Each element of these lists contains all the tokens/vectors for one document. Now concatenate them into one giant collection and make sure that the length is still the same.

In [None]:
all_token_vectors = np.concatenate(doc_token_vectors, axis=0)

In [None]:
all_token_vectors.shape

(979560, 768)

We want to make comparisons between vectors quickly. One common option is *cosine similarity*, which measures the angle between vectors but ignores their length. We can speed this computation up by setting all the poem vectors to have length 1.0.

In [None]:
row_norms = np.sqrt(np.sum(all_token_vectors ** 2, axis=1))
all_token_vectors /= row_norms[:,np.newaxis]

In [None]:
all_token_ids = np.concatenate(doc_tokens)

In [None]:
all_token_ids.shape

(979560,)

We can use these arrays to find all the instances of a word in the collection. Here I'll get the ID for **bank** and then find the token positions of every instance of that ID.

In [None]:
search_keyword = "bank"

In [None]:
word_positions = np.where(all_token_ids == tokenizer.vocab[search_keyword])[0]

In [None]:
word_positions

array([ 33342,  52132,  77144,  88668, 114341, 119622, 142652, 158890,
       161170, 164239, 168784, 219253, 232196, 275022, 298212, 375956,
       385199, 392194, 437021, 437167, 451502, 519265, 533188, 569669,
       577342, 580189, 605425, 624431, 628761, 653525, 689306, 691542,
       705508, 723727, 729783, 745631, 766997, 780894, 795393, 843536,
       850933, 856114, 861446, 915122, 935153, 968510])

In [None]:
# create an array so that we can go backwards from numeric token IDs to strings

word_lookup = np.empty(tokenizer.vocab_size, dtype="O")

for word, index in tokenizer.vocab.items():
  word_lookup[index] = word

In [None]:
tokenizer("bank")

{'input_ids': [101, 2924, 102], 'attention_mask': [1, 1, 1]}

In [None]:
word_lookup[[101, 2924, 102]]

array(['[CLS]', 'bank', '[SEP]'], dtype=object)

To help distinguish between the word positions for **bank** above, let's make a function `get_context()` that gets a "keyword in context" view for a given word position. This function will return 10 words on either side of the search keyword. Wordpieces in the context view will be slightly modified for readable, replacing `##` with `-`.

In [None]:
def get_context(keyword_token_id, n=10):

  # The token where we will start the context view
  start_pos = max(0, keyword_token_id - n)
  # The token where we will end the context view
  end_pos = min(keyword_token_id + n + 1, len(all_token_ids))

  # Make a list called tokens and use word_lookup to get the words for given token IDs from starting position up to the keyword
  tokens = [word_lookup[word] for word in all_token_ids[start_pos:keyword_token_id] ]

  # Append to the list tokens with the search keyword surrounded by **asterisks**
  tokens.append("**" + word_lookup[all_token_ids[keyword_token_id]] + "**")

  # Extend the list tokens and use word_lookup to get the words for given token IDs from the keyword to the end position
  tokens.extend([word_lookup[word] for word in all_token_ids[keyword_token_id+1:end_pos] ])
  
  # Make wordpieces slightly more readable by replacing the ## special character with a dash - or nothing, and fixing punctuation
  context_wordpieces = " ".join(tokens).replace(" ##", "-")
  context_wordpieces = context_wordpieces.replace('##', '')
  context_wordpieces = context_wordpieces.replace(' ,', ',')
  context_wordpieces = context_wordpieces.replace(' .', '.')


  return context_wordpieces

We're also going to make a nearly identical function that returns a clean version of the context view, replacing the `##` special token with nothing.

In [None]:
def get_context_clean(keyword_token_id, n=10):

  # The token where we will start the context view
  start_pos = max(0, keyword_token_id - n)
  # The token where we will end the context view
  end_pos = min(keyword_token_id + n + 1, len(all_token_ids))

  # Make a list called tokens and use word_lookup to get the words for given token IDs from starting position up to the keyword
  tokens = [word_lookup[word] for word in all_token_ids[start_pos:keyword_token_id] ]

  # Append to the list tokens with the search keyword surrounded by **asterisks**
  tokens.append("**" + word_lookup[all_token_ids[keyword_token_id]] + "**")

  # Extend the list tokens and use word_lookup to get the words for given token IDs from the keyword to the end position
  tokens.extend([word_lookup[word] for word in all_token_ids[keyword_token_id+1:end_pos] ])
  
  # Make wordpieces slightly more readable by replacing the ## special character with a dash - or nothing, and fixing punctuation
  context_wordpieces = " ".join(tokens).replace(" ##", "")
  context_wordpieces = context_wordpieces.replace('##', '')
  context_wordpieces = context_wordpieces.replace(' ,', ',')
  context_wordpieces = context_wordpieces.replace(' .', '.')


  return context_wordpieces

To visualize the search keyword more easily, we're going to import a couple of Python modules that will allow us to output text with bolded words and other styling. Here we will make a function `print_md()` that will allow us to print with Markdown styling.

In [None]:
from IPython.display import Markdown, display

def print_md(string):
    display(Markdown(string))

In [None]:
for position in word_positions:

  print_md(f"<br> {position}:  {get_context(position)} <br>")

<br> 33342:  eli-sion forever eli-ding. today is a fog **bank** in which i am hiding. love is a burn <br>

<br> 52132:  surface, anti - tank missiles swarm-ed through numbered **bank** accounts like o-vid ’ s see-thing knotted seed <br>

<br> 77144:  thee. dad dead, mom — back in the **bank**, teller-ing — started dressing in cute skirts and <br>

<br> 88668:  the road. kneeling i plunge the spoon into the **bank** : chicken bro-th & amp ; rice. rain <br>

<br> 114341:  rain ! a ya-wn-ing soldier knelt against the **bank**, staring across the morning b-lea-r with fog <br>

<br> 119622:  the blurred reflections of the willow-s on the opposite **bank** received it. welcome, flowers. write your name <br>

<br> 142652:  of the weed, time of bram-ble along the **bank** of a canal muddy with old newspaper close - held <br>

<br> 158890:  a straight - backed chair at a desk and his **bank** account is empty and he wishes for death one lies <br>

<br> 161170:  read. read us again. but for a low **bank** of cloud, clear morning, empty sky. the <br>

<br> 164239:  t take. it is not like going to the **bank**. there are no hard candi-es in a basket <br>

<br> 168784:  1 water roared everywhere around us, yet from the **bank** all we could see of it were quick sp-ume <br>

<br> 219253:  off root cl-ump and cave-d - in grassy **bank**. last year ’ s downed trees, slash piles <br>

<br> 232196:  to be alone. beneath a dove and rainbow some **bank** their fire, wrap their er-ogen-ous zones in <br>

<br> 275022:  have fallen down ) fell in my dream beside the **bank** of england ’ s wall to be, me with <br>

<br> 298212:  gli-ness, even that which you ’ ve long **bank**-roll-ed. ladies, who of my lord would <br>

<br> 375956:  . he wished his only daughter to work in the **bank** but he ’ d given her a source to sustain <br>

<br> 385199:  tun-dra tires, lifted as if on wires, **bank**-ed over ice and rocked its wings to land. <br>

<br> 392194:  ifer-s in their shove and jo-stle down a **bank**, drinking, mud ca-king lips. eyelids, <br>

<br> 437021:  the otter, the beaver. i will climb the **bank** where the willow never dies. behind the fa-uve <br>

<br> 437167:  cars is one giant cai-man bas-king on the **bank**. the jaguar ’ s all swimming stealth now — <br>

<br> 451502:  ly sc-ut-tle back into mud holes drilling the **bank**. bending down to look, i could smell the <br>

<br> 519265:  that he goes to. it could be at a **bank** or a library or turning a piece of flat land <br>

<br> 533188:  stream grove of reed-s heron-s watching from the **bank** hen-ges whole fields honey-combe-d with so-uter <br>

<br> 569669:  by battleships, flickering of a grey tail on the **bank**, motionless hull-s enormous under a dead grey sky <br>

<br> 577342:  hid in holes at the br-im of the clay **bank** as the creek eased up pe-l-vic bones, <br>

<br> 580189:  and feel, lick the ic-icle broken from the **bank** and still say nothing at all, only cry pretty <br>

<br> 605425:  . it ’ s the fourth of july on the **bank** of hi-nk-son creek fifty years ago, the <br>

<br> 624431:  of production. ( hence the failure of the medici **bank**. ) none of it ’ s there that you <br>

<br> 628761:  a-y here it is, stuck close beside the **bank** beneath the bunch of grass that spin-dles rank its <br>

<br> 653525:  into dash-es that spread without prints onto the screaming **bank**. “ now, tell me one difference, ” <br>

<br> 689306:  though. no watch, no sleeping bag, no **bank**-book. the apartment looks the way it feels to <br>

<br> 691542:  the smallest branch. my ab-ode is at the **bank** of a river, a river that comes out of <br>

<br> 705508:  a co-uga-r to come by. i always **bank** on something parc-hed and am-bling to make my <br>

<br> 723727:  from cubic content realms of atmosphere at play beyond the **bank** and sho-al of time. then resonance begins, <br>

<br> 729783:  red paint on their shaft, or the iron turkey **bank** and the porcelain coffee cup that disappeared a while back <br>

<br> 745631:  going steady on the will-ame-tte. along the **bank**, i lift my pace from devil - may - <br>

<br> 766997:  . i remember a gang of friends racing a fog **bank** ’ s onslaught along the beach. seal - slick <br>

<br> 780894:  chain, imp-ass-ive, stepping to the farther **bank** — continuing their march, as if by word, <br>

<br> 795393:  worst one of my life, the woman at the **bank** tells me. though i ’ d like to be <br>

<br> 843536:  song, the heron ’ s great rise from the **bank**. last a carp leaps, voices and a lantern <br>

<br> 850933:  the burn comes down, and roar-s fra-e **bank** to bra-e ; and bird and beast in covert <br>

<br> 856114:  border a man jo-gs. the river ’ s **bank** green, small trees bob in the wind like tr <br>

<br> 861446:  off being young until you retire, and however you **bank** your screw, the money you save won ’ t <br>

<br> 915122:  ugh where i cl-ing. a range of clouds **bank**-ed up behind the peak of that ap-oc-ry <br>

<br> 935153:  , one ear tuned to the creek ’ s far **bank**, one dish-ed towards him. her un-star <br>

<br> 968510:  words almost our own as we come sliding down the **bank**. last night, we covered the gardens in plastic <br>

Here we make a list of all the context views for our keyword.

In [None]:
keyword_contexts = []

for position in word_positions:

  keyword_contexts.append(get_context_clean(position))

### **Plotting with SVD**

From here we can use a [Singular Value Decomposition](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html) to project these 768-dimensional vectors to the two axes of most variation.

In [None]:
U, S, Vt = np.linalg.svd(all_token_vectors[word_positions,:])

In [None]:
all_token_vectors[word_positions,:].shape

(46, 768)

For convenience, we put them into a Pandas DataFrame.

In [None]:
df = pd.DataFrame({"x": U[:,0], "y": U[:,1], "context": keyword_contexts})
df.head()

Unnamed: 0,x,y,context
0,-0.146669,0.031306,elision forever eliding. today is a fog **bank...
1,-0.142731,0.218669,"surface, anti - tank missiles swarmed through ..."
2,-0.147445,0.196902,"thee. dad dead, mom — back in the **bank**, te..."
3,-0.14694,-0.018884,the road. kneeling i plunge the spoon into the...
4,-0.154485,-0.094999,rain ! a yawning soldier knelt against the **b...


Then we will plot them from this DataFrame with the Python data viz library [Altair](https://altair-viz.github.io/gallery/scatter_tooltips.html).

In [None]:
import altair as alt

In [None]:
alt.Chart(df).mark_circle(size=200).encode(
    x="x", y="y",
    tooltip=['context']
    ).interactive().properties(
    width=500,
    height=500
)

### **Put it all together**

In [None]:
search_keyword = 'bank'

word_positions = np.where(all_token_ids == tokenizer.vocab[search_keyword])[0]

# create an array so that we can go backwards from numeric token IDs to strings
word_lookup = np.empty(tokenizer.vocab_size, dtype="O")

for word, index in tokenizer.vocab.items():
  word_lookup[index] = word

def get_context_clean(keyword_token_id, n=10):

  # The token where we will start the context view
  start_pos = max(0, keyword_token_id - n)
  # The token where we will end the context view
  end_pos = min(keyword_token_id + n + 1, len(all_token_ids))

  # Make a list called tokens and use word_lookup to get the words for given token IDs from starting position up to the keyword
  tokens = [word_lookup[word] for word in all_token_ids[start_pos:keyword_token_id] ]

  # Append to the list tokens with the search keyword surrounded by **asterisks**
  tokens.append("**" + word_lookup[all_token_ids[keyword_token_id]] + "**")

  # Extend the list tokens and use word_lookup to get the words for given token IDs from the keyword to the end position
  tokens.extend([word_lookup[word] for word in all_token_ids[keyword_token_id+1:end_pos] ])
  
  # Make wordpieces slightly more readable by replacing the ## special character with a dash - or nothing, and fixing punctuation
  context_wordpieces = " ".join(tokens).replace(" ##", "")
  context_wordpieces = context_wordpieces.replace('##', '')
  context_wordpieces = context_wordpieces.replace(' ,', ',')
  context_wordpieces = context_wordpieces.replace(' .', '.')


  return context_wordpieces

keyword_contexts = []

for position in word_positions:

  keyword_contexts.append(get_context_clean(position))
  
U, S, Vt = np.linalg.svd(all_token_vectors[word_positions,:])
df = pd.DataFrame({"x": U[:,0], "y": U[:,1], "context": keyword_contexts})
alt.Chart(df).mark_circle(size=200).encode(
    x="x", y="y",
    tooltip=['context']
    ).interactive().properties(
    width=500,
    height=500
)

It looks like there are two distinct clusters, well-separated on the $y$-axis.

In [None]:
positions_sorted_by_second_factor = np.argsort(U[:,1])
for position in positions_sorted_by_second_factor:
  print(U[position,1], get_context(word_positions[position], n=8))

-0.2020301 reflections of the willow-s on the opposite **bank** received it. welcome, flowers. write
-0.2012551 imp-ass-ive, stepping to the farther **bank** — continuing their march, as if by
-0.16963395 ear tuned to the creek ’ s far **bank**, one dish-ed towards him. her
-0.14708208 branch. my ab-ode is at the **bank** of a river, a river that comes
-0.1434011 here it is, stuck close beside the **bank** beneath the bunch of grass that spin-dles
-0.14265151 ’ s the fourth of july on the **bank** of hi-nk-son creek fifty years ago
-0.14099596 on the will-ame-tte. along the **bank**, i lift my pace from devil -
-0.14031844 weed, time of bram-ble along the **bank** of a canal muddy with old newspaper close
-0.13302709 our own as we come sliding down the **bank**. last night, we covered the gardens
-0.13197082 es that spread without prints onto the screaming **bank**. “ now, tell me one difference
-0.12787624 holes at the br-im of the clay **bank** as the creek eased up pe-l-vic
-0.12548

We can also search *all* of the vectors for words similar to a query word. 

In [None]:
def get_nearest(query_vector, n=30):
  cosines = all_token_vectors.dot(query_vector)
  ordering = np.flip(np.argsort(cosines))
  return ordering[:n]

To do so, we need to find the word position of our desired search keyword.

In [None]:
search_keyword = 'plunge'
word_positions = np.where(all_token_ids == tokenizer.vocab[search_keyword])[0]
for position in word_positions:

  print_md(f"<br> {position}: {get_context(position)} <br>")

<br> 10905: quickly, quickly cut him off from the known. **plunge** your source into the strange, the invisible wells gone <br>

<br> 14071: oran-ts describe the chop in grunt-s, then **plunge** through thirty feet of grease. i try to hold <br>

<br> 22382: oran-t inc-ong-ru-ous flights parallel and merging **plunge** into slap out of tidal pools the fresh kills beak <br>

<br> 88663: out on the side of the road. kneeling i **plunge** the spoon into the bank : chicken bro-th & <br>

<br> 146724: ; wounds the bird of paradise. on paths that **plunge** into pri-mo-rdial green, echo ’ s laughter <br>

<br> 181825: of sunrise before his mind bell-ows start. he **plunge**-s in to tame the water before the water tame <br>

<br> 190693: and down - and hit a world, at every **plunge**, and finished knowing - then - turning to watch <br>

<br> 250516: fold all its loose - flowing garments into one, **plunge**-s upon the shore, and floods the dun pale <br>

<br> 263690: d buffet-ings gan-nets alternately appear and vanish, **plunge**, rise, and loft and give their heads a <br>

<br> 304352: of star-light bear broken messages among mountains where shadows **plunge** yet our brightness is un-wave-ring ken-nst du <br>

<br> 316146: in all my dreams before my helpless sight, he **plunge**-s at me, gut-tering, choking, drowning <br>

<br> 331272: converge : warmth, humidity, temperature ’ s sudden **plunge** ; a child ’ s brain, objects, sound <br>

<br> 545234: make my way up-hill past a startled horse who **plunge**-s in the pad-dock above the nun-nery. <br>

<br> 594050: pirate ’ s plank, the diving board, the **plunge**, nor with the moon whether she be zombie or <br>

<br> 596775: . the prophet stands eye level with the ve-nding **plunge**, a here and now mechanism he would need to <br>

<br> 636251: old ferry along the banks of the ar-no, **plunge**-s his wooden bail-er into the bottom of the <br>

<br> 654514: _ _ _ _ _ _ having once taken the **plunge** the situation that preceded it becomes obsolete which a moment <br>

<br> 715675: what we make to keep making ) the concrete saw **plunge**-s and res-ur-face-s, precise as a <br>

<br> 789728: turkey sheds. ii the small world of the car **plunge**-s through the deep fields of the night, on <br>

<br> 837053: , 1969 - 74. a headache makes your mouth **plunge**, then it pulls away. the smell of diesel <br>

<br> 852433: you said, sliding a needle, watching do-pe **plunge**, the body ' s rush and tow until you <br>

In [None]:
keyword_position = 10905

In [None]:
contexts = [get_context(token_id) for token_id in get_nearest(all_token_vectors[keyword_position,:])]
for context in contexts:
  print_md(context)

quickly, quickly cut him off from the known. **plunge** your source into the strange, the invisible wells gone

out on the side of the road. kneeling i **plunge** the spoon into the bank : chicken bro-th &

really trying 1. first things first : surprise, **catch** your source off balance when he least expects it :

oran-ts describe the chop in grunt-s, then **plunge** through thirty feet of grease. i try to hold

. i, with a shift of my skin, **dive**-st my self to become the rock that shadows it

more sen-su-ous than the rest ) about to **dive** into the deep - blue waiting — call it the

man ' s eyes saves no one, but to **fling** them with a grace you did not know you knew

so carefully without breaking a single pearl-y cell to **slide** each piece into a cold blue china bowl the juice

; wounds the bird of paradise. on paths that **plunge** into pri-mo-rdial green, echo ’ s laughter

turkey sheds. ii the small world of the car **plunge**-s through the deep fields of the night, on

, a res-urgent stretch of store-front-s to **dive** into, com-pad-re, col leg-no,

, knee-l and drive the old nails home, **slide** another shin-gle into place, pound, toes bent

old ferry along the banks of the ar-no, **plunge**-s his wooden bail-er into the bottom of the

his daughter, who died after a car wreck. **wedge** her into the smoky path & amp ; cover her

blaze of day behind blank eyes. sound : birds **stab** greedy beak-s into trunk and seed, spill hu

you said, sliding a needle, watching do-pe **plunge**, the body ' s rush and tow until you

i have seen it melt out of his eyes it **dive**-s into the por-es of the earth when they

still water to deny them. but whale, you **dive** down until the ocean ’ s ground begs you solid

of sunrise before his mind bell-ows start. he **plunge**-s in to tame the water before the water tame

d buffet-ings gan-nets alternately appear and vanish, **plunge**, rise, and loft and give their heads a

then. we ’ d wrap the copper pipe and **drop** it in, then use the telephone battery to make

lone bulls at home when they smell pasture. they **thrust** their bone skulls under bar-bs, tongues quivering for

, craving earth like an after-tas-te. to **discover** in one ' s hand two local stones the size

morning while he shit-s on the can. det **ain** and con-fine, quickly, quickly cut him off

birds stab greedy beak-s into trunk and seed, **spill** hu-sk onto the heap where my dreaming and my

of the feet. ni-bble earl-obe-s, **dip** my tongue in the salt fold of shoulder and throat

angel - fat, steal it in mouthful-s, **store** it away where you save the face that you touched

peas, to cup water in our hands, to **seek** the right screw under the sofa for hours this gives

grief as greasy children reach deep into my fever to **scoop** out their revenge in double - dip-s..

of the feet. ni-bble earl-obe-s, **dip** my tongue in the salt fold of shoulder and throat