<a href="https://colab.research.google.com/github/kuberiitb/artificial_intelligence/blob/main/notebooks/visualize_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, I am going to show you the impact of nearby words on embedding of any word.

To show this, I take a word bank, which can have multiple meanings
- River front
- Financial institution
- To rely on something

Workflow:
* We will create a set of sentences with each having a specific meaning of word bank
* We get embedding of each sentence, using BERT model
* Get token-id of word bank and extract embedding of that token-id, in the sentence
* Since it is multidimen vector, we will project it in 2-D using t-SNE algorithm and plot it.  

In [1]:
#install transformers to laod the model
!pip install transformers



In [2]:
import numpy as np
from transformers import AutoTokenizer, AutoModelForMaskedLM

In [3]:
#load BERT model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name) # used in converting a word or sentences to token_ids
model = AutoModelForMaskedLM.from_pretrained(model_name) # get embedding of the tokens

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
# create corpus of texts and context vector(for plotting)
texts = ["I deposited my salary in the bank yesterday.",
         "The bank approved my loan within a week.",
         "She opened a savings account at the bank.",
         "The bank closes at 4 PM on weekdays.",
         "The bank provides both online and offline services to its customers.",

         "We sat on the grassy bank of the river to enjoy the sunset.",
         "Children were playing near the river bank, building sand castles.", #
         "The fisherman cast his net from the bank of the wide river.",
         "Wildflowers grew in clusters along the steep bank of the stream.",
         "The boat was tied securely to a post on the river bank.",

           "You can bank on her to finish the project on time.",
           "We always bank on his experience when making tough decisions.",
          "We can bank on the weather being sunny this weekend.",
          "They bank on his team to deliver results.",
          "The companies bank on innovation to stay ahead."
]
context = ["finance"]*5 + ["river"]*5 + ["rely"]*5

In [5]:
bank_token_embeddings = []

print(context)
print(len(context))

for text in texts:
  inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
  embeddings = model(**inputs).logits[0].detach().numpy()
  # get token id of word "bank"
  bank_token_id = tokenizer("bank", add_special_tokens=False)["input_ids"][0]
  #locate the bank_token_id in the sentence
  bank_token_index = [idx for idx, x in enumerate(inputs["input_ids"][0]) if x==bank_token_id]
  #extract embedding of bank_token_id from whole sentence
  bank_token_embedding = embeddings[bank_token_index]
  #append to output for plotting
  bank_token_embeddings.append(bank_token_embedding.reshape(-1))

['finance', 'finance', 'finance', 'finance', 'finance', 'river', 'river', 'river', 'river', 'river', 'rely', 'rely', 'rely', 'rely', 'rely']
15


In [6]:
#check size of each embedding
for x in bank_token_embeddings:
  print(x.shape)
  break

(30522,)


In [7]:
#convert to numpy matrix for dimentionality reduction
bank_token_embeddings_np = np.array(bank_token_embeddings)
bank_token_embeddings_np.shape

(15, 30522)

Reduce the embedding dimension to 2 and plot

In [8]:
from sklearn.manifold import TSNE

In [9]:
#load TSNE object
# n_components: number of output dimensions
# random_state: for reproducibility
# perplexity:
tsne = TSNE(n_components=2, random_state=42, perplexity=2)
bank_tsne = tsne.fit_transform(bank_token_embeddings_np)

In [10]:
bank_tsne.shape

(15, 2)

In [11]:
tsne.kl_divergence_

0.18724213540554047

In [12]:
bank_tsne.shape[0]

15

In [13]:
for x in range(bank_tsne.shape[0]):
  print(bank_tsne[x,:], "\t\t",texts[x])

[-45.18385    -1.2572727] 		 I deposited my salary in the bank yesterday.
[-29.408592  -5.54569 ] 		 The bank approved my loan within a week.
[-58.027332    1.3294841] 		 She opened a savings account at the bank.
[-25.119524 -14.227649] 		 The bank closes at 4 PM on weekdays.
[-1.970891e+01 -8.197819e-03] 		 The bank provides both online and offline services to its customers.
[ 54.951145 -36.343815] 		 We sat on the grassy bank of the river to enjoy the sunset.
[ 33.283722 -49.48029 ] 		 Children were playing near the river bank, building sand castles.
[ 48.01334  -30.287382] 		 The fisherman cast his net from the bank of the wide river.
[ 49.648666 -16.12431 ] 		 Wildflowers grew in clusters along the steep bank of the stream.
[ 36.842228 -36.172993] 		 The boat was tied securely to a post on the river bank.
[-138.79288   -16.625975] 		 You can bank on her to finish the project on time.
[-127.43147   -12.413934] 		 We always bank on his experience when making tough decisions.
[-146.98

In [14]:
import plotly.express as px

fig = px.scatter(x=bank_tsne[:, 0], y=bank_tsne[:, 1], color=context)
fig.update_layout(
    title="t-SNE visualization of Custom Classification dataset",
    xaxis_title="First t-SNE",
    yaxis_title="Second t-SNE",
)
fig.show()

**Issues to fix further**

- if there is a variation of word bank like banks, banked etc, then tokenizer will split it into two parts and we will have to use some hack to aggregate and then compare.