<a href="https://colab.research.google.com/github/mr-cri-spy/crisbee-chatbot/blob/main/01_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 Understanding Embeddings with BERT
In this notebook, we go into the world of embeddings, utilizing the BERT model to understand the semantic relationships between words in different contexts.

Setup and Installation
Start by installing the necessary libraries to ensure all functionalities are available.

In [4]:
!pip install transformers
!pip install scipy



 Importing Libraries
Import essential modules for our tasks.

In [7]:
from transformers import BertModel, AutoTokenizer
from scipy.spatial.distance import cosine

Model Setup
Load the pretrained BERT model and tokenizer. This model will help us extract embeddings for our analysis.

In [8]:
#defining model name
model_name = "bert-base-cased"

#loading the pretrained model and tokenizer
model = BertModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Function Definition Predict

Define a function that encodes input text into tensors, which are then fed to the model to obtain embeddings.

In [21]:
#Defining a function to encode the input text and get model predictions

def predict(text):
  encoded_inputs = tokenizer(text, return_tensors='pt')
  return model(**encoded_inputs)[0]

defining sentence
set up sentences to analyze

In [24]:
#Defining the Sentences

sentence1 = "There was a fly drinking from my soup"
sentence2 = "There is a fly swimming in my juice"
# sentence2 = "To become a commercial pilot, he had to fly for 1500 hours." # second fly example

#Tokenizing the sentences
# sentence1 = tokenizer.tokenize(sentence1)
# sentence2 = tokenizer.tokenize(sentence2)

 Tokenization and Model Predictions

Tokenize the sentences and obtain predictions (embeddings) from the model.

In [22]:
# Getting model predictions for the sentences
out1, encoded_sentence1 = predict(sentence1)
out2, encoded_sentence2 = predict(sentence2)

Extracting Embeddings
Extract embeddings specifically for the word "fly" from both sentences.

In [30]:
# Extracting embeddings for the word 'fly' in both sentences
# Need to find the correct index of 'fly' in the tokenized sentences
# The tokenizer adds special tokens like [CLS] and [SEP]

# Getting model predictions for the sentences
out1, encoded_sentence1 = predict(sentence1)
out2, encoded_sentence2 = predict(sentence2)


# Find the index of the token 'fly' in the tokenized sentences using word_ids()
word_ids1 = encoded_sentence1.word_ids()
word_ids2 = encoded_sentence2.word_ids()

# Find the index of the word 'fly' in the original sentences
fly_word_index1 = sentence1.split().index("fly")
fly_word_index2 = sentence2.split().index("fly")

print("Sentence 1 word_ids:", word_ids1)
print("Sentence 2 word_ids:", word_ids2)
print("Sentence 1 tokens:", tokenizer.convert_ids_to_tokens(encoded_sentence1['input_ids'][0].tolist()))
print("Sentence 2 tokens:", tokenizer.convert_ids_to_tokens(encoded_sentence2['input_ids'][0].tolist()))
print("Fly word index in sentence 1:", fly_word_index1)
print("Fly word index in sentence 2:", fly_word_index2)


# Find the token index corresponding to the word 'fly'
fly_token_index1 = -1
for i, word_id in enumerate(word_ids1):
    if word_id is not None and word_id == fly_word_index1:
        fly_token_index1 = i
        break

fly_token_index2 = -1
for i, word_id in enumerate(word_ids2):
    if word_id is not None and word_id == fly_word_index2:
        fly_token_index2 = i
        break


if fly_token_index1 == -1 or fly_token_index2 == -1:
    raise ValueError("Could not find the token for the word 'fly' in one or both sentences.")

emb1 = out1[0:, fly_token_index1, :].detach()[0]
emb2 = out2[0:, fly_token_index2, :].detach()[0]

Sentence 1 word_ids: [None, 0, 1, 2, 3, 4, 5, 6, 7, None]
Sentence 2 word_ids: [None, 0, 1, 2, 3, 4, 5, 6, 7, None]
Sentence 1 tokens: ['[CLS]', 'There', 'was', 'a', 'fly', 'drinking', 'from', 'my', 'soup', '[SEP]']
Sentence 2 tokens: ['[CLS]', 'There', 'is', 'a', 'fly', 'swimming', 'in', 'my', 'juice', '[SEP]']
Fly word index in sentence 1: 3
Fly word index in sentence 2: 3


Calculate the cosine similarity between the embeddings of the word "fly" from both sentences to measure how context affects meaning.

In [31]:
# Calculating the cosine similarity between the embeddings
cosine(emb1, emb2)

np.float32(0.10581541)