
# 🧠 **Understanding Embeddings with BERT**

In this notebook, we go into the world of embeddings, utilizing the BERT model to understand the semantic relationships between words in different contexts.

## 🛠️ Setup and Installation

Start by installing the necessary libraries to ensure all functionalities are available.

In [2]:
!pip install transformers==4.29.2
!pip install scipy==1.7.3

## 📚 Importing Libraries

Import essential modules for our tasks.

In [None]:
from transformers import BertModel, AutoTokenizer
from scipy.spatial.distance import cosine

## 🤖 Model Setup

Load the pre-trained BERT model and tokenizer. This model will help us extract embeddings for our analysis.

In [None]:
# Defining the model name
model_name = "bert-base-cased"

# Loading the pre-trained model and tokenizer
model = BertModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- Th


## 📝 Function Definition: Predict

Define a function that encodes input text into tensors, which are then fed to the model to obtain embeddings.

In [None]:
# Defining a function to encode the input text and get model predictions
def predict(text):
    encoded_inputs = tokenizer(text, return_tensors="pt")
    return model(**encoded_inputs)[0]

## 📃 Defining the Sentences

Set up sentences to analyze. The

In [None]:
# Defining the sentences
sentence1 = "There was a fly drinking from my soup"
sentence2 = "There is a fly swimming in my juice"
# sentence2 = "To become a commercial pilot, he had to fly for 1500 hours." # second fly example

# Tokenizing the sentences
tokens1 = tokenizer.tokenize(sentence1)
tokens2 = tokenizer.tokenize(sentence2)

## 🔍 Tokenization and Model Predictions

Tokenize the sentences and obtain predictions (embeddings) from the model.

In [None]:
# Getting model predictions for the sentences
out1 = predict(sentence1)
out2 = predict(sentence2)

## 🔄 Extracting Embeddings

Extract embeddings specifically for the word "fly" from both sentences.

In [None]:
# Extracting embeddings for the word 'fly' in both sentences
emb1 = out1[0:, tokens1.index("fly"), :].detach()[0]
emb2 = out2[0:, tokens2.index("fly"), :].detach()[0]

# emb1 = out1[0:, 3, :].detach()
# emb2 = out2[0:, 3, :].detach()

## 📊 Calculating Cosine Similarity

Calculate the cosine similarity between the embeddings of the word "fly" from both sentences to measure how context affects meaning.

In [None]:
# Calculating the cosine similarity between the embeddings
cosine(emb1, emb2)

0.06798791885375977

## 🌟 Conclusion

This notebook has guided you through the process of extracting and comparing word embeddings using BERT. Such techniques are fundamental in understanding word semantics and their usage across different contexts.

Experiment by changing the sentences or focusing on different words to see how the embeddings and their similarities vary!