# üß†‚ú®**Text Encoding with Pretrained Language Models (BERT)**
In this notebook, I demonstrate how to tokenize text and extract embeddings using a **pretrained BERT model** from Hugging Face‚Äôs `transformers` library.

## üéØ **Task Objective**

- Understand how text is tokenized using a pretrained tokenizer.

- Pass the tokens through a pretrained BERT model.

- Extract the `[CLS]` token embedding as a representation of the sentence.

- Compare the similarity of two sentences using cosine similarity.

## üì¶  **Importing Libraries**

In [1]:
from transformers import BertTokenizer, BertModel
import torch
import torch.nn.functional as F

 ## üß∞ **Loading the Pretrained BERT Model**

 In this step, I load the `bert-base-uncased` model and its tokenizer.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

## üß™  **Tokenizing a Sentence and Extracting the Embedding**
In this step, I tokenize a sentence using BERT and extract the `[CLS]` embedding. I also display:

- The tokenized words

- Their corresponding token IDs

- A sample of the `[CLS]` embedding vector

In [3]:
text = "Text encoding with BERT is powerful."

inputs = tokenizer(text, return_tensors="pt")
input_ids = inputs["input_ids"][0]

tokens = tokenizer.convert_ids_to_tokens(input_ids)
print("Tokens:", tokens)
print("Token IDs:", input_ids.tolist())

with torch.no_grad():
    outputs = model(**inputs)

cls_embedding = outputs.last_hidden_state[:, 0, :]
print("\nCLS embedding shape:", cls_embedding.shape)

print("\nSample CLS embedding (first 10 values):")
print(cls_embedding[0][:10])

Tokens: ['[CLS]', 'text', 'encoding', 'with', 'bert', 'is', 'powerful', '.', '[SEP]']
Token IDs: [101, 3793, 17181, 2007, 14324, 2003, 3928, 1012, 102]

CLS embedding shape: torch.Size([1, 768])

Sample CLS embedding (first 10 values):
tensor([-0.5005, -0.1889, -0.1415, -0.0087, -0.5979, -0.5189, -0.0960,  0.1999,
         0.0647, -0.4416])


## üìè **Comparing Multiple Sentences Using Cosine Similarity**
In this step, I compare three sentences ‚Äî two similar, and one exactly the same as another. For each, I show token IDs, tokens, and cosine similarity.

In [4]:
def get_cls_embedding(text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    return F.normalize(outputs.last_hidden_state[:, 0, :], p=2, dim=1), inputs["input_ids"][0]

# Input sentences
text1 = "I love machine learning."
text2 = "Machine learning is fascinating."
text3 = "I love machine learning."  # exactly the same as text1


emb1, ids1 = get_cls_embedding(text1)
emb2, ids2 = get_cls_embedding(text2)
emb3, ids3 = get_cls_embedding(text3)

tokens1 = tokenizer.convert_ids_to_tokens(ids1)
tokens2 = tokenizer.convert_ids_to_tokens(ids2)
tokens3 = tokenizer.convert_ids_to_tokens(ids3)


sim12 = F.cosine_similarity(emb1, emb2)
sim13 = F.cosine_similarity(emb1, emb3)

print(f"Cosine similarity (text1 vs text2): {sim12.item():.4f}")
print(f"Cosine similarity (text1 vs text3): {sim13.item():.4f} (should be ~1.0)")


print("\nTokens and IDs:")
print("Text 1 Tokens :", tokens1)
print("Text 1 IDs    :", ids1.tolist())
print("\nText 2 Tokens :", tokens2)
print("Text 2 IDs    :", ids2.tolist())
print("\nText 3 Tokens :", tokens3)
print("Text 3 IDs    :", ids3.tolist())

print("\nExplanation:")
print("- text1 and text3 are identical, so their cosine similarity is 1.0 (or very close).")
print("- text1 and text2 share key phrases like 'machine learning', so they are semantically similar.")
print("- BERT's [CLS] embedding captures this semantic closeness even if the wording changes.")


Cosine similarity (text1 vs text2): 0.9456
Cosine similarity (text1 vs text3): 1.0000 (should be ~1.0)

Tokens and IDs:
Text 1 Tokens : ['[CLS]', 'i', 'love', 'machine', 'learning', '.', '[SEP]']
Text 1 IDs    : [101, 1045, 2293, 3698, 4083, 1012, 102]

Text 2 Tokens : ['[CLS]', 'machine', 'learning', 'is', 'fascinating', '.', '[SEP]']
Text 2 IDs    : [101, 3698, 4083, 2003, 17160, 1012, 102]

Text 3 Tokens : ['[CLS]', 'i', 'love', 'machine', 'learning', '.', '[SEP]']
Text 3 IDs    : [101, 1045, 2293, 3698, 4083, 1012, 102]

Explanation:
- text1 and text3 are identical, so their cosine similarity is 1.0 (or very close).
- text1 and text2 share key phrases like 'machine learning', so they are semantically similar.
- BERT's [CLS] embedding captures this semantic closeness even if the wording changes.


## üìù **Summary**
In this task, I explored how pretrained BERT models convert text into meaningful numerical representations (machine-readable embeddings). Specifically, I:

- **Tokenized raw text** into model-readable input using BERT‚Äôs tokenizer, revealing how words are broken down into subword tokens and their corresponding token IDs.

- **Extracted the `[CLS]` token embedding**, which serves as a compact vector representation capturing the overall meaning of the sentence.

- **Compared sentence embeddings** by computing cosine similarity, demonstrating how BERT captures semantic relationships ‚Äî identical or closely related sentences produce high similarity scores.

This hands-on experience deepened my understanding of how transformer-based language models encode language, enabling numerous NLP applications such as text classification, semantic search, and clustering.