# Create Embeddings out of an LLM

### Problem Statement
Your mission, should you choose to accept it, is to extract **meaningful sentence-level embeddings** using a pre-trained **causal language model (SmolLM2-135M)** on Amazon Reviews.

You're working with a **generative language model**, but you‚Äôre not here to generate Shakespeare. Instead, you‚Äôll tap into its **hidden states** to get semantic embeddings that capture the essence of a review ‚Äî the good, the bad, and the brutally honest.

---

### Requirements

1. **Load and Tokenize Text**
   - Use the `McAuley-Lab/Amazon-Reviews-2023` dataset (subset: `raw_review_All_Beauty`).
   - Load ~10 sample reviews for testing.
   - Tokenize them using `"HuggingFaceTB/SmolLM2-135M"` tokenizer.

2. **Extract Embeddings**
   - Run the tokenized batch through the model with `output_hidden_states=True`.
   - Access the **last hidden layer** from `outputs.hidden_states[-1]`.

3. **Compute Sentence Embeddings**
   - Options:
     - If the model uses a classification token (e.g., `[CLS]`), extract its embedding.
     - For causal models (which typically don‚Äôt), **average the token embeddings** from the final layer, **excluding padding tokens**.

4. **Find the cosine similarity for a given keyword** 
   - Compute the cosine similarity between the average embeddings of the reviews and a keyword.

---

### Constraints

- ‚ùå Do **not** use sentence-transformers or pre-built embedding tools like `bert-as-service`.
- ‚ùå Do **not** generate text (no `.generate()`).
- ‚úÖ Use only Hugging Face's `AutoModelForCausalLM` and `AutoTokenizer`.
- ‚úÖ Exclude padding tokens when computing average embeddings.
- ‚úÖ Ensure everything runs on `cuda` if available.

---

<details>
  <summary>üí° Hint</summary>

```python
# Run model with hidden states
outputs = model(**tokenized_inputs, output_hidden_states=True, return_dict=True)

# Get the last hidden layer (batch_size, seq_len, hidden_dim)
last_hidden = outputs.hidden_states[-1]

# Use the attention mask to avoid averaging over padding
attention_mask = tokenized_inputs['attention_mask']  # (batch_size, seq_len)

# Compute masked average: zero out padding tokens
masked_embeddings = last_hidden * attention_mask.unsqueeze(-1)  # broadcast mask
summed = masked_embeddings.sum(dim=1)  # sum across tokens
count = attention_mask.sum(dim=1, keepdim=True)  # count of non-padding tokens

# Final sentence-level embeddings
sentence_embeddings = summed / count  # (batch_size, hidden_dim)


In [7]:
import torch
import torch.nn as nn
import torch.optim as optim

In [None]:
# Create sample reviews for testing
# Using synthetic reviews instead of loading from dataset due to compatibility issues
reviews = [
    "This product has amazing quality and works perfectly!",
    "Terrible quality, broke after one use. Very disappointed.",
    "Great value for money. The quality exceeded my expectations.",
    "The quality is okay but nothing special. Average product.",
    "Absolutely love the quality! Best purchase I've made.",
    "Poor quality materials. Would not recommend to anyone.",
    "Decent quality for the price point. Does what it needs to do.",
    "Outstanding quality and craftsmanship. Worth every penny!",
    "The quality is questionable. Mine arrived damaged.",
    "Exceptional quality! This product will last for years."
]

print(f"Loaded {len(reviews)} sample reviews")
print(f"\nFirst review: {reviews[0]}")

In [None]:
# Sample reviews loaded successfully
# Ready to tokenize and extract embeddings

In [None]:
# Load SmolLM2-135M model and tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")

# Set padding token (required for batch processing)
tokenizer.pad_token = tokenizer.eos_token

print(isinstance(model, torch.nn.Module))  # Should print: True
print(f"Model loaded: {model.__class__.__name__}")
print(f"Padding token set to: {tokenizer.pad_token}")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print(f"Using device: {device}")

In [None]:
import torch.nn.functional as F

# Tokenize the reviews with padding for batch processing
encodings = tokenizer(reviews, return_tensors="pt", padding=True, truncation=True)
input_ids = encodings['input_ids'].to(device)
attention_mask = encodings['attention_mask'].to(device)

print(f"Tokenized {len(reviews)} reviews")
print(f"Input shape: {input_ids.shape}")

# Forward pass with output_hidden_states=True to get all hidden states
with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask, output_hidden_states=True)

# Extract last hidden states (batch_size, seq_len, hidden_dim)
last_hidden_states = outputs.hidden_states[-1]
print(f"Last hidden states shape: {last_hidden_states.shape}")

# Compute sentence embeddings by averaging token embeddings excluding padding tokens
# attention_mask has 1 for real tokens, 0 for padding
expanded_mask = attention_mask.unsqueeze(-1).expand(last_hidden_states.size()).float()
sum_embeddings = torch.sum(last_hidden_states * expanded_mask, dim=1)
sum_mask = torch.clamp(expanded_mask.sum(dim=1), min=1e-9)  # avoid division by zero
sentence_embeddings = sum_embeddings / sum_mask  # (batch_size, hidden_dim)

print(f"Sentence embeddings shape: {sentence_embeddings.shape}")

# --- Cosine similarity for a given keyword ---
print("\n" + "="*60)
print("Computing cosine similarity with keyword: 'quality'")
print("="*60)

keyword = "quality"

# Tokenize and embed the keyword the same way
keyword_enc = tokenizer(keyword, return_tensors="pt")
keyword_input_ids = keyword_enc['input_ids'].to(device)
keyword_attention_mask = keyword_enc['attention_mask'].to(device)

with torch.no_grad():
    keyword_outputs = model(keyword_input_ids, attention_mask=keyword_attention_mask, output_hidden_states=True)

keyword_last_hidden = keyword_outputs.hidden_states[-1]
keyword_mask = keyword_attention_mask.unsqueeze(-1).expand(keyword_last_hidden.size()).float()
keyword_embedding = (keyword_last_hidden * keyword_mask).sum(dim=1) / torch.clamp(keyword_mask.sum(dim=1), min=1e-9)

# Compute cosine similarity between keyword embedding and each review embedding
cosine_similarities = F.cosine_similarity(sentence_embeddings, keyword_embedding)

print(f"\nCosine similarities (higher = more similar to '{keyword}'):\n")
for i, (review, sim) in enumerate(zip(reviews, cosine_similarities)):
    print(f"Review #{i+1} [similarity: {sim.item():.4f}]: {review[:60]}...")

print("\n‚úÖ Success! Extracted embeddings and computed similarities")