# Embedding Generation for Relevance Scoring

This notebook creates pre-computed embeddings for fast relevance scoring.

**Goal**: Create `embeddings.pkl` for semantic similarity calculations.

## What This Does:
1. Loads Sentence-BERT model
2. Creates an embedding generator
3. Saves the model for offline use

**Upload this notebook to Kaggle and run it there!**

## 1. Install Dependencies

In [None]:
!pip install -q sentence-transformers transformers torch scikit-learn numpy

## 2. Import Libraries

In [None]:
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import os

print("‚úÖ Libraries imported successfully!")

## 3. Load Sentence-BERT Model

In [None]:
# Load pre-trained Sentence-BERT model
model_name = 'all-MiniLM-L6-v2'  # Fast and accurate
print(f"Loading model: {model_name}...")

embedder = SentenceTransformer(model_name)

print(f"‚úÖ Model loaded!")
print(f"Embedding dimension: {embedder.get_sentence_embedding_dimension()}")

## 4. Create Embedding Wrapper Class

In [None]:
class EmbeddingGenerator:
    """Wrapper for sentence embeddings with caching."""
    
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.cache = {}
    
    def encode(self, texts, use_cache: bool = True):
        """Generate embeddings for text(s)."""
        if isinstance(texts, str):
            texts = [texts]
            single = True
        else:
            single = False
        
        embeddings = []
        for text in texts:
            if use_cache and text in self.cache:
                embeddings.append(self.cache[text])
            else:
                emb = self.model.encode(text, convert_to_numpy=True)
                if use_cache:
                    self.cache[text] = emb
                embeddings.append(emb)
        
        embeddings = np.array(embeddings)
        return embeddings[0] if single else embeddings
    
    def similarity(self, text1: str, text2: str) -> float:
        """Calculate cosine similarity between two texts."""
        emb1 = self.encode(text1)
        emb2 = self.encode(text2)
        return cosine_similarity([emb1], [emb2])[0][0]
    
    def clear_cache(self):
        """Clear embedding cache."""
        self.cache.clear()

print("‚úÖ EmbeddingGenerator class defined!")

## 5. Test the Embedding Generator

In [None]:
# Initialize generator
generator = EmbeddingGenerator()

# Test with sample texts
topic = "Introduction to Machine Learning"
video_titles = [
    "Machine Learning Tutorial for Beginners",
    "Deep Learning Explained",
    "Introduction to ML - Complete Course",
    "Python Programming Basics",
    "What is Machine Learning? ML Explained"
]

print(f"\nüìä Similarity scores for topic: '{topic}'\n")
for title in video_titles:
    score = generator.similarity(topic, title)
    print(f"{score:.4f} - {title}")

print(f"\n‚úÖ Cache size: {len(generator.cache)} embeddings")

## 6. Batch Embedding Test

In [None]:
# Test batch encoding
sample_texts = [
    "Linear Regression in Machine Learning",
    "Neural Networks and Deep Learning",
    "Classification Algorithms Tutorial",
    "Gradient Descent Optimization"
]

batch_embeddings = generator.encode(sample_texts)

print(f"\nüì¶ Batch encoding results:")
print(f"Input: {len(sample_texts)} texts")
print(f"Output shape: {batch_embeddings.shape}")
print(f"Embedding dimension: {batch_embeddings.shape[1]}")

## 7. Save the Model as .pkl File

In [None]:
# Save the embedding generator
output_path = 'embeddings.pkl'

with open(output_path, 'wb') as f:
    pickle.dump(generator, f)

print(f"\n‚úÖ Model saved to: {output_path}")
print(f"File size: {os.path.getsize(output_path) / (1024*1024):.2f} MB")

# Test loading
with open(output_path, 'rb') as f:
    loaded_generator = pickle.load(f)

# Verify loaded model works
test_score = loaded_generator.similarity(
    "Machine Learning Basics",
    "Introduction to ML Tutorial"
)

print(f"\n‚úÖ Model loaded successfully!")
print(f"Test similarity score: {test_score:.4f}")
print("\nüì• Download this file and place it in: ml_models/nlp/embeddings.pkl")

## 8. Performance Benchmark

In [None]:
import time

# Benchmark encoding speed
test_texts = [f"Sample text number {i}" for i in range(100)]

start = time.time()
embeddings = loaded_generator.encode(test_texts, use_cache=False)
elapsed = time.time() - start

print(f"\n‚ö° Performance Benchmark:")
print(f"Encoded {len(test_texts)} texts in {elapsed:.2f} seconds")
print(f"Speed: {len(test_texts)/elapsed:.1f} texts/second")
print(f"Average: {elapsed/len(test_texts)*1000:.1f} ms per text")

## Next Steps

1. ‚úÖ Download `embeddings.pkl` from Kaggle
2. üìÅ Place it in: `c:\Users\Acer\Documents\GitHub\AutoYT-Playlist\ml_models\nlp\embeddings.pkl`
3. üöÄ The backend will use this for fast relevance scoring!

---

**Model Info:**
- Model: `all-MiniLM-L6-v2`
- Embedding Size: 384 dimensions
- Speed: ~100-200 texts/second on CPU
- Use Case: Semantic similarity for video relevance scoring