# Embedding Experiment

**Goal:** Learn how to create embeddings with OpenAI API.

**What we'll do:**
1. Connect to OpenAI API
2. Create embedding for a single text
3. Compare embeddings of similar vs different texts
4. Batch embed multiple texts

## Setup: Import Libraries & Load API Key

In [None]:
from openai import OpenAI
from dotenv import load_dotenv
import os
import numpy as np

# Load environment variables from .env file
load_dotenv()

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

print("‚úÖ OpenAI client initialized")

## Step 1: Create Embedding for Single Text

Let's create an embedding for a simple automotive text.

In [None]:
# Sample text about CAN protocol
text = "CAN protocol is used in automotive networks for communication between ECUs"

# Create embedding
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=text
)

# Extract embedding vector
embedding = response.data[0].embedding

print(f"Text: {text}")
print(f"\nEmbedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
print(f"\nEmbedding type: {type(embedding)}")

## Step 2: Compare Similar vs Different Texts

**Cosine Similarity:**
- Measures how similar two vectors are
- Range: -1 to 1
- 1 = identical, 0 = unrelated, -1 = opposite

**Formula:**
```
similarity = (A ¬∑ B) / (||A|| * ||B||)
```

In [None]:
def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors."""
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Test texts
text1 = "CAN protocol is used in automotive networks"
text2 = "Controller Area Network enables vehicle communication"  # Similar!
text3 = "Apple pie recipe with cinnamon"  # Different!

# Create embeddings
emb1 = client.embeddings.create(model="text-embedding-3-small", input=text1).data[0].embedding
emb2 = client.embeddings.create(model="text-embedding-3-small", input=text2).data[0].embedding
emb3 = client.embeddings.create(model="text-embedding-3-small", input=text3).data[0].embedding

# Calculate similarities
sim_1_2 = cosine_similarity(emb1, emb2)
sim_1_3 = cosine_similarity(emb1, emb3)

print("Text 1:", text1)
print("Text 2:", text2)
print("Text 3:", text3)
print("\n" + "="*60)
print(f"Similarity (Text 1 ‚Üî Text 2): {sim_1_2:.4f}  ‚Üê High! (similar topics)")
print(f"Similarity (Text 1 ‚Üî Text 3): {sim_1_3:.4f}  ‚Üê Low! (different topics)")
print("="*60)

## Step 3: Batch Embeddings

**Why batch?**
- More efficient (single API call)
- Faster
- Lower cost

**Limit:** Max 2048 texts per batch

In [None]:
# Multiple texts
texts = [
    "CAN bus uses twisted pair cables",
    "OBD-II diagnostic connector in vehicles",
    "Infotainment system user interface",
    "Electronic Control Unit programming",
    "Vehicle network architecture"
]

# Batch embed
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts  # Pass list of texts
)

# Extract all embeddings
embeddings = [item.embedding for item in response.data]

print(f"Embedded {len(texts)} texts in a single API call")
print(f"Each embedding has {len(embeddings[0])} dimensions")
print("\nTexts:")
for i, text in enumerate(texts):
    print(f"  {i+1}. {text}")

## Step 4: Real Chunk Embedding Test

Let's embed a few real chunks from our PDFs.

In [None]:
# Load a few chunks from our PDF loader
import sys
sys.path.append('..')
from src.pdf_loader import load_pdfs_from_directory

# Load chunks (limit to first 5 for testing)
print("Loading PDF chunks...")
chunks = load_pdfs_from_directory("../data/automotive")
test_chunks = chunks[:5]

print(f"\nEmbedding {len(test_chunks)} chunks...")

# Extract text from chunks
chunk_texts = [chunk.page_content for chunk in test_chunks]

# Create embeddings
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunk_texts
)

chunk_embeddings = [item.embedding for item in response.data]

print(f"\n‚úÖ Successfully embedded {len(chunk_embeddings)} chunks")
print(f"‚úÖ Each embedding: {len(chunk_embeddings[0])} dimensions")

# Show first chunk
print("\n" + "="*60)
print("First Chunk Preview:")
print("="*60)
print(test_chunks[0].page_content[:200] + "...")
print(f"\nEmbedding (first 10 values): {chunk_embeddings[0][:10]}")

## Step 5: Cost Estimation

Let's estimate the cost for embedding all 635 chunks.

In [None]:
import tiktoken

# Initialize tokenizer
encoding = tiktoken.get_encoding("cl100k_base")  # OpenAI's tokenizer

# Count tokens in all chunks
total_tokens = 0
for chunk in chunks:
    tokens = encoding.encode(chunk.page_content)
    total_tokens += len(tokens)

# Cost calculation
# text-embedding-3-small: $0.00002 per 1K tokens
cost_per_1k = 0.00002
total_cost = (total_tokens / 1000) * cost_per_1k

print("="*60)
print("Cost Estimation for Full Embedding")
print("="*60)
print(f"Total chunks: {len(chunks)}")
print(f"Total tokens: {total_tokens:,}")
print(f"Model: text-embedding-3-small")
print(f"Cost per 1K tokens: ${cost_per_1k}")
print(f"\nüí∞ Estimated cost: ${total_cost:.4f}")
print("="*60)
print("\n‚úÖ Very affordable!")

## Summary

**What you learned:**
1. How to create embeddings with OpenAI API
2. How to measure similarity with cosine similarity
3. Batch embedding for efficiency
4. Real chunk embedding works perfectly
5. Cost is very low (~$0.01 for 635 chunks)

**Next step:** Create `src/embeddings.py` with reusable functions!