## ü§î What are Embeddings?

**Embeddings** convert text into a list of numbers (called a **vector**) that represents the text's **meaning**.

```
"I love dogs"     ‚Üí  [0.2, -0.5, 0.8, 0.1, ...]  (384 numbers)
"I adore puppies" ‚Üí  [0.3, -0.4, 0.7, 0.2, ...]  (similar numbers!)
"I hate rain"     ‚Üí  [-0.8, 0.3, -0.2, 0.9, ...] (different numbers)
```

**Key insight:** Similar meanings ‚Üí Similar numbers!

## Why Do We Need Embeddings?

**Problem:** Computers can't understand text directly.

```
‚ùå Computer can't do: "qualitative research" ‚âà "interview methods"
‚úÖ Computer can do:   [0.2, 0.5, ...] ‚âà [0.3, 0.4, ...]  (number comparison!)
```

**Solution:** Convert text to numbers, then compare numbers!

---

## Step 1: Set Up (Load chunks from previous notebooks)

In [None]:
# Load documents and create chunks (from Notebooks 1 & 2)
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import os

# Set up paths
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
data_folder = os.path.join(project_root, 'data')

# Load PDFs
print("üìö Loading documents...")
all_pages = []
for pdf_name in os.listdir(data_folder):
    if pdf_name.endswith('.pdf'):
        loader = PyPDFLoader(os.path.join(data_folder, pdf_name))
        all_pages.extend(loader.load())

# Split into chunks
print("‚úÇÔ∏è Creating chunks...")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(all_pages)

print(f"\n‚úÖ Ready! {len(chunks)} chunks to embed.")

---

## Step 2: Load the Embedding Model

In [None]:
# Import the embedding model
from langchain_community.embeddings import HuggingFaceEmbeddings

print("üî¢ Loading embedding model...")
print("   Model: all-MiniLM-L6-v2")
print("   (This is a free, local model - no API key needed!)\n")

embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'}
)

print("‚úÖ Embedding model loaded!")

### üí° About the Model

**all-MiniLM-L6-v2** is:
- Free and open source
- Runs locally (no internet needed after download)
- Fast and efficient
- Creates 384-dimensional vectors

**Other options:**
- `text-embedding-ada-002` (OpenAI) - Better quality, but costs money
- `all-mpnet-base-v2` - Higher quality, but slower

---

## Step 3: Create Your First Embedding

In [None]:
# Create an embedding for a simple text
sample_text = "qualitative research methods"

print(f"üìù Text: '{sample_text}'")
print("\nüîÑ Converting to embedding...\n")

# Create the embedding
vector = embeddings.embed_query(sample_text)

print(f"‚úÖ Created embedding!")
print(f"\nüìä Vector details:")
print(f"   Length: {len(vector)} numbers")
print(f"   First 5 numbers: {[round(v, 4) for v in vector[:5]]}")
print(f"   Last 5 numbers:  {[round(v, 4) for v in vector[-5:]]}")

### üí° What Just Happened?

```
"qualitative research methods"
              ‚Üì
        Embedding Model
              ‚Üì
[0.023, -0.051, 0.089, ..., 0.012]  ‚Üê 384 numbers!
```

These 384 numbers encode the **meaning** of the text in a way computers can process.

---

## Step 4: Understanding Similarity

In [None]:
import numpy as np

def cosine_similarity(vec1, vec2):
    """Calculate how similar two vectors are (0 to 1)."""
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

# Test with similar and different texts
texts = [
    "qualitative research methods",
    "interview techniques in research",  # Similar meaning
    "data analysis approaches",           # Somewhat similar
    "the weather is nice today"           # Very different
]

# Create embeddings for all texts
print("üîç Comparing text similarity:\n")
print(f"üìù Base text: '{texts[0]}'\n")

base_vector = embeddings.embed_query(texts[0])

for i, text in enumerate(texts[1:], 1):
    vector = embeddings.embed_query(text)
    similarity = cosine_similarity(base_vector, vector)
    
    # Visual bar
    bar_length = int(similarity * 20)
    bar = "‚ñà" * bar_length + "‚ñë" * (20 - bar_length)
    
    print(f"   '{text}'")
    print(f"   Similarity: {bar} {similarity:.2%}\n")

### üí° Understanding Cosine Similarity

**Cosine similarity** measures how "aligned" two vectors are:

| Score | Meaning |
|-------|----------|
| 1.0 (100%) | Identical meaning |
| 0.8+ | Very similar |
| 0.5-0.8 | Somewhat related |
| < 0.5 | Different topics |

This is how we find relevant chunks for a question!

---

## Step 5: Embed a Document Chunk

In [None]:
# Embed one of our actual chunks
sample_chunk = chunks[0]

print("üìÑ Sample Chunk:")
print("="*60)
print(sample_chunk.page_content[:300] + "...")
print("="*60)

# Create embedding
chunk_vector = embeddings.embed_query(sample_chunk.page_content)

print(f"\n‚úÖ Chunk embedded!")
print(f"   Vector length: {len(chunk_vector)} numbers")

---

## Step 6: Batch Embedding (Multiple Texts at Once)

In [None]:
# Embed multiple texts efficiently
print("üî¢ Embedding multiple chunks at once...\n")

# Take first 5 chunks as example
sample_texts = [chunk.page_content for chunk in chunks[:5]]

# Embed all at once (more efficient than one by one)
vectors = embeddings.embed_documents(sample_texts)

print(f"‚úÖ Created {len(vectors)} embeddings!")
print(f"\nüìä Each vector has {len(vectors[0])} dimensions")

# Show the shape
print(f"\nüîç Result shape: {len(vectors)} texts √ó {len(vectors[0])} dimensions")

### üí° embed_query vs embed_documents

| Method | Use For | Example |
|--------|---------|----------|
| `embed_query(text)` | Single text (user's question) | "What is coding?" |
| `embed_documents(list)` | Multiple texts (batch) | All your chunks |

`embed_documents` is more efficient for many texts!

---

## üß™ Experiment: Find Similar Chunks

In [None]:
# Let's manually find the most similar chunk to a question
question = "What is qualitative data coding?"

print(f"‚ùì Question: '{question}'\n")

# Embed the question
question_vector = embeddings.embed_query(question)

# Embed first 20 chunks and find most similar
print("üîç Searching through chunks...\n")

similarities = []
for i, chunk in enumerate(chunks[:20]):
    chunk_vector = embeddings.embed_query(chunk.page_content)
    sim = cosine_similarity(question_vector, chunk_vector)
    similarities.append((i, sim, chunk))

# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)

# Show top 3
print("üìä Top 3 Most Similar Chunks:")
print("="*60)

for i, (idx, sim, chunk) in enumerate(similarities[:3], 1):
    print(f"\n#{i} Similarity: {sim:.2%}")
    print(f"   {chunk.page_content[:150]}...")

### üí° This is Exactly How RAG Works!

```
1. User asks: "What is qualitative coding?"
                    ‚Üì
2. Convert question to vector
                    ‚Üì
3. Compare with all chunk vectors
                    ‚Üì
4. Return most similar chunks
                    ‚Üì
5. Send chunks to AI for answer
```

In the next notebook, we'll store these vectors in a database for fast searching!

---

## ‚úÖ Summary

In this notebook, you learned:

1. **What embeddings are** - Converting text to meaningful numbers
2. **Why we need them** - Computers can compare numbers, not text
3. **HuggingFace embeddings** - Free, local embedding model
4. **Cosine similarity** - Measuring how similar two texts are
5. **Finding similar chunks** - The core of RAG retrieval

## ‚û°Ô∏è Next Step

In **Notebook 4: Vector Store**, you'll learn how to store these embeddings in ChromaDB for fast, efficient searching.

---

**Key Components:**
- `embeddings` - The HuggingFace embedding model
- `chunks` - Your document chunks ready to be stored