# LangChain Embeddings: Text to Vectors

## Overview
This notebook introduces embeddings - the foundation of semantic search and advanced RAG systems. You'll learn how to convert text into numerical vectors that capture semantic meaning.

## Learning Objectives
- Understand what embeddings are and why they're important
- Convert text into numerical vector representations
- Work with both single queries and document collections
- Learn the relationship between embeddings and similarity search
- Prepare for advanced RAG with vector databases

## Key Concepts
- **Embeddings**: Numerical vector representations of text
- **Semantic Similarity**: Vectors with similar meaning are close in vector space
- **Vector Dimensions**: The size/complexity of embedding representations
- **Batch Processing**: Handling multiple documents efficiently

## What Are Embeddings?

Embeddings convert text into numerical vectors that capture semantic meaning. Think of them as "coordinates" in a high-dimensional space where similar concepts are located near each other.

### Real-World Analogy:
- **GPS Coordinates**: [40.7128, -74.0060] represents New York City
- **Text Embeddings**: [0.2, -0.5, 0.8, ...] represents "artificial intelligence"

### Why Embeddings Matter:
- **Semantic Search**: Find content by meaning, not just keywords
- **Similarity Comparison**: Measure how related two texts are
- **RAG Enhancement**: Retrieve the most relevant context automatically
- **Clustering**: Group similar documents together

## Step 1: Environment Setup Options

For embeddings, we have different approaches:
- **Azure OpenAI** (commented out) - Cloud-based embedding models
- **LM Studio** - Using simulated embeddings for learning

Since LM Studio doesn't natively support embedding models, we'll use a deterministic fake embedding generator that produces consistent vectors for learning purposes.

In [1]:
# # with Azure OpenAI and LangChain

# from langchain_openai import AzureOpenAIEmbeddings

# # Environment and configuration
# from dotenv import load_dotenv
# import os

# # Load environment variables from .env file
# load_dotenv()

# embeddings = AzureOpenAIEmbeddings(
#     deployment="text-embedding-3-large",          # Deployment name in Azure
#     azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),      # Your Azure endpoint
#     api_version=os.getenv("AZURE_OPENAI_API_VERSION"),      # API version
#     api_key=os.getenv("AZURE_OPENAI_API_KEY"),              # Authentication key
# )

### Option 1: Azure OpenAI Embeddings (Commented Out)
Azure provides high-quality embedding models like `text-embedding-3-large` with 3072 dimensions for production use.

### Option 2: Deterministic Fake Embeddings (Active)
For learning purposes, we use simulated embeddings that generate consistent 4096-dimensional vectors. This allows us to understand embedding concepts without requiring external embedding services.

In [47]:
# with local LM Studio and LangChain

from langchain_core.embeddings import DeterministicFakeEmbedding

embeddings = DeterministicFakeEmbedding(size=4096)

## Step 2: Creating Single Query Embeddings

The `embed_query()` method converts a single text string into a vector representation. This is typically used for:
- **Search queries**: Converting user questions into vectors
- **Single document processing**: Getting vectors for individual texts
- **Similarity comparisons**: Measuring closeness between texts

In [48]:
sample_text = "This is small text document to be embedded."
response = embeddings.embed_query(sample_text)
dimensions = len(response)
print(f"📏 Total dimensions in embedding: {dimensions}")

📏 Total dimensions in embedding: 4096


## Step 3: Batch Document Embeddings

The `embed_documents()` method efficiently processes multiple texts at once. This is essential for:
- **Document databases**: Converting entire document collections
- **Batch processing**: Handling large amounts of text efficiently
- **Vector databases**: Preparing documents for similarity search systems

### Efficiency Benefits:
- **Parallel Processing**: Multiple texts processed simultaneously
- **API Optimization**: Fewer API calls for large datasets
- **Memory Efficiency**: Better resource utilization

In [53]:
documents = [
    "This is small text document to be embedded.",
    "This is another small text document to be embedded.",
    "This is yet another small text document to be embedded.",
    "This is the last small text document to be embedded."
]

# Generate embeddings for the documents
document_embeddings = embeddings.embed_documents(documents)
print(f"📄 Total documents: {len(documents)}")
print(f"📄 Total document dimensions: {len(document_embeddings[1])}")

📄 Total documents: 4
📄 Total document dimensions: 4096


## Understanding Vector Dimensions

### What Are Dimensions?
Each number in an embedding vector represents a "feature" or "aspect" of the text's meaning. More dimensions generally mean:
- **Higher Precision**: More nuanced understanding of text meaning
- **Better Similarity**: More accurate comparisons between texts
- **Increased Complexity**: More computational resources required

### Common Dimension Sizes:
- **384 dimensions**: Lightweight models (sentence-transformers)
- **1536 dimensions**: OpenAI text-embedding-ada-002
- **3072 dimensions**: OpenAI text-embedding-3-large
- **4096 dimensions**: Our simulation (learning purposes)

## Summary

You've successfully learned the fundamentals of embeddings! You've mastered:

✅ **Embedding Concepts**: Understood how text becomes numerical vectors  
✅ **Single Query Processing**: Converted individual texts to embeddings  
✅ **Batch Processing**: Efficiently handled multiple documents  
✅ **Vector Dimensions**: Learned about embedding size and complexity  
✅ **Practical Applications**: Prepared for similarity search and RAG  

## Key Concepts Learned

- **Embeddings**: Numerical representations that capture semantic meaning
- **embed_query()**: Method for single text-to-vector conversion
- **embed_documents()**: Efficient batch processing for multiple texts
- **Vector Dimensions**: The size and complexity of embedding representations
- **Deterministic Fake Embeddings**: Consistent vectors for learning and testing

## How Embeddings Enable Advanced AI. 

### Semantic Search:
```
Query: "artificial intelligence"
Similar Documents: "machine learning", "neural networks", "AI"
Different Documents: "cooking recipes", "travel guides"
```

### Similarity Measurement:
- Embeddings allow measuring how "close" texts are in meaning
- Similar concepts have vectors pointing in similar directions
- Distance calculations reveal semantic relationships

## Real-World Applications

- **Search Engines**: Find relevant content by meaning, not just keywords
- **Recommendation Systems**: Suggest similar products or content
- **Document Clustering**: Group related documents automatically
- **RAG Systems**: Retrieve most relevant context for AI responses
- **Question Answering**: Match questions to relevant information

## Next Steps

In the next notebook, you'll explore:
- **Vector Databases**: Storing and searching large embedding collections
- **Similarity Search**: Finding the most relevant documents automatically
- **Production RAG**: Building complete retrieval systems with FAISS
- **Web Data Integration**: Processing real websites for RAG systems