# Week 4: Embeddings and RAG Concepts

## Learning Objectives
- Understand what embeddings are and how they work
- Generate embeddings using OpenAI's API
- Calculate semantic similarity between texts
- Build a simple semantic search system
- Understand the foundation for RAG systems

In [None]:
import os
import numpy as np
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

print("✓ Ready to explore embeddings!")

## Part 1: What Are Embeddings?

### Concept
- Embeddings are **numerical representations** of text
- Text → Vector of numbers (e.g., 1536 dimensions for OpenAI's text-embedding-3-small)
- Semantically similar texts have similar vectors
- Unlike keywords, embeddings capture **meaning**

### Why Embeddings?
- Traditional search: keyword matching ("cat" ≠ "feline")
- Embeddings: semantic matching ("cat" ≈ "feline")
- Enable finding related content even with different words

## Part 2: Creating Embeddings

In [None]:
def get_embedding(text, model="text-embedding-3-small"):
    """Get embedding for a piece of text"""
    text = text.replace("\n", " ")  # Clean text
    response = client.embeddings.create(
        input=[text],
        model=model
    )
    return response.data[0].embedding

# Get an embedding
text = "Machine learning is transforming healthcare."
embedding = get_embedding(text)

print(f"Text: {text}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
print(f"Embedding type: {type(embedding)}")

## Part 3: Measuring Similarity

### Cosine Similarity
Measures how similar two vectors are (range: -1 to 1)
- 1 = identical
- 0 = unrelated
- -1 = opposite

In [None]:
def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors"""
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    
    dot_product = np.dot(vec1, vec2)
    magnitude = np.linalg.norm(vec1) * np.linalg.norm(vec2)
    
    return dot_product / magnitude

# Compare similar texts
text1 = "The patient has a fever."
text2 = "The person is running a temperature."
text3 = "I enjoy playing basketball."

emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
emb3 = get_embedding(text3)

print(f"Similarity between '{text1}' and '{text2}':")
print(f"  {cosine_similarity(emb1, emb2):.4f}")
print()

print(f"Similarity between '{text1}' and '{text3}':")
print(f"  {cosine_similarity(emb1, emb3):.4f}")

## Part 4: Building a Document Store

In [None]:
class SimpleDocumentStore:
    """Store documents with their embeddings"""
    
    def __init__(self):
        self.documents = []
        self.embeddings = []
    
    def add_document(self, text):
        """Add a document and compute its embedding"""
        embedding = get_embedding(text)
        self.documents.append(text)
        self.embeddings.append(embedding)
        return len(self.documents) - 1  # Return index
    
    def add_documents(self, texts):
        """Add multiple documents"""
        for text in texts:
            self.add_document(text)
        print(f"Added {len(texts)} documents")
    
    def search(self, query, top_k=3):
        """Search for most similar documents"""
        query_embedding = get_embedding(query)
        
        # Calculate similarities
        similarities = [
            cosine_similarity(query_embedding, doc_emb)
            for doc_emb in self.embeddings
        ]
        
        # Get top k indices
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        # Return results
        results = [
            {
                "document": self.documents[i],
                "similarity": similarities[i],
                "index": i
            }
            for i in top_indices
        ]
        
        return results
    
    def __len__(self):
        return len(self.documents)

print("✓ Document store class created")

### Test the Document Store

In [None]:
# Create a document store
store = SimpleDocumentStore()

# Add medical knowledge documents
medical_docs = [
    "Hypertension is high blood pressure that can lead to heart disease.",
    "Diabetes is a condition where blood sugar levels are too high.",
    "Asthma is a respiratory condition causing breathing difficulties.",
    "Migraine is a neurological condition causing severe headaches.",
    "Arthritis causes inflammation and pain in the joints.",
    "Pneumonia is a lung infection causing cough and fever.",
    "Depression is a mental health disorder affecting mood.",
    "Eczema is a skin condition causing itchy, inflamed patches."
]

store.add_documents(medical_docs)
print(f"\nStore contains {len(store)} documents")

### Search the Store

In [None]:
# Search with different queries
queries = [
    "What conditions affect the heart?",
    "Tell me about breathing problems",
    "Mental health issues"
]

for query in queries:
    print(f"\nQuery: {query}")
    print("=" * 70)
    
    results = store.search(query, top_k=3)
    
    for i, result in enumerate(results, 1):
        print(f"{i}. [Similarity: {result['similarity']:.3f}]")
        print(f"   {result['document']}")

## Part 5: Understanding RAG

### What is RAG?
**Retrieval-Augmented Generation** combines:
1. **Retrieval**: Find relevant documents using embeddings
2. **Augmentation**: Add retrieved docs to the prompt
3. **Generation**: LLM generates answer using the context

### Why RAG?
- Gives LLM access to specific knowledge
- Reduces hallucinations
- Works with private/recent data
- More cost-effective than fine-tuning

## Part 6: Simple RAG Example

In [None]:
def simple_rag(query, document_store, top_k=2):
    """Simple RAG: retrieve documents and generate answer"""
    
    # Step 1: Retrieve relevant documents
    results = document_store.search(query, top_k=top_k)
    
    # Step 2: Build context from retrieved documents
    context = "\n\n".join([
        f"Document {i+1}: {result['document']}"
        for i, result in enumerate(results)
    ])
    
    # Step 3: Create prompt with context
    prompt = f"""Answer the question based on the context below.
    
Context:
{context}

Question: {query}

Answer:"""
    
    # Step 4: Generate answer
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    
    return {
        "answer": response.choices[0].message.content,
        "sources": results
    }

# Test RAG
question = "What should I know about conditions that affect breathing?"
result = simple_rag(question, store)

print(f"Question: {question}")
print("\n" + "="*70 + "\n")
print(f"Answer: {result['answer']}")
print("\n" + "="*70 + "\n")
print("Sources used:")
for i, source in enumerate(result['sources'], 1):
    print(f"{i}. {source['document']}")

## Part 7: Document Chunking

For longer documents, we need to split them into chunks:

In [None]:
def chunk_text(text, chunk_size=200, overlap=50):
    """Split text into overlapping chunks"""
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:  # Don't add empty chunks
            chunks.append(chunk)
    
    return chunks

# Test chunking
long_text = """Machine learning in healthcare has revolutionized diagnostic processes. 
Computer vision models can now analyze medical images with accuracy comparable to expert radiologists. 
Natural language processing helps extract insights from clinical notes and research papers. 
Predictive models identify patients at risk of developing certain conditions. 
However, challenges remain in model interpretability, data privacy, and clinical integration. 
The future of AI in healthcare depends on collaboration between clinicians, data scientists, and patients."""

chunks = chunk_text(long_text, chunk_size=20, overlap=5)

print(f"Original text ({len(long_text.split())} words):")
print(long_text)
print("\n" + "="*70 + "\n")
print(f"Created {len(chunks)} chunks:")
for i, chunk in enumerate(chunks, 1):
    print(f"\nChunk {i}:")
    print(chunk)

## Part 8: Practical Example - Research Paper Search

In [None]:
# Simulate research paper abstracts
papers = [
    {
        "title": "Deep Learning for Medical Image Analysis",
        "abstract": "This paper reviews convolutional neural networks for analyzing CT scans and MRI images. We demonstrate state-of-the-art performance in tumor detection."
    },
    {
        "title": "Natural Language Processing in Clinical Documentation",
        "abstract": "We present a transformer-based model for extracting structured information from electronic health records. Our approach reduces manual data entry time."
    },
    {
        "title": "Predictive Models for Patient Outcomes",
        "abstract": "Machine learning models predict hospital readmission risk using demographic and clinical data. Results show 85% accuracy in identifying high-risk patients."
    },
    {
        "title": "Genomic Data Analysis with AI",
        "abstract": "Deep learning identifies genetic variants associated with disease susceptibility. Our model processes whole-genome sequencing data efficiently."
    }
]

# Create store with paper abstracts
paper_store = SimpleDocumentStore()

for paper in papers:
    # Combine title and abstract for better searchability
    doc_text = f"{paper['title']}. {paper['abstract']}"
    paper_store.add_document(doc_text)

print(f"Indexed {len(paper_store)} research papers")

In [None]:
# Search for papers
research_query = "What papers discuss using AI for analyzing patient records?"

print(f"Query: {research_query}\n")
results = paper_store.search(research_query, top_k=2)

for i, result in enumerate(results, 1):
    print(f"{i}. Similarity: {result['similarity']:.3f}")
    print(f"   {result['document']}")
    print()

## Key Takeaways

1. **Embeddings capture meaning** - Similar concepts have similar vectors
2. **Cosine similarity measures relevance** - Higher values = more similar
3. **Semantic search beats keywords** - Finds related content with different words
4. **RAG = Retrieve + Generate** - Give LLMs relevant context
5. **Chunk long documents** - Break into searchable pieces

## Next Week

We'll build complete RAG systems:
- Processing document collections
- Advanced retrieval strategies
- Combining multiple sources
- Production-ready patterns