[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kgweber-cwru/coding-with-ai-wn26/blob/main/series-2-coding-llms/week-4-embeddings-and-rag-concepts/concepts.ipynb)

# Week 4: Embeddings and RAG Concepts

## Learning Objectives
- Understand what embeddings are and how they work
- Generate embeddings using Vertex AI's API
- Calculate semantic similarity between texts
- Build a simple semantic search system
- Understand the foundation for RAG systems

In [1]:
import os
import sys
from pathlib import Path

IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    !pip install -q google-genai google-auth python-dotenv numpy
    from google.colab import auth
    auth.authenticate_user()
    try:
        PROJECT_ID = input("Enter your Google Cloud Project ID (press Enter to use default ADC): ").strip()
    except Exception:
        PROJECT_ID = ""
    if PROJECT_ID:
        os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
else:
    def find_service_account_json(max_up=6):
        p = Path.cwd()
        for _ in range(max_up):
            candidate = p / "series-2-coding-llms" / "creds"
            if candidate.exists():
                for f in candidate.glob("*.json"):
                    return str(f.resolve())
            candidate2 = p / "creds"
            if candidate2.exists():
                for f in candidate2.glob("*.json"):
                    return str(f.resolve())
            p = p.parent
        return None

    sa_path = find_service_account_json()
    if sa_path:
        os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = sa_path
    else:
        try:
            from dotenv import load_dotenv
            load_dotenv()
        except Exception:
            pass


In [2]:
import numpy as np
from google import genai
from google.genai import types
import google.auth

creds, project = google.auth.default()
project = os.environ.get("GOOGLE_CLOUD_PROJECT", project)
client = genai.Client(vertexai=True, project=project, location="us-central1")
print(f"Using project: {project}")

print("✅ Environment loaded successfully!")

Using project: coding-with-ai-wn-26
✅ Environment loaded successfully!


## Part 1: What Are Embeddings?

### Concept
- Embeddings are **numerical representations** of text
- Text → Vector of numbers (e.g., 768 dimensions for Vertex AI's `gemini-embedding-001`)
- Semantically similar texts have similar vectors
- Unlike keywords, embeddings capture **meaning**

### Why Embeddings?
- Traditional search: keyword matching ("cat" ≠ "feline")
- Embeddings: semantic matching ("cat" ≈ "feline")
- Enable finding related content even with different words

### Example
A classic example is the analogy "king is to man as queen is to woman". In an embedding space, this can be expressed as:

`vector('king') - vector('man') + vector('woman') ≈ vector('queen')`

This means if you take the vector for 'king', subtract the vector for 'man' (to remove the "maleness" concept), and add the vector for 'woman' (to add the "femaleness" concept), you should get a vector that is very close to the vector for 'queen'.


![king-queen](images/king-queen.png)

_credit: https://ai.engin.umich.edu/2018/07/23/word-embeddings-and-how-they-vary/_

## Part 2: Creating Embeddings and Measuring Similarity

In [3]:
# Some useful functions

# One to calculate the embeddings using Google's embedding model
# Locally, you could use GloVE or Word2Vec for single words

def get_embedding(text, model="gemini-embedding-001"):
    """Get embedding for a piece of text"""
    text = text.replace("\n", " ")  # Clean text
    response = client.models.embed_content(
        model=model,
        contents=text
    )
    return response.embeddings[0].values

# We need a cosine similarity function to find the most similar vector
def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors"""
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    dot_product = np.dot(vec1, vec2)
    magnitude = np.linalg.norm(vec1) * np.linalg.norm(vec2)
    if magnitude == 0:
        return 0
    return dot_product / magnitude


Let's look at our king/queen example

In [4]:
# Get embeddings for the words
emb_king = np.array(get_embedding("king"))
emb_man = np.array(get_embedding("man"))
emb_woman = np.array(get_embedding("woman"))
emb_queen = np.array(get_embedding("queen"))

# Perform the vector arithmetic
result_vec = emb_king - emb_man + emb_woman

# Let's see how close our result is to 'queen'
similarity = cosine_similarity(result_vec, emb_queen)
print(f"Similarity between (king - man + woman) and queen: {similarity:.4f}")

# To make this more robust, we could find the most similar word from a vocabulary
# For this demo, we'll use a small list of candidate words.
vocabulary = ["queen", "prince", "princess", "king", "man", "woman", "girl", "boy", "tomato", "car"]
vocab_embeddings = {word: np.array(get_embedding(word)) for word in vocabulary}

# Find the word in our vocab with the highest similarity to our result vector
similarities = {
    word: cosine_similarity(result_vec, vocab_emb)
    for word, vocab_emb in vocab_embeddings.items()
}

# Find the most similar word
most_similar_word = max(similarities, key=similarities.get)

print(f"\nThe most similar word in our vocabulary to (king - man + woman) is: '{most_similar_word}'")
print("Similarities:", {word: f"{sim:.3f}" for word, sim in sorted(similarities.items(), key=lambda item: item[1], reverse=True)})

Similarity between (king - man + woman) and queen: 0.6087

The most similar word in our vocabulary to (king - man + woman) is: 'king'
Similarities: {'king': '0.794', 'woman': '0.737', 'queen': '0.609', 'girl': '0.597', 'princess': '0.580', 'prince': '0.537', 'boy': '0.487', 'tomato': '0.470', 'car': '0.467', 'man': '0.305'}


### Interpreting the Results

You might notice that the result isn't perfectly 'queen'. In our case, 'king' might even show up as the most similar. Why is that?

- **Model Optimization**: The classic `king - man + woman ≈ queen` example works perfectly with models like `word2vec` or `GloVe`, which were trained specifically to be good at these word-level analogies.
- **Modern Models**: Newer models like Vertex AI's `gemini-embedding-001` are optimized for understanding the semantic meaning of **phrases and sentences**, not just single words. The simple vector arithmetic doesn't always hold up as cleanly for individual words.

**However, the key takeaway is still valid.** Notice that 'queen' has a very high similarity score, much higher than unrelated words like 'boy' or 'girl', let alone 'tomato.' The similarity score for 'queen' is also significantly higher than 'man' (which we subtracted). This still demonstrates that the embedding space understands the semantic relationships between these concepts, even if the single-word arithmetic isn't perfect. It captures the "royal" and "female" concepts correctly.

### Let's work with longer text

In [5]:

# Get an embedding
text = "Machine learning is transforming healthcare."
embedding = get_embedding(text)

print(f"Text: {text}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
print(f"Embedding type: {type(embedding)}")

Text: Machine learning is transforming healthcare.
Embedding dimensions: 768
First 10 values: [0.050925444811582565, -0.0007302068406715989, 0.01596071571111679, -0.016186965629458427, -0.05082755535840988, 0.050884198397397995, -0.0360821932554245, 0.005750035401433706, 0.000686395273078233, -0.001051548053510487]
Embedding type: <class 'list'>


In [6]:
# Compare similar texts
text1 = "The patient has a fever."
text2 = "The person is running a temperature."
text3 = "I enjoy playing basketball."

emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
emb3 = get_embedding(text3)

print(f"Similarity between '{text1}' and '{text2}':")
print(f"  {cosine_similarity(emb1, emb2):.4f}")
print()

print(f"Similarity between '{text1}' and '{text3}':")
print(f"  {cosine_similarity(emb1, emb3):.4f}")

Similarity between 'The patient has a fever.' and 'The person is running a temperature.':
  0.7732

Similarity between 'The patient has a fever.' and 'I enjoy playing basketball.':
  0.2746


## Part 3: Building a Document Store

We will build a temporary vector-based document store: we'll store the text and embeddings for each document and make it possible to search the store as well.

This is an ephemeral document store: It will evaporate as soon as you stop this runtime. But we'll look at more permanent ones next week. For the time being, it's helpful to look directly under the hood and see what it's doing

In [7]:
class SimpleDocumentStore:
    """Store documents with their embeddings"""
    
    def __init__(self):
        self.documents = []
        self.embeddings = []
    
    def add_document(self, text):
        """Add a document and compute its embedding"""
        embedding = get_embedding(text)
        self.documents.append(text)
        self.embeddings.append(embedding)
        return len(self.documents) - 1  # Return index
    
    def add_documents(self, texts):
        """Add multiple documents"""
        for text in texts:
            self.add_document(text)
        print(f"Added {len(texts)} documents")
    
    def search(self, query, top_k=3):
        """Search for most similar documents"""
        query_embedding = get_embedding(query)
        
        # Calculate similarities
        similarities = [
            cosine_similarity(query_embedding, doc_emb)
            for doc_emb in self.embeddings
        ]
        
        # Get top k indices
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        # Return results
        results = [
            {
                "document": self.documents[i],
                "similarity": similarities[i],
                "index": i
            }
            for i in top_indices
        ]
        
        return results
    
    def __len__(self):
        return len(self.documents)

print("✅ Document store class created")

✅ Document store class created


### Test the Document Store

In [8]:
# Create a document store
store = SimpleDocumentStore()

# Add medical knowledge documents
medical_docs = [
    "Hypertension is high blood pressure that can lead to heart disease.",
    "Diabetes is a condition where blood sugar levels are too high.",
    "Asthma is a respiratory condition causing breathing difficulties.",
    "Migraine is a neurological condition causing severe headaches.",
    "Arthritis causes inflammation and pain in the joints.",
    "Pneumonia is a lung infection causing cough and fever.",
    "Depression is a mental health disorder affecting mood.",
    "Eczema is a skin condition causing itchy, inflamed patches."
]

store.add_documents(medical_docs)
print(f"\nStore contains {len(store)} documents")

Added 8 documents

Store contains 8 documents


### Search the Store

In [9]:
# Search with different queries
queries = [
    "What conditions affect the heart?",
    "Tell me about breathing problems",
    "Mental health issues"
]

for query in queries:
    print(f"\nQuery: {query}")
    print("=" * 70)
    
    results = store.search(query, top_k=3)
    
    for i, result in enumerate(results, 1):
        print(f"{i}. [Similarity: {result['similarity']:.3f}]")
        print(f"   {result['document']}")


Query: What conditions affect the heart?
1. [Similarity: 0.529]
   Hypertension is high blood pressure that can lead to heart disease.
2. [Similarity: 0.456]
   Asthma is a respiratory condition causing breathing difficulties.
3. [Similarity: 0.433]
   Depression is a mental health disorder affecting mood.

Query: Tell me about breathing problems
1. [Similarity: 0.634]
   Asthma is a respiratory condition causing breathing difficulties.
2. [Similarity: 0.431]
   Pneumonia is a lung infection causing cough and fever.
3. [Similarity: 0.351]
   Diabetes is a condition where blood sugar levels are too high.

Query: Mental health issues
1. [Similarity: 0.549]
   Depression is a mental health disorder affecting mood.
2. [Similarity: 0.377]
   Migraine is a neurological condition causing severe headaches.
3. [Similarity: 0.344]
   Eczema is a skin condition causing itchy, inflamed patches.


## Part 5: Understanding RAG

### What is RAG?
**Retrieval-Augmented Generation** combines:
1. **Retrieval**: Find relevant documents using embeddings
2. **Augmentation**: Add retrieved docs to the prompt
3. **Generation**: LLM generates answer using the context

### Why RAG?
- Gives LLM access to specific knowledge
- Reduces hallucinations
- Works with private/recent data
- More cost-effective than fine-tuning

## Part 6: Simple RAG Example

In [10]:
def simple_rag(query, document_store, top_k=2, debug=False):
    """Simple RAG: retrieve documents and generate answer"""
    
    # Step 1: Retrieve relevant documents
    results = document_store.search(query, top_k=top_k)
    
    # Step 2: Build context from retrieved documents
    context = "\n\n".join([
        f"Document {i+1}: {result['document']}"
        for i, result in enumerate(results)
    ])
    
    # Step 3: Create prompt with context
    prompt = f"""Answer the question based on the context below.
    
Context:
{context}

Question: {query}

Answer:"""
    
    # Step 4: Generate answer
    response = client.models.generate_content(
        model="gemini-2.5-flash-lite",
        contents=prompt,
        config=types.GenerateContentConfig(temperature=0.3)
    )
    
    return {
        "answer": response.text,
        "sources": results,
        "prompt": prompt if debug else "use debug mode to view prompt"
    }

# Test RAG
question = "What should I know about conditions that affect breathing?"
result = simple_rag(question, store, debug=True)

print(f"Question: {question}")
print("\n" + "="*70 + "\n")
print(f"Answer: {result['answer']}")
print("\n" + "="*70 + "\n")
print("Sources used:")
for i, source in enumerate(result['sources'], 1):
    print(f"{i}. {source['document']}")
print("\n" + "="*70 + "\n")
print("Prompt sent to model:")
print(result['prompt'])

Question: What should I know about conditions that affect breathing?


Answer: Based on the provided documents, you should know that:

*   **Asthma** is a respiratory condition that causes breathing difficulties.
*   **Pneumonia** is a lung infection that causes cough and fever.


Sources used:
1. Asthma is a respiratory condition causing breathing difficulties.
2. Pneumonia is a lung infection causing cough and fever.


Prompt sent to model:
Answer the question based on the context below.

Context:
Document 1: Asthma is a respiratory condition causing breathing difficulties.

Document 2: Pneumonia is a lung infection causing cough and fever.

Question: What should I know about conditions that affect breathing?

Answer:


## Part 7: Document Chunking

For longer documents, we need to split them into chunks:

In [11]:
def chunk_text(text, chunk_size=200, overlap=50):
    """Split text into overlapping chunks"""
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:  # Don't add empty chunks
            chunks.append(chunk)
    
    return chunks

# Test chunking
long_text = """Machine learning in healthcare has revolutionized diagnostic processes. 
Computer vision models can now analyze medical images with accuracy comparable to expert radiologists. 
Natural language processing helps extract insights from clinical notes and research papers. 
Predictive models identify patients at risk of developing certain conditions. 
However, challenges remain in model interpretability, data privacy, and clinical integration. 
The future of AI in healthcare depends on collaboration between clinicians, data scientists, and patients."""

chunks = chunk_text(long_text, chunk_size=20, overlap=5)

print(f"Original text ({len(long_text.split())} words):")
print(long_text)
print("\n" + "="*70 + "\n")
print(f"Created {len(chunks)} chunks:")
for i, chunk in enumerate(chunks, 1):
    print(f"\nChunk {i}:")
    print(chunk)

Original text (70 words):
Machine learning in healthcare has revolutionized diagnostic processes. 
Computer vision models can now analyze medical images with accuracy comparable to expert radiologists. 
Natural language processing helps extract insights from clinical notes and research papers. 
Predictive models identify patients at risk of developing certain conditions. 
However, challenges remain in model interpretability, data privacy, and clinical integration. 
The future of AI in healthcare depends on collaboration between clinicians, data scientists, and patients.


Created 5 chunks:

Chunk 1:
Machine learning in healthcare has revolutionized diagnostic processes. Computer vision models can now analyze medical images with accuracy comparable to

Chunk 2:
images with accuracy comparable to expert radiologists. Natural language processing helps extract insights from clinical notes and research papers. Predictive

Chunk 3:
notes and research papers. Predictive models identify patien

## Part 8: Practical Example - Research Paper Search

In [12]:
# Simulate research paper abstracts
papers = [
    {
        "title": "Deep Learning for Medical Image Analysis",
        "abstract": "This paper reviews convolutional neural networks for analyzing CT scans and MRI images. We demonstrate state-of-the-art performance in tumor detection."
    },
    {
        "title": "Natural Language Processing in Clinical Documentation",
        "abstract": "We present a transformer-based model for extracting structured information from electronic health records. Our approach reduces manual data entry time."
    },
    {
        "title": "Predictive Models for Patient Outcomes",
        "abstract": "Machine learning models predict hospital readmission risk using demographic and clinical data. Results show 85% accuracy in identifying high-risk patients."
    },
    {
        "title": "Genomic Data Analysis with AI",
        "abstract": "Deep learning identifies genetic variants associated with disease susceptibility. Our model processes whole-genome sequencing data efficiently."
    }
]

# Create store with paper abstracts
paper_store = SimpleDocumentStore()

for paper in papers:
    # Combine title and abstract for better searchability
    doc_text = f"{paper['title']}. {paper['abstract']}"
    paper_store.add_document(doc_text)

print(f"Indexed {len(paper_store)} research papers")

Indexed 4 research papers


In [13]:
# Search for papers
research_query = "What papers discuss using AI for analyzing patient records?"

print(f"Query: {research_query}\n")
results = paper_store.search(research_query, top_k=2)

for i, result in enumerate(results, 1):
    print(f"{i}. Similarity: {result['similarity']:.3f}")
    print(f"   {result['document']}")
    print()

Query: What papers discuss using AI for analyzing patient records?

1. Similarity: 0.678
   Natural Language Processing in Clinical Documentation. We present a transformer-based model for extracting structured information from electronic health records. Our approach reduces manual data entry time.

2. Similarity: 0.597
   Deep Learning for Medical Image Analysis. This paper reviews convolutional neural networks for analyzing CT scans and MRI images. We demonstrate state-of-the-art performance in tumor detection.



## Key Takeaways

1. **Embeddings capture meaning** - Similar concepts have similar vectors
2. **Cosine similarity measures relevance** - Higher values = more similar
3. **Semantic search beats keywords** - Finds related content with different words
4. **RAG = Retrieve + Generate** - Give LLMs relevant context
5. **Chunk long documents** - Break into searchable pieces

## Next Week

We'll build complete RAG systems:
- Processing document collections
- Advanced retrieval strategies
- Combining multiple sources
- Production-ready patterns