# 📓 The GenAI Revolution Cookbook

**Title:** Master Semantic Search: Build Custom AI Knowledge Base Tools

**Description:** Learn to create AI tools with semantic search capabilities using vector databases. This step-by-step guide empowers you to enhance AI retrieval with real-world applications.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



# Introduction

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances AI retrieval by combining language models with external knowledge bases. This approach addresses the limitations of standalone language models, which often struggle with up-to-date information and context-specific queries. By integrating retrieval mechanisms, RAG systems can access and utilize vast external datasets, improving the accuracy and relevance of AI-generated responses. For a deeper understanding of how to tailor these models to specific domains, see our [guide on customizing large language models](/blog/44830763/mastering-domain-specific-llm-customization-techniques-and-tools-unveiled).

## 1. Environment Setup and Installation

A well-configured development environment is crucial for building RAG systems. Begin by installing the necessary libraries:

In [None]:
# Install core libraries required for building RAG systems
# LangChain: Framework for building LLM applications with retrieval capabilities
# Transformers: Hugging Face library for pre-trained language models and embeddings
# ChromaDB: Vector database for efficient similarity search and storage
!pip install langchain transformers chromadb

- [LangChain Documentation](https://python.langchain.com/docs/introduction/)
- [Hugging Face Documentation](https://huggingface.co/docs)
- [ChromaDB Documentation](https://docs.trychroma.com/)

These tools provide the foundational components for building and deploying RAG systems, ensuring scalability and efficiency. For best practices on fine-tuning models with these tools, explore our [fine-tuning guide with Hugging Face Transformers](/blog/44830763/mastering-fine-tuning-of-large-language-models-with-hugging-face).

## 2. Data Preparation and Knowledge Base Construction

Prepare your data by loading and preprocessing documents to build a high-quality knowledge base for RAG:

In [None]:
import pandas as pd

def prepare_knowledge_base(file_path):
    """
    Load and preprocess documents from a CSV file for RAG system.
    
    Args:
        file_path (str): Path to the CSV file containing documents
        
    Returns:
        pd.DataFrame: Preprocessed dataframe with cleaned text
        
    Raises:
        FileNotFoundError: If the CSV file doesn't exist
        ValueError: If required columns are missing
    """
    try:
        data = pd.read_csv(file_path)
    except FileNotFoundError:
        raise FileNotFoundError(f"Document file not found at {file_path}")
    
    if 'text' not in data.columns:
        raise ValueError("CSV must contain a 'text' column with document content")
    
    data = data.dropna(subset=['text'])
    data['text'] = data['text'].apply(lambda x: x.lower().strip())
    data = data.drop_duplicates(subset=['text'], keep='first')
    data = data[data['text'].str.len() >= 10]
    
    print(f"Loaded {len(data)} documents after preprocessing")
    
    return data

# Example usage
data = prepare_knowledge_base('documents.csv')

## 3. Embedding and Vectorization Techniques

Convert text documents into dense vector embeddings for semantic search:

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

def generate_embeddings(texts, model_name='all-MiniLM-L6-v2', batch_size=32):
    """
    Generate embeddings for a list of text documents using sentence transformers.
    
    Args:
        texts (list): List of text strings to embed
        model_name (str): Name of the sentence transformer model to use
        batch_size (int): Number of texts to process in each batch for memory efficiency
        
    Returns:
        np.ndarray: Array of embeddings with shape (num_texts, embedding_dim)
        
    Raises:
        ValueError: If texts list is empty
    """
    if not texts or len(texts) == 0:
        raise ValueError("Text list cannot be empty")
    
    model = SentenceTransformer(model_name)
    embeddings = model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True
    )
    
    print(f"Generated {len(embeddings)} embeddings with dimension {embeddings.shape[1]}")
    
    return embeddings

# Example usage: Generate embeddings from preprocessed data
embeddings = generate_embeddings(data['text'].tolist())

# Verify embedding quality by checking shape and data type
assert embeddings.shape[0] == len(data), "Mismatch between number of texts and embeddings"
assert embeddings.dtype == np.float32, "Embeddings should be float32 for efficiency"

## 4. Integrating Vector Databases for Retrieval

Store embeddings in ChromaDB vector database for efficient similarity search:

In [None]:
import chromadb
from chromadb.config import Settings

def setup_vector_database(embeddings, texts, metadata=None, collection_name="rag_documents"):
    """
    Initialize ChromaDB and store document embeddings with metadata.
    
    Args:
        embeddings (np.ndarray): Document embeddings to store
        texts (list): Original text documents corresponding to embeddings
        metadata (list): Optional list of metadata dicts for each document
        collection_name (str): Name of the ChromaDB collection
        
    Returns:
        chromadb.Collection: ChromaDB collection object for querying
        
    Raises:
        ValueError: If embeddings and texts have different lengths
    """
    if len(embeddings) != len(texts):
        raise ValueError("Number of embeddings must match number of texts")
    
    client = chromadb.PersistentClient(
        path="./vector_db_path",
        settings=Settings(anonymized_telemetry=False)
    )
    
    collection = client.get_or_create_collection(
        name=collection_name,
        metadata={"description": "RAG document embeddings for semantic search"}
    )
    
    ids = [f"doc_{i}" for i in range(len(texts))]
    
    if metadata is None:
        metadata = [{"source": "documents.csv", "index": i} for i in range(len(texts))]
    
    collection.add(
        embeddings=embeddings.tolist(),
        documents=texts,
        metadatas=metadata,
        ids=ids
    )
    
    print(f"Stored {len(embeddings)} embeddings in collection '{collection_name}'")
    
    return collection

def query_vector_database(collection, query_text, model, top_k=5):
    """
    Query the vector database to retrieve most similar documents.
    
    Args:
        collection (chromadb.Collection): ChromaDB collection to query
        query_text (str): Text query to search for
        model (SentenceTransformer): Model to encode the query
        top_k (int): Number of top results to return
        
    Returns:
        dict: Query results containing documents, distances, and metadata
    """
    query_embedding = model.encode([query_text])[0]
    
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=top_k
    )
    
    return results

# Example usage: Setup vector database and perform a query
collection = setup_vector_database(embeddings, data['text'].tolist())

# Test retrieval with a sample query
model = SentenceTransformer('all-MiniLM-L6-v2')
results = query_vector_database(collection, "What is machine learning?", model, top_k=3)
print(f"Retrieved {len(results['documents'][0])} relevant documents")

## 5. Building the RAG Pipeline

Build an end-to-end RAG pipeline integrating retrieval and generation:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

class RAGPipeline:
    """
    Retrieval-Augmented Generation pipeline combining vector search with LLM generation.
    
    This pipeline retrieves relevant documents from a vector database and uses them
    as context for a language model to generate informed, accurate responses.
    """
    
    def __init__(self, collection, embedding_model, llm_model_name="gpt2", top_k=3):
        """
        Initialize the RAG pipeline with retriever and generator components.
        
        Args:
            collection (chromadb.Collection): Vector database collection for retrieval
            embedding_model (SentenceTransformer): Model for encoding queries
            llm_model_name (str): Name of the Hugging Face LLM to use for generation
            top_k (int): Number of documents to retrieve for context
        """
        self.collection = collection
        self.embedding_model = embedding_model
        self.top_k = top_k
        
        self.tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
        self.llm = AutoModelForCausalLM.from_pretrained(llm_model_name)
        
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        print(f"RAG Pipeline initialized with {llm_model_name}")
    
    def retrieve(self, query):
        query_embedding = self.embedding_model.encode([query])[0]
        
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=self.top_k
        )
        
        documents = results['documents'][0] if results['documents'] else []
        
        return documents
    
    def generate(self, query, context_docs):
        context = "\n\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(context_docs)])
        prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
        
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=512
        )
        
        with torch.no_grad():
            outputs = self.llm.generate(
                inputs['input_ids'],
                max_new_tokens=150,
                temperature=0.7,
                do_sample=True,
                pad_token_id=self.tokenizer.pad_token_id
            )
        
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        answer = response[len(prompt):].strip()
        
        return answer
    
    def query(self, query_text):
        retrieved_docs = self.retrieve(query_text)
        
        if not retrieved_docs:
            return {
                "answer": "No relevant documents found to answer the query.",
                "sources": []
            }
        
        answer = self.generate(query_text, retrieved_docs)
        
        return {
            "answer": answer,
            "sources": retrieved_docs
        }

# Example usage: Initialize and test the RAG pipeline
pipeline = RAGPipeline(
    collection=collection,
    embedding_model=model,
    llm_model_name="gpt2",
    top_k=3
)

# Test the pipeline with a sample query
response = pipeline.query("What is RAG?")
print(f"Answer: {response['answer']}")
print(f"Retrieved {len(response['sources'])} source documents")

## 6. Advanced Techniques: Agentic Systems and Fine-Tuning

Enhancing RAG systems with advanced techniques like agentic systems and model fine-tuning can significantly improve performance. Agents enable multi-step reasoning and query refinement, while fine-tuning adapts models to specific domains. For a comprehensive walkthrough on fine-tuning, refer to our [fine-tuning guide with Hugging Face Transformers](/blog/44830763/mastering-fine-tuning-of-large-language-models-with-hugging-face).

In [None]:
import re
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset

class QueryRefinementAgent:
    """
    Agent for multi-step query refinement to improve retrieval quality.
    
    This agent analyzes queries, expands them with synonyms, and reformulates
    them for better semantic matching with documents.
    """
    
    def __init__(self, embedding_model):
        """
        Initialize the query refinement agent.
        
        Args:
            embedding_model (SentenceTransformer): Model for semantic similarity
        """
        self.embedding_model = embedding_model
    
    def refine_query(self, query):
        refined = re.sub(r'\s+', ' ', query.strip())
        refined = re.sub(r'[^\w\s?]', '', refined)
        
        expansion_map = {
            'RAG': 'RAG Retrieval-Augmented Generation',
            'AI': 'AI artificial intelligence machine learning',
            'ML': 'ML machine learning',
        }
        
        for term, expansion in expansion_map.items():
            if term.lower() in refined.lower():
                refined = f"{refined} {expansion}"
        
        if '?' not in refined and not any(word in refined.lower() for word in ['what', 'how', 'why', 'when', 'where']):
            refined = f"What is {refined}?"
        
        return refined
    
    def multi_query_retrieval(self, original_query, collection, top_k=5):
        queries = [
            original_query,
            self.refine_query(original_query),
            f"Explain {original_query}",
            f"Definition of {original_query}"
        ]
        
        all_results = []
        seen_docs = set()
        
        for query in queries:
            query_embedding = self.embedding_model.encode([query])[0]
            results = collection.query(
                query_embeddings=[query_embedding.tolist()],
                n_results=top_k
            )
            
            for doc in results['documents'][0]:
                if doc not in seen_docs:
                    all_results.append(doc)
                    seen_docs.add(doc)
        
        return all_results[:top_k]

def fine_tune_embedding_model(texts, labels, model_name='all-MiniLM-L6-v2', output_dir='./fine_tuned_model'):
    from sentence_transformers import SentenceTransformer, InputExample, losses
    from torch.utils.data import DataLoader
    
    model = SentenceTransformer(model_name)
    
    train_examples = [
        InputExample(texts=[text, label]) 
        for text, label in zip(texts, labels)
    ]
    
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    
    train_loss = losses.CosineSimilarityLoss(model)
    
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=3,
        warmup_steps=100,
        output_path=output_dir,
        show_progress_bar=True
    )
    
    print(f"Model fine-tuned and saved to {output_dir}")
    
    return model

# Example usage: Query refinement agent
agent = QueryRefinementAgent(embedding_model=model)
refined_query = agent.refine_query("RAG")
print(f"Original: 'RAG' -> Refined: '{refined_query}'")

# Example usage: Multi-query retrieval
results = agent.multi_query_retrieval("What is RAG?", collection, top_k=5)
print(f"Retrieved {len(results)} documents using multi-query approach")

# Example usage: Fine-tuning (requires paired training data)
sample_texts = data['text'].tolist()[:100]
sample_labels = data['text'].tolist()[:100]
fine_tuned_model = fine_tune_embedding_model(sample_texts, sample_labels)

## 7. Testing, Validation, and Performance Evaluation

Evaluate RAG system performance using multiple metrics and test cases:

In [None]:
import time
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class RAGEvaluator:
    """
    Comprehensive evaluation framework for RAG systems.
    
    Measures retrieval quality, generation accuracy, and system performance
    using industry-standard metrics.
    """
    
    def __init__(self, pipeline):
        """
        Initialize evaluator with a RAG pipeline.
        
        Args:
            pipeline (RAGPipeline): RAG pipeline to evaluate
        """
        self.pipeline = pipeline
    
    def evaluate_retrieval_quality(self, test_queries, ground_truth_docs):
        precisions = []
        recalls = []
        reciprocal_ranks = []
        
        for query, relevant_docs in zip(test_queries, ground_truth_docs):
            retrieved = self.pipeline.retrieve(query)
            
            relevant_retrieved = sum(1 for doc in retrieved if doc in relevant_docs)
            precision = relevant_retrieved / len(retrieved) if retrieved else 0
            precisions.append(precision)
            
            recall = relevant_retrieved / len(relevant_docs) if relevant_docs else 0
            recalls.append(recall)
            
            rank = next((i + 1 for i, doc in enumerate(retrieved) if doc in relevant_docs), 0)
            reciprocal_ranks.append(1 / rank if rank > 0 else 0)
        
        return {
            'precision': np.mean(precisions),
            'recall': np.mean(recalls),
            'mrr': np.mean(reciprocal_ranks),
            'f1_score': 2 * np.mean(precisions) * np.mean(recalls) / (np.mean(precisions) + np.mean(recalls)) if (np.mean(precisions) + np.mean(recalls)) > 0 else 0
        }
    
    def evaluate_generation_quality(self, test_queries, reference_answers):
        similarities = []
        
        for query, reference in zip(test_queries, reference_answers):
            response = self.pipeline.query(query)
            generated = response['answer']
            
            gen_embedding = self.pipeline.embedding_model.encode([generated])
            ref_embedding = self.pipeline.embedding_model.encode([reference])
            
            similarity = cosine_similarity(gen_embedding, ref_embedding)[0][0]
            similarities.append(similarity)
        
        return {
            'avg_semantic_similarity': np.mean(similarities),
            'min_similarity': np.min(similarities),
            'max_similarity': np.max(similarities)
        }
    
    def evaluate_latency(self, test_queries, num_runs=10):
        latencies = []
        
        for query in test_queries:
            query_latencies = []
            
            for _ in range(num_runs):
                start_time = time.time()
                self.pipeline.query(query)
                end_time = time.time()
                
                latency_ms = (end_time - start_time) * 1000
                query_latencies.append(latency_ms)
            
            latencies.append(np.median(query_latencies))
        
        return {
            'avg_latency_ms': np.mean(latencies),
            'p50_latency_ms': np.percentile(latencies, 50),
            'p95_latency_ms': np.percentile(latencies, 95),
            'p99_latency_ms': np.percentile(latencies, 99)
        }
    
    def run_full_evaluation(self, test_data):
        print("Starting RAG system evaluation...")
        
        print("Evaluating retrieval quality...")
        retrieval_metrics = self.evaluate_retrieval_quality(
            test_data['queries'],
            test_data['ground_truth_docs']
        )
        
        print("Evaluating generation quality...")
        generation_metrics = self.evaluate_generation_quality(
            test_data['queries'],
            test_data['reference_answers']
        )
        
        print("Evaluating system latency...")
        latency_metrics = self.evaluate_latency(test_data['queries'])
        
        results = {
            'retrieval': retrieval_metrics,
            'generation': generation_metrics,
            'latency': latency_metrics
        }
        
        return results

def print_evaluation_results(results):
    print("\n" + "="*50)
    print("RAG SYSTEM EVALUATION RESULTS")
    print("="*50)
    
    print("\nRetrieval Metrics:")
    print(f"  Precision: {results['retrieval']['precision']:.3f} (higher is better, max 1.0)")
    print(f"  Recall: {results['retrieval']['recall']:.3f} (higher is better, max 1.0)")
    print(f"  F1 Score: {results['retrieval']['f1_score']:.3f} (harmonic mean of precision and recall)")
    print(f"  MRR: {results['retrieval']['mrr']:.3f} (Mean Reciprocal Rank, higher is better)")
    
    print("\nGeneration Metrics:")
    print(f"  Avg Semantic Similarity: {results['generation']['avg_semantic_similarity']:.3f}")
    print(f"  Min Similarity: {results['generation']['min_similarity']:.3f}")
    print(f"  Max Similarity: {results['generation']['max_similarity']:.3f}")
    
    print("\nLatency Metrics:")
    print(f"  Average Latency: {results['latency']['avg_latency_ms']:.2f} ms")
    print(f"  P50 Latency: {results['latency']['p50_latency_ms']:.2f} ms")
    print(f"  P95 Latency: {results['latency']['p95_latency_ms']:.2f} ms")
    print(f"  P99 Latency: {results['latency']['p99_latency_ms']:.2f} ms")
    print("="*50 + "\n")

# Example usage: Prepare test data
test_data = {
    'queries': [
        "What is RAG?",
        "How does semantic search work?",
        "What are vector databases?"
    ],
    'ground_truth_docs': [
        [data['text'].tolist()[0], data['text'].tolist()[1]],
        [data['text'].tolist()[2]],
        [data['text'].tolist()[3], data['text'].tolist()[4]]
    ],
    'reference_answers': [
        "RAG is Retrieval-Augmented Generation, combining retrieval with language models.",
        "Semantic search uses embeddings to find semantically similar content.",
        "Vector databases store and retrieve high-dimensional embeddings efficiently."
    ]
}

# Run evaluation
evaluator = RAGEvaluator(pipeline)
results = evaluator.run_full_evaluation(test_data)

# Display results with interpretation
print_evaluation_results(results)

This comprehensive guide walks you through setting up a RAG system, integrating advanced techniques, and evaluating performance, ensuring you can build scalable, production-ready AI applications.