# 📓 The GenAI Revolution Cookbook

**Title:** Master LangChain: Build a RAG-Based Question Answering App

**Description:** Unlock the power of LangChain for precise question answering. Learn to integrate retrieval-augmented generation with real-world data sources in this step-by-step guide.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



# Building a Retrieval-Augmented Generation (RAG) System with LangChain and ChromaDB

## Introduction

In the rapidly evolving field of artificial intelligence, Retrieval-Augmented Generation (RAG) stands out as a powerful technique for enhancing language models with external knowledge. This tutorial will guide you through building a RAG system using LangChain and ChromaDB, enabling you to create applications that are not only intelligent but also contextually aware. By the end of this tutorial, you'll have a solid understanding of integrating language models with vector databases to solve real-world problems like question answering and document summarization.

## Installation

To get started, you'll need to install the necessary libraries. Run the following commands in a code cell:

In [7]:
!pip install langchain transformers torch chromadb langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-0.3.33-py3-none-any.whl.metadata (2.4 kB)
Downloading langchain_openai-0.3.33-py3-none-any.whl (74 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-openai
Successfully installed langchain-openai-0.3.33


## Project Setup

Before diving into the code, ensure you have the following prerequisites:

- An OpenAI API key for accessing GPT-3.
- A data source for creating embeddings.

Define your environment variables and configuration files as needed.

## Step-by-Step Build

### Data Ingestion and Embedding Creation

First, we'll ingest data and create embeddings for storage in a vector database.

In [3]:
from transformers import AutoTokenizer, AutoModel
import torch

def load_data(source):
    """Load data from the specified source.

    Args:
        source (str): The path or identifier for the data source.

    Returns:
        list: A list of text data loaded from the source.
    """
    # Placeholder for data loading logic
    return ["Sample text 1", "Sample text 2"]

# Load your data
data = load_data('your_data_source')

# Initialize tokenizer and model for embedding creation
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

# Preprocess and create embeddings
embeddings = []
for text in data:
    # Tokenize the text and create embeddings
    inputs = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        embedding = model(**inputs).last_hidden_state.mean(dim=1)
    embeddings.append(embedding)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

### Language Model Integration

Integrate a language model using LangChain to handle the generation aspect of RAG. For those interested in tailoring language models to specific domains, our article on [fine-tuning large language models for domain-specific applications](/blog/44830763/mastering-fine-tuning-of-large-language-models-for-domain-applications) provides valuable insights.

In [17]:
from langchain import LLMChain
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from google.colab import userdata

# Initialize LangChain with your model
# Get the OpenAI API key securely
openai_api_key = userdata.get('OPENAI_API_KEY')

# Initialize the ChatOpenAI language model
llm = ChatOpenAI(model="gpt-3.5-turbo", api_key=openai_api_key)

# Define a simple prompt template
template = """Use the following context to answer the question:
{context}

Question: {question}

Answer:"""
prompt = PromptTemplate.from_template(template)

# Initialize the LLMChain with the language model and prompt
llm_chain = LLMChain(llm=llm, prompt=prompt)

### Vector Database Setup

Set up a vector database for storing and querying embeddings.

In [14]:
import chromadb
import torch
from chromadb.utils import embedding_functions

# Initialize ChromaDB client
client = chromadb.Client()

# Define the embedding function using the same model as before
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="bert-base-uncased")

# Delete the collection if it already exists to avoid the embedding function conflict
try:
    client.delete_collection("my_rag_collection")
    print("Deleted existing collection 'my_rag_collection'.")
except:
    print("Collection 'my_rag_collection' did not exist or could not be deleted.")
    pass # Collection did not exist

# Get or create a collection with the specified embedding function
collection = client.get_or_create_collection("my_rag_collection", embedding_function=embedding_function)
print("Created or got collection 'my_rag_collection'.")

# Store embeddings in the vector database
# ChromaDB requires documents and ids when adding embeddings
# Assuming embeddings correspond to the sample data loaded earlier
# You would replace "doc_texts" and "doc_ids" with your actual document texts and unique identifiers
doc_texts = ["Sample text 1", "Sample text 2"]
doc_ids = ["doc_1", "doc_2"]

# Convert embeddings from tensors to a list of lists of floats for ChromaDB
embeddings_list = [embedding.squeeze().tolist() for embedding in embeddings]

# Add documents and embeddings to the collection
# Note: If you are using an embedding function with the collection, you don't need to provide embeddings here.
# ChromaDB will generate embeddings from the documents using the specified embedding function.
# Since we already have pre-computed embeddings, I will add them directly.
# If you want ChromaDB to compute embeddings, you would use: collection.add(documents=doc_texts, ids=doc_ids)
collection.add(
    embeddings=embeddings_list,
    documents=doc_texts,
    ids=doc_ids
)

print(f"Added {len(doc_ids)} documents to the collection.")

# Query the database with a sample query
# Replace 'your_query' with an actual query string
query_text = 'your_query'
# When querying, ChromaDB will use the collection's embedding function to embed the query text
query_result = collection.query(query_texts=[query_text], n_results=1)

print("Query Result:", query_result)

Deleted existing collection 'my_rag_collection'.
Created or got collection 'my_rag_collection'.
Added 2 documents to the collection.
Query Result: {'ids': [['doc_1']], 'embeddings': None, 'documents': [['Sample text 1']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[None]], 'distances': [[0.4598293900489807]]}


### Full End-to-End Application

Now, let's put all components together to build a complete RAG application.

In [19]:
def answer_question(question):
    """Answer a question using retrieval-augmented generation.

    Args:
        question (str): The question to be answered.

    Returns:
        str: The generated answer to the question.
    """
    # Retrieve relevant information from the vector database
    # Use the 'collection' object created in the previous cell
    retrieved_result = collection.query(query_texts=[question], n_results=1) # Get top 1 relevant document
    retrieved_docs = retrieved_result['documents'][0] if retrieved_result and retrieved_result['documents'] else ["No relevant information found."]

    # Generate a response using the language model
    # Pass the retrieved context and question to the LLMChain
    response = llm_chain.invoke({"context": "\n".join(retrieved_docs), "question": question})

    return response['text'] # Assuming the response structure from invoke

# Example usage of the question-answering function
# Replace 'your_query' with an actual question
print(answer_question("What is the capital of France?"))

The capital of France is Paris.


### Testing & Validation

Test and validate the application with various queries to ensure robustness.

In [21]:
# Test cases for the question-answering application
test_queries = [
    "What is the capital of France?",
    "Why is the sky blue?"
]

for query in test_queries:
    print(f"Query: {query}")
    print(f"Response: {answer_question(query)}")

Query: What is the capital of France?
Response: The capital of France is Paris.
Query: Why is the sky blue?
Response: The sky appears blue because of the way the Earth's atmosphere scatters sunlight. Blue light is scattered in all directions by the gases and particles in the atmosphere, which is why we see the sky as blue.


## Conclusion

In this tutorial, we've built a RAG system using LangChain and ChromaDB, demonstrating how to integrate language models with vector databases for enhanced AI applications. While this guide provides a foundational understanding, consider exploring advanced topics such as integrating additional data sources or optimizing for different performance metrics. This will help you create more scalable and efficient AI solutions.