<a href="https://colab.research.google.com/github/NataHsH/GenerativeAI-II-Project/blob/Nataliia_Honcharova/RAG_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

📚 RAG System with LangChain, ChromaDB, and Gemini 2.0
This notebook implements a simple Retrieval-Augmented Generation (RAG) system. It uses ChromaDB for document storage, LangChain for workflow management, and the gemini-2.0-flash model for natural language generation. The system is designed to retrieve relevant documents from a knowledge base and generate human-like responses to user queries based on that data.

Key components:

ChromaDB: A vector database for storing and retrieving document embeddings.
LangChain: A framework for building custom chains and workflows.
Gemini 2.0: A cutting-edge model for language understanding and response generation.

# 📦 Install Necessary Libraries

In [1]:
!pip install --quiet --upgrade langchain langchain-community langchain-text-splitters langgraph
!pip install chromadb
!pip install langchain chromadb
!pip install langsmith
!pip install -U langchain-google-genai
!pip install gradio


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-google-genai
  Downloading langchain_google_genai-2.1.4-py3-none-any.whl.metadata (5.2 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain-google-genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-ai-generativelanguage<0.7.0,>=0.6.18 (from langchain-google-genai)
  Downloading google_ai_generativelanguage-0.6.18-py3-none-any.whl.metadata (9.8 kB)
Downloading langchain_google_genai-2.1.4-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Downloading google_ai_generativelanguage-0.6.18-py3-none-any.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━

Collecting gradio
  Downloading gradio-5.29.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.10.0 (from gradio)
  Downloading gradio_client-1.10.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.9-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.6 (from gradio)
  Downloading safehttpx-0.1.6-py3-none-any.whl.metadata (4.2 kB)
Collecting semantic-version~=2.0

In [2]:
import os
from google.colab import userdata
from langchain.chat_models import init_chat_model
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langsmith import Client
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langgraph.graph import START, StateGraph


# 🔧 Set Environment Variables
Set up API keys for LangSmith and other services.


In [3]:
import os
from google.colab import userdata

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = userdata.get("LANGSMITH_API_KEY")
os.environ["TAVILY_API_KEY"] = userdata.get("TAVILY_API_KEY")
os.environ["GOOGLE_API_KEY"] = userdata.get("GOOGLE_API_KEY")
os.environ["HUGGINGFACE_API_KEY"] = userdata.get("HUGGINGFACE_API_KEY")

# 📝 Extract Text from Wikipedia
Scrape and parse text from the selected Wikipedia page.


In [4]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/2025_stock_market_crash"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

content = []
for tag in soup.find_all(["h1", "h2", "h3", "p"]):
    content.append(tag.get_text())

document_text = "\n".join(content)


# ⚙️ Automatic Text Chunking with RecursiveCharacterTextSplitter
Split the text into chunks for easier processing.


In [5]:
chunk_size = 500
chunk_overlap = 100

text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
chunks = text_splitter.split_text(document_text)

assert len(chunks) >= 50, f"Expected at least 50 chunks, but got {len(chunks)} chunks."


# 🔧 Store Document Chunks in Chroma Vector Store
Store the chunks in Chroma for fast retrieval.


In [6]:
embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

vector_store = Chroma(
    collection_name="stock_market_crash_2025",
    embedding_function=embedding_function,
    persist_directory="./chroma_db"
)

_ = vector_store.add_documents(documents=[Document(page_content=chunk) for chunk in chunks])
vector_store.persist()


  embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  vector_store = Chroma(
  vector_store.persist()


# 🛠️ Implement LangSmith Logging
Log interactions through LangSmith to track system behavior.


In [7]:
logger = Client(api_key=os.getenv("LANGSMITH_API_KEY"))

def log_interaction(input_text: str, output_text: str):
    logger.log(
        input_data=input_text,
        output_data=output_text,
        metadata={"project": "RAG System", "phase": "QA Chain"}
    )


# 🔍 Metadata Filtering Implementation
Implement metadata filtering to refine retrieval results.


In [8]:
from typing import Dict

def retrieve_with_metadata(question: str, metadata: Dict):
    results = vector_store.similarity_search(question, metadata=metadata)
    return results


# 🔁 Retrieve and Generate Functions  
Implement functions to retrieve relevant documents and generate answers using the retrieved context.


In [60]:
from typing import TypedDict, List
from langchain.schema import Document

class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


In [61]:
from langchain.prompts import PromptTemplate

qa_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
Answer the question based only on the following context:

{context}

Question: {question}
Answer:"""
)


In [62]:
def retrieve(state: State) -> State:
    context = vector_store.similarity_search(state["question"])
    return {**state, "context": context}



def generate(state: dict):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    prompt = qa_prompt.format(context=docs_content, question=state["question"])
    response = llm.invoke(prompt)
    return {"answer": response.content}




In [63]:
graph_builder = StateGraph(State)
graph_builder.add_node("retrieve", retrieve)
graph_builder.add_node("generate", generate)

graph_builder.set_entry_point("retrieve")
graph_builder.add_edge("retrieve", "generate")

graph = graph_builder.compile()


# 🧠 Initialize LLM Model
Initialize the Gemini 2.0 model for answer generation.


In [64]:
llm = init_chat_model("gemini-2.0-flash", model_provider="google_genai")


# 🧐 Define Questions for Testing
Create meaningful questions for the system to answer, testing the retrieval mechanism.


In [67]:
questions = [
      "What were the primary causes of the 2025 stock market crash?",
      "How did global markets react to the announcement of new tariffs by President Trump on April 2, 2025?",
      "What was the impact of the bond market sell-off, and how did it differ from the stock market decline?",
      "How did the Trump administration respond to the market turmoil caused by the tariffs?",
      "Which countries and sectors were most affected by the 2025 tariffs imposed by the United States?"
  ]

In [68]:
for question in questions:

    result = graph.invoke({"question": question})
    print(f'Question: {question}')
    print(f'Context: {result["context"]}')
    print(f'Antwort: {result["answer"]}\n\n')

Question: What were the primary causes of the 2025 stock market crash?
Context: [Document(metadata={}, page_content='Contents\n2025 stock market crash'), Document(metadata={}, page_content='across global stock markets, including those in the United States. It became the largest global market decline since the 2020 stock market crash, which occurred during the recession caused by the COVID-19 pandemic.[1]'), Document(metadata={}, page_content='On April 3, the Nasdaq Composite lost 1,600 points, the worst sell-off since the start of the COVID-19 pandemic. The S&P 500 lost 6.65% of its value on April 3, nearly initiating a trading curb. The Dow also fell 1,679 points or 3.98%. The Russell 2000 lead losses by falling 6.59%, entering a bear market.'), Document(metadata={}, page_content='which was expected to go into effect by midnight.[37] As a result, the Dow Jones index lost all of its morning gains and fell by around 300 points, as did the S&P 500 and Nasdaq, which erased their gains and