
# # 📚 RAG-System mit LangChain, ChromaDB und Gemini 2
# Dieses Notebook implementiert ein einfaches Retrieval-Augmented Generation (RAG) System.
# Es verwendet ChromaDB zur Dokumentenspeicherung, LangChain für Workflow-Management und das Modell `gemini-2.0-flash`.


# 📥 Bibliotheken importieren

In [1]:
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
#from langchain.embeddings import GooglePalmEmbeddings
from langchain.vectorstores import Chroma
from langchain.memory import ConversationBufferMemory
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import ConversationalRetrievalChain

import os
#from langchain_huggingface import HuggingFaceEmbeddings
#from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
#from langchain.embeddings import HuggingFaceEmbeddings

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
os.environ["USER_AGENT"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"

# ## 🌐 Schritt 1: Wikipedia-Seite laden

In [3]:
url = "https://en.wikipedia.org/wiki/2025_in_science"
loader = WebBaseLoader(url)
documents = loader.load()

# ## ✂️ Schritt 2: Text in Chunks aufteilen

In [4]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100
)
chunks = splitter.split_documents(documents)
print(f"Anzahl der Chunks: {len(chunks)}")

Anzahl der Chunks: 89


# ## 🧠 Schritt 3: Vektorisierung und Speicherung in ChromaDB

In [7]:
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
  from .autonotebook import tqdm as notebook_tqdm


In [8]:
#embedding = GooglePalmEmbeddings()
#embedding =OpenAIEmbeddings()  

persist_directory = "chroma_db"
if os.path.exists(persist_directory) and os.listdir(persist_directory):
    vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embedding)
else:
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embedding,
        persist_directory=persist_directory
    )
    vectorstore.persist()


  vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embedding)


In [22]:
os.environ["LANGCHAIN_PROJECT"] = "RAG Wikipedia 2025"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
from langchain.callbacks.tracers import LangChainTracer

tracer = LangChainTracer(project_name="RAG Wikipedia 2025")



# ## 💬 Schritt 4: Dialogsystem mit Gedächtnis (Memory)

In [23]:
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

retriever = vectorstore.as_retriever()
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")

rag_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory, 
    callbacks=[tracer]
)

# ## 🧪 Schritt 5: Beispiel-Dialog

In [24]:

fragen = [
    "When did the second Trump administration impose an immediate freeze on research grants, communications, hiring, and meetings at the National Institutes of Health?",
    "How much of the institute's operations were affected by this decision?",
    "When did astronomers report the discovery of Saturn's new moons?",
    "Which countries' telescopes were used? How many new moons of Saturn were discovered?",
    "What is the total number of confirmed satellites of Saturn currently known?" 
]

for frage in fragen:
    antwort = rag_chain.run(frage)
    print(f"\n🙋 Frage: {frage}\n🤖 Antwort: {antwort}")



🙋 Frage: When did the second Trump administration impose an immediate freeze on research grants, communications, hiring, and meetings at the National Institutes of Health?
🤖 Antwort: According to the provided text, the second Trump administration imposed an immediate freeze on scientific grants, communications, hiring, and meetings at the National Institutes of Health (NIH) on January 22, 2025.

🙋 Frage: How much of the institute's operations were affected by this decision?
🤖 Antwort: The decision impacted $47.4 billion worth of activities at the National Institutes of Health (NIH).

🙋 Frage: When did astronomers report the discovery of Saturn's new moons?
🤖 Antwort: Astronomers reported the discovery of 128 new moons of Saturn on March 11, 2025.

🙋 Frage: Which countries' telescopes were used? How many new moons of Saturn were discovered?
🤖 Antwort: The new moons of Saturn were discovered by astronomers using the Canada-France-Hawaii Telescope, and there were 128 new moons discovered