### LangChain Vectorstore RAG Implementation
---
This notebook demonstrates a Retrieval-Augmented Generation (RAG) system using LangChain with local models via Ollama. The implementation follows a multi-step reasoning process:

1. **Setup**: Loads two Ollama models (phi4-mini for reasoning and Gemma3:1b for synthesis) to handle different parts of the process.

2. **Question Analysis**: Uses the reasoning model to break down complex questions into logical sub-steps that can be individually researched.

3. **Document Processing**: Loads a local text corpus about space exploration, splits it into manageable chunks, and creates vector embeddings using the nomic-embed-text model.

4. **Knowledge Retrieval**: For each identified reasoning step, performs a similarity search in the Chroma vectorstore to find the most relevant information from the knowledge base.

5. **Answer Synthesis**: Feeds the original question and all retrieved contextual information to the synthesis model, which generates a cohesive, factual response.

This approach enhances the quality of AI-generated answers by combining structured reasoning with targeted information retrieval from a domain-specific knowledge base, allowing for more accurate and contextually relevant responses than using an LLM alone.

#### To Do List

Amend to look in the correct doc store, i.e. not one in the src folder

In [18]:
# Install the required libraries from requirements.txt 
# pip install -r requirements.txt

In [19]:
try:
    # Import the built-in regular expressions module for pattern matching and text processing
    import re

    # Import the Ollama LLM class from the LangChain community package (often used for integrating local LLMs)
    from langchain_community.llms import Ollama

    # Import the PromptTemplate class used to define and structure prompts for LLMs
    from langchain.prompts import PromptTemplate

    # Import RunnableMap, a utility for composing and executing a sequence of runnable components
    from langchain_core.runnables import RunnableMap

    # Import Chroma vector store, used for storing and searching vector embeddings (RAG retrieval)
    from langchain.vectorstores import Chroma

    # Import Ollama-specific embeddings and LLM classes for use with LangChain
    from langchain_ollama import OllamaEmbeddings, OllamaLLM

    # Import TextLoader to load plain text documents from files for processing
    from langchain.document_loaders import TextLoader, CSVLoader, PyPDFLoader

    # Import CharacterTextSplitter to split large documents into smaller chunks based on character count
    from langchain.text_splitter import CharacterTextSplitter
    
    # Import the pathlib module for handling filesystem paths in a cross-platform way
    from pathlib import Path
    
    # Import the os module for interacting with the operating system, such as file and directory operations
    import os

except ImportError as e:
    print(f"Import error: {e}")
except Exception as e:
    print(f"Unexpected error during imports: {e}")


In [20]:
# --- Step 1: Ensure Ollama models are available ---
# Attempt to load the reasoning and synthesis LLM models from Ollama.
# If the models are not available, provide instructions to the user and exit.
try:
    reasoning_llm = OllamaLLM(model="phi4-mini")
    synthesis_llm = OllamaLLM(model="Gemma3:1b")
except Exception as e:
    print("❌ Failed to connect to Ollama or load model 'phi4-mini'.")
    print("💡 Make sure Ollama is running and the model is available:")
    print("    ollama run phi4-mini")
    print(f"Error details: {e}")
    exit(1)

In [21]:
# --- Step 2: Prompt to break down the question ---
# Define a prompt template to break down a question into logical steps.
# This uses the reasoning LLM to generate a step-by-step breakdown.
reasoning_prompt = PromptTemplate.from_template("""
You are a reasoning assistant. Break the following question into logical steps to help answer it:

Question: {question}

Step-by-step breakdown:
""")
step_chain = reasoning_prompt | reasoning_llm

In [30]:
# --- Step 3: Set up vectorstore with Chroma ---
def load_documents_from_directory(directory="C:\\Python\\Agent-School\\docs\\docs2"):
    """Loads .txt, .csv, and .pdf files from the specified directory."""
    supported_extensions = {
        ".txt": TextLoader,
        ".csv": CSVLoader,
        ".pdf": PyPDFLoader,
    }

    documents = []
    directory_path = Path(directory)
    
    # Check if directory exists
    if not directory_path.exists():
        print(f"Warning: Directory '{directory}' does not exist. Creating it...")
        directory_path.mkdir(parents=True, exist_ok=True)
        return documents

    for filepath in directory_path.rglob("*"):
        ext = filepath.suffix.lower()
        loader_cls = supported_extensions.get(ext)
        if loader_cls:
            try:
                print(f"Loading {filepath}...")
                # Some loaders like CSVLoader and PyPDFLoader require file paths, not file handles
                loader = loader_cls(str(filepath))
                loaded_docs = loader.load()
                print(f"  Successfully loaded {len(loaded_docs)} documents from {filepath}")
                documents.extend(loaded_docs)
            except Exception as e:
                print(f"Failed to load {filepath}: {e}")

    if not documents:
        print("Warning: No documents were loaded. Please add some documents to the specified directory.")
        # Add a dummy document to prevent embedding errors
        from langchain.schema import Document
        documents = [Document(page_content="Dummy content for testing", metadata={"source": "dummy.txt"})]
    
    return documents

# Load and split documents
print("\n🔍 Loading documents from 'C:\\Python\\Agent-School\\docs\\docs2' directory...")
documents = load_documents_from_directory("C:\\Python\\Agent-School\\docs\\docs2")
print(f"Total documents loaded: {len(documents)}")

# Split documents
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_docs = text_splitter.split_documents(documents)
print(f"Documents split into {len(split_docs)} chunks")

# Create embeddings and vectorstore with error handling
try:
    print("\n📊 Creating embeddings with Ollama 'nomic-embed-text' model...")
    embeddings = OllamaEmbeddings(model="nomic-embed-text")
    
    # Test if embeddings work before creating the vectorstore
    test_embedding = embeddings.embed_query("Test query")
    if not test_embedding or len(test_embedding) == 0:
        raise ValueError("Embedding model returned empty embeddings")
    
    print(f"Embedding test successful - vector dimension: {len(test_embedding)}")
    print("\n💾 Creating Chroma vectorstore...")
    vectorstore = Chroma.from_documents(split_docs, embeddings)
    print("Vectorstore successfully created!")
except Exception as e:
    print(f"❌ Error creating embeddings or vectorstore: {e}")
    print("\n⚠️ Falling back to a simple keyword-based retrieval approach...")
    
    # Define a simple fallback retrieval function
    def simple_retrieval(query, documents, top_k=1):
        # Simple keyword-based scoring
        scores = []
        for doc in documents:
            content = doc.page_content.lower()
            query_terms = query.lower().split()
            score = sum(term in content for term in query_terms)
            scores.append((doc, score))
        
        # Sort by score, descending
        scores.sort(key=lambda x: x[1], reverse=True)
        return [doc for doc, _ in scores[:top_k]]
    
    # This will be used instead of vectorstore.similarity_search
    def fallback_similarity_search(query, k=1):
        return simple_retrieval(query, split_docs, top_k=k)
    
    # Create a dummy vectorstore object with our fallback method
    class FallbackVectorstore:
        def similarity_search(self, query, k=1):
            return fallback_similarity_search(query, k)
    
    vectorstore = FallbackVectorstore()
    print("Fallback retrieval system ready.")


🔍 Loading documents from 'C:\Python\Agent-School\docs\docs2' directory...
Loading C:\Python\Agent-School\docs\docs2\Presidents.csv...
  Successfully loaded 47 documents from C:\Python\Agent-School\docs\docs2\Presidents.csv
Loading C:\Python\Agent-School\docs\docs2\US Developments in Space.txt...
  Successfully loaded 1 documents from C:\Python\Agent-School\docs\docs2\US Developments in Space.txt
Loading C:\Python\Agent-School\docs\docs2\US Space Policy.pdf...
  Successfully loaded 3 documents from C:\Python\Agent-School\docs\docs2\US Space Policy.pdf
Total documents loaded: 51
Documents split into 50 chunks

📊 Creating embeddings with Ollama 'nomic-embed-text' model...
Embedding test successful - vector dimension: 768

💾 Creating Chroma vectorstore...
  Successfully loaded 3 documents from C:\Python\Agent-School\docs\docs2\US Space Policy.pdf
Total documents loaded: 51
Documents split into 50 chunks

📊 Creating embeddings with Ollama 'nomic-embed-text' model...
Embedding test successf

In [23]:
# --- Step 4: Prompt for final synthesis ---
# Define a prompt template for synthesizing a final answer.
# This uses the synthesis LLM to generate a complete and informative response.
# Remember - this is just a template, with {question} and {facts} as placeholders which are populated later.

synthesis_prompt = PromptTemplate.from_template("""
Based on the following question and information, write a complete, informative answer. 
For each piece of information used, explicitly list the source document.

Question: {question}

Information:
{facts}

Answer (include sources for each fact):
""")
synthesis_chain = synthesis_prompt | synthesis_llm

In [24]:
# --- Step 5: Ask a question ---
# Define the question to be answered and invoke the reasoning chain to get step-by-step reasoning.
question = "Who was the U.S. president during the moon landing, and what was his policy on space exploration?"
steps_text = step_chain.invoke({"question": question})

# Print the reasoning steps generated by the LLM.
print("\n🧠 Reasoning steps:\n")
print(steps_text)


🧠 Reasoning steps:

To accurately determine who was the President of the United States at the time of NASA's Apollo 11 mission (which included the first manned Moon landing in July 1969) as well as to understand their policies regarding space exploration during that period, you can follow these logical steps:

1. Identify the specific date and event: The moon landing took place on July 20th, 1969.
2. Determine which U.S. President was serving at this time:
   - Lyndon B. Johnson served as Vice President until January 20, 1969 (after Richard Nixon became president following Kennedy's assassination in November).
   - Richard Nixon succeeded John F. Kennedy after his death on the night of December 14-15.
3. Establish who held office from August to September when Apollo 11 launched:
   - Gerald Ford was Vice President under both Johnson and then briefly before becoming acting President (after Spiro Agnew resigned in October).
4. Recognize Richard Nixon's presidency extended through this p

In [25]:
# --- Step 6: Parse reasoning steps ---
# Extract individual reasoning steps from the generated text using regex.
step_lines = re.findall(r"\d+\.\s+(.*)", steps_text)
facts = []

In [26]:
# --- Step 7: Lookup each reasoning step with vectorstore ---
# For each reasoning step, perform a similarity search in the vectorstore.
for step in step_lines:
    print(f"\n🔍 Looking up: {step}")
    
    # Retrieve the most relevant document from the vectorstore.
    docs = vectorstore.similarity_search(step, k=1)
    
    if docs:
        # Extract content and source information
        result = docs[0].page_content
        # Get source from metadata or fallback to document type
        source = docs[0].metadata.get('source', 'Unknown source')
        # Extract just the filename from the path
        source_filename = os.path.basename(source) if source != 'Unknown source' else source
        facts.append(f"- {step.strip()}: [Source: {source_filename}] {result}")
    else:
        facts.append(f"- {step.strip()}: [Source: None] No relevant info found in local knowledge base.")

# Combine all retrieved facts into a single string.
combined_facts = "\n".join(facts)

# Print the retrieved facts.
print("\n📚 Retrieved facts from Vectorstore:\n")
print(combined_facts)


🔍 Looking up: Identify the specific date and event: The moon landing took place on July 20th, 1969.

🔍 Looking up: Determine which U.S. President was serving at this time:

🔍 Looking up: 3. Establish who held office from August to September when Apollo 11 launched:

🔍 Looking up: Recognize Richard Nixon's presidency extended through this period, as he became the incumbent president following Kennedy’s assassination.

🔍 Looking up: Research or recall historical records to ascertain his policy on space exploration:

🔍 Looking up: The U.S. president during the moon landing on July 20th, 1969 was Richard Nixon.

🔍 Looking up: His policy regarding space exploration at this time wasn't entirely clear-cut but centered around continuing and expanding NASA's mission to explore new frontiers in outer space following Kennedy’s ambitious Apollo missions set forth a vision of reaching for higher goals (the famous "moonshot"). However, specific policies relating directly after the moon landing woul

In [27]:
# --- Step 8: Summarize using second LLM ---
# Use the synthesis chain to generate a final answer based on the question and retrieved facts.
final_answer = synthesis_chain.invoke({
    "question": question,
    "facts": combined_facts
})

# Print the final synthesized answer.
print("\n✅ Final synthesized answer:\n")
print(final_answer)


✅ Final synthesized answer:

Okay, here’s a breakdown of the key facts related to the prompt, with sources cited:

1.  **The Space Shuttle Program:** Launched in 1981 under President Ronald Reagan, the Shuttle was designed as a reusable spacecraft to significantly reduce the cost of space travel compared to the Apollo missions. (Source: [NASA – Shuttle Program History](https://www.nasa.gov/shuttle/))

2.  **Kennedy's 1961 Moon Landing Commitment:** President John F. Kennedy established the goal of landing a man on the Moon before the end of the 1960s, setting a pivotal moment in space exploration. (Source: [Wikipedia – Kennedy’s Moon Landing Goal](https://en.wikipedia.org/wiki/Kennedy%27s_moon_landing_goal))

3.  **Apollo Program:**  The Apollo program, launched in 1961, aimed to land humans on the Moon and return them safely to Earth.  It significantly boosted American technological prowess and prestige. (Source: [History.com – Apollo Program](https://www.history.com/topics/apollo-pr