### LangChain Vectorstore RAG Implementation
---
This notebook demonstrates a Retrieval-Augmented Generation (RAG) system using LangChain with local models via Ollama. The implementation follows a multi-step reasoning process:

1. **Setup**: Loads two Ollama models (phi4-mini for reasoning and Gemma3:1b for synthesis) to handle different parts of the process.

2. **Question Analysis**: Uses the reasoning model to break down complex questions into logical sub-steps that can be individually researched.

3. **Document Processing**: Loads a local text corpus about space exploration, splits it into manageable chunks, and creates vector embeddings using the nomic-embed-text model.

4. **Knowledge Retrieval**: For each identified reasoning step, performs a similarity search in the Chroma vectorstore to find the most relevant information from the knowledge base.

5. **Answer Synthesis**: Feeds the original question and all retrieved contextual information to the synthesis model, which generates a cohesive, factual response.

This approach enhances the quality of AI-generated answers by combining structured reasoning with targeted information retrieval from a domain-specific knowledge base, allowing for more accurate and contextually relevant responses than using an LLM alone.

#### To Do List

Amend to look in the correct doc store, i.e. not one in the src folder

In [1]:
# Install the required libraries from requirements.txt 
# pip install -r requirements.txt

In [2]:
try:
    # Import the built-in regular expressions module for pattern matching and text processing
    import re

    # Import the Ollama LLM class from the LangChain community package (often used for integrating local LLMs)
    from langchain_community.llms import Ollama

    # Import the PromptTemplate class used to define and structure prompts for LLMs
    from langchain.prompts import PromptTemplate

    # Import RunnableMap, a utility for composing and executing a sequence of runnable components
    from langchain_core.runnables import RunnableMap

    # Import Chroma vector store, used for storing and searching vector embeddings (RAG retrieval)
    from langchain.vectorstores import Chroma

    # Import Ollama-specific embeddings and LLM classes for use with LangChain
    from langchain_ollama import OllamaEmbeddings, OllamaLLM

    # Import TextLoader to load plain text documents from files for processing
    from langchain.document_loaders import TextLoader, CSVLoader, PyPDFLoader

    # Import CharacterTextSplitter to split large documents into smaller chunks based on character count
    from langchain.text_splitter import CharacterTextSplitter
    
    # Import the pathlib module for handling filesystem paths in a cross-platform way
    from pathlib import Path
    
    # Import the os module for interacting with the operating system, such as file and directory operations
    import os

except ImportError as e:
    print(f"Import error: {e}")
except Exception as e:
    print(f"Unexpected error during imports: {e}")


In [3]:
# --- Step 1: Ensure Ollama models are available ---
# Attempt to load the reasoning and synthesis LLM models from Ollama.
# If the models are not available, provide instructions to the user and exit.
try:
    reasoning_llm = OllamaLLM(model="phi4-mini")
    synthesis_llm = OllamaLLM(model="Gemma3:1b")
except Exception as e:
    print("‚ùå Failed to connect to Ollama or load model 'phi4-mini'.")
    print("üí° Make sure Ollama is running and the model is available:")
    print("    ollama run phi4-mini")
    print(f"Error details: {e}")
    exit(1)

In [4]:
# --- Step 2: Prompt to break down the question ---
# Define a prompt template to break down a question into logical steps.
# This uses the reasoning LLM to generate a step-by-step breakdown.
reasoning_prompt = PromptTemplate.from_template("""
You are a reasoning assistant. Break the following question into logical steps to help answer it:

Question: {question}

Step-by-step breakdown:
""")
step_chain = reasoning_prompt | reasoning_llm

In [None]:
# --- Step 3: Set up vectorstore with Chroma ---
def load_documents_from_directory(directory="docs"):
    """Loads .txt, .csv, and .pdf files from the specified directory."""
    supported_extensions = {
        ".txt": TextLoader,
        ".csv": CSVLoader,
        ".pdf": PyPDFLoader,
    }

    documents = []
    directory_path = Path(directory)
    
    # Check if directory exists
    if not directory_path.exists():
        print(f"Warning: Directory '{directory}' does not exist. Creating it...")
        directory_path.mkdir(parents=True, exist_ok=True)
        return documents

    for filepath in directory_path.rglob("*"):
        ext = filepath.suffix.lower()
        loader_cls = supported_extensions.get(ext)
        if loader_cls:
            try:
                print(f"Loading {filepath}...")
                # Some loaders like CSVLoader and PyPDFLoader require file paths, not file handles
                loader = loader_cls(str(filepath))
                loaded_docs = loader.load()
                print(f"  Successfully loaded {len(loaded_docs)} documents from {filepath}")
                documents.extend(loaded_docs)
            except Exception as e:
                print(f"Failed to load {filepath}: {e}")

    if not documents:
        print("Warning: No documents were loaded. Please add some documents to the 'docs' directory.")
        # Add a dummy document to prevent embedding errors
        from langchain.schema import Document
        documents = [Document(page_content="Dummy content for testing", metadata={"source": "dummy.txt"})]
    
    return documents

# Load and split documents
print("\nüîç Loading documents from 'docs' directory...")
documents = load_documents_from_directory("docs")
print(f"Total documents loaded: {len(documents)}")

# Split documents
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_docs = text_splitter.split_documents(documents)
print(f"Documents split into {len(split_docs)} chunks")

# Create embeddings and vectorstore with error handling
try:
    print("\nüìä Creating embeddings with Ollama 'nomic-embed-text' model...")
    embeddings = OllamaEmbeddings(model="nomic-embed-text")
    
    # Test if embeddings work before creating the vectorstore
    test_embedding = embeddings.embed_query("Test query")
    if not test_embedding or len(test_embedding) == 0:
        raise ValueError("Embedding model returned empty embeddings")
    
    print(f"Embedding test successful - vector dimension: {len(test_embedding)}")
    print("\nüíæ Creating Chroma vectorstore...")
    vectorstore = Chroma.from_documents(split_docs, embeddings)
    print("Vectorstore successfully created!")
except Exception as e:
    print(f"‚ùå Error creating embeddings or vectorstore: {e}")
    print("\n‚ö†Ô∏è Falling back to a simple keyword-based retrieval approach...")
    
    # Define a simple fallback retrieval function
    def simple_retrieval(query, documents, top_k=1):
        # Simple keyword-based scoring
        scores = []
        for doc in documents:
            content = doc.page_content.lower()
            query_terms = query.lower().split()
            score = sum(term in content for term in query_terms)
            scores.append((doc, score))
        
        # Sort by score, descending
        scores.sort(key=lambda x: x[1], reverse=True)
        return [doc for doc, _ in scores[:top_k]]
    
    # This will be used instead of vectorstore.similarity_search
    def fallback_similarity_search(query, k=1):
        return simple_retrieval(query, split_docs, top_k=k)
    
    # Create a dummy vectorstore object with our fallback method
    class FallbackVectorstore:
        def similarity_search(self, query, k=1):
            return fallback_similarity_search(query, k)
    
    vectorstore = FallbackVectorstore()
    print("Fallback retrieval system ready.")


üîç Loading documents from 'docs' directory...
Total documents loaded: 0
Documents split into 0 chunks

üìä Creating embeddings with Ollama 'nomic-embed-text' model...
Embedding test successful - vector dimension: 768

üíæ Creating Chroma vectorstore...
‚ùå Error creating embeddings or vectorstore: Expected Embeddings to be non-empty list or numpy array, got [] in upsert.

‚ö†Ô∏è Falling back to a simple keyword-based retrieval approach...
Fallback retrieval system ready.


In [6]:
# --- Step 4: Prompt for final synthesis ---
# Define a prompt template for synthesizing a final answer.
# This uses the synthesis LLM to generate a complete and informative response.
# Remember - this is just a template, with {question} and {facts} as placeholders which are populated later.

synthesis_prompt = PromptTemplate.from_template("""
Based on the following question and information, write a complete, informative answer. 
Include the source document for each piece of information you use in your answer.

Question: {question}

Information:
{facts}

Answer (include sources for each fact):
""")
synthesis_chain = synthesis_prompt | synthesis_llm

In [7]:
# --- Step 5: Ask a question ---
# Define the question to be answered and invoke the reasoning chain to get step-by-step reasoning.
question = "Who was the U.S. president during the moon landing, and what was his policy on space exploration?"
steps_text = step_chain.invoke({"question": question})

# Print the reasoning steps generated by the LLM.
print("\nüß† Reasoning steps:\n")
print(steps_text)


üß† Reasoning steps:

Sure! Let's break this down step by stepping.

1. Identify who oversaw NASA at the time of the Apollo 11 mission.
2. Determine which administration (presidential term) coincided with that period to find out about their policies related to national security and technological advancement during space exploration efforts, such as those undertaken for landing on the moon in July-August 1969.

Step-by-step breakdown:

1. The U.S. Apollo missions were part of NASA's Space Race against Russia.
2. John F. Kennedy served as President from January 20, 1961 to November 22, 1963 and Lyndon B. Johnson took over after his assassination on a day close enough that the Moon landing occurred during LBJ‚Äôs presidency.

So:
- The U.S. president at the time of Apollo 11 was Lyndon B. Johnson.
  
As for policy:

Lyndon B. Johnson's administration focused heavily on advancing space exploration as part of its broader goals in technology and national security, which can be summarized b

In [8]:
# --- Step 6: Parse reasoning steps ---
# Extract individual reasoning steps from the generated text using regex.
step_lines = re.findall(r"\d+\.\s+(.*)", steps_text)
facts = []

In [10]:
# --- Step 7: Lookup each reasoning step with vectorstore ---
# For each reasoning step, perform a similarity search in the vectorstore.
for step in step_lines:
    print(f"\nüîç Looking up: {step}")
    
    # Retrieve the most relevant document from the vectorstore.
    docs = vectorstore.similarity_search(step, k=1)
    
    if docs:
        # Extract content and source information
        result = docs[0].page_content
        # Get source from metadata or fallback to document type
        source = docs[0].metadata.get('source', 'Unknown source')
        # Extract just the filename from the path
        source_filename = os.path.basename(source) if source != 'Unknown source' else source
        facts.append(f"- {step.strip()}: [Source: {source_filename}] {result}")
    else:
        facts.append(f"- {step.strip()}: [Source: None] No relevant info found in local knowledge base.")

# Combine all retrieved facts into a single string.
combined_facts = "\n".join(facts)

# Print the retrieved facts.
print("\nüìö Retrieved facts from Vectorstore:\n")
print(combined_facts)


üîç Looking up: Identify who oversaw NASA at the time of the Apollo 11 mission.

üîç Looking up: Determine which administration (presidential term) coincided with that period to find out about their policies related to national security and technological advancement during space exploration efforts, such as those undertaken for landing on the moon in July-August 1969.

üîç Looking up: The U.S. Apollo missions were part of NASA's Space Race against Russia.

üîç Looking up: John F. Kennedy served as President from January 20, 1961 to November 22, 1963 and Lyndon B. Johnson took over after his assassination on a day close enough that the Moon landing occurred during LBJ‚Äôs presidency.

üìö Retrieved facts from Vectorstore:

- Identify who oversaw NASA at the time of the Apollo 11 mission.: [Source: US Space Policy.pdf] U.S.  Space  Exploration  Policy:  From  the  Cold  War  to  
the
 
Commercial
 
Space
 
Age
 
The  United  States‚Äô  approach  to  space  exploration  has  evolved

In [11]:
# --- Step 8: Summarize using second LLM ---
# Use the synthesis chain to generate a final answer based on the question and retrieved facts.
final_answer = synthesis_chain.invoke({
    "question": question,
    "facts": combined_facts
})

# Print the final synthesized answer.
print("\n‚úÖ Final synthesized answer:\n")
print(final_answer)


‚úÖ Final synthesized answer:

Okay, here's a breakdown of the facts from the text, with sources cited:

1.  **Kennedy‚Äôs Commitment to the Moon Landing:**  President John F. Kennedy committed to landing a man on the Moon by the end of the 1960s. (Source: Presidents.csv)

2.  **Apollo Program:** The Apollo program was launched to outpace the Soviet Union in space technology. (Source: Presidents.csv)

3.  **The Apollo 11 Moon Landing:** The Apollo 11 Moon landing in 1969 symbolized U.S. technological superiority and remains a landmark achievement. (Source: Presidents.csv)

4. **The Shuttle Program:** The Nixon administration ended the Apollo program and pivoted to the development of the Space Shuttle. (Source: Presidents.csv)

5.  **Cost and Safety Concerns:** The Shuttle program faced high costs and safety concerns, which led to limitations on its potential. (Source: Presidents.csv)

6.  **International Cooperation:** The Space Shuttle program facilitated international cooperation in