# LangChain RAG Demo with Ollama

A lightweight Retrieval-Augmented Generation (RAG) system demonstrating context-aware question answering with local LLM deployment.

## Main Features

- **Document Retrieval**: In-memory vector store with semantic search
- **Local LLM Integration**: Ollama-powered inference (Mistral 7B)
- **Conversation Modes**: Single-turn Q&A and multi-turn chat
- **Flexible Embeddings**: HuggingFace embeddings with FakeEmbeddings fallback
- **Text Processing**: Automatic document chunking with overlap

## Architecture

```
Query → Retriever → Vector Search → Context Extraction
                                           ↓
User Query + Context → LLM Prompt → Ollama (Mistral) → Response
```

**RAG Pipeline**:
1. Documents split into chunks (100 chars, 20 overlap)
2. Chunks embedded and stored in vector database
3. User query retrieves top-2 relevant chunks
4. Context + query sent to LLM for grounded response

## Dependencies

**Core**:
- `langchain` - Framework orchestration
- `langchain-ollama` - Ollama LLM integration
- `langchain-text-splitters` - Document chunking

**Optional (for embeddings)**:
- `langchain-huggingface` - Sentence embeddings
- `sentence-transformers` - Embedding models
- `transformers` + `torch` - Model support

## Prerequisites

```bash
# Install and run Ollama
ollama serve
ollama pull mistral:7b
```

## Usage

**Single-turn**: `retrieve_and_respond(query)` - Direct Q&A  
**Multi-turn**: `ConversationPipeline.chat(query)` - Conversational context

## Key Components

- **LLM**: ChatOllama (Mistral 7B, temp=0.7)
- **Vector Store**: InMemoryVectorStore (ephemeral)
- **Retriever**: Top-2 similarity search
- **Embeddings**: MiniLM-L6-v2 (384 dims) or fake fallback

In [1]:
# Core LangChain and related packages
# %pip install -qU langchain langchain-ollama langchain-text-splitters

# Optional: HuggingFace embeddings for real vector support
# %pip install -qU langchain_huggingface

# Optional: Transformers and PyTorch for local models
# %pip install -qU transformers torch

# Optional: SentenceTransformers for sentence embeddings
# %pip install -qU sentence-transformers

In [None]:
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
# Step 1: Initialize Ollama Chat Model
# Ensure ollama is running :
#    ollama serve
#    ollama pull mistral:7b
llm = ChatOllama(
    model="mistral:7b",  
    temperature=0.7,
)

In [None]:
# Step 2: Prepare Documents
documents = [
    "Large Language Models (LLMs) are AI systems trained on massive text datasets to understand and generate human language. "
    "They are commonly built using Transformer architectures, which allow them to process context across long sequences of text.",

    "Natural Language Processing (NLP) focuses on enabling computers to interpret and work with human language. "
    "Core NLP tasks include text classification, question answering, sentiment analysis, summarization, and translation.",

    "Python is widely used for building and deploying LLM and NLP systems. "
    "Popular libraries and frameworks include Hugging Face Transformers for model usage, tokenizers for text processing, "
    "and vector databases for retrieval-augmented generation (RAG) workflows."
]

In [None]:
# Step 3: Split Documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
)
chunks = text_splitter.split_text("\n".join(documents))

In [None]:
# Step 4: Create Vector Store
# Fallback embeddings
try:
    from langchain_huggingface import HuggingFaceEmbeddings
    import torch  # Auto-detected by sentence-transformers
    embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
    )
except (ImportError, ModuleNotFoundError, RuntimeError) as e:  # Add RuntimeError for GPU issues
    from langchain_core.embeddings import FakeEmbeddings
    print(f"HuggingFace unavailable ({e}), using FakeEmbeddings fallback.")
    embeddings = FakeEmbeddings(size=384)  # Match MiniLM dimension
    
vectorstore = InMemoryVectorStore.from_texts(chunks, embeddings)
retriever = vectorstore.as_retriever(k=2)

In [None]:
# Step 5: Define RAG Prompt
prompt = ChatPromptTemplate.from_messages([
    ("system",
        "You are a helpful assistant. Prioritize answering using the provided context. "
        "If the context contains the answer, rely on it strictly. "
        "If the context does not contain the answer, you may use your own verified, "
        "parametric knowledge — but do NOT make up facts. "
        "If you are not confident or the information is unknown, say "
        "'I don't know' or 'The context does not provide this information.'\n\n"
        "Context: {context}"
    ),
    ("human", "{query}")
])


In [None]:
# Step 6: Simple RAG Function
def retrieve_and_respond(query: str) -> str:
    """Retrieve docs → format prompt → invoke LLM"""
    context_docs = retriever.invoke(query)
    context = "\n".join([doc.page_content for doc in context_docs])
    
    messages = prompt.format_messages(context=context, query=query)
    response = llm.invoke(messages)
    return response.content

In [None]:
# Step 7: Multi-turn Conversation
class ConversationPipeline:
    def __init__(self, llm, retriever, prompt):
        self.llm = llm
        self.retriever = retriever
        self.prompt = prompt
        self.chat_history = []
    
    def chat(self, user_query: str) -> str:
        """Handle multi-turn conversation"""
        context_docs = self.retriever.invoke(user_query)
        context = "\n".join([doc.page_content for doc in context_docs])
        
        # Format messages with context and conversation history
        messages = prompt.format_messages(context=context, query=user_query)
        
        response = self.llm.invoke(messages)
        response_text = response.content
        
        # Store in history
        self.chat_history.append({"user": user_query, "assistant": response_text})
        
        return response_text

In [None]:
# Step 8: Run Pipeline
if __name__ == "__main__":
    print("=== Single-turn RAG ===")
    result = retrieve_and_respond("What is a Large Language Model?")
    print(f"Response: {result}\n")
    
    print("=== Multi-turn Conversation ===")
    conversation = ConversationPipeline(llm, retriever, prompt)
    
    queries = [
        "What is a Large Language Model?",
        "What are common tasks in Natural Language Processing?",
        "Can I use Python to build and deploy LLM applications?"
    ]
    
    for query in queries:
        response = conversation.chat(query)
        print(f"User: {query}")
        print(f"Assistant: {response}\n")

### With Tools (Function Calling)

In [None]:
from langchain_core.tools import tool
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage

@tool  # Docstring is mandatory as it becomes the tool description given to the LLM
def multiply(a: int, b: int) -> int:
    """Multiply two integers and return the result."""
    return a * b

llm_with_tools = llm.bind_tools([multiply])

# Ask model
ai_msg = llm_with_tools.invoke([HumanMessage("What is 5 times 3?")])

# Execute tool
tool_call = ai_msg.tool_calls[0]
result = multiply.invoke(tool_call["args"])

print("Tool result:", result)


### Streaming

In [None]:
print("=== Streaming ===")

for chunk in llm.stream("Tell me about Large Language Models"):
    print(chunk.content, end="", flush=True)
print()