### LangChain RAG demo with Ollama, in-memory vectors, tool use and streaming"

In [1]:
# Core LangChain and related packages
# %pip install -qU langchain langchain-ollama langchain-text-splitters

# Optional: HuggingFace embeddings for real vector support
# %pip install -qU langchain_huggingface

# Optional: Transformers and PyTorch for local models
# %pip install -qU transformers torch

# Optional: SentenceTransformers for sentence embeddings
# %pip install -qU sentence-transformers

In [3]:
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [4]:
# Step 1: Initialize Ollama Chat Model
# Ensure ollama is running :
#    ollama serve
#    ollama pull mistral:7b
llm = ChatOllama(
    model="mistral:7b",  
    temperature=0.7,
)

In [5]:
# Step 2: Prepare Documents
documents = [
    "Large Language Models (LLMs) are AI systems trained on massive text datasets to understand and generate human language. "
    "They are commonly built using Transformer architectures, which allow them to process context across long sequences of text.",

    "Natural Language Processing (NLP) focuses on enabling computers to interpret and work with human language. "
    "Core NLP tasks include text classification, question answering, sentiment analysis, summarization, and translation.",

    "Python is widely used for building and deploying LLM and NLP systems. "
    "Popular libraries and frameworks include Hugging Face Transformers for model usage, tokenizers for text processing, "
    "and vector databases for retrieval-augmented generation (RAG) workflows."
]

In [6]:
# Step 3: Split Documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
)
chunks = text_splitter.split_text("\n".join(documents))

In [7]:
# Step 4: Create Vector Store
# Fallback embeddings
try:
    from langchain_huggingface import HuggingFaceEmbeddings
    import torch  # Auto-detected by sentence-transformers
    embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
    )
except (ImportError, ModuleNotFoundError, RuntimeError) as e:  # Add RuntimeError for GPU issues
    from langchain_core.embeddings import FakeEmbeddings
    print(f"HuggingFace unavailable ({e}), using FakeEmbeddings fallback.")
    embeddings = FakeEmbeddings(size=384)  # Match MiniLM dimension
    
vectorstore = InMemoryVectorStore.from_texts(chunks, embeddings)
retriever = vectorstore.as_retriever(k=2)

In [8]:
# Step 5: Define RAG Prompt
prompt = ChatPromptTemplate.from_messages([
    ("system",
        "You are a helpful assistant. Prioritize answering using the provided context. "
        "If the context contains the answer, rely on it strictly. "
        "If the context does not contain the answer, you may use your own verified, "
        "parametric knowledge — but do NOT make up facts. "
        "If you are not confident or the information is unknown, say "
        "'I don't know' or 'The context does not provide this information.'\n\n"
        "Context: {context}"
    ),
    ("human", "{query}")
])


In [9]:
# Step 6: Simple RAG Function
def retrieve_and_respond(query: str) -> str:
    """Retrieve docs → format prompt → invoke LLM"""
    context_docs = retriever.invoke(query)
    context = "\n".join([doc.page_content for doc in context_docs])
    
    messages = prompt.format_messages(context=context, query=query)
    response = llm.invoke(messages)
    return response.content

In [10]:
# Step 7: Multi-turn Conversation
class ConversationPipeline:
    def __init__(self, llm, retriever, prompt):
        self.llm = llm
        self.retriever = retriever
        self.prompt = prompt
        self.chat_history = []
    
    def chat(self, user_query: str) -> str:
        """Handle multi-turn conversation"""
        context_docs = self.retriever.invoke(user_query)
        context = "\n".join([doc.page_content for doc in context_docs])
        
        # Format messages with context and conversation history
        messages = prompt.format_messages(context=context, query=user_query)
        
        response = self.llm.invoke(messages)
        response_text = response.content
        
        # Store in history
        self.chat_history.append({"user": user_query, "assistant": response_text})
        
        return response_text

In [11]:
# Step 8: Run Pipeline
if __name__ == "__main__":
    print("=== Single-turn RAG ===")
    result = retrieve_and_respond("What is a Large Language Model?")
    print(f"Response: {result}\n")
    
    print("=== Multi-turn Conversation ===")
    conversation = ConversationPipeline(llm, retriever, prompt)
    
    queries = [
        "What is a Large Language Model?",
        "What are common tasks in Natural Language Processing?",
        "Can I use Python to build and deploy LLM applications?"
    ]
    
    for query in queries:
        response = conversation.chat(query)
        print(f"User: {query}")
        print(f"Assistant: {response}\n")

=== Single-turn RAG ===
Response:  A Large Language Model (LLM) is an AI system trained on extensive text datasets to understand and generate human language. It's designed to process context across long sequences of text using architectures such as Transformers. Core NLP tasks for LLMs include text classification and question answering.

=== Multi-turn Conversation ===
User: What is a Large Language Model?
Assistant:  A Large Language Model (LLM) is an AI system trained on vast amounts of text data to understand and generate human language. It utilizes advanced architectures like the Transformer model, which enables it to process context across long sequences of text. Core Natural Language Processing (NLP) tasks that LLMs can perform include text classification and question answering.

User: What are common tasks in Natural Language Processing?
Assistant:  Common tasks in Natural Language Processing (NLP) include:
1. Text Classification: This involves categorizing texts into predefined

### With Tools (Function Calling)

In [12]:
from langchain_core.tools import tool
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage

@tool  # Docstring is mandatory as it becomes the tool description given to the LLM
def multiply(a: int, b: int) -> int:
    """Multiply two integers and return the result."""
    return a * b

llm_with_tools = llm.bind_tools([multiply])

# Ask model
ai_msg = llm_with_tools.invoke([HumanMessage("What is 5 times 3?")])

# Execute tool
tool_call = ai_msg.tool_calls[0]
result = multiply.invoke(tool_call["args"])

print("Tool result:", result)


Tool result: 15


### Streaming

In [14]:
print("=== Streaming ===")

for chunk in llm.stream("Tell me about Large Language Models"):
    print(chunk.content, end="", flush=True)
print()

=== Streaming ===
 Large Language Models (LLMs) are a type of artificial intelligence (AI) model designed to process and generate human-like text. They are trained on vast amounts of internet text, learning from it to generate responses that mimic the way people talk and write.

The most popular type of large language models is based on Transformer architecture, introduced by Google's paper "Attention is All You Need" in 2017. These models use self-attention mechanisms to understand the context of words within a sentence, making them capable of generating coherent and contextually appropriate responses.

Examples of large language models include Meena, T5 (Text-to-Text Transfer Transformer), BERT (Bidirectional Encoder Representations from Transformers), and GPT-3 (Generative Pretrained Transformer 3) developed by OpenAI. These models have been trained on a vast amount of text data, allowing them to generate responses that are both informative and engaging.

Large language models have 