**LangChain -** provides a modular framework and tools to integrate LLMs with external data sources and software.

**LLM -** understand, generate, and process human-like language from vast amounts of text data.

In [8]:
!pip install -q --upgrade langchain langchain-google-genai langchain-core langchain_community doc2txt pypdf langchain_chroma sentence_transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━[0m [32m41.0/67.3 kB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m920.4 kB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.7/413.7 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m19.3 MB/

In [9]:
!pip install langchain-google-genai langchain chromadb pandas matplotlib



In [1]:
import getpass    # The input read defaults to "Password"
import os

Get the API keys of Google and LangChain

In [2]:
if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

Enter API key for Google Gemini: ··········


In [3]:
if not os.environ.get("LANGCHAIN_API_KEY"):
  os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter API key for Langchain: ")

Enter API key for Langchain: ··········


In [5]:
import langchain
langchain.__version__

'0.3.27'

In [6]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"     # Logging every step your LangChain app takes
os.environ["LANGCHAIN_PROJECT"] = "BizLens"     # Logging in Langsmith with this given name

Initialize Gemini (LLM)  with LangChain

In [10]:
from langchain_google_genai import ChatGoogleGenerativeAI

#  model name for Gemini Pro, which is typically 'gemini-1.0-pro' (slow and used when complex)
# 'temperature' -> controls randomness of output. Lower = more focused/deterministic, higher = more creative/random.
# 'top_p' -> controls nucleus sampling (how diverse the output is). Lower = safer/less diverse, higher = more diverse.
google_llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-flash",   # It is Gemini model variant
    temperature=0.9,            # It adds creativity/randomness to responses
    top_p=0.8                   # Nucleus sampling for probability cutoff
)

In [11]:
llm_response = google_llm.invoke("Tell me a joke")
llm_response

AIMessage(content="Why don't scientists trust atoms? \n\nBecause they make up everything!", additional_kwargs={}, response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'model_name': 'gemini-1.5-flash', 'safety_ratings': []}, id='run--5516452c-ba1f-4661-b89f-eac884c41a25-0', usage_metadata={'input_tokens': 4, 'output_tokens': 17, 'total_tokens': 21, 'input_token_details': {'cache_read': 0}})

**Outout Parsers** - responsible for taking the output of a model and transforming it to a more suitable format as normal output

**Different Output Parsers:**

- **StrOutputParser:** Returns the model’s output as a plain string.
- **JsonOutputParser:** model output is valid JSON and parses it into a Python dict.
- **PydanticOutputParser:** Uses a Pydantic model to validate and structure the output.
-** CommaSeparatedListOutputParser: **Converts output like "apple, banana, orange" into a Python list.
-** ListOutputParser**: Parses structured list-style outputs (like bullet points) into a Python list.
- **RegexParser**: structured info from the output using a regex pattern.
- **OutputFixingParser**: Wraps another parser and auto-corrects invalid outputs
- **RetryOutputParser**: Retries parsing if the output doesn’t fit the schema, prompting the LLM again.

In [12]:
# Formating the output from AI response to normal output
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()
output_parser.invoke(llm_response)

"Why don't scientists trust atoms? \n\nBecause they make up everything!"

Creating a chain of model, output parser, etc to automate multi-step tasks.


In [13]:
# Creating a Chain that do all at once rather than doing all in different steps
chain = google_llm | output_parser
chain.invoke("Tell me a Joke")

"Why don't scientists trust atoms? \n\nBecause they make up everything!"

In [14]:
chain.invoke("Recent news in india")

"Recent news from India is varied and fast-paced.  To give you a useful summary, I need some specifics.  What areas are you interested in?  For example, are you interested in:\n\n* **Politics:**  Government decisions, elections, policy changes, etc.?\n* **Business/Economy:**  Market trends, inflation, new investments, etc.?\n* **Social Issues:**  Caste-based violence, religious tensions, women's rights, etc.?\n* **Technology:**  Developments in the Indian tech sector, digital initiatives, etc.?\n* **International Relations:**  India's relationships with other countries, diplomatic efforts, etc.?\n* **Sports:**  Cricket, other sporting events, etc.?\n* **Culture:**  Film releases, festivals, art, etc.?\n* **Disasters/Accidents:**  Natural disasters, major accidents, etc.?\n\n\nPlease tell me what aspects of Indian news you'd like to know more about, and I'll do my best to provide a concise summary of recent events."

### Structured Output

**Uses of Pydantic:**

- **API development:** Validates request/response data in FastAPI and generates automated docs.
- **Data processing & ETL:** Cleans and validates data at ingestion to ensure quality.
- **Configuration management:** Loads and validates settings from env variables or files.
- **Object serialization:** Easily converts models to/from dicts and JSON.
- **Data modeling & integrity:** Defines schemas with validation to ensure reliable domain objects.

In [19]:
# BaseModel - BaseModel provides validation, parsing, serialization, and default/constraint handling for data models.
# Field - Used inside BaseModel to give extra information about each field, like:
    # Default values
    # Metadata (title, description)
    # Validation constraints (min length, max value, regex, etc.)

from typing import List
from pydantic import BaseModel, Field

In [20]:
# Defined Structure for the model
class MobileReview(BaseModel):
  phone_model: str = Field(description = "Name and Model of the phone")
  rating: float = Field(description = "Rating of the phone")
  pros: List[str] = Field(description = "List of positive aspects")
  cons: List[str] = Field(description = "List of negative aspects")
  summary: str = Field(description = "Brief summary of the review")

review_text = """
Just got my hands on the new Galaxy S21 and wow, this thing is slick! The screen is gorgeous, colors pop like crazy. Camera's insane too, especially at night - my Insta game's never been
stronger. Battery life's solid, lasts me all day no problem.

Not gonna lie though, it's pretty pricey. And what's with ditching the charger? C'mon Samsung.
Also, still getting used to the new button layout, keep hitting Bixby by mistake.

Overall I'd say it's a solid 4 out of 5. Great phone, but a few annoying quirks keep it from being perfect.
If you're due for an upgrade, definitely worth checking out!
"""

# Output in custum structured then use this with_structured_output method
structured_llm = google_llm.with_structured_output(MobileReview)
output = structured_llm.invoke(review_text)
output

MobileReview(phone_model='Galaxy S21', rating=4.0, pros=['Gorgeous screen', 'Insane camera, especially at night', 'Solid battery life'], cons=['Pricey', 'No charger included', 'New button layout'], summary='Great phone, but a few annoying quirks keep it from being perfect.')

In [21]:
output.pros

['Gorgeous screen', 'Insane camera, especially at night', 'Solid battery life']

In [22]:
output.rating

4.0

### Prompt Template

In [23]:
# defines and structure prompts
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("tell me a joke about {topic}")
prompt.invoke({"topic": "sports"})    # Template with value substituted

ChatPromptValue(messages=[HumanMessage(content='tell me a joke about sports', additional_kwargs={}, response_metadata={})])

In [24]:
chain = prompt | google_llm | output_parser
chain.invoke({"topic": "kids"})

"Why are kids like bubbles?  Because they're fun to watch until they burst!"

In [25]:
# Putting everything together from start
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Defining the prompt
prompt = ChatPromptTemplate.from_template("tell me a joke about {topic}")

# Initializing the Model(LLM)
google_llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", temperature=0.2)

# Creating output parser
output_parser = StrOutputParser()

# Compose Chain
chain = prompt | google_llm | output_parser

# Use Chain
result = chain.invoke({"topic": "kids"})
result

"Why are kids like bubbles?  Because they're a lot of fun until they pop!"

### LLM Messages

Different type of messages:

- AI message -> LLM’s response.
- Human message -> user’s input/question.
- System message -> system message is a prompt that guides the model's behavior, environment setup / role assignment.

In [27]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage

# Custum message
system_message = SystemMessage(content = "You are a helpful assistant that tells jokes.")
human_message = HumanMessage(content = "Tell me about programming")
google_llm.invoke([system_message, human_message])

AIMessage(content='Why do programmers prefer dark mode?\n\nBecause light attracts bugs!', additional_kwargs={}, response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'model_name': 'gemini-1.5-flash', 'safety_ratings': []}, id='run--001ca7c4-d641-4fab-aba4-45906e29ccd5-0', usage_metadata={'input_tokens': 13, 'output_tokens': 14, 'total_tokens': 27, 'input_token_details': {'cache_read': 0}})

In [29]:
# Putting system and human messages in to a template
template = ChatPromptTemplate([
    ("system", "You are a helpful assistant that tells jokes."),
    ("human", "Tell me about {topic}")
])

# Setting the prompt in the placeholder value
prompt_value = template.invoke({"topic": "sports"})
prompt_value

ChatPromptValue(messages=[SystemMessage(content='You are a helpful assistant that tells jokes.', additional_kwargs={}, response_metadata={}), HumanMessage(content='Tell me about sports', additional_kwargs={}, response_metadata={})])

## RAG - Retrieval Augumented Generator

In [32]:
# !pip install "unstructured[excel]" msoffcrypto-tool

In [34]:
# !pip install docx2txt

In [37]:
# !pip install langchain_community

**Different types of text_splitters:**

- **CharacterTextSplitter:**
    - Splits by character length (simple cutoff).
    - May break sentences unnaturally.
- **RecursiveCharacterTextSplitter:**
    - Splits hierarchically: tries paragraphs → sentences → words.
    - Produces meaningful chunks (preferred in production).
- **TokenTextSplitter:**
    - Splits based on LLM tokens (not characters).
- **NLTKTextSplitter:**
    - Splits based on linguistic units (sentences, paragraphs).
    - Requires extra dependencies (nltk, spacy).
- **MarkdownHeaderTextSplitter:**
    - Splits Markdown docs by headers (#, ##, etc.).
- **HTMLHeaderTextSplitter:**
     - Splits HTML documents by header tags (\<h1>, \<h2>, etc.).
- **CodeTextSplitter:**
     - Splits source code files by functions/classes.

In [39]:
from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader  # Helps to load documents
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings     # Embeddings to get the relational vector for each token
from langchain_core.documents import Document     # Every type is converted to this LLM type doc

In [43]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# splitting the documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len,
    add_start_index = True
)

In [41]:
docx_loader = Docx2txtLoader("/content/GreenGrow Innovations_ Company History.docx")
documents = docx_loader.load()

In [44]:
documents

[Document(metadata={'source': '/content/GreenGrow Innovations_ Company History.docx'}, page_content="GreenGrow Innovations was founded in 2010 by Sarah Chen and Michael Rodriguez, two agricultural engineers with a passion for sustainable farming. The company started in a small garage in Portland, Oregon, with a simple mission: to make farming more environmentally friendly and efficient.\n\n\n\nIn its early days, GreenGrow focused on developing smart irrigation systems that could significantly reduce water usage in agriculture. Their first product, the WaterWise Sensor, was launched in 2012 and quickly gained popularity among local farmers. This success allowed the company to expand its research and development efforts.\n\n\n\nBy 2015, GreenGrow had outgrown its garage origins and moved into a proper office and research facility in the outskirts of Portland. This move coincided with the development of their second major product, the SoilHealth Monitor, which used advanced sensors to ana

In [45]:
splits = text_splitter.split_documents(documents)
print(f"Number of splits: {len(splits)}")

Number of splits: 2


In [46]:
splits[0]

Document(metadata={'source': '/content/GreenGrow Innovations_ Company History.docx', 'start_index': 0}, page_content='GreenGrow Innovations was founded in 2010 by Sarah Chen and Michael Rodriguez, two agricultural engineers with a passion for sustainable farming. The company started in a small garage in Portland, Oregon, with a simple mission: to make farming more environmentally friendly and efficient.\n\n\n\nIn its early days, GreenGrow focused on developing smart irrigation systems that could significantly reduce water usage in agriculture. Their first product, the WaterWise Sensor, was launched in 2012 and quickly gained popularity among local farmers. This success allowed the company to expand its research and development efforts.\n\n\n\nBy 2015, GreenGrow had outgrown its garage origins and moved into a proper office and research facility in the outskirts of Portland. This move coincided with the development of their second major product, the SoilHealth Monitor, which used advanc

In [47]:
splits[0].metadata

{'source': '/content/GreenGrow Innovations_ Company History.docx',
 'start_index': 0}

In [50]:
splits[0].page_content

'Company: QuantumNext Systems\n\nHeadquarters: QuantumNext Systems is headquartered in Bangalore, Karnataka, India. The company, specializing in quantum computing and advanced data processing, is situated in the bustling tech metropolis of Bangalore, often referred to as the "Silicon Valley of India." From this technology capital, QuantumNext Systems is well-positioned to tap into India\'s rich pool of engineering talent and growing tech ecosystem, enabling it to push the boundaries of computational innovation.'

#### Load Documents

In [48]:
# Function to load documents from a folder
def load_documents(folder_path: str) -> List[Document]:
  documents = []
  for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    if filename.endswith('.pdf'):
      loader = PyPDFLoader(file_path)
    elif filename.endswith('docx'):
      loader = Docx2txtLoader(file_path)
    else:
      print(f"Unsupported file type: {filename}")
      continue
    documents.extend(loader.load())
  return documents

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len,
    add_start_index = True
)

In [49]:
# Load Documents from a folder
folder_path = "/content"
documents = load_documents(folder_path)

print(f"Loaded {len(documents)} documents from the folder.")
splits = text_splitter.split_documents(documents)
print(f"Split the documents into {len(splits)} chunks.")

Unsupported file type: .config
Unsupported file type: sample_data
Loaded 5 documents from the folder.
Split the documents into 8 chunks.


#### Get Embeddings

In [51]:
# Embedding documents

embeddings = GoogleGenerativeAIEmbeddings(model = "models/embedding-001")

# Get embeddings for multiple documents
multi_doc_embeddings = embeddings.embed_documents([split.page_content for split in splits])
print(f"Embeddings shape: {len(multi_doc_embeddings)}")   # Embeddings for different documents


Embeddings shape: 8


In [52]:
multi_doc_embeddings[0]   # vector representation of one of the document after split

[0.020223723724484444,
 -0.019797246903181076,
 -0.0380454882979393,
 -0.04309780150651932,
 0.11615443229675293,
 0.02160935290157795,
 0.015302896499633789,
 -0.00768645154312253,
 0.008013720624148846,
 0.048679474741220474,
 -0.039101000875234604,
 0.008384802378714085,
 -0.035774387419223785,
 -0.024990728124976158,
 0.00047951252781786025,
 -0.012532973662018776,
 -0.016729529947042465,
 0.007844334468245506,
 -0.00662657618522644,
 -0.009797823615372181,
 0.03647154942154884,
 0.0038728113286197186,
 0.013231532648205757,
 0.013576321303844452,
 -0.013882886618375778,
 -0.051442813128232956,
 0.011697376146912575,
 -0.050200991332530975,
 -0.02769986167550087,
 0.010265843011438847,
 -0.020384538918733597,
 0.011846241541206837,
 -0.048316486179828644,
 0.0250454880297184,
 -0.01130848377943039,
 -0.0044659036211669445,
 -0.03487629070878029,
 0.002629740396514535,
 -0.01181736122816801,
 0.01149409357458353,
 0.005383891519159079,
 -0.04249626398086548,
 -0.0227237306535244,
 -

In [54]:
# !pip install langchain_chroma -q

#### Create and Persist chroma vector store

- **Creating =** like filling a database table in memory with embeddings.
- **Persisting =** like hitting “Save to disk” so you don’t lose your work after closing Python.

In [56]:
# To store all these embeddings we use chromaDB
from langchain_community.vectorstores import Chroma

# Initialize Gemini embedding model
embedding_function = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
collection_name = "my_chatbot_collection"

# Create Chroma vector store with Gemini embeddings
vectorstore = Chroma.from_documents(
    collection_name=collection_name,
    documents=splits,
    embedding=embedding_function,
    persist_directory="./chroma_db"
)

vectorstore.persist()
print("Vector store created and persisted to ./chroma_db")

Vector store created and persisted to ./chroma_db


#### Perform Similarity Search - Retrieval and Generation

Using this similarity search, we get most relevant document. That document and Query are sent together to LLM to get the answer perfectly

In [58]:
query = "When was GreenGrow Innovations founded?"
search_results = vectorstore.similarity_search(query)
print(len(search_results))

print(f"Top 2 most relevant chunks for the query: {query}\n")
for i, result in enumerate(search_results, 1):
  print(f"Result {i}:")
  print(f"Source: {result.metadata.get('source', 'Unknown')}")
  print(f"Content: {result.page_content}")
  print()

4
Top 2 most relevant chunks for the query: When was GreenGrow Innovations founded?

Result 1:
Source: /content/GreenGrow Innovations_ Company History.docx
Content: The company's breakthrough came in 2018 with the introduction of the EcoHarvest System, an integrated solution that combined smart irrigation, soil monitoring, and automated harvesting techniques. This system caught the attention of large-scale farmers across the United States, propelling GreenGrow to national prominence.



Today, GreenGrow Innovations employs over 200 people and has expanded its operations to include offices in California and Iowa. The company continues to focus on developing sustainable agricultural technologies, with ongoing projects in vertical farming, drought-resistant crop development, and AI-powered farm management systems.



Despite its growth, GreenGrow remains committed to its original mission of promoting sustainable farming practices. The company regularly partners with universities and resea

As we cant invoke directly the vectorstore object to llm, so we convert it into retriever object and invoke llm

In [61]:
# Retriever object to get all the relevant docs from vector database
retriever = vectorstore.as_retriever()
retriever.invoke("When was GreenGrow Innovations founded?")

[Document(metadata={'start_index': 979, 'source': '/content/GreenGrow Innovations_ Company History.docx'}, page_content="The company's breakthrough came in 2018 with the introduction of the EcoHarvest System, an integrated solution that combined smart irrigation, soil monitoring, and automated harvesting techniques. This system caught the attention of large-scale farmers across the United States, propelling GreenGrow to national prominence.\n\n\n\nToday, GreenGrow Innovations employs over 200 people and has expanded its operations to include offices in California and Iowa. The company continues to focus on developing sustainable agricultural technologies, with ongoing projects in vertical farming, drought-resistant crop development, and AI-powered farm management systems.\n\n\n\nDespite its growth, GreenGrow remains committed to its original mission of promoting sustainable farming practices. The company regularly partners with universities and research institutions to advance the fiel

In [64]:
# Creating a template with proper prompts
from langchain_core.prompts import ChatPromptTemplate

template = """ Answer the question based only on the following context:
{context}
Question: {question}

Answer = """
prompt = ChatPromptTemplate.from_template(template)

In [65]:
# RunnablePassthrough forwards the input unchanged in pipelines that require no preprocessing.
from langchain_core.runnables import RunnablePassthrough   # It is the query that we invoke

rag_chain = (
    {
        "context": retriever,
        "question": RunnablePassthrough()}
        | prompt
)
rag_chain.invoke("When was GreenGrow innovations founded?")

ChatPromptValue(messages=[HumanMessage(content=' Answer the question based only on the following context:\n[Document(metadata={\'source\': \'/content/GreenGrow Innovations_ Company History.docx\', \'start_index\': 979}, page_content="The company\'s breakthrough came in 2018 with the introduction of the EcoHarvest System, an integrated solution that combined smart irrigation, soil monitoring, and automated harvesting techniques. This system caught the attention of large-scale farmers across the United States, propelling GreenGrow to national prominence.\\n\\n\\n\\nToday, GreenGrow Innovations employs over 200 people and has expanded its operations to include offices in California and Iowa. The company continues to focus on developing sustainable agricultural technologies, with ongoing projects in vertical farming, drought-resistant crop development, and AI-powered farm management systems.\\n\\n\\n\\nDespite its growth, GreenGrow remains committed to its original mission of promoting sus

In [66]:
# Joining all content of all documents into one

def doc2str(docs):
  return "\n\n".join(doc.page_content for doc in docs)

In [67]:
# RAG Chain with proper formatted docs being put into the prompt.
rag_chain = (
    {"context": retriever | doc2str, "question": RunnablePassthrough()}
    | prompt
)
rag_chain.invoke("When was GreenGrow innovations founded?")

ChatPromptValue(messages=[HumanMessage(content=" Answer the question based only on the following context:\nThe company's breakthrough came in 2018 with the introduction of the EcoHarvest System, an integrated solution that combined smart irrigation, soil monitoring, and automated harvesting techniques. This system caught the attention of large-scale farmers across the United States, propelling GreenGrow to national prominence.\n\n\n\nToday, GreenGrow Innovations employs over 200 people and has expanded its operations to include offices in California and Iowa. The company continues to focus on developing sustainable agricultural technologies, with ongoing projects in vertical farming, drought-resistant crop development, and AI-powered farm management systems.\n\n\n\nDespite its growth, GreenGrow remains committed to its original mission of promoting sustainable farming practices. The company regularly partners with universities and research institutions to advance the field of agricultu

In [69]:
from pprint import pprint

In [70]:
# This rag chain invokes the llm and parses output and displays only the page_content from the json returned
rag_chain = (
    {"context": retriever | doc2str, "question": RunnablePassthrough()}
    | prompt
    | google_llm
    | StrOutputParser()
)
question = "When was GreenGrow innovations founded?"
response = rag_chain.invoke(question)
pprint(response)

('The provided text does not state when GreenGrow Innovations was founded.  It '
 'only mentions that their breakthrough was in 2018.')


### Conversational RAG - Handling follow up questions

In [81]:
from langchain_core.messages import HumanMessage, AIMessage

chat_history = []
chat_history.extend([
    HumanMessage(content = question),
    AIMessage(content = response)
])

In [82]:
chat_history

[HumanMessage(content='When was GreenGrow innovations founded?', additional_kwargs={}, response_metadata={}),
 AIMessage(content='The provided text does not state when GreenGrow Innovations was founded.  It only mentions that their breakthrough was in 2018.', additional_kwargs={}, response_metadata={})]

In [83]:
# This prompt helps in converting the follow-up question into complete question
from langchain_core.prompts import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder   # Take list of messages dynamically
)

# Prompt to give answers based on the context and chat history given
contextualize_q_system_prompt = (
    "Given a chat history and the latest user question"
    "Which might reference contexxt in the chat history"
    "formulate a standalone question which can be understood"
    "without the chat history, Do not answer the Question,"
    "Just reformulate it if needed and otherwise return it as is"
)

# Creating a template to insert into the retrieval chain.
contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),       # List of messages
        ("human", "{input}"),
    ]
)

# Chain built on the contextualised prompt given and used in this chain
contextualize_chain = contextualize_q_prompt | google_llm | StrOutputParser()
contextualize_chain.invoke({"input": "Where it is headquarted?", "chat_history": chat_history})

'Where is GreenGrow Innovations headquartered?'

In [84]:
chat_history

[HumanMessage(content='When was GreenGrow innovations founded?', additional_kwargs={}, response_metadata={}),
 AIMessage(content='The provided text does not state when GreenGrow Innovations was founded.  It only mentions that their breakthrough was in 2018.', additional_kwargs={}, response_metadata={})]

In [85]:
from langchain.chains import create_history_aware_retriever
history_aware_retriever = create_history_aware_retriever(
    google_llm, retriever, contextualize_q_prompt
)

history_aware_retriever.invoke({"input": "Where it is headquarted?", "chat_history": chat_history})
# history_aware_retriever.invoke({"input": "Where it is headqueartered?", "chat_history": chatHistory})

[Document(metadata={'source': '/content/GreenGrow Innovations_ Company History.docx', 'start_index': 979}, page_content="The company's breakthrough came in 2018 with the introduction of the EcoHarvest System, an integrated solution that combined smart irrigation, soil monitoring, and automated harvesting techniques. This system caught the attention of large-scale farmers across the United States, propelling GreenGrow to national prominence.\n\n\n\nToday, GreenGrow Innovations employs over 200 people and has expanded its operations to include offices in California and Iowa. The company continues to focus on developing sustainable agricultural technologies, with ongoing projects in vertical farming, drought-resistant crop development, and AI-powered farm management systems.\n\n\n\nDespite its growth, GreenGrow remains committed to its original mission of promoting sustainable farming practices. The company regularly partners with universities and research institutions to advance the fiel

In [86]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# RAG pipeline setup:
# - `qa_prompt`: Defines how the assistant answers using retrieved context and chat history.
# - `question_answer_chain`: Combines retrieved documents to generate responses.
# - `rag_chain`: Links retrieval with the response generator, ensuring context-aware answers.
qa_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI assistant. Use the following context to answer the user's question."),
    ("system", "Context: {context}"),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}")
])

# `create_stuff_documents_chain`: Merges retrieved documents into a structured prompt.
# Ensures all relevant content is formatted before passing it to the LLM for response generation.
# This step improves contextual understanding by consolidating multiple sources into a single coherent input.
question_answer_chain = create_stuff_documents_chain(google_llm, qa_prompt)

# `create_retrieval_chain`: Retrieves relevant documents based on the query
# and passes them to the response generation chain for context-aware answers.
rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)



"""A normal retriever (like a vector store retriever) does not inherently support conversational retrieval because
 it only fetches documents based on the current query—it does not consider past interactions.
 This function - create_retrieval_chain supports conversational retrieval on top of the standard retrieval.
 It takes a retriever and chain made out of a prompt template.
"""

'A normal retriever (like a vector store retriever) does not inherently support conversational retrieval because\n it only fetches documents based on the current query—it does not consider past interactions.\n This function - create_retrieval_chain supports conversational retrieval on top of the standard retrieval.\n It takes a retriever and chain made out of a prompt template.\n'

In [87]:
# Invoking the created retrieval chain
result = rag_chain.invoke({"input": "Where it is headquarted?", "chat_history": chat_history})

pprint(result)

{'answer': "The provided text doesn't specify the headquarters location of "
           'GreenGrow Innovations.  It only mentions that they have offices in '
           'California and Iowa.',
 'chat_history': [HumanMessage(content='When was GreenGrow innovations founded?', additional_kwargs={}, response_metadata={}),
                  AIMessage(content='The provided text does not state when GreenGrow Innovations was founded.  It only mentions that their breakthrough was in 2018.', additional_kwargs={}, response_metadata={})],
 'context': [Document(metadata={'start_index': 979, 'source': '/content/GreenGrow Innovations_ Company History.docx'}, page_content="The company's breakthrough came in 2018 with the introduction of the EcoHarvest System, an integrated solution that combined smart irrigation, soil monitoring, and automated harvesting techniques. This system caught the attention of large-scale farmers across the United States, propelling GreenGrow to national prominence.\n\n\n\nTod

Building a Multi User Chatbot

Using Sqlite3 database

In [88]:
import sqlite3
from datetime import datetime

DB_NAME = "rag_app.db"

def get_db_connection():
  conn = sqlite3.connect(DB_NAME)
  conn.row_factory = sqlite3.Row
  return conn

def create_application_logs():
  conn = get_db_connection()
  cursor = conn.execute('''CREATE TABLE IF NOT EXISTS application_logs
                        (id INTEGER PRIMARY KEY AUTOINCREMENT,
                        session_id TEXT,
                        user_query TEXT,
                        llm_response TEXT,
                        model TEXT,
                        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)
                        ''')
  conn.close()

def insert_application_logs(session_id, user_query, llm_response, model):
  conn = get_db_connection()
  conn.execute('INSERT INTO application_logs (session_id, user_query, llm_response, model) VALUES (?, ?, ?, ?)',
               (session_id, user_query, llm_response, model))
  conn.commit()
  conn.close()

def get_chat_history(session_id):
  conn = get_db_connection()
  cursor = conn.execute('SELECT * FROM application_logs WHERE session_id = ? ORDER BY created_at', (session_id,))
  messages = []
  for row in cursor.fetchall():
    messages.extend([
        {"role":"human", "content": row['user_query']},
        {"role":"ai", "content": row['llm_response']}
    ])
  conn.close()
  return messages

# Initialize the database
create_application_logs()

In [89]:
# USER 1

import uuid
session_id = str(uuid.uuid4())

chat_history = get_chat_history(session_id)
print(chat_history)

question1 = "Where is this company available?"
answer1 = rag_chain.invoke({"input": question1, "chat_history": chat_history})['answer']
insert_application_logs(session_id, question1, answer1, google_llm.model)

chat_history.extend([
    {"role":"human", "content": question1},
    {"role":"ai", "content": answer1}
])
pprint(chat_history[-1])

[]
{'content': "That depends on which company you're asking about.  To answer "
            'your question, please specify the company name (TechWave '
            'Innovations or GreenFields BioTech).',
 'role': 'ai'}


In [91]:
question2 = "What does it do?"
# Using same session id considers chat history while answering a new question
chat_history = get_chat_history(session_id)
answer2 = rag_chain.invoke({"input": question2, "chat_history": chat_history})['answer']
insert_application_logs(session_id, question2, answer2, google_llm.model)

pprint(answer2)

('Please specify which company you are asking about.  I need to know whether '
 'you want to know about TechWave Innovations or GreenFields BioTech.')


In [92]:
chat_history = get_chat_history(session_id)
pprint(chat_history)

[{'content': 'Where is this company available?', 'role': 'human'},
 {'content': "That depends on which company you're asking about.  To answer "
             'your question, please specify the company name (TechWave '
             'Innovations or GreenFields BioTech).',
  'role': 'ai'},
 {'content': 'What does it do?', 'role': 'human'},
 {'content': "Please specify which company you're asking about (TechWave "
             'Innovations or GreenFields BioTech).  I need to know which '
             "company you're referring to in order to describe what it does.",
  'role': 'ai'},
 {'content': 'What does it do?', 'role': 'human'},
 {'content': 'Please specify which company you are asking about.  I need to '
             'know whether you want to know about TechWave Innovations or '
             'GreenFields BioTech.',
  'role': 'ai'}]


In [93]:
chat_history = get_chat_history(session_id)
print(chat_history)

question1 = "It is GreenFields BioTech"
answer1 = rag_chain.invoke({"input": question1, "chat_history": chat_history})['answer']
insert_application_logs(session_id, question1, answer1, google_llm.model)

chat_history.extend([
    {"role":"human", "content": question1},
    {"role":"ai", "content": answer1}
])
pprint(chat_history[-1])

[{'role': 'human', 'content': 'Where is this company available?'}, {'role': 'ai', 'content': "That depends on which company you're asking about.  To answer your question, please specify the company name (TechWave Innovations or GreenFields BioTech)."}, {'role': 'human', 'content': 'What does it do?'}, {'role': 'ai', 'content': "Please specify which company you're asking about (TechWave Innovations or GreenFields BioTech).  I need to know which company you're referring to in order to describe what it does."}, {'role': 'human', 'content': 'What does it do?'}, {'role': 'ai', 'content': 'Please specify which company you are asking about.  I need to know whether you want to know about TechWave Innovations or GreenFields BioTech.'}]
{'content': 'GreenFields BioTech is a company headquartered in Zurich, '
            'Switzerland.  It conducts groundbreaking research in sustainable '
            'agriculture and biotechnology, focusing on creating eco-friendly '
            'farming solutions

In [94]:
# Generating a new session key for new user
session_id = str(uuid.uuid4())

chat_history = get_chat_history(session_id)
print(chat_history)

question = "When was JPMC founded?"
answer = rag_chain.invoke({"input": question, "chat_history": chat_history})['answer']
insert_application_logs(session_id, question, answer, google_llm.model)

chat_history.extend([
    {"role":"human", "content": question},
    {"role": "ai", "content": answer}
])

pprint(chat_history)

[]
[{'content': 'When was JPMC founded?', 'role': 'human'},
 {'content': 'This question cannot be answered from the given context.  The '
             'provided text is about TechWave Innovations and its EcoHarvest '
             'system; it contains no information about JPMorgan Chase (JPMC).',
  'role': 'ai'}]
