## Indian Banking and Finance Report-QA Chatbot: Using Langchain | PineCone | Redis | OpenAi

In [1]:
import os
from dotenv import load_dotenv
from functools import lru_cache

PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

load_dotenv()
# This is needed to enable LangSmith logging/tracing for chains, prompts, and LLM calls.
os.environ['LANGCHAIN_API_KEY'] = os.getenv('LANGCHAIN_API_KEY')

# Setting this to 'true' will start recording all your chain/agent/LLM activity for debugging and evaluation.
os.environ['LANGCHAIN_TRACING_V2'] = 'true'

# This helps organize and filter traces inside the LangSmith dashboard based on this project.
os.environ['Langchain_Project'] = os.getenv('LANGCHAIN_PROJECT')

### Utilize OpenAi embeddings to convert text chunks into vectors

In [2]:
from langchain.embeddings import OpenAIEmbeddings

@lru_cache()
def get_embeddings():
    """
    Lazily load and cache the OpenAI Embeddings instance.
    Only initialized once per process.
    """
    return OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))

embeddings = get_embeddings()

  return OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))


In [3]:
from langchain.chat_models import ChatOpenAI
@lru_cache()
def get_llm():
    """
    Lazily load and cache the LLM instance.
    Only initialized once per process.
    """
    return ChatOpenAI(
        temperature=0,
        model_name="gpt-3.5-turbo",
    )
    
llm = get_llm()

  return ChatOpenAI(


### Load and Split documents into Chunks

In [4]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader('India Banking and Finance Report 2024.pdf')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size= 800,
    chunk_overlap= 100,
    separators=["\n\n", "\n", ".", " ", ""]
)

text_docs = text_splitter.split_documents(documents)
# Add metadata for source type
for doc in text_docs:
    doc.metadata["source"] = "text"

text_docs[:5]

[Document(metadata={'producer': 'Adobe PDF Library 15.0', 'creator': 'Adobe InDesign CC 2017 (Windows)', 'creationdate': '2024-11-13T13:26:09+05:30', 'moddate': '2024-12-02T14:37:36+05:30', 'trapped': '/False', 'source': 'text', 'total_pages': 248, 'page': 0, 'page_label': 'i'}, page_content='IBFR 2024 is published by Academic Foundation in association with NIBM, Pune and is available for purchase from Amazon\nNATIONAL INSTITUTE OF \nBANK MANAGEMENT\n(NIBM), PUNE\nINDIA BANKING \nAND FINANCE  \nREPORT 2024\nEdited by\nPARTHA RAY\nARINDAM BANDYOPADHYAY\nSANJAY BASU\nINR 1495 (Ind sub)\nUS$ 79.95 (overseas)\nISBN 978-93-327-0655-2\nINR 2595 (Ind sub)\nUS$ 89.95 (overseas)\nNATIONAL INSTITUTE OF \nBANK MANAGEMENT\n(NIBM), PUNEwww.nibmindia.org\nNIBM\nPARTHA RAY\nARINDAM BANDYOPADHYAYSANJAY BASU\nI\nBFR has become the flagship annual publication of NIBM \nPune eagerly awaited by the practitioners, experts, regula-\ntors and students of financial services and insurance \nsector. IBFR 2024 h

### Extract Tables from a PDF with pdfplumber

In [5]:
import pdfplumber
import pandas as pd

def extract_tables_from_pdf(pdf_path: str):
    all_tables = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            tables = page.extract_tables()
            for table_num, table in enumerate(tables):
                if table:  # Skip empty tables
                    df = pd.DataFrame(table[1:], columns=table[0])
                    df["page"] = page_num
                    df["table_number"] = table_num + 1
                    all_tables.append(df)
    
    return all_tables  # List of DataFrames

In [6]:
extracted_tables = extract_tables_from_pdf('India Banking and Finance Report 2024.pdf')

In [7]:
# Display first table
if extracted_tables:
    print(extracted_tables[0].head())
else:
    print("No tables found.")

Empty DataFrame
Columns: [m of articles on the Indian and global
, across sixteen chapters: from latest
talization of Trade and regulatory contours
NIBM MEs. The discussion, in each chapter, is
e the earlier editions, the style remains
n integrated and volatile world of banking INDIA
nd the key lessons for the future.
BANKING
d in the papers on penal action by supervisors,
om global bank failure and deft analysis of
off-beat embedded finance options... All in all,
ly readers’ privilege and BFSI’s pride!
umar Das Director, Institute of Insurance and
AND
ement (IIRM), Hyderabad; former Managing Director
ank of India
FINANCE
ed to see that NIBM has come out with its Annual
and Finance Report 2024. The issue carries well
articles by the faculty... and the scholarship is
ent. I am sure that the banking fraternity would
y benefit from these well-articulated research
REPORT
I am particularly happy that the rich volume
plied research papers on the emerging and critical
ESG, Dynamics of ECL, an

### Convert Extracted Tables to JSON Format

In [8]:
import json

def convert_tables_to_json(tables: list):
    json_tables = []

    for i, df in enumerate(tables):
        # Clean column names and remove NaNs
        df.columns = [str(col).strip() for col in df.columns]
        df = df.dropna(how='all').dropna(axis=1, how='all').fillna("")

        json_data = df.to_dict(orient="records")

        # Wrap with metadata
        json_tables.append({
            "table_id": f"table_{i+1}",
            "page": int(df.get("page", [None])[0]) if "page" in df else None,
            "data": json_data
        })

    return json_tables

In [9]:
# Convert extracted tables
json_formatted_tables = convert_tables_to_json(extracted_tables)

  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data = df.to_dict(orient="records")
  json_data

### Convert Table Rows Langchain Documents

In [10]:
from langchain.schema import Document

def table_to_documents(json_tables: list):
    docs = []
    for table in json_tables:
        # Convert table data (list of dicts) back to a DataFrame
        df = pd.DataFrame(table["data"])
        
        # Convert to markdown (GPT-friendly format)
        markdown = df.to_markdown(index=False)
        doc = Document(
            page_content=f"Table ID: {table['table_id']} (Page {table['page']})\n\n{markdown}",
            metadata={
                "source": "table",
                "table_id": table["table_id"],
                "page": table["page"]
            }
        )
        docs.append(doc)
    return docs

In [11]:
table_docs = table_to_documents(json_formatted_tables)
table_docs

[Document(metadata={'source': 'table', 'table_id': 'table_1', 'page': None}, page_content='Table ID: table_1 (Page None)\n\n'),
 Document(metadata={'source': 'table', 'table_id': 'table_2', 'page': None}, page_content='Table ID: table_2 (Page None)\n\n'),
 Document(metadata={'source': 'table', 'table_id': 'table_3', 'page': None}, page_content='Table ID: table_3 (Page None)\n\n'),
 Document(metadata={'source': 'table', 'table_id': 'table_4', 'page': None}, page_content='Table ID: table_4 (Page None)\n\n'),
 Document(metadata={'source': 'table', 'table_id': 'table_5', 'page': 28}, page_content='Table ID: table_5 (Page 28)\n\n| None   | O/N   | O/N     | O/N      | O/N   | Term     | Term   | Notice   | Term    | 3M   | 3M    | 3Ma i   | bl    | 12M   | 1Y    | 5Y    | 10Y   |   page |   table_number |\n|        | TRE   | Mkt     | Call     | MI-   | TREPS    | Mkt    | Money    | Money   | TB   | GOI   | v       | a     | CD    | GOI   | GOI   | GOI   |        |                |\n|     

### Merge text chunks document with table documents

In [12]:
all_docs = text_docs + table_docs  # combine both text and table docs
all_docs[:5]

[Document(metadata={'producer': 'Adobe PDF Library 15.0', 'creator': 'Adobe InDesign CC 2017 (Windows)', 'creationdate': '2024-11-13T13:26:09+05:30', 'moddate': '2024-12-02T14:37:36+05:30', 'trapped': '/False', 'source': 'text', 'total_pages': 248, 'page': 0, 'page_label': 'i'}, page_content='IBFR 2024 is published by Academic Foundation in association with NIBM, Pune and is available for purchase from Amazon\nNATIONAL INSTITUTE OF \nBANK MANAGEMENT\n(NIBM), PUNE\nINDIA BANKING \nAND FINANCE  \nREPORT 2024\nEdited by\nPARTHA RAY\nARINDAM BANDYOPADHYAY\nSANJAY BASU\nINR 1495 (Ind sub)\nUS$ 79.95 (overseas)\nISBN 978-93-327-0655-2\nINR 2595 (Ind sub)\nUS$ 89.95 (overseas)\nNATIONAL INSTITUTE OF \nBANK MANAGEMENT\n(NIBM), PUNEwww.nibmindia.org\nNIBM\nPARTHA RAY\nARINDAM BANDYOPADHYAYSANJAY BASU\nI\nBFR has become the flagship annual publication of NIBM \nPune eagerly awaited by the practitioners, experts, regula-\ntors and students of financial services and insurance \nsector. IBFR 2024 h

In [13]:
metadatas = [doc.metadata for doc in all_docs]
metadatas[:1]

[{'producer': 'Adobe PDF Library 15.0',
  'creator': 'Adobe InDesign CC 2017 (Windows)',
  'creationdate': '2024-11-13T13:26:09+05:30',
  'moddate': '2024-12-02T14:37:36+05:30',
  'trapped': '/False',
  'source': 'text',
  'total_pages': 248,
  'page': 0,
  'page_label': 'i'}]

In [14]:
# Ensure there are no None values in metadata
def clean_metadata(metadata):
    return {key: (value if value is not None else "") for key, value in metadata.items()}

In [15]:
from langchain.vectorstores import Chroma

vector_db = Chroma(persist_directory="indian_banking_fin_report_index", embedding_function=embeddings)

vector_db.add_texts(
    texts=[doc.page_content for doc in all_docs],
    metadatas=[clean_metadata(doc.metadata) for doc in all_docs]
)

vector_db.persist()

  vector_db = Chroma(persist_directory="indian_banking_fin_report_index", embedding_function=embeddings)
  vector_db.persist()


### Prompt Engineering

In [16]:
# from langchain.chat_models import ChatOpenAI
# llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")

In [17]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vector_db.as_retriever(),
    llm=llm,
    include_original=True
)

In [18]:
# Utility to classify whether a query should use table or text data
def is_table_query(query: str) -> bool:
    table_keywords = ["table", "figure", "amount", "data", "statistics", "reserves", "value", "year", "percentage", "increase", "decrease"]
    return any(keyword.lower() in query.lower() for keyword in table_keywords)

In [19]:
def get_retriever(query: str):
    if is_table_query(query):
        return vector_db.as_retriever(search_kwargs={"filter": {"source": "table"}})
    else:
        return vector_db.as_retriever()

In [20]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.output_parsers import StrOutputParser
from langchain.schema.runnable import RunnableMap

# template with memory placeholder
chat_prompt = ChatPromptTemplate.from_messages([
   ("system",  """You are an expert assistant trained on the India Banking and Finance Report (IBFR) 2024.
Always base your answers only on the given context.
Return exact values if mentioned, especially for numerical queries like inflation, growth rates, etc.
If the user's question has an answer in the context, quote it clearly.
If not, reply: "The report does not specify this."
"""),
   MessagesPlaceholder(variable_name="chat_history"),
   ("user", "Context:\n{context}\n\nQuestion: {input}")
])

parser = StrOutputParser()

# wrap your llm with prompt
context_chain = RunnableMap({
    "input": lambda x: x["input"],
    "chat_history": lambda x: x.get("chat_history", []),
    "context": lambda x: "\n\n".join(
        doc.page_content for doc in multi_query_retriever.invoke(x["input"])
    )
}) | chat_prompt | llm | parser


In [21]:
import redis
REDIS_URL = os.getenv('REDIS_HOST')
# Create a Redis client from the URL
r = redis.Redis.from_url(REDIS_URL, decode_responses=True)
try:
    # Test the connection
    r.set('foo', 'bar')
    print(r.get('foo'))  # Expected output: 'bar'
except Exception as e:
    print("Redis connection failed:", e)

bar


In [22]:
from langchain_community.chat_message_histories import RedisChatMessageHistory

# function provides Redis-based history based on session_id
def get_redis_history(session_id: str):
    return RedisChatMessageHistory(
        session_id=session_id,
        url=REDIS_URL
    )

In [23]:
runnable = RunnableWithMessageHistory(
    context_chain,
    get_redis_history,
    input_message_key="input",
    history_messages_key="chat_history"
)

In [24]:
def get_chat_response(user_input: str, session_id: str) -> str:
    """
    Invokes Langchain runnables with message history with user input and session id
    
    Args:
        user_input(str): The user's query or message
        session_id(str): Unique session identifier for tracking conversation history
    
    Returns:
        str: LLM-generated response
    """
    try:
        response = runnable.invoke(
            {"input": user_input},
            config={"configurable": {"session_id": session_id}}
        )
        return response
    except Exception as e:
        return f"An error occured {str(e)}"


In [31]:
response = get_chat_response("who is the Associate Professor of NIBM", "user_45d4")
print(response)

The Associate Professors at NIBM mentioned in the context are Richa Verma Bajaj, Shomi Srivastava, Smita Roy Trivedi, Tasneem Chherawala, Gargi Sanati, and Dipali Krishnakumar.


In [26]:
response = get_chat_response("what was the inflation in April 2022", "user_1")
print(response)

In April 2022, the CPI inflation peaked at 7.8% and the core CPI inflation peaked at 6.95%.


In [27]:
results = vector_db.similarity_search("Director of NIBM", k=3)
for doc in results:
    print(doc.page_content)

NIBM
 
 Her research interests lie in interna-
tional economics, central banking, foreign 
exchange market, macro prudential measures 
and technical analysis for markets
 
Tarun Agarwal is Ex-Director, National Insur -
ance Academy, Pune
 
Tasneem Chherawala is Associate Professor at 
NIBM
 
 Her domain expertise is in the areas of 
risk modelling and management, Basel, IFRS, 
financial derivatives, project finance and struc-
tured finance, in which she conducts executive 
trainings and PGDM courses
at NIBM
 
 He is a Chartered Accountant with 
specialization in Audit, Compliance and Trade 
Finance
 
Richa Verma Bajaj  is Associate Professor at 
NIBM
 
 She has teaching, research and consult -
ing experience of fifteen years in the area of 
Risk Management
 
Shomi Srivastava  is Associate Professor at 
NIBM
 
 His areas of specialization are Leader -
ship, Human Resource Management, Discipline 
Management, Preventive Vigilance, Change 
Management and Organizational Develop-
ment
 
Shru