In [1]:
!git clone https://github.com/jaZabcd/FinTech-Multi-Role-ChatBot-Advance-Rag-.git
%cd FinTech-Multi-Role-ChatBot-Advance-Rag-

Cloning into 'FinTech-Multi-Role-ChatBot-Advance-Rag-'...
remote: Enumerating objects: 25, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 25 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (25/25), 45.14 KiB | 711.00 KiB/s, done.
/content/FinTech-Multi-Role-ChatBot-Advance-Rag-


In [19]:
!pip install -U -q langchain langchain-community google-generativeai chromadb langchain-google-genai rank_bm25

In [4]:
from google.colab import userdata
import os

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY

In [8]:
import re

def clean_markdown(text):
    text = re.sub(r"#+ ", "", text)  # Remove headers
    text = re.sub(r"`{3}.*?`{3}", "", text, flags=re.DOTALL)  # Remove code blocks
    text = re.sub(r"[-*] ", "", text)  # Remove bullet points
    text = re.sub(r"\n{2,}", "\n", text)  # Remove excessive newlines
    return text.strip()

def remove_boilerplate(text):
    lines = text.splitlines()
    clean_lines = [line for line in lines if not re.search(r"(©|last updated|http|Page \d+)", line, re.IGNORECASE)]
    return "\n".join(clean_lines)

def normalize_text(text):
    return " ".join(text.split())

def preprocess_text(text):
    text = clean_markdown(text)
    text = remove_boilerplate(text)
    text = normalize_text(text)
    return text

In [21]:
from langchain.docstore.document import Document
from langchain_community.document_loaders import TextLoader
import pandas as pd


def load_mdfiles(path, role):
    for file in os.listdir(path):
        if file.endswith(".md"):
            loader = TextLoader(os.path.join(path, file))
            for doc in loader.load():
                raw_text = doc.page_content
                cleaned = preprocess_text(raw_text)
                yield Document(page_content=cleaned, metadata={"source": file, "role": role})

def load_csvfiles(path:str, role:str):
  for file in os.listdir(path):
    if file.endswith(".csv"):
      df=pd.read_csv(os.path.join(path,file))

      for _, row in df.iterrows():
        row_text = "\n".join(f"{col}: {row[col]}" for col in df.columns)
        cleaned = preprocess_text(row_text)
        yield Document(
            page_content=cleaned,
            metadata={"source": file, "role": role}
        )


In [22]:
# Collect all docs
all_docs = []

role_dirs = {
    "marketing": load_mdfiles("data/marketing", "marketing"),
    "engineering": load_mdfiles("data/engineering", "engineering"),
    "general": load_mdfiles("data/general", "general"),
    "finance": load_mdfiles("data/finance", "finance"),
    "hr": load_csvfiles("data/hr", "hr"),
}

# Merge all documents
for role, docs in role_dirs.items():
    all_docs.extend(docs)

In [23]:
all_docs[0].page_content

'Comprehensive Marketing Report Q1 2024 Executive Summary Q1 2024 marked a foundational quarter for FinNova, as we focused on building robust marketing infrastructure to support aggressive expansion and enhance customer acquisition channels. This report details our marketing strategies, performance metrics, and strategic objectives, emphasizing our efforts to expand into Europe, launch the InstantPay feature, and boost social media engagement. With a $2 million marketing spend, we achieved significant milestones, setting a strong trajectory for the remainder of 2024. Q1 Marketing Overview In Q1 2024, FinNova prioritized establishing a scalable framework for growth, with a focus on strengthening customer acquisition channels and enhancing brand visibility. Key initiatives included: **European Market Entry**: Launched targeted campaigns in the UK, Germany, and France to build brand awareness and capture market share. **InstantPay Launch**: Introduced the InstantPay feature, a seamless pa

In [24]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.vectorstores import Chroma


text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

# Embedding model
embedding_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

# Split all docs
split_docs = text_splitter.split_documents(all_docs)

# Store in one collection
vectordb = Chroma.from_documents(
    documents=split_docs,
    embedding=embedding_model,
    persist_directory="chroma_store_unified",
    collection_name="role_based_docs"
)

vectordb.persist()
print(f"Stored total chunks: {len(split_docs)}")



Stored total chunks: 305


In [25]:
from rank_bm25 import BM25Okapi
from typing import List
from langchain.schema import Document

# Global mapping to match tokenized doc index with original doc
tokenized_docs = []
original_docs = []

def create_bm25_index(documents: List[Document]) -> BM25Okapi:
    global tokenized_docs, original_docs
    tokenized_docs = [doc.page_content.split() for doc in documents]
    original_docs = documents
    return BM25Okapi(tokenized_docs)

bm25 = create_bm25_index(split_docs)

In [38]:
def hybrid_search(query: str, role: str, top_k=5, alpha=0.5):
    """
    alpha: weight between BM25 and vector similarity (0 to 1)
    1.0 = only vector, 0.0 = only BM25
    """
    # Step 1: BM25 lexical match
    query_tokens = query.split()
    bm25_scores = bm25.get_scores(query_tokens)

    # Step 2: Vector similarity match from Chroma
    vector_results = vectordb.similarity_search_with_score(query, k=len(split_docs), filter={"role": role})

    # Step 3: Combine scores (normalize and rerank)
    score_map = {}

    # Add BM25 scores
    for i, score in enumerate(bm25_scores):
        doc = original_docs[i]
        if doc.metadata["role"] == role:
            score_map[doc.page_content] = {"bm25": score, "vector": 0}

    # Add vector scores
    for doc, vscore in vector_results:
        if doc.metadata["role"] == role:
            if doc.page_content in score_map:
                score_map[doc.page_content]["vector"] = 1 - vscore  # lower vscore = more similar
            else:
                score_map[doc.page_content] = {"bm25": 0, "vector": 1 - vscore}

    # Weighted hybrid score
    reranked = sorted(
        score_map.items(),
        key=lambda x: alpha * x[1]["vector"] + (1 - alpha) * x[1]["bm25"],
        reverse=True
    )

    """print(f"\n{'-'*30} HYBRID RESULTS {'-'*30}")
    for text, scores in reranked[:top_k]:
        print(f"Role: {role}")
        print(f"BM25 Score:  {scores['bm25']:.4f}")
        print(f"Vector Score: {scores['vector']:.4f}")
        print(f"Combined Score: {alpha * scores['vector'] + (1 - alpha) * scores['bm25']:.4f}")
        print(f"Content: {text[:200]}...\n")"""

    return [Document(page_content=text, metadata={"role": role}) for text, _ in reranked[:top_k]]

In [27]:
results = hybrid_search("budget forecast and spending trends", role="finance", top_k=3, alpha=0.6)

for r in results:
    print(f"[{r.metadata['role']}] {r.page_content}\n")


------------------------------ HYBRID RESULTS ------------------------------
Role: finance
BM25 Score:  6.9140
Vector Score: 0.2299
Combined Score: 2.9035
Content: & HR**: $47 million, reflecting investments in employee development and retention programs. **Software Subscriptions**: $47 million, maintaining high operational tech costs to support scalability. **O...

Role: finance
BM25 Score:  2.1491
Vector Score: 0.3012
Combined Score: 1.0404
Content: Detailed tables and charts are available in the financial statements section of the main report, showcasing YoY comparisons, expense breakdowns, and key ratios....

Role: finance
BM25 Score:  2.2921
Vector Score: 0.1985
Combined Score: 1.0359
Content: R&D, market expansion, and product development. Financing activities provided $110 million to support working capital and long-term growth. Despite risks such as vendor cost inflation and competitive ...

[finance] & HR**: $47 million, reflecting investments in employee development and rete

In [46]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain


llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash")

In [47]:
template = """
You are a helpful assistant for the {role} department. Use the following information to answer the user's question.

Context:
{context}

Question:
{question}

Answer in a professional tone.
"""

prompt = PromptTemplate(
    input_variables=["question", "context", "role"],
    template=template
)

rag_chain = LLMChain(llm=llm, prompt=prompt)

In [48]:
def answer_query_with_rag(question: str, role: str, top_k=5, alpha=0.5):
    # Step 1: Hybrid Retrieval
    retrieved_docs = hybrid_search(question, role, top_k=top_k, alpha=alpha)

    # Step 2: Merge context
    context = "\n\n".join([doc.page_content for doc in retrieved_docs])

    # Step 3: Run LLM
    response = rag_chain.run({
        "question": question,
        "context": context,
        "role": role
    })

    return response

In [49]:
response = answer_query_with_rag("What is the current year's marketing budget strategy?", role="marketing")
print("\n📢 Final Answer:\n", response)


📢 Final Answer:
 FinSolve Technologies' 2024 marketing budget of $15M prioritized several key strategic areas.  47% ($7M) was allocated to digital marketing, focusing on increasing customer engagement and reducing Customer Acquisition Cost (CAC).  A significant portion of the budget was also dedicated to Public Relations and event organization, supporting brand building and market expansion.  While specific allocations for these areas aren't detailed, the overall strategy aimed to achieve a 15% conversion rate target.  

The current strategy emphasizes:

* **Improved Customer Acquisition:**  Focusing on digital channels, particularly social media (Instagram and LinkedIn), with educational content on financial empowerment and fintech innovation.  Expansion into new markets, specifically Latin America, is also a key component.
* **Enhanced B2B Engagement:**  A 10% increase in Account-Based Marketing (ABM) budget is planned to target enterprise clients in manufacturing and retail sectors

In [50]:
print(answer_query_with_rag("What is the system architecture?", role="finance"))
print(answer_query_with_rag("Attendance of Aadhya Patel?", role="hr"))
print(answer_query_with_rag("What are marketing KPIs?",  role="marketing"))

The provided text focuses on FinSolve Technologies' financial performance and does not contain information about its system architecture.  Therefore, I cannot answer your question using the given context.  To understand the system architecture, a different source of information would be needed.
Aadhya Patel's attendance percentage is 99.31%.
Our key marketing KPIs (Key Performance Indicators) include:

* **Conversion Rate:**  This measures the percentage of website visitors or leads who complete a desired action (e.g., purchase, sign-up).  Targets varied across quarters, ranging from 10% to 15%, reflecting adjustments to our strategies and goals.

* **Return on Investment (ROI):** This indicates the profitability of our marketing campaigns, calculated as the ratio of incremental revenue generated to total marketing spend.  Our ROI targets ranged from 2.4x to 4.4x across different quarters, demonstrating a focus on maximizing marketing efficiency.

* **Customer Retention Rate:** This me