# Financial Document RAG (Retrieval-Augmented Generation) System

## Project Overview
This notebook demonstrates a Retrieval-Augmented Generation (RAG) system for analyzing financial documents using:
- LlamaParse for PDF extraction
- Sentence Transformer for embeddings
- Pinecone for vector storage
- Google Gemini for question answering

## Step 1: Install Required Libraries
In this step, we will install the necessary libraries required for the project. These libraries include Pinecone for vector storage, LangChain for managing language models and embeddings, LlamaParse for parsing PDFs, and others needed for our solution.

In [None]:
# Required Libraries
!pip install --upgrade --quiet pinecone-client pinecone-text pinecone-notebooks langchain-community langchain-huggingface pdfplumber sentence-transformers google-generativeai langchain_community llama_parse

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.6/1.6 MB[0m [31m88.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 2: Importing Necessary Libraries
In this step, we will import the libraries needed for our workflow. These libraries include the necessary tools for managing vector stores, embeddings, text parsing, and more.

- **os**: For environment variable management.
- **nest_asyncio**: To allow asynchronous operations in Jupyter notebooks.
- **google.generativeai**: For utilizing Google's generative AI models.
- **langchain.embeddings**: For handling embeddings using Sentence Transformers.
- **langchain.text_splitter**: For splitting text data into smaller chunks.
- **llama_parse**: For parsing PDF documents.
- **pinecone**: For managing the vector store and querying data.

In [None]:
# Importing necessary libraries
import os
import nest_asyncio
import google.generativeai as genai
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from llama_parse import LlamaParse
from pinecone import Pinecone, ServerlessSpec

## Step 3: Apply nest_asyncio to Prevent Event Loop Issues
In this step, we apply `nest_asyncio` to prevent event loop issues when using asynchronous code within Jupyter notebooks. This is important to ensure that our code runs smoothly without any conflicts with the existing event loop in the notebook environment.

In [None]:
# Apply nest_asyncio to prevent event loop issues
nest_asyncio.apply()

## Step 4: Fetch API Keys from Google Colab User Data
In this step, we use the `userdata` module from Google Colab to securely fetch API keys for Pinecone and Google Gemini. This helps keep sensitive information, like API keys, secure and prevents hardcoding them directly in the code.

In [None]:
from google.colab import userdata

# Fetch API keys securely from Google Colab user data
pinecone_api_key = userdata.get('pinecone_api_key')
gemini_api_key = userdata.get('gemini_api_key')
llama_key = userdata.get('llama_key')

## Step 5: Pinecone Configuration
Here we configure Pinecone by creating a Pinecone index if it doesn't already exist. Pinecone will be used to store the document embeddings and facilitate semantic search.

- `index_name`: The name of the Pinecone index.
- `dimension`: The dimensionality of the vector embeddings, which is set to 384 for the Sentence Transformer model.
- `metric`: The similarity metric used for vector comparison, in this case, 'dotproduct'.
- `ServerlessSpec`: Specifies the cloud and region for the Pinecone serverless index.

In [None]:
# Pinecone Configuration
index_name = 'hybrid-search-langchain-pinecone'
pc = Pinecone(api_key=pinecone_api_key)

# Create Pinecone Index if it doesn't exist
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=384,  # dimension of dense vector
        metric='dotproduct',
        spec=ServerlessSpec(cloud='aws', region='us-east-1')
    )

# Initialize Pinecone index
index = pc.Index(index_name)

## Step 6: Configure LlamaParse for PDF Parsing
Here, we configure `LlamaParse`, which will be used to parse PDF documents and extract text data in a structured format. The result type is set to "markdown" for cleaner formatting.

In [None]:
# Configure LlamaParse
os.environ["LLAMA_CLOUD_API_KEY"] = llama_key
llama_parser = LlamaParse(result_type="markdown")

# Load PDF document
documents = llama_parser.load_data("/content/Sample Financial Statement.pdf")

Started parsing the file under job_id 571b3c49-c18f-4492-b889-52c51a0ef0b3


## Step 7: Initialize Embedding Model
We initialize the `SentenceTransformer` model, which will be used to create embeddings for document content. The embeddings will represent the semantic meaning of the text, enabling efficient similarity search later.

In [None]:
# Initialize embedding model
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

## Step 8: Configure Text Splitter
The text splitter is configured to split documents into smaller chunks of text. This ensures that we don't exceed the token limit for model processing and also helps preserve context within chunks.

- `chunk_size`: The maximum size of each chunk (500 tokens).
- `chunk_overlap`: The amount of overlap between chunks to maintain context between them.

In [None]:
# Text Splitter configuration
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

## Step 9: Prepare Documents for Embedding
In this step, we split the documents into smaller text chunks using the text splitter. We then prepare the documents for embedding by creating a list of dictionaries that contains the text chunks and associated metadata (e.g., source).

In [None]:
# Prepare documents for embedding
docs = []
for doc in documents:
    texts = text_splitter.split_text(doc.text)
    for text in texts:
        docs.append({
            'page_content': text,
            'metadata': {'source': getattr(doc, 'source', 'Unknown')}
        })

## Step 10: Embed Documents for Pinecone Storage
We now create embeddings for each document chunk using the `SentenceTransformer` model. These embeddings represent the semantic content of each chunk, allowing us to store them in Pinecone for later retrieval.

In [None]:
def embed_documents(docs):
    """Embed document chunks for Pinecone storage"""
    embedded_docs = []
    for doc in docs:
        embedding = embeddings.embed_query(doc['page_content'])
        embedded_docs.append({
            'id': f"doc_{hash(doc['page_content'])}",
            'values': embedding,
            'metadata': {
                'text': doc['page_content'],
                'source': doc['metadata'].get('source', 'Unknown')
            }
        })
    return embedded_docs

## Step 11: Store Embedded Documents to Pinecone
This function takes the embedded documents and saves them in batches to the Pinecone index. The `upsert` method is used to insert the embeddings into the vector store.

In [None]:
def store_to_pinecone(embedded_docs):
    """Save embedded documents to Pinecone in batches"""
    try:
        batch_size = 100
        for i in range(0, len(embedded_docs), batch_size):
            batch = embedded_docs[i:i+batch_size]
            index.upsert(vectors=batch)
        print(f"Successfully uploaded {len(embedded_docs)} document chunks")
    except Exception as e:
        print(f"Error uploading to Pinecone: {e}")

# Embed and save documents
embedded_docs = embed_documents(docs)
store_to_pinecone(embedded_docs)

Successfully uploaded 491 document chunks


## Step 12: Retrieve Relevant Information from Pinecone
This function queries Pinecone to retrieve relevant documents based on a given query. It uses the query's embedding and returns the top-k most relevant results.

In [None]:
def retrieve_relevant_info(query, top_k=5):
    """Retrieve relevant documents from Pinecone"""
    query_embedding = embeddings.embed_query(query)

    try:
        results = index.query(
            vector=query_embedding,
            top_k=top_k,
            include_metadata=True
        )
        return results['matches']
    except Exception as e:
        print(f"Error retrieving documents: {e}")
        return []

## Step 13: Configure Google Gemini for Generating Responses
We configure Google Gemini, which will be used to generate detailed financial analysis responses based on the retrieved documents. We set up various parameters for the generation, including temperature and token limits.

In [None]:
# Configure Google Gemini
genai.configure(api_key=gemini_api_key)

def generate_response(query):
    """Generate detailed financial analysis response"""
    relevant_docs = retrieve_relevant_info(query)
    context = "\n".join([doc["metadata"]["text"] for doc in relevant_docs])

    prompt = f"""You are a financial analyst specializing in profit and loss statements. Based on the financial data provided, answer the following question in a **detailed, sentence-based format**:

    **Context:**
    {context}

    **Query:**
    {query}

    **Instructions:**
    - Provide a clear, well-structured answer.
    - If the answer is numerical, explain the context behind the numbers (e.g., percentage increase, variance).
    - Keep the response concise but informative, focusing on key metrics.
    """

    model = genai.GenerativeModel("gemini-pro")
    generation_config = {
        "temperature": 0.0,
        "top_p": 0.8,
        "top_k": 40,
        "max_output_tokens": 1024,
        "candidate_count": 1
    }

    safety_settings = [
        {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
        {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
        {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
        {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"}
    ]

    response = model.generate_content(
        prompt,
        generation_config=generation_config,
        safety_settings=safety_settings
    )

    return response.text

## Step 14: Example Queries and Responses
Finally, we will run some example queries to demonstrate how the system generates responses based on the financial data in the documents.

In [None]:
# Test with an example query
query = "What is the gross profit for Q3 2024?"
response = generate_response(query)
print("Response:", response)

Response: The gross profit for Q3 2024 is **$46,257**. This represents a **$1,843** increase from the previous quarter and a **$11,843** increase from the same quarter last year. The gross profit margin for Q3 2024 is **67.8%**, which is a slight decrease from the previous quarter but an improvement from the same quarter last year.


In [None]:
# Test with another example query
query = "How do the net income and operating expenses compare for Q1 2024?"
response = generate_response(query)
print("Response:", response)

Response: In Q1 2024, the company's net income experienced a moderate increase of approximately 8.8%, rising from $24,108 million in Q1 2023 to $26,248 million. This represents an absolute increase of $2,140 million.

On the other hand, the company's total operating expenses remained relatively stable, with a marginal increase of 0.03% from $14,510 million in Q1 2023 to $14,510 million in Q1 2024. This translates to an absolute increase of only $1 million.

Overall, the company's financial performance in Q1 2024 was marked by a modest increase in net income and stable operating expenses, indicating a slight improvement in profitability.
