<a href="https://colab.research.google.com/github/loryneJoy/Multimodal_Revenue_Analysis_Agent/blob/main/Multimodal_Revenue_Analysis_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### PROJECT TASK
#### come up with their a multimodal agent. It should take an image and text, then give back an output

### STEPS I TOOK:
1. Data source (PDF File) extract images and text
2. Convert text into chunks and perform embedding
3. Perform embedding on the images using clip model by OpenAI
4. Store images in the base 64 format
5. Store both text and image embeddings in vector stores (FAISS vector store)
6. Clip embedding (convert query into an embedding) (CLIP is Contrastive Language - Image Pre-training). Clip has combination of visual transformer and text transformer
7. Retriever - is a vector store from FAISS
8. Convert text image information to specific formatt
9. Pass to an LLM model
10. Get the multimodal answer


:#### Step 1: Extracting images and text by use of PyMuPDF(fitz)

In [None]:
%pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.30-py3-none-any.whl.metadata (3.0 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain-community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_community-0.3.30-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
%pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m69.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.4


In [None]:
# Importing Libraries
import fitz  # PyMuPDF
from langchain_core.documents import Document
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import numpy as np
from langchain.chat_models import init_chat_model
from langchain.prompts import PromptTemplate
from langchain.schema.messages import HumanMessage
from sklearn.metrics.pairwise import cosine_similarity
import os
import base64
import io
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

#### Load the CLIP Models

In [None]:
from dotenv import load_dotenv
load_dotenv()

import os



In [None]:
# Clip Model
import os
from dotenv import load_dotenv
load_dotenv()

# set up the environment
os.environ["GROQ_API_KEY"]=os.getenv("GROQ_API_KEY")

# initialize the Clip Model for unified embeddings - huggingface
clip_model=CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor=CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip_model.eval()

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

CLIPModel(
  (text_model): CLIPTextTransformer(
    (embeddings): CLIPTextEmbeddings(
      (token_embedding): Embedding(49408, 512)
      (position_embedding): Embedding(77, 512)
    )
    (encoder): CLIPEncoder(
      (layers): ModuleList(
        (0-11): 12 x CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05,

### EMBEDDING

In [None]:
### Embedding functions
def embed_image(image_data):
    """Embed image using CLIP"""
    if isinstance(image_data, str):  # If path
        image = Image.open(image_data).convert("RGB")
    else:  # If PIL Image
        image = image_data

    inputs=clip_processor(images=image,return_tensors="pt")
    with torch.no_grad():
        features = clip_model.get_image_features(**inputs)
        # Normalize embeddings to unit vector
        features = features / features.norm(dim=-1, keepdim=True)
        return features.squeeze().numpy()

def embed_text(text):
    """Embed text using CLIP."""
    inputs = clip_processor(
        text=text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=77  # CLIP's max token length
    )
    with torch.no_grad():
        features = clip_model.get_text_features(**inputs)
        # Normalize embeddings
        features = features / features.norm(dim=-1, keepdim=True)
        return features.squeeze().numpy()

In [None]:
## Process PDF
pdf_path="/content/Cost_Revenue_Analysis.pdf"
doc=fitz.open(pdf_path)
# Storage for all documents and embeddings
all_docs = []
all_embeddings = []
image_data_store = {}  # Store actual image data for LLM

# Text splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

In [None]:
doc

Document('/content/Cost_Revenue_Analysis.pdf')

In [None]:
for i,page in enumerate(doc):
    ## process text
    text=page.get_text()
    if text.strip():
        ##create temporary document for splitting
        temp_doc = Document(page_content=text, metadata={"page": i, "type": "text"})
        text_chunks = splitter.split_documents([temp_doc])

        #Embed each chunk using CLIP
        for chunk in text_chunks:
            embedding = embed_text(chunk.page_content)
            all_embeddings.append(embedding)
            all_docs.append(chunk)



    # processing the images
    # Three Important Steps include:

    # Convert PDF image to PIL format
    # Store as base64 for GPT-4V (which needs base64 images)
    # Create CLIP embedding for retrieval

    for img_index, img in enumerate(page.get_images(full=True)):
        try:
            xref = img[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]

            # Convert to PIL Image
            pil_image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

            # Create unique identifier
            image_id = f"page_{i}_img_{img_index}"

            # Store image as base64 for later use with GPT-4V
            buffered = io.BytesIO()
            pil_image.save(buffered, format="PNG")
            img_base64 = base64.b64encode(buffered.getvalue()).decode()
            image_data_store[image_id] = img_base64

            # Embed image using CLIP
            embedding = embed_image(pil_image)
            all_embeddings.append(embedding)

            # Create document for image
            image_doc = Document(
                page_content=f"[Image: {image_id}]",
                metadata={"page": i, "type": "image", "image_id": image_id}
            )
            all_docs.append(image_doc)

        except Exception as e:
            print(f"Error processing image {img_index} on page {i}: {e}")
            continue

doc.close()

In [None]:
all_docs

[Document(metadata={'page': 0, 'type': 'text'}, page_content='|  393 |\nBRUNO DE ROSA – PARTNER E SCIENTIFIC DIRECTOR DYN@MIKA S.R.L.\nCOST AND REVENUE ANALYSIS\nHow to produce par-al proﬁtability informa-on'),
 Document(metadata={'page': 0, 'type': 'image', 'image_id': 'page_0_img_0'}, page_content='[Image: page_0_img_0]'),
 Document(metadata={'page': 0, 'type': 'image', 'image_id': 'page_0_img_1'}, page_content='[Image: page_0_img_1]'),
 Document(metadata={'page': 1, 'type': 'text'}, page_content='|  394 |\nBRUNO DE ROSA – PARTNER E SCIENTIFIC DIRECTOR DYN@MIKA S.R.L.\nEFFICIENCY\n1. OPERATIONAL \nPRODUCTIVITY\n2. FINANCIAL \nPRODUCTIVITY\nOPHYSICAL\nIPHYSICAL\nOREVENUES\nIEXPENSES\na) Partial \nb) Total \na) Partial\nb) Total'),
 Document(metadata={'page': 1, 'type': 'image', 'image_id': 'page_1_img_0'}, page_content='[Image: page_1_img_0]'),
 Document(metadata={'page': 2, 'type': 'text'}, page_content='|  395 |\nBRUNO DE ROSA – PARTNER E SCIENTIFIC DIRECTOR DYN@MIKA S.R.L.\nEFFICIE

In [None]:
# Create unified FAISS vector store with CLIP embeddings
embeddings_array = np.array(all_embeddings)
embeddings_array

array([[-0.04622868,  0.01753438,  0.03629856, ..., -0.08006593,
        -0.01380648, -0.00609296],
       [-0.02515623, -0.02274406,  0.05144298, ...,  0.01638889,
         0.01698227, -0.05855679],
       [-0.0089438 ,  0.00232195, -0.00363326, ...,  0.06655418,
        -0.0269383 , -0.01064131],
       ...,
       [ 0.0050875 ,  0.01034724, -0.00103213, ...,  0.00999359,
         0.02369087,  0.00190299],
       [-0.00482922,  0.00452271, -0.01624018, ..., -0.05643078,
         0.02530663,  0.00100464],
       [-0.02515623, -0.02274406,  0.05144298, ...,  0.01638889,
         0.01698227, -0.05855679]], dtype=float32)

In [None]:

(all_docs,embeddings_array)

([Document(metadata={'page': 0, 'type': 'text'}, page_content='|  393 |\nBRUNO DE ROSA – PARTNER E SCIENTIFIC DIRECTOR DYN@MIKA S.R.L.\nCOST AND REVENUE ANALYSIS\nHow to produce par-al proﬁtability informa-on'),
  Document(metadata={'page': 0, 'type': 'image', 'image_id': 'page_0_img_0'}, page_content='[Image: page_0_img_0]'),
  Document(metadata={'page': 0, 'type': 'image', 'image_id': 'page_0_img_1'}, page_content='[Image: page_0_img_1]'),
  Document(metadata={'page': 1, 'type': 'text'}, page_content='|  394 |\nBRUNO DE ROSA – PARTNER E SCIENTIFIC DIRECTOR DYN@MIKA S.R.L.\nEFFICIENCY\n1. OPERATIONAL \nPRODUCTIVITY\n2. FINANCIAL \nPRODUCTIVITY\nOPHYSICAL\nIPHYSICAL\nOREVENUES\nIEXPENSES\na) Partial \nb) Total \na) Partial\nb) Total'),
  Document(metadata={'page': 1, 'type': 'image', 'image_id': 'page_1_img_0'}, page_content='[Image: page_1_img_0]'),
  Document(metadata={'page': 2, 'type': 'text'}, page_content='|  395 |\nBRUNO DE ROSA – PARTNER E SCIENTIFIC DIRECTOR DYN@MIKA S.R.L.\nE

In [None]:
# Create custom FAISS index since we have precomputed embeddings
vector_store = FAISS.from_embeddings(
    text_embeddings=[(doc.page_content, emb) for doc, emb in zip(all_docs, embeddings_array)],
    embedding=None,  # using precomputed embeddings
    metadatas=[doc.metadata for doc in all_docs]
)
vector_store



<langchain_community.vectorstores.faiss.FAISS at 0x7d8e19099430>

In [None]:
%pip install -U langchain-groq

Collecting langchain-groq
  Downloading langchain_groq-0.3.8-py3-none-any.whl.metadata (2.6 kB)
Collecting groq<1,>=0.30.0 (from langchain-groq)
  Downloading groq-0.32.0-py3-none-any.whl.metadata (16 kB)
Downloading langchain_groq-0.3.8-py3-none-any.whl (16 kB)
Downloading groq-0.32.0-py3-none-any.whl (135 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.4/135.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq, langchain-groq
Successfully installed groq-0.32.0 langchain-groq-0.3.8


In [None]:
# Initialize GROQ model
# from langchain.chat_models import init_chat_model # This import is already present in cell a3621417
llm = init_chat_model("groq:meta-llama/llama-4-scout-17b-16e-instruct") # Change model to the specified GROQ model
llm

ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x7d8e1885a3c0>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x7d8e196cfb60>, model_name='meta-llama/llama-4-scout-17b-16e-instruct', model_kwargs={}, groq_api_key=SecretStr('**********'))

In [None]:

def retrieve_multimodal(query, k=5):
    """Unified retrieval using CLIP embeddings for both text and images."""
    # Embed query using CLIP
    query_embedding = embed_text(query)

    # Search in unified vector store
    results = vector_store.similarity_search_by_vector(
        embedding=query_embedding,
        k=k
    )

    return results

In [None]:
def create_multimodal_message(query, retrieved_docs):
    """Create a message with both text and images for GPT-4V."""
    content = []

    # Add the query
    content.append({
        "type": "text",
        "text": f"Question: {query}\n\nContext:\n"
    })

    # Separate text and image documents
    text_docs = [doc for doc in retrieved_docs if doc.metadata.get("type") == "text"]
    image_docs = [doc for doc in retrieved_docs if doc.metadata.get("type") == "image"]

    # Add text context
    if text_docs:
        text_context = "\n\n".join([
            f"[Page {doc.metadata['page']}]: {doc.page_content}"
            for doc in text_docs
        ])
        content.append({
            "type": "text",
            "text": f"Text excerpts:\n{text_context}\n"
        })

    # Add images
    for doc in image_docs:
        image_id = doc.metadata.get("image_id")
        if image_id and image_id in image_data_store:
            content.append({
                "type": "text",
                "text": f"\n[Image from page {doc.metadata['page']}]:\n"
            })
            content.append({
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{image_data_store[image_id]}"
                }
            })

    # Add instruction
    content.append({
        "type": "text",
        "text": "\n\nPlease answer the question based on the provided text and images."
    })

    return HumanMessage(content=content)

In [None]:
def multimodal_pdf_rag_pipeline(query):
    """Main pipeline for multimodal RAG."""
    # Retrieve relevant documents
    context_docs = retrieve_multimodal(query, k=5)

    # Create multimodal message
    message = create_multimodal_message(query, context_docs)

    # Get response from GPT-4V
    response = llm.invoke([message])

    # Print retrieved context info
    print(f"\nRetrieved {len(context_docs)} documents:")
    for doc in context_docs:
        doc_type = doc.metadata.get("type", "unknown")
        page = doc.metadata.get("page", "?")
        if doc_type == "text":
            preview = doc.page_content[:100] + "..." if len(doc.page_content) > 100 else doc.page_content
            print(f"  - Text from page {page}: {preview}")
        else:
            print(f"  - Image from page {page}")
    print("\n")

    return response.content

In [None]:
if __name__ == "__main__":
    # Example queries
    queries = [
        "What are the different types of efficiency discussed in the document?",
        "What are the main steps of a typical income statement shown in the document?",
        "What are the different profit margins in the income statement (Sales → EAT breakdown)?",
        "What is the EBIT margin in the example income statement?",
        "In the job order costing table, what is subtracted to reach the ‘First Margin’?"
    ]

    for query in queries:
        print(f"\nQuery: {query}")
        print("-" * 50)
        # Use of the llm object,GROQ model
        answer = multimodal_pdf_rag_pipeline(query)
        print(f"Answer: {answer}")
        print("=" * 70)


Query: What are the different types of efficiency discussed in the document?
--------------------------------------------------

Retrieved 5 documents:
  - Text from page 18: Focused on cost-objects that are considered
particularly relevant for day-by-day decisions
and, ther...
  - Text from page 11: into categories that remind us the type of resources purchased or consumed
(e.g., raw materials, dep...
  - Text from page 6: more modern management control. The
focus is on the outside world.
Those who adopt this perspective ...
  - Text from page 13: more modern management control. The
focus is on the outside world.
Those who adopt this perspective ...
  - Text from page 44: do
not
always
have
these
characteristics. There are, for example, cost objects such as activities
pe...


Answer: ## Types of Efficiency Discussed

The provided text excerpts relate to cost analysis, cost accumulation, and cost assignment in the context of management control. However, they do not explicitly discuss 

In [None]:
# Install git
!apt-get install -qq git

# Clone  repo into Colab
!git clone https://github.com/loryneJoy/Multimodal_AI_Cost_Revenue_Analysis.git

# Copy  notebook into the repo
!cp /content/Revenue_Analysis_Multimodal_Project.ipynb /content/Multimodal_AI_Cost_Revenue_Analysis/

# Move into repo directory
%cd /content/Multimodal_AI_Cost_Revenue_Analysis

# Configure Git
!git config --global user.email "lorynejoynyanchama@gmail.com"
!git config --global user.name "loryneJoy"

# Commit changes
!git add Revenue_Analysis_Multumodal_Project.ipynb
!git commit -m "Added notebook from Colab"

# Push with token (replace YOUR_TOKEN_HERE with your PAT)
!git push https://loryneJoy:YOUR_TOKEN_HERE@github.com/loryneJoy/Multimodal_AI_Cost_Revenue_Analysis.git main


In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
import os

file_path = "/content/drive/MyDrive/Colab Notebooks/Multimodal_Revenue_Analysis_Project.ipynb"
print("Exists:", os.path.exists(file_path))
print("Full path:", os.path.abspath(file_path))


Exists: True
Full path: /content/drive/MyDrive/Colab Notebooks/Multimodal_Revenue_Analysis_Project.ipynb


In [None]:
import nbformat

# Path to your notebook - make sure this is the correct path in your Colab environment
notebook_path = "/content/drive/MyDrive/Colab Notebooks/Multimodal_Revenue_Analysis_Project.ipynb"

# Load notebook
with open(notebook_path, "r", encoding="utf-8") as f:
    nb = nbformat.read(f, as_version=4)

# Remove problematic widget metadata
if "widgets" in nb["metadata"]:
    del nb["metadata"]["widgets"]

# Save cleaned notebook
with open(notebook_path, "w", encoding="utf-8") as f:
    nbformat.write(nb, f)

print(f"Cleaned notebook saved to {notebook_path}")

Cleaned notebook saved to /content/drive/MyDrive/Colab Notebooks/Multimodal_Revenue_Analysis_Project.ipynb


In [None]:
!pip install nbstripout
!nbstripout Multimodal_Revenue_Analysis_Project.ipynb

Collecting nbstripout
  Downloading nbstripout-0.8.1-py2.py3-none-any.whl.metadata (19 kB)
Downloading nbstripout-0.8.1-py2.py3-none-any.whl (16 kB)
Installing collected packages: nbstripout
Successfully installed nbstripout-0.8.1
Could not strip 'Multimodal_Revenue_Analysis_Project.ipynb': file not found


In [None]:
!pip install nbstripout
!nbstripout "/content/drive/MyDrive/Colab Notebooks/Multimodal_Revenue_Analysis_Project.ipynb"




In [5]:
import nbformat

# Path to the current notebook
notebook_path = "/content/drive/MyDrive/Colab Notebooks/Multimodal_Revenue_Analysis_Project.ipynb"  # replace with your filename

# Load the notebook
with open(notebook_path) as f:
    nb = nbformat.read(f, as_version=4)

# Fix metadata (remove broken widgets)
if "widgets" in nb["metadata"]:
    nb["metadata"]["widgets"]["state"] = nb["metadata"]["widgets"].get("state", {})

# Save the fixed notebook
with open(notebook_path, "w") as f:
    nbformat.write(nb, f)

print("Notebook metadata fixed ✅. Reopen the notebook now.")

Notebook metadata fixed ✅. Reopen the notebook now.
