![supermat](../docs/assets/supermat-logo-black-sub.png)
# Supermat Demo

## Introduction
Supermat focuses on parsing a document while retaining its hierarchical structure unlike most solutions out there.
Here is a demonstration to showcase the intermediate representation of supermat's parser framework.

In [None]:
%cd ..

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
from pathlib import Path
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

In [None]:
from supermat import FileProcessor, ParsedDocumentType, ParsedDocument

## Supermat as a Parser

Enter path to a sample pdf document here

In [None]:
pdf_file = Path("test_samples/test.pdf")

### Handlers
You can have multiple handlers for a given file type.

You can build your own handler by registering it via `FileProcessor`.

In this example we are going to use the PyMuPDF Handler.

In [None]:
FileProcessor.get_handlers(pdf_file)

In [None]:
handler = FileProcessor.get_handler("PyMuPDFParser")

In [None]:
parsed_document = handler.parse(pdf_file)

In [None]:
print(ParsedDocument.dump_json(parsed_document, indent=2).decode('utf-8'))

> Technically, we can just run `FileProcessor.parse(pdf_file)` directly instead of fetching a handler.
> 
> This works by fetching the main handler registered in `FileProcessor`.
> 
> For a given file type, you can only have one main handler registered to it.
> 
> Currently, the Adobe is registered as the main handler.

## Supermat as a Retriever

### Load all pdf files

Enter dir path of pdf files

In [None]:
pdf_files_dir = Path("data/")

In [None]:
pdf_files = list(pdf_files_dir.glob("*.pdf"))

In [None]:
from itertools import chain
from typing import TYPE_CHECKING, cast

from joblib import Parallel, delayed

# This is simply to process all the files in parallel efficiently

parsed_files = Parallel(n_jobs=-1, backend="threading")(
    delayed(handler.parse_file)(path)
    for path in pdf_files
)

if TYPE_CHECKING:
    from supermat.core.models.parsed_document import ParsedDocumentType

    parsed_files = cast(list[ParsedDocumentType], parsed_files)

documents = list(chain.from_iterable(parsed_docs for parsed_docs in parsed_files))

if TYPE_CHECKING:
    from supermat.core.models.parsed_document import ParsedDocumentType

    documents = cast(ParsedDocumentType, documents)

### Setup vector db using `SupermatRetriever`

> The retriever acts exactly like langchain's retriever module.
>
> This makes it very easy to make a drop in replacement for existing langchain RAG systems.
>
> In `vector_store`, you can provide any langchain VectorStore class in it. SupermatRetriver acts as a wrapper around a langchain vector store which makes this very easy to refactor existing RAG systems.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model=HuggingFaceEmbeddings(
    model_name="thenlper/gte-base"
)

In [None]:
from supermat.langchain.bindings import SupermatRetriever
from langchain_chroma import Chroma


retriever = SupermatRetriever(
    parsed_docs=documents,
    vector_store=Chroma(
        embedding_function=embedding_model,
        collection_name="PDFS_SUPERMAT_DEMO",
    ),
)

### Invoke the retriever chain

In [None]:
retriever.invoke("bio technology")

### Now let's compare this with langchain's SOTA

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader(pdf_files_dir)
langchain_documents = loader.load()

text_splitter = SemanticChunker(embedding_model, breakpoint_threshold_type="percentile")
dataset_chunks = text_splitter.split_documents(langchain_documents)

langchain_vector_store = Chroma.from_documents(
    documents=dataset_chunks,
    embedding=embedding_model,
    collection_name="PDFS_LANGCHAIN_DEMO",
)

In [None]:
langchain_retriever = langchain_vector_store.as_retriever()
langchain_retriever.invoke("bio technology")

### Setup any prefered langchain LLM

In [None]:
ask = "<ask any question related to the pdf>"

In [None]:
from langchain_ollama.llms import OllamaLLM
from supermat.langchain.bindings import get_default_chain

llm_model = OllamaLLM(model="deepseek-r1:8b", temperature=0.0)
chain = get_default_chain(retriever, llm_model, substitute_references=False, return_context=False)

### Invoke the chain to get answers related

In [None]:
answer = chain.invoke(ask)
print(answer)

### Try with langchain

> _NOTE: This is not the most optimized prompt template_

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser

system_prompt = """
Use the below information to answer the subsequent question.
Please cite the section numbers you take data from.  
Provide factual, verifiable information and include references to credible sources where possible. 
Avoid speculative or unverified content.
If the answer cannot be found, write "I don't know."
Information:
\"\"\"
{context}
\"\"\"
"""

def format_docs(docs: list[Document]) -> str:
    response = ["{{" f"'text':'{doc.page_content}'" "}}" for doc in docs]
    return f"[{','.join(response)}]"


prompt_template = PromptTemplate.from_template(system_prompt)
langchain_chain = (
    RunnableParallel({"context": retriever|format_docs, "question": RunnablePassthrough()}) 
    | prompt_template 
    | llm_model
    | StrOutputParser()
)


In [None]:
answer = langchain_chain.invoke(ask)
print(answer)