# Semi-structured RAG

Let's evaluate your architecture on a small semi-structured Q&A dataset. This dataset is composed of QA pairs over pdfs that contain tables.

## Pre-requisites

We will install quite a few prerequisites for this example since we are comparing various techinques and models.

In [None]:
# %pip install -U langchain_benchmarks
# %pip install -U langchain langsmith langchainhub unstructured chromadb openai huggingface pandas langchain_experimental

For this code to work, please configure LangSmith environment variables with your credentials.

In [1]:
import os

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "sk-..."  # Your API key

# Silence warnings from HuggingFace
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Review Q&A Tasks

The registry provides configurations to test out common architectures on curated datasets.

In [2]:
from langchain_benchmarks import clone_public_dataset, registry

ModuleNotFoundError: No module named 'pandas'

In [None]:
registry

In [None]:
task = registry["Semi-structured Earnings"]
task

In [None]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

### Now, index the documents

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="thenlper/gte-base")
docs = list(task.get_docs())
retriever_factory = task.retriever_factories["basic"]
# Indexes the documents with the specified embeddings
retriever = retriever_factory(embeddings, docs=docs)

### Time to evaluate

We will compose our retriever with a simple Llama based LLM.

In [None]:
from langchain.chat_models import ChatAnthropic
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable.passthrough import RunnableAssign

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer based solely on the retrieved documents below:\n\n<Documents>\n{docs}</Documents>",
        ),
        ("user", "{Question}"),
    ]
)
llm = ChatAnthropic(model="claude-2")


def create_chain(retriever):
    return (
        RunnableAssign({"docs": (lambda x: next(iter(x.values()))) | retriever})
        | prompt
        | llm
        | StrOutputParser()
    )

In [None]:
from functools import partial

from langsmith.client import Client

from langchain_benchmarks.rag import get_eval_config

client = Client()
RAG_EVALUATION = get_eval_config()
chain = create_chain(retriever)
test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=chain,
    evaluation=RAG_EVALUATION,
    verbose=True,
)

## Example processing the docs

RAG apps are as good as the information they are able to retrieve. Let's try indexing the tables' summaries to
improve the likelihood that they are retrieved whenever a user asks a relevant question.

We will use unstructured's `partition_pdf` functionality and generate summaries using an LLM.

You can define your own indexing pipeline to see how it impacts the downstream performance.

In [None]:
from operator import itemgetter

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.document import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable.passthrough import RunnableAssign

# Prompt
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are summarizing semi-structured tables or text in a pdf.\n\n```document\n{doc}\n```",
        ),
        ("user", "Write a concise summary."),
    ]
)

# Summary chain
model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k")


def create_doc(x) -> Document:
    return Document(
        page_content=x["output"],
        metadata=x["doc"].metadata,
    )


summarize_chain = (
    {"doc": lambda x: x}
    | RunnableAssign({"prompt": prompt})
    | {
        "output": itemgetter("prompt") | model | StrOutputParser(),
        "doc": itemgetter("doc"),
    }
    | create_doc
)

In [None]:
summaries = summarize_chain.batch(
    [doc for doc in docs if doc.metadata["element_type"] == "table"]
)

Index the documents and create the retriever. We will re

In [None]:
# Indexes the documents with the specified embeddings
retriever_with_summaries = retriever_factory(
    embeddings,
    docs=docs + summaries,
    # Specify a unique transformation name to avoid local cache collisions with other indices.
    transformation_name="table-summaries",
)

### Evaluate

We'll evaluate the new chain on the same dataset.

In [None]:
chain_2 = create_chain(retriever_with_summaries)

test_run_with_summaries = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=chain_2,
    evaluation=RAG_EVALUATION,
    verbose=True,
)