# Semi-structured RAG

Let's evaluate your architecture on a small semi-structured Q&A dataset. This dataset is composed of QA pairs over pdfs that contain tables.

## Pre-requisites

We will install quite a few prerequisites for this example since we are comparing various techinques and models.

In [None]:
# %pip install -U langchain_benchmarks
# %pip install -U langchain langsmith langchainhub unstructured chromadb openai huggingface pandas langchain_experimental

In [1]:
%load_ext autoreload
%autoreload 2

For this code to work, please configure LangSmith environment variables with your credentials.

In [2]:
import os

os.environ[
    "LANGCHAIN_ENDPOINT"
] = "http://localhost:1984"  # "https://api.smith.langchain.com
os.environ["LANGCHAIN_API_KEY"] = "sk-..."  # Your API key

# Silence warnings from HuggingFace
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Review Q&A Tasks

The registry provides configurations to test out common architectures on curated datasets.

In [3]:
from langchain_benchmarks import clone_public_dataset, registry

In [4]:
registry

Name,Dataset ID,Description
Tool Usage - Typewriter (1 func),placeholder,"Environment with a single function that accepts a single letter as input, and ""prints"" it on a piece of paper. The objective of this task is to evaluate the ability to use the provided tools to repeat a given input string. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string."
Tool Usage - Typewriter,placeholder,"Environment with 26 functions each representing a letter of the alphabet. In this variation of the typewriter task, there are 26 parameterless functions, where each function represents a letter of the alphabet (instead of a single function that takes a letter as an argument). The object is to evaluate the ability of use the functions to repeat the given string. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string."
Tool Usage - Relational Data,e95d45da-aaa3-44b3-ba2b-7c15ff6e46f5,"Environment with fake data about users and their locations and favorite foods. The environment provides a set of tools that can be used to query the data. The objective of this task is to evaluate the ability to use the provided tools to answer questions about relational data. The dataset contains 21 examples of varying difficulty. The difficulty is measured by the number of tools that need to be used to answer the question. Each example is composed of a question, a reference answer, and information about the sequence in which tools should be used to answer the question. Success is measured by the ability to answer the question correctly, and efficiently."
Multiverse Math,placeholder,"An environment that contains a few basic math operations, but with altered results. For example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected. The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math."
Email Extraction,https://smith.langchain.com/public/36bdfe7d-3cd1-4b36-b957-d12d95810a2b/d,"A dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail. Some additional cleanup of the data was done by hand after the initial pass. See https://github.com/jacoblee93/oss-model-extraction-evals."
LangChain Docs Q&A,452ccafc-18e1-4314-885b-edd735f17b9d,Questions and answers based on a snapshot of the LangChain python docs. The environment provides the documents and the retriever information. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Semi-structured Earnings,c47d9617-ab99-4d6e-a6e6-92b8daf85a7d,Questions and answers based on PDFs containing tables and charts. The task provides the raw documents as well as factory methods to easily index them and create a retriever. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).


In [5]:
task = registry["Semi-structured Earnings"]
task

0,1
Name,Semi-structured Earnings
Type,RetrievalTask
Dataset ID,c47d9617-ab99-4d6e-a6e6-92b8daf85a7d
Description,Questions and answers based on PDFs containing tables and charts. The task provides the raw documents as well as factory methods to easily index them and create a retriever. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Retriever Factories,"basic, parent-doc, hyde"
Architecture Factories,


In [6]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Semi-structured Earnings already exists. Skipping.
You can access the dataset at http://localhost/o/00000000-0000-0000-0000-000000000000/datasets/df5a7c83-35d1-4cad-973b-dde6d70e1735.


### Now, index the documents

In [8]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="thenlper/gte-base")
docs = list(task.get_docs())
retriever_factory = task.retriever_factories["basic"]
# Indexes the documents with the specified embeddings
retriever = retriever_factory(embeddings, docs=docs)

  0%|          | 0/23 [00:00<?, ?it/s]

### Time to evaluate

We will compose our retriever with a simple Llama based LLM.

In [21]:
from langchain.chat_models import ChatAnthropic
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable.passthrough import RunnableAssign

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer based solely on the retrieved documents below:\n\n<Documents>\n{docs}</Documents>",
        ),
        ("user", "{Question}"),
    ]
)
llm = ChatAnthropic(model="claude-2")


def create_chain(retriever):
    return (
        RunnableAssign({"docs": (lambda x: next(iter(x.values()))) | retriever})
        | prompt
        | llm
        | StrOutputParser()
    )

In [22]:
from functools import partial

from langchain_benchmarks.rag import get_eval_config
from langsmith.client import Client

client = Client()
RAG_EVALUATION = get_eval_config()
chain = create_chain(retriever)
test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=chain,
    evaluation=RAG_EVALUATION,
    verbose=True,
)

View the evaluation results for project 'test-earnest-stretch-74' at:
http://localhost/o/00000000-0000-0000-0000-000000000000/projects/p/2f2c0482-06b2-4219-98d7-3ce64ac621e9?eval=true

View all tests for Dataset Semi-structured Earnings at:
http://localhost/o/00000000-0000-0000-0000-000000000000/datasets/df5a7c83-35d1-4cad-973b-dde6d70e1735
[------------------------------------------------->] 5/5
 Eval quantiles:
                                          inputs.Question  \
count                                                   5   
unique                                                  5   
top     Analyzing the operating expenses for Q3 2023, ...   
freq                                                    1   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                                                   NaN   
50%                               

## Example processing the docs

RAG apps are as good as the information they are able to retrieve. Let's try indexing the tables' summaries to
improve the likelihood that they are retrieved whenever a user asks a relevant question.

We will use unstructured's `partition_pdf` functionality and generate summaries using an LLM.

You can define your own indexing pipeline to see how it impacts the downstream performance.

In [14]:
from operator import itemgetter

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.document import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable.passthrough import RunnableAssign

# Prompt
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are summarizing semi-structured tables or text in a pdf.\n\n```document\n{doc}\n```",
        ),
        ("user", "Write a concise summary."),
    ]
)

# Summary chain
model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k")


def create_doc(x) -> Document:
    return Document(
        page_content=x["output"],
        metadata=x["doc"].metadata,
    )


summarize_chain = (
    {"doc": lambda x: x}
    | RunnableAssign({"prompt": prompt})
    | {
        "output": itemgetter("prompt") | model | StrOutputParser(),
        "doc": itemgetter("doc"),
    }
    | create_doc
)

In [15]:
summaries = summarize_chain.batch(
    [doc for doc in docs if doc.metadata["element_type"] == "table"]
)

Index the documents and create the retriever. We will re

In [16]:
# Indexes the documents with the specified embeddings
retriever_with_summaries = retriever_factory(
    embeddings,
    docs=docs + summaries,
    # Specify a unique transformation name to avoid local cache collisions with other indices.
    transformation_name="table-summaries",
)

  0%|          | 0/30 [00:00<?, ?it/s]

### Evaluate

We'll evaluate the new chain on the same dataset.

In [25]:
chain_2 = create_chain(retriever_with_summaries)

test_run_with_summaries = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=chain_2,
    evaluation=RAG_EVALUATION,
    verbose=True,
)

View the evaluation results for project 'test-terrific-smile-42' at:
http://localhost/o/00000000-0000-0000-0000-000000000000/projects/p/ec94b1ee-c37f-42b9-a141-4a5997dd7efb?eval=true

View all tests for Dataset Semi-structured Earnings at:
http://localhost/o/00000000-0000-0000-0000-000000000000/datasets/df5a7c83-35d1-4cad-973b-dde6d70e1735
[------------------------------------------------->] 5/5
 Eval quantiles:
                                          inputs.Question  \
count                                                   5   
unique                                                  5   
top     Analyzing the operating expenses for Q3 2023, ...   
freq                                                    1   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                                                   NaN   
50%                                