# Preparing training data for RAFT

We'll follow these steps to create the training dataset:

- **Collect Domain-Specific Documents**: Gather documents relevant to the domain you want to specialize the LLM in (e.g., medical documents for PubMed, legal documents, API documentation for software).
- **Chunk the file into Documents**
- For each Document chunk, generate a set of Questions that can be answered from the Document
- For each Document-Question pair, create a list of documents using:
  - **Golden Document (D*)**: Document that contains the answer to the question.
  - **Distractor Documents (Dk)**: Documents that do not contain relevant information.
- **Question-Answer-Document Triplets**: From each Document-Question pair, generate a factual Answer based on the Golden Document.
- **Add disctractor documents**
- **Generate and save dataset**

In [1]:
! pip install langchain langchain-community langchain-openai pypdf

Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-openai
  Downloading langchain_openai-1.1.10-py3-none-any.whl.metadata (3.1 kB)
Collecting pypdf
  Downloading pypdf-6.7.1-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-classic<2.0.0,>=1.0.0 (from langchain-community)
  Downloading langchain_classic-1.0.1-py3-none-any.whl.metadata (4.2 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain-community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading marshmallow-3.26.2-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-a

### Import libraries

In [4]:
import random
from langchain_core.documents import Document
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai.chat_models import ChatOpenAI
from google.colab import userdata

In [None]:
# access OPENAI api key from colab secrets
OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
OPENAI_API_KEY

## Setting up the LLM

In [7]:
model = ChatOpenAI(model_name="gpt-4o-mini", api_key=OPENAI_API_KEY)

## Loading and chunking documents

In [8]:

def load_and_chunk_pdf(pdf_path):
    """
    Load a PDF file and chunk it semantically using LangChain.

    Args:
        pdf_path (str): Path to the PDF file

    Returns:
        list: List of semantic chunks
    """
    # Initialize the PDF loader
    loader = PyPDFLoader(pdf_path)

    # Load the document
    pages = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=3000,
        chunk_overlap=500,
        length_function=len,
        is_separator_regex=False,
    )

    chunks = text_splitter.split_documents(pages)

    return chunks

In [9]:
# calling the function to chunk the finance PDF
chunks = load_and_chunk_pdf("/content/AgenticRAG.pdf")


In [10]:

print(f"Total number of chunks: {len(chunks)}")

# Print information about each chunk
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1}:")
    print(f"Content length: {len(chunk.page_content)}")
    print(f"Metadata: {chunk.metadata}")
    print("-" * 50)
    if (i>5):
      break

Total number of chunks: 54

Chunk 1:
Content length: 2966
Metadata: {'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20260221163449', 'source': '/content/AgenticRAG.pdf', 'total_pages': 39, 'page': 0, 'page_label': '1'}
--------------------------------------------------

Chunk 2:
Content length: 528
Metadata: {'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20260221163449', 'source': '/content/AgenticRAG.pdf', 'total_pages': 39, 'page': 0, 'page_label': '1'}
--------------------------------------------------

Chunk 3:
Content length: 2908
Metadata: {'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20260221163449', 'source': '/content/AgenticRAG.pdf', 'total_pages': 39, 'page': 1, 'page_label': '2'}
--------------------------------------------------

Chunk 4:
Content length: 2601
Metadata: {'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20260221163449', 'source': '/content/AgenticRAG.pdf', 'total_pages': 39, 'page': 1, 'page_label'

Lets actually look into the contents of the chunk

In [11]:
sample_index = random.randint(0, len(chunks)-1)
chunk = chunks[sample_index]
chunk

Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20260221163449', 'source': '/content/AgenticRAG.pdf', 'total_pages': 39, 'page': 27, 'page_label': '28'}, page_content='Figure 23: An Overview of Agentic Document Workflows (ADW)\n[36]\nUse Case: Invoice Payments Workflow\nPrompt: Generate a payment recommendation report based on the submitted invoice and associated vendor\ncontract terms.\nSystem Process (ADW Workflow):\n1. Parse the invoice to extract key details such as invoice number, date, vendor information, line items,\nand payment terms.\n2. Retrieve the corresponding vendor contract to verify payment terms and identify any applicable\ndiscounts or compliance requirements.\n3. Generate a payment recommendation report that includes original amount due, potential early payment\ndiscounts, budget impact analysis, and strategic payment actions.\nResponse: Integrated Response: "Invoice INV-2025-045 for $15,000.00 has been processed. An early payment\ndi