# RAG-Based Pipeline for Contract Field Extraction
This pipeline uses Retrieval-Augmented Generation (RAG) to accurately extract 13 specific fields (e.g. Labor Category, Bill Rate, Pay Rate, etc.) from contract PDFs. The approach combines PDF text extraction, intelligent chunking, semantic embedding, Azure Cognitive Search for vector storage, and GPT-4 for final field extraction.

In [1]:
#pip install tiktoken
#pip install azure-ai-documentintelligence
#pip install pypdf
#pip install pymupdf
#pip install aiohttp
#pip install rich
#pip install tenacity



#pip install azure-search-documents
#pip install --upgrade azure-search-documents

#pip install openai

In [1]:
#Import libraries
from dotenv import load_dotenv
import os
import asyncio
import requests

from azure.core.credentials import AzureKeyCredential

from prepdocslib.textsplitter import SentenceTextSplitter
from prepdocslib.pdfparser import DocumentAnalysisParser



#for vector db part
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex, SearchField,SearchableField, SearchFieldDataType, SimpleField,
    VectorSearch, VectorSearchProfile, HnswAlgorithmConfiguration
)
from azure.search.documents import SearchClient


#for analyzing outputs
from openai import AzureOpenAI

In [2]:
#load env
load_dotenv()

True

In [3]:
#constants
ai_doc_endpoint = os.getenv('AI_DOCUMENT_ENDPOINT')
ai_doc_key = os.getenv('AI_DOCUMENT_KEY')

ai_embed_endpoint = os.getenv('AI_EMBEDDING_ENDPOINT')
ai_embed_key = os.getenv('AI_EMBEDDING_KEY')

ai_search_endpoint = os.getenv('AI_SEARCH_ENDPOINT')
ai_search_key = os.getenv('AI_SEARCH_KEY')

ai_openai_endpoint = os.getenv('AI_OPENAI_ENDPOINT')
ai_openai_key = os.getenv('AI_OPENAI_KEY')

In [4]:
async def file_processor(
        #THIS function will take a pdf as input and chunk it       
    file_path,
    document_intelligence_service = ai_doc_endpoint,
    document_intelligence_key = ai_doc_key,
    use_content_understanding = False,
    content_understanding_endpoint = "tbd"
):
    #Initialize a Sentence-Level Text Splitter
    sentence_text_splitter = SentenceTextSplitter()

    #use document intelligence to split the doc
    doc_int_parser = DocumentAnalysisParser(
        endpoint=document_intelligence_service,
            credential=AzureKeyCredential(document_intelligence_key),
            use_content_understanding=use_content_understanding,
            content_understanding_endpoint=content_understanding_endpoint,
    )

     # Open file as binary
    with open(file_path, "rb") as f:
        pages = []
        async for page in doc_int_parser.parse(f):
            pages.append(page)

    # Chunk the parsed pages
    chunks = list(sentence_text_splitter.split_pages(pages))

    return chunks



In [5]:
chunks = await file_processor(file_path="Data/TheOrangeBook.pdf")


In [6]:
chunks[:3]

[<prepdocslib.page.SplitPage at 0x203e6c0af90>,
 <prepdocslib.page.SplitPage at 0x203e0653390>,
 <prepdocslib.page.SplitPage at 0x203e6aa3390>]

In [7]:
len(chunks)

240

In [8]:
for i, chunk in enumerate(chunks[:2]):
    print(f"\n--- Chunk {i+1} ---")
    print(chunk.text)


--- Chunk 1 ---
SKILLSTORM
Training Internal Use Only
The Orange Book
For Training Department Programs and Process Documentation
<figure><table><tr><td colSpan=2>Note: This documentation is to be updated only by training leadership. Contact Valerie Braun vbraun@skillstorm.com if edits are needed.</td></tr><tr><td>Program Checklists &amp; Process Documentation</td><td>6</td></tr><tr><td>Trainer Pre-Course Checklist: TFBD / VetTec</td><td>8</td></tr><tr><td>Process Documentation</td><td>8</td></tr><tr><td>Voucher/Licenses/Books pre-order:</td><td>8</td></tr><tr><td>Learning Platform Access - Onboarding</td><td>9</td></tr><tr><td>Canvas</td><td>9</td></tr><tr><td>Settings:</td><td>10</td></tr><tr><td>Outlook Calendar Invite - Main Meeting</td><td>15</td></tr><tr><td>Outlook Calendar Invite - Quality Audits</td><td>17</td></tr><tr><td>Microsoft Teams</td><td>17</td></tr><tr><td>Creating a Team</td><td>17</td></tr><tr><td>Team Member Permissions</td><td>19</td></tr><tr><td>Lecture Recordin

### Send Text Chunks to OpenAI Embedding

In [9]:
texts = [x.text for x in chunks]

In [10]:
def get_embedding(text):
    headers = {
        "Content-Type": "application/json",
        "api-key": ai_embed_key
    }
    data = {
        "input": text,
    }
    response = requests.post(ai_embed_endpoint, headers=headers, json=data)
    response.raise_for_status()
    return response.json()["data"][0]["embedding"]

In [11]:
# Example:
text_embeddings = [get_embedding(x) for x in texts]

In [12]:
text_embeddings[:1]

[[-0.022580983,
  -0.0073415665,
  -0.009807069,
  -0.025341796,
  0.007224816,
  0.010239734,
  -0.024064405,
  -0.022072773,
  -0.010782282,
  -0.028652025,
  0.01457325,
  0.0019538593,
  -0.014793016,
  0.030080507,
  0.0010584836,
  0.020905266,
  0.015053988,
  -0.015301226,
  -0.016372586,
  -0.041947886,
  -0.03477801,
  0.018666396,
  -0.020369586,
  0.010576251,
  -0.0029685614,
  -0.004385024,
  0.015534727,
  -0.036069136,
  0.0035505986,
  0.0013237483,
  0.021949155,
  0.008756312,
  0.012574751,
  -0.020287173,
  -0.019023517,
  -0.019257018,
  -0.014545779,
  -0.001748687,
  0.010837223,
  0.016056672,
  0.008021468,
  0.026262067,
  -0.014463367,
  -0.005161073,
  0.0007125234,
  0.01808951,
  0.009504891,
  -0.0014611023,
  0.01263656,
  0.030739805,
  0.00807641,
  -0.0050855284,
  -0.006572385,
  -0.007197345,
  0.015369902,
  0.00028672628,
  0.014490837,
  0.032030933,
  -0.001061059,
  -0.008069542,
  -0.0018594286,
  -0.018364217,
  -0.019627875,
  -0.012767046,

#### Format Records for Azure AI Search

In [13]:
documents = [
    {
        "id": f"chunk-{i}",
        "content": text,
        "embedding": embedding
    }
    for i, (text, embedding) in enumerate(zip(texts, text_embeddings))
]


In [14]:
documents[:1]

[{'id': 'chunk-0',
  'content': 'SKILLSTORM\nTraining Internal Use Only\nThe Orange Book\nFor Training Department Programs and Process Documentation\n<figure><table><tr><td colSpan=2>Note: This documentation is to be updated only by training leadership. Contact Valerie Braun vbraun@skillstorm.com if edits are needed.</td></tr><tr><td>Program Checklists &amp; Process Documentation</td><td>6</td></tr><tr><td>Trainer Pre-Course Checklist: TFBD / VetTec</td><td>8</td></tr><tr><td>Process Documentation</td><td>8</td></tr><tr><td>Voucher/Licenses/Books pre-order:</td><td>8</td></tr><tr><td>Learning Platform Access - Onboarding</td><td>9</td></tr><tr><td>Canvas</td><td>9</td></tr><tr><td>Settings:</td><td>10</td></tr><tr><td>Outlook Calendar Invite - Main Meeting</td><td>15</td></tr><tr><td>Outlook Calendar Invite - Quality Audits</td><td>17</td></tr><tr><td>Microsoft Teams</td><td>17</td></tr><tr><td>Creating a Team</td><td>17</td></tr><tr><td>Team Member Permissions</td><td>19</td></tr><tr>

### Setting up our Vector Database -- create an index


In [15]:
index_client = SearchIndexClient(
    endpoint=ai_search_endpoint,
    credential=AzureKeyCredential(ai_search_key)
)

In [17]:
index_name = "orange-book-index"

In [19]:
vector_profile = VectorSearchProfile(
    name="my-vector-profile2",
    algorithm_configuration_name="my-hnsw-config"
)

In [20]:
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SearchableField(name="text", type=SearchFieldDataType.String),
    SearchField(
        name="embedding",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        filterable=False,
        sortable=False,
        facetable=False,
        vector_search_profile_name="my-vector-profile2",
        vector_search_dimensions=1536
    )
]

In [21]:
vector_search_config = VectorSearch(
    algorithms=[HnswAlgorithmConfiguration(name="my-hnsw-config")],
    profiles=[vector_profile]
)

In [22]:
index = SearchIndex(
    name=index_name,
    fields=fields,
    vector_search=vector_search_config
)

In [23]:
# Create the index
index_client.create_index(index)

<azure.search.documents.indexes.models._index.SearchIndex at 0x203e6aa2ad0>

### Upload Chunk Data to Azure AI Search

In [24]:
search_client = SearchClient(endpoint=ai_search_endpoint, index_name=index_name, credential=AzureKeyCredential(ai_search_key))

In [25]:
docs_to_upload = [
    {
        "id": doc["id"],
        "text": doc["content"],  # assuming your Azure index uses field name 'text'
        "embedding": doc["embedding"]
    }
    for doc in documents
]

In [26]:
# Upload documents to azure
search_client.upload_documents(documents=docs_to_upload)

[<azure.search.documents._generated.models._models_py3.IndexingResult at 0x203e6c62900>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x203eede0a50>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x203eede0b90>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x203e67ddba0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x203e67dfce0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x203e6c8af90>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x203e6c86be0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x203e6c86cf0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x203e6c7d250>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x203e6c7db50>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x203e6c2b3e0>,
 <azure.search.docume

### function :: user can input in a new query

In [27]:
default_system_message = '''
         You are an assistant that tries to get information out of text chunks.
         Answer the query using only the sources provided below.
         Answer ONLY with the facts listed in the list of sources below.
         Do not repeat the question to me, just provide the answer in a concise way.
         If there isn't enough information below, say you don't know.
         Do not generate answers that don't use the sources below.'''

In [None]:
def AskAndAnswer(query, system_message = default_system_message):
    query_vector = get_embedding(query)

    results = search_client.search(
    search_text=None,
    vector_queries=[{
         "kind": "vector",
        "vector": query_vector,
        "fields": "embedding"
    }],
    top=3
    )
    
    results = list(results)


    ##for debugging:
    for result in results:
        print(result["text"])  # or whatever field name you used for chunk text
    print("\n\n\n\n")
    ##Extract the answer using OpenAI
    
    top_chunks = "\n\n".join([result["text"] for result in results])

    client = AzureOpenAI(
    api_key=ai_openai_key,  # use your OpenAI key
    api_version="2023-05-15",
    azure_endpoint=ai_openai_endpoint
    )

    response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": f"""Using the following document excerpts, answer this question: 
        {query}

        DOCUMENT:
        {top_chunks}
        """}
    ],
    temperature=0.1
    )

    return response.choices[0].message.content


In [37]:
AskAndAnswer("What's the requirements after doing an Apprenti competency interview?")

Check Interviews
Apprentices registered through the DOL and Apprenti, will need to complete a competency check interview in alignment with their SOW in order to graduate from their apprenticeship. Assistant account managers will check in with apprentices to schedule their competency check interview at either the 2, 5th, or 10th month of their apprenticeship (once deployed). Normally, apprentices will complete their competencies and graduate from their apprenticeship around the 3-month mark. Trainers are responsible for conducting these interviews and accurately documenting their competency as well as communicating the outcome to the client services team.
The Assistant Account Managers will schedule these interviews with you through your HubSpot meeting link. Please ensure your calendar is up to date to avoid rescheduling.
Preparing for the Interview:
o Download the Apprenti Competency Checklist. This is the interview scorecard you will :unselected: use during the interview.
Open HubSpo

'After completing an Apprenti competency interview, the following requirements must be fulfilled:\n\n1. **Update HubSpot**:\n   - Open the apprentice\'s contact in HubSpot.\n   - Open the pinned PDF under Activities.\n   - Fill out the competency checklist (PDF) included within HubSpot.\n   - Mark if the apprentice has completed their requirements under the Competency Schedule section in the PDF, if applicable.\n   - Update the accurate percentage of competency under the property "Apprenti Competency Completion" in the HubSpot contact.\n\n2. **Email the Assistant Account Manager**:\n   - Include training leadership and Ian Go in the CC.\n   - Provide the interview result.\n   - Report progress on Competency Schedule requirements.\n   - Document 3-5 bullet points on the apprentice\'s current work role and responsibilities (Work Plan).\n\nImportant: Apprentices cannot graduate if they have not completed all the terms.'

In [38]:
AskAndAnswer("What if like... I really don't wanna do all that PIP stuff? Can I just fire them? Tell me why I can do that")

Rights Act</td><td>Writing the PIP</td></tr><tr><td>Engaged</td><td>you&#x27;re not in the</td><td></td><td>of 1964</td><td>Communicating</td></tr><tr><td>Kitchen Show</td><td>room</td><td>Daily assignments</td><td>FMLA</td><td>the PIP</td></tr><tr><td>Magic</td><td>What they</td><td></td><td>Parental /</td><td>Enforcing the PIP</td></tr><tr><td>Battling Demo</td><td>should do on</td><td>Supplementary</td><td>Bereavement Leave</td><td></td></tr><tr><td>Demons</td><td>Mondays during</td><td>material -</td><td>PTO During Training</td><td>Academic</td></tr><tr><td>Surprise</td><td>1/1 blocks</td><td>Official guides,</td><td>Military Leave of</td><td>Probation</td></tr><tr><td>Questions</td><td>Bedside manner</td><td>Oracle Trails,</td><td>Absence</td><td>(VetTec)</td></tr><tr><td>Monitoring Chat</td><td>Energy,</td><td>MDN,</td><td>International Travel</td><td>Attendance</td></tr><tr><td>Raised Hands</td><td>Enthusiasm</td><td>HackerRank,</td><td>During Training</td><td>Performance</td></

'The document indicates that termination is a possible outcome for performance-related issues, but it also emphasizes the importance of providing the individual with a clear understanding of expectations and an opportunity to improve through a Performance Dossier Action Plan. Termination decisions are evaluated by leaders across multiple departments, and the process involves scheduling and communicating the termination appropriately. While termination is possible, the process appears to require following specific steps and evaluations rather than bypassing the improvement plan entirely.'

In [39]:
AskAndAnswer("How do I garnish the wages of the trainees in my batch? Is there a precedent for this? I want it to all be funneled into my checking account because they owe me")

tax forms and paid</td></tr></table></figure>
Top of DocumentSKILLSTORM
Training Internal Use Only
<figure><table><tr><td></td><td>time off, etc. This portal is also available to trainers for the purposes of tax document and time off request management. &quot; Management Portal: This is where you will spend most of your time as a trainer. Each trainer is responsible for approving timecards and PTO requests for each of their candidates and trainees. For more information, see Trainee Time Management. [insert link]</td></tr><tr><td>Performance Dossier</td><td>This is an official, documented conversation to be had with Tech Force by Design candidates when there is a technical performance concern. This document should clearly outline the concern, provide the individual in question with a fair and achievable plan to improve, and a signature from all parties involved in the conversation. Please check OneDrive regularly to make sure you always have the most up-to-date version of this document.

"I don't know. The provided document excerpts do not contain any information about garnishing wages of trainees or any precedent for such actions."