# Contract Review Workflow

<a href="https://colab.research.google.com/github/run-llama/llamacloud-demo/blob/main/examples/document_workflows/contract_review/contract_review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://github.com/run-llama/llamacloud-demo/blob/main/examples/document_workflows/contract_review/contract_review.png?raw=1)

This tutorial shows you how to create an agentic workflow that can review a contract for compliance with certain regulations. We will parse the contract into a set of key clauses, match it with relevant clauses from a guideline repository (here, we specifically do GDPR), and then produce a compliance summary.

In [1]:
!pip install llama-index==0.10.68 llama-index-indices-llama-cloud==0.1.0 llama-cloud==0.1.6 llama-parse==0.4.9 # ==0.12.5



In [2]:
import nest_asyncio

nest_asyncio.apply()

## Setup

We setup an index for guidelines. In this case it's just the GDPR document.

We also setup our parser.

In [3]:
!mkdir -p data
!wget "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679" -O data/gdpr.pdf

--2024-12-18 07:16:32--  https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679
Resolving eur-lex.europa.eu (eur-lex.europa.eu)... 143.204.29.13, 143.204.29.129, 143.204.29.7, ...
Connecting to eur-lex.europa.eu (eur-lex.europa.eu)|143.204.29.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/pdf]
Saving to: ‘data/gdpr.pdf’

data/gdpr.pdf           [  <=>               ] 959.27K  2.74MB/s    in 0.3s    

2024-12-18 07:16:33 (2.74 MB/s) - ‘data/gdpr.pdf’ saved [982296]



### Setup Index
Here we use LlamaCloud: https://cloud.llamaindex.ai/. If you don't have access yet, you're always welcome to use our open-source VectorStoreIndex.

In [4]:
# option 1
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex

index = LlamaCloudIndex(
  name="gdpr",
  project_name="llamacloud_demo",
  organization_id="cdcb3478-1348-492e-8aa0-25f47d1a3902",
  api_key="llx-..."
)

retriever = index.as_retriever(similarity_top_k=2)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /usr/local/lib/python3.10/dist-
[nltk_data]     packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


ImportError: cannot import name 'default_transformations' from 'llama_index.core.ingestion.api_utils' (/usr/local/lib/python3.10/dist-packages/llama_index/core/ingestion/api_utils.py)

In [None]:
!pwd

### Setup Parser

Here we use LlamaParse to parse the vendor agremeent.

In [None]:
from llama_parse import LlamaParse

# use our multimodal models for extractions
parser = LlamaParse(result_type="markdown")

### Define Contract Output Schema

We want to extract relevant clauses from the agreement in order to match it against relevant clauses in the GDPR. This schema defines a way to structuring the set of extracted clauses.

In [None]:
from typing import List, Optional
from pydantic import BaseModel, Field

class ContractClause(BaseModel):
    clause_text: str = Field(..., description="The exact text of the clause.")
    mentions_data_processing: bool = Field(False, description="True if the clause involves personal data collection or usage.")
    mentions_data_transfer: bool = Field(False, description="True if the clause involves transferring personal data, especially to third parties or across borders.")
    requires_consent: bool = Field(False, description="True if the clause explicitly states that user consent is needed for data activities.")
    specifies_purpose: bool = Field(False, description="True if the clause specifies a clear purpose for data handling or transfer.")
    mentions_safeguards: bool = Field(False, description="True if the clause mentions security measures or other safeguards for data.")

class ContractExtraction(BaseModel):
    vendor_name: Optional[str] = Field(None, description="The vendor's name if identifiable.")
    effective_date: Optional[str] = Field(None, description="The effective date of the agreement, if available.")
    governing_law: Optional[str] = Field(None, description="The governing law of the contract, if stated.")
    clauses: List[ContractClause] = Field(..., description="List of extracted clauses and their relevant indicators.")

### Define Compliance Check Schema

Define a schema that matches clauses with relevant guidelines in GDPR.

In [None]:
from typing import Optional
from pydantic import BaseModel, Field

class GuidelineMatch(BaseModel):
    guideline_text: str = Field(..., description="The single most relevant guideline excerpt related to this clause.")
    similarity_score: float = Field(..., description="Similarity score indicating how closely the guideline matches the clause, e.g., between 0 and 1.")
    relevance_explanation: Optional[str] = Field(None, description="Brief explanation of why this guideline is relevant.")

class ClauseComplianceCheck(BaseModel):
    clause_text: str = Field(..., description="The exact text of the clause from the contract.")
    matched_guideline: Optional[GuidelineMatch] = Field(None, description="The most relevant guideline extracted via vector retrieval.")
    compliant: bool = Field(..., description="Indicates whether the clause is considered compliant with the referenced guideline.")
    notes: Optional[str] = Field(None, description="Additional commentary or recommendations.")

### Define Final Output Schema

This is the schema for the final compliance report. It contains the vendor name, if it's overall compliant, and also the summary notes.

It will be inferred from the individual checks for every clause.

In [None]:
from typing import Optional, List
from pydantic import BaseModel, Field

class ComplianceReport(BaseModel):
    vendor_name: Optional[str] = Field(None, description="The vendor's name if identified from the contract.")
    overall_compliant: bool = Field(..., description="Indicates if the contract is considered overall compliant.")
    summary_notes: Optional[str] = Field(None, description="General summary or recommendations for achieving full compliance.")

## Setup Contract Review Workflow

Let's define the following contract review workflow:
1. Extract out structured data from the vendor agreement.
2. For each clause, do retrieval against GDPR to see if it's compliant with guidelines.
3. Generate a final summary.

In [None]:
from llama_index.core.workflow import (
    Event,
    StartEvent,
    StopEvent,
    Context,
    Workflow,
    step,
)
from llama_index.core.llms import LLM
from typing import Optional
from pydantic import BaseModel
from llama_index.core import SimpleDirectoryReader
from llama_index.core.schema import Document
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.prompts import ChatPromptTemplate
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core.retrievers import BaseRetriever
from pathlib import Path
import logging
import json
import os

_logger = logging.getLogger(__name__)
_logger.setLevel(logging.INFO)


CONTRACT_EXTRACT_PROMPT = """\
You are given contract data below. \
Please extract out relevant information from the contract into the defined schema - the schema is defined as a function call.\

{contract_data}
"""

CONTRACT_MATCH_PROMPT = """\
Given the following contract clause and the corresponding relevant guideline text, evaluate the compliance \
and provide a JSON object that matches the ClauseComplianceCheck schema.

**Contract Clause:**
{clause_text}

**Matched Guideline Text(s):**
{guideline_text}
"""


COMPLIANCE_REPORT_SYSTEM_PROMPT = """\
You are a compliance reporting assistant. Your task is to generate a final compliance report \
based on the results of clause compliance checks against \
a given set of guidelines.

Analyze the provided compliance results and produce a structured report according to the specified schema.
Ensure that if there are no noncompliant clauses, the report clearly indicates full compliance.
"""

COMPLIANCE_REPORT_USER_PROMPT = """\
A set of clauses within a contract were checked against GDPR compliance guidelines for the following vendor: {vendor_name}.
The set of noncompliant clauses are given below.

Each section includes:
- **Clause:** The exact text of the contract clause.
- **Guideline:** The relevant GDPR guideline text.
- **Compliance Status:** Should be `False` for noncompliant clauses.
- **Notes:** Additional information or explanations.

{compliance_results}

Based on the above compliance results, generate a final compliance report following the `ComplianceReport` schema below.
If there are no noncompliant clauses, the report should indicate that the contract is fully compliant.
"""


class ContractExtractionEvent(Event):
    contract_extraction: ContractExtraction


class MatchGuidelineEvent(Event):
    clause: ContractClause


class MatchGuidelineResultEvent(Event):
    result: ClauseComplianceCheck


class GenerateReportEvent(Event):
    match_results: List[ClauseComplianceCheck]


class LogEvent(Event):
    msg: str
    delta: bool = False


class ContractReviewWorkflow(Workflow):
    """Contract review workflow."""

    def __init__(
        self,
        parser: LlamaParse,
        guideline_retriever: BaseRetriever,
        llm: LLM | None = None,
        similarity_top_k: int = 20,
        output_dir: str = "data_out",
        **kwargs,
    ) -> None:
        """Init params."""
        super().__init__(**kwargs)

        self.parser = parser
        self.guideline_retriever = guideline_retriever

        self.llm = llm or OpenAI(model="gpt-4o-mini")
        self.similarity_top_k = similarity_top_k

        # if not exists, create
        out_path = Path(output_dir) / "workflow_output"
        if not out_path.exists():
            out_path.mkdir(parents=True, exist_ok=True)
            os.chmod(str(out_path), 0o0777)
        self.output_dir = out_path

    @step
    async def parse_contract(
        self, ctx: Context, ev: StartEvent
    ) -> ContractExtractionEvent:
        # load output template file
        contract_extraction_path = Path(
            f"{self.output_dir}/contract_extraction.json"
        )
        if contract_extraction_path.exists():
            if self._verbose:
                ctx.write_event_to_stream(LogEvent(msg=">> Loading contract from cache"))
            contract_extraction_dict = json.load(open(str(contract_extraction_path), "r"))
            contract_extraction = ContractExtraction.model_validate(contract_extraction_dict)
        else:
            if self._verbose:
                ctx.write_event_to_stream(LogEvent(msg=">> Reading contract"))

            # no need to parse contract, it's already in markdown
            # you can use LlamaParse to parse more complex PDFs + other docs

            docs = SimpleDirectoryReader(input_files=[ev.contract_path]).load_data()

            # extract from contract
            prompt = ChatPromptTemplate.from_messages([
                ("user", CONTRACT_EXTRACT_PROMPT)
            ])
            contract_extraction = await llm.astructured_predict(
                ContractExtraction,
                prompt,
                contract_data="\n".join([d.get_content(metadata_mode="all") for d in docs])
            )
            if not isinstance(contract_extraction, ContractExtraction):
                raise ValueError(f"Invalid extraction from contract: {contract_extraction}")
            # save output template to file
            with open(contract_extraction_path, "w") as fp:
                fp.write(contract_extraction.model_dump_json())
        if self._verbose:
            ctx.write_event_to_stream(LogEvent(msg=f">> Contract data: {contract_extraction.dict()}"))

        return ContractExtractionEvent(contract_extraction=contract_extraction)

    @step
    async def dispatch_guideline_match(
        self, ctx: Context, ev: ContractExtractionEvent
    ) -> MatchGuidelineEvent:
        """For each clause in the contract, find relevant guidelines.

        Use a map-reduce pattern.

        """
        await ctx.set("num_clauses", len(ev.contract_extraction.clauses))
        await ctx.set("vendor_name", ev.contract_extraction.vendor_name)

        for clause in ev.contract_extraction.clauses:
            ctx.send_event(MatchGuidelineEvent(clause=clause, vendor_name=ev.contract_extraction.vendor_name))

    @step
    async def handle_guideline_match(
        self, ctx: Context, ev: MatchGuidelineEvent
    ) -> MatchGuidelineResultEvent:
        """Handle matching clause against guideline."""

        # retrieve matching guideline
        query = f"""\
Please find the relevant guideline from {ev.vendor_name} that aligns with the following contract clause:

{ev.clause.clause_text}
"""
        guideline_docs = self.guideline_retriever.retrieve(query)
        guideline_text="\n\n".join([g.get_content() for g in guideline_docs])
        if self._verbose:
            ctx.write_event_to_stream(
                LogEvent(msg=f">> Found guidelines: {guideline_text[:200]}...")
            )

        # extract from contract
        prompt = ChatPromptTemplate.from_messages([
            ("user", CONTRACT_MATCH_PROMPT)
        ])
        compliance_output = await llm.astructured_predict(
            ClauseComplianceCheck,
            prompt,
            clause_text=ev.clause.model_dump_json(),
            guideline_text=guideline_text

        )

        if not isinstance(compliance_output, ClauseComplianceCheck):
            raise ValueError(f"Invalid compliance check: {compliance_output}")

        return MatchGuidelineResultEvent(result=compliance_output)

    @step
    async def gather_guideline_match(
        self, ctx: Context, ev: MatchGuidelineResultEvent
    ) -> GenerateReportEvent:
        """Handle matching clause against guideline."""
        num_clauses = await ctx.get("num_clauses")
        events = ctx.collect_events(ev, [MatchGuidelineResultEvent] * num_clauses)
        if events is None:
            return

        match_results = [e.result for e in events]
        # save match results
        match_results_path = Path(
            f"{self.output_dir}/match_results.jsonl"
        )
        with open(match_results_path, "w") as fp:
            for mr in match_results:
                fp.write(mr.model_dump_json() + "\n")


        return GenerateReportEvent(match_results=[e.result for e in events])

    @step
    async def generate_output(
        self, ctx: Context, ev: GenerateReportEvent
    ) -> StopEvent:
        if self._verbose:
            ctx.write_event_to_stream(LogEvent(msg=">> Generating Compliance Report"))

        # if all clauses are compliant, return a compliant result
        non_compliant_results = [r for r in ev.match_results if not r.compliant]

        # generate compliance results string
        result_tmpl = """
1. **Clause**: {clause}
2. **Guideline:** {guideline}
3. **Compliance Status:** {compliance_status}
4. **Notes:** {notes}
"""
        non_compliant_strings = []
        for nr in non_compliant_results:
            non_compliant_strings.append(
                result_tmpl.format(
                    clause=nr.clause_text,
                    guideline=nr.matched_guideline.guideline_text,
                    compliance_status=nr.compliant,
                    notes=nr.notes
                )
            )
        non_compliant_str = "\n\n".join(non_compliant_strings)

        prompt = ChatPromptTemplate.from_messages([
            ("system", COMPLIANCE_REPORT_SYSTEM_PROMPT),
            ("user", COMPLIANCE_REPORT_USER_PROMPT)
        ])
        compliance_report = await llm.astructured_predict(
            ComplianceReport,
            prompt,
            compliance_results=non_compliant_str,
            vendor_name=await ctx.get("vendor_name")
        )

        return StopEvent(result={"report": compliance_report, "non_compliant_results": non_compliant_results})

In [None]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o")
workflow = ContractReviewWorkflow(
    parser=parser,
    guideline_retriever=retriever,
    llm=llm,
    verbose=True,
    timeout=None,  # don't worry about timeout to make sure it completes
)

#### Visualize the workflow

In [None]:
from llama_index.utils.workflow import draw_all_possible_flows

draw_all_possible_flows(ContractReviewWorkflow, filename="contract_workflow.html")

## Run the Workflow

Let's run the full workflow and generate the output!

In [None]:
from IPython.display import clear_output

handler = workflow.run(contract_path="data/vendor_agreement.md")
async for event in handler.stream_events():
    if isinstance(event, LogEvent):
        if event.delta:
            print(event.msg, end="")
        else:
            print(event.msg)

response_dict = await handler
print(str(response_dict["report"]))

In [None]:
print(str(response_dict["report"]))

vendor_name='ACME Office Supply, Inc.' overall_compliant=False summary_notes="The contract contains noncompliant clauses regarding subprocessors and data transfer. It allows engaging subprocessors without prior client approval and lacks the client's right to object. Additionally, it does not mention additional safeguards or compliance with standard contractual clauses for data transfer, which are recommended to protect data subjects' rights. To achieve full compliance, these clauses should be revised to align with GDPR guidelines."


In [None]:
response_dict["non_compliant_results"]

[ClauseComplianceCheck(clause_text='- Vendor may engage subprocessors without prior Client approval - Subprocessors may be located in any jurisdiction globally - Notice of new subprocessors provided within 30 days of engagement - Client has no right to object to new subprocessors', matched_guideline=GuidelineMatch(guideline_text='The processor shall not engage another processor without prior specific or general written authorisation of the controller. In the case of general written authorisation, the processor shall inform the controller of any intended changes concerning the addition or replacement of other processors, thereby giving the controller the opportunity to object to such changes.', similarity_score=0.9, relevance_explanation='The guideline specifies that the processor must obtain prior authorization from the controller before engaging subprocessors, and must inform the controller of changes, allowing them to object. The contract clause does not comply with these requirement