# Contract Price Adjustment Detection using IBM Granite Models

**Granite 4 Small × Granite Docling 258M** for clause extraction and classification

In this notebook, we demonstrate how to automatically **detect and classify price adjustment clauses** (such as CPI-based or cost-based increases) in B2B contracts.

Automating this analysis helps procurement, finance, and legal teams quickly identify pricing flexibility and escalation risk across large volumes of supplier agreements.

We use:
- **Granite Docling (`ibm-granite/granite-docling-258M-mlx`)** for PDF-to-Text conversion  
- **Granite 4 Small (`ibm/granite-4-h-small`)** for semantic analysis and clause classification  

This workflow extracts clauses like:
- CPI-linked adjustments (inflation, cost-of-living)
- Cost-based adjustments (materials, energy, logistics)
- Penalty or performance-based price changes
- Explicitly fixed price (no adjustment)


### 1. Install Dependencies

Install all required libraries for document parsing and LLM-based classification.

In [None]:
%pip install "git+https://github.com/ibm-granite-community/utils" \
    docling \
    langchain \
    langchain_ibm \
    langchain_community \
    transformers \
    mlx-vlm 
        

### 2. Import Libraries

Load all dependencies and configure the core components for extraction, logging, and analysis.


In [None]:
import os
import json
import logging
import pandas as pd
from IPython.display import display
from docling.datamodel import vlm_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline
from langchain.prompts import ChatPromptTemplate
from langchain_ibm import WatsonxLLM


### 3. Initialize Granite Docling & WatsonxLLM

Before extracting and classifying contracts, we need to initialize our two main engines:

- **Granite Docling** – **ibm-granite/granite-docling-258M-mlx** a multimodal Image-Text-to-Text model designed for converting complex documents (PDFs, scanned images, etc.) into structured, machine-readable formats like Markdown, HTML, or JSON.

- **ibm/granite-4-h-small** used to semantically classify clauses.

In [None]:

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

#Initialize Granite Docling
pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_MLX,
)

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        )
    }
)
logger.info("Granite Docling initialized")

# Initialize WatsonxLLM
api_key = os.getenv("WATSON_API_KEY")
project_id = os.getenv("WATSON_PROJECT_ID")
watsonx_url = os.getenv("WATSON_URL", "https://us-south.ml.cloud.ibm.com")


if not api_key or not project_id:
    logger.error("WATSON_API_KEY or WATSON_PROJECT_ID environment variables not set")
    raise ValueError("Missing required environment variables")

llm = WatsonxLLM(
    model_id="ibm/granite-4-h-small",
    apikey=api_key,
    url=watsonx_url,
    project_id=project_id,
    params={"decoding_method": "greedy", "max_new_tokens": 3000},
)

logger.info("WatsonxLLM initialized")


### 4. Define Helper Functions

These utility functions perform the following key tasks:

- **`extract_contract_text()`** → Converts PDF/DOCX files into Markdown text using Granite Docling  
- **`classify_contract()`** → Clause detection and classification  
- **`process_contracts()`** → Batch-processes all contracts in a directory  

Each function includes logging for traceability and robust error handling.

In [None]:
def extract_contract_text(file_path, max_chars=32000):
    """Extract contract text using Granite Docling"""
    try:
        logger.info(f"Extracting: {file_path}")
        result = doc_converter.convert(file_path)
        text = result.document.export_to_markdown()
        
        # Validate text length
        if len(text) > max_chars:
            logger.warning(f"Contract exceeds {max_chars} chars ({len(text)}), truncating")
            text = text[:max_chars]
        
        logger.info(f"Extracted {len(text)} characters")
        return text
    except Exception as e:
        logger.error(f"Extraction failed: {e}")
        return None

def classify_contract(contract_name, contract_text):
    """Classify contract using Granite 4 with robust JSON-only prompt"""

    prompt = f"""
You are a professional contract analysis model trained to identify all pricing mechanisms in B2B service agreements.

Your goal is to find, classify, and explain **all clauses that describe how prices can change, or confirm that prices cannot change**.

### Classification categories
- **"CPI-based"** – Price changes tied to inflation, CPI-U, CPI-W, cost of living, or similar indices.
- **"Cost-based"** – Price changes tied to supplier costs, fuel, materials, energy, labor, or other input variations.
- **"Penalty-based"** – Adjustments linked to performance, service levels, or penalties (e.g., late payments, SLA breaches).
- **"No price increase"** – Clauses explicitly stating that prices are fixed, capped, or not subject to increase for the term.

### Important rules
1. If a single section contains multiple mechanisms (e.g. CPI + cost), create **separate clause entries**.
2. Only mark `has_price_increases = true` if **at least one** clause allows upward price movement.
3. If the contract says prices are fixed or capped for the term, mark `"No price increase"` and `has_price_increases = false`.
4. If wording is ambiguous (e.g. “subject to market review”), classify as `"Cost-based"` with `"confidence": "Low"`.
5. Confidence levels:
   - **High** – Clear, explicit adjustment wording.
   - **Medium** – Indirect or conditional adjustment language.
   - **Low** – Unclear, inferred, or conflicting statements.

### Output rules
- Output **only one valid JSON object**, starting with '{' and ending with '}'.
- Do not include any markdown, code fences, or text outside JSON.
- Use **double quotes only** for strings.
- Return empty lists if no clauses are found.

### Output schema
{{
  "contract_name": "{contract_name}",
  "total_clauses_found": <integer>,
  "price_adjustment_clauses": [
    {{
      "clause_id": <integer>,
      "classification": "<CPI-based|Cost-based|Penalty-based|No price increase>",
      "section": "<section heading or short context>",
      "supporting_clause": "<exact quote from contract>",
      "confidence": "<High|Medium|Low>",
      "explanation": "<brief reasoning>"
    }}
  ],
  "summary": {{
    "has_price_increases": <true|false>,
    "adjustment_types": [<list of strings>],
    "overall_assessment": "<short synthesis of pricing structure and mechanisms>"
  }}
}}

If no price-related clauses are found:
- Set "total_clauses_found": 0,
- "price_adjustment_clauses": [],
- "summary.has_price_increases": false,
- "summary.overall_assessment": "No price adjustment or escalation clauses detected; pricing appears fixed."

### CONTRACT TEXT
{contract_text}
"""

    try:
        logger.info(f"Classifying: {contract_name}")
        response = llm(prompt).strip()

        # Extract JSON if LLM wraps it with extra characters
        if "{" in response:
            start = response.find("{")
            end = response.rfind("}") + 1
            response = response[start:end]

        # Parse JSON
        result = json.loads(response)

        # Validate JSON structure
        if not isinstance(result.get("price_adjustment_clauses"), list):
            logger.error("Invalid price_adjustment_clauses format")
            return None

        if not isinstance(result.get("summary"), dict):
            logger.error("Invalid summary format")
            return None

        clause_count = len(result.get("price_adjustment_clauses", []))
        return result

    except json.JSONDecodeError as e:
        logger.error(f"JSON decode error: {e}")
        logger.error(f"Response snippet: {response[:300]}...")
        return None
    except Exception as e:
        logger.error(f"Classification error: {e}")
        return None

def process_contracts(data_dir):
    """Process all contracts in directory"""
    
    results = []
    
    if not os.path.exists(data_dir):
        logger.error(f"Directory not found: {data_dir}")
        return results
    
    pdf_files = [f for f in os.listdir(data_dir) if f.lower().endswith(".pdf")]
    logger.info(f"\nProcessing {len(pdf_files)} contracts...\n")
    
    for idx, file_name in enumerate(pdf_files, 1):
        file_path = os.path.join(data_dir, file_name)
        logger.info(f"\n[{idx}/{len(pdf_files)}] {file_name}")
        
        # Extract with Granite Docling
        contract_text = extract_contract_text(file_path)
        
        if not contract_text:
            logger.warning(f"Extraction failed, skipping")
            continue
        
        classification = classify_contract(file_name, contract_text)
        
        if classification:
            results.append(classification)
        else:
            logger.warning(f"Classification failed")
    
    return results


### 5. Run the Processing Pipeline

Specify your directory containing contracts (PDF or DOCX).
The pipeline will extract text, classify clauses, and store results as structured JSON.

In [None]:
data_dir = "./Price_Detection_Sample_Contracts"
logger.info("="*80)
logger.info("Starting contract classification pipeline")
logger.info("="*80)

# Process all contracts
results = process_contracts(data_dir)

if results:
    # Save JSON output
    output_path = "./outputs/contract_classifications.json"
    os.makedirs("./outputs", exist_ok=True)
    
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    
    logger.info(f"\nJSON output saved: {output_path}")
    logger.info(f"Contracts processed successfully")
else:
    logger.error("No contracts processed")


### 6. Generate Summary & Clause-Level Tables

The notebook generates two key outputs:

1. **Contract Summary Table**: One row per contract summarizing detected pricing mechanisms.

2. **Clause-Level Table**: Detailed breakdown of each detected clause (classification, text snippet, and confidence).

In [None]:
# Flatten the summary
df_summary = pd.json_normalize(results, sep='_')
logger.info (results)
df_clauses = pd.json_normalize(
    results,
    record_path=['price_adjustment_clauses'],
    meta=['contract_name'],
    sep='_'
)
# === DISPLAY CLEAN TABLES ===
print("=== Contract Summary ===")
display(
    df_summary[[
        "contract_name",
        "summary_has_price_increases",
        "summary_adjustment_types",
        "summary_overall_assessment"
    ]].style.set_table_styles([
        {"selector": "th", "props": [("text-align", "left")]},
        {"selector": "td", "props": [("text-align", "left")]}
    ])
)
contracts_with_increases = [
    r for r in results 
    if r.get("summary", {}).get("has_price_increases") is True
]

if contracts_with_increases:
    print("=== Clause-Level Details ===")
    display(df_clauses[[
            "contract_name",
            "clause_id",
            "classification",
            "section",
            "confidence",
            "supporting_clause",
            "explanation"
        ]].style.set_table_styles([
            {"selector": "th", "props": [("text-align", "left")]},
            {"selector": "td", "props": [("text-align", "left")]}
        ])
    )


### 7. Summary
This workflow demonstrates how IBM Granite models can automatically extract, interpret, and structure price adjustment logic from complex contract documents.

By combining Granite Docling’s multimodal document understanding with Granite 4 Small’s clause reasoning, legal and procurement teams can dramatically accelerate contract review, compliance checks, and financial risk analysis.