# 🔧 Environment Configuration

This section loads environment variables required for Azure OpenAI API access.

## 📋 Required Environment Variables
- `AZURE_OPENAI_ENDPOINT`: Your Azure OpenAI service endpoint URL
- `AZURE_OPENAI_DEPLOYMENT`: The deployment name for your model

> **Note**: Make sure you have a `.env` file in your project root with these variables defined.

---

In [4]:
import os
from dotenv import load_dotenv

load_dotenv()
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT")

# 🔐 Authentication Setup

Setting up Azure authentication using DefaultAzureCredential for secure access to Azure OpenAI services.

## 🔑 Authentication Method
This notebook uses **Azure DefaultAzureCredential** which automatically tries multiple authentication methods:
- Managed Identity (if running on Azure)
- Azure CLI credentials
- Visual Studio credentials
- Environment variables

## ✅ Benefits
- **Secure**: No hardcoded secrets in your code
- **Flexible**: Works across different environments
- **Automatic**: Handles token refresh automatically

---

In [5]:
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")

# 🤖 Azure OpenAI Client Initialization

Creating the Azure OpenAI client instance with the configured authentication and endpoint settings.

## 🔗 Client Configuration
- **API Version**: `2024-12-01-preview` (latest preview with advanced features)
- **Authentication**: Token-based using Azure AD credentials
- **Endpoint**: Configured from environment variables

## 📝 Usage Notes
- This client instance will be used for all OpenAI API calls
- Token refresh is handled automatically
- Supports all Azure OpenAI service features including chat completions, embeddings, and more

In [6]:
from openai import AzureOpenAI

client = AzureOpenAI(
    api_version="2024-12-01-preview",
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    azure_ad_token_provider=token_provider,
)

# 📄 PDF Text Extraction

The following function extracts all text content from a PDF file using the PyPDF2 library.

## 🔧 Function Details
- **Purpose**: Reads a PDF file and converts all pages to plain text
- **Input**: File path to the PDF document
- **Output**: Combined text from all pages as a single string
- **Library**: Uses `PyPDF2.PdfReader` for PDF parsing

## 📝 How it works
1. Opens the PDF file in binary read mode
2. Creates a PDF reader object
3. Iterates through each page in the document
4. Extracts text from each page and concatenates it
5. Returns the complete text content

---

In [7]:
import PyPDF2

def get_text_from_pdf(input_file: str) -> str:
    """Extracts and returns text from a PDF file."""
    text = ""
    with open(input_file, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text
    return text

In [8]:
import json
from pathlib import Path
from typing import List, Dict, Optional

def detect_pii_with_llm(text: str, output_pdf: Optional[str] = None) -> List[Dict]:
    """Calls the Azure OpenAI chat completion to detect PII in the provided text.

    Returns a list of objects with keys: 'text', 'category', 'confidence'.
    """
    prompt = (
        "Find all personally identifiable information (PII) in the following text that has either name, date, address or phone number only. "
        "Return a JSON array of objects, each with the following fields: "
        "'text' (the exact text span that is PII), 'category' (the type of PII, such as address, phone number, name, etc.), "
        "and 'confidence' (a score from 0 to 1 indicating your confidence in the categorization). "
        "Respond ONLY with the JSON array, no explanation or extra text.\nText: " + text
    )

    try:
        response = client.chat.completions.create(
            messages=[
                {"role": "system", "content": "You are a helpful assistant that extracts PII from text."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=16000,
            temperature=0.0,
            top_p=1.0,
            model=AZURE_OPENAI_DEPLOYMENT
        )
        content = response.choices[0].message.content

        # strip common markdown fences if present
        if content.startswith("```json"):
            content = content[len("```json"):].strip()
        if content.endswith("```"):
            content = content[:-3].strip()

        redact_words = json.loads(content)

        # If an expected output PDF path was provided, write the metadata JSON
        # next to that file using the `.meta.json` suffix (e.g. redacted_doc.meta.json).
        if output_pdf:
            try:
                out_path = Path(output_pdf)
                meta_path = out_path.with_suffix('.meta.json')
                with open(meta_path, 'w', encoding='utf-8') as f:
                    json.dump(redact_words, f, ensure_ascii=False, indent=2)
                print(f"Metadata JSON saved as '{meta_path}'")
            except Exception as e:
                print(f"Warning: failed to write metadata file '{meta_path}': {e}")

        return redact_words
    except Exception as e:
        raise RuntimeError(f"Azure OpenAI PII extraction error: {e}")

In [9]:
import fitz 

def redact_pdf(input_file: str, redact_words: List[Dict], output_file: Optional[str] = None) -> str:
    """Redacts occurrences of each redact word in the PDF and saves a new PDF.

    Returns the path to the saved redacted PDF.
    """
    inp = Path(input_file)
    if output_file is None:
        output_file = str(inp.with_name(f"redacted_{inp.name}"))

    doc = fitz.open(input_file)
    for page in doc:
        for word in redact_words:
            text_to_find = word.get("text")
            if not text_to_find:
                continue
            for inst in page.search_for(text_to_find):
                page.add_redact_annot(inst, fill=(0, 0, 0))
        page.apply_redactions()
    doc.save(output_file)
    return output_file

In [12]:
import sys
import os
from pathlib import Path
from typing import Optional

def main(input_file: str, output_file: Optional[str] = None) -> str:
    """Entry point: extract text from `input_file`, detect PII, redact PDF, and return output path.

    Args:
        input_file: path to the source PDF to redact.
        output_file: optional path for the redacted PDF. If None, a default
            filename `redacted_<input_filename>` will be used.
    """
    print(f"Extracting text from: {input_file}")
    text = get_text_from_pdf(input_file)

    if not text.strip():
        raise ValueError(f"No text extracted from {input_file}")

    print("Detecting PII via LLM...")
    # Determine expected output filename so the detector can save metadata next to it
    inp = Path(input_file)
    expected_output = output_file if output_file is not None else str(inp.with_name(f"redacted_{inp.name}"))
    redact_words = detect_pii_with_llm(text, output_pdf=expected_output)
    print(f"PII items found: {redact_words}")

    print("Applying redactions to PDF...")
    out = redact_pdf(input_file, redact_words, output_file=expected_output)
    print(f"Redacted PDF saved as '{out}'")
    return out

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python main.py <input_pdf|input_dir> [output_pdf|output_dir]")
        sys.exit(2)

    input_path = sys.argv[1]
    output_path = sys.argv[2] if len(sys.argv) > 2 else None

    try:
        # If the input is a directory, process all .pdf files inside it
        if os.path.isdir(input_path):
            input_dir = Path(input_path)
            # Determine the output directory: if provided, use it; otherwise create/use a subdir "redacted"
            if output_path:
                out_dir = Path(output_path)
                # If an output file path was passed accidentally, ensure it's a directory for multiple outputs
                if out_dir.exists() and not out_dir.is_dir():
                    raise ValueError(f"When input is a directory, output must be a directory: {out_dir}")
            else:
                out_dir = input_dir / "redacted"

            # Create output directory if it doesn't exist
            out_dir.mkdir(parents=True, exist_ok=True)

            pdf_files = [p for p in sorted(input_dir.iterdir()) if p.is_file() and p.suffix.lower() == ".pdf"]
            if not pdf_files:
                raise ValueError(f"No PDF files found in directory: {input_dir}")

            results = []
            for pdf in pdf_files:
                print(f"\nProcessing file: {pdf}")
                # For each file, construct an output filename within out_dir
                out_file = str(out_dir / f"redacted_{pdf.name}")
                try:
                    res = main(str(pdf), output_file=out_file)
                    results.append(res)
                except Exception as file_err:
                    print(f"Failed to process {pdf}: {file_err}")

            print('\nProcessing complete. Outputs:')
            for r in results:
                print(r)
    except Exception as e:
        print(e)
        sys.exit(1)

In [15]:
# Call the main function to process sample.pdf
result_path = main("sample.pdf")
print(f"Processing complete! Redacted file saved at: {result_path}")

Extracting text from: sample.pdf
Detecting PII via LLM...
Metadata JSON saved as 'redacted_sample.meta.json'
PII items found: [{'text': 'John Michael Rodriguez', 'category': 'name', 'confidence': 0.99}, {'text': '03/15/1985', 'category': 'date', 'confidence': 0.99}, {'text': '(555) 123-4567', 'category': 'phone number', 'confidence': 0.99}, {'text': '1234 Maple Street\nApt 5B\nSpringfield, IL  62704', 'category': 'address', 'confidence': 0.99}, {'text': 'Maria Rodriguez', 'category': 'name', 'confidence': 0.99}, {'text': '(555) 987-6543', 'category': 'phone number', 'confidence': 0.99}, {'text': 'Sarah Michelle Chen', 'category': 'name', 'confidence': 0.99}, {'text': '(555) 876-5432', 'category': 'phone number', 'confidence': 0.99}, {'text': '5678 Oak Avenue, Unit 12, Chicago, IL  60601', 'category': 'address', 'confidence': 0.99}, {'text': 'Michael James Wilson', 'category': 'name', 'confidence': 0.99}, {'text': '(555) 345-6789', 'category': 'phone number', 'confidence': 0.99}, {'text