<a href="https://colab.research.google.com/github/isomjd-code/latin-courthand-correction/blob/main/latin_case_extractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# ==============================================================================
#        Legal Case Extractor for Transkribus PAGE XML (Corrected)
# ==============================================================================
#
# DESCRIPTION:
# This script, designed for Google Colab, processes a single page from a
# Transkribus document containing medieval legal records. It performs the
# following actions:
#
# 1.  Downloads the specified page's PAGE XML from Transkribus.
# 2.  Uses the Anthropic Claude 3.7 Sonnet model to intelligently segment the
#     transcribed text into individual legal cases.
# 3.  For each case, it uses Claude again to:
#     a. Extract key metadata (plaintiffs, defendants, plea type, etc.).
#     b. Expand the abbreviated Latin transcription into full Latin.
#     c. Translate the expanded Latin into modern English.
# 4.  Formats the final, structured data as JSON, Wiki Markup, or Markdown,
#     as specified by the user.
#
# HOW TO USE IN GOOGLE COLAB:
# 1.  In the left sidebar, click the key icon (üîë) and create secrets for:
#     - TRANKRIBUS_USER
#     - TRANKRIBUS_PASSWORD
#     - ANTHROPIC_API_KEY
# 2.  Paste this entire script into a single cell in a new Colab notebook.
# 3.  Fill in the document details in the "USER CONFIGURATION" section below.
# 4.  Run the cell.
#
# ==============================================================================

# --- Step 0: Install Dependencies (only runs if in a Colab-like environment) ---
try:
    import google.colab
    print("Installing required libraries for Google Colab...")
    # Use -q for a quieter installation
    !pip install -q anthropic lxml requests
    print("Installation complete.")
except ImportError:
    print("Not running in Google Colab. Assuming libraries are already installed.")

# --- Library Imports ---
import os
import requests
import xml.etree.ElementTree as ET
import json
import anthropic
import time
import re
from typing import List, Dict, Any, Optional, Tuple

# Import userdata for secrets management in Colab
try:
    from google.colab import userdata
except ImportError:
    # Define a dummy class if not in Colab for local testing
    class UserData:
        def get(self, key):
            return os.environ.get(key)
    userdata = UserData()


# ==============================================================================
#                           1. USER CONFIGURATION
# ==============================================================================
# --- Transkribus Credentials (from Colab Secrets) ---
# Ensure you have set these secrets in your Colab environment (left sidebar, key icon)
TRANKRIBUS_USERNAME = userdata.get('TRANKRIBUS_USER')
TRANKRIBUS_PASSWORD = userdata.get('TRANKRIBUS_PASSWORD')
ANTHROPIC_API_KEY = userdata.get('anthropic_key')

# --- Document to Process ---
# Using the document details from your error log
COLLECTION_ID = 2093054  # <<< --- YOUR COLLECTION ID --- >>>
DOCUMENT_ID = 9343917  # <<< --- THE SPECIFIC DOCUMENT ID YOU WANT TO PROCESS --- >>>

# --- Output Configuration ---
# Choose your desired output format: 'json', 'wiki', or 'markdown'
OUTPUT_FORMAT = 'wiki'  # <<< --- SET YOUR DESIRED OUTPUT FORMAT HERE --- >>>

# --- LLM Configuration ---
ANTHROPIC_MODEL_NAME = "claude-3-7-sonnet-20250219" # Use the latest powerful model
LLM_TEMPERATURE = 0.1      # Low temperature for deterministic, structured output

# --- File/Directory Configuration ---
TEMP_XML_FILENAME = "downloaded_page.xml" # Temporary file to store downloaded XML.


# ==============================================================================
#                       2. SYSTEM PROMPTS FOR THE LLM
# ==============================================================================

# This prompt is used to split the full page text into individual cases.
CASE_SEGMENTATION_PROMPT = """
You are an expert archivist specializing in the English Court of Common Pleas records. Your task is to segment a full page of transcribed text into individual legal cases.

A new case typically begins with a marginal note like 'Westm'', 'Resum'', or a similar term, or it starts with a new plaintiff's name followed by a phrase like 'optulit se' or 'per attornatum suum'.

You will be given a JSON object where keys are line numbers and values are the transcribed text for that line.

Your task is to return a JSON list of objects. Each object in the list represents one case and must have two keys: 'start_line' and 'end_line', corresponding to the line numbers of that case.

Example Input:
{
  "1": "Westm'",
  "2": "Hugo Brmetson p' attorn' suu' optulit se...",
  "3": "reddit ei quadraginta solidos...",
  "4": "modo mand' q'd nichil h'ent...",
  "5": "xv dies",
  "6": "Westm'",
  "7": "Iosephus de menburu' p'sona ecl'ie...",
  "8": "q'd reddat ei sexmarcas..."
}

Example Output:
[
  {
    "start_line": 1,
    "end_line": 5
  },
  {
    "start_line": 6,
    "end_line": 8
  }
]

Now, analyze the provided text and produce the segmentation list. Output ONLY the valid JSON list and nothing else.
"""

# This prompt processes a single case to extract all required information.
CASE_PROCESSING_PROMPT = """
You are an expert in medieval Latin legal documents, a paleographer, and a historian. You will be given the abbreviated Latin text of a single legal case from the English Court of Common Pleas.

Your task is to produce a single, valid JSON object with four keys:

1.  `metadata`: An object containing extracted information about the case. It should include:
    *   `plaintiffs`: A list of plaintiff names.
    *   `defendants`: A list of defendant names.
    *   `plea_type`: A concise description of the plea (e.g., "Debt", "Trespass", "Account", "Detinue of chattels").
    *   `county`: The county or location mentioned, if any.
    *   `values_mentioned`: A list of any monetary values or damages (e.g., "40 solidos", "100 solidorum").

2.  `abbreviated_latin`: The original, unmodified abbreviated Latin text that was provided as input.

3.  `expanded_latin`: A full expansion of the abbreviated Latin. Follow standard paleographic conventions:
    *   `p' attorn' suu'` -> `per attornatum suum`
    *   `q'd` -> `quod`
    *   `&c` -> `et cetera`
    *   `d'ni R'` -> `domini Regis`
    *   `Will'm` -> `Willelmum`
    *   `uic'` -> `uicecomes`
    *   `pl'ito` -> `placito`
    *   Always use 'u' (not 'v') and 'i' (not 'j').

4.  `english_translation`: A clear, modern English translation of the fully expanded Latin text.

Ensure your entire output is a single, valid JSON object and nothing else.
"""

# ==============================================================================
#                       3. HELPER & CORE FUNCTIONS
# ==============================================================================

def get_transkribus_session() -> requests.Session:
    """Authenticates with Transkribus and returns a session object."""
    if not TRANKRIBUS_USERNAME or not TRANKRIBUS_PASSWORD:
        raise ValueError("Transkribus username and password must be provided via Colab Secrets.")

    print(f"Authenticating with Transkribus as user: {TRANKRIBUS_USERNAME}...")
    login_url = "https://transkribus.eu/TrpServer/rest/auth/login"
    session = requests.Session()
    try:
        response = session.post(login_url, data={'user': TRANKRIBUS_USERNAME, 'pw': TRANKRIBUS_PASSWORD})
        response.raise_for_status()
        if '<sessionId>' in response.text and 'JSESSIONID' in session.cookies:
            print("‚úÖ Transkribus authentication successful.")
            return session
        else:
            raise ConnectionRefusedError(f"Transkribus login failed. Response: {response.text[:500]}")
    except requests.exceptions.RequestException as e:
        print(f"FATAL: Network error during Transkribus authentication: {e}")
        raise

def get_first_page_nr(session: requests.Session, coll_id: int, doc_id: int) -> int:
    """
    Gets the page number for the first page of a document by parsing JSON.
    This is the corrected function.
    """
    print(f"Getting page details for document {doc_id}...")
    pages_url = f"https://transkribus.eu/TrpServer/rest/collections/{coll_id}/{doc_id}/pages"
    try:
        # The 'pages' endpoint returns JSON, not XML.
        response = session.get(pages_url, params={"page": 0})
        response.raise_for_status()

        # Parse the response as JSON
        data = response.json()

        page_list = []
        # The API can return a list directly or a dict containing a list
        if isinstance(data, list):
            page_list = data
        elif isinstance(data, dict) and 'trpPage' in data:
            page_list = data['trpPage']

        if page_list and len(page_list) > 0:
            first_page_info = page_list[0]
            if 'pageNr' in first_page_info:
                page_nr = int(first_page_info['pageNr'])
                print(f"‚úÖ Found first page number: {page_nr}")
                return page_nr

        # If we get here, something is wrong with the response structure
        raise ValueError("Could not find 'pageNr' in the JSON response.")

    except (requests.exceptions.RequestException, json.JSONDecodeError, ValueError, IndexError) as e:
        # Catch a wider range of potential errors
        print(f"FATAL: Could not get page number for document {doc_id}: {e}")
        # Add a debug print to help the user if it fails again
        if 'response' in locals():
            print(f"DEBUG: Raw response from server was: {response.text[:500]}")
        raise

def download_page_xml(session: requests.Session, coll_id: int, doc_id: int, page_nr: int, output_path: str) -> bool:
    """Downloads the latest PAGE XML for a specific page."""
    print(f"Downloading latest PAGE XML for doc {doc_id}, page {page_nr}...")
    # The /text endpoint gets the latest transcript version
    xml_url = f"https://transkribus.eu/TrpServer/rest/collections/{coll_id}/{doc_id}/{page_nr}/text"
    try:
        response = session.get(xml_url)
        response.raise_for_status()
        with open(output_path, 'wb') as f:
            f.write(response.content)
        print(f"‚úÖ PAGE XML downloaded successfully to {output_path}")
        return True
    except requests.exceptions.RequestException as e:
        print(f"FATAL: Error downloading PAGE XML: {e}")
        raise

def parse_lines_from_xml(xml_path: str) -> Dict[int, str]:
    """Parses PAGE XML to extract an ordered dictionary of line number to text."""
    print(f"Parsing XML file: {xml_path}...")
    try:
        tree = ET.parse(xml_path)
        root = tree.getroot()
        namespace = {'page': 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15'}
        lines_text = {}
        line_counter = 1 # Use 1-based indexing for clarity in prompts

        # Find all TextLine elements in reading order
        for text_line in root.findall('.//page:TextLine', namespace):
            text_equiv = text_line.find('page:TextEquiv/page:Unicode', namespace)
            if text_equiv is not None and text_equiv.text:
                lines_text[line_counter] = text_equiv.text.strip()
            else:
                lines_text[line_counter] = "" # Handle empty lines
            line_counter += 1

        print(f"‚úÖ Successfully parsed XML. Found {len(lines_text)} text lines.")
        return lines_text
    except (ET.ParseError, ValueError) as e:
        print(f"FATAL: Error parsing XML file: {e}")
        raise

def call_anthropic_api(client: anthropic.Anthropic, system_prompt: str, user_prompt: str) -> Optional[Dict[str, Any]]:
    """Makes a call to the Anthropic API and returns the parsed JSON response."""
    try:
        message = client.messages.create(
            model=ANTHROPIC_MODEL_NAME,
            max_tokens=4096,
            temperature=LLM_TEMPERATURE,
            system=system_prompt,
            messages=[{"role": "user", "content": user_prompt}]
        )
        response_text = message.content[0].text

        # Clean the response to extract only the JSON part
        json_match = re.search(r'\{.*\}|\[.*\]', response_text, re.DOTALL)
        if not json_match:
            print("    ‚ùå ERROR: LLM did not return a JSON object or list.")
            print(f"    LLM Raw Response: {response_text}")
            return None

        cleaned_json_text = json_match.group(0)
        return json.loads(cleaned_json_text)

    except anthropic.APIError as e:
        print(f"    ‚ùå FATAL: Anthropic API Error: {e}")
        raise
    except json.JSONDecodeError:
        print("    ‚ùå ERROR: Failed to decode JSON from LLM response.")
        print(f"    LLM Raw Response: {response_text}")
        return None
    except Exception as e:
        print(f"    ‚ùå An unexpected error occurred during the API call: {e}")
        return None

def format_as_json(data: List[Dict]) -> str:
    """Formats the processed data as a JSON string."""
    return json.dumps(data, indent=2)

def format_as_wiki(data: List[Dict]) -> str:
    """Formats the processed data as Wiki Markup."""
    output = []
    for i, case in enumerate(data):
        meta = case.get('metadata', {})
        plaintiffs = ", ".join(meta.get('plaintiffs', ['N/A']))
        defendants = ", ".join(meta.get('defendants', ['N/A']))
        plea = meta.get('plea_type', 'N/A')
        values = ", ".join(meta.get('values_mentioned', ['N/A']))

        output.append(f"== Case {i+1}: {plaintiffs} v. {defendants} ==")
        output.append(f"'''Plea Type:''' {plea}")
        output.append(f"'''Values Mentioned:''' {values}\n")

        output.append('{| class="wikitable"')
        output.append('|+ Case Details')
        output.append('|-')
        output.append('! Abbreviated Latin')
        output.append('! Expanded Latin')
        output.append('! English Translation')
        output.append('|-')
        # Pipe characters in text need to be escaped for wikitables
        abbr = case.get('abbreviated_latin', '').replace('|', '{{!}}')
        expd = case.get('expanded_latin', '').replace('|', '{{!}}')
        eng = case.get('english_translation', '').replace('|', '{{!}}')
        output.append(f"| {abbr}")
        output.append(f"| {expd}")
        output.append(f"| {eng}")
        output.append('|}')
        output.append("\n")
    return "\n".join(output)

def format_as_markdown(data: List[Dict]) -> str:
    """Formats the processed data as Markdown."""
    output = []
    for i, case in enumerate(data):
        meta = case.get('metadata', {})
        plaintiffs = ", ".join(meta.get('plaintiffs', ['N/A']))
        defendants = ", ".join(meta.get('defendants', ['N/A']))
        plea = meta.get('plea_type', 'N/A')
        values = ", ".join(meta.get('values_mentioned', ['N/A']))

        output.append(f"## Case {i+1}: {plaintiffs} v. {defendants}")
        output.append(f"**Plea Type:** {plea}")
        output.append(f"**Values Mentioned:** {values}\n")

        output.append("### Abbreviated Latin")
        output.append(f"> {case.get('abbreviated_latin', '')}\n")

        output.append("### Expanded Latin")
        output.append(f"> {case.get('expanded_latin', '')}\n")

        output.append("### English Translation")
        output.append(f"> {case.get('english_translation', '')}\n")
        output.append("---")
    return "\n".join(output)


# ==============================================================================
#                           4. MAIN EXECUTION WORKFLOW
# ==============================================================================

def main():
    """The main function to run the end-to-end extraction process."""
    start_time = time.time()
    print("==============================================================")
    print("        STARTING TRANSKRIBUS LEGAL CASE EXTRACTOR")
    print(f"        Document ID: {DOCUMENT_ID} in Collection: {COLLECTION_ID}")
    print("==============================================================")

    # --- Initialize API Clients ---
    try:
        if not ANTHROPIC_API_KEY or "YOUR_" in ANTHROPIC_API_KEY:
            raise ValueError("Anthropic API Key is not set. Please configure it in Colab Secrets.")
        if not TRANKRIBUS_USERNAME or "YOUR_" in TRANKRIBUS_USERNAME:
            raise ValueError("Transkribus credentials are not set. Please configure them in Colab Secrets.")

        anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
        print("‚úÖ Anthropic client initialized.")

        session = get_transkribus_session()
    except (ValueError, ConnectionRefusedError, requests.exceptions.RequestException) as e:
        print(f"\n--- SCRIPT HALTED: Could not initialize APIs. Error: {e} ---")
        return

    try:
        # --- Step 1: Get Page Info & Download XML ---
        print("\n--- [Step 1/4] Fetching Document Data from Transkribus ---")
        page_nr = get_first_page_nr(session, COLLECTION_ID, DOCUMENT_ID)
        download_page_xml(session, COLLECTION_ID, DOCUMENT_ID, page_nr, TEMP_XML_FILENAME)

        # --- Step 2: Parse XML and Segment into Cases ---
        print("\n--- [Step 2/4] Parsing XML and Segmenting into Cases ---")
        all_lines = parse_lines_from_xml(TEMP_XML_FILENAME)
        if not all_lines:
            print("\n--- SCRIPT HALTED: The downloaded XML contains 0 text lines. ---")
            return

        print("Asking LLM to segment the page into cases...")
        case_segments = call_anthropic_api(
            anthropic_client,
            CASE_SEGMENTATION_PROMPT,
            json.dumps(all_lines, indent=2)
        )

        if not case_segments or not isinstance(case_segments, list):
            print("\n--- SCRIPT HALTED: Could not segment text into cases. ---")
            return

        print(f"‚úÖ LLM identified {len(case_segments)} potential cases.")

        # --- Step 3: Process Each Case ---
        print("\n--- [Step 3/4] Processing Each Case with LLM ---")
        processed_cases = []
        for i, segment in enumerate(case_segments):
            start = segment.get('start_line')
            end = segment.get('end_line')
            print(f"  Processing Case {i+1}/{len(case_segments)} (lines {start}-{end})...")

            if start is None or end is None:
                print(f"    ‚ö†Ô∏è Skipping invalid segment: {segment}")
                continue

            # Collate the text for the current case
            case_lines = [all_lines[j] for j in range(start, end + 1) if j in all_lines]
            case_text = " ".join(case_lines)

            if not case_text.strip():
                print(f"    ‚ö†Ô∏è Skipping empty case.")
                continue

            # Call LLM to process this single case
            case_data = call_anthropic_api(
                anthropic_client,
                CASE_PROCESSING_PROMPT,
                case_text
            )

            if case_data:
                processed_cases.append(case_data)
                print(f"    ‚úÖ Successfully processed Case {i+1}.")
            else:
                print(f"    ‚ùå Failed to process Case {i+1}.")
            time.sleep(1) # Be polite to the API

        # --- Step 4: Format and Print Output ---
        print(f"\n--- [Step 4/4] Formatting Output as '{OUTPUT_FORMAT.upper()}' ---")
        if not processed_cases:
            print("\n--- No cases were successfully processed. Final output is empty. ---")
            return

        final_output = ""
        if OUTPUT_FORMAT.lower() == 'json':
            final_output = format_as_json(processed_cases)
        elif OUTPUT_FORMAT.lower() == 'wiki':
            final_output = format_as_wiki(processed_cases)
        elif OUTPUT_FORMAT.lower() == 'markdown':
            final_output = format_as_markdown(processed_cases)
        else:
            print(f"  ‚ùå ERROR: Unknown output format '{OUTPUT_FORMAT}'. Defaulting to JSON.")
            final_output = format_as_json(processed_cases)

        print("\n" + "="*25 + " FINAL OUTPUT " + "="*25 + "\n")
        print(final_output)
        print("\n" + "="*64 + "\n")


    except (ValueError, ConnectionRefusedError, requests.exceptions.RequestException, anthropic.APIError, ET.ParseError, IOError) as e:
        print(f"\n--- SCRIPT HALTED DUE TO A CRITICAL ERROR ---")
        print(f"Error: {e}")
    finally:
        # --- Cleanup ---
        print("--- Cleaning up temporary files... ---")
        if os.path.exists(TEMP_XML_FILENAME):
            os.remove(TEMP_XML_FILENAME)
            print(f"Removed {TEMP_XML_FILENAME}")

        end_time = time.time()
        print("\n==============================================================")
        print("            PROCESS COMPLETE")
        print(f"            Total execution time: {end_time - start_time:.2f} seconds.")
        print("==============================================================")


# --- Execute the main function when the script is run ---
if __name__ == "__main__":
    main()

Installing required libraries for Google Colab...
Installation complete.
        STARTING TRANSKRIBUS LEGAL CASE EXTRACTOR
        Document ID: 9343917 in Collection: 2093054
‚úÖ Anthropic client initialized.
Authenticating with Transkribus as user: isomjd@gmail.com...
‚úÖ Transkribus authentication successful.

--- [Step 1/4] Fetching Document Data from Transkribus ---
Getting page details for document 9343917...
‚úÖ Found first page number: 1
Downloading latest PAGE XML for doc 9343917, page 1...
‚úÖ PAGE XML downloaded successfully to downloaded_page.xml

--- [Step 2/4] Parsing XML and Segmenting into Cases ---
Parsing XML file: downloaded_page.xml...
‚úÖ Successfully parsed XML. Found 60 text lines.
Asking LLM to segment the page into cases...
‚úÖ LLM identified 12 potential cases.

--- [Step 3/4] Processing Each Case with LLM ---
  Processing Case 1/12 (lines 1-5)...
    ‚úÖ Successfully processed Case 1.
  Processing Case 2/12 (lines 6-9)...
    ‚úÖ Successfully processed Case 2.