## Claude Requirements Extraction
This script aims to develop a structure to send large amounts of text/content to LLM APIs through multiple calls. The current approach takes in all JSON files from the Plan Net IG and key narrative information in markdown form (formerly extracted from HTML files). The script then analyzes each type of information in batches, and creates a meta-list of all requirements extracted from those documents. The goal is to identify if this approach can produce all technical information at an appropriate level of deatil that an LLM would need to know to help design a test kit for a given IG.

First attempts: We were able to run through the script fully using the Claude API with all JSONs and markdown content. The process took over 93 minutes. The output requirements list is saved in the files (processed_output/test_requirements_claude1.json and .csv). 

In progress: 
- Adding images back in, revising the prompting based on Inferno requirements extraction process documentation
- Comparing LLM results
- Reviewing LangChain capabilities to improve document loading and summary quality.

### Script Organization:
1. Imports and Basic Setup
2. API Configuration
3. File Processing
4. Rate Limiting
5. Batch Processing/API Call Functions
7. Main Execution

### Setup

In [28]:
# 1. IMPORTS AND BASIC SETUP
import base64
import json
import logging
from typing import List, Dict, Tuple, Union, Optional, Any
from dataclasses import dataclass
import os
import time
import threading
from IPython.display import Image
import math
import io
import re
import pandas as pd
from json_repair import repair_json
from langchain_community.document_loaders import BSHTMLLoader
import shutil
from dotenv import load_dotenv
import httpx
from collections import defaultdict
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
from anthropic import RateLimitError
from anthropic import Anthropic
import google.generativeai as gemini
from openai import OpenAI
import httpx


# Basic setup
logging.basicConfig(level=logging.INFO)
load_dotenv()

# Constants
CERT_PATH = '/opt/homebrew/etc/openssl@3/cert.pem'

### API configuration

In [None]:
API_CONFIGS = {
    "claude": {
        "model_name": "claude-3-5-sonnet-20240620",
        "max_tokens": 8192,
        "temperature": 0.7,
        "batch_size": 3,
        "delay_between_chunks": 2,
        "delay_between_batches": 10,
        "requests_per_minute": 25,
        "max_requests_per_day": 5000,
        "delay_between_requests": 2
    }
}

SYSTEM_PROMPTS = {
    "claude": """You are a seasoned Healthcare Integration Test Engineer 
                analyzing a FHIR Implementation Guide to extract precise testable requirements."""
}

### Pulling in files of interest
Sourcing JSON files from the IG, copying them from full-ig directory into json_only folder.

In [None]:
def copy_json_files(source_folder='full-ig/site', destination_folder='full-ig/json_only'):
    """Copy and filter relevant JSON files"""
    if not os.path.exists(destination_folder):
        os.makedirs(destination_folder)

    json_files = []
    for file_name in os.listdir(source_folder):
        if (file_name.endswith('.json') and 
            not any(file_name.endswith(ext) for ext in [
                '.ttl.json', '.jsonld.json', '.xml.json', '.change.history.json'
            ])):
            json_files.append(file_name)
            shutil.copy(os.path.join(source_folder, file_name), destination_folder)
            
    logging.info(f"Copied {len(json_files)} JSON files to {destination_folder}")
    return json_files


def prepare_json_for_processing(json_file_path: str) -> Union[dict, list]:
    """Read and prepare JSON file for processing, handling UTF-8 BOM"""
    try:
        # First try reading with utf-8-sig encoding to handle BOM
        with open(json_file_path, 'r', encoding='utf-8-sig') as f:
            data = json.load(f)
            return data['entry'] if isinstance(data, dict) and 'entry' in data else data
    except (json.JSONDecodeError, UnicodeError) as e:
        # If that fails, try with regular utf-8
        try:
            with open(json_file_path, 'r', encoding='utf-8') as f:
                data = json.load(f)
                return data['entry'] if isinstance(data, dict) and 'entry' in data else data
        except (json.JSONDecodeError, UnicodeError) as e:
            # If the file is corrupted, log it and skip
            logging.error(f"Error processing {json_file_path}: {str(e)}")
            return []  # Return empty list for corrupted files

### Preparing Files for LLM

Because we have so many JSONs, we cannot feed them all to an LLM at once. These functions split combined JSONs into chunks that can fit in one LLM call. We included this feature to try and maintain summarization quality- the LLM should receive all relevant information together instead of in pieces, to help it understand what it is receiving.

In [None]:
def split_json(json_data: Union[dict, list], max_size: int = 2000) -> List[list]:
    """Split JSON into chunks while maintaining object integrity"""
    if isinstance(json_data, dict):
        json_data = [json_data]
    
    chunks = []
    current_chunk = []
    current_size = 0
    
    for item in json_data:
        item_size = len(json.dumps(item))
        if item_size > max_size:
            if current_chunk:
                chunks.append(current_chunk)
            chunks.append([item])
            current_chunk = []
            current_size = 0
        elif current_size + item_size > max_size:
            chunks.append(current_chunk)
            current_chunk = [item]
            current_size = item_size
        else:
            current_chunk.append(item)
            current_size += item_size
    
    if current_chunk:
        chunks.append(current_chunk)
    return chunks

def clean_markdown(text: str) -> str:
    """Clean markdown content"""
    text = re.sub(r'\n\s*\n', '\n\n', text)
    text = re.sub(r'<!--.*?-->', '', text, flags=re.DOTALL)
    text = re.sub(r'\.{2,}', '.', text)
    text = re.sub(r'\\(.)', r'\1', text)
    text = re.sub(r'\|', ' ', text)
    text = re.sub(r'[-\s]*\n[-\s]*', '\n', text)
    return text.strip()

def split_markdown(content: str, max_size: int = 2000) -> List[str]:
    """Split markdown into manageable chunks"""
    chunks = []
    lines = content.split('\n')
    current_chunk = []
    current_size = 0
    
    for line in lines:
        line_size = len(line)
        if current_size + line_size > max_size:
            if current_chunk:
                chunks.append('\n'.join(current_chunk))
            current_chunk = [line]
            current_size = line_size
        else:
            current_chunk.append(line)
            current_size += line_size
            
    if current_chunk:
        chunks.append('\n'.join(current_chunk))
    return chunks

def consolidate_jsons(base_directory: str = 'full-ig/json_only'):
    """Consolidate related JSON files while maintaining integrity"""
    subdirs = [d for d in os.listdir(base_directory) 
              if os.path.isdir(os.path.join(base_directory, d))]
    
    for subdir in subdirs:
        folder_path = os.path.join(base_directory, subdir)
        combined_data = []
        
        for filename in os.listdir(folder_path):
            if filename.endswith('.json'):
                try:
                    with open(os.path.join(folder_path, filename), 'r') as f:
                        json_content = json.load(f)
                        if isinstance(json_content, dict) and 'entry' in json_content:
                            combined_data.extend(json_content['entry'])
                        else:
                            combined_data.append(json_content)
                except json.JSONDecodeError as e:
                    logging.error(f"Error decoding JSON from {filename}: {e}")
                    continue
        
        if combined_data:
            output_filename = f"{subdir}_combined.json"
            output_path = os.path.join(base_directory, output_filename)
            try:
                with open(output_path, 'w') as outfile:
                    json.dump({
                        "resourceType": subdir,
                        "total": len(combined_data),
                        "entry": combined_data
                    }, outfile, indent=2)
                logging.info(f"Created {output_filename} with {len(combined_data)} entries")
            except Exception as e:
                logging.error(f"Error writing {output_filename}: {e}")

#### Defining Rate Limiting & Safe Call Functions

Because of the amount of content we are sending to APIs, we need to include rate limiting in our prompt chaining process to avoid hitting rate limits. This includes a function to create a reate limiter and to make calls to the Claude LLM with the rate limiter included. 

In [None]:
def create_rate_limiter():
    """Create a rate limiter state dictionary for all APIs"""
    return {
        api: {
            'requests': [],
            'daily_requests': 0,
            'last_reset': time.time()
        }
        for api in API_CONFIGS.keys()
    }

def check_rate_limits(rate_limiter: dict, api: str):
    """Check and wait if rate limits would be exceeded"""
    if api not in rate_limiter:
        raise ValueError(f"Unknown API: {api}")
        
    now = time.time()
    state = rate_limiter[api]
    config = API_CONFIGS[api]
    
    # Reset daily counts if needed
    day_seconds = 24 * 60 * 60
    if now - state['last_reset'] >= day_seconds:
        state['daily_requests'] = 0
        state['last_reset'] = now
    
    # Check daily limit
    if state['daily_requests'] >= config['max_requests_per_day']:
        raise Exception(f"{api} daily request limit exceeded")
    
    # Remove old requests outside the current minute
    state['requests'] = [
        req_time for req_time in state['requests']
        if now - req_time < 60
    ]
    
    # Wait if at rate limit
    if len(state['requests']) >= config['requests_per_minute']:
        sleep_time = 60 - (now - state['requests'][0])
        if sleep_time > 0:
            time.sleep(sleep_time)
        state['requests'] = state['requests'][1:]
    
    # Add minimum delay between requests
    if state['requests'] and now - state['requests'][-1] < config['delay_between_requests']:
        time.sleep(config['delay_between_requests'])
    
    # Record this request
    state['requests'].append(now)
    state['daily_requests'] += 1


In [None]:
@retry(
    wait=wait_exponential(multiplier=1, min=4, max=60),
    stop=stop_after_attempt(5),
    retry=retry_if_exception_type((RateLimitError, TimeoutError))
)
def make_api_request(client, api_type: str, content: Union[str, dict, list], rate_limit_func) -> str:
    """Make rate-limited API request with retries"""
    rate_limit_func()
    
    config = API_CONFIGS[api_type]
    formatted_content = format_content_for_api(content, api_type)
    
    try:
        if api_type == "claude":
            response = client.messages.create(
                model=config["model_name"],
                max_tokens=config["max_tokens"],
                messages=[{
                    "role": "user", 
                    "content": formatted_content
                }],
                system=SYSTEM_PROMPTS[api_type]
            )
            return response.content[0].text
            
    except Exception as e:
        logging.error(f"Error in {api_type} API request: {str(e)}")
        raise

#### Processing Functions
This set of functions allows for analysis of batches of each information type (e.g., JSONs, markdown, and images) from individual files to extract requirements, and then sending those combined requirements lists to the LLM at once to ask for one meta-summarization of requirements. The requirements are formatted as a CSV with relevant metadata, which is saved as an output file for review.

In [33]:
def process_content_batch(api_type: str, contents: List[Union[str, dict]], 
                        config: dict, client, rate_limit_func) -> List[str]:
    """Process a batch of content with rate limiting"""
    results = []
    for content in contents:
        result = make_api_request(client, api_type, content, rate_limit_func)
        results.append(result)
        time.sleep(config["delay_between_chunks"])
    return results

In [34]:
def create_requirements_extraction_prompt(content: Union[str, dict, list]) -> str:
    """Create a prompt that aligns with Inferno's requirements extraction process"""
    
    return f"""Analyze this FHIR Implementation Guide content to extract precise requirements following these guidelines:

For each requirement you identify, provide:

1. REQUIREMENT TEXT
- Extract direct quotes from the source
- For compound requirements, split into atomic requirements
- Maintain context when splitting
- Use [...] for added clarifications
- Use ... for removed text
- Format using markdown syntax for code blocks, italics, etc.

2. REQUIREMENT METADATA
- Conformance Level (SHALL, SHOULD, MAY, SHOULD NOT, SHALL NOT)
- Actor(s) the requirement applies to
- Whether the requirement is conditional (True/False)
- Any sub-requirements or referenced requirements

3. SOURCE TRACEABILITY
- Note the specific section or location this requirement comes from
- For JSON content, note the specific resource type and element

When analyzing content, focus on:

a) Making requirements atomic and testable
b) Maintaining the original text while adding necessary context
c) Identifying implicit requirements for each actor
d) Distinguishing between conjunctive ("and") and disjunctive ("or") requirements
e) Capturing terminology bindings and must-support elements
f) Noting RESTful API conformance requirements
g) Identifying conditional requirements

Content to analyze:
{json.dumps(content, indent=2) if isinstance(content, (dict, list)) else content}

Format each requirement as:
```
Requirement Text: <quoted text with [...] for clarifications and ... for elisions>
Conformance: <conformance level>
Actor: <actor name(s)>
Conditional: <True/False>
Sub-Requirements: <list of referenced requirements if any>
Source: <specific location in documentation>
```"""

In [35]:
def process_llm_requirements_output(output: str) -> List[Dict]:
    """Process LLM output into standardized requirements format"""
    requirements = []
    current_req = {}
    
    # Split output into individual requirements
    req_blocks = output.split('\n\n')
    
    for block in req_blocks:
        if block.strip().startswith('Requirement Text:'):
            # Save previous requirement if it exists
            if current_req:
                requirements.append(current_req)
                current_req = {}
            
            # Parse new requirement
            lines = block.strip().split('\n')
            for line in lines:
                if ': ' in line:
                    key, value = line.split(': ', 1)
                    key = key.lower().replace(' ', '_')
                    current_req[key] = value.strip()
    
    # Add final requirement
    if current_req:
        requirements.append(current_req)
        
    return requirements

def save_requirements_to_csv(requirements: List[Dict], output_file: str):
    """Save extracted requirements to CSV format matching Inferno's structure"""
    df = pd.DataFrame(requirements)
    
    # Rename columns to match Inferno's format
    column_mapping = {
        'requirement_text': 'Requirement',
        'conformance': 'Conformance',
        'actor': 'Actor',
        'conditional': 'Conditionality',
        'source': 'URL',
        'sub_requirements': 'Sub-Requirement(s)'
    }
    
    df = df.rename(columns=column_mapping)
    
    # Add required columns if missing
    required_columns = ['Req Set', 'Id'] + list(column_mapping.values())
    for col in required_columns:
        if col not in df.columns:
            df[col] = ''
            
    # Generate sequential IDs if not present
    if 'Id' in df.columns and df['Id'].isna().all():
        df['Id'] = range(1, len(df) + 1)
        
    df.to_csv(output_file, index=False)

In [None]:
def setup_clients():
    """Initialize clients for LLM service"""
    try:
        # Claude setup
        verify_path = CERT_PATH if os.path.exists(CERT_PATH) else True
        http_client = httpx.Client(verify=verify_path, timeout=60.0)
        claude_client = Anthropic(
            api_key=os.getenv('ANTHROPIC_API_KEY'),
            http_client=http_client
        )
        
        return {
            "claude": claude_client,
        }
        
    except Exception as e:
        logging.error(f"Error setting up clients: {str(e)}")
        raise

def process_all_content(api_type: str, base_directory: str) -> Dict[str, Any]:
    """Process all content and generate requirements in Inferno format"""
    clients = setup_clients()
    client = clients[api_type]
    config = API_CONFIGS[api_type]
    rate_limiter = create_rate_limiter()
    
    def check_limits():
        check_rate_limits(rate_limiter, api_type)
    
    try:
        # Process JSON files
        json_files = copy_json_files()
        all_requirements = []
        
        for json_file in json_files:
            json_data = prepare_json_for_processing(
                os.path.join(base_directory, 'json_only', json_file)
            )
            chunks = split_json(json_data)
            
            for chunk in chunks:
                response = make_api_request(client, api_type, chunk, check_limits)
                chunk_requirements = process_llm_requirements_output(response)
                all_requirements.extend(chunk_requirements)
                time.sleep(config["delay_between_chunks"])
                
        # Process markdown files
        markdown_dir = os.path.join(base_directory, 'markdown')
        if os.path.exists(markdown_dir):
            for md_file in os.listdir(markdown_dir):
                if md_file.endswith('.md'):
                    with open(os.path.join(markdown_dir, md_file), 'r') as f:
                        content = clean_markdown(f.read())
                    chunks = split_markdown(content)
                    
                    for chunk in chunks:
                        response = make_api_request(client, api_type, chunk, check_limits)
                        chunk_requirements = process_llm_requirements_output(response)
                        all_requirements.extend(chunk_requirements)
                        time.sleep(config["delay_between_chunks"])
        
        # Save requirements to CSV
        output_directory = 'processed_output'
        output_file = os.path.join(output_directory, f"test_requirements_{api_type}.json")
        save_requirements_to_csv(all_requirements, output_file)
        
        return {
            "requirements": all_requirements,
            "output_file": output_file
        }
        
    except Exception as e:
        logging.error(f"Error processing content: {str(e)}")
        raise

In [None]:
def format_content_for_api(content: Union[str, dict, list], api_type: str) -> str:
    """Format content appropriately for each API"""
    
    # Create the base requirements extraction prompt
    base_prompt = create_requirements_extraction_prompt(content)
    
    if api_type == "claude":
        return [{
            "type": "text",
            "text": base_prompt
        }]
    
    # For other APIs, return just the text
    return base_prompt

### Execution

In [None]:
# Define input and output directories
base_directory = 'full-ig'
output_directory = 'processed_output'

# Create output directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)

# Process with each API
apis = ["claude"]
results = {}

for api_type in apis:
    logging.info(f"Processing with {api_type}...")
    results[api_type] = process_all_content(api_type, base_directory)
    
    # Save results
    output_file = os.path.join(output_directory, f"test_requirements_{api_type}.json")
    with open(output_file, 'w') as f:
        json.dump(results[api_type], f, indent=2)
    logging.info(f"Saved {api_type} results to {output_file}")
    

INFO:root:Processing with claude...
INFO:root:Copied 166 JSON files to full-ig/json_only
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic