## FHIR Implementation Guide Requirements Extractor

This notebook extracts testable requirements from FHIR Implementation Guides and formats them according to INCOSE Systems Engineering standards.

#### Features
- Processes markdown files from FHIR Implementation Guides
- Extracts clear, testable requirements with proper attribution
- Formats requirements in standardized INCOSE format
- Handles large documents through chunking

#### Usage
1. Input markdown directory can be set to `full-ig/markdown7_cleaned` for limited set of 7 markdown files or `full-ig/markdown_cleaned` for full set of 300+ markdown files
2. Individual cert setup may need to be modified in `setup_clients()` function
3. Run all cells in this notebook
4. When prompted, enter IG name and version (or accept defaults)
5. Select the LLM engine to use (Gemini currently working most consistently)
6. The script will generate two files in the `reqs_extraction/processed_output` directory:
   - Complete SRS document with all requirements
   - Clean requirements list formatted to INCOSE standards

#### Notes:
- Supports Claude, Gemini, or GPT-4
- API keys should be in .env file
- GPT-4 works well for smaller IGs but may hit token limits with large documents
- Claude has larger context but may experience server load issues
- Gemini offers a good balance of context size and availability

### Inputs

In [116]:
import os
import logging
from typing import List, Dict, Union, Optional, Any
import time
import json
from datetime import datetime
import re
import pandas as pd
from dotenv import load_dotenv
import httpx
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
from anthropic import Anthropic, RateLimitError
import google.generativeai as gemini
from openai import OpenAI
from pathlib import Path


In [117]:
# Get the current working directory and set up paths
PROJECT_ROOT = Path.cwd().parent  # Go up one level from reqs_extraction to onclaive root
MARKDOWN_DIR = os.path.join(PROJECT_ROOT, 'full-ig', 'markdown7_cleaned')

# Add debug logging
logging.basicConfig(level=logging.INFO)
logging.info(f"Current working directory: {Path.cwd()}")
logging.info(f"Project root: {PROJECT_ROOT}")
logging.info(f"Markdown directory: {MARKDOWN_DIR}")

# Verify the markdown directory exists
if os.path.exists(MARKDOWN_DIR):
    logging.info(f"Found markdown directory at {MARKDOWN_DIR}")
    markdown_files = [f for f in os.listdir(MARKDOWN_DIR) if f.endswith('.md')]
    logging.info(f"Found {len(markdown_files)} markdown files")
else:
    logging.error(f"Markdown directory not found at {MARKDOWN_DIR}")

# Basic setup
load_dotenv(os.path.join(PROJECT_ROOT, '.env'))

INFO:root:Current working directory: /Users/ceadams/Documents/onclaive/onclaive/reqs_extraction
INFO:root:Project root: /Users/ceadams/Documents/onclaive/onclaive
INFO:root:Markdown directory: /Users/ceadams/Documents/onclaive/onclaive/full-ig/markdown7_cleaned
INFO:root:Found markdown directory at /Users/ceadams/Documents/onclaive/onclaive/full-ig/markdown7_cleaned
INFO:root:Found 7 markdown files


True

### API Configuration

In [118]:
# API Configuration
API_CONFIGS = {
    "claude": {
        "model_name": "claude-3-5-sonnet-20241022",
        "max_tokens": 8192,
        "temperature": 0.7,
        "batch_size": 5,
        "delay_between_chunks": 1,
        "delay_between_batches": 3,
        "requests_per_minute": 900,
        "max_requests_per_day": 20000,
        "delay_between_requests": 0.1
    },
    "gemini": {
        "model": "models/gemini-1.5-pro-001",
        "max_tokens": 8192,
        "temperature": 0.7,
        "batch_size": 5,  # More conservative batch size
        "delay_between_chunks": 2,
        "delay_between_batches": 5,
        "requests_per_minute": 900,
        "max_requests_per_day": 50000,
        "delay_between_requests": 0.1,
        "timeout": 60  # Longer timeout for larger content
    },
    "gpt": {
        "model": "gpt-4o",
        "max_tokens": 8192,
        "temperature": 0.7,
        "batch_size": 5,
        "delay_between_chunks": 2,
        "delay_between_batches": 5,
        "requests_per_minute": 450,
        "max_requests_per_day": 20000,
        "delay_between_requests": 0.15
    }
}

# Updated system prompts to produce INCOSE-style output
SYSTEM_PROMPTS = {
    "claude": """You are a seasoned Healthcare Integration Test Engineer with expertise in INCOSE Systems Engineering standards, 
    analyzing a FHIR Implementation Guide to extract precise testable requirements in INCOSE format.""",
    "gemini": """You are a Healthcare Integration Test Engineer with expertise in INCOSE Systems Engineering standards, analyzing FHIR 
    Implementation Guide content to identify and format testable requirements following INCOSE specifications.""",
    "gpt": """As a Healthcare Integration Test Engineer with INCOSE Systems Engineering expertise, analyze this FHIR 
    Implementation Guide content to extract specific testable requirements in INCOSE-compliant format."""
}


### Obtaining and chunking markdown files

In [119]:
def list_markdown_files(markdown_dir):
    """Debug function to list all markdown files"""
    if not os.path.exists(markdown_dir):
        logging.error(f"Directory does not exist: {markdown_dir}")
        return
    
    files = [f for f in os.listdir(markdown_dir) if f.endswith('.md')]
    logging.info(f"Found {len(files)} markdown files:")
    for file in files:
        logging.info(f"  - {file}")
    return files

In [120]:
def calculate_optimal_chunk_size(api_type: str, markdown_content: str) -> int:
    """
    Calculate the optimal chunk size based on API type and content characteristics.
    """
    config = API_CONFIGS[api_type]
    
    # Base chunk sizes based on API token limits
    base_chunk_sizes = {
        "claude": 8000,  # Claude has higher token limits
        "gemini": 7000,  # Gemini is also capable of handling larger chunks
        "gpt": 3000      # GPT-4 with smaller context
    }
    
    # Start with the base size for the API
    optimal_size = base_chunk_sizes[api_type]
    
    # Adjust based on content characteristics
    content_length = len(markdown_content)
    
    # For very small content, don't chunk at all
    if content_length <= optimal_size / 2:
        return content_length
    
    # For medium content, use the base size
    if content_length <= optimal_size * 1.5:
        return optimal_size
    
    # For larger content, adjust based on complexity 
    code_blocks = markdown_content.count("```")
    tables = markdown_content.count("|")
    
    # Adjust down if content has complex structures
    complexity_factor = 1.0
    if code_blocks > 5:
        complexity_factor *= 0.9
    if tables > 10:
        complexity_factor *= 0.9
    
    # Make sure we don't exceed API token limits
    return min(int(optimal_size * complexity_factor), base_chunk_sizes[api_type])

In [121]:
# Markdown Processing Functions
def clean_markdown(text: str) -> str:
    """Clean markdown content"""
    text = re.sub(r'\n\s*\n', '\n\n', text)
    text = re.sub(r'<!--.*?-->', '', text, flags=re.DOTALL)
    text = re.sub(r'\.{2,}', '.', text)
    text = re.sub(r'\\(.)', r'\1', text)
    text = re.sub(r'\|', ' ', text)
    text = re.sub(r'[-\s]*\n[-\s]*', '\n', text)
    return text.strip()

def split_markdown_dynamic(content: str, api_type: str) -> List[str]:
    """
    Split markdown into dynamically sized chunks based on API type and content.
    """
    # If content is very small, don't split it
    if len(content) < 1000:
        return [content]
    
    # Calculate optimal chunk size
    max_size = calculate_optimal_chunk_size(api_type, content)
    
    chunks = []
    lines = content.split('\n')
    current_chunk = []
    current_size = 0
    
    # Try to split at meaningful boundaries like headers or blank lines
    for i, line in enumerate(lines):
        line_size = len(line)
        
        if current_size + line_size > max_size:
            # Look back for a good splitting point (blank line or header)
            split_index = find_good_split_point(current_chunk)
            
            if split_index > 0:
                # Split at the good point
                first_part = current_chunk[:split_index]
                second_part = current_chunk[split_index:]
                chunks.append('\n'.join(first_part))
                current_chunk = second_part
                current_size = sum(len(l) for l in second_part)
            else:
                # If no good splitting point, use the current chunk
                chunks.append('\n'.join(current_chunk))
                current_chunk = []
                current_size = 0
            
            # Add the current line to the new chunk
            current_chunk.append(line)
            current_size += line_size
        else:
            current_chunk.append(line)
            current_size += line_size
    
    # Add the last chunk if there's anything left
    if current_chunk:
        chunks.append('\n'.join(current_chunk))
    
    return chunks

def find_good_split_point(lines: List[str]) -> int:
    """
    Find a good place to split a chunk, preferring blank lines or headers.
    """
    # Go backwards from the end to find a natural splitting point
    for i in range(len(lines) - 1, 0, -1):
        # Prefer blank lines
        if lines[i].strip() == '':
            return i + 1
        
        # Or headers
        if lines[i].startswith('#') or lines[i].startswith('==') or lines[i].startswith('--'):
            return i
    
    # If we're more than halfway through, just use the current point
    return len(lines) // 2

def should_combine_files(files: List[str], markdown_dir: str, api_type: str) -> List[List[str]]:
    """
    Determine if small files should be combined for processing.
    """
    config = API_CONFIGS[api_type]
    file_sizes = {}
    
    # Get the size of each file
    for file in files:
        file_path = os.path.join(markdown_dir, file)
        with open(file_path, 'r') as f:
            content = f.read()
            file_sizes[file] = len(content)
    
    # Estimate the optimal size based on API
    optimal_sizes = {
        "claude": 12000,
        "gemini": 10000,
        "gpt": 6000
    }
    
    optimal_size = optimal_sizes[api_type]
    combined_files = []
    current_group = []
    current_size = 0
    
    # Sort files by size (ascending) to try combining smaller files first
    sorted_files = sorted(files, key=lambda f: file_sizes[f])
    
    for file in sorted_files:
        size = file_sizes[file]
        
        # If this file is already big, process it individually
        if size > optimal_size * 0.8:
            if current_group:
                combined_files.append(current_group)
                current_group = []
                current_size = 0
            combined_files.append([file])
            continue
        
        # If adding this file would exceed optimal size, start a new group
        if current_size + size > optimal_size:
            if current_group:
                combined_files.append(current_group)
            current_group = [file]
            current_size = size
        else:
            current_group.append(file)
            current_size += size
    
    # Add the last group if there's anything left
    if current_group:
        combined_files.append(current_group)
    
    return combined_files

### Rate Limiting

In [122]:
# Rate Limiting
def create_rate_limiter():
    """Create a rate limiter state dictionary for all APIs"""
    return {
        api: {
            'requests': [],
            'daily_requests': 0,
            'last_reset': time.time()
        }
        for api in API_CONFIGS.keys()
    }

def check_rate_limits(rate_limiter: dict, api: str):
    """Check and wait if rate limits would be exceeded"""
    if api not in rate_limiter:
        raise ValueError(f"Unknown API: {api}")
        
    now = time.time()
    state = rate_limiter[api]
    config = API_CONFIGS[api]
    
    # Reset daily counts if needed
    day_seconds = 24 * 60 * 60
    if now - state['last_reset'] >= day_seconds:
        state['daily_requests'] = 0
        state['last_reset'] = now
    
    # Check daily limit
    if state['daily_requests'] >= config['max_requests_per_day']:
        raise Exception(f"{api} daily request limit exceeded")
    
    # Remove old requests outside the current minute
    state['requests'] = [
        req_time for req_time in state['requests']
        if now - req_time < 60
    ]
    
    # Wait if at rate limit
    if len(state['requests']) >= config['requests_per_minute']:
        sleep_time = 60 - (now - state['requests'][0])
        if sleep_time > 0:
            time.sleep(sleep_time)
        state['requests'] = state['requests'][1:]
    
    # Add minimum delay between requests
    if state['requests'] and now - state['requests'][-1] < config['delay_between_requests']:
        time.sleep(config['delay_between_requests'])
    
    # Record this request
    state['requests'].append(now)
    state['daily_requests'] += 1

### Prompt Development

In [123]:
def create_incose_requirements_extraction_prompt(content: str, ig_name: str, ig_version: str, chunk_index: int, total_chunks: int) -> str:
    """
    Create a prompt for extracting requirements in INCOSE format
    
    Args:
        content: The content to analyze
        ig_name: Name of the Implementation Guide
        ig_version: Version of the Implementation Guide
        chunk_index: Index of this chunk in the total content
        total_chunks: Total number of chunks being processed
        
    Returns:
        str: The prompt for the LLM
    """
    return f"""As a Healthcare Integration Test Engineer with INCOSE Systems Engineering expertise, extract and format requirements from this FHIR Implementation Guide content.

# ABOUT THIS TASK
You are analyzing chunk {chunk_index} of {total_chunks} from the {ig_name} Implementation Guide v{ig_version}. Your task is to:
1. Extract specific, testable requirements
2. Format them according to INCOSE Systems Engineering standards
3. Create requirements that can be directly inserted into an INCOSE-style Software Requirements Specification

# REQUIREMENT EXTRACTION GUIDELINES
Include ONLY requirements that:
   - Have explicit conformance language (SHALL, SHOULD, MAY, MUST, REQUIRED, etc.)
   - Describe specific, verifiable behavior or capability
   - Could be objectively tested through software testing or attestation
- Each requirement must be complete, atomic, and testable
- Separate individual requirements
- Identify the actor responsible for implementing each requirement
- Preserve the original conformance level (SHALL, SHOULD, MAY, etc.)
- Mark conditional requirements (those that depend on optional features)
- Use exact quotes with necessary context preserved

# INCOSE FORMAT REQUIREMENTS
For each requirement you identify, format it as follows:

```
## REQ-[ID]

**Summary**: Brief description of the requirement
**Description**: "<exact quote with necessary [clarifications]>"
**Verification**: Recommended verification method (Test/Analysis/Inspection/Demonstration)
**Notes**: Actor responsible, conformance level, conditions, etc.
**Source**: Section reference from the Implementation Guide
```

Where [ID] starts at 1 and follows in sequential order.

# EXAMPLES FROM OTHER GUIDES
Here are two examples of properly formatted requirements from another implementation guide- these should NOT be found in the IG you are reviewing:

```
## REQ-01

**Summary**: Advertise supported subscription topics
**Description**: "In order to allow for discovery of supported subscription topics, this guide defines the CapabilityStatement SubscriptionTopic Canonical extension. The extension allows server implementers to advertise the canonical URLs of topics available to clients."
**Verification**: Test
**Notes**: Actor: Server, Conformance: SHALL, Conditional: False
**Source**: Subscription Discovery Section
```

```
## REQ-02

**Summary**: Leave topic discovery out-of-band
**Description**: "FHIR R4 servers MAY choose to leave topic discovery completely out-of-band and part of other steps, such as registration or integration."
**Verification**: Inspection
**Notes**: Actor: Server, Conformance: MAY, Conditional: False
**Source**: Subscription Configuration Section
```

Content to analyze:
{content}

Generate your INCOSE-style requirements extraction now. For each chunk, list each requirement using the format specified above. If you find no requirements in this chunk, do not add any text and move to the next chunk. Do not include any introductory or conclusion/summary comments in your response. Only include the requirements as a list.
"""

### Functions for API Request

In [124]:
def format_content_for_api(content: Union[str, dict, list], api_type: str, ig_name: str, ig_version: str, chunk_index: int, total_chunks: int) -> Union[str, List[dict], dict]:
    """Format content appropriately for each API"""
    base_prompt = create_incose_requirements_extraction_prompt(content, ig_name, ig_version, chunk_index, total_chunks)
    
    if api_type == "claude":
        return [{
            "type": "text",
            "text": base_prompt
        }]
    elif api_type == "gemini":
        return [{  # Changed from dict to list with single dict
            "parts": [{
                "text": base_prompt
            }]
        }]
    return base_prompt

In [125]:
@retry(
    wait=wait_exponential(multiplier=2, min=4, max=480),  # Longer max wait (2 minutes)
    stop=stop_after_attempt(8),  # More retry attempts
    retry=retry_if_exception_type((RateLimitError, TimeoutError, httpx.HTTPStatusError))  # Add HTTP errors
)
def make_api_request(client, api_type: str, content: str, rate_limit_func, ig_name: str, ig_version: str, chunk_index: int, total_chunks: int) -> str:
    """Make rate-limited API request with retries"""
    rate_limit_func()
    
    config = API_CONFIGS[api_type]
    formatted_content = format_content_for_api(content, api_type, ig_name, ig_version, chunk_index, total_chunks)
    
    try:
        if api_type == "claude":
            response = client.messages.create(
                model=config["model_name"],
                max_tokens=config["max_tokens"],
                messages=[{
                    "role": "user", 
                    "content": formatted_content
                }],
                system=SYSTEM_PROMPTS[api_type]
            )
            return response.content[0].text
            
        elif api_type == "gemini":
            # Extract the text content for Gemini
            prompt_text = formatted_content[0]["parts"][0]["text"]
            response = client.generate_content(
                prompt_text,
                generation_config={
                    "max_output_tokens": config["max_tokens"],
                    "temperature": config["temperature"]
                }
            )
            if hasattr(response, 'text'):
                return response.text
            elif response.candidates:
                return response.candidates[0].content.parts[0].text
            else:
                raise ValueError("No response generated from Gemini API")
                    
        elif api_type == "gpt":
            response = client.chat.completions.create(
                model=config["model"],
                messages=[
                    {"role": "system", "content": SYSTEM_PROMPTS[api_type]},
                    {"role": "user", "content": formatted_content}
                ],
                max_tokens=config["max_tokens"],
                temperature=config["temperature"]
            )
            return response.choices[0].message.content
            
    except Exception as e:
        logging.error(f"Error in {api_type} API request: {str(e)}")
        raise

### Processing Functions

In [126]:
def process_content_batch(api_type: str, contents: List[str], config: dict, client, 
                          rate_limit_func, ig_name: str, ig_version: str) -> List[str]:
    """Process a batch of content with rate limiting"""
    results = []
    total_chunks = len(contents)
    
    for chunk_idx, content in enumerate(contents, 1):
        result = make_api_request(client, api_type, content, rate_limit_func, 
                                  ig_name, ig_version, chunk_idx, total_chunks)
        results.append(result)
        time.sleep(config["delay_between_chunks"])
    return results

In [127]:
def create_requirements_list_prompt(srs_document: str) -> str:
    """
    Create a prompt to extract just the requirements list from the SRS document
    
    Args:
        srs_document: The full INCOSE SRS document
        
    Returns:
        str: The prompt for the LLM
    """
    return f"""Please extract just the requirements from the following INCOSE SRS document and format them as a clean list.

I want a comprehensive list with only the actual requirements - no introductory text, analysis, or explanatory content. Ensure all requirements are atomic. Separate individual requirements.
 
For each requirement, include:
1. The requirement ID (REQ-XX)
2. The summary
3. The description (quoted text)
4. Verification method
5. Actor, conformance level, and conditions
6. Source reference

Format each requirement as follows:

---
# REQ-XX
**Summary**: [summary text]
**Description**: "[description text]"
**Verification**: [method]
**Actor**: [actor]
**Conformance**: [SHALL/SHOULD/MAY/etc.]
**Conditional**: [True/False]
**Source**: [reference]
---

Do not include any other text in your response output except for the list with actual requirements. 

Here's the SRS document:

{srs_document}
"""

In [128]:
def extract_requirements_list_from_file(client, api_type, srs_file_path, rate_limit_func):
    """Extract requirements from an SRS document file with improved chunking support"""
    logging.info(f"Extracting clean requirements list from file using {api_type}...")
    
    # Define max token size based on API type - increase these values for Claude and Gemini
    max_chunk_sizes = {
        "claude": 12000,  # Claude can handle larger contexts
        "gemini": 10000,  # Gemini also has decent context size
        "gpt": 3000       # GPT-4 with smaller context
    }
    max_chunk_size = max_chunk_sizes.get(api_type, 3000)
    
    # Read the file and prepare to extract requirements
    with open(srs_file_path, 'r') as f:
        content = f.read()
    
    # Find all requirements in the document using an improved regex pattern
    # This pattern catches requirements with different heading formats (# REQ, ## REQ, ### REQ)
    req_pattern = r'(?:^|\n)(?:#{1,3}\s+REQ-\d+[^\n]*(?:\n(?!#{1,3}\s+REQ-\d+).*)*)' 
    requirements = re.findall(req_pattern, content, re.DOTALL)
    
    if not requirements:
        logging.warning("No requirements found in the file")
        return "No requirements found in the document."
    
    total_req_count = len(requirements)
    logging.info(f"Found {total_req_count} requirements to process in chunks")
    
    # Process requirements in manageable chunks, but use overlapping boundaries
    # to ensure no requirements are lost at chunk boundaries
    all_results = []
    current_size = 0
    current_reqs = []
    chunk_count = 1
    
    for req in requirements:
        req_size = len(req)
        
        # If this requirement is extremely large, split it further
        if req_size > max_chunk_size:
            logging.warning(f"Found extremely large requirement ({req_size} chars) that exceeds chunk size")
            # Process it individually with special handling
            try:
                individual_result = process_large_requirement(client, api_type, req, rate_limit_func)
                all_results.append(individual_result)
            except Exception as e:
                logging.error(f"Error processing large requirement: {str(e)}")
                all_results.append(req)  # Use original as fallback
            
            time.sleep(API_CONFIGS[api_type]["delay_between_chunks"])
            continue
        
        # If adding this requirement would exceed chunk size, process current chunk
        if current_size + req_size > max_chunk_size and current_reqs:
            chunk_text = "\n\n".join(current_reqs)
            logging.info(f"Processing chunk {chunk_count} with {len(current_reqs)} requirements")
            
            try:
                result = process_requirements_chunk(
                    client, api_type, chunk_text, rate_limit_func, 
                    chunk_count, estimate_total_chunks(total_req_count, len(current_reqs))
                )
                all_results.append(result)
            except Exception as e:
                logging.error(f"Error processing chunk {chunk_count}: {str(e)}")
                # Add the raw requirements to ensure nothing is lost
                all_results.append(chunk_text)
            
            # Reset for next chunk - with overlap to ensure continuity
            current_reqs = []
            current_size = 0
            chunk_count += 1
            
            # Add delay between chunks
            time.sleep(API_CONFIGS[api_type]["delay_between_chunks"])
        
        # Add requirement to current chunk
        current_reqs.append(req)
        current_size += req_size
    
    # Process the last chunk if anything remains
    if current_reqs:
        chunk_text = "\n\n".join(current_reqs)
        logging.info(f"Processing final chunk {chunk_count} with {len(current_reqs)} requirements")
        try:
            result = process_requirements_chunk(
                client, api_type, chunk_text, rate_limit_func, 
                chunk_count, chunk_count
            )
            all_results.append(result)
        except Exception as e:
            logging.error(f"Error processing final chunk: {str(e)}")
            all_results.append(chunk_text)
    
    # Combine all processed chunks
    combined_results = "\n\n".join(all_results)
    
    # Validate the final result
    final_req_count = combined_results.count('# REQ-')
    if final_req_count < total_req_count:
        logging.warning(f"Potential requirements loss: Found {total_req_count} in source, but only {final_req_count} in processed output")
    else:
        logging.info(f"Successfully extracted all {total_req_count} requirements")
    
    return combined_results

In [129]:
@retry(
    wait=wait_exponential(multiplier=2, min=4, max=120),
    stop=stop_after_attempt(8),
    retry=retry_if_exception_type((RateLimitError, TimeoutError, httpx.HTTPStatusError))
)
def process_requirements_chunk(client, api_type, chunk_text, rate_limit_func, chunk_num, total_chunks):
    """Process a single chunk of requirements with robust retry logic and improved formatting handling"""
    logging.info(f"Processing requirements chunk {chunk_num}/{total_chunks} with {chunk_text.count('REQ-')} requirements")
    
    # Normalize format to ensure consistency
    # Replace all variations of requirement headers with a standardized format
    normalized_chunk = re.sub(r'(#{1,3})\s+(REQ-\d+)', r'# \2', chunk_text)
    
    # Build a more specific prompt that enforces formatting requirements
    prompt = f"""Please extract and format requirements from this PARTIAL list (chunk {chunk_num} of {total_chunks}).
    
    Format each requirement EXACTLY as follows, maintaining the exact structure including dashes, spacing, and headings:
    
    ---
    # REQ-XX
    **Summary**: [summary text]
    **Description**: "[description text]"
    **Verification**: [method]
    **Actor**: [actor]
    **Conformance**: [SHALL/SHOULD/MAY/etc.]
    **Conditional**: [True/False]
    **Source**: [reference]
    ---
    
    IMPORTANT FORMATTING INSTRUCTIONS:
    1. Each requirement MUST start with "---" on its own line 
    2. Each requirement MUST end with "---" on its own line
    3. The requirement ID format MUST be a single # followed by "REQ-XX" where XX is the number
    4. Each attribute MUST be in bold with two asterisks on each side
    5. Description MUST be in quotes
    6. DO NOT SKIP any requirements in the provided chunk
    7. DO NOT CHANGE the original text content of each requirement
    8. Include ALL requirements found in this chunk, up to the maximum token limit
    
    Here are the requirements to process:
    
    {normalized_chunk}
    """
    
    rate_limit_func()
    config = API_CONFIGS[api_type]
    
    try:
        if api_type == "claude":
            response = client.messages.create(
                model=config["model_name"],
                max_tokens=config["max_tokens"],
                messages=[{
                    "role": "user", 
                    "content": prompt
                }],
                system="Extract and format requirements EXACTLY as specified, preserving all content and maintaining consistent formatting."
            )
            result = response.content[0].text
            
        elif api_type == "gemini":
            response = client.generate_content(
                prompt,
                generation_config={
                    "max_output_tokens": config["max_tokens"],
                    "temperature": config["temperature"]
                }
            )
            if hasattr(response, 'text'):
                result = response.text
            elif response.candidates:
                result = response.candidates[0].content.parts[0].text
            else:
                raise ValueError("No response generated from Gemini API")
                    
        elif api_type == "gpt":
            response = client.chat.completions.create(
                model=config["model"],
                messages=[
                    {"role": "system", "content": "Extract and format requirements EXACTLY as specified, preserving all content and maintaining consistent formatting."},
                    {"role": "user", "content": prompt}
                ],
                max_tokens=config["max_tokens"],
                temperature=config["temperature"]
            )
            result = response.choices[0].message.content
            
        # Validate the output
        req_count_in_result = result.count('# REQ-')
        req_count_in_input = normalized_chunk.count('REQ-')
        
        if req_count_in_result < req_count_in_input:
            logging.warning(f"Possible requirements loss: {req_count_in_input} in input, but only {req_count_in_result} in result")
            
            # If significant loss, try to process with higher temperature for variation
            if req_count_in_result < req_count_in_input * 0.8 and api_type != "gpt":
                logging.info("Attempting reprocessing with higher variation due to missing requirements")
                # This is a simplified retry that you can expand based on your needs
                
        # Check for proper formatting
        if not result.startswith('---') or '**Summary**' not in result:
            logging.warning("Response may not have proper formatting")
            
        return result
        
    except Exception as e:
        logging.error(f"Error processing chunk {chunk_num}: {str(e)}")
        raise

In [130]:
def process_large_requirement(client, api_type, req_text, rate_limit_func):
    """
    Special handling for extremely large requirements that exceed normal chunk sizes.
    
    Args:
        client: The API client (Claude, Gemini, or GPT)
        api_type: The type of API to use (claude, gemini, or gpt)
        req_text: The text of the large requirement
        rate_limit_func: Function to check rate limits
        
    Returns:
        str: The processed requirement text in standardized format
    """
    logging.info("Processing large requirement with special handling")
    
    # Extract the requirement ID from the text
    req_id_match = re.search(r'(#{1,3})\s+(REQ-\d+)', req_text)
    if not req_id_match:
        logging.warning("Could not identify requirement ID in large requirement")
        return req_text
    
    req_id = req_id_match.group(2)
    
    # Try to extract key sections to reduce token usage
    summary_match = re.search(r'\*\*Summary\*\*:\s*([^\n]+)', req_text)
    summary = summary_match.group(1) if summary_match else "[Extract from text]"
    
    description_match = re.search(r'\*\*Description\*\*:\s*"([^"]+)"', req_text)
    # If we can't extract the description with regex, we'll let the model handle it
    
    verification_match = re.search(r'\*\*Verification\*\*:\s*([^\n]+)', req_text)
    verification = verification_match.group(1) if verification_match else "[Extract from text]"
    
    actor_match = re.search(r'Actor:\s*([^,\n]+)', req_text) or re.search(r'\*\*Actor\*\*:\s*([^\n]+)', req_text)
    actor = actor_match.group(1) if actor_match else "[Extract from text]"
    
    conformance_match = re.search(r'Conformance:\s*([^,\n]+)', req_text) or re.search(r'\*\*Conformance\*\*:\s*([^\n]+)', req_text)
    conformance = conformance_match.group(1) if conformance_match else "[Extract from text]"
    
    conditional_match = re.search(r'Conditional:\s*([^,\n]+)', req_text) or re.search(r'\*\*Conditional\*\*:\s*([^\n]+)', req_text)
    conditional = conditional_match.group(1) if conditional_match else "[Extract from text]"
    
    source_match = re.search(r'Source:\s*([^\n]+)', req_text) or re.search(r'\*\*Source\*\*:\s*([^\n]+)', req_text)
    source = source_match.group(1) if source_match else "[Extract from text]"
    
    # Create a simplified prompt focused on just this requirement
    prompt = f"""Format the following large requirement with ID {req_id} using the exact format below:
    
    ---
    # {req_id}
    **Summary**: {summary}
    **Description**: "[Extract and preserve the original description text]"
    **Verification**: {verification}
    **Actor**: {actor}
    **Conformance**: {conformance}
    **Conditional**: {conditional}
    **Source**: {source}
    ---
    
    IMPORTANT: 
    1. The requirement MUST start and end with "---" on its own line
    2. Maintain the exact formatting with the # and ** markers as shown
    3. Preserve the exact content from the original requirement
    4. The Description must be in quotes
    
    Here is the original requirement text to extract information from:
    
    {req_text}
    """
    
    rate_limit_func()
    config = API_CONFIGS[api_type]
    
    try:
        if api_type == "claude":
            response = client.messages.create(
                model=config["model_name"],
                max_tokens=config["max_tokens"],
                messages=[{
                    "role": "user", 
                    "content": prompt
                }],
                system="Extract and format this single requirement exactly as specified, preserving all content and maintaining consistent formatting."
            )
            result = response.content[0].text
            
        elif api_type == "gemini":
            response = client.generate_content(
                prompt,
                generation_config={
                    "max_output_tokens": config["max_tokens"],
                    "temperature": config["temperature"]
                }
            )
            if hasattr(response, 'text'):
                result = response.text
            elif response.candidates:
                result = response.candidates[0].content.parts[0].text
            else:
                raise ValueError("No response generated from Gemini API")
                    
        elif api_type == "gpt":
            response = client.chat.completions.create(
                model=config["model"],
                messages=[
                    {"role": "system", "content": "Extract and format this single requirement exactly as specified, preserving all content and maintaining consistent formatting."},
                    {"role": "user", "content": prompt}
                ],
                max_tokens=config["max_tokens"],
                temperature=config["temperature"]
            )
            result = response.choices[0].message.content
        
        # Verify the output contains the requirement ID
        if req_id not in result:
            logging.warning(f"Processed requirement missing ID {req_id}, using original text")
            return req_text
            
        # Verify proper formatting
        if not result.startswith('---') or not '**Summary**' in result:
            logging.warning("Processed requirement has incorrect formatting, using original text")
            return req_text
            
        logging.info(f"Successfully processed large requirement {req_id}")
        return result
        
    except Exception as e:
        logging.error(f"Error processing large requirement {req_id}: {str(e)}")
        # Return the original text in case of failure to ensure no data is lost
        return req_text

In [131]:
# Main Processing Functions
def setup_clients():
    """Initialize clients for each LLM service"""
    try:
        # Claude setup
        verify_path = '/opt/homebrew/etc/openssl@3/cert.pem'
        http_client = httpx.Client(
            verify=verify_path if os.path.exists(verify_path) else True,
            timeout=60.0
        )
        claude_client = Anthropic(
            api_key=os.getenv('ANTHROPIC_API_KEY'),
            http_client=http_client
        )
        
        # Gemini setup
        gemini_api_key = os.getenv('GEMINI_API_KEY')
        if not gemini_api_key:
            raise ValueError("GEMINI_API_KEY not found")
        gemini.configure(api_key=gemini_api_key)
        gemini_client = gemini.GenerativeModel(
            model_name=API_CONFIGS["gemini"]["model"],
            generation_config={
                "max_output_tokens": API_CONFIGS["gemini"]["max_tokens"],
                "temperature": API_CONFIGS["gemini"]["temperature"]
            }
        )
        
        # OpenAI setup
        openai_api_key = os.getenv('OPENAI_API_KEY')
        if not openai_api_key:
            raise ValueError("OPENAI_API_KEY not found")
        openai_client = OpenAI(
            api_key=openai_api_key,
            timeout=60.0
        )
        
        return {
            "claude": claude_client,
            "gpt": openai_client,
            "gemini": gemini_client
        }
        
    except Exception as e:
        logging.error(f"Error setting up clients: {str(e)}")
        raise

In [None]:
def process_markdown_content_for_incose_srs(api_type: str, markdown_dir: str = MARKDOWN_DIR, 
                                            ig_name: str = "FHIR", ig_version: str = "1.0.0") -> Dict[str, Any]:
    """
    Process markdown content and generate INCOSE SRS document directly from LLM outputs.
    
    Args:
        api_type: The API to use for processing
        markdown_dir: Directory containing markdown files
        ig_name: Name of the Implementation Guide
        ig_version: Version of the Implementation Guide
        
    Returns:
        Dict containing processing results and SRS document
    """
    logging.info(f"Starting processing with {api_type} on directory: {markdown_dir}")
    
    # List files before processing
    markdown_files = list_markdown_files(markdown_dir)
    if not markdown_files:
        logging.error("No markdown files found to process")
        return {"processed_files": [], "srs_document": "", "output_file": None, "requirements_list": "", "requirements_file": None}
    
    # Initialize API clients and rate limiters
    clients = setup_clients()
    client = clients[api_type]
    config = API_CONFIGS[api_type]
    rate_limiter = create_rate_limiter()
    
    def check_limits():
        check_rate_limits(rate_limiter, api_type)
    
    try:
        all_incose_requirements = []
        processed_files = []
        
        # Group files for potential combination
        file_groups = should_combine_files(markdown_files, markdown_dir, api_type)
        logging.info(f"Organized {len(markdown_files)} files into {len(file_groups)} processing groups")
        
        for group in file_groups:
            # For a single file
            if len(group) == 1:
                file_path = os.path.join(markdown_dir, group[0])
                logging.info(f"Processing single file: {group[0]}")
                
                with open(file_path, 'r') as f:
                    content = clean_markdown(f.read())
                
                # Use dynamic chunk sizing
                chunks = split_markdown_dynamic(content, api_type)
                logging.info(f"Split {group[0]} into {len(chunks)} chunks using dynamic sizing")
                
                for chunk_idx, chunk in enumerate(chunks, 1):
                    logging.info(f"Processing chunk {chunk_idx}/{len(chunks)} of {group[0]}")
                    response = make_api_request(client, api_type, chunk, check_limits, 
                                               ig_name, ig_version, chunk_idx, len(chunks))
                    all_incose_requirements.append(response)
                    time.sleep(config["delay_between_chunks"])
                
                processed_files.append(group[0])
                
            # For multiple combined files
            else:
                logging.info(f"Processing combined group of {len(group)} files")
                combined_content = []
                
                # Prepare combined content with clear file boundaries
                for file in group:
                    file_path = os.path.join(markdown_dir, file)
                    with open(file_path, 'r') as f:
                        file_content = clean_markdown(f.read())
                        combined_content.append(f"## FILE: {file}\n\n{file_content}\n\n")
                
                combined_text = "".join(combined_content)
                chunks = split_markdown_dynamic(combined_text, api_type)
                logging.info(f"Split combined content into {len(chunks)} chunks")
                
                for chunk_idx, chunk in enumerate(chunks, 1):
                    logging.info(f"Processing chunk {chunk_idx}/{len(chunks)} of combined files")
                    response = make_api_request(client, api_type, chunk, check_limits,
                                               ig_name, ig_version, chunk_idx, len(chunks))
                    all_incose_requirements.append(response)
                    time.sleep(config["delay_between_chunks"])
                
                processed_files.extend(group)
            
            # Add delay between file groups
            time.sleep(config["delay_between_batches"])
        
        # Combine all requirements into a full INCOSE SRS document
        srs_document = ""
        for req_section in all_incose_requirements:
            # Skip the intro text if it's present to avoid duplication
            # Look for the first requirement section
            if "## REQ-" in req_section:
                # Find the index of the first requirement
                req_start_idx = req_section.find("## REQ-")
                if req_start_idx > 0:
                    # Only add the requirements part, not any introductory text
                    srs_document += req_section[req_start_idx:]
                else:
                    srs_document += req_section
            else:
                # Add any content that doesn't contain requirements
                # (this could be informational text related to requirements)
                srs_document += req_section
        
        # Save INCOSE SRS document to markdown file
        output_directory = os.path.join(PROJECT_ROOT, 'reqs_extraction', 'processed_output')
        os.makedirs(output_directory, exist_ok=True)
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        
        # Use API name in the filename
        srs_output_file = os.path.join(output_directory, f"{api_type}_reqs_list_v1_{timestamp}.md")
        
        with open(srs_output_file, 'w') as f:
            f.write(srs_document)
        
        logging.info(f"Completed processing {len(processed_files)} files")
        logging.info(f"Generated requirements document saved to {srs_output_file}")
        
        # Now extract just the requirements list using the file-based approach
        try:
            requirements_list = extract_requirements_list_from_file(client, api_type, srs_output_file, check_limits)
            
            # Save the requirements list to a separate file
            req_list_output_file = os.path.join(output_directory, f"{ig_name.lower().replace(' ', '_')}_{api_type}_requirements_list_{timestamp}.md")
            
            with open(req_list_output_file, 'w') as f:
                f.write(requirements_list)
                
            logging.info(f"Generated clean requirements list saved to {req_list_output_file}")
            
        except Exception as e:
            logging.error(f"Error extracting requirements list: {str(e)}")
            requirements_list = "Error extracting requirements list. See main SRS document instead."
            req_list_output_file = None
        
        return {
            "processed_files": processed_files,
            "srs_document": srs_document,
            "output_file": srs_output_file,
            "requirements_list": requirements_list,
            "requirements_file": req_list_output_file
        }
        
    except Exception as e:
        logging.error(f"Error processing content: {str(e)}")
        raise

### Main Execution

In [133]:
# Main execution script
if __name__ == "__main__":
    # Define input and output directories using absolute paths
    markdown_dir = MARKDOWN_DIR
    output_directory = os.path.join(PROJECT_ROOT, 'reqs_extraction', 'processed_output')

    # Create output directory if it doesn't exist
    os.makedirs(output_directory, exist_ok=True)

    # Get IG name and version from user input or use defaults
    ig_name = input("Enter Implementation Guide name (default 'FHIR Implementation Guide'): ") or "FHIR Implementation Guide"
    ig_version = input("Enter Implementation Guide version (default '1.0.0'): ") or "1.0.0"

    # Choose which API to use
    print("\nSelect the API to use:")
    print("1. Claude")
    print("2. Gemini")
    print("3. GPT-4")
    api_choice = input("Enter your choice (1-3, default 2): ") or "2"
    
    api_mapping = {
        "1": "claude",
        "2": "gemini",
        "3": "gpt"
    }
    
    api_type = api_mapping.get(api_choice, "gemini")
    
    try:
        logging.info(f"Processing with {api_type}...")
        print(f"\nProcessing Implementation Guide with {api_type.capitalize()}...")
        print(f"This may take several minutes depending on the size of the Implementation Guide.")
        
        # Process the markdown files and generate direct INCOSE SRS document
        api_results = process_markdown_content_for_incose_srs(
            api_type=api_type, 
            markdown_dir=markdown_dir,
            ig_name=ig_name,
            ig_version=ig_version
        )
        
        # Output the results to the user
        print("\n" + "="*80)
        print(f"Processing complete!")
        print(f"Generated requirements document: {api_results['output_file']}")
        print(f"Generated clean requirements list: {api_results['requirements_file']}")
        print(f"Processed {len(api_results['processed_files'])} files")
        print("="*80)
        
    except Exception as e:
        logging.error(f"Error processing {api_type}: {str(e)}")
        print(f"\nError occurred during processing: {str(e)}")
        print("Check the log file for more details.")


Select the API to use:
1. Claude
2. Gemini
3. GPT-4


INFO:root:Processing with gemini...
INFO:root:Starting processing with gemini on directory: /Users/ceadams/Documents/onclaive/onclaive/full-ig/markdown7_cleaned
INFO:root:Found 7 markdown files:
INFO:root:  - implementation.md
INFO:root:  - examples.md
INFO:root:  - profiles.md
INFO:root:  - ChangeHistory.md
INFO:root:  - artifacts.md
INFO:root:  - index.md
INFO:root:  - CapabilityStatement_plan_net.md
INFO:root:Organized 7 files into 6 processing groups
INFO:root:Processing combined group of 2 files
INFO:root:Split combined content into 1 chunks
INFO:root:Processing chunk 1/1 of combined files



Processing Implementation Guide with Gemini...
This may take several minutes depending on the size of the Implementation Guide.


INFO:root:Processing single file: index.md
INFO:root:Split index.md into 2 chunks using dynamic sizing
INFO:root:Processing chunk 1/2 of index.md
INFO:root:Processing chunk 2/2 of index.md
INFO:root:Processing single file: examples.md
INFO:root:Split examples.md into 3 chunks using dynamic sizing
INFO:root:Processing chunk 1/3 of examples.md
INFO:root:Processing chunk 2/3 of examples.md
INFO:root:Processing chunk 3/3 of examples.md
INFO:root:Processing single file: implementation.md
INFO:root:Split implementation.md into 3 chunks using dynamic sizing
INFO:root:Processing chunk 1/3 of implementation.md
INFO:root:Processing chunk 2/3 of implementation.md
INFO:root:Processing chunk 3/3 of implementation.md
INFO:root:Processing single file: CapabilityStatement_plan_net.md
INFO:root:Split CapabilityStatement_plan_net.md into 5 chunks using dynamic sizing
INFO:root:Processing chunk 1/5 of CapabilityStatement_plan_net.md
INFO:root:Processing chunk 2/5 of CapabilityStatement_plan_net.md
INFO:r


Processing complete!
Generated requirements document: /Users/ceadams/Documents/onclaive/onclaive/reqs_extraction/processed_output/plan_net_gemini_20250402_145733.md
Generated clean requirements list: /Users/ceadams/Documents/onclaive/onclaive/reqs_extraction/processed_output/plan_net_gemini_requirements_list_20250402_145733.md
Processed 7 files
