### This script aims to develop a prompt chain structure to send large amounts of text/content to LLM APIs including through multiple calls

#### Note: still in experimentation mode/progress

JSON file organization:
- Copies relevant JSON files from source to working directory
- Groups related files based on their names
- Creates organized folder structure
- Excludes certain file types (like .ttl.json, .jsonld.json)

JSON Consolidation:
- Combines related JSON files into single consolidated files
- Creates files like Organization_combined.json, StructureDefinition_combined.json, etc.

JSON Processing:
- Splits large JSON files into manageable chunks
- Processes each chunk through Claude
- Gets technical analysis of configurations and structures
- Combines chunk analyses into coherent summaries

Markdown processing:
- Processes documentation files
- Extracts key technical information
- Creates summaries of documentation content

Image processing:
- Processes technical diagrams and figures
- Extracts architectural and design information
- Creates descriptions of visual technical content

Meta-analysis creation:
- Combines all processed information
- Creates comprehensive technical analysis covering:
    - Technical requirements and architecture
    - Implementation details
    - Visual documentation analysis
    - Cross-reference analysis
    - Patterns and potential issues

Output generation:
- Creates output directory
- Saves final analysis as markdown file
- Includes comprehensive technical documentation

In [1]:
# import packages
import base64
import json
from typing import List, Dict, Tuple, Union, Optional
from dataclasses import dataclass
import os
import time
import threading
from IPython.display import Image
import math
import os
#import google.generativeai as gemini
#from openai import OpenAI
import io, threading, time, re
import pandas as pd
from json_repair import repair_json
from langchain_community.document_loaders import BSHTMLLoader
import shutil
from dotenv import load_dotenv
import httpx
from collections import defaultdict
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


Read in API keys for Claude, Gemini, and GPT from .env file

In [8]:
load_dotenv()

claude_api_key = os.getenv('ANTHROPIC_API_KEY')
#gemini_api_key = os.getenv('GEMINI_API_KEY')
#OpenAI.api_key = os.getenv('OPENAI_API_KEY')

Setting up Claude

In [9]:
claude = Anthropic(api_key = claude_api_key)
claude_version = "claude-3-5-sonnet-20240620"  # "claude-3-opus-20240229"   "claude-3-5-sonnet-20240620" "claude-3-sonnet-20240229" "claude-3-haiku-20240307"
claude_max_output_tokens = 8192  # claude 3 opus is only 4096 tokens, sonnet is 8192

In [10]:
#CERT_PATH = '/Users/amathur/ca-certificates.crt'
CERT_PATH = '/opt/homebrew/etc/openssl@3/cert.pem'

In [11]:
def create_anthropic_client():
    """Create Anthropic client with proper certificate verification"""
    verify_path = CERT_PATH if os.path.exists(CERT_PATH) else True
    http_client = httpx.Client(
        verify=verify_path,
        timeout=30.0
    )
    return Anthropic(
        api_key=claude_api_key,
        http_client=http_client
    )

### Pulling in files of interest

In [12]:
source_folder = 'full-ig/site'
destination_folder = 'full-ig/json_only'

In [16]:
def copy_json_files():
    """
    Copy JSON files from full-ig/site to full-ig/json_only directory,
    excluding compound extensions and creating the directory if needed
    """
    source_folder = 'full-ig/site'
    destination_folder = 'full-ig/json_only'

    # Create the destination folder if it doesn't exist
    if not os.path.exists(destination_folder):
        os.makedirs(destination_folder)

    json_files = []
    for file_name in os.listdir(source_folder):
        # Check if the file ends with .json but not with compound extensions
        if file_name.endswith('.json') and not (file_name.endswith('.ttl.json') or 
                                             file_name.endswith('.jsonld.json') or 
                                             file_name.endswith('.xml.json') or 
                                             file_name.endswith('.change.history.json')):
            json_files.append(file_name)
            # Copy the file to the destination folder
            shutil.copy(os.path.join(source_folder, file_name), destination_folder)
            
    logging.info(f"Copied {len(json_files)} JSON files to {destination_folder}")
    return json_files

def group_files_by_base_name(directory_path, delimiter='-'):
    """
    Group files in the directory by their base name (portion before a delimiter).
    
    Args:
    directory_path (str): Path to the directory containing files.
    delimiter (str): The delimiter to split the file name on (default is '-').

    Returns:
    dict: A dictionary where keys are base names and values are lists of files that share the same base name.
    """
    grouped_files = defaultdict(list)
    
    # Iterate through the files in the directory
    for filename in os.listdir(directory_path):
        if filename.endswith('.json'):  # Only process .json files
            if delimiter in filename:  # Only consider files with the delimiter
                # Get the base name (before the first delimiter)
                base_name = filename.split(delimiter)[0]
                
                # Append the file to the group corresponding to its base name
                grouped_files[base_name].append(filename)
    
    return grouped_files

def copy_files_to_folders(directory_path, grouped_files):
    """
    Copy files to folders if the base name group has more than 1 file,
    and remove them from the original directory.
    
    Args:
    directory_path (str): Path to the directory containing files.
    grouped_files (dict): Dictionary of grouped files by base name.
    """
    for base_name, files in grouped_files.items():
        if len(files) >= 1:  # Process groups with one or more files
            # Create a folder for the base name in the same directory
            base_folder = os.path.join(directory_path, base_name)
            if not os.path.exists(base_folder):
                os.makedirs(base_folder)  # Create the folder if it doesn't exist
            logging.info(f"Created folder: {base_folder}")
            
            # Copy each file in the group to the new folder
            for file in files:
                source_file = os.path.join(directory_path, file)
                destination_file = os.path.join(base_folder, file)
                shutil.copy(source_file, destination_file)  # Copy the file

### Preparing files for LLM

In [17]:
def split_json(json_data, max_size=2000):
    """
    Split JSON array into chunks while maintaining complete JSON objects
    Returns list of chunks, where each chunk contains complete JSON objects
    """
    if isinstance(json_data, dict):
        json_data = [json_data]
    
    chunks = []
    current_chunk = []
    current_size = 0
    
    for item in json_data:
        item_size = len(json.dumps(item))
        
        # Handle large individual items
        if item_size > max_size:
            if current_chunk:
                chunks.append(current_chunk)
                current_chunk = []
                current_size = 0
            chunks.append([item])
            continue
        
        # Start new chunk if current would exceed max_size
        if current_size + item_size > max_size and current_chunk:
            chunks.append(current_chunk)
            current_chunk = []
            current_size = 0
        
        current_chunk.append(item)
        current_size += item_size
    
    if current_chunk:
        chunks.append(current_chunk)
        
    return chunks

def prepare_json_for_processing(json_file_path):
    """Read and prepare JSON file for processing"""
    with open(json_file_path, 'r') as f:
        data = json.load(f)
        
    if isinstance(data, dict) and 'entry' in data:
        return data['entry']
    return data

def create_json_summary_prompt(chunk, chunk_num, total_chunks):
    """Create prompt for summarizing JSON chunk"""
    return f"""Analyze this portion ({chunk_num} of {total_chunks}) of a FHIR Implementation Guide JSON resource bundle.
    Focus on key technical details, requirements, and relationships.
    
    JSON Content:
    {json.dumps(chunk, indent=2)}
    
    Please provide:
    1. Resource Types and Profiles present
    2. Key technical requirements and constraints
    3. Dependencies and relationships between resources
    4. Notable patterns or unique configurations
    
    Focus on new information not covered in previous chunks."""

def process_json_file(client, json_file_path, model_name="claude-3-5-sonnet-20240620"):
    """Process a JSON file while maintaining object integrity"""
    # Load and prepare JSON data
    json_data = prepare_json_for_processing(json_file_path)
    
    # Split into chunks
    chunks = split_json(json_data)
    
    # Process each chunk
    chunk_summaries = []
    for i, chunk in enumerate(chunks):
        prompt = create_json_summary_prompt(chunk, i+1, len(chunks))
        response = client.messages.create(
            model=model_name,
            max_tokens=1000,
            messages=[
                {"role": "user", "content": prompt},
                {"role": "assistant", "content": "Here is the technical summary: <summary>"}
            ],
            stop_sequences=["</summary>"]
        )
        chunk_summaries.append(response.content[0].text)
    
    # Combine summaries
    return combine_summaries(chunk_summaries)


def combine_summaries(summaries):
    """Combine chunk summaries into a cohesive analysis"""
    prompt = f"""Synthesize these related summaries into a unified technical analysis:

    {json.dumps(summaries, indent=2)}
    
    Create a comprehensive analysis that:
    1. Eliminates redundant information
    2. Maintains technical accuracy
    3. Preserves important relationships
    4. Highlights key patterns
    5. Notes any conflicts or inconsistencies"""
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=2000,
        messages=[
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": "Here is the combined analysis: <summary>"}
        ],
        stop_sequences=["</summary>"]
    )
    
    return response.content[0].text

def consolidate_jsons(base_directory='full-ig/json_only'):
    """Consolidate related JSON files while maintaining object integrity"""
    subdirs = [d for d in os.listdir(base_directory) 
              if os.path.isdir(os.path.join(base_directory, d))]
    
    for subdir in subdirs:
        folder_path = os.path.join(base_directory, subdir)
        combined_data = []
        
        for filename in os.listdir(folder_path):
            if filename.endswith('.json'):
                file_path = os.path.join(folder_path, filename)
                try:
                    with open(file_path, 'r') as f:
                        json_content = json.load(f)
                        if isinstance(json_content, dict) and 'entry' in json_content:
                            combined_data.extend(json_content['entry'])
                        else:
                            combined_data.append(json_content)
                except json.JSONDecodeError as e:
                    logging.error(f"Error decoding JSON from {filename}: {e}")
                    continue
        
        if combined_data:
            output_filename = f"{subdir}_combined.json"
            output_path = os.path.join(base_directory, output_filename)
            
            try:
                with open(output_path, 'w') as outfile:
                    json.dump({
                        "resourceType": subdir,
                        "total": len(combined_data),
                        "entry": combined_data
                    }, outfile, indent=2)
                logging.info(f"Created {output_filename} with {len(combined_data)} entries")
            except Exception as e:
                logging.error(f"Error writing {output_filename}: {e}")

def encode_image(image_path):
    """Convert image to base64 encoding for API consumption"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def process_image(client, image_path):
    """Process a single image and generate a technical description"""
    try:
        base64_image = encode_image(image_path)
        
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/" + image_path.split('.')[-1],
                            "data": base64_image
                        }
                    },
                    {
                        "type": "text",
                        "text": "Analyze this technical diagram/figure. Focus on:\n1. Key components and their relationships\n2. Technical workflows or processes shown\n3. Architecture or design patterns illustrated\n4. Important technical details or annotations\nProvide a detailed technical description."
                    }
                ]
            },
            {
                "role": "assistant",
                "content": "Here is the technical analysis of the image: <summary>"
            }
        ]

        response = client.messages.create(
            model="claude-3-5-sonnet-20240620",
            max_tokens=1000,
            messages=messages,
            stop_sequences=["</summary>"]
        )
        
        return response.content[0].text
        
    except Exception as e:
        logging.error(f"Error processing image {image_path}: {str(e)}")
        raise

def process_all_content(client, base_directory='full-ig'):
    """Process all content types (JSON, Markdown, and Images) and create summaries"""
    # First consolidate JSONs
    consolidate_jsons(os.path.join(base_directory, 'json_only'))
    
    # Process consolidated JSONs
    json_summaries = {}
    json_dir = os.path.join(base_directory, 'json_only')
    for filename in os.listdir(json_dir):
        if filename.endswith('_combined.json'):
            file_path = os.path.join(json_dir, filename)
            json_summaries[filename] = process_json_file(client, file_path)
    
    # Process markdown files
    markdown_summaries = {}
    markdown_dir = os.path.join(base_directory, 'markdown')
    for filename in os.listdir(markdown_dir):
        if filename.endswith('.md'):
            with open(os.path.join(markdown_dir, filename)) as f:
                content = clean_markdown(f.read())
                markdown_summaries[filename] = summarize_markdown(client, content)
    
    # Process images
    image_summaries = {}
    image_dir = os.path.join(base_directory, 'site/Figures')
    for filename in os.listdir(image_dir):
        if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
            file_path = os.path.join(image_dir, filename)
            image_summaries[filename] = process_image(client, file_path)
    
    # Create final meta-summary
    return create_meta_summary(client, json_summaries, markdown_summaries, image_summaries)

def create_meta_summary(client, json_summaries, markdown_summaries, image_summaries):
    """Create a comprehensive meta-summary incorporating all content types"""
    prompt = f"""Synthesize information from multiple content types into a comprehensive technical analysis:

    JSON Configuration Summaries:
    {json.dumps(json_summaries, indent=2)}

    Documentation Summaries:
    {json.dumps(markdown_summaries, indent=2)}

    Diagram/Figure Analyses:
    {json.dumps(image_summaries, indent=2)}

    Create a comprehensive technical analysis that:
    1. Technical Requirements and Architecture
       - Core technical requirements
       - System architecture and patterns
       - Integration points and interfaces
       
    2. Implementation Details
       - Key configurations and settings
       - Resource profiles and extensions
       - Validation rules and constraints
       
    3. Visual Documentation Analysis
       - Technical workflows and processes
       - Component relationships
       - Architecture diagrams insights
       
    4. Cross-Reference Analysis
       - Relationships between documentation, config, and diagrams
       - Consistency validation
       - Dependencies and prerequisites
       
    Highlight any important patterns, potential issues, or implementation considerations."""

    response = client.messages.create(
        model="claude-3-sonnet-20240307",
        max_tokens=4000,
        messages=[
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": "Here is the comprehensive technical analysis: <summary>"}
        ],
        stop_sequences=["</summary>"]
    )
    
    return response.content[0].text

def save_processed_content(base_directory='full-ig', output_directory='processed_output'):
    """Save all processed content and summaries in an organized structure"""
    os.makedirs(output_directory, exist_ok=True)
    
    # Create subdirectories for different content types
    summaries_dir = os.path.join(output_directory, 'summaries')
    os.makedirs(summaries_dir, exist_ok=True)
    
    # Initialize client
    client = Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY'))
    
    # Process all content
    final_summary = process_all_content(client, base_directory)
    
    # Save final meta-summary
    with open(os.path.join(output_directory, 'final_technical_analysis.md'), 'w') as f:
        f.write(final_summary)
    
    # Log completion
    logging.info(f"Processing complete. Results saved to {output_directory}")
    
    return final_summary

### Running the processor

In [18]:
# 1. First set up logging and import required packages
import logging
logging.basicConfig(level=logging.INFO)

load_dotenv()  # Load environment variables from .env file
client = create_anthropic_client()

# 3. Copy JSON files from site to json_only folder
copied_files = copy_json_files()
print(f"Copied {len(copied_files)} JSON files")

# 4. Group related files
grouped_files = group_files_by_base_name('full-ig/json_only')
print("Files grouped by base name")

# 5. Create folders and organize files
copy_files_to_folders('full-ig/json_only', grouped_files)
print("Files organized into folders")

# 6. Consolidate related JSONs
consolidate_jsons('full-ig/json_only')
print("JSONs consolidated")

# 7. Process all content types and create final analysis
final_analysis = process_all_content(client, base_directory='full-ig')

# 8. Save the results
output_dir = 'processed_output'
os.makedirs(output_dir, exist_ok=True)

with open(os.path.join(output_dir, 'technical_analysis.md'), 'w') as f:
    f.write(final_analysis)

print(f"Analysis complete and saved to {output_dir}/technical_analysis.md")

INFO:root:Copied 166 JSON files to full-ig/json_only
INFO:root:Created folder: full-ig/json_only/Location
INFO:root:Created folder: full-ig/json_only/StructureDefinition
INFO:root:Created folder: full-ig/json_only/ValueSet
INFO:root:Created folder: full-ig/json_only/CodeSystem
INFO:root:Created folder: full-ig/json_only/OrganizationAffiliation
INFO:root:Created folder: full-ig/json_only/SearchParameter
INFO:root:Created folder: full-ig/json_only/HealthcareService
INFO:root:Created folder: full-ig/json_only/usage
INFO:root:Created folder: full-ig/json_only/Organization
INFO:root:Created folder: full-ig/json_only/CapabilityStatement
INFO:root:Created folder: full-ig/json_only/PractitionerRole
INFO:root:Created folder: full-ig/json_only/ImplementationGuide
INFO:root:Created folder: full-ig/json_only/InsurancePlan
INFO:root:Created folder: full-ig/json_only/Practitioner
INFO:root:Created folder: full-ig/json_only/Endpoint
INFO:root:Created folder: full-ig/json_only/plan
INFO:root:Created O

Copied 166 JSON files
Files grouped by base name
Files organized into folders
JSONs consolidated


INFO:root:Created ValueSet_combined.json with 24 entries
INFO:root:Created HealthcareService_combined.json with 10 entries
INFO:root:Created SearchParameter_combined.json with 51 entries
INFO:root:Created Endpoint_combined.json with 1 entries
INFO:root:Created InsurancePlan_combined.json with 2 entries
INFO:root:Created ImplementationGuide_combined.json with 1 entries
INFO:root:Created PractitionerRole_combined.json with 6 entries
INFO:root:Created OrganizationAffiliation_combined.json with 7 entries
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messa

RateLimitError: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': 'Number of request tokens has exceeded your per-minute rate limit (https://docs.anthropic.com/en/api/rate-limits); see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase.'}}