# RAG Results Analysis

This notebook is designed to aid in the analysis of the results from the RAG experiments. The current results are stored in a large .JSON file, and this notebook aims to load that data, parse out the relevant information, and export it in a format that can be further analyzed.

The notebook includes functions to flatten the processed requirements list into individual rows, each requirement combined with the parent file information. It also processes the JSON file and converts it to a DataFrame, allowing for easier data manipulation and analysis.

The notebook also categorizes files based on their "base file type" (CodeSystem, SearchParameter, etc.) and provides statistics on the number of unique files, the number of files per category, and the average number of requirements per file within each category.

Finally, the notebook exports the processed data to a CSV file for further analysis.

In [13]:
import json
import pandas as pd
from typing import List, Dict
from collections import defaultdict
import os
import re

## Processing

In [2]:
def flatten_requirements(requirements: List[Dict]) -> List[Dict]:
    """
    Flattens the processed requirements list into individual rows.
    Each requirement will be combined with the parent file information.
    """
    flattened = []
    for req in requirements:
        flat_req = {
            'requirement': req.get('Requirement*', ''),
            'conformance': req.get('Conformance*', ''),
            'actor': req.get('Actor*', ''),
            'verifiable': req.get('Verifiable?', ''),
            'planning_to_test': req.get('Planning To Test?', ''),
            'grouping': req.get('Grouping', ''),
            'test_plan': req.get('Test Plan', ''),
            'simulation_approach': req.get('Simulation Approach', '')
        }
        flattened.append(flat_req)
    return flattened

In [3]:
def process_json_to_df(file_path: str) -> pd.DataFrame:
    """
    Process the JSON file and convert it to a pandas DataFrame.
    """
    # Read the JSON file
    with open(file_path, 'r') as f:
        data = json.load(f)
    
    # Initialize list to store all rows
    all_rows = []
    
    # Process each entry in the JSON
    for entry in data:
        # Extract base information
        base_info = {
            'file_name': entry.get('file_name', ''),
            'chunk_index': entry.get('chunk_index', ''),
            'total_chunks': entry.get('total_chunks', ''),
            'response': entry.get('response', ''),
            'llm_response': entry.get('llm_response', '')
        }
        
        # Process requirements if they exist
        if 'processed_requirements' in entry and entry['processed_requirements']:
            flattened_reqs = flatten_requirements(entry['processed_requirements'])
            
            # Combine base info with each requirement
            for req in flattened_reqs:
                combined_row = {**base_info, **req}
                all_rows.append(combined_row)
        else:
            # If no requirements, just add the base info
            all_rows.append(base_info)
    
    # Convert to DataFrame
    df = pd.DataFrame(all_rows)
    
    # Reorder columns to put base info first
    base_columns = ['file_name', 'chunk_index', 'total_chunks', 'response', 'llm_response']
    other_columns = [col for col in df.columns if col not in base_columns]
    df = df[base_columns + other_columns]
    
    return df

In [4]:
file_path = 'rag-lite-plannet-reqs-processed_v0.json'

In [5]:
df = process_json_to_df(file_path)

In [7]:
df.head(5)

Unnamed: 0,file_name,chunk_index,total_chunks,response,llm_response,requirement,conformance,actor,verifiable,planning_to_test,grouping,test_plan,simulation_approach
0,/site/CodeSystem-DeliveryMethodCS.json,0,1,<ANSWER>YES</ANSWER>,"{\n ""Requirement*"": ""Server SHALL support the...",Server SHALL support the CodeSystem 'DeliveryM...,SHALL,Server,Yes,Yes,Terminology,Verify that the server supports the DeliveryMe...,SIMULATED: Inferno will include this CodeSyste...
1,/site/CodeSystem-DeliveryMethodCS.json,0,1,<ANSWER>YES</ANSWER>,"{\n ""Requirement*"": ""Server SHALL support the...",The DeliveryMethodCS CodeSystem SHALL include ...,SHALL,Server,Yes,Yes,Terminology,Verify that the server recognizes and accepts ...,SIMULATED: Inferno will include this code in i...
2,/site/CodeSystem-DeliveryMethodCS.json,0,1,<ANSWER>YES</ANSWER>,"{\n ""Requirement*"": ""Server SHALL support the...",The DeliveryMethodCS CodeSystem SHALL include ...,SHALL,Server,Yes,Yes,Terminology,Verify that the server recognizes and accepts ...,SIMULATED: Inferno will include this code in i...
3,/site/SearchParameter-organizationaffiliation-...,0,1,<ANSWER>YES</ANSWER>,"{\n ""Requirement*"": ""Servers SHALL support se...",Servers SHALL support searching OrganizationAf...,SHALL,Server,Yes,Yes,Search Parameters,1. Retrieve the server's CapabilityStatement\n...,SIMULATED: Inferno will implement support for ...
4,/site/SearchParameter-organizationaffiliation-...,0,1,<ANSWER>YES</ANSWER>,"{\n ""Requirement*"": ""Servers SHALL support se...",The 'specialty' search parameter for Organizat...,SHALL,Server,Yes,Yes,Search Parameters,1. Retrieve the server's CapabilityStatement\n...,SIMULATED: Inferno will implement this search ...


In [6]:
output_path = 'analysis_results.csv'
df.to_csv(output_path, index=False)

## Analysis

In [8]:
unique_files = df['file_name'].unique()

In [11]:
print(f"Number of unique files: {len(unique_files)}")

Number of unique files: 158


In [14]:
def categorize_files(file_paths):
    """
    Categorizes files based on their base type (CodeSystem, SearchParameter, etc.)
    
    Args:
        file_paths: Array-like object containing file paths
        
    Returns:
        dict: Dictionary with base types as keys and lists of files as values
    """
    # Create a defaultdict to store categories
    categories = defaultdict(list)
    
    # Regular expression to extract the base type
    # Matches content between /site/ and the next hyphen
    pattern = r'/site/([^-]+)-'
    
    for file_path in file_paths:
        match = re.search(pattern, file_path)
        if match:
            base_type = match.group(1)
            categories[base_type].append(file_path)
    
    # Convert defaultdict to regular dict and sort the lists
    return {k: sorted(v) for k, v in categories.items()}

In [15]:
categorized_files = categorize_files(unique_files)

In [17]:
print(f"Total number of unique files: {len(unique_files)}")
print("Number of files per category:")
for category, files in sorted(categorized_files.items()):
    print(f"  {category}: {len(files)}")

Total number of unique files: 158
Number of files per category:
  CapabilityStatement: 1
  CodeSystem: 10
  SearchParameter: 47
  StructureDefinition: 21
  ValueSet: 1


In [16]:
for category, files in sorted(categorized_files.items()):
    print(f"\n{category} ({len(files)} files):")
    for file in files:
        # Extract just the filename without the /site/ prefix
        filename = os.path.basename(file)
        print(f"  - {filename}")


CapabilityStatement (1 files):
  - CapabilityStatement-plan-net.json

CodeSystem (10 files):
  - CodeSystem-AcceptingPatientsCS.json
  - CodeSystem-DeliveryMethodCS.json
  - CodeSystem-EndpointConnectionTypeCS.json
  - CodeSystem-EndpointPayloadTypeCS.json
  - CodeSystem-HealthcareServiceCategoryCS.json
  - CodeSystem-InsurancePlanTypeCS.json
  - CodeSystem-InsuranceProductTypeCS.json
  - CodeSystem-OrgTypeCS.json
  - CodeSystem-ProviderRoleCS.json
  - CodeSystem-QualificationStatusCS.json

SearchParameter (47 files):
  - SearchParameter-healthcareservice-coverage-area.json
  - SearchParameter-healthcareservice-delivery-method.json
  - SearchParameter-healthcareservice-endpoint.json
  - SearchParameter-healthcareservice-location.json
  - SearchParameter-healthcareservice-name.json
  - SearchParameter-healthcareservice-organization.json
  - SearchParameter-healthcareservice-service-type.json
  - SearchParameter-healthcareservice-specialty.json
  - SearchParameter-insuranceplan-administ

In [25]:
def get_category(file_path):
    """
    Extract category from file path, handling both /site/ and /html_only/ paths
    as well as special cases.
    """
    # Handle special cases first
    if file_path.endswith('expansions.json'):
        return 'Expansions'
    
    # Pattern to match category in both /site/ and /html_only/ paths
    pattern = r'(?:/site/|/html_only/)([^-]+)-'
    match = re.search(pattern, file_path)
    
    if match:
        return match.group(1)
    
    # If no pattern match, use the filename without extension as category
    base_name = os.path.basename(file_path)
    category = os.path.splitext(base_name)[0]
    return category

In [26]:
df['category'] = df['file_name'].apply(get_category)

In [27]:
# Get requirements count by category
category_counts = df.groupby('category').size().sort_values(ascending=False)

In [28]:
# Get requirements count by file within each category
file_counts = df.groupby(['category', 'file_name']).size().reset_index(name='count')


In [29]:
category_stats = df.groupby('category').agg({
    'file_name': 'nunique',  # unique files per category
    'requirement': 'count'   # total requirements per category
}).reset_index()
category_stats['avg_requirements_per_file'] = (
    category_stats['requirement'] / category_stats['file_name']
).round(2)

In [30]:
category_stats

Unnamed: 0,category,file_name,requirement,avg_requirements_per_file
0,CapabilityStatement,2,14,7.0
1,CodeSystem,19,70,3.68
2,Expansions,2,6,3.0
3,ImplementationGuide,1,0,0.0
4,SearchParameter,89,411,4.62
5,StructureDefinition,42,143,3.4
6,ValueSet,3,7,2.33
