# HRM Evaluation & Test Generation System - Complete Demo

## Comprehensive Tutorial and Examples

This notebook provides a complete walkthrough of the HRM (Hierarchical Recurrent Model) evaluation and test generation system. You'll learn how to:

- Load and inspect the HRM v9 Optimized model
- Parse and validate requirements from epics and user stories
- Generate comprehensive test cases using AI
- Use the REST API for test generation
- Integrate with multi-agent systems
- Fine-tune the model on custom data
- Analyze and visualize results

**Author:** Ian Cruickshank  
**Model:** HRM v9 Optimized (28M parameters)  
**Checkpoint:** step_7566 (converged)


## 1. Setup and Installation

First, ensure all dependencies are installed and the environment is configured correctly.


In [None]:
# Install required packages (uncomment if needed)
# !pip install torch transformers fastapi uvicorn pydantic pyyaml matplotlib seaborn pandas

import sys
import os
import json
import logging
from pathlib import Path
import warnings

# Add hrm_eval to path
sys.path.insert(0, str(Path.cwd() / 'hrm_eval'))

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Suppress warnings
warnings.filterwarnings('ignore')

print("Environment setup complete!")
print(f"Python version: {sys.version}")
print(f"Working directory: {Path.cwd()}")


In [None]:
# Import core libraries
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Check device availability
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)


## 2. Model Architecture & Loading

The HRM v9 Optimized model is a hierarchical dual-level transformer designed for puzzle-solving tasks, adapted for test case generation.


In [None]:
# Load model architecture information
with open('hrm_eval/hrm_checkpoint_eval/model_architecture.json', 'r') as f:
    model_arch = json.load(f)

print("=" * 60)
print("HRM v9 Optimized Model Architecture")
print("=" * 60)
print(f"Total Parameters: {model_arch['summary']['total_parameters']:,}")
print(f"Model Size: {model_arch['summary']['model_size_mb']:.2f} MB")
print(f"Architecture: {model_arch['summary']['architecture_type']}")
print(f"Number of Layers: {model_arch['summary']['num_layers']}")
print("\nComponent Breakdown:")
for component, params in model_arch['component_parameters'].items():
    print(f"  - {component}: {params:,} parameters")
print("=" * 60)


In [None]:
# Visualize parameter distribution
param_data = model_arch['component_parameters']
components = list(param_data.keys())
params = list(param_data.values())

plt.figure(figsize=(12, 6))
plt.barh(components, params, color='steelblue')
plt.xlabel('Number of Parameters')
plt.title('HRM v9 Optimized - Parameter Distribution by Component')
plt.grid(axis='x', alpha=0.3)
for i, v in enumerate(params):
    plt.text(v, i, f' {v:,}', va='center')
plt.tight_layout()
plt.show()


In [None]:
# Load checkpoint analysis
with open('analysis/checkpoint_analysis.json', 'r') as f:
    checkpoint_data = json.load(f)

print("\nCheckpoint Analysis Summary:")
print(f"Best Checkpoint: {checkpoint_data['recommended_checkpoint']}")
print(f"Training Status: {checkpoint_data['training_status']}")
print(f"Health Status: {checkpoint_data['health_status']}")
print(f"Total Training Steps: {checkpoint_data['total_steps']}")
print(f"Weight Stability: {checkpoint_data['weight_change_percent']:.2f}%")


## 3. Requirements Parsing

Parse and validate requirements from epics and user stories to prepare them for test generation.


In [None]:
# Import requirement parsing modules
from requirements_parser.schemas import Epic, UserStory, AcceptanceCriteria
from requirements_parser.requirement_parser import RequirementParser
from requirements_parser.requirement_validator import RequirementValidator

# Initialize parser and validator
parser = RequirementParser()
validator = RequirementValidator()

print("Requirements parsing modules loaded successfully!")


In [None]:
# Create a sample epic with user stories
sample_epic = {
    "epic_id": "EPIC-001",
    "title": "User Authentication System",
    "description": "Implement secure user authentication with OAuth2 and JWT tokens",
    "user_stories": [
        {
            "story_id": "US-001",
            "title": "User Login",
            "description": "As a user, I want to log in with email and password",
            "acceptance_criteria": [
                {
                    "criterion_id": "AC-001",
                    "description": "User can enter email and password",
                    "type": "functional"
                },
                {
                    "criterion_id": "AC-002",
                    "description": "System validates credentials against database",
                    "type": "functional"
                },
                {
                    "criterion_id": "AC-003",
                    "description": "JWT token is generated on successful login",
                    "type": "functional"
                },
                {
                    "criterion_id": "AC-004",
                    "description": "Login response time is less than 500ms",
                    "type": "performance"
                }
            ]
        },
        {
            "story_id": "US-002",
            "title": "Password Reset",
            "description": "As a user, I want to reset my password via email",
            "acceptance_criteria": [
                {
                    "criterion_id": "AC-005",
                    "description": "User can request password reset via email",
                    "type": "functional"
                },
                {
                    "criterion_id": "AC-006",
                    "description": "Reset link expires after 24 hours",
                    "type": "security"
                }
            ]
        }
    ]
}

# Parse the epic
epic = Epic(**sample_epic)
print(f"Created Epic: {epic.epic_id} - {epic.title}")
print(f"User Stories: {len(epic.user_stories)}")
print(f"Total Acceptance Criteria: {sum(len(us.acceptance_criteria) for us in epic.user_stories)}")


In [None]:
# Validate the epic
is_valid, issues = validator.validate_epic(epic)
testability_score, testability_report = validator.check_testability(epic)

print("\nEpic Validation Results:")
print(f"Valid: {is_valid}")
if issues:
    print("Issues found:")
    for issue in issues:
        print(f"  - {issue}")
else:
    print("No issues found!")

print(f"\nTestability Score: {testability_score:.2f}/10.0")
print("\nTestability Report:")
for key, value in testability_report.items():
    print(f"  {key}: {value}")


In [None]:
# Extract test contexts
test_contexts = parser.extract_test_contexts(epic)

print(f"\nExtracted {len(test_contexts)} test contexts:")
for i, context in enumerate(test_contexts[:3], 1):  # Show first 3
    print(f"\n{i}. {context.context_type.upper()}")
    print(f"   Story: {context.user_story_id}")
    print(f"   Criteria: {context.criterion_id}")
    print(f"   Requirement: {context.requirement_text[:80]}...")


## 4. Test Case Generation

Generate comprehensive test cases using the HRM model based on the parsed requirements.


In [None]:
# Import test generation modules
from test_generator.generator import TestCaseGenerator
from test_generator.template_engine import TestTemplateEngine
from test_generator.coverage_analyzer import CoverageAnalyzer

print("Test generation modules loaded!")


In [None]:
# Initialize test case generator
# Note: This requires the actual model checkpoint
checkpoint_path = "checkpoints_hrm_v9_optimized_step_7566"

try:
    generator = TestCaseGenerator(
        model_path=checkpoint_path,
        device=device
    )
    print(f"TestCaseGenerator initialized with checkpoint: {checkpoint_path}")
    print(f"Using device: {device}")
except Exception as e:
    print(f"Note: Model loading requires actual checkpoint file")
    print(f"For demo purposes, we'll use mock generation")
    generator = None


In [None]:
# Generate test cases (or show example structure)
example_test_case = {
    "test_id": "TC-001",
    "title": "Test Successful User Login with Valid Credentials",
    "description": "Verify that a user can successfully log in with valid email and password",
    "priority": "P1",
    "test_type": "functional",
    "preconditions": [
        "User account exists in the database",
        "User credentials are valid",
        "Authentication service is running"
    ],
    "test_steps": [
        {
            "step_number": 1,
            "action": "Navigate to login page",
            "expected_result": "Login form is displayed"
        },
        {
            "step_number": 2,
            "action": "Enter valid email address",
            "expected_result": "Email field accepts input"
        },
        {
            "step_number": 3,
            "action": "Enter valid password",
            "expected_result": "Password field accepts masked input"
        },
        {
            "step_number": 4,
            "action": "Click 'Login' button",
            "expected_result": "Authentication request is sent to server"
        }
    ],
    "expected_outcome": "User is successfully authenticated and redirected to dashboard with valid JWT token",
    "postconditions": [
        "JWT token is stored in session",
        "User session is active",
        "Login timestamp is recorded"
    ]
}

print("Example Test Case Structure:")
print(json.dumps(example_test_case, indent=2))


In [None]:
# Format test case in different styles
template_engine = TestTemplateEngine()

# Gherkin format
print("\n" + "=" * 60)
print("GHERKIN FORMAT (BDD Style)")
print("=" * 60)
gherkin_test = f"""
Feature: User Authentication
  As a user
  I want to log in with my credentials
  So that I can access my account

  Scenario: {example_test_case['title']}
    Given {example_test_case['preconditions'][0]}
    And {example_test_case['preconditions'][1]}
    When I navigate to the login page
    And I enter my valid email address
    And I enter my valid password
    And I click the 'Login' button
    Then {example_test_case['expected_outcome']}
    And {example_test_case['postconditions'][0]}
"""
print(gherkin_test)


## 5. API Service Usage

The system provides a REST API for test generation. Here's how to use it programmatically.


In [None]:
# Example API usage (demonstrative - server would need to be running)
import requests

# API endpoint configuration
API_BASE_URL = "http://localhost:8000"
API_KEY = "your-api-key-here"

# Example API request structure
api_request_example = {
    "epic": sample_epic,
    "generation_config": {
        "num_test_cases": 10,
        "priority_distribution": {
            "P1": 0.3,
            "P2": 0.5,
            "P3": 0.2
        },
        "test_types": ["functional", "security", "performance"],
        "include_edge_cases": True
    }
}

print("Example API Request:")
print(json.dumps(api_request_example, indent=2)[:500] + "...")


In [None]:
# Example API response structure
api_response_example = {
    "status": "success",
    "request_id": "req-12345",
    "generated_test_cases": [
        {
            "test_id": "TC-001",
            "title": "Test Successful User Login",
            "priority": "P1",
            "test_type": "functional"
        },
        {
            "test_id": "TC-002",
            "title": "Test Login with Invalid Password",
            "priority": "P1",
            "test_type": "security"
        }
    ],
    "metadata": {
        "num_test_cases": 10,
        "generation_time_ms": 1250,
        "model_version": "hrm_v9_optimized_step_7566",
        "coverage_score": 0.92
    }
}

print("\nExample API Response:")
print(json.dumps(api_response_example, indent=2))


## 6. Multi-Agent Integration

Integrate the test generator with multi-agent systems for collaborative testing.


In [None]:
# Import agent integration modules
from integration.agent_adapter import AgentSystemAdapter
from integration.workflow_connector import WorkflowConnector

print("Agent integration modules loaded!")


In [None]:
# Initialize agent adapter (demonstrative)
# adapter = AgentSystemAdapter(generator, agent_id="test_generator_sqs")

# Agent capabilities
agent_capabilities = [
    "generate_test_cases",
    "analyze_coverage",
    "validate_requirements"
]

print("Agent Capabilities:")
for i, capability in enumerate(agent_capabilities, 1):
    print(f"{i}. {capability}")

# Example agent request
agent_request = {
    "type": "generate_test_cases",
    "request_id": "req-agent-001",
    "requester": "swe_agent",
    "epic": sample_epic,
    "priority": "high"
}

print("\nExample Agent Request:")
print(json.dumps(agent_request, indent=2)[:400] + "...")


## 7. Fine-Tuning Pipeline

Continuously improve the model by fine-tuning on domain-specific data.


In [None]:
# Import fine-tuning modules
from fine_tuning.data_collector import TrainingDataCollector
from fine_tuning.fine_tuner import HRMFineTuner
from fine_tuning.evaluator import FineTuningEvaluator

# Training data example
training_example = {
    "prompt": "Create test cases for user login functionality with OAuth2 authentication",
    "completion": "Test Case 1: Successful OAuth2 Login\\nSteps:\\n1. Navigate to login page\\n2. Click 'Login with OAuth2'\\n3. Enter valid credentials\\n4. Verify redirect to dashboard\\nExpected: User is authenticated with OAuth2 token"
}

print("Fine-Tuning Pipeline Components:")
print("1. TrainingDataCollector - Collect feedback and examples")
print("2. HRMFineTuner - Fine-tune model on custom data")
print("3. FineTuningEvaluator - Evaluate model improvements")
print("\nTraining Example Format:")
print(json.dumps(training_example, indent=2))


## 8. Real-World Example: E-Commerce Testing

Let's walk through a complete real-world example for an e-commerce fulfillment system.


In [None]:
# E-commerce fulfillment epic
ecommerce_epic = {
    "epic_id": "EPIC-ECOM-001",
    "title": "Order Fulfillment System",
    "description": "Implement automated order fulfillment with inventory management",
    "user_stories": [
        {
            "story_id": "US-ECOM-001",
            "title": "Place Order",
            "description": "As a customer, I want to place an order for products",
            "acceptance_criteria": [
                {
                    "criterion_id": "AC-ECOM-001",
                    "description": "Customer can add products to cart",
                    "type": "functional"
                },
                {
                    "criterion_id": "AC-ECOM-002",
                    "description": "System validates inventory availability",
                    "type": "functional"
                },
                {
                    "criterion_id": "AC-ECOM-003",
                    "description": "Order confirmation is sent within 30 seconds",
                    "type": "performance"
                }
            ]
        },
        {
            "story_id": "US-ECOM-002",
            "title": "Track Order",
            "description": "As a customer, I want to track my order status",
            "acceptance_criteria": [
                {
                    "criterion_id": "AC-ECOM-004",
                    "description": "Customer can view real-time order status",
                    "type": "functional"
                },
                {
                    "criterion_id": "AC-ECOM-005",
                    "description": "Tracking updates are sent via email and SMS",
                    "type": "functional"
                }
            ]
        }
    ]
}

# Parse and validate
ecom_epic = Epic(**ecommerce_epic)
is_valid, issues = validator.validate_epic(ecom_epic)
score, report = validator.check_testability(ecom_epic)

print("E-Commerce Epic:")
print(f"Title: {ecom_epic.title}")
print(f"User Stories: {len(ecom_epic.user_stories)}")
print(f"Validation: {'PASSED' if is_valid else 'FAILED'}")
print(f"Testability Score: {score:.1f}/10.0")


In [None]:
# Extract and analyze test contexts
ecom_contexts = parser.extract_test_contexts(ecom_epic)

# Group by type
context_types = {}
for ctx in ecom_contexts:
    if ctx.context_type not in context_types:
        context_types[ctx.context_type] = 0
    context_types[ctx.context_type] += 1

print(f"\nExtracted {len(ecom_contexts)} test contexts:")
for ctx_type, count in context_types.items():
    print(f"  - {ctx_type.title()}: {count}")

# Visualize context distribution
plt.figure(figsize=(10, 6))
plt.bar(context_types.keys(), context_types.values(), color=['#2ecc71', '#e74c3c', '#f39c12'])
plt.xlabel('Context Type')
plt.ylabel('Count')
plt.title('Test Context Distribution for E-Commerce Epic')
plt.grid(axis='y', alpha=0.3)
for i, (k, v) in enumerate(context_types.items()):
    plt.text(i, v, str(v), ha='center', va='bottom')
plt.tight_layout()
plt.show()


## 9. Analysis & Visualization

Analyze model performance and visualize results.


In [None]:
# Load detailed checkpoint analysis
with open('analysis/detailed_checkpoint_analysis.json', 'r') as f:
    detailed_analysis = json.load(f)

# Extract training metrics
checkpoints = detailed_analysis['checkpoints']
steps = [cp['step'] for cp in checkpoints]
avg_weights = [cp['statistics']['mean'] for cp in checkpoints]
std_weights = [cp['statistics']['std'] for cp in checkpoints]

# Plot weight evolution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Weight mean evolution
ax1.plot(steps, avg_weights, marker='o', linewidth=2, markersize=6)
ax1.set_xlabel('Training Step')
ax1.set_ylabel('Average Weight Value')
ax1.set_title('Model Weight Evolution During Training')
ax1.grid(True, alpha=0.3)
ax1.axhline(y=avg_weights[-1], color='r', linestyle='--', alpha=0.5, label='Final Value')
ax1.legend()

# Weight std evolution
ax2.plot(steps, std_weights, marker='s', color='orange', linewidth=2, markersize=6)
ax2.set_xlabel('Training Step')
ax2.set_ylabel('Weight Standard Deviation')
ax2.set_title('Model Weight Variability During Training')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Training converged at step {steps[-1]}")
print(f"Final weight mean: {avg_weights[-1]:.6f}")
print(f"Final weight std: {std_weights[-1]:.6f}")


In [None]:
# Create a test coverage heatmap (simulated data)
test_types = ['Functional', 'Security', 'Performance', 'Integration', 'Edge Cases']
priorities = ['P1 (Critical)', 'P2 (Important)', 'P3 (Nice-to-have)']

# Simulated coverage data
coverage_data = np.array([
    [0.95, 0.88, 0.92, 0.85, 0.78],  # P1
    [0.92, 0.85, 0.88, 0.82, 0.75],  # P2
    [0.85, 0.78, 0.80, 0.75, 0.70]   # P3
])

plt.figure(figsize=(12, 6))
sns.heatmap(coverage_data, annot=True, fmt='.2f', cmap='RdYlGn', 
            xticklabels=test_types, yticklabels=priorities,
            vmin=0.0, vmax=1.0, cbar_kws={'label': 'Coverage Score'})
plt.title('Test Coverage Heatmap by Type and Priority')
plt.xlabel('Test Type')
plt.ylabel('Priority Level')
plt.tight_layout()
plt.show()


## 10. Summary & Next Steps

### Key Takeaways

1. **Model Architecture**: HRM v9 Optimized with 28M parameters, hierarchical dual-level transformer
2. **Requirements Parsing**: Robust validation and testability scoring
3. **Test Generation**: AI-powered generation with NO hardcoded logic
4. **API Integration**: Production-ready REST API with authentication
5. **Multi-Agent Support**: Seamless integration with agent systems
6. **Continuous Improvement**: Fine-tuning pipeline for domain adaptation

### System Capabilities

- Parse and validate requirements from epics and user stories
- Generate comprehensive test cases covering multiple scenarios
- Analyze test coverage and identify gaps
- Format tests in various styles (Gherkin, pytest, etc.)
- Integrate with multi-agent systems for collaborative testing
- Fine-tune models on domain-specific data

### Next Steps

1. **Deploy API**: Start the FastAPI server for production use
2. **Fine-Tune Model**: Collect domain-specific examples and retrain
3. **Integrate with CI/CD**: Add test generation to your pipeline
4. **Agent Mesh**: Connect to your multi-agent system
5. **Monitor Performance**: Track metrics and optimize

### Resources

- **Documentation**: `/docs` directory
- **API Guide**: `hrm_eval/API_USAGE_GUIDE.md`
- **Implementation Guide**: `hrm_eval/IMPLEMENTATION_GUIDE.md`
- **GitHub**: https://github.com/ianshank/HRM_SQE_Agent_Test_Generator


In [None]:
# Final statistics
print("=" * 70)
print("HRM EVALUATION & TEST GENERATION SYSTEM - SESSION SUMMARY")
print("=" * 70)
print(f"Model: HRM v9 Optimized (step_7566)")
print(f"Total Parameters: 27,990,018")
print(f"Device: {device}")
print(f"Epics Processed: 2")
print(f"User Stories Analyzed: {len(sample_epic['user_stories']) + len(ecommerce_epic['user_stories'])}")
print(f"Test Contexts Extracted: {len(test_contexts) + len(ecom_contexts)}")
print(f"Average Testability Score: {(testability_score + score) / 2:.1f}/10.0")
print("=" * 70)
print("\nThank you for using the HRM Test Generation System!")
print("For support: ianshank@gmail.com")
