# Evaluate Results

## Running Evaluations

In [1]:
import json
import pandas as pd
import numpy as np
from openai import OpenAI

In [7]:
MODES = ['naive', 'local', 'global', 'hybrid']
TEST_TYPES = ['compliant', 'sentence-noncompliant', 'half-noncompliant', 'total-noncompliant']

In [47]:
evaluation_prompt = '''
You are an expert in regulation compliance analysis. 
We have a system that tries to check the compliance of a regulation with a new proposed regulation. 
You have the regulation, the proposed regulation, and the difference between these too, the output 
of the system will be provided to you and you should check either if the output shows the compliance 
status of the proposed regulation with the regulation.
If the regulation's compliance is checked with general rules or other regulations it is ok.
If the changes made to the regulation are mentioned but it is said the regulation is compliant it is ok.
If at least half of the non compliance reasons are given it is enough.

regulation: "{text}"

proposed_regulation: "{test}"

compliance_status: "{desc}"

system_output: "{result}"

output should be in this format:
if the system_output is correct: `True`
if the system_output is not correct: `False - one line description why`
'''

In [8]:
client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key="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",
)

In [19]:
with open("test_data_sample.json", "r", encoding="utf-8") as f:
    samples = json.load(f)

In [9]:
def call_gemini(prompt):
    completion = client.chat.completions.create(
      model="google/gemini-2.5-pro-preview-06-05",
      messages=[
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": prompt
            }
          ]
        }
      ]
    )
    return completion.choices[0].message.content

In [48]:
def evaluate_results(sample, result, test_type):
    text, test, result = sample['text'], sample[test_type], result[test_type]
    test_desc = 'The proposed regulation is completely compliant with previous regulation'
    if test_type != 'compliant':
        test_desc = sample[f'{test_type}-desc']
    prompt = evaluation_prompt.format(text=text, test=test, desc=test_desc, result=result)
    return call_gemini(prompt)

In [77]:
def run_evaluations_for_mode(samples, results):
    evaluations = []
    true_dict = {
        'compliant': 0,
        'sentence-noncompliant': 0,
        'half-noncompliant': 0,
        'total-noncompliant': 0,
    }
    for i in range(len(results)):
        sample, result = samples[i], results[i]
        if sample['id'] != result['id']:
            print(f'ids not matched for {i}')
            continue
        evaluation = {
            'id': sample['id']
        }
        for test_type in TEST_TYPES:
            test_eval = evaluate_results(sample, result, test_type)
            evaluation[test_type] = test_eval
            if test_eval == 'True':
                true_dict[test_type] += 1
        evaluations.append(evaluation)
        print(f'Evaluation done for {i}')
    return evaluations, true_dict

In [78]:
format_percent = lambda d: {k: f"{(v / 50) * 100:.2f}%" for k, v in d.items()}

In [90]:
for mode in MODES:
    with open(f"test_data_results_{mode}.json", "r", encoding="utf-8") as f:
        results = json.load(f)
    evals, true_dict = run_evaluations_for_mode(samples, results)
    print('_'*121)
    print(format_percent(true_cnt))
    print('_'*121)
    with open(f"test_data_evals_{mode}.json", "w", encoding="utf-8") as f:
        json.dump(evaluations, f, indent=4, ensure_ascii=False)

Evaluation done for 0
Evaluation done for 1
Evaluation done for 2
Evaluation done for 3
Evaluation done for 4
Evaluation done for 5
Evaluation done for 6
Evaluation done for 7
Evaluation done for 8
Evaluation done for 9
Evaluation done for 10
Evaluation done for 11
Evaluation done for 12
Evaluation done for 13
Evaluation done for 14
Evaluation done for 15
Evaluation done for 16
Evaluation done for 17
Evaluation done for 18
Evaluation done for 19
Evaluation done for 20
Evaluation done for 21
Evaluation done for 22
Evaluation done for 23
Evaluation done for 24
Evaluation done for 25
Evaluation done for 26
Evaluation done for 27
Evaluation done for 28
Evaluation done for 29
Evaluation done for 30
Evaluation done for 31
Evaluation done for 32
Evaluation done for 33
Evaluation done for 34
Evaluation done for 35
Evaluation done for 36
Evaluation done for 37
Evaluation done for 38
Evaluation done for 39
Evaluation done for 40
Evaluation done for 41
Evaluation done for 42
Evaluation done for 4

## Evaluation Analysis

In [91]:
all_evals = {}
for mode in MODES:
    with open(f"test_data_evals_{mode}.json", "r", encoding="utf-8") as f:
        evals = json.load(f)
    all_evals[mode] = evals

In [100]:
for i in range(len(all_evals[MODES[0]])):
    for test_type in TEST_TYPES:
        tmp = all_evals[MODES[0]][i][test_type]
        for mode in MODES:
            if all_evals[mode][i][test_type] != tmp:
                print(f'difference in {mode}, {i}, {test_type}')

#### According to above the, results for all different modes of lightrag was the same.

### Problems with compliant regulations analysis

- Incomplete detection of changes: The system frequently misses significant deletions and modifications, such as the omission of final lines referencing comparative tables.
- Failure to identify non-compliance: It incorrectly asserts full compliance despite substantial alterations, like a change in domestic construction capability from 65% to 85%.
- Missing critical omissions: The system overlooks crucial missing elements, including appendix references or specific budget sources and signing authorities.
- Empty or incorrect output: The system sometimes produces no output or provides an entirely inaccurate assessment of compliance.

### Problems with sentence noncompliant regulations analysis

- Inaccurate Compliance Assessment: The system frequently asserts full compliance despite clear contradictions, such as changing a tax rate from 1% to 8%.
- Failure to Detect Critical Changes: It consistently misses significant alterations to core provisions, like a project changing from "railway" to "freeway."
- Omission of Key Non-Compliant Details: The system overlooks crucial discrepancies in financial figures, deadlines, or requirements, for instance, a change in allocated amount from 10 billion to 50 billion Rials.
- Inability to Identify Fundamental Policy Reversals: It fails to recognize when the proposed regulation directly reverses the original intent, such as changing debt forgiveness from non-approval to approval.
- Lack of Contextual Understanding: The system analyzes proposed regulations in isolation, rather than comparing them against original regulations to identify non-compliance.

### Problems with half noncompliant regulations analysis

- Failure to compare with base regulation: The system consistently misses non-compliance by not comparing the proposed text to the original, e.g., missing changes in legal basis or validity period.
- Incorrectly claiming full compliance: It frequently states compliance even when fundamental contradictions exist, such as changing a tax rate from zero to 20%.
- Missing multiple non-compliance points: The system fails to identify several critical discrepancies simultaneously, like differences in amount, priority of issuance, and coordination for issuance time.
- Analyzing regulations in isolation: It assesses proposed regulations without considering their original context, e.g., failing to identify that domestic content changes violate the original regulation.
- Misunderstanding core regulatory intent: The system can completely misinterpret a regulation's purpose, such as stating price increases are allowed when explicitly forbidden.

### Problems with total noncompliant regulations analysis

- Failure to compare regulations: The system consistently neglects to compare proposed regulations with original ones, leading to missed contradictions (e.g., analyzing a tax change on basic goods in isolation instead of comparing it to a tax increase on luxury goods).
- Incorrectly asserting full compliance: It frequently claims compliance even when the proposed regulation fundamentally contradicts or reverses the original (e.g., stating compliance when an "approval to sell" is changed to a "disapproval to sell").
- Treating proposed regulation as current law: The system often assumes the proposed regulation is already valid and checks its compliance against general laws, ignoring direct conflicts with the original (e.g., confirming non-compliant content of the proposed regulation instead of its non-compliance with the original).
- Missing complete contradictions/reversals: It fails to identify instances where the proposed regulation is the exact opposite of the original in all key aspects (e.g., completely reversing a decision to extend a deadline to mandating immediate implementation and fines).
- Misinterpreting original intent: The system can fundamentally misunderstand the original regulation's purpose, leading to incorrect compliance assessments (e.g., stating original regulation eliminated subsidies when it established them).

### Problems Summary

- Inadequate Comparison: The system consistently fails to compare proposed regulations with original ones, missing contradictions and fundamental changes.
- False Compliance Claims: It frequently asserts full compliance even when proposed regulations contradict or reverse original intent.
- Contextual Blindness: The system often analyzes proposed regulations in isolation, treating them as current law and ignoring their relationship with original versions.
- Missed Reversals & Contradictions: It fails to identify complete reversals, fundamental policy shifts, or direct contradictions between proposed and original regulations.
- Misinterpretation of Intent: The system frequently misunderstands the core purpose or intent of original regulations, leading to inaccurate assessments.
- Overlooked Critical Changes: It consistently misses significant alterations, deletions, omissions, or discrepancies in key provisions, details, or financial figures.
- Incomplete Non-Compliance Detection: The system often fails to identify multiple points of non-compliance simultaneously.
- Erroneous or Absent Output: The system can produce no output or provide entirely inaccurate compliance assessments.