# Test Tool EU Law Citations Extraction

_Last Update:_ 2025.11.08

The goal of this notebook is to create tests that ensure that the tool and its associated functions work as intended. Going from functional coding to actual performance.


## Environment Settings

In [None]:
import os
import sys
from pathlib import Path
import warnings
import pandas as pd
warnings.filterwarnings("ignore")

# PROJECT ROOT to Python paths
ROOT_DIR = Path().resolve().parents[0]
if str(ROOT_DIR) not in sys.path:
    sys.path.insert(0, str(ROOT_DIR))
# RELOAD the custom script
%load_ext autoreload
%autoreload 2

In [3]:
from dotenv import load_dotenv
# Load environment variables from .env file
SECRETS_FILE = Path(r"C:\LocalSecrets\master.env")
load_dotenv(str(SECRETS_FILE))
# ASSIGN the values of the environment variables
openai_key = str(os.getenv("OPENAI_UIO24EMC_KEY"))
pplx_key = str(os.getenv("PERPLEXITY_UIO24EMC_KEY"))
claude_key = str(os.getenv("ANTHROPIC_API_KEY"))

## S01 System Message and Prompt Constructors

- `validate_citations/*.py`
- `extract_citations/*.py`
- `select_citations/*.py`

### PROMPT Constructor `run_validate()` (using real DataFrame)

Check whether there are any missing citations.

In [20]:
# Test both formats side by side run_validate()
from gptquery.tools.tool_eulaw_citations.validate_citations.prompts.default import (prompt_validate_completeness, 
                                                                                    VALIDATION_SYSTEM_MESSAGE)

# LOAD reliability CSV
df_reliab = pd.read_csv(ROOT_DIR / "data" / "uio24hlp" / "realiability_data_refs.csv")
df_ref_lvl = df_reliab[["uoa_referral_id","questions_text","std_potential_citations","nlang_potential_citations",]].drop_duplicates()
df_ref_lvl = df_ref_lvl.rename(columns={"questions_text": "question_text"})  # type: ignore
df_ref_lvl = df_ref_lvl.rename(columns={"nlang_potential_citations": "potential_citations"})  # type: ignore

df_ref_lvl["validate_prompt"] = df_ref_lvl.apply(lambda row: prompt_validate_completeness(row["question_text"], 
                                                                                          row["potential_citations"]), axis=1)

print(VALIDATION_SYSTEM_MESSAGE)
print("-" * 100)
print(df_ref_lvl["validate_prompt"][0])


You are a legal citation completeness validator specializing in EUROPEAN UNION LAW. Your task is to determine whether the provided citations are all and the same as those mentioned in the provided question text.

CRITICAL INSTRUCTIONS:
1. FOCUS ONLY ON EU LEGAL INSTRUMENTS - ignore national or non-EU sources
2. VALIDATE ONLY DIRECT REFERENCES - check if directly mentioned legal provisions are covered  
3. ANSWER ONLY "complete" OR "incomplete" - no explanations needed

CITATION FORMAT UNDERSTANDING:
The citations are in simplified EU legal format:
- Format: "CELEX_NUMBER,document_part,structural_element"
- Examples:
  * 31977L0388,main,body article 9 paragraph 2 point (e)
  * 62005CJ0119,main,body paragraph 1
  * 12006E018,main,body article 18
  * 32002F0584,main,body article 4 paragraph 6

VALIDATION RULES:
1. Identify ONLY the EU legal provisions explicitly mentioned by name in the question
2. For each mentioned provision, check if there is a corresponding citation:
   - Same legal 

### PROMPT Constructor `run_extract()` (using real DataFrame)

Extract EU Law citations.

In [19]:
# TEST prompt construction run_extract()
from gptquery.tools.tool_eulaw_citations.extract_citations.prompts.default import (prompt_extract_basic, 
                                                                                   EXTRACTION_SYSTEM_MESSAGE)
# LOAD reliability CSV
df_reliab = pd.read_csv(ROOT_DIR / "data" / "uio24hlp" / "realiability_data_refs.csv")
df_ref_lvl = df_reliab[["uoa_referral_id",
                        "questions_text",
                        "std_potential_citations",
                        "nlang_potential_citations"]].drop_duplicates().reset_index(drop=True)

# RENAME columns to match prompt constructors
df_ref_lvl = df_ref_lvl.rename(columns={"questions_text": "question_text"})                     # type: ignore
df_ref_lvl = df_ref_lvl.rename(columns={"nlang_potential_citations": "potential_citations"})    # type: ignore

# CREATE prompt constructor 
df_ref_lvl["extract_prompt"] = df_ref_lvl.apply(lambda row: prompt_extract_basic(row["question_text"],
                                                                                 row["potential_citations"]), axis=1)

df_ref_lvl["extract_prompt"] = (df_ref_lvl["uoa_referral_id"] + "\n" + df_ref_lvl["extract_prompt"])
print(EXTRACTION_SYSTEM_MESSAGE)
print("-" * 100)
print(df_ref_lvl["extract_prompt"][0])


You are a legal citation extraction expert specializing in EUROPEAN UNION LAW ONLY. Your task is to identify ALL missing EU legal citations that are DIRECTLY mentioned in the question but not covered by the existing citations.

CRITICAL INSTRUCTIONS:
1. EXTRACT ONLY EU LEGAL INSTRUMENTS - ignore all national or non-EU sources
2. EXTRACT ONLY DIRECT REFERENCES - do not infer or interpret contextually related citations
3. EXTRACT ALL MISSING CITATIONS - return all directly mentioned but missing citations
4. If ALL directly mentioned citations are already covered, return "NONE"

EU LEGAL INSTRUMENTS INCLUDE:
- EU Treaties (TEU, TFEU)
- EU Regulations
- EU Directives  
- EU Decisions
- EU Court of Justice (CJEU) cases
- EU General Court cases
- EU Commission decisions
- EU Council decisions

DOCUMENT PARTS:
- main (default for articles, most common)
- preamble (for recitals)
- annex (for annexes)
- appendix (for appendices)
- protocol (for protocols)
- convention (for conventions)
- agree

### PROMPT Constructor `run_select()` (using real DataFrame)

Select citations that are listed in a text snippet when provided option.

In [18]:
# TEST prompt construction run_select()
from gptquery.tools.tool_eulaw_citations.select_citations.prompts.default import (prompt_select_basic, 
                                                                                   SELECTION_SYSTEM_MESSAGE)
# LOAD Realiability Data
df_reliab = pd.read_csv(ROOT_DIR / "data" / "uio24hlp" / "realiability_data_refs.csv")
# DROP duplicated entries
df_ref_lvl = df_reliab[["uoa_referral_id",
                        "questions_text",
                        "std_potential_citations",
                        "nlang_potential_citations"]].drop_duplicates().reset_index(drop=True)

# RENAME columns to match prompt constructors
df_ref_lvl = df_ref_lvl.rename(columns={"questions_text": "question_text"})                     # type: ignore
df_ref_lvl = df_ref_lvl.rename(columns={"nlang_potential_citations": "potential_citations"})    # type: ignore
# CREATE prompt select
df_ref_lvl["prompt_select"] = df_ref_lvl.apply(lambda row: prompt_select_basic(row["question_text"],
                                                                                 row["potential_citations"]), axis=1)
df_ref_lvl["prompt_select"] = (df_ref_lvl["uoa_referral_id"] + "\n" + df_ref_lvl["prompt_select"])
print(SELECTION_SYSTEM_MESSAGE)
print("=" * 100)
print(df_ref_lvl["prompt_select"][0])

You are a legal citation selector specializing in EUROPEAN UNION LAW. Your task is to identify ALL citations from the provided list that correspond to DIRECT legal references explicitly mentioned in the question text.

CRITICAL INSTRUCTIONS:
1. SELECT ALL CITATIONS - Extract every citation that is directly mentioned in the text
2. DIRECT REFERENCES ONLY - Do not infer, interpret, or select contextually related citations  
3. EXACT MATCHING - Match citations to explicit legal references in the question
4. COMPREHENSIVE EXTRACTION - Return all directly mentioned citations.

CITATION FORMAT UNDERSTANDING:
The citations are in simplified EU legal format:
- Format: CELEX_NUMBER,document_part,structural_element
- Examples: 
  * 31977L0388,main,body article 9 paragraph 2 point (e)
  * 62005CJ0119,main,body paragraph 1  
  * 12008E018,main,body article 18
  * 31971R1408,main,body article 40 paragraph 3 point (b)

CELEX NUMBER PATTERNS:
- Directives: 3YYYYLNNNN (e.g., 31977L0388 = Council Direc

## S02 Reliability Test for EU Law Citation Extraction

Run pipeline `run_validate()` + `run_extract()` + `run_select()` on golden record data. This is a representative sample used to validate approach more heuristically.

In [32]:
# LOAD validation dataset fill those dont have citations with empty strings.
df_validation = pd.read_csv(ROOT_DIR / "data" / "validation" / "validation_data.csv").fillna("")
# RENAME Column for Validation
df_validation = df_validation.rename(columns={'nlang_citations':'potential_citations'})
df_validation.head(5)

Unnamed: 0,iuropa_referral_question_id,question_text,potential_citations,citations_flag
0,REF_2007_0522DE_Q001,Is additional note 5(b) to Chapter 20 of the C...,"31987R2658,annex I,note 5 point (a)\n31987R265...",Maybe
1,REF_2007_0522DE_Q002,Is additional note 5(b) to Chapter 20 of the C...,"31987R2658,annex I,note 5 point (a)\n31987R265...",Maybe
2,REF_2007_0522DE_Q003,If both the preceding questions are answered i...,"31987R2658,annex I,note 5 point (a)\n31987R265...",Maybe
3,REF_2008_0022DE_Q001,Is Article 24(2) of Directive 2004/38 of the E...,"32004L0038,main,body article 24 paragraph 2\n3...",Maybe
4,REF_2008_0039DE_Q003,"If the answer is ‘yes’, is the national court ...","31989L0104,main,body article 3",Maybe


### Reliability Test `run_validate_basic()`

In [None]:
from gptquery import run_validate_basic
new_validate_df = run_validate_basic(df_validation, api_key=openai_key, model="gpt-4.1-mini")
new_validate_df

### Reliability Test `run_extract_basic()`

In [None]:
from gptquery import run_extract_basic
new_extract_df = run_extract_basic(df_validation, api_key=openai_key, model="gpt-4.1-mini")
new_extract_df

### Reliability Test `run_select_basic()`

In [None]:
from gptquery import run_select_basic
new_select_df = run_select_basic(df_validation, api_key=openai_key, model="gpt-4.1-mini")
new_select_df

### POST Processing




In [None]:
# POST process ouputs 
from gptquery.utils.data_prep import (eval_and_explode, split_column_with_names)
## EXPLODE Nicely formatted list into rows
df1 = eval_and_explode(new_select_df, 'selected_citations')
## SPLIT selected citation into cols for standardizing pipeline
df2 = split_column_with_names(df1, source_col='selected_citations', target_cols=['celex','document_part','structural_element'])
df2[(df2['selected_citations']=='') & (df2['citations_flag']=='Maybe')]

## S03 Estimate Costs per Tool Call

Before running a Tool Call you can use the prompt constructor function, the system message and expected output to estimate what would be the cost of running that specific tool call in a given dataframe. The only limitation is that since tool calls are made for actually running queries you have import both the 
prompt function and system message from the actual tool. On the flip side this also gives you the advanatge of estimating the cost for multiple models at once. These costs are estimated using the tokenizers that are used by OpenAI for each of the respective models.


In [None]:
# LOAD reliability CSV
df_reliab = pd.read_csv(ROOT_DIR / "data" / "uio24hlp" / "realiability_data_refs.csv")
df_ref_lvl = df_reliab[["uoa_referral_id",
                        "questions_text",
                        "nlang_potential_citations"]].drop_duplicates().reset_index(drop=True)

# RENAME columns to match prompt function
df_ref_lvl = df_ref_lvl.rename(columns={"questions_text": "question_text"})                    # type: ignore
df_ref_lvl = df_ref_lvl.rename(columns={"nlang_potential_citations": "potential_citations"})   # type: ignore
df_ref_lvl.head()

Unnamed: 0,uoa_referral_id,question_text,potential_citations
0,REF_2007_0522DE,1.| Is additional note 5(b) to Chapter 20 of t...,"31987R2658,annex I,body note 5 point (a)\n3198..."
1,REF_2008_0001IT,"1.| Where — for VAT purposes, and in accordanc...","31977L0388,main,\n31977L0388,main,body article..."
2,REF_2008_0002IT,1.| ‘Does Community law preclude the applicati...,"62005CJ0119,main,"
3,REF_2008_0003BE,1.| Are Article 40(3)(b) of Regulation (EEC) N...,"31971R1408,main,body article 40 paragraph 3 po..."
4,REF_2008_0005DK,1.| Can the storing and subsequent printing ou...,"32001L0029,main,\n32001L0029,main,body article..."


### Costs for `run_validate()`

This tool call has a dynamic prompt function with paramaters that can be passed to it. Such as 
`granularity="full"` this argument gets unpackage via `**prompt_kwrds`

In [32]:
# IMPORT Cost Estimation Functions & Tool Call
from gptquery.estimation.cost_estimator import (estimate_costs_for_models, create_cost_matrix)
from gptquery.tools.tool_eulaw_citations.validate_citations.prompts.default import (prompt_validate_completeness,
                                                                                    VALIDATION_SYSTEM_MESSAGE)

# MODELS for Cost Analysis
MODELS = ["gpt-4.1-mini","gpt-5-mini", "gpt-5-nano", "gpt-4.1-nano"]

# COST per tool_call cost estimation
costs_run_validate = estimate_costs_for_models(df_ref_lvl,
                                  prompt_validate_completeness,
                                  system_msg=VALIDATION_SYSTEM_MESSAGE,
                                  models=MODELS,
                                  verbose=False,
                                  granularity="full" # dynamic kwrds**
                                  )
# DISPLAY Costs DataFrame
create_cost_matrix(costs_run_validate)

Unnamed: 0,Model,Total Cost ($),Cost Per Row ($),Total Tokens,Input Cost ($),Output Cost ($)
0,gpt-5-nano,$0.3699,$0.0001,7358179,$0.3679,$0.0020
1,gpt-4.1-nano,$0.7378,$0.0001,7358179,$0.7358,$0.0020
2,gpt-5-mini,$1.8496,$0.0004,7358179,$1.8395,$0.0100
3,gpt-4.1-mini,$2.9513,$0.0006,7358179,$2.9433,$0.0080


### COSTS for `run_extract()`

This tool does not have an standard fixed lenght response I am approximating using this code:

- `avg_response_len = "A-X " * 50` but depending on expected outputs this can be anything.

In [33]:
# IMPORT Cost Estimation Functions & Tool Call
from gptquery.estimation.cost_estimator import (estimate_costs_for_models, create_cost_matrix)
from gptquery.tools.tool_eulaw_citations.extract_citations.prompts.default import (prompt_extract_basic,
                                                                                   EXTRACTION_SYSTEM_MESSAGE)
# MODELS for Cost Analysis
MODELS = ["gpt-4.1-mini", "gpt-4o-mini","gpt-4.1-nano","gpt-5-mini", "gpt-5-nano","gpt-5"]

# ESTIMATED avg response lenght for tool_call
avg_response_len = "A-X " * 50

# COST per tool_call cost estimation
costs_run_extract = estimate_costs_for_models(
    df_ref_lvl,
    prompt_extract_basic,
    system_msg=EXTRACTION_SYSTEM_MESSAGE,
    models=MODELS,
    expected_response_length=avg_response_len,
    verbose=False)

# DISPLAY Costs DataFrame
create_cost_matrix(costs_run_extract)

Unnamed: 0,Model,Total Cost ($),Cost Per Row ($),Total Tokens,Input Cost ($),Output Cost ($)
0,gpt-5-nano,$0.5322,$0.0001,6584689,$0.3292,$0.2030
1,gpt-4.1-nano,$0.8614,$0.0002,6584689,$0.6585,$0.2030
2,gpt-4o-mini,$1.2922,$0.0003,6584689,$0.9877,$0.3045
3,gpt-5-mini,$2.6610,$0.0005,6584689,$1.6462,$1.0148
4,gpt-4.1-mini,$3.4458,$0.0007,6584689,$2.6339,$0.8119
5,gpt-5,$13.3051,$0.0026,6584689,$8.2309,$5.0742


### COSTS for `run_select()`

In [36]:
# LOAD Functions prompt creation & system prompt
from gptquery.estimation.cost_estimator import (estimate_costs_for_models, create_cost_matrix)
from gptquery.tools.tool_eulaw_citations.select_citations.prompts import (prompt_select_citations,
                                                                          SELECTION_SYSTEM_MESSAGE)
# MODELS for Cost Analysis
MODELS = ["gpt-4.1-mini", "gpt-4o-mini","gpt-4.1-nano","gpt-5-mini", "gpt-5-nano","gpt-5"]

avg_response_len = "A-X " * 50

# COST estimation
costs_run_select = estimate_costs_for_models(df_ref_lvl,
                                  prompt_select_citations,
                                  models=MODELS,
                                  system_msg=SELECTION_SYSTEM_MESSAGE,
                                  expected_response_length=avg_response_len,
                                  verbose=False)
# DISPLAY Costs DataFrame
create_cost_matrix(costs_run_select)

Unnamed: 0,Model,Total Cost ($),Cost Per Row ($),Total Tokens,Input Cost ($),Output Cost ($)
0,gpt-5-nano,$0.5262,$0.0001,6464010,$0.3232,$0.2030
1,gpt-4.1-nano,$0.8494,$0.0002,6464010,$0.6464,$0.2030
2,gpt-4o-mini,$1.2741,$0.0003,6464010,$0.9696,$0.3045
3,gpt-5-mini,$2.6309,$0.0005,6464010,$1.6160,$1.0148
4,gpt-4.1-mini,$3.3975,$0.0007,6464010,$2.5856,$0.8119
5,gpt-5,$13.1543,$0.0026,6464010,$8.0800,$5.0742
