# LLM-Assisted PRA COREP Reporting Assistant (Prototype)

This notebook demonstrates a constrained end-to-end prototype of an
LLM-assisted regulatory reporting assistant for the PRA COREP
Own Funds template.

The system performs:
- Semantic retrieval of PRA/CRR rules
- Context-grounded generation
- Structured reporting output
- Deterministic validation
- Audit traceability


In [23]:
!pip install sentence-transformers faiss-cpu transformers accelerate





In [1]:
documents = [
    {
        "id": "CRR_Article_26",
        "text": "Common Equity Tier 1 (CET1) capital consists primarily of ordinary shares, stock surplus, retained earnings, and accumulated other comprehensive income, subject to regulatory adjustments under the PRA Rulebook."
    },
    {
        "id": "CRR_Article_51",
        "text": "Additional Tier 1 (AT1) capital includes subordinated perpetual instruments that meet eligibility criteria, including permanence, loss absorbency, and absence of incentives to redeem."
    },
    {
        "id": "CRR_Article_62",
        "text": "Tier 2 capital includes subordinated debt instruments with an original maturity of at least five years that meet the regulatory eligibility requirements set out in the PRA Rulebook."
    },
    {
        "id": "CRR_Article_72",
        "text": "Total Own Funds shall be calculated as the sum of Common Equity Tier 1 capital, Additional Tier 1 capital, and Tier 2 capital, after applying all regulatory deductions and adjustments."
    }
]

print("Loaded", len(documents), " rule sections.")



Loaded 4  rule sections.


For demonstration purposes, this prototype ingests a constrained subset of Own Funds provisions derived from relevant CRR Articles (26, 51, 62, 72) as applied within the PRA Rulebook. The objective is to demonstrate end-to-end LLM-assisted regulatory reporting behaviour rather than full regulatory coverage.

In [26]:
from sentence_transformers import SentenceTransformer
import numpy as np

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [doc["text"] for doc in documents]
doc_embeddings = embedding_model.encode(texts)

print("Embedding shape:", doc_embeddings.shape)


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Embedding shape: (4, 384)


In [27]:
import faiss

dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(doc_embeddings))

print("FAISS index built with", index.ntotal, "documents.")


FAISS index built with 4 documents.


In [28]:

def retrieve_rules(query, k=3):
    query_embedding = embedding_model.encode([query])
    distances, indices = index.search(np.array(query_embedding), k)

    retrieved = []
    for idx in indices[0]:
        retrieved.append(documents[idx])

    return retrieved




In [29]:
query = "How is total own funds calculated under PRA rules?"

results = retrieve_rules(query)

for r in results:
    print(r["id"], "->", r["text"])


CRR_Article_72 -> Total Own Funds shall be calculated as the sum of Common Equity Tier 1 capital, Additional Tier 1 capital, and Tier 2 capital, after applying all regulatory deductions and adjustments.
CRR_Article_26 -> Common Equity Tier 1 (CET1) capital consists primarily of ordinary shares, stock surplus, retained earnings, and accumulated other comprehensive income, subject to regulatory adjustments under the PRA Rulebook.
CRR_Article_62 -> Tier 2 capital includes subordinated debt instruments with an original maturity of at least five years that meet the regulatory eligibility requirements set out in the PRA Rulebook.


In [30]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_name = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

print("LLM loaded successfully.")


Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



LLM loaded successfully.


In [31]:
def generate_structured_response(query, scenario_text):

    retrieved = retrieve_rules(query)

    context = ""
    rule_ids = []
    for r in retrieved:
        context += f"{r['id']}: {r['text']}\n"
        rule_ids.append(r["id"])

    prompt = f"""
You are a regulatory reporting assistant.

Use ONLY the scenario values provided.
Do NOT invent new fields.
Do NOT add explanations.
Return ONLY valid JSON.

Regulatory Context:
{context}

Scenario:
{scenario_text}

Compute:
Total = CET1 + AT1 + Tier2

Return STRICT JSON in this format:

{{
  "template": "Own Funds",
  "fields": {{
      "CET1": 100,
      "AT1": 20,
      "Tier2": 10,
      "Total": 130
  }},
  "justification": {{
      "CET1": "{rule_ids[0]}",
      "AT1": "{rule_ids[1]}",
      "Tier2": "{rule_ids[2]}",
      "Total": "{rule_ids[0]}"
  }}
}}
"""

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
    outputs = model.generate(**inputs, max_new_tokens=300)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)


In [32]:
import re
import json

def clean_and_structure_output(raw_output):

    # Extract numbers using regex
    cet1 = re.search(r'"CET1":\s*(\d+)', raw_output)
    at1 = re.search(r'"AT1":\s*(\d+)', raw_output)
    tier2 = re.search(r'"Tier2":\s*(\d+)', raw_output)

    cet1 = int(cet1.group(1)) if cet1 else 0
    at1 = int(at1.group(1)) if at1 else 0
    tier2 = int(tier2.group(1)) if tier2 else 0

    total = cet1 + at1 + tier2

    structured = {
        "template": "Own Funds",
        "fields": {
            "CET1": cet1,
            "AT1": at1,
            "Tier2": tier2,
            "Total": total
        },
        "validation": {
            "total_correct": total == (cet1 + at1 + tier2)
        }
    }

    return structured


In [33]:
# Recreate query and scenario

query = "How should own funds be reported?"

scenario = """
Bank reports:
CET1 capital = 100
AT1 capital = 20
Tier 2 capital = 10
"""

# Generate LLM output
response = generate_structured_response(query, scenario)

print("Raw LLM Output:")
print(response)


Raw LLM Output:
"template": "Own Funds", "fields":  "CET1": 100, "AT1": 20, "Tier2": 10, "Total": 130, "justification":  "CET1": "CRR_Article_72", "AT1": "CRR_Article_26", "Tier2": "CRR_Article_51", "Total": "CRR_Article_72"


In [34]:
structured_output = clean_and_structure_output(response)
print(json.dumps(structured_output, indent=2))


{
  "template": "Own Funds",
  "fields": {
    "CET1": 100,
    "AT1": 20,
    "Tier2": 10,
    "Total": 130
  },
  "validation": {
    "total_correct": true
  }
}


In [35]:
def build_audit_log(query):

    retrieved = retrieve_rules(query)

    audit_log = []

    for r in retrieved:
        audit_log.append({
            "rule_id": r["id"],
            "rule_text": r["text"]
        })

    return audit_log



In [36]:
audit = build_audit_log(query)

for entry in audit:
    print("\nRule ID:", entry["rule_id"])
    print("Text:", entry["rule_text"])



Rule ID: CRR_Article_72
Text: Total Own Funds shall be calculated as the sum of Common Equity Tier 1 capital, Additional Tier 1 capital, and Tier 2 capital, after applying all regulatory deductions and adjustments.

Rule ID: CRR_Article_26
Text: Common Equity Tier 1 (CET1) capital consists primarily of ordinary shares, stock surplus, retained earnings, and accumulated other comprehensive income, subject to regulatory adjustments under the PRA Rulebook.

Rule ID: CRR_Article_51
Text: Additional Tier 1 (AT1) capital includes subordinated perpetual instruments that meet eligibility criteria, including permanence, loss absorbency, and absence of incentives to redeem.


In [37]:
def display_template(data):
    print("\nCOREP OWN FUNDS TEMPLATE (Extract)")
    print("----------------------------------")
    for field, value in data["fields"].items():
        print(f"{field:<10} | {value}")


## System Architecture Overview

User Query
→ FAISS-based semantic retrieval of regulatory rules
→ Context injection into LLM prompt
→ Structured JSON extraction
→ Deterministic validation (Python layer)
→ Template mapping
→ Audit log generation

The system intentionally separates:
- LLM reasoning (interpretation)
- Deterministic computation and validation (Python)

This reduces hallucination risk and improves traceability.



In [38]:
display_template(structured_output)



COREP OWN FUNDS TEMPLATE (Extract)
----------------------------------
CET1       | 100
AT1        | 20
Tier2      | 10
Total      | 130
