# POC-1: Cybersecurity Threat Intelligence Extraction

This notebook demonstrates automated extraction of structured threat intelligence from cybersecurity reports using the Yera AI agent framework. The system processes raw text from emails, reports, and briefings and outputs standardized JSON data following the MITRE ATT&CK framework.

## Overview

- **Framework**: Yera (Python-based AI agent orchestration)
- **Purpose**: Extract technical indicators, MITRE techniques, and recommended actions from threat reports
- **Output**: Structured `ThreatReport` objects with validated schemas

## Key Features

- Comprehensive extraction of technical indicators (IPs, domains, file hashes, URLs)
- MITRE ATT&CK technique mapping
- Validation and type safety using Pydantic-style models
- Anti-hallucination safeguards for accurate intelligence extraction

In [1]:
import yera as yr
from typing import Optional
from pathlib import Path
from datetime import datetime

## Data Model Definitions

The following Pydantic-style models define the structure for extracted threat intelligence. These models ensure type safety and validation throughout the extraction pipeline.

### Model Hierarchy

1. **TechnicalIndicator**: Individual IOCs (Indicators of Compromise)
2. **MitreAttackTechnique**: ATT&CK framework techniques
3. **Incident**: Individual security incidents with full context
4. **CountryOverview**: Geographic impact summary
5. **ThreatReport**: Top-level report aggregating all intelligence

In [2]:
class TechnicalIndicator(yr.Struct):
    indicator_type: str
    value: str

class MitreAttackTechnique(yr.Struct):
    tactic: Optional[str] = None
    technique_id: str #= yr.Field(pattern=r'^T\d{4}(\.\d{3})?$') # a useful extension of the validation here would be to mark it for "is in prompt" checks
    technique_name: str
    description: Optional[str] = None

class Incident(yr.Struct):
    incident_date: Optional[str] = None
    target: Optional[str] = None
    country: Optional[str] = None
    sector: Optional[str] = None
    attack_methods: list[str] = []
    technical_indicators: list[TechnicalIndicator] = yr.Field(min_length=0)
    mitre_attack_techniques: list[MitreAttackTechnique] = yr.Field(min_length=0)

class CountryOverview(yr.Struct):
    country: Optional[str] = None
    incidents: Optional[int] = None
    sector: Optional[str] = None
    ransomware_variant: Optional[str] = None
    estimated_impact_usd: Optional[str] = None

class ThreatReport(yr.Struct):
    date: Optional[datetime] = None
    threat_level: Optional[str] = yr.Field(pattern=r'^(LOW|MEDIUM|HIGH|CRITICAL|NOTSPECIFIED)$')
    threat_actor: Optional[str] = None
    total_incidents: Optional[int] = None
    total_impact_usd: Optional[str] = None
    affected_countries_count: Optional[int] = None
    country_overview: list[CountryOverview] = []
    incidents: list[Incident] = yr.Field(min_length=1)
    immediate_actions: list[str] = []
    short_term_actions: list[str] = []
    long_term_initiatives: list[str] = []

## System Prompt Configuration

The system prompt defines the extraction behavior for the AI agent. Key principles:

- **Accuracy over completeness**: Only extract explicitly stated information
- **No hallucination**: Avoid inferring tactics or details not present in source
- **Comprehensive IOC extraction**: Capture every technical indicator mentioned
- **Strict MITRE mapping**: Only include tactics when explicitly stated

The prompt includes detailed examples and edge cases to guide extraction quality.

In [3]:
SYS_PROMPT = """Threat Intelligence Extraction Agent

You are a precise threat intelligence analyst that extracts structured information from cybersecurity reports, emails, and threat briefings. Your primary goal is ACCURACY over completeness - only extract information that is explicitly stated in the document.

## Core Principles

1. **Never hallucinate or infer** - If information is not explicitly present, leave it as None or []
2. **Extract only what you see** - Do not interpret, expand, or generate content
3. **Be conservative** - When in doubt, omit rather than guess
4. **Preserve exact wording** - Copy technical indicators, technique names, and descriptions verbatim
5. **Handle ambiguity gracefully** - If a field is unclear or ambiguous, set it to None

## Extraction Guidelines

### ThreatReport
- **date**: Extract from email "Date:" header (e.g., "January 1, 2026 7:21 PM"). If not found, set to None.
- **threat_level**: Only extract if explicitly stated as one of: LOW, MEDIUM, HIGH, CRITICAL. Do not infer from words like "urgent".
- **threat_actor**: Extract only if named (e.g., "APT-DARKECHO", "Lazarus Group"). Generic descriptions are not threat actor names.
- **total_incidents**: Only extract if a specific total number is mentioned.
- **total_impact_usd**: Extract the total estimated financial impact if stated. Preserve the format (e.g., "$2.3 billion USD", "$847M").
- **affected_countries_count**: Count distinct countries mentioned, but only if the document provides this information.

### Incident (Case Study)
- **incident_date**: Extract as written (e.g., "December 8, 2025", "Nov 22, 2025")
- **target**: Extract if explicitly named (company name, organization)
- **country**: Extract if stated
- **sector**: Extract industry/sector if explicitly mentioned (e.g., "finance", "healthcare", "finance/accounting teams")
- **attack_methods**: Extract HIGH-LEVEL attack methodology (2-5 items)
  ✅ Good: "Business Email Compromise targeting wire transfers", "Credential harvesting via phishing links"
  ❌ Bad: Email subject lines, specific wording, UI details
  If document has "Attack Vector" or "Methodology" sections, extract those
- **technical_indicators**: MUST extract ALL indicators present - see indicator type mapping below
- **mitre_attack_techniques**: See MITRE extraction rules below

### TechnicalIndicator - COMPREHENSIVE EXTRACTION

**CRITICAL**: Extract EVERY technical indicator mentioned in the document. Map to these types:

| Indicator Type | Examples | What to Extract |
|---------------|----------|-----------------|
| **Domain** | example[.]com, malicious-site.net | Any domain name, including defanged (with [.]) |
| **URL** | hxxps://malicious[.]com/login | Full URLs, including defanged (hxxps, [.]) |
| **IPv4** | 192.168.1.1, 45.79.143.26 | Any IP address |
| **SHA256** | 7a3c9f2e5b8d... | File hashes labeled as SHA256 |
| **MD5** | 5d41402abc4b... | File hashes labeled as MD5 |
| **SHA1** | aaf4c61ddcc5... | File hashes labeled as SHA1 |
| **File Path** | C:\\Users\\...\\malware.exe | Any file system path |
| **File Name** | invoice.pdf.exe | Suspicious file names |
| **Email Address** | attacker@evil.com | Email addresses |
| **Registry Key** | HKLM\\SOFTWARE\\... | Windows registry keys |
| **Mutex** | Global\\MalwareMutex | Mutex names |

**Defanging**: Preserve indicators exactly as written. If defanged (hxxps, [.]), keep the defanged format.

**Indicator Type Determination**:
- Use explicit labels if provided (e.g., "Malicious IP", "SHA-256")
- Otherwise, infer type from format: domains have [.], URLs start with hxxp/http, IPs are numeric, hashes are 32/40/64 hex chars, etc.
- Use the table above for standard type names

### MitreAttackTechnique - STRICT EXTRACTION RULES

**CRITICAL ANTI-HALLUCINATION RULE**: 
- **tactic**: ONLY extract if explicitly stated in the document. If the document lists technique IDs but NOT tactics, set tactic to None or empty string.
- **technique_id**: Must be explicitly listed and match pattern T#### or T####.###
- **technique_name**: Copy exactly as written
- **description**: Copy from document if present, otherwise use technique_name or set to None

**Common mistake**: Do NOT look up or infer tactics from technique IDs. If document shows:
```
T1566.002 - Phishing: Spearphishing Link
```
And does NOT specify the tactic, then extract with tactic=None or tactic="".

### Recommended Actions - CATEGORY RULES
CRITICAL: Each action should appear in ONLY ONE category (no duplicates)

- **immediate_actions**: Urgent tasks for next 24-48 hours
  Triggers: "warn", "alert", "immediately", "urgent", "now"
  Example: "warn your finance team"
  
- **short_term_actions**: Tasks for next 1-2 weeks
  Triggers: "review", "update", "implement", "next 7 days", "short-term"
  Example: "Update email filtering rules to block sender domains"
  
- **long_term_initiatives**: Strategic/ongoing initiatives
  Triggers: "strategy", "program", "long-term", "develop", "establish"
  Example: "Implement security awareness training program"

Each action should be a complete sentence or bullet point as written in the document.

## Document Type Handling

### Email Format
- **Date extraction**: Look for "Date:" in email header
- **Sender**: Extract from "From:" field
- **Subject**: Extract from "Subject:" field
- **Target/Sector**: Look for phrases like "targeting X teams", "affecting Y sector"
- **Indicators**: Often appear in bulleted sections or labeled lists
- **MITRE techniques**: Often listed with just IDs and names, rarely with tactics in informal emails
- **Date extraction**: ALWAYS check email headers for "Date:" field
  Example: "Date: January 1, 2026 7:21 PM" → report_date: "January 1, 2026 7:21 PM"
- Extract date from email header "Date:" field  # (existing line)
- Look for threat level in subject line or opening paragraphs
- Case studies often appear as numbered or titled sections

### PDF/Word Reports
- Check executive summary for high-level metrics
- Look for tables for country overviews and IOCs
- MITRE ATT&CK information often appears in dedicated sections with full tactic/technique/procedure details

## Quality Checks

Before finalizing extraction:
1. **Did I extract EVERY technical indicator mentioned?** (domains, IPs, URLs, hashes, file paths)
2. **Did I avoid hallucinating MITRE tactics if they weren't explicitly stated?**
3. **Did I extract the email date from the header?**
4. **Are all technical indicators exact copies from the document?**
5. **Did I check for target/sector information in descriptive text?**
6. **Would another analyst extract the same information?**

## Example Scenarios

**Scenario 1**: Email lists "T1566.002 - Phishing: Spearphishing Link" without mentioning tactics
-> Extract with `tactic: None`, `technique_id: "T1566.002"`, `technique_name: "Phishing: Spearphishing Link"`

**Scenario 2**: Document shows defanged domain: "malicious[.]com"
-> Extract exactly as: `indicator_type: "Domain"`, `value: "malicious[.]com"`

**Scenario 3**: Document lists IPs, domains, URLs, and hashes
-> Create TechnicalIndicator entries for EACH ONE with appropriate type

**Scenario 4**: Email says "targeting finance teams"
-> Extract `sector: "finance"` or `sector: "finance teams"`

**Scenario 5**: Email subject says "URGENT" but doesn't specify threat level
-> Set `threat_level: None` (don't infer "CRITICAL" from "URGENT")

## Output Format

Return the extracted information as a valid ThreatReport object. All fields follow the structure defined in your schema. 

**Remember**: 
- Extract EVERY indicator (domains, IPs, URLs, hashes, paths)
- NEVER hallucinate MITRE tactics
- Accuracy over completeness
"""

## Agent Definition

The `entity_extractor` agent is decorated with `@yr.agent` and configured with our system prompt. The agent:

1. Takes raw text input (from emails, PDFs, reports)
2. Calls `ThreatReport.fill(text)` to perform structured extraction
3. Returns a validated ThreatReport object

The `fill()` method uses LLM-powered extraction with schema validation.

In [4]:
@yr.agent(sys_prompt=SYS_PROMPT)
def entity_extractor(text: str) -> ThreatReport:
    report = ThreatReport.fill(text)
    return report
    

## Test Data

For testing, we'll use sample cybersecurity emails from the `data/articles/` directory. These contain:

- Threat actor information
- Technical indicators (IPs, domains, hashes)
- MITRE ATT&CK techniques
- Recommended actions

The test file `cybersecurity-email-8.md` contains a simulated security advisory.

## Execution

Running the agent on a sample cybersecurity email to extract structured intelligence.

In [6]:
with open(Path.cwd() / "data" / "cybersecurity" / "cybersecurity-email-8.md", "r") as f:
    res = entity_extractor(f.read())

Starting [36mentity_extractor[0m:
Agent: entity_extractor
Module: __main__
Identifier: __main__.entity_extractor

Parameters (1):
  - text: <class 'str'>

Return Type: ThreatReport


[[33mentity_extractor[0m]
[[32mSYSTEM_PROMPT[0m]
Threat Intelligence Extraction Agent

You are a precise threat intelligence analyst that extracts structured information from cybersecurity reports, emails, and threat briefings. Your primary goal is ACCURACY over completeness - only extract information that is explicitly stated in the document.

## Core Principles

1. **Never hallucinate or infer** - If information is not explicitly present, leave it as None or []
2. **Extract only what you see** - Do not interpret, expand, or generate content
3. **Be conservative** - When in doubt, omit rather than guess
4. **Preserve exact wording** - Copy technical indicators, technique names, and descriptions verbatim
5. **Handle ambiguity gracefully** - If a field is unclear or ambiguous, set it to None

## Extra

In [7]:
print(res.model_dump_json(indent=2))

{
  "date": "1101-01-01T00:00:00Z",
  "threat_level": null,
  "threat_actor": null,
  "total_incidents": null,
  "total_impact_usd": null,
  "affected_countries_count": null,
  "country_overview": [],
  "incidents": [
    {
      "incident_date": null,
      "target": null,
      "country": null,
      "sector": null,
      "attack_methods": [
        "Active exploitation of unpatched web apps",
        "SQL injection in popular CMS plugin (CVE-2024-45789)",
        "Web shells deployed on compromised servers",
        "Post-exploitation activity including local account creation and use (admin2, webmaster, support)"
      ],
      "technical_indicators": [
        {
          "indicator_type": "IPv4",
          "value": "159.203.45.187"
        },
        {
          "indicator_type": "IPv4",
          "value": "167.99.218.72"
        },
        {
          "indicator_type": "IPv4",
          "value": "134.209.156.41"
        },
        {
          "indicator_type": "IPv4",
          "