# Key Prompts

In [None]:
rel_type_definition='''
###Main Topic:Attack and Compromise Relationships###
Name:'exploits'
Definition:Represents an entity leveraging a specific flaw or weakness within another entity (typically a Vulnerability) to achieve a malicious objective.
Note:This is a more specific form of 'uses'. When the action involves taking advantage of a known vulnerability, 'exploits' should be used instead of 'uses'.
Example:The malware exploits the CVE-2021-44228 vulnerability.

Name:'bypasses'
Definition:Indicates that an offensive entity (e.g., malware, exploit) successfully evades or circumvents a defensive measure. If 'mitigates' (such as patching to mitigate vulnerabilities) are successful actions for defenders, then 'bypasses' (such as using obfuscation techniques to bypass sandboxes) are successful actions for attackers. It is specifically used to describe the behavior of circumventing and bypassing defensive measures.
Example:The malware's obfuscation technique bypasses sandbox analysis.

Name:'malicious-investigates-track-detects'
Definition:Represents a malicious action where one entity (typically malware or a tool) performs either a discrete investigation, continuous tracking, or active detection of another entity to gather information or for evasive purposes.
Note:This relationship now covers three types of malicious information gathering and reconnaissance: Investigating: One-time reconnaissance of an entity (e.g., a system scan). Tracking: Long-term, continuous surveillance of an entity (e.g., keystroke logging). Detecting (Malicious): Evasion-focused discovery, such as identifying a sandbox, debugger, or specific security tool to alter behavior.
Example:Example 1 (Investigates): A malware implant malicious-investigates-track-detects local system configuration files. Example 2 (Tracks): A spyware module malicious-investigates-track-detects the user's web Browse history. Example 3 (Detecting): The malware malicious-investigates-track-detects the presence of a virtual machine environment.

Name:'impersonates'
Definition:Indicates that one entity actively masquerades as another, distinct entity to deceive or gain trust.
Note:This is distinct from 'alias-of'. 'impersonates' is a deceptive action between two separate entities. In contrast, 'alias-of' links two different names for the very same entity. For example, a hacker 'impersonates' the CEO in an email, whereas "APT28" is an 'alias-of' "Fancy Bear."
Example:A threat actor impersonates a trusted IT administrator to trick users.

Name:'targets'
Definition:Describes an offensive entity directing its actions against another entity. It expresses the intent and direction of an attack.
Note:'targets' describes intent, while 'compromises' describes a successful outcome. An actor might 'target' the financial industry for years but 'compromise' a specific bank in a single operation. 'targets' is also broader than 'exploits'; an actor can 'target' an organization, whereas they 'exploit' a specific vulnerability within that organization's systems.
Example:A phishing campaign targets employees in the financial sector.

Name:'compromises'
Definition:Represents that an offensive entity has successfully violated the confidentiality, integrity, or availability of a target, achieving some form of unauthorized access or control.
Note:See the note under 'targets' for a direct comparison.
Example:The threat actor compromised the company's domain controller.

Name:'leads-to'
Definition:Describes a causal relationship where one entity or event directly results in another outcome or state, often used in attack chains. Relationships such as 'exploits', 'delivers', and 'executes' are all “points” in the attack chain, and 'leads-to' is the “line” connecting these points, clearly showing the logic of “vulnerability exploit leads to remote code execution”. When a relationship meets the subdivision relationship of 'exploits', 'delivers', and 'executes', choose them instead of 'leads-to'.
Example:Exploitation of a vulnerability leads-to remote code execution.

###Main Topic:Data and Payload Movement###
Name:'drops'
Definition:Represents an entity creating a new file on the local filesystem from its own embedded or internal resources.
Note:This relationship exclusively describes the action of Local -> Local file creation, with no network communication involved. This is distinct from 'downloads', which is an External -> Local action.
Example:The installer drops a malicious DLL file into the System32 folder.

Name:'downloads'
Definition:Represents an entity retrieving a file or data from an external, remote source and saving it to the local system.
Note:This relationship exclusively describes the action of External -> Local data transfer. It is the direct opposite of 'drops', which involves no network communication.
Example:The dropper downloads a second-stage payload from a malicious URL.

Name:'executes'
Definition:Signifies that one entity (e.g., a loader, script) runs or initiates another entity (e.g., a malicious executable).
Example:A dropper executes a second-stage payload.

Name:'delivers'
Definition:Represents a higher-level, abstract relationship where one attack component is responsible for 'bringing' a malicious payload or tool to the target environment.
Note:This describes the abstract "bringing" action within an attack chain, answering "How did the payload get here?" at a tactical level. For example, a phishing email 'delivers' malware; this delivery might be achieved through the user 'downloads' an attachment, which then 'drops' an executable.
Example:A phishing campaign delivers the Ursnif malware.

Name:'beacons-to'
Definition:Specifically indicates that malware or an implant periodically sends 'beacon' or 'heartbeat' signals to its Command and Control (C2) server.
Example:Malware beacons-to (beacons-to) Command and Control URL.

Name:'exfiltrate-to'
Definition:Specifically describes the act of stealing data from a compromised system and transmitting it outward to a target location specified by the attacker, such as a server or IP address.
Note:The core of this relationship is purposeful, outbound data transmission. Its distinction from other network relationships lies in intent and direction: (1) Versus 'communicates-with': 'exfiltrate-to' is a specific type of 'communicates-with'. If the purpose of the communication is confirmed to be data theft, 'exfiltrate-to' should be preferred for more precise semantics. If the purpose is unknown, the more general 'communicates-with' should be used. (2) Versus 'downloads': The data flow direction is the opposite of 'downloads'. 'downloads' refers to fetching files from an external source into the victim system, while 'exfiltrate-to' refers to uploading data from the victim system to an external source. (3) Versus 'leaks': 'exfiltrate-to' typically describes a targeted, covert transfer from a victim to an attacker. In contrast, 'leaks' (if used as a custom relationship) usually refers to a broader, potentially public or semi-public data disclosure.
Example:A spyware implant (Malware) exfiltrate-to a specific FTP server (Infrastructure) to upload stolen documents.

Name:'leaks'
Definition:Represents the unauthorized disclosure or public release of sensitive resources. This includes confidential data (e.g., documents, credentials) as well as operational assets like malware source code or vulnerability details. 
Note:'exfiltrate-to' (malware steals data to a server) describes the directed transfer of data from the victim to the attacker. 'leaks' (internal threat actors leak company documents) describes the unauthorized public or semi-public disclosure of sensitive resources (data, source code, etc.). The core difference between the two lies in the direction of information flow and the degree of disclosure.
Example:An insider threat leaks confidential corporate documents online. The source code for a prominent banking trojan leaks onto a public repository.

Name:'communicates-with'
Definition:Describes the occurrence of network communication between two entities. It is a general relationship for network interactions.
Note:'beacons-to', 'downloads', and 'exfiltrate-to' are all specific types of 'communicates-with'. If the traffic is a periodic heartbeat, 'beacons-to' is more precise. If the purpose is to retrieve a file, use 'downloads'. If it is to send data out, use 'exfiltrate-to'. Use 'communicates-with' for general descriptions or when the specific purpose is unknown.
Example:The implant communicates-with a C2 server every hour.

###Main Topic:Infrastructure and Provisioning###
Name:'resolves-to'
Definition:A specific technical relationship describing a domain name being resolved to one or more IP addresses via the Domain Name System (DNS).
Note:See the note under 'hosts'. This relationship is a core technical link for establishing network infrastructure associations.
Example:The malicious domain https://www.google.com/search?q=evil-phishing.com resolves-to the IP address 198.51.100.10.

Name:'hosts'
Definition:Indicates that an infrastructure entity 'carries' or provides the runtime environment for another object, such as a malicious payload, website, or C2 service.
Note:This relationship describes 'carrying' at the infrastructure level. It is distinct from 'delivers', which describes a tactical action, and 'provides', which is more general. For example, a server ('hosts') a malware file, which is then ('downloads') by a victim after being ('delivers') by a phishing link.
Example:A bulletproof hosting provider hosts malware command and control servers.

Name:'provides'
Definition:A general relationship where one entity supplies another with a resource, service, or capability.
Note:This is the most abstract supply relationship and should be used when a more specific term is not applicable. Follow the priority: use 'delivers' for tactical delivery or 'hosts' for infrastructure hosting first. Use 'provides' only when the relationship is more general than these options.
Example:A bulletproof hosting service provides infrastructure for a phishing campaign.

###Main Topic:Attribution and Association###
Name:'authored-by'
Definition:Defines the creator or development source of an entity, such as malware, a tool, a report, or an attack pattern. It is used to trace the provenance of an object.
Note:The core of this relationship is to clarify "who created it". It has key distinctions from other relationships: (1) Versus 'attributed-to': 'authored-by' focuses on the act of creation itself, while 'attributed-to' focuses on assigning responsibility for an attack campaign. An organization can have 'authored-by' a tool, while the campaign that uses the tool is 'attributed-to' another group. (2) Versus 'owns': 'owns' describes the state of ownership over infrastructure or tools, while 'authored-by' describes their creation source.
Example:The Lazarus Group (Identity) authored-by a custom backdoor malware (Malware).

Name:'owns'
Definition:Describes a real-world entity (e.g., an organization, team, or an individual) having ownership or de facto dominion over another entity (e.g., infrastructure, a domain name, or a tool).
Note:The core of this relationship is ownership by a real-world entity. Its distinction from 'controls' lies in the nature of the subject: the subject of 'owns' is a real-world entity (a team, an individual), while the subject of 'controls' is software. This is a critical distinction as it separates the real-world actor from their digital-world proxy tools.
Example:The APT41 group (Identity) owns the domain name evil-domain.com and the C2 server.

Name:'controls'
Definition:Specifically describes the relationship where one software entity (e.g., a trojan, backdoor, RAT) commands and controls another software entity (e.g., a hijacked process, a browser plugin).
Note:The core of this relationship is software-level control. Its key distinction from 'owns' is the level of the controller: the subject of 'controls' is a piece of software (e.g., a RAT), while the subject of 'owns' is a real-world entity (e.g., a team). For example, a team can 'own' a domain name, and the RAT program on the C2 server pointed to by that domain then 'controls' another process on the victim host.
Example:A Remote Access Trojan (RAT) controls a compromised browser process to steal cookies.

Name:'attributed-to'
Definition:Formally assigns the responsibility for a threat activity, such as an Intrusion Set or Campaign, to one or more Threat Actors. This is typically the conclusion derived from intelligence analysis and attribution efforts.
Note:It differs from 'authored-by' and 'affiliated-with'. 'attributed-to' focuses on the responsibility for an attack, while 'authored-by' pertains to the creation of an entity, like malware. An organization might 'author' a tool, but if another affiliated group uses it in an attack, the attack activity is 'attributed-to' the latter. 'affiliated-with' describes a broader organizational or social connection (e.g., membership, employment), whereas 'attributed-to' is a specific assignment of culpability for an action.
Example:Intrusion Set "Sandworm" is attributed-to Russian GRU Unit 74455.

Name:'affiliated-with'
Definition:Describes an affiliation, employment, or membership relationship between individuals and organizations. 'authored-by' refers to the creation relationship, 'attributed-to' refers to the responsibility for the attack, and 'owns' refers to the ownership of the infrastructure. 'affiliated-with' describes an 'affiliation' relationship at the organizational or social level, which is not necessarily creation, attack or ownership. When a relationship meets the subdivision relationship of 'attributed-to' and 'owns', choose them instead of 'affiliated-with'.
Example:A security researcher is affiliated-with a university.

Name:'cooperates-with'
Definition:Describes active, non-hierarchical collaboration between two or more peer entities, such as threat groups working together. 'affiliated-with' describes an affiliation. 'cooperates-with' (threat A cooperates with threat B) describes a collaborative relationship between peer entities.
Example:Threat Actor A cooperates-with Threat Actor B in a joint operation.

###Main Topic:Composition, Capability and State###
Name:'is-part-of'
Definition:Used when one entity is a component, member, or constituent of a larger entity. It is the inverse of ''consists-of''.
Example:A malicious module is-part-of a larger malware family.

Name:'consists-of'
Definition:Describes the compositional relationship where a complex entity is made up of its structural subcomponents.
Note:This relationship should be used to detail an object's "bill of materials" or internal architecture. It is distinct from 'has', which is used to attribute abstract features or capabilities rather than constituent parts. Use 'consists-of' to answer the question, "What is it made of?"
Example:The TrickBot malware framework consists-of numerous distinct modules, such as a password grabber and a VNC module.

Name:'has'
Definition:Indicates that an entity possesses a specific feature, function, or capability, which may be abstract in nature.
Note:This relationship is best used for attributing characteristics or functions to an object. It differs from 'consists-of', which is used for deconstructing an object into its physical or logical components. Use 'has' to answer the question, "What can it do?" or "What properties does it possess?"
Example:A backdoor Trojan has a persistence capability.

Name:'depends-on'
Definition:Signifies that one entity requires another entity to exist or function correctly.
Note:This describes a state of prerequisite or dependency. It differs from uses, which describes an action. For example, malware uses PowerShell to execute commands, but it depends-on a specific library to run. It covers terms like requires and is required for.
Example:A malware depends-on a specific version of the .NET Framework.

Name:'creates-or-generates'
Definition:An entity dynamically creates or generates another entity, such as a file, process, or data.
Note:This is more general than authored-by (which is about original creation by an identity) and drops (which is specific to malware placing a file). It describes the runtime action of creation. It covers terms like create, creates, and generates. If the relationship is more concise, such as a malware creating a file, use 'drops' instead, or if it is about the original creation by an identity, use 'authored-by'. Otherwise, use 'creates-or-generates' to capture the action of creation or generation in a broader sense.
Example:A malware creates-or-generates a new registry key. A malware creates-or-generates notification popups.

Name:'modifies-or-removes-or-replaces'
Definition:Indicates that an entity alters, replaces, or removes another entity or its components, such as changing a registry key.
Example:A ransomware modifies(modifies-or-removes-or-replaces) the Master Boot Record.

Name:'uses'
Definition:Represents that an entity employs or leverages another entity to achieve its objectives. It is a highly general, active relationship describing "A uses B to do something."
Note:Differentiated from 'depends-on' and 'exploits'. 'uses' is an active behavior (e.g., malware uses PowerShell to execute commands), while 'depends-on' is a static, prerequisite state (e.g., the malware's execution depends-on the .NET Framework). 'exploits' is a special case of 'uses' that specifically involves leveraging a 'vulnerability'; if a vulnerability is leveraged, 'exploits' should be preferred.
Example:Threat Actor APT41 uses the Cobalt Strike framework.

###Main Topic:Classification and Lineage###
Name:'variant-of'
Definition:Indicates that one entity is a direct evolutionary version of another, typically sharing a lineage in code or core functionality.
Note:This is distinct from 'derived-from' and 'compares-to'. 'variant-of' implies direct derivation, often at the code level (e.g., the Zeus malware has countless 'variants'). 'derived-from' is more abstract, signifying conceptual or technical inspiration without direct code reuse. 'compares-to' is for a general comparison of attributes without implying any lineage.
Example:The Gootkit malware is a variant-of the earlier Gozi trojan.

Name:'derived-from'
Definition:Indicates that an entity is conceptually, technically, or philosophically inspired by or based on another, but is not a direct code-level evolution.
Note:See the note under 'variant-of'. 'derived-from' represents a more abstract, "intellectual lineage" relationship.
Example:The techniques used in the Triton malware were derived-from the know-how developed for the Stuxnet attack.

Name:'alias-of'
Definition:Indicates that one entity is an alternative name or identifier for another.
Note:This provides a direct and explicit way to link known aliases, which is more specific than the broader compares-to relationship. It is a bidirectional relationship. This covers terms like has alias and is alias of.
Example:APT28 alias-of Fancy Bear.

Name:'compares-to'
Definition:Indicates a comparative relationship between two entities based on their features, behavior, complexity, or other attributes. 'variant-of' means two entities have a direct evolution or code-derived variant relationship, while 'compares-to' is broader and can include any form of comparison. When a relationship meets both criteria, 'variant-of' should be used instead.
Example:Malware A compares-to Malware B in its propagation method.

Name:'categorized-as'
Definition:Links an entity to its formal classification or type within a given taxonomy. 'variant-of' is a specific evolutionary classification. 'categorized-as' is a more formal, ontological classification relationship, for example, used to link an instance to a category in a taxonomy.
Example:The threat activity is categorized-as a form of ransomware attack.

###Main Topic:Geographic Relationships###
Name:'located-at'
Definition:Specifies the current or known geographic location of an entity.
Note:This is distinct from 'originates-from'. 'located-at' refers to the present location, while 'originates-from' refers to the place of origin or provenance. For example, a threat actor may 'originates-from' Iran, but the server they use is 'located-at' a data center in the Netherlands.
Example:A command and control server is located-at a data center in Germany.

Name:'originates-from'
Definition:Specifies the place of origin or provenance of an entity.
Note:See the note under 'located-at' for a direct comparison.
Example:The Stuxnet malware is believed to originate-from the United States and Israel.

###Main Topic:Analysis and Defense Relationships###
Name:'indicates'
Definition:Represents an inferential relationship where the presence of one entity (typically an Indicator) serves as evidence or a sign of another threat entity. It expresses that "if A is observed, it likely signifies that B exists or is occurring."
Note:The core of this relationship is analytical inference. It is distinct from the 'detecting' function within other relationships (e.g., 'research-describes-analysis-of-characterizes-detects'). The 'detecting' function represents an active, confirmed discovery, whereas 'indicates' represents a probabilistic link ("this likely means that"). This relationship is fundamental for operationalizing threat intelligence, as it directly connects a detectable artifact (the IOC) to the threat it helps to identify.
Example:An IP address (indicator) indicates a malware.

Name:'mitigates'
Definition:Indicates that a defensive measure or Course of Action effectively counters, reduces, or remediates the threat posed by an Attack Pattern, Vulnerability, or Malware.
Note:This is the inverse of 'bypasses'. 'mitigates' is a successful action for the defender (e.g., a patch mitigates a vulnerability), whereas 'bypasses' is a successful action for the attacker (e.g., an obfuscation technique bypasses a sandbox).
Example:Applying the MS17-010 patch mitigates the EternalBlue exploit.

Name:'based-on'
Definition:Indicates that an object (e.g., report, indicator, signature) is derived from or based on the information or analysis of another object (e.g., observed data, another report, malware sample).
Example:Indicator based-on (based-on) Observed Data.

Name:'research-describes-analysis-of-characterizes-detects'
Definition:A comprehensive research and defense relationship that signifies a document describing a subject, an actor analyzing a subject, a formal analysis object characterizing a subject's behavior, or a defensive tool identifying a threat.
Note:This consolidated relationship serves four primary purposes:Describing: Linking a textual document or publication to the entity it is about.Analyzing: Linking an analytical actor (e.g., a researcher or organization) to the subject of their investigation.Characterizing: Linking a formal analysis object (e.g., a Malware Analysis run) to the entity it was performed on.Detecting (Defensive): Linking a defensive tool, signature, or security product to the threat it successfully identifies.
Example:Example 1 (Describing): A Mandiant report research-describes-analysis-of-characterizes-detects the APT1 group.Example 2 (Analyzing): A security researcher research-describes-analysis-of-characterizes-detects a new malware sample. Example 3 (Characterizing): A sandbox analysis run research-describes-analysis-of-characterizes-detects the WannaCry malware. Example 4 (Detecting): An antivirus signature research-describes-analysis-of-characterizes-detects a specific malware file.

###Main Topic:Meta and Fallback Relationships###
Name:'negation'
Definition:Represents the confirmed absence of a relationship, link, characteristic, or action between entities.
Note:This type is used to explicitly state that a suspected or potential relationship does not exist. It is crucial for refuting claims or clarifying the scope of an entity's attributes. It should be used for phrases like does not contain, has no links to, is not affected by.
Example:A threat report states that Malware X negation (is not affected by) Vulnerability Y.

Name:'other'
Definition:If a relationship exists but does not fit into the categories above, and write down the value of 'rel' as the original text of the relationship.
Example: Not available, Not Applicable, Unknown, etc.
'''


In [None]:
Precision_Prompt='''
You are a professional AI assistant specializing in evaluating the Precision of Knowledge Graph (KG) entity relationships extracted from text. Your task is to receive a "Source Text", a list of "Predicted Values" relationships, and a "Ground Truth" relationship list for reference only. You need to evaluate each relationship in the "Predicted Values" list and output the detailed evaluation results in the specified JSON format.

Evaluation Attitude: Strive to Confirm, Cautious to Falsify
Your core evaluation principle is "Strive to Confirm, Cautious to Falsify". Consider yourself the "defense attorney" for the prediction model, not the "prosecutor". Your primary task is to do everything possible to find a chain of evidence in the "Source Text" and "Advanced Reasoning Rules" that can prove the "Predicted Values" are correct.

```
Default Trust: Unless a predicted relationship has a clear, irreconcilable, and direct contradiction with the "Source Text", you should prioritize assuming it is correct (TP).

Burden of Proof: Judging a relationship as a False Positive (FP) requires strong, direct counter-evidence. Merely "not being directly mentioned in the source text" or "requiring multi-step reasoning" are never sufficient reasons to classify it as an FP.

Acknowledge Reasoning: Place high value on the abstraction, generalization, and reasoning capabilities demonstrated by the model. Your job is to verify whether this reasoning is logical and well-founded, not just to perform literal string matching.
```

Introduction to Evaluation Objects: Core Triplet and Auxiliary Data
Target of Evaluation: Core Triplet

```
  The final judgment of relationship equivalence is based on the semantics of its core triplet (sub, rel, obj).

  Your core task is to determine whether the semantics expressed by a predicted triplet can match the semantics of a ground truth triplet or the semantics supported by the "Source Text".

Role of Auxiliary Data

    Other fields, such as alias, mother_entity, and rel_type, while not directly involved in the final equivalence comparison, play a crucial auxiliary role in your reasoning process to parse and extend the semantics of the core triplet:

    alias:

        This is the primary basis for judging entity equivalence.

        Application Scenario: When the sub or obj of one relationship does not perfectly match the corresponding entity name in another relationship (e.g., malwareA vs. Trojan.MSIL.malwareA), you must check if the alias list of one entity contains the name of the other. This is an important means of implementing the "General-Specific Equivalence Rule".

    mother_entity:

        This is key to understanding hierarchical, subordinate, or variant relationships between entities.

        Application Scenario: When applying the "Chain Deduction Rule", mother_entity provides the critical link. For example, if entity A's mother_entity is B (i.e., A is-a-variant-of B), and another relationship is B uses C, you can effectively deduce that A uses C.

    rel_type:

        This provides an initial classification and context for the relationship but does not limit its final semantics.

        Application Scenario: For example, a relationship with rel as uses might have a rel_type of uses or delivers. You should focus on the core action expressed by the rel field, not its classification label in rel_type.
```

Workflow
Step 1: Iterate Through Predictions

```
    Start with the first relationship in the "Predicted Values" list, predict_relationship[i], and evaluate them one by one.

⭐Step 2: MANDATORY Subject Attribution Analysis

    This is the first analysis that must be performed before any matching or verification.

    Analyze the subject of the current predicted relationship (predict_relationship[i].sub), for example, CERBER.

    Then, quickly scan the entire text to answer a core question: "Are there any entities in the source text that act as a proxy, tool, component, or actor for this subject?"

    Based on this analysis, conceptually establish a temporary list of equivalent entities.

        For example, in this case, after reading the full text, you would establish: ['loader' ≡ 'CERBER', 'the attackers' ≡ 'CERBER'].

    In all subsequent steps, you must treat this equivalence list as an absolute fact.

Step 3: Prioritize Matching with Ground Truth

    Compare the current predict_relationship[i] with the entire "Ground Truth" list.

    During the comparison, you must use the equivalence list established in Step 2 and apply all "Advanced Reasoning Rules".

        For example: When you see (loader, ...) in the ground truth, you must immediately treat it as (CERBER, ...) for matching with the prediction.

    If a match is found: Directly judge it as TP. Record the matched index_truth, and immediately proceed to evaluate the next predicted relationship.

Step 4: Final Verification with Source Text

    If no match is found in the "Ground Truth" list, you must directly verify predict_relationship[i] against the "Source Text".

    Similarly, you must use the equivalence list established in Step 2 and apply all "Advanced Reasoning Rules".

        For example: When you read "The loader checks for Regedit" in the source text, based on the conclusion from Step 2, you must immediately translate it in your mind to "CERBER checks for Regedit" before comparing it with the predicted relationship (CERBER, detects, Regedit).

    If the source text supports the relationship after the equivalence transformation, judge it as TP. Extract the most relevant quotation.

    If the source text still does not support or contradicts the relationship after the equivalence transformation, judge it as FP.
```

Advanced Reasoning Rules
These rules are the basis for your complex judgments and can be used both when matching "Ground Truth" and when verifying against the "Source Text".
Rule 1: Chain Deduction Equivalence Rule

```
    Definition: If a relationship A -> C can be logically deduced from a chain of relationships in the "Ground Truth" or "Source Text" (e.g., A -> B and B -> C), it is considered equivalent. This chain can include different types of relationships (e.g., is-variant-of, uses, based-on).

    Core Idea: Attack behaviors and entity relationships are often interlinked. This rule aims to identify such implicit logical chains, acknowledging the model's ability to capture macro-level facts that span multiple relationships.

    Example A (Intermediate Tool Deduction)

        Source Text: Earth Baku use Godzilla webshell, which is based on Cobalt Strike.

        Relationship to Evaluate: { 'sub': "Earth Baku", 'rel': "uses", 'obj': "Cobalt Strike" }

        Reasoning Logic: The text provides the chain Earth Baku -> uses -> Godzilla webshell and Godzilla webshell -> based-on -> Cobalt Strike. Through this chain, it can be logically deduced that "Earth Baku indirectly uses Cobalt Strike," so the relationship holds.

    Example B (Variant Inheritance Deduction)

        Source Text: The source code of StealthVector was utilized to create a similar software, StealthReacher. Their common feature is the use of AES encryption.

        Relationship to Evaluate: { "sub": "StealthReacher", "rel": "uses", "obj": "AES encryption" } (This is a ground truth to be verified)

        Prediction/Source Text Evidence: { "sub": "StealthReacher", "rel": "is a variant of", "obj": "StealthVector" } and { "sub": "StealthVector", "rel": "uses", "obj": "AES" }

        Reasoning Logic: The text provides the chain StealthReacher -> is-a-variant-of -> StealthVector and StealthVector -> uses -> AES. Because a child variant typically inherits the core functionalities of its parent, it can be deduced that StealthReacher uses AES. The relationship holds.

    Example C (Component Dependency Deduction)

        Source Text: HermeticRansom subsequently downloaded a component, RSA-OAEP. RSA-OAEP, in turn, uses the SHA-256 algorithm.

        Relationship to Evaluate: { "sub": "HermeticRansom", "rel": "uses", "obj": "SHA-256" }

        Reasoning Logic: The text provides the chain HermeticRansom -> uses/downloads -> RSA-OAEP and RSA-OAEP -> uses -> SHA-256. Because HermeticRansom used a component that requires SHA-256, it can be inferred that it indirectly used SHA-256. The relationship holds.

Rule 2: General-Specific Equivalence Rule

    Definition: If it can be concluded from the source text, entity information, or entity relationships that one entity is a general or specific description of another, they are considered equivalent.

    Core Idea: The granularity of information extracted by the model may not perfectly align with the "Ground Truth" or human annotation. This rule aims to recognize valid extractions of the same core fact at different levels of abstraction, whether they are more specific or more general.

    Example A (Specific Subject)

        Source Text: The Magecart attack on British Airways involved purchasing and utilizing an SSL certificate provided by Comodo.

        Relationship to Evaluate: { 'sub': "Magecart", 'rel': "uses", 'obj': "SSL certificates (Comodo)" }

        Reasoning Logic: The predicted sub "Magecart" is a general description of the source text's sub "Magecart's attack on British Airways". The core reference is the same, so the relationship holds.

    Example B (Generalized Subject)

        Source Text: Persistent malicious applications on the Google Play platform disseminated the Android.Reputation.1 malware.

        Relationship to Evaluate: { 'sub': "Google Play", 'rel': "delivers", 'obj': "Android.Reputation.1" }

        Reasoning Logic: The predicted sub "Google Play" is a general description of the source text's sub "Persistent malicious applications on the Google Play platform," referring to the platform rather than specific apps. This generalization is reasonable and equivalent when describing a distribution relationship.

    Example C (Generalized Object)

        Source Text: HermeticRansom utilized the Golang GUID library

        Relationship to Evaluate: { "sub": "HermeticRansom", "rel": "uses", "obj": "Golang" }

        Reasoning Logic: The predicted obj "Golang" is a general description of the source text's obj "Golang GUID library". The core technology stack reference is the same, so the relationship holds.


    Example D (Action Performed by Proxy/Component)

        Source Text: "The report states that the APT42 group utilizes a custom tool named 'Tempting Cedar'. This tool is responsible for injecting the main payload into lsass.exe."

        Relationship to Evaluate: { "sub": "APT42", "rel": "injects-code-into", "obj": "lsass.exe" }

        Reasoning Logic:

            The source text provides two key facts: (APT42, utilizes, Tempting Cedar) and (Tempting Cedar, injects-into, lsass.exe).

            This forms a clear logical chain: Subject (APT42) -> uses -> Tool/Proxy (Tempting Cedar) -> performs -> Action (injects-into lsass.exe).

            When an action is performed by a tool/component/proxy (Tempting Cedar) of an entity (APT42), attributing the action directly to that entity (APT42) is a completely valid and highly valuable abstraction in intelligence analysis. The tool's behavior represents the intent and capability of its user.

            Therefore, APT42 injects-code-into lsass.exe is a valid and correct inference. The relationship holds.

    Example E (Actor-Tool Equivalence Substitution)

    Source Text: "The attack was carried out by the Sandworm group. The malicious payload was hosted on a compromised Microsoft OneDrive account, from which it was delivered to victims."

    Context/Ground Truth Information: It is known that Sandworm used the Cyclops Blink malware in this attack.

    Relationship to Evaluate: { "sub": "Cyclops Blink", "rel": "uses", "obj": "Microsoft OneDrive" }

    Reasoning Logic:

        The source text explicitly states that the actor using OneDrive is Sandworm.

        The subject of the relationship to be evaluated is the malware Cyclops Blink.

        In the context of cybersecurity intelligence analysis, specific behaviors (TTPs) within an attack operation can be attributed to either the Actor or the core Tool/Malware used in that operation. The two are often interchangeable when describing the event.

        Therefore, substituting the subject of the action "uses OneDrive," performed by Sandworm in the text, with its core tool, Cyclops Blink, resulting in "Cyclops Blink uses OneDrive," is a completely reasonable and semantically equivalent statement. The relationship holds.

Rule 3: Action-Technique Equivalence Rule

    Definition: If one relationship describes a specific technical action (e.g., injects code into DLL), and another relationship describes the corresponding, widely known named technique/TTP (e.g., employs DLL Hollowing), they are considered equivalent.

    Core Idea: Acknowledge the model's ability to abstract and generalize threat behaviors. Extracting standardized technique names (TTPs) is often more valuable than reiterating descriptive behaviors.

    Example A (ETW Disabling)

        Source Text: StealthVector disabled the Event Tracing for Windows (ETW) functionality.

        Relationship to Evaluate: { "sub": "StealthVector", "rel": "employs", "obj": "ETW Disable" }

        Reasoning Logic: "ETW Disable" is the standardized technical name for the action of "disabling the Event Tracing for Windows functionality". They are semantically equivalent.

    Example B (DLL Injection)

        Source Text: StealthVector injects malicious code into a legitimate DLL.

        Relationship to Evaluate: { "sub": "StealthVector", "rel": "employs", "obj": "DLL Hollowing" }

        Reasoning Logic: "DLL Hollowing" is the classic technical name for the action of "injecting malicious code into a legitimate DLL". They are semantically equivalent.

Rule 4: Event-Element Complementarity Rule

    Definition: If two relationships describe different elements of the same event (e.g., one describes action + object, the other describes action + destination), and the source text can link these elements together, they are considered equivalent.

    Core Idea: A threat event is a whole. If the relationships extracted by the model can fit together like puzzle pieces to form the core event described in the source text, their validity should be recognized.

    Example (Conceptual)

        Source Text: upload.exe is used to upload previously downloaded videos to a hacker-controlled YouTube channel.

        Relationship A: { "sub": "upload.exe", "rel": "uploads", "obj": "videos" }

        Relationship B: { "sub": "upload.exe", "rel": "exfiltrates-to", "obj": "YouTube channels" }

        Reasoning Logic: Relationship A describes the action + object, while Relationship B describes the action + destination. The source text links these two parts together, jointly describing the complete event of "uploading videos to YouTube". Therefore, A and B can be considered different facets of the same equivalent fact during evaluation.

Rule 5: Relation Semantic Equivalence Rule (New Rule)

    Definition: If two relationships have the same or equivalent subjects (sub) and objects (obj), and their relation verbs (rel) express the same or similar intent in the given context, or describe different aspects of the same event, they are considered equivalent.

    Core Idea: Natural language expression is rich and diverse; one should not demand exact identity in relation verbs. As long as the core action or intent is preserved, its validity should be recognized.

    Example A

        Source Text: Android.Reputation.1 incorporated the Google Play icon for the purpose of self-disguise.

        Relationship to Evaluate: { "sub": "Android.Reputation.1", "rel": "uses", "obj": "Google Play icon" }

        Reasoning Logic: In the context of malware utilizing an element, "incorporated for the purpose of disguise" and "uses" have identical core intent and fact.

    Example B

        Source Text: Infection with AZORult occurred after a user downloaded ProtonVPN_win_v1.10.0.exe.

        Relationship to Evaluate: { 'sub': "ProtonVPN_win_v1.10.0.exe", 'rel': "delivers", 'obj': "AZORult" }

        Reasoning Logic: The source text describes a causal relationship (downloading A leads to infection with B). The predicted relation verb "delivers" accurately summarizes this causal fact from the attacker's perspective. They are semantically equivalent.

Rule 6: Role Inversion and Functional-Structural Equivalence Rule

    Definition: A predicted relationship (B, rel_2, A) is considered equivalent if it can be derived by inverting the subject-object roles of an existing relationship (A, rel_1, B) from the text or ground truth and applying a logically sound semantic transformation. This is especially applicable when deducing a structural or subordinate relationship (e.g., is-part-of, is-a-component-of, is-contained-in) from a functional one (e.g., uses, delivers, employs).

    Core Idea: Acknowledge that facts can be described from different perspectives. A core tool (B) used by an entity (A) can be structurally or logically considered a part of entity A's overall scheme (B is-part-of A). This rule aims to capture such deep, human-like logical inferences.

    Example A (Core Case: From "uses" to "is-part-of")

        Source Text: "The CERBER family of ransomware... is now using a new loader."

        Relationship to Evaluate: { "sub": "loader", "rel": "is-part-of", "obj": "CERBER" }

        Reasoning Logic:

            The text provides the functional relationship (CERBER, uses, loader).

            The relationship to evaluate is (loader, is-part-of, CERBER).

            Applying this rule:

                Role Inversion: Invert the subject and object of the text relationship to get (loader, ???, CERBER).

                Functional -> Structural Conversion: Determine if the inverse of uses can be equivalent to is-part-of. In the malware analysis domain, a component (loader) used by a malware (CERBER) to perform a core function can absolutely be considered "a part of" the malware's overall scheme. Therefore, inferring loader is-part-of CERBER from CERBER uses loader is entirely reasonable and correct. The relationship holds.

    Example B (From "delivers" to "is-part-of" - Container/Vector)

    Source Text: "The attack begins with a phishing email that directs victims to download a malicious Word document. This document, upon opening, executes a macro that installs the Emotet trojan."

    Relationship to Evaluate: { "sub": "Word document", "rel": "is-part-of", "obj": "Emotet" }

    Reasoning Logic:

        The text describes a functional role: the Word document is the initial vector or container that carries and delivers the Emotet attack.

        The relationship to be evaluated makes a structural judgment: the Word document is part of Emotet.

        Applying this rule, we conclude that: When describing a complete attack operation, its critical and indispensable delivery chain components (like the Word document in this case) can be reasonably considered "a part of" the overall threat (Emotet). Without this document, the attack would not succeed.

        Therefore, abstracting the functional role of "delivery" into the structural relationship of "composition" is valid. The relationship holds.

Rule 7: Exclusion and Rejection Rule

    B. Rejecting Malformed Extractions

        Description: If the sub or obj in a generated relationship is not a clear, specific Named Entity, the relationship is considered an invalid extraction. During evaluation, such results cannot be matched with any valid ground truth relationship and should be judged as False Positives (FP). This type of formatting error mainly falls into the following two categories:

        B.1. Rejecting Non-Entity Content (Sentences, Clauses, or Complex Descriptions)

            Description: When the sub or obj contains a complete sentence, a long clause, or a complex description mixed with various pieces of information, rather than a clear, independent noun or named entity, the relationship is invalid.

            Examples:

                Example 1: {'sub': 'CVE-2022 - 22965 and CVE-2022 - 22963 : technical details CVE-2022 - 22965 ( Spring4Shell , SpringShell )', 'rel': 'be', 'obj': 'a vulnerability in the Spring Framework that uses data binding functionality to bind data stored within an HTTP request to certain objects used by an application'}

                Example 2: {'sub': 'the getCachedIntrospectionResults method', 'rel': 'exec', 'obj': 'to gain unauthorized access to such objects by passing their class names via an HTTP request'}

                Example 3: {'sub': 'the critical vulnerability CVE-2022 - 22965 in Spring', 'rel': 'be', 'obj': 'similar to the long - closed CVE-2010 - 1622 , where class name checks were added as a fix so that the name did not match classLoader or protectionDomain'}

                Example 4: {'sub': 'A vulnerable configuration', 'rel': 'consist', 'obj': 'of : JDK version 9 + Apache Tomcat for serving the application Spring Framework versions 5.3.0 to 5.3.17 and 5.2.0 to 5.2.19 and below application built as a WAR file CVE-2022 - 22963 is a vulnerability in the routing functionality of Spring Cloud Function that allows code injection through Spring Expression Language ( SpEL ) by adding a special spring.cloud.function.routing-expression header to an HTTP request'}

        B.2. Rejecting Generic Pronouns or Non-Specific References

            Description: When the sub or obj uses a generic pronoun (e.g., I, you, which) or a word whose referent cannot be independently determined, the relationship is invalid due to the lack of a clear entity.

            Examples:

                Example 5: {'sub': 'which', 'rel': 'make', 'obj': 'CVE-2022 - 22965 a critical threat'}

                Example 6: {'sub': 'you', 'rel': 'fix', 'obj': 'CVE-2022 - 22963'}

                Example 7: {'sub': 'you', 'rel': 'need', 'obj': 'to install the new Spring Cloud Function versions'}

                Example 8: {'sub': 'you', 'rel': 'write', 'obj': 'the new Spring Cloud Function versions'}

                Example 9: {'sub': 'I', 'rel': 'describe', 'obj': 'some of unknown agent , sites people , technical questions andâ\x80 ¦ Reply CVE-2022 - 22965 and CVE-2022 - 22963 : technical detailsMitigations for Spring vulnerabilities exploitationIndicators of Compromise IT threat evolution in Q3 2022'}

        Unified Judgment Logic: The common problem with both categories of extractions above is that their subject (sub) or object (obj) part is not an independent, well-defined named entity. The first category treats entire sentences or complex descriptions as entities, while the second uses pronouns that cannot be resolved without context. These are all invalid knowledge graph relationships and should be judged as FP.

Rule 8: Placeholder Entity Resolution Rule (Optimistic Principle)

    Definition: When you encounter entities formatted like Attacker(using: X), Attacking(using: Y), or Attacking(from: Z), you must recognize that they are not literal strings but semantic placeholders that require resolution.

    Core Idea (Optimistic Principle): We should first believe that the model had a reason to extract this placeholder. Our task is not to harshly disprove the placeholder's existence, but to use the information it provides to do our best to find evidence supporting the main relationship.

    Resolution and Verification Process (Revised):

        Identify and Deconstruct into Context: Upon seeing an Attacker(...) or Attacking(...) format, immediately deconstruct it into the core perspective or context that must be adopted when evaluating the main relationship.

            Attacker(using: X) means ⇔ "The subject I am now evaluating is the attacker associated with tool/vulnerability X."

            Attacking(using: Y) means ⇔ "The subject I am now evaluating is the attack campaign associated with tool/vulnerability Y."

        Directly Evaluate the Main Relationship: Immediately begin evaluating the complete predicted relationship (placeholder, rel, obj). Throughout this evaluation, you must wear the "colored glasses" established in Step 1.

        Look for Fused Evidence: Search the source text for fused evidence that simultaneously supports both the "context described by the placeholder" and the "main relationship (rel, obj)".

            Reasoning Logic: Do not judge them separately. Directly ask yourself: "Does the source text support that 'the attacker associated with X' performed the action 'rel' on 'obj'?" As long as you can find such fused evidence, judge it as TP.

    Example A (Attacker with Tool) - Applying the Optimistic Principle

        Source Text: "The campaign, orchestrated by an unknown actor, leveraged the CVE-2021-44228 vulnerability to gain initial access."

        Relationship to Evaluate: { "sub": "Attacker(using: CVE-2021-44228)", "rel": "gains-access", "obj": "target_system" }

        New Judgment Logic:

            Get the Perspective: The subject of this prediction is "the attacker associated with CVE-2021-44228".

            Directly Evaluate Main Relationship: I need to find evidence in the text that "the attacker associated with CVE-2021-44228 gained access to the target system".

            Look for Fused Evidence: The text says "an unknown actor, leveraged the CVE-2021-44228 vulnerability to gain initial access". This sentence perfectly fuses the two pieces of information:

                Context: There is indeed an "unknown actor" using CVE-2021-44228.

                Main Relationship: This actor did indeed "gain initial access".

            Conclusion: The fused evidence is conclusive. The relationship is TP.

    Example B (Attack with Tool) - Applying the Optimistic Principle

        Source Text: "A recent wave of attacks utilized the EternalBlue exploit to propagate laterally within networks."

        Relationship to Evaluate: { "sub": "Attacking(using: EternalBlue)", "rel": "propagates-laterally", "obj": "networks" }

        New Judgment Logic:

            Get the Perspective: The subject of this prediction is "the attack campaign that used EternalBlue".

            Directly Evaluate Main Relationship: I need to find evidence in the text that "the attack campaign using EternalBlue propagated laterally within networks".

            Look for Fused Evidence: The sentence "A recent wave of attacks utilized the EternalBlue exploit to propagate laterally within networks" contains all the required information.

                Context: There was indeed a "wave of attacks" using EternalBlue.

                Main Relationship: This attack did indeed "propagate laterally within networks".

            Conclusion: The fused evidence is conclusive. The relationship is TP.

Rule 9: Canonical Relation Validation Rule

Definition: When the `rel` value in a predicted relationship (sub, rel, obj) exactly matches a `Name` in the provided `background_info:rel_type_definition` list, you must recognize that this `rel` is a canonical name, not a phrase directly extracted from the original text.

Core Idea: Acknowledge that the generation end and the evaluation end share the same relationship vocabulary. The focus of evaluation is to verify whether the *meaningof the original text conforms to the definition of that canonical relationship, not to search for the literal string of the canonical name.

Validation Process:

    1.  Identify Canonical Relation: Check if the `rel` value exists in the `Name` list of `rel_type_definition`.
    2.  Find and Understand the Definition: If it is a canonical relation, find its `Definition` and `Note` in `rel_type_definition`.
    3.  Validate Based on Definition: Abandon searching for the literal `rel` string in the source text. Instead, determine if the actual meaning of the sentence describing the interaction between `sub` and `obj` in the text matches the official definition you found in the previous step.
    4.  Conclude: If the text's meaning matches the definition, the relationship is TP. Otherwise, it is FP.

Example (Core Case: Handling canonical `communicates-with`)

    Background: In `rel_type_definition`, the definition of `communicates-with` is "Describes the occurrence of network communication between two entities."

    Source Text: "...analysis revealed network 'traffic between' the infected host and the domain evil.com."

    Predicted Relationship to Evaluate: { "sub": "infected host", "rel": "communicates-with", "obj": "evil.com" }

    Judgment Logic:

    1.  Recognize that the `rel` value "communicates-with" is a canonical name.
    2.  Look up its definition and find it means "describes network communication between two entities."
    3.  Determine if the phrase "network traffic between" in the source text matches the definition of "network communication."
    4.  The answer is yes. The meaning of the text perfectly matches the definition of the canonical relation.
    5.  Conclusion: Even though the word "communicates-with" does not appear in the original text, the relationship should be judged as TP.

Rule 10: Relationship Hierarchy Inclusion Rule

    Definition: This rule is used to handle cases where the `rel` of the predicted relationship and the ground truth have a hierarchical (parent/child, general/specific) relationship in `rel_type_definition`. When a broad relationship (e.g., `communicates-with`) encompasses several specific relationships (e.g., `downloads`, `exfiltrates-to`), the evaluation must be inclusive.

    Core Idea (Optimistic Principle): A partial mismatch in relationship granularity should not be immediately considered an error. We should prioritize recognizing more precise insights while also being inclusive of broadly correct generalizations. The core of the evaluation is to determine if the prediction is consistent or compatible with the most precise fact supported by the text.

    Validation Process: When `predict_rel` and `truth_rel` are inconsistent but have a hierarchical relationship in their definitions, handle them according to the following two scenarios:

    Scenario 1: Prediction is more specific than Ground Truth (Prediction is subclass, Ground Truth is superclass)

        Scenario Example:
            Predicted Relationship: (A, downloads, B)
            Ground Truth: (A, communicates-with, B)

        Judgment Logic: The prediction makes a more precise assertion than the ground truth. In this case, we should temporarily ignore the ground truth and use the source text as the final arbiter.
            1.  Identify Hierarchy: Recognize that `downloads` is a subclass of `communicates-with`.
            2.  Validate Specific Prediction: Ask yourself: "Does the source text explicitly support the more specific action 'A downloaded a file from B'?"
            3.  Make a Judgment:
                - If the text supports it (e.g., mentions "retrieve a file," "fetch a payload"), then this more precise prediction is correct and more valuable. Judge it directly as TP.
                - If the text does not support this specific action (e.g., only mentions "heartbeat traffic"), then the prediction is an over-inference. Judge it as FP.

    Scenario 2: Prediction is more general than Ground Truth (Prediction is superclass, Ground Truth is subclass)

        Scenario Example:
            Predicted Relationship: (A, communicates-with, B)
            Ground Truth: (A, downloads, B)

        Judgment Logic: The assertion made by the prediction (communication occurred) is factually encompassed by the assertion in the ground truth (download occurred). While the prediction is not as precise, it is not incorrect. The optimistic principle should be applied here.
            1.  Identify Hierarchy: Recognize that `communicates-with` is a superclass of `downloads`.
            2.  Apply Inclusion Principle: Since the more specific fact (download) is true, the more general description (communication) is naturally also true. This is like saying "a dog is an animal," which is correct.
            3.  Make a Judgment:
                - Judge it directly as TP. We acknowledge that the prediction is correct at a macro level, even if it loses some detail. It can only be judged FP in an extreme case where the text explicitly mentions multiple types of communication, and the general relationship creates ambiguity or is misleading. But in most cases, it should be optimistically judged as TP.
```

Rule 11: Entity Attribute Validation Rule
[
Definition: When a Predicted Relationship describes an intrinsic attribute association between entities (such as an alias or variant relationship), it must be validated by checking the Ground Truth's Entity List, even if there is no directly corresponding triplet in the Ground Truth relationship list. If the semantics of the predicted relationship can be confirmed by the attributes (e.g., `alias`, `mother_entity`) of a ground truth entity, the predicted relationship should be judged as TP (True Positive).

```
Core Idea: Information in a knowledge graph can be represented in two equivalent ways: as an "edge" between nodes (i.e., a relationship triplet) or as an "attribute" of the node itself. This rule ensures that the AI can recognize the equivalence of these two representations, thereby correctly acknowledging valid predictions that choose to represent information as an "edge" and avoiding misjudgments caused by the "Ground Truth" using an "attribute" representation.

Validation Process and Scenarios:

    Scenario: Validating Alias Relationships

        Applicable Predictions: When the `rel` of a predicted relationship is `alias-of`, `indicates` (in specific contexts), `also-known-as`, or other words expressing the meaning of "alias".

        Validation Steps:
            1.  Get the predicted relationship (sub, rel, obj).
            2.  Search within the Ground Truth entity list.
            3.  Look for an entity `E_truth` that satisfies either of the following conditions:
                - `E_truth.name` is equivalent to the predicted `sub`, and the `E_truth.alias` list contains a name equivalent to the predicted `obj`.
                - `E_truth.name` is equivalent to the predicted `obj`, and the `E_truth.alias` list contains a name equivalent to the predicted `sub`.
            4.  If such an entity is found, the predicted relationship is judged as TP.

    Scenario: Validating Variant/Part-of Relationships

        Applicable Predictions: When the `rel` of a predicted relationship is `is-variant-of`, `is-part-of`, `is-component-of`, etc.

        Validation Steps:
            1.  Get the predicted relationship (sub, rel, obj).
            2.  Search within the Ground Truth entity list.
            3.  Look for an entity `E_truth` whose `name` is equivalent to the predicted `sub`, and whose `mother_entity` list contains a name equivalent to the predicted `obj`.
            4.  If found, the predicted relationship is judged as TP.

Output Format Adaptation:

    When judging as TP based on this rule, since there is no directly matching `truth_relationship` index, `context` should be used as the `support_reason`.

    The `context` field should clearly explain the validation logic, for example: "This relationship is validated by the 'alias' attribute of the ground truth entity 'CVE-2022-22965'."
```

Rule 12: Entity Alias Equivalence Rule

```
Definition: If an entity A is defined as an alias of another entity B in the entity information of the "Ground Truth" or "Predicted Values" (i.e., A is in B's `alias` list, or vice versa), then entities A and B should be considered fully interchangeable when evaluating any relationship.

Core Idea: An alias relationship establishes absolute semantic equivalence between two entities. The core of evaluation is to understand the fact expressed by the triplet, not to be fixated on the specific name of the entity in that fact. Therefore, a relationship involving entity A, `(A, rel, C)`, and an identical relationship involving its alias B, `(B, rel, C)`, describe the same objective fact and must be judged as equivalent.

Validation Process:

    1.  When evaluating a predicted relationship P = (`P_sub`, `P_rel`, `P_obj`), you need to compare it with every relationship T = (`T_sub`, `T_rel`, `T_obj`) in the ground truth list.
    2.  During comparison, if `P_rel` and `T_rel` are semantically equivalent, but the subjects (`sub`) or objects (`obj`) do not match literally.
    3.  You must check the entity information: Is `P_sub` an alias of `T_sub` (or vice versa)? And is `P_obj` an alias of `T_obj` (or vice versa)?
    4.  If the relation verbs are equivalent, and the corresponding subjects and objects are aliases of each other, then P and T are judged to be equivalent.

Example:

    Predicted Value:
    JSON
    { "index_predict": "predict_relationship_10", "sub": "APT29", "rel": "uses", "obj": "Cobalt Strike" }

    Ground Truth:
    JSON
    { "index_truth": "truth_relationship_4", "sub": "Cozy Bear", "rel": "uses", "obj": "Cobalt Strike" }

    Relevant Entity Information (from Ground Truth or Predicted Values):
    JSON
    { "name": "Cozy Bear", "type": "threat-actor", "alias": ["APT29", "The Dukes"] }

    Judgment Logic:

    1.  The AI compares `predict_relationship_10` and `truth_relationship_4`.
    2.  The `rel` ("uses") and `obj` ("Cobalt Strike") match perfectly.
    3.  The `sub` does not match ("APT29" vs. "Cozy Bear").
    4.  Applying Rule 12: The AI checks the entity information and finds that "APT29" is an alias of "Cozy Bear".
    5.  Therefore, `P_sub` and `T_sub` are equivalent. The entire relationship triplet is judged as equivalent.

Final Output:
JSON
    {
      "index_predict": "predict_relationship_10",
      "result": "TP",
      "support_reason": "index_truth",
      "index_truth": "truth_relationship_4"
    }
```

]

Rule 13: Entity Hierarchy Inheritance/Induction Rule
[
Definition: If an entity A's `mother_entity` is B, this signifies that A is a variant, component, or specific instance of B. In such a hierarchical relationship, relationships can be passed in two directions:

```
    Downward Inheritance: Behaviors or attributes of B are often also possessed by its child entity A.
    Upward Induction: A specific behavior or attribute of a child entity A can be generalized as a capability of the parent entity B.

    During evaluation, logical deductions in both directions should be considered valid.

Core Idea: Malware families and attack groups are constantly evolving. A variant's behavior defines the current capabilities of its family, while a family's capabilities predict the functions its variants might possess. Evaluation must acknowledge this dynamic, hierarchical relationship, rather than treating each entity in isolation. `A -> C` is equivalent to `B -> C` because, at an intelligence level, the behavior of A *isthe behavior of B.

Validation Process:

    1.  When evaluating a predicted relationship P = (`P_sub`, `P_rel`, `P_obj`) against a ground truth relationship T = (`T_sub`, `T_rel`, `T_obj`).
    2.  If `P_rel` and `P_obj` are equivalent to `T_rel` and `T_obj`, but the subjects `P_sub` and `T_sub` do not match.
    3.  You must check the entity information to determine if a `mother_entity` relationship exists between `P_sub` and `T_sub`.
    4.  If `P_sub`'s parent is `T_sub`, or `T_sub`'s parent is `P_sub`, then `P_sub` and `T_sub` are judged to be equivalent in this context.

Example (Upward Induction):

    Predicted Value:
    JSON
    { "index_predict": "predict_relationship_15", "sub": "Cerber v5.0.1", "rel": "encrypts", "obj": ".doc files" }

    Ground Truth:
    JSON
    { "index_truth": "truth_relationship_7", "sub": "Cerber", "rel": "encrypts", "obj": ".doc files" }

    Relevant Entity Information:
    JSON
    { "name": "Cerber v5.0.1", "type": "malware", "mother_entity": ["Cerber"] }

    Judgment Logic:

    1.  The AI compares `predict_relationship_15` and `truth_relationship_7`.
    2.  The `rel` and `obj` match. The `sub` does not match ("Cerber v5.0.1" vs. "Cerber").
    3.  Applying Rule 13: The AI checks the entity information and finds that the `mother_entity` of "Cerber v5.0.1" is "Cerber".
    4.  The AI performs upward induction: The encryption behavior of a specific variant can be generalized to the behavior of the family to which it belongs. Therefore, `P_sub` and `T_sub` are equivalent in this context.

Final Output:
JSON
{
  "index_predict": "predict_relationship_15",
  "result": "TP",
  "support_reason": "index_truth",
  "index_truth": "truth_relationship_7"
}
```

]

Output Format Requirements

```
Please strictly follow the JSON list format below for your evaluation results. You need to generate a corresponding JSON object for each relationship in the "Predicted Values" list. Do not output any extra explanations, titles, or summaries; only output the final JSON list itself.

[
  {
    "index_predict": "predict_relationship_N",
    "result": "TP",
    "support_reason": "index_truth",
    "index_truth": "truth_relationship_X"
  },
  {
    "index_predict": "predict_relationship_M",
    "result": "TP",
    "support_reason": "context",
    "context": "<A short quote from the source text that directly supports the relationship>"
  },
  {
    "index_predict": "predict_relationship_K",
    "result": "FP",
    "index_truth_may_match_top": "truth_relationship_A",
    "index_truth_may_match_second": "truth_relationship_B",
    "index_truth_may_match_third": "None",
    "context_may_match_top": "<Text snippet 1 from the source text most likely to have caused this erroneous extraction>",
    "context_may_match_second": "<Text snippet 2 from the source text most likely to have caused this erroneous extraction>",
    "context_may_match_third": "<Text snippet 3 from the source text most likely to have caused this erroneous extraction>"
  }
]

Field Descriptions:

    index_predict: (String) The index of the predicted relationship currently being evaluated.

    result: (String) The evaluation result, must be 'TP' or 'FP'.

    support_reason: (Appears only when result is 'TP') (String) Explains the basis for the TP judgment, must be 'index_truth' or 'context'.

    index_truth: (Appears only when support_reason is 'index_truth') (String) The index of the matched ground truth relationship.

    context: (Appears only when support_reason is 'context') (String) A brief quote extracted from the "Source Text" that most directly supports the relationship.

    index_truth_may_match_...: (Appears only when result is 'FP') (String) The indices of the 1-3 most similar relationships from the "Ground Truth" list to this incorrect prediction. If none are found, it is 'None'.

    context_may_match_...: (Appears only when result is 'FP') (String) The 1-3 text snippets from the "Source Text" most likely to have caused the model to produce this erroneous extraction. If none are found, it is 'None'.
```

Background Information:rel_type_definition
'''+rel_type_definition


In [None]:
Recall_Prompt='''You are a professional AI assistant responsible for evaluating the accuracy of Knowledge Graph (KG) entity relationships extracted from text. Your task is to receive a "Source Text," a list of "ground truth" relationships extracted from that text, and a list of "predict values" relationships. You will then compare each item in the "ground truth" list to determine if it exists in the "predict values" list.

Your working method is as follows: For each entity relationship in the "ground truth" list, in order, you will search the entire "predict values" list to find a matching item. If a match is found, the evaluation result for that "ground truth" item is "TP" (True Positive); if no match is found, the result is "FN" (False Negative).

Evaluation Attitude: Dig Deep for Connections, Strive to Match

Your core evaluation principle is to "Dig Deep for Connections, Strive to Match." You must assume that for every "Ground Truth" relationship, there is a very high probability of a semantically equivalent counterpart in the "Predicted Values" list, even if their phrasing, level of abstraction, or logical steps are entirely different. Your task is to use all your reasoning abilities to find this connection.

  Assume a Match Exists: Before exhausting all "Advanced Reasoning Rules" for inference and matching, you must never hastily judge a "Ground Truth" relationship as "FN" (False Negative, i.e., not recalled).

  Burden of Proof: Judging a relationship as "FN" is the last resort. It implies that you have confirmed, even under the most lenient semantic and logical deductions, that the prediction model completely missed this fact. Therefore, this is a final judgment that requires extreme caution.

  Acknowledge Indirect Evidence: Place high importance on the ability to find equivalent relationships at different levels of abstraction and through different logical steps. A fact can be stated directly or implied through multiple indirect relational chains. Your job is to maximize the identification of facts that the prediction model has successfully "recalled," rather than conducting a strict literal review.

Core Evaluation Rules:

1.  Comparison Scope:

      Entity Relationship Comparison: Only compare the three core attributes of a relationship: `sub` (subject), `rel` (relation), and `obj` (object). Other attributes like `rel_type`, etc., do not participate in equivalence judgments but can be used as supplementary evidence.
      Entity Comparison: Although the entity lists between ground truth and predicted values are not within the comparison scope, the `name` and `alias` attributes can be used to assist in judging the consistency of entity relationships. `alias` can be used to determine if an entity exists in the predicted values, while `mother_entity` can help understand the parent-child relationships between entities. The `index` attribute is only for identifying the entity's position in the list and does not participate in equivalence judgments; you need the `index` attribute in your output to identify the ground truth and predicted values.

2.  Equivalence Rules for Relationship Matching:

      Core Entity Matching: If an entity name (`name`/`sub`/`obj`) includes additional descriptive details but the core subject refers to the same thing, it is considered a successful match. For example, "Magecart's attack on British Airways, use, VPN" is considered equivalent to "Magecart, use, VPN".
      Active/Passive Voice Equivalence: Relationships expressed in active and passive voice are considered equivalent. For example, `a -> b` and `b <- a`.
      Semantic Equivalence of `rel` Attribute: `rel` attributes that express similar actions or intentions can be considered equivalent in certain contexts. For example, "indicates," "delivers," "use," and "has" can be treated as the same relation in specific contexts.
      Chain Deduction Equivalence (Multi-hop Relations): If the predicted values indirectly express a direct relationship from the ground truth through one or more intermediate entities, it is considered a successful match. For example, if the ground truth is `A -> C`, and the predicted values are `A -> B` and `B -> C`, then `A -> C` should be recorded as TP.

3.  Exclusion Rules:

      Ignore Malformed Extraction Results: If in a generated relationship, the `sub` or `obj` is not a clear, specific entity (e.g., it is a pronoun like "you," "I," "which," or a very long clause, or an incomplete sentence), that relationship is considered an invalid extraction and cannot be matched with any valid ground truth relationship.

4.  Use the Source Text as the Ultimate Authority:
    All judgments of relationship equivalence can use the provided "Source Text" as a source of evidence. When "predict values" are incomplete or do not directly match the "ground truth," you must search for evidence in the "Source Text" and connect them through contextual reasoning. The "Source Text" is the final arbiter for resolving all ambiguities, performing chain deductions, and making complex judgments.

5.  Advanced Rules:
    Refer to the Advanced Rules Explained section provided later to identify possible equivalent relationships.

Output Format:

Please strictly follow the JSON format below, providing an evaluation result for each item in the "ground truth" list (keep the 'm' in 'missing' lowercase. Specifically, when the result is 'FN', include three extra keys: 'index_predict_may_match_top', 'index_predict_may_match_second', and 'index_predict_may_match_third', which represent the indices of relationships in the predicted values that might have matched the index_truth but were ultimately not considered a match. You must find three potential matching relationship indices from the predicted values; you can include relationships where the sub/obj/rel has one or more similarities as a 'potential match'. If you cannot find three, you must fill at least one, and the other two can be 'None'):
`[{'index_truth': 'truth_entity1', 'index_predict': 'predict_relationship17', 'result': 'TP'}, {'index_truth': 'truth_entity3', 'index_predict': 'missing', 'result': 'FN', 'index_predict_may_match_top': 'predict_relationship5', 'index_predict_may_match_second': 'predict_relationship8', 'index_predict_may_match_third': 'predict_relationship12'}]`

-----

Advanced Rules Explained:

Rule: Event Element Complementarity Rule
[
Rule Definition

When the "ground truth" and "predict values" describe the same core event initiated by the same subject (sub), if they respectively describe different elements of that event—for example, one describes the action + object, and the other describes the action + destination—and the source text confirms that these elements indeed constitute the event together, then these two relationships should be considered equivalent.

Core Idea

The core of threat intelligence is to understand a complete attack event chain. A complete event typically includes multiple elements such as [Subject] -\> [Action] -\> [Object/Payload] -\> [Destination]. If the extracted relationship pairs, though not identical, can complement each other like puzzle pieces to form the core event described in the source text, then their extraction should be recognized as valid.

Application to your example for analysis:

```
1. Identify the Core Event:
By reading the source text, we can locate the key sentence describing the behavior of upload.exe:

    "The last malicious file in the bundle is upload.exe, which uploads the video previously downloaded using download.exe, to YouTube."

The core event here is the "uploading video to YouTube" operation performed by upload.exe.

2. Deconstruct and Compare:

    Ground Truth: { "sub": "upload.exe", "rel": "uploads", "obj": "videos" }

        It accurately captures the [Subject] -> [Action] -> [Object] part of the event.
        It answers the question: "What did upload.exe upload?" -> "videos".

    Predicted Value: { "sub": "upload.exe", "rel": "exfiltrate-to", "obj": "hacked YouTube channels" }

        It accurately captures the [Subject] -> [Intent of Action] -> [Destination] part of the event.
        It answers the question: "Where did upload.exe send the data?" -> "hacked YouTube channels".

3. Judge Equivalence:

    Subject Consistency: Both are upload.exe.
    Relation/Intent Equivalence: "uploads" is a literal description of the action. "exfiltrate-to" is a cybersecurity-oriented qualification of this unauthorized upload action, describing data transmission outwards. Their intent is consistent.
    Object Complementarity: "videos" is the content/object being uploaded, and "hacked YouTube channels" is the destination of the upload. The key sentence in the source text perfectly links these two objects.
```

Conclusion

The "ground truth" and "predict values" are like two different but equally correct snapshots of the same fact. They respectively capture different elemental fragments of the complete event chain "upload.exe uploads videos to hacked YouTube channels."
]

Rule: Technical Procedure Equivalence Rule
[
Rule Definition

If a "ground truth" relationship describes a specific, detailed step within a recognized Technique, Tactic, or Procedure (TTP), while a "predict values" relationship describes another (often earlier or more general) step of the same TTP, and the subject (sub) and object (obj) of both relationships are identical, then these two relationships should be considered equivalent.

Core Idea

The ultimate goal of evaluation is not to reward the model for word-for-word repetition of the original text, but to determine if it has successfully extracted core, actionable threat intelligence. In many cases, a complex attack technique consists of multiple steps. The model might only extract a simpler or more general step, but this can still reveal the key malicious association between entities. This rule aims to recognize this "effective extraction of the same technical procedure at different levels of abstraction."

Application Conditions (IF)

```
Entity Consistency: The subject (sub) and object (obj) of the relationships must be the same or equivalent.
Procedural Association: The actions described by the two relation verbs (rel) must be technically closely related and commonly co-occurring steps. This requires judgment based on expertise in the cybersecurity domain.
Intent Preservation: The simplified (or different stage) relationship must preserve the malicious intent of the original relationship.
```

Example Illustration (applied to example 14)

```
Ground Truth: { "sub": "bxsdk64.dll", "rel": "injects code into", "obj": "find.exe" }
    Description: This is the core execution phase of the "Process Injection/Hollowing" technique.

Predicted Value: { "sub": "bxsdk64.dll", "rel": "creates", "obj": "find.exe" }
    Description: This is the preparatory phase of the "Process Injection/Hollowing" technique, performed to obtain a host process.

Judgment using the rule:
    Entity Consistency? (✓) Yes, the subject and object are both bxsdk64.dll and find.exe.
    Procedural Association? (✓) Yes, "creating a process" and "injecting code into it" are two consecutive steps of the standard "Process Injection" TTP.
    Intent Preservation? (✓) Yes, whether it's "creates" or "injects," it clearly indicates that the malicious file bxsdk64.dll is abnormally using the legitimate program find.exe, and its malicious intent is clear.
```

Final Conclusion: Since all conditions are met, the "predict values" successfully captured a key stage of the complex technique described by the "ground truth." According to the Technical Procedure Equivalence Rule, the two should be considered equivalent relationships.
]

Rule: Entity-Attribute Equivalence Rule
[
Rule Definition

If an entity (subject or object) in the "ground truth" is a generic description of something (e.g., "a remote admin tool"), while the corresponding entity in the "predict values" is the specific name or key attribute of that thing (e.g., "NetSupport Ltd" or "AnyDesk"), and the source text explicitly links the two, then these two relationships can be considered equivalent.

Core Idea

The purpose of our evaluation is to determine if the model has captured the core facts of the intelligence. The core fact is "the attacker used the remote admin tool from NetSupport Ltd." Whether it says "used a remote admin tool" or "used NetSupport Ltd," both point to the same fact given the context. The model may have extracted the "attribute" of the thing rather than its "category" due to the diversity of language expression.

Application to your example for analysis:

```
1. Locate Key Evidence:
In the source text you provided, we can find this key sentence:

    "at one point we observed a legitimate remote admin client tool by NetSupport Ltd being used to install components during these attacks."

This sentence clearly and directly links the two entities "legitimate remote admin client tool" and "NetSupport Ltd," making it clear that the former is a product of the latter.

2. Item-by-item Comparison and Analysis:

    Subject (sub):
        Ground Truth: Attacker(using: Sodinokibi)
        Predicted Value: Sodinokibi ransomware campaign
        Conclusion: These are equivalent. The former refers to the specific "attacker," while the latter refers to the "attack campaign." In describing attack behavior, they refer to the same core entity.

    Relation (rel):
        Ground Truth: uses
        Predicted Value: uses
        Conclusion: Perfect match.

    Object (obj):
        Ground Truth: legitimate remote admin client tool (Generic description)
        Predicted Value: NetSupport Ltd (Specific name/attribute)
        Conclusion: According to our newly defined Entity-Attribute Equivalence Rule and the key evidence from the source text, these are equivalent. The ground truth extracted the category of the tool, while the predicted value extracted the brand/provider of the tool. Both point to the same tool being exploited.
```

Final Conclusion

Yes, these two relationships should be judged as TP (True Positive).
]

Rule: General-Specific Equivalence Rule
[
Rule Definition

If an entity (subject or object) or relation in the "ground truth" is a specific, definite instance or type (specific), while the corresponding entity or relation in the "predict values" is a more general, broader description of that specific entity (general), and both have the same core reference, then the two relationships should be considered equivalent. The reverse is also true.

Core Idea

The core of our model evaluation is whether it has captured key intelligence. If the information extracted by the prediction is less (or more) detailed than the ground truth but its core identity and intent are correct, we should not penalize it. This rule aims to recognize the model's effective extraction of the same core fact at different granularities. This includes conceptual subsumption (e.g., "PowerShell command" is a type of "malicious command") and hierarchical naming conventions (e.g., the prefix and full name of a security detection signature).

Example A (Conceptual Specificity)

```
Ground Truth: { "sub": "Attacker", "rel": "launch", "obj": "malicious commands" } (General description)
Predicted Value: { "sub": "Sodinokibi", "rel": "uses", "obj": "Encoded PowerShell commands" } (Specific instance)
Analysis:
    In the source text and security context, "Sodinokibi" is a proxy for "Attacker," and the intent of "uses" and "launch" is consistent.
    Object (obj): "Encoded PowerShell commands" is a specific type of "malicious commands." The predicted value provides more precise intelligence and, according to this rule, should be considered a successful match.
```

Example B (Entity Naming Specificity)

```
Ground Truth: { "sub": "PDM:Exploit.Win32.Generic", "rel": "indicates exploitation of", "obj": "CVE-2022-30190" }
Predicted Value: { "sub": "PDM:Exploit", "rel": "indicates", "obj": "CVE-2022-30190" }
Analysis:
    Object (obj) & Relation (rel): The object matches perfectly, and the relations are semantically equivalent.
    Subject (sub): This is the key to the judgment.
        In the security domain, detection names (like antivirus signatures) often follow a hierarchical structure: "Category:Family.Platform.Variant".
        The ground truth `PDM:Exploit.Win32.Generic` is a specific name that includes platform and generic identifiers (specific).
        The predicted value `PDM:Exploit` is its core category prefix (general).
    Conclusion: Although the predicted value omits details, it accurately captures the core identity of the entity. According to this rule, this general description and its specific instance should be considered equivalent. Therefore, the entire relationship is judged as TP.
```

]

Rule: Entity Property Equivalence Rule
[
Definition:
When a "ground truth" relationship T = (T_sub, T_rel, T_obj) (especially one expressing an alias or hierarchical relationship) cannot find a direct match in the "predict values" relationship list, this secondary check procedure must be initiated. This procedure aims to verify if the semantics of T are fully expressed by an internal property (`alias` or `mother_entity`) of a "predict values" entity. If the verification is successful, T is also considered a successful match (TP).

Core Idea:
Information in a knowledge graph can be represented in two equivalent ways: as an "edge" between nodes (an explicit relationship) or as an "attribute" of the node itself (an intrinsic property). This rule aims to recognize the equivalence of these two forms of expression, avoiding unfairly penalizing the model for choosing the more compact and efficient "property" representation.

Verification Process:
This is a mandatory check to be performed after a regular relationship match has failed.

```
1. Prerequisite: The "ground truth" relationship T being evaluated has failed to find a match in the "predict values" relationship list.

2. Step 1: Identify Relationship Type
    Determine if T.rel belongs to one of the following two categories:
        Alias class: `alias-of`, `indicates` (in specific contexts), `also-known-as`, etc.
        Hierarchy class: `is-variant-of`, `is-part-of`, `is-component-of`, `version-of`, etc.
    If T.rel does not belong to these categories, this rule does not apply, and the result for T is FN.

3. Step 2: Perform Property Check
    Iterate through each entity `P_entity` in the "predict values" entity list:
        If T.rel is in the "Alias class":
            Check if either of the following conditions is met:
                `P_entity.name` is equivalent to `T.sub`, AND `T.obj` is in the `P_entity.alias` list.
                `P_entity.name` is equivalent to `T.obj`, AND `T.sub` is in the `P_entity.alias` list.
            If a condition is met, proceed immediately to Step 3.
        If T.rel is in the "Hierarchy class":
            Check if the following condition is met:
                `P_entity.name` is equivalent to `T.sub`, AND `T.obj` is in the `P_entity.mother_entity` list.
            If the condition is met, proceed immediately to Step 3.

4. Step 3: Determine and Record the Result
    As soon as a matching `P_entity` is found in Step 2, stop iterating.
    Judge the result for the "ground truth" relationship T as TP.
    Record the index of the matching predicted entity, `P_entity.index`.
```

Special Note on Output Format:

  Strict Adherence: When and only when this rule is applied, the `index_predict` field in the final JSON output for T must be filled with the matching predicted entity index (e.g., 'predict_entity5'), NOT a relationship index.

Example A: Verifying an Alias Relationship

````
Ground Truth to be evaluated:
    { "index_truth": "truth_relationship2", "sub": "REvil", "rel": "alias-of", "obj": "Sodinokibi" }

Relevant Predicted Entity:
    { "index": "predict_entity5", "name": "Sodinokibi", "alias": ["REvil", "Ransom.Sodinokibi"] }

AI Judgment Process:
    1.  Regular relationship matching fails.
    2.  This rule is initiated. `rel` is `alias-of`, which belongs to the "Alias class".
    3.  Iterate through the predicted entity list and find `predict_entity5`.
    4.  Check condition 2: `predict_entity5.name` ("Sodinokibi") matches `T.obj`, and `T.sub` ("REvil") exists in its `alias` list.
    5.  Condition is met. Judge `truth_relationship2` as TP.

Final Output:
```json
{
  "index_truth": "truth_relationship2",
  "index_predict": "predict_entity5",
  "result": "TP"
}
```
````

Example B: Verifying a Hierarchical Relationship

````
Ground Truth to be evaluated:
    { "index_truth": "truth_relationship9", "sub": "Cerber v5.0.1", "rel": "is-variant-of", "obj": "Cerber" }

Relevant Predicted Entity:
    { "index": "predict_entity12", "name": "Cerber v5.0.1", "mother_entity": ["Cerber"] }

AI Judgment Process:
    1.  Regular relationship matching fails.
    2.  This rule is initiated. `rel` is `is-variant-of`, which belongs to the "Hierarchy class".
    3.  Iterate through the predicted entity list and find `predict_entity12`.
    4.  Check condition: `predict_entity12.name` ("Cerber v5.0.1") matches `T.sub`, and `T.obj` ("Cerber") exists in its `mother_entity` list.
    5.  Condition is met. Judge `truth_relationship9` as TP.

Final Output:
```json
{
  "index_truth": "truth_relationship9",
  "index_predict": "predict_entity12",
  "result": "TP"
}
```
````

]

Rule: Source Text-Based Fusion & Deduction Rule
[
Rule Definition

When no single "predict value" relationship or simple relationship chain can directly match the "ground truth," if you can logically connect one or more discrete "predict value" relationship fragments by incorporating contextual information from the "Source Text" to prove the correctness of the "ground truth," it should also be considered a successful match (TP).

Core Idea

This rule simulates the analytical process of a human expert: we do not view each extracted relationship in isolation. Instead, we place them back into the original text, using the full context to build a complete logical chain to verify a more complex or abstract conclusion. This rule encompasses various modes of reasoning.

Example A (Fusing Multi-Source Predicted Values)

```
Ground Truth: (Spring4Shell, exists in, getCachedIntrospectionResults method)
Predicted Values to be fused:
    (Spring Framework, consists-of, getCachedIntrospectionResults)
    (Spring Framework, has, CVE-2022-22965)
Key Evidence in Source Text:
    "By analogy with the infamous Log4Shell threat, the vulnerability was named Spring4Shell."
    "The bug exists in the getCachedIntrospectionResults method..."
AI's Required Reasoning Process:
    1.  Source Text Analysis: The text explicitly states that Spring4Shell is an alias for the vulnerability CVE-2022-22965 and that this bug exists in the `getCachedIntrospectionResults` method. Based on the source text alone, I can directly infer that the ground truth relationship is valid.
    2.  Predicted Value Verification: Multiple discrete predicted relationships, such as "the framework contains the vulnerability" and "the framework contains the method," are consistent with this fact and collectively paint the full technical background.
    3.  Final Conclusion: Because the "ground truth" relationship can be directly deduced from the "Source Text," and the "predict values" provide supporting context, it is judged as TP.
```

Example B (Fusing Tool and Executor)

```
Ground Truth: (attacker, to execute, arbitrary commands)
Predicted Value to be fused: (JSP web shell, leads-to, Remote code execution)
Key Evidence in Source Text:
    "So an attacker can ... upload a JSP web shell to execute arbitrary commands on a server..."
AI's Required Reasoning Process:
    1.  Source Text Analysis: The text explicitly links the "executor" (attacker), the "tool" (JSP web shell), and the "goal/action" (execute arbitrary commands). It explains that the executor achieves the goal by using the tool.
    2.  Semantic Equivalence Analysis: I recognize that the object of the "ground truth," `arbitrary commands`, and the object of the "predict value," `Remote code execution`, are equivalent in this context, both referring to the technical outcome of "remotely executing arbitrary code."
    3.  Fusion and Deduction:
        The "ground truth" describes the relationship [Executor] -> [Goal].
        The "predict value" describes the relationship [Tool] -> [Goal].
        The source text provides the crucial bridging information: [Executor] uses [Tool].
        Therefore, the prediction describing the tool's capability can be considered a valid and specific explanation for the ground truth describing the attacker's action, with the support of the source text.
    4.  Final Conclusion: According to the rule, judge as TP.
```

]

Rule: Exclusion and Rejection Rule

B. Rejecting Malformed Extractions

Description: If the `sub` or `obj` in a generated relationship is not a clear, specific Named Entity, the relationship is considered an invalid extraction. During evaluation, such results cannot be matched with any valid ground truth relationship, leading to an FN result for the corresponding ground truth. These formatting errors are mainly divided into the following two categories:

B.1. Rejecting Non-Entity Content (Sentences, Clauses, or Complex Descriptions)

  Description: When the `sub` or `obj` contains a full sentence, a long clause, or a complex description mixed with various pieces of information, rather than a clear, independent noun or named entity, the relationship is invalid.
  Examples:
      Example 1: {'sub': 'CVE-2022 - 22965 and CVE-2022 - 22963 : technical details CVE-2022 - 22965 ( Spring4Shell , SpringShell )', 'rel': 'be', 'obj': 'a vulnerability in the Spring Framework that uses data binding functionality to bind data stored within an HTTP request to certain objects used by an application'}
      Example 2: {'sub': 'the getCachedIntrospectionResults method', 'rel': 'exec', 'obj': 'to gain unauthorized access to such objects by passing their class names via an HTTP request'}
      Example 3: {'sub': 'the critical vulnerability CVE-2022 - 22965 in Spring', 'rel': 'be', 'obj': 'similar to the long - closed CVE-2010 - 1622 , where class name checks were added as a fix so that the name did not match classLoader or protectionDomain'}
      Example 4: {'sub': 'A vulnerable configuration', 'rel': 'consist', 'obj': 'of : JDK version 9 + Apache Tomcat for serving the application Spring Framework versions 5.3.0 to 5.3.17 and 5.2.0 to 5.2.19 and below application built as a WAR file CVE-2022 - 22963 is a vulnerability in the routing functionality of Spring Cloud Function that allows code injection through Spring Expression Language ( SpEL ) by adding a special spring.cloud.function.routing-expression header to an HTTP request'}

B.2. Rejecting General Pronouns or Non-Specific References

  Description: When the `sub` or `obj` uses general pronouns (e.g., I, you, which) or words whose reference cannot be independently determined, the relationship is invalid due to the lack of a clear entity.
  Examples:
      Example 5: {'sub': 'which', 'rel': 'make', 'obj': 'CVE-2022 - 22965 a critical threat'}
      Example 6: {'sub': 'you', 'rel': 'fix', 'obj': 'CVE-2022 - 22963'}
      Example 7: {'sub': 'you', 'rel': 'need', 'obj': 'to install the new Spring Cloud Function versions'}
      Example 8: {'sub': 'you', 'rel': 'write', 'obj': 'the new Spring Cloud Function versions'}
      Example 9: {'sub': 'I', 'rel': 'describe', 'obj': 'some of unknown agent , sites people , technical questions andâ\\x80¦ Reply CVE-2022 - 22965 and CVE-2022 - 22963 : technical detailsMitigations for Spring vulnerabilities exploitationIndicators of Compromise IT threat evolution in Q3 2022'}

Unified Judgment Logic: The common problem with both types of extraction results above is that their subject (sub) or object (obj) part is not an independent, well-defined named entity. The first type treats entire sentences or complex descriptions as entities, while the second type uses pronouns that cannot be independently resolved without context. These are all invalid knowledge graph relationships and cannot participate in equivalence judgments.

Rule: Placeholder Entity Resolution Rule (Optimistic Principle)

  Definition: When you encounter an entity formatted like `Attacker(using: X)`, `Attacking(using: Y)`, or `Attacking(from: Z)`, you must recognize that they are not literal strings but semantic placeholders that need to be resolved.
  Core Idea (Optimistic Principle): We first trust that the model had a reason to extract this placeholder. Our task is not to strictly disprove the placeholder's existence, but to use the information provided by the placeholder to do our best to find evidence that supports the main relationship.
  Resolution and Verification Process (Revised):
    1.  Identify and Deconstruct into Context: Upon seeing the `Attacker(...)` or `Attacking(...)` format, immediately deconstruct it into the core perspective or context that must be adopted when evaluating the main relationship.
          `Attacker(using: X)` means ⇔ "The subject I am now evaluating is the attacker associated with tool/vulnerability X."
          `Attacking(using: Y)` means ⇔ "The subject I am now evaluating is the attack campaign associated with tool/vulnerability Y."
    2.  Directly Evaluate the Main Relationship: Immediately begin evaluating the complete predicted relationship (placeholder, rel, obj). Throughout the evaluation, you must look for evidence through the "lens" obtained in step 1.
    3.  Look for Fused Evidence: In the source text, search for fused evidence that can simultaneously support both the "context described by the placeholder" and the "main relationship (rel, obj)."
          Judgment Logic: Do not judge them separately. Directly ask yourself: "Does the source text support the idea that 'the attacker associated with X' performed the `rel` action on `obj`?" As long as such fused evidence can be found, judge it as TP.

Example A (Attacker with Tool) - Applying the Optimistic Principle

  Source Text: "The campaign, orchestrated by an unknown actor, leveraged the CVE-2021-44228 vulnerability to gain initial access."
  Relationship to Evaluate: { "sub": "Attacker(using: CVE-2021-44228)", "rel": "gains-access", "obj": "target_system" }
  New Judgment Logic:
    1.  Get Perspective: The subject of this prediction is "the attacker associated with CVE-2021-44228."
    2.  Directly Evaluate Main Relationship: I need to find evidence in the source text that proves "the attacker associated with CVE-2021-44228 gained access to the target system."
    3.  Look for Fused Evidence: The text says "an unknown actor, leveraged the CVE-2021-44228 vulnerability to gain initial access." This sentence perfectly fuses the two pieces of information:
          Context: There is indeed an `unknown actor` using `CVE-2021-44228`.
          Main Relationship: This attacker did indeed `gain initial access`.
    4.  Conclusion: Fused evidence is conclusive. The relationship is TP.

Example B (Attack with Tool) - Applying the Optimistic Principle

  Source Text: "A recent wave of attacks utilized the EternalBlue exploit to propagate laterally within networks."
  Relationship to Evaluate: { "sub": "Attacking(using: EternalBlue)", "rel": "propagates-laterally", "obj": "networks" }
  New Judgment Logic:
    1.  Get Perspective: The subject of this prediction is "the attack campaign that used EternalBlue."
    2.  Directly Evaluate Main Relationship: I need to find evidence in the source text that proves "the attack campaign that used EternalBlue propagated laterally within networks."
    3.  Look for Fused Evidence: The sentence "A recent wave of attacks utilized the EternalBlue exploit to propagate laterally within networks" contains all the necessary information.
          Context: There was indeed a `wave of attacks` using `EternalBlue`.
          Main Relationship: This attack did indeed `propagate laterally within networks`.
    4.  Conclusion: Fused evidence is conclusive. The relationship is TP.

Rule: Canonical Relation Validation Rule

  Definition: When the `rel` value in a predicted relationship (sub, rel, obj) is an exact match for a `Name` in the provided `background_info:rel_type_definition` list, you must recognize that this `rel` is a canonical name, not a phrase directly extracted from the original text.
  Core Idea: Acknowledge that the generation side and the evaluation side share the same vocabulary of relations. The focus of the evaluation is to verify whether the meaning of the original text conforms to the definition of that canonical relation, not to find the literal value of the canonical name.
  Verification Process:
    1.  Identify Canonical Relation: Check if the `rel` value exists in the `Name` list of `rel_type_definition`.
    2.  Find and Understand Definition: If it is a canonical relation, find its `Definition` and `Note` in `rel_type_definition`.
    3.  Validate Against the Definition: Abandon the idea of a perfect match between the ground truth and the prediction. That is, the `rel` in the ground truth is sometimes the actual phrase from the text or a summary phrase, while the prediction uses a canonical name. If the two relations are equivalent, even if the predicted `rel` does not appear in the original text, it should be considered a TP. Conversely, if the predicted `rel` is a phrase from the text or a summary based on the text, and the ground truth is a canonical name, and they express the same/similar relationship, it should also be considered a TP.

Example (Core Case: Handling canonical `communicates-with`)

  Background: In `rel_type_definition`, the definition of `communicates-with` is "Describes the occurrence of network communication between two entities."
  Source Text: "...analysis revealed network 'traffic between' the infected host and the domain evil.com."
  Predicted Relationship to Evaluate: { "sub": "infected host", "rel": "communicates-with", "obj": "evil.com" }
  Judgment Logic:
    1.  Recognize that the `rel` value "communicates-with" is a canonical name.
    2.  Look up its meaning in the provided definitions: "Describes the occurrence of network communication between two entities."
    3.  Determine if the phrase from the source text "network traffic between" fits the definition of "network communication."
    4.  The answer is yes. The meaning of the text perfectly matches the definition of the canonical relation.
    5.  Conclusion: Even though the words "communicates-with" do not appear in the original text, this relationship should be judged as TP.

Rule: Relationship Hierarchy Inclusion Rule

  Definition: This rule is used to handle cases where the `rel` of the predicted relationship and the ground truth have a hierarchical (parent-child / general-specific) relationship in the `rel_type_definition`. When a broad relationship (e.g., `communicates-with`) includes several specific relationships (e.g., `downloads`, `exfiltrate-to`), the evaluation must be inclusive.
  Core Idea (Optimistic Principle): An incomplete match in relationship granularity should not be directly considered an error. We should prioritize recognizing more precise insights and be tolerant of fundamentally correct general statements. The core of the evaluation is to determine if the prediction is consistent with or compatible with the most precise fact supported by the text.
  Verification Process: When `predict_rel` and `truth_rel` are inconsistent but have a hierarchical relationship in the definitions, handle them according to the following two scenarios:

Scenario One: Prediction is more specific than Ground Truth (Prediction is subclass, Ground Truth is superclass)

  Scenario Example:
      Predicted Relationship: (A, downloads, B)
      Ground Truth: (A, communicates-with, B)
  Judgment Logic: The prediction provides a more precise assertion than the ground truth. In this case, we should temporarily ignore the ground truth and use the source text as the final arbiter.
    1.  Identify Hierarchy: Recognize that `downloads` is a subclass of `communicates-with`.
    2.  Validate the Specific Prediction: Ask yourself: "Does the source text explicitly support the more specific action 'A downloaded a file from B'?"
    3.  Make a Judgment:
          If the source text clearly does not support this specific action (e.g., it only mentions heartbeat traffic), then the prediction is an over-inference. Judge as FN. Otherwise, optimistically judge it as TP.

Scenario Two: Prediction is more general than Ground Truth (Prediction is superclass, Ground Truth is subclass)

  Scenario Example:
      Predicted Relationship: (A, communicates-with, B)
      Ground Truth: (A, downloads, B)
  Judgment Logic: The prediction's assertion (communication occurred) is factually contained within the ground truth's assertion (a download occurred). Although the prediction is less precise, it is not wrong. The optimistic principle you suggested should be adopted here.
    1.  Identify Hierarchy: Recognize that `communicates-with` is a superclass of `downloads`.
    2.  Apply Inclusion Principle: Since the more specific fact (download) is established, the more general description (communication) is naturally also true. This is like saying "a dog is an animal," which is correct.
    3.  Make a Judgment:
          Directly judge as TP. We recognize that the prediction is correct at a macro level, even if it loses some detail.

Rule: Entity Alias Equivalence Rule

  Definition: If an entity A is defined as an alias of another entity B in the entity information of either the "ground truth" or "predict values" (i.e., A is in B's `alias` list, or vice versa), then entities A and B should be considered fully interchangeable when evaluating any relationship.
  Core Idea: An alias relationship establishes absolute semantic equivalence between two entities. The core of the evaluation is to understand the fact expressed by the triplet, not to be fixated on the specific name of the entity in that fact. Therefore, a relationship involving entity A, `(A, rel, C)`, and an identical relationship involving its alias B, `(B, rel, C)`, describe the same objective fact and must be judged as equivalent.
  Verification Process:
    1.  When evaluating a predicted relationship `P = (P_sub, P_rel, P_obj)`, you need to compare it with every relationship `T = (T_sub, T_rel, T_obj)` in the ground truth list.
    2.  During comparison, if `P_rel` and `T_rel` are semantically equivalent, but the subjects (`sub`) or objects (`obj`) do not match literally.
    3.  You must check the entity information: Is `P_sub` an alias of `T_sub` (or vice versa)? And is `P_obj` an alias of `T_obj` (or vice versa)?
    4.  If the relation verbs are equivalent, and the corresponding subjects and objects are aliases of each other, then judge `P` and `T` as equivalent.

Example:

  Predicted Value:

    ```json
    { "index_predict": "predict_relationship_10", "sub": "APT29", "rel": "uses", "obj": "Cobalt Strike" }
    ```

  Ground Truth:

    ```json
    { "index_truth": "truth_relationship_4", "sub": "Cozy Bear", "rel": "uses", "obj": "Cobalt Strike" }
    ```

  Relevant Entity Information (from either ground truth or predicted values):

    ```json
    { "name": "Cozy Bear", "type": "threat-actor", "alias": ["APT29", "The Dukes"] }
    ```

  Judgment Logic:

    1.  The AI compares `predict_relationship_10` and `truth_relationship_4`.
    2.  The `rel` ("uses") and `obj` ("Cobalt Strike") match perfectly.
    3.  The `sub` does not match ("APT29" vs. "Cozy Bear").
    4.  Applying the rule: The AI checks the entity information and finds that "APT29" is an alias of "Cozy Bear."
    5.  Therefore, `P_sub` and `T_sub` are equivalent. The entire relationship triplet is judged as equivalent.

  Final Output:

    ```json
    {
      "index_predict": "predict_relationship_10",
      "result": "TP",
      "support_reason": "index_truth",
      "index_truth": "truth_relationship_4"
    }
    ```

]

Rule: Entity Hierarchy Inheritance/Induction Rule
[

  Definition: If an entity A's `mother_entity` is B, it means A is a variant, component, or specific instance of B. Under this hierarchical relationship, their relations can be passed in both directions:
      Downward Inheritance: The behaviors or attributes of B are often also possessed by its child entity A.
      Upward Induction: The specific behaviors or attributes of a child entity A can be generalized as a capability of the parent entity B.
      During evaluation, logical deductions in both directions should be considered valid.
  Core Idea: Malware families and attack groups are constantly evolving. The behavior of a variant defines the current capabilities of its family, and the capabilities of the family predict the potential functions of its variants. Evaluation must acknowledge this dynamic, hierarchical relationship, rather than treating each entity in isolation. The core reason `A -> C` is equivalent to `B -> C` is that, from an intelligence perspective, the behavior of A *isthe behavior of B.
  Verification Process:
    1.  When evaluating a predicted relationship `P = (P_sub, P_rel, P_obj)` against a ground truth relationship `T = (T_sub, T_rel, T_obj)`.
    2.  If `P_rel` and `P_obj` are equivalent to `T_rel` and `T_obj`, but the subjects `P_sub` and `T_sub` do not match.
    3.  You must check the entity information to determine if a `mother_entity` relationship exists between `P_sub` and `T_sub`.
    4.  If the mother entity of `P_sub` is `T_sub`, or the mother entity of `T_sub` is `P_sub`, then judge `P_sub` and `T_sub` as equivalent in this context.

Example (Upward Induction):

  Predicted Value:

    ```json
    { "index_predict": "predict_relationship_15", "sub": "Cerber v5.0.1", "rel": "encrypts", "obj": ".doc files" }
    ```

  Ground Truth:

    ```json
    { "index_truth": "truth_relationship_7", "sub": "Cerber", "rel": "encrypts", "obj": ".doc files" }
    ```

  Relevant Entity Information:

    ```json
    { "name": "Cerber v5.0.1", "type": "malware", "mother_entity": ["Cerber"] }
    ```

  Judgment Logic:

    1.  The AI compares `predict_relationship_15` and `truth_relationship_7`.
    2.  The `rel` and `obj` match. The `sub` does not match ("Cerber v5.0.1" vs. "Cerber").
    3.  Applying Rule Thirteen: The AI checks the entity information and finds that the `mother_entity` of "Cerber v5.0.1" is "Cerber."
    4.  The AI performs upward induction: The encryption behavior of a specific variant can be generalized to the behavior of its family. Therefore, `P_sub` and `T_sub` are equivalent in this relationship.

  Final Output:

    ```json
    {
      "index_predict": "predict_relationship_15",
      "result": "TP",
      "support_reason": "index_truth",
      "index_truth": "truth_relationship_7"
    }
    ```

]

The following are demonstrative examples for you to study carefully and apply the judgment logic therein:

\#Demonstrative Example (General-Specific Equivalence Rule)
Ground Truth Relationship:
'sub': "Magecart"
'rel': "purchased SSL certificates from" ('rel_type':["other"])
'obj': "Comodo"
Relationship for Evaluation:
'sub': "Magecart's attack on British Airways"
'rel': "uses" ('rel_type':["uses"], 'tactic':"Defense Evasion")
'obj': "SSL certificates (Comodo)"
The AI assistant should recognize during judgment:
The 'sub' descriptions differ, but both point to Magecart and its attack campaign;
The 'rel' descriptions use different words, but combined with the context and the 'obj', both convey the fact of utilizing SSL certificates from Comodo;
'rel_type' and 'tactic' are different, but these two values do not participate in the equivalence judgment of 'rel';
Therefore, the two should be considered equivalent relationships.

\#Demonstrative Example (Core Rule 2: Semantic Equivalence of `rel` Attribute)
Original Relationship:
'sub': "Android.Reputation.1"
'rel': "has" ('rel_type':["has"])
'obj': "Google Play icon"
Corresponding Predicted Relationship:
'sub': "Android.Reputation.1"
'rel': "uses" ('rel_type':["uses"], 'tactic':["Persistence", "Defense Evasion"])
'obj': "Google Play icon"
The AI assistant should recognize during judgment:
Although the 'rel' in the original relationship is "has" and in the predicted result is "uses," both describe the usage or possession relationship between Android.Reputation.1 and the Google Play icon in context;

\#Demonstrative Example (Core Rule 2: Semantic Equivalence of `rel` Attribute)
Original Relationship:
{
"sub": "ProtonVPN_win_v1.10.0.exe",
"rel": "indicates",
"rel_type": ["indicates"],
"obj": "AZORult"
},

Corresponding Predicted Relationship:
'sub': "ProtonVPN_win_v1.10.0.exe"
'rel': "delivers" ('rel_type':["delivers"], 'tactic':["Initial Access"])
'obj': "AZORult"

The AI assistant should recognize during judgment:
Although the 'rel' in one relationship is "indicates" and in the other is "delivers," both convey the same meaning in context, i.e., the presence of ProtonVPN_win_v1.10.0.exe reveals the presence of AZORult; the subject (sub) and object (obj) are identical in both relationships, pointing to the same entities.

\#Demonstrative Example: Chain Deduction Equivalence Rule and Behavior-Technique Equivalence Rule
Original Relationship:
{
"sub": "Earth Baku",
"rel": "uses",
"rel_type": ["uses"],
"obj": "Cobalt Strike"
}

Corresponding Predicted Relationships:
{
'sub': "Earth Baku",
'rel': "uses",
'rel_type': ["uses"],
'tactic': ["Execution"],
'obj': "Godzilla webshell"
}

{
'sub': "Godzilla webshell",
'rel': "delivers",
'rel_type': ["delivers"],
'tactic': ["Execution"],
'obj': "Cobalt Strike"
}

The AI assistant should recognize during judgment: Although the ground truth directly states "Earth Baku uses Cobalt Strike," the prediction splits this into "Earth Baku uses Godzilla webshell" and "Godzilla webshell delivers Cobalt Strike," both convey the same information in the overall context, i.e., Earth Baku is associated with Cobalt Strike through the intermediate entity Godzilla webshell; after chain deduction, the subject (Earth Baku) and object (Cobalt Strike) of the ground truth are consistent with the final relationship in the prediction;

{
"sub": "StealthReacher",
"rel": "uses",
"rel_type": ["uses"],
"obj": "AES encryption"
}
Corresponding Predicted Relationships:
{
"sub": "StealthReacher",
"rel": "is a variant of",
"rel_type": ["variant-of"],
"tactic": ["other"],
"obj": "StealthVector"
}
{
"sub": "StealthVector",
"rel": "uses",
"rel_type": ["uses"],
"tactic": ["Defense Evasion"],
"obj": "AES"
}
The AI assistant should recognize during judgment:
Although the ground truth directly states "StealthReacher uses AES encryption," the prediction splits this into "StealthReacher is a variant of StealthVector" and "StealthVector uses AES," both convey the same information in the overall context, i.e., there is an association between StealthReacher and AES;
After chain deduction, the subject (StealthReacher) and object (AES encryption) of the ground truth are consistent with the final relationship in the prediction, where "AES encryption" and "AES" can be considered the same entity;
Additional information like 'rel_type' and 'tactic' does not participate in the semantic equivalence judgment of the relationship;
Therefore, these relationships should be considered equivalent.

\#Demonstrative Example: (Core Rule 3: Exclusion Rule)
{"sub": "Avast Threat Research", "rel": "published tweet about", "rel_type": ["research-describes-analysis-of-characterizes-detects"], "obj": "HermeticRansom"}
Corresponding Predicted Relationship: Any.
The AI assistant should recognize during judgment: This relationship is related to self-promotion/introduction and therefore should not be counted. The purpose of this relationship is that Avast Threat Research published a tweet about HermeticRansom on social media, which is a self-promotional relationship, not one related to threat information or attack activity. Similarly, all relationships where an organization/company/research team discovered/published/announced/reported/researched/analyzed/investigated/introduced/commented on/discussed/mentioned something are self-promotional/introductory relationships and should not be counted. For example: an organization-discovered-a malware, a company-published-a security report, a research team-analyzed-a vulnerability, etc.

\#Demonstrative Example: (General-Specific Equivalence Rule)
Ground Truth Relationship:
{"sub": "HermeticRansom", "rel": "uses", "rel_type": ["uses"], "obj": "Golang GUID library"}
Corresponding Predicted Relationship:
{"sub": "HermeticRansom", "rel": "uses", "rel_type": ["uses"], "tactic": ["Resource Development"], "obj": "Golang"}
The AI assistant should recognize during judgment:
Although the original relationship directly describes "HermeticRansom uses Golang GUID library," and the prediction describes "HermeticRansom uses Golang," both convey the same information in the overall context, i.e., HermeticRansom uses Golang-related libraries or tools; therefore, these two relationships should be considered equivalent.

\#Demonstrative Example: (Source Text-Based Fusion & Deduction Rule)
Ground Truth Relationship: {
"sub": "RSA-OAEP",
"rel": "uses",
"rel_type": ["uses"],
"obj": "SHA-256"
},
Corresponding Predicted Relationships:
{
"sub": "HermeticRansom",
"rel": "uses",
"rel_type": ["uses"],
"tactic": ["Impact"],
"obj": "RSA-OAEP"
},
{
"sub": "HermeticRansom",
"rel": "uses",
"rel_type": ["uses"],
"tactic": ["Impact"],
"obj": "SHA-256"
},

The AI assistant should recognize during judgment:
Although the original relationship directly describes "RSA-OAEP uses SHA-256," and the prediction describes "HermeticRansom uses RSA-OAEP" and "HermeticRansom uses SHA-256," both convey the same information in the overall context, i.e., HermeticRansom uses RSA-OAEP and SHA-256 for encryption or data processing;
After chain deduction, the subject (RSA-OAEP) and object (SHA-256) of the ground truth are consistent with the final relationship in the prediction;

\#Demonstrative Example: (Core Rule 3: Exclusion Rule)
{
"sub": "John Doe",
"rel": "analyzed",
"rel_type": ["research-describes-analysis-of-characterizes-detects"],
"obj": "Tomiris malware"
}

Corresponding Predicted Relationship:
{
"sub": "John Doe",
"rel": "published report about",
"rel_type": ["research-describes-analysis-of-characterizes-detects"],
"obj": "Tomiris malware"
}

The AI assistant should recognize during judgment:
Whether it is "analyzed" or "published report about a malware," both are expressions of self-promotion or content publication, providing no intelligence about the malware's behavior, technology usage, or attack targets. The AI should exclude such relationships from evaluation.

\#Demonstrative Example: (Relationship Not Belonging to General-Specific Equivalence Rule)
Original Relationship:
{
"sub": "Win32.Generic",
"rel": "indicates exploitation of",
"rel_type": ["indicates"],
"obj": "CVE-2022-22965"
},
Corresponding Predicted Relationship:
{"sub": "an attacker", "rel": "exec", "obj": "CVE-2022-22965"}
The AI assistant should recognize during judgment:
The focus of this original relationship is "Win32.Generic," whereas the corresponding prediction does not contain an equivalent obj/sub. Without using chain deduction, these two relationships are not equivalent. If chain deduction is used based on evidence that "attacker uses Win32.Generic" or "attacker uses some_malware, malware uses Win32.Generic," then this relationship could be considered equivalent. However, without evidence for chain deduction, these two relationships are not equivalent.

\#Demonstrative Example: (Core Rule 3: Exclusion Rule)
{'sub': 'which', 'rel': 'make', 'obj': 'CVE-2022 - 22965 a critical threat'}{'sub': 'CVE-2022 - 22965 and CVE-2022 - 22963 : technical details CVE-2022 - 22965 ( Spring4Shell , SpringShell )', 'rel': 'be', 'obj': 'a vulnerability in the Spring Framework that uses data binding functionality to bind data stored within an HTTP request to certain objects used by an application'}{'sub': 'the getCachedIntrospectionResults method', 'rel': 'exec', 'obj': 'to gain unauthorized access to such objects by passing their class names via an HTTP request'}, {'sub': 'the critical vulnerability CVE-2022 - 22965 in Spring', 'rel': 'be', 'obj': 'similar to the long - closed CVE-2010 - 1622 , where class name checks were added as a fix so that the name did not match classLoader or protectionDomain'}, {'sub': 'A vulnerable configuration', 'rel': 'consist', 'obj': 'of : JDK version 9 + Apache Tomcat for serving the application Spring Framework versions 5.3.0 to 5.3.17 and 5.2.0 to 5.2.19 and below application built as a WAR file CVE-2022 - 22963 is a vulnerability in the routing functionality of Spring Cloud Function that allows code injection through Spring Expression Language ( SpEL ) by adding a special spring.cloud.function.routing-expression header to an HTTP request'}{'sub': 'you', 'rel': 'fix', 'obj': 'CVE-2022 - 22963'}, {'sub': 'you', 'rel': 'need', 'obj': 'to install the new Spring Cloud Function versions'}, {'sub': 'you', 'rel': 'write', 'obj': 'the new Spring Cloud Function versions'}, {'sub': 'I', 'rel': 'describe', 'obj': 'some of unknown agent , sites people , technical questions andâ¦ Reply CVE-2022 - 22965 and CVE-2022 - 22963 : technical detailsMitigations for Spring vulnerabilities exploitationIndicators of Compromise IT threat evolution in Q3 2022'}

The AI assistant should recognize during judgment: It is clear that these are all failed Entity Relationship extraction results. Some use very long clauses for sub/obj, some have sub/obj that are not specific entities (like you, I, which), some have sub/obj that are clauses, and some have sub/obj that are incomplete sentences. None of these are specific entity relationships, so these relationships will not match the ground truth and will not be counted as TP.

## \#Demonstrative Example: (Core Rule 3: Exclusion Rule) Corresponding Predicted Relationship: { "sub": "CVE-2022 - 22965 and CVE-2022 - 22963 : technical details CVE-2022 - 22965 ( Spring4Shell , SpringShell )", "rel": "be", "obj": "a vulnerability in the Spring Framework that uses data binding functionality to bind data stored within an HTTP request to certain objects used by an application" } { "sub": "the critical vulnerability CVE-2022 - 22965 in Spring", "rel": "be", "obj": "similar to the long - closed CVE-2010 - 1622 , where class name checks were added as a fix so that the name did not match classLoader or protectionDomain" }, The AI assistant should recognize during judgment: The `obj` field in the relationship is clearly verbose, almost an entire descriptive sentence from the original text. Normal entity relationships should be concise. The `obj` is too long, containing complete clauses (a `that`-clause and a `to`-clause), and lacks a precise, independent entity definition. Both of these should be failed extraction results. These two relationships will not match any ground truth relationship and will not be counted as TP.

Background Information: rel_type_definition
''' + rel_type_definition

# Key functions

In [None]:

import os
import re
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Dict, Any, Union
from openai import OpenAI, OpenAIError
from langchain_text_splitters import RecursiveCharacterTextSplitter
try:
    from tqdm import tqdm
except ImportError:
    tqdm = None

def ask_group_link(
    prompt_list: List[List[Dict[str, str]]],
    model: str,
    token: int,
    temp: float,
    max_workers: int = 64,
    api_key: str = None,
    api_base: str = None
) -> List[Union[str, None]]:
    effective_api_key = api_key or os.getenv("OPENAI_API_KEY")
    if not effective_api_key:
        raise ValueError("API key must be provided either as an argument or as an OPENAI_API_KEY environment variable.")

    def _call_openai_api(prompt: List[Dict[str, str]]) -> Union[str, None]:
        """A helper function to call the API for a single prompt."""
        try:
            client = OpenAI(api_key=effective_api_key, base_url=api_base)
            response = client.chat.completions.create(
                model=model,
                messages=prompt,
                max_tokens=token,
                temperature=temp
            )
            return response.choices[0].message.content
        except OpenAIError as e:
            print(f"An API error occurred: {e.__class__.__name__} - {e.body.get('message') if e.body else 'No message'}")
            return None
        except Exception as e:
            print(f"An unexpected error occurred during an API call: {e}")
            return None

    results = [None] len(prompt_list)
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_index = {
            executor.submit(_call_openai_api, prompt): i
            for i, prompt in enumerate(prompt_list)
        }

        iterable = as_completed(future_to_index)
        if tqdm:
            iterable = tqdm(iterable, total=len(prompt_list), desc="Processing API Requests")

        for future in iterable:
            index = future_to_index[future]
            try:
                results[index] = future.result()
            except Exception as e:
                print(f"Request for prompt at index {index} generated an exception: {e}")
                results[index] = None
    
    return results



In [None]:
import sys
import random
import time
import json
import traceback
import importlib
import os
import re
from json_repair import loads as json_repair_loads
from typing import List, Dict, Union

# It's assumed that the 'tools' module contains the 'ask_group_link' function,
# which may use the following libraries.
from concurrent.futures import ThreadPoolExecutor, as_completed
from openai import OpenAI, OpenAIError

MY_API_KEY = os.getenv("MY_API_KEY", "your_api_key_here")
MY_API_BASE = os.getenv("MY_API_BASE", "your_api_base_url_here")


def generate_evaluation_prompts(
    knowledge_graph_collection: dict,
    method_name: str,
    start_index: int,
    end_index: int,
    llm_params: dict,
    evaluation_target: str = 'all',  # Can be 'precision', 'recall', or 'all'
    ground_truth_key: str = 'ground_truth',
    context_parent_key: str = None,
    context_child_key: str = None
):
    """
    Generates prompts for evaluation and calls the LLM to get results.

    - `evaluation_target` parameter controls the generation of tasks for precision, recall, or both.
    - The returned list contains dictionaries with an 'eval_type' field to distinguish tasks.
    """
    def prompt_maker(ground_truth: str, prediction: str, context: str, eval_type: str) -> List[Dict[str, str]]:
        # Assumes the tools module has these two prompt templates defined.
        if eval_type == 'recall':
            prompt_template = Precision_Prompt
            main_content = (
                f"--- Ground Truth ---\n{ground_truth}\n\n"
                f"--- Prediction Pool ---\n{prediction}"
            )
        elif eval_type == 'precision':
            prompt_template = Recall_Prompt
            main_content = (
                f"--- Prediction to Evaluate ---\n{prediction}\n\n"
                f"--- Reference Standard (Ground Truth) ---\n{ground_truth}"
            )
        else:
            return None

        final_content = f"{prompt_template}\n\n{main_content}"
        
        if context:
            final_content += (
                f"\n\n--- Accompanying Context ---\n{context}"
            )
        if len(context) <= 5:
            print(f"Warning: Context is too short ({len(context)} chars), may not provide enough information.")
        return [{"role": "user", "content": final_content}]

    tasks_to_run = []
    
    try:
        full_data_len = len(knowledge_graph_collection[ground_truth_key])
        start = max(0, start_index)
        end = min(full_data_len, end_index)
    except (KeyError, TypeError) as e:
        print(f"Error: Failed to access the knowledge graph dictionary: {e}.")
        return None

    print(f"Preparing evaluation prompts for method '{method_name}' for {end - start} indices (range: {start}-{end-1})...")

    for i in range(start, end):
        try:
            truth_kg_string = knowledge_graph_collection[ground_truth_key][i]
            predicted_kg_string = knowledge_graph_collection[method_name][i]
            
            context_str = ""
            if context_parent_key and context_child_key:
                context_str = knowledge_graph_collection.get(context_parent_key, [{}])[i].get(context_child_key, "")

            if not isinstance(truth_kg_string, str) or not isinstance(predicted_kg_string, str):
                continue
            
            if evaluation_target in ['recall', 'all']:
                prompt = prompt_maker(truth_kg_string, predicted_kg_string, context_str, 'recall')
                if prompt:
                    tasks_to_run.append({'index': i, 'type': 'recall', 'prompt': prompt})
            
            if evaluation_target in ['precision', 'all']:
                if predicted_kg_string and "#Relationship_List_Start##Relationship_List_End#" not in predicted_kg_string:
                    prompt = prompt_maker(truth_kg_string, predicted_kg_string, context_str, 'precision')
                    if prompt:
                        tasks_to_run.append({'index': i, 'type': 'precision', 'prompt': prompt})

        except (KeyError, IndexError) as e:
            print(f"Warning: Error processing index {i}, skipped. Error: {e}")
            continue

    if not tasks_to_run:
        print("No executable prompts were generated, terminating task.")
        return []

    prompts_to_run = [task['prompt'] for task in tasks_to_run]
    print(f"✅ Successfully generated {len(prompts_to_run)} evaluation prompts. Calling LLM...")
    
    # Updated call reflecting the new ask_group_link signature
    llm_results = ask_group_link(
        prompt_list=prompts_to_run,
        model="gemini-2.5-pro",
        token=64*1024,
        temp=1.0,
        max_workers=128,
        api_key="MY_API_KEY",
        api_base="https://generativelanguage.googleapis.com/v1beta",
    )

    ans = []
    if llm_results and len(llm_results) == len(tasks_to_run):
        for i, result_str in enumerate(llm_results):
            original_idx = tasks_to_run[i]['index']
            eval_type = tasks_to_run[i]['type']
            ans.append({'original_index': original_idx, 'eval_type': eval_type, 'result': result_str})
    else:
        print(f"Warning: Mismatch between LLM results ({len(llm_results) if llm_results else 0}) and tasks ({len(tasks_to_run)}).")

    print("Evaluation prompt generation and LLM calls are complete.")
    return ans

def extract_kg_components(kg_string: str):
    """Helper function: Extracts entity and relationship lists from a KG string."""
    entity_content, rel_content = None, None
    if not isinstance(kg_string, str): return None, None
    entity_pattern = re.compile(r"#(?:Final_)?Entity_List_Start#(.*?)#(?:Final_)?Entity_List_End#", re.DOTALL)
    rel_pattern = re.compile(r"#(?:Final_)?Relationship_List_Start#(.*?)#(?:Final_)?Relationship_List_End#", re.DOTALL)
    entity_match = entity_pattern.search(kg_string)
    if entity_match: entity_content = entity_match.group(1).strip()
    rel_match = rel_pattern.search(kg_string)
    if rel_match: rel_content = rel_match.group(1).strip()
    return entity_content, rel_content

def process_and_save_precision_results(
    knowledge_graph_collection: dict,
    raw_evaluation_results: list,
    prediction_key: str,
    ground_truth_key: str,
    metadata_key: str,
    output_directory: str
) -> dict:
    """
    Processes precision evaluation results, calculates TP/FP, saves detailed reports,
    and returns a dictionary of precision scores for each index.
    """
    def create_local_lookup(kg_string):
        lookup_dict = {}
        if not kg_string: return lookup_dict
        entity_content, rel_content = extract_kg_components(kg_string)
        try:
            entities = json_repair_loads(entity_content) if entity_content else []
            relationships = json_repair_loads(rel_content) if rel_content else []
            for item in entities + relationships:
                if isinstance(item, dict) and 'index' in item:
                    lookup_dict[item['index']] = item
        except Exception as e:
            print(f"   - [Warning] Failed to parse and create local lookup table: {e}")
        return lookup_dict

    method_output_dir = os.path.join(output_directory, prediction_key)
    os.makedirs(method_output_dir, exist_ok=True)
    print(f"Precision evaluation reports will be saved to: {method_output_dir}")

    precision_scores_by_index = {}
    total_processed_count = 0
    
    truth_source_list = knowledge_graph_collection.get(ground_truth_key, [])
    predicted_source_list = knowledge_graph_collection.get(prediction_key, [])
    
    precision_results = [item for item in raw_evaluation_results if item.get('eval_type') == 'precision']

    print("\n--- Starting to process precision evaluation results, calculate precision, and save files ---")
    
    for item in precision_results:
        original_index = item['original_index']
        result_str = item['result']
        
        detailed_report_list = []
        tp_count, fp_count = 0, 0
        try:
            current_truth_kg = truth_source_list[original_index]
            current_predicted_kg = predicted_source_list[original_index]
            local_truth_lookup = create_local_lookup(current_truth_kg)
            local_prediction_lookup = create_local_lookup(current_predicted_kg)
            
            cleaned_str = result_str.strip().removeprefix('```json').strip().removesuffix('```').strip()
            eval_data = json.loads(cleaned_str)

            for eval_item in eval_data:
                result_type = eval_item.get('result')
                predict_idx = eval_item.get('index_predict')
                
                if result_type == 'TP':
                    tp_count += 1
                    report_entry = {
                        'prediction': local_prediction_lookup.get(predict_idx, {'error': f'Index {predict_idx} not found'}),
                        'outcome': 'TP',
                        'justification': eval_item.get('support_reason')
                    }
                    if eval_item.get('support_reason') == 'index_truth':
                        truth_idx = eval_item.get('index_truth')
                        report_entry['matching_ground_truth'] = local_truth_lookup.get(truth_idx, {'error': f'Index {truth_idx} not found'})
                    elif eval_item.get('support_reason') == 'context':
                        report_entry['supporting_quote'] = eval_item.get('context')
                    detailed_report_list.append(report_entry)

                elif result_type == 'FP':
                    fp_count += 1
                    report_entry = {
                        'prediction': local_prediction_lookup.get(predict_idx, {'error': f'Index {predict_idx} not found'}),
                        'outcome': 'FP'
                    }
                    for i, suffix in enumerate(['top', 'second', 'third']):
                        match_idx = eval_item.get(f'index_truth_may_match_{suffix}')
                        match_ctx = eval_item.get(f'context_may_match_{suffix}')
                        if match_idx and match_idx != 'None':
                            report_entry[f'potential_match_truth_{i+1}'] = local_truth_lookup.get(match_idx, {'error': f'Index {match_idx} not found'})
                        if match_ctx and match_ctx != 'None':
                            report_entry[f'potentially_misleading_context_{i+1}'] = match_ctx
                    detailed_report_list.append(report_entry)

            precision = tp_count / (tp_count + fp_count) if (tp_count + fp_count) > 0 else 0.0
            precision_scores_by_index[original_index] = precision
            
            translated_text = knowledge_graph_collection[metadata_key][original_index].get('context_translation', 'Error: Context translation not found.')
            
            final_output_data = {
                "evaluation_details": detailed_report_list,
                "source_text_and_translation": translated_text
            }

            output_filename = os.path.join(method_output_dir, f"{prediction_key}_index_{original_index}.json")
            with open(output_filename, 'w', encoding='utf-8') as f:
                json.dump(final_output_data, f, ensure_ascii=False, indent=2)
            
            print(f"Index {original_index}: TP={tp_count}, FP={fp_count}, Precision={precision:.2f} -> Saved")
            total_processed_count += 1

        except Exception as e:
            print(f"Index {original_index}: Processing failed, skipped. Error: {e}")
            precision_scores_by_index[original_index] = 0.0
            
    average_precision = sum(precision_scores_by_index.values()) / len(precision_scores_by_index) if precision_scores_by_index else 0
    print("\n" + "="*50)
    print(f"Processing complete for all {total_processed_count} valid results.")
    print(f"Average Precision: {average_precision:.4f}")
    print("="*50)
    
    return precision_scores_by_index

def process_and_save_recall_results(
    knowledge_graph_collection: dict,
    raw_evaluation_results: list,
    prediction_key: str,
    ground_truth_key: str,
    metadata_key: str,
    output_directory: str
) -> dict:
    """
    Processes recall evaluation results, calculates TP/FN, saves detailed reports,
    and returns a dictionary of recall scores for each index.
    """
    def create_local_lookup(kg_string):
        lookup_dict = {}
        if not kg_string: return lookup_dict
        entity_content, rel_content = extract_kg_components(kg_string)
        try:
            entities = json_repair_loads(entity_content) if entity_content else []
            relationships = json_repair_loads(rel_content) if rel_content else []
            for item in entities + relationships:
                if isinstance(item, dict) and 'index' in item:
                    lookup_dict[item['index']] = item
        except Exception as e:
            print(f"   - [Warning] Failed to parse and create local lookup table: {e}")
        return lookup_dict

    method_output_dir = os.path.join(output_directory, prediction_key)
    os.makedirs(method_output_dir, exist_ok=True)
    print(f"Recall evaluation reports will be saved to: {method_output_dir}")

    recall_scores_by_index = {}
    total_processed_count = 0
    
    truth_source_list = knowledge_graph_collection.get(ground_truth_key, [])
    predicted_source_list = knowledge_graph_collection.get(prediction_key, [])

    recall_results = [item for item in raw_evaluation_results if item.get('eval_type') == 'recall']

    print("\n--- Starting to process recall evaluation results, calculate recall, and save files ---")
    
    for item in recall_results:
        original_index = item['original_index']
        result_str = item['result']
        
        detailed_report_list = []
        tp_count, fn_count = 0, 0
        try:
            current_truth_kg = truth_source_list[original_index]
            current_predicted_kg = predicted_source_list[original_index]
            local_truth_lookup = create_local_lookup(current_truth_kg)
            local_prediction_lookup = create_local_lookup(current_predicted_kg)
            
            cleaned_str = result_str.strip().removeprefix('```json').strip().removesuffix('```').strip()
            eval_data = json.loads(cleaned_str)

            for eval_item in eval_data:
                result_type = eval_item.get('result')
                truth_idx = eval_item.get('index_truth')
                
                if result_type == 'TP':
                    tp_count += 1
                    predict_idx = eval_item.get('index_predict')
                    detailed_report_list.append({
                        'ground_truth': local_truth_lookup.get(truth_idx, {'error': f'Index {truth_idx} not found'}),
                        'prediction': local_prediction_lookup.get(predict_idx, {'error': f'Index {predict_idx} not found'}),
                        'outcome': 'TP'
                    })
                elif result_type == 'FN':
                    fn_count += 1
                    report_entry = {
                        'ground_truth': local_truth_lookup.get(truth_idx, {'error': f'Index {truth_idx} not found'}),
                        'prediction': 'missing',
                        'outcome': 'FN'
                    }
                    
                    for i, suffix in enumerate(['top', 'second', 'third']):
                        match_idx = eval_item.get(f'index_predict_may_match_{suffix}')
                        if match_idx and match_idx != 'None':
                            report_entry[f'potential_matching_prediction_{i+1}'] = local_prediction_lookup.get(match_idx, {'error': f'Index {match_idx} not found'})
                    
                    detailed_report_list.append(report_entry)

            recall = tp_count / (tp_count + fn_count) if (tp_count + fn_count) > 0 else 0.0
            recall_scores_by_index[original_index] = recall
            
            translated_text = knowledge_graph_collection[metadata_key][original_index].get('context_translation', 'Error: Context translation not found.')
            
            final_output_data = {
                "evaluation_details": detailed_report_list,
                "source_text_and_translation": translated_text
            }

            output_filename = os.path.join(method_output_dir, f"{prediction_key}_index_{original_index}.json")
            with open(output_filename, 'w', encoding='utf-8') as f:
                json.dump(final_output_data, f, ensure_ascii=False, indent=2)
            
            print(f"Index {original_index}: TP={tp_count}, FN={fn_count}, Recall={recall:.2f} -> Saved")
            total_processed_count += 1

        except Exception as e:
            print(f"Index {original_index}: Processing failed, skipped. Error: {e}")
            recall_scores_by_index[original_index] = 0.0
            
    average_recall = sum(recall_scores_by_index.values()) / len(recall_scores_by_index) if recall_scores_by_index else 0
    print("\n" + "="*50)
    print(f"Processing complete for all {total_processed_count} valid results.")
    print(f"Average Recall: {average_recall:.4f}")
    print("="*50)
    
    return recall_scores_by_index

def print_summary_report(summary_data: dict):
    """
    Prints a clear, final summary report grouped by the judge model.
    """
    print("\n" + "="*90)
    print("--- F I N A L   E V A L U A T I O N   S U M M A R Y   R E P O R T ---")
    print("="*90)

    if not summary_data:
        print("No evaluation results were generated.")
        return

    for judge_name, methods_assessed in summary_data.items():
        print(f"\n--- Judge Model: {judge_name} ---")
        
        header = f"{'Method Name':<65} | {'Avg Precision':<15} | {'Avg Recall':<15}"
        print(header)
        print("-" * len(header))

        for method_name, results_by_index in methods_assessed.items():
            if not results_by_index:
                continue

            precision_list = [v['precision'] for v in results_by_index.values() if 'precision' in v]
            recall_list = [v['recall'] for v in results_by_index.values() if 'recall' in v]

            avg_precision = sum(precision_list) / len(precision_list) if precision_list else 0.0
            avg_recall = sum(recall_list) / len(recall_list) if recall_list else 0.0
            
            print(f"{method_name:<65} | {avg_precision:<15.4f} | {avg_recall:<15.4f}")
    
    print("\n" + "="*90)

def main():
    """
    Main execution function using a nested loop for evaluation with robust error handling.
    """
    # --- 1. Initialization and Parameter Definition ---
    
    # Load KNOWLEDGE_GRAPH_DATA here, it has following keys:
    #"ground_truth" 
    #"metadata"
    # and multiple method names 

    
    evaluation_summary = {}
    failed_combinations = []

    JUDGE_MODELS = [
        {'name': 'gemini-2.5-pro', 'max_tokens_k': 128},
    ]

    methods_to_evaluate = [key for key in KNOWLEDGE_GRAPH_DATA.keys() if key not in ['ground_truth', 'metadata']]
    
    start_index, end_index = 0, 60
    print(f"Methods to be evaluated: {methods_to_evaluate}")
    print(f"Judge models to be used: {[j['name'] for j in JUDGE_MODELS]}\n")

    # --- 2. Start Nested Loop for Evaluation ---
    for judge_model_config in JUDGE_MODELS:
        judge_model_name = judge_model_config['name']
        max_tokens_k = judge_model_config['max_tokens_k']
        
        print(f"\n{'='*25} Starting evaluation with Judge: {judge_model_name} (Token Limit: {max_tokens_k}K) {'='*25}")
        
        # Updated LLM parameters to match the new 'ask_group_link' function
        llm_api_params = {
            "model": judge_model_name,
            "token": max_tokens_k * 1024,
            "temp": 0.2,
            "max_workers": 6,
            "api_key": MY_API_KEY,
            "api_base": MY_API_BASE,
        }

        for method_to_evaluate in methods_to_evaluate:
            try:
                print(f"\n>>> Using '{judge_model_name}' to evaluate method: '{method_to_evaluate}' <<<")

                # Stage 1: Get LLM Evaluation Results
                print(f"[{method_to_evaluate}] Stage 1: Acquiring evaluation results...")
                raw_results = generate_evaluation_prompts(
                    knowledge_graph_collection=KNOWLEDGE_GRAPH_DATA, method_name=method_to_evaluate,
                    start_index=start_index, end_index=end_index, llm_params=llm_api_params,
                    evaluation_target='all', ground_truth_key='ground_truth',
                    context_parent_key='metadata', context_child_key='context',
                )
                if not raw_results:
                    print(f"  [Warning] Stage 1 failed for ('{judge_model_name}', '{method_to_evaluate}'). Skipping.")
                    continue
                print(f"[{method_to_evaluate}] Stage 1 complete, {len(raw_results)} records obtained.")
                
                # Stage 2a: Calculate Precision
                print(f"[{method_to_evaluate}] Stage 2a: Calculating and saving precision...")
                precision_save_dir = f'./evaluation_results/precision/{os.path.basename(judge_model_name)}'
                precision_scores = process_and_save_precision_results(
                    knowledge_graph_collection=KNOWLEDGE_GRAPH_DATA, raw_evaluation_results=raw_results, prediction_key=method_to_evaluate,
                    ground_truth_key='ground_truth', metadata_key='metadata',
                    output_directory=precision_save_dir
                )
                print(f"[{method_to_evaluate}] Precision calculation complete.")

                # Stage 2b: Calculate Recall
                print(f"[{method_to_evaluate}] Stage 2b: Calculating and saving recall...")
                recall_save_dir = f'./evaluation_results/recall/{os.path.basename(judge_model_name)}'
                recall_scores = process_and_save_recall_results(
                    knowledge_graph_collection=KNOWLEDGE_GRAPH_DATA, raw_evaluation_results=raw_results, prediction_key=method_to_evaluate,
                    ground_truth_key='ground_truth', metadata_key='metadata',
                    output_directory=recall_save_dir
                )
                print(f"[{method_to_evaluate}] Recall calculation complete.")

                # Stage 3: Aggregate scores
                print(f"[{method_to_evaluate}] Stage 3: Aggregating scores...")
                judge_dict = evaluation_summary.setdefault(judge_model_name, {})
                method_results = judge_dict.setdefault(method_to_evaluate, {})
                
                if precision_scores:
                    for idx, score in precision_scores.items():
                        method_results.setdefault(f'index_{idx}', {})['precision'] = score
                
                if recall_scores:
                    for idx, score in recall_scores.items():
                        method_results.setdefault(f'index_{idx}', {})['recall'] = score
                
                print(f"--- ✓ Method '{method_to_evaluate}' successfully evaluated by '{judge_model_name}' ---")
            
            except Exception as e:
                print("\n" + "!"*80)
                print(f"!!! CRITICAL ERROR processing Judge='{judge_model_name}', Method='{method_to_evaluate}'.")
                print(f"!!! Error Type: {type(e).__name__}, Message: {e}")
                print("--- Error Traceback ---")
                traceback.print_exc()
                print("!"*80 + "\n>>> Continuing to the next task.")
                
                failed_combinations.append({
                    'judge': judge_model_name,
                    'method': method_to_evaluate,
                    'error': f"{type(e).__name__}: {e}"
                })
                continue
    
    # --- 3. Print Final Summary Report ---
    print_summary_report(evaluation_summary)
    
    summary_path = './evaluation_results/evaluation_summary_full.json'
    print(f"\nSaving the complete evaluation summary to: {summary_path}")
    os.makedirs(os.path.dirname(summary_path), exist_ok=True)
    with open(summary_path, 'w', encoding='utf-8') as f:
        json.dump(evaluation_summary, f, ensure_ascii=False, indent=4)
    print("Save complete.")

    if failed_combinations:
        print("\n" + "!"*40)
        print("--- The following combinations failed with errors ---")
        for failure in failed_combinations:
            print(f"  - Judge: {failure['judge']:<35} | Method: {failure['method']:<60} | Error: {failure['error']}")
        print("!"*40)
    else:
        print("\n🎉 All task combinations completed successfully without fatal errors.")


if __name__ == "__main__":
    main()