# Entity and Relationship Extraction for Threat Intelligence

## Overview
This notebook implements entity and relationship extraction from threat intelligence text using LLM-based approach.

### Task Description
- **Input**: Threat intelligence text content
- **Output**: Named entities and relationships in structured format
- **Entity Types**: malware, threat type, attacker, vulnerability, tool, etc.
- **Relationship Types**: use, target, exploit, etc.

### Example
**Input**: A hitherto unknown attack group has been observed targeting a materials research organization in Asia. The group, which Symantec calls Clasiopa, is characterized by a distinct toolset, which includes one piece of custom malware (Backdoor.Atharvan).

**Output**:
- Named Entities: (Clasiopa, attacker), (custom malware, malware), (Backdoor.Atharvan, malware)
- Relationships: (Clasiopa, use, custom malware), (custom malware, name, Backdoor.Atharvan)


In [2]:
import json
import os
import re
from pathlib import Path
from typing import Dict, List, Tuple, Any
from collections import defaultdict
import datetime

# Load environment and model setup
from dotenv import load_dotenv
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Load environment variables
load_dotenv()

print("🔧 Setting up Entity & Relationship Extraction Pipeline")
print("=" * 60)


🔧 Setting up Entity & Relationship Extraction Pipeline


In [3]:
def load_data(input_file: str) -> list:
    """
    Load threat intelligence data from JSON file.
    """
    try:
        with open(input_file, 'r', encoding='utf-8') as f:
            data = json.load(f)
        print(f"✅ Loaded {len(data)} records from {input_file}")
        return data
    except Exception as e:
        print(f"❌ Error loading {input_file}: {e}")
        return []

# Load threat intelligence data
data_path = '../data/processed/merged_threat_intelligence.json'
data = load_data(data_path)

if data:
    print(f"📊 Sample data structure:")
    print(f"   Keys: {list(data[0].keys())}")
    print(f"   Title: {data[0]['title'][:100]}...")


✅ Loaded 427 records from ../data/processed/merged_threat_intelligence.json
📊 Sample data structure:
   Keys: ['title', 'content', 'link']
   Title: FortiGuard Labs Threat Research...


In [4]:
# Device setup
device = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)

print(f"🖥️  Using device: {device.upper()}")
print(f"🔧 PyTorch version: {torch.__version__}")

# Memory cleanup
if device == "cuda":
    torch.cuda.empty_cache()
elif device == "mps":
    import gc
    gc.collect()
    if hasattr(torch.mps, 'empty_cache'):
        torch.mps.empty_cache()


🖥️  Using device: MPS
🔧 PyTorch version: 2.7.1


In [5]:
# Get configuration from environment
HF_TOKEN = os.getenv('HF_TOKEN')
DEFAULT_MODEL = os.getenv('DEFAULT_MODEL', 'Qwen/Qwen2.5-1.5B-Instruct')
FALLBACK_MODEL = os.getenv('FALLBACK_MODEL', 'gpt2')

def setup_model_for_extraction(model_name: str = None, hf_token: str = None):
    """
    Tải model từ Hugging Face với token từ environment variables.
    """
    model_name = model_name or DEFAULT_MODEL
    hf_token = hf_token or HF_TOKEN

    print(f"🤖 Đang tải mô hình: {model_name}")
    print(f"📱 Thiết bị: {device.upper()}")
    print(f"🔑 Token: {'✅ Found' if hf_token else '❌ Missing'}")

    try:
        # Load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            token=hf_token,
            trust_remote_code=True
        )
        tokenizer.pad_token = tokenizer.pad_token or tokenizer.eos_token

        # Thiết lập kiểu dữ liệu và bản đồ thiết bị
        torch_dtype = torch.float16 if device == "cuda" else torch.float32
        device_map = "auto" if device == "cuda" else None

        # Load model
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            token=hf_token,
            trust_remote_code=True,
            torch_dtype=torch_dtype,
            device_map=device_map,
            use_cache=False
        )

        if device_map is None and device in ["mps", "cuda"]:
            model.to(device)

        if device_map is None:
            # Nếu không sử dụng device_map="auto", có thể chỉ định device
            pipe = pipeline(
                "text-generation",
                model=model,
                tokenizer=tokenizer,
                device=0 if device != "cpu" else -1,
                torch_dtype=torch_dtype,
                model_kwargs={"use_cache": False}
            )
        else:
            # Nếu sử dụng device_map="auto", không chỉ định device
            pipe = pipeline(
                "text-generation",
                model=model,
                tokenizer=tokenizer,
                torch_dtype=torch_dtype,
                model_kwargs={"use_cache": False}
            )

        print(f"✅ Đã tải thành công {model_name} trên {device.upper()}")
        return pipe

    except Exception as e:
        print(f"❌ Lỗi khi tải {model_name}: {e}")
        return setup_fallback_model(hf_token)

def setup_fallback_model(hf_token: str = None):
    """
    Tải fallback model nếu model chính lỗi.
    """
    fallback_name = FALLBACK_MODEL
    hf_token = hf_token or HF_TOKEN
    print(f"🔄 Đang tải mô hình dự phòng: {fallback_name}")

    try:
        tokenizer = AutoTokenizer.from_pretrained(fallback_name, token=hf_token)
        tokenizer.pad_token = tokenizer.pad_token or tokenizer.eos_token

        model = AutoModelForCausalLM.from_pretrained(
            fallback_name,
            token=hf_token,
            torch_dtype=torch.float32,
            use_cache=False
        )

        if device in ["cuda", "mps"]:
            model.to(device)

        pipe = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            device=0 if device != "cpu" else -1,
            model_kwargs={"use_cache": False}
        )

        print(f"✅ {FALLBACK_MODEL} đã sẵn sàng trên {device.upper()}")
        return pipe

    except Exception as e:
        print(f"❌ Lỗi khi tải {FALLBACK_MODEL} fallback: {e}")
        return None

# Load model
extraction_model = setup_model_for_extraction()


🤖 Đang tải mô hình: Qwen/Qwen2.5-1.5B-Instruct
📱 Thiết bị: MPS
🔑 Token: ✅ Found


Device set to use mps:0


✅ Đã tải thành công Qwen/Qwen2.5-1.5B-Instruct trên MPS


In [6]:
def create_entity_extraction_prompt(text: str) -> str:
    """
    Create prompt for entity and relationship extraction focusing on core cybersecurity entity types.
    """
    # Truncate text to avoid token limits
    text_truncated = (text[:1500] if text else "").replace('\n', ' ').strip()
    
    prompt = f"""Instruction: Please identify the following types of entities and then extract the relationships between these extracted entities:

Entity Types (focus on these only):
- Malware: Malicious software (e.g., 'Stuxnet', 'Emotet', 'Backdoor.Atharvan')
- Threat Type: Category of threats (e.g., 'Ransomware', 'APT', 'Botnet')
- Attacker: Threat actors/groups (e.g., 'APT28', 'Lazarus Group', 'Shuckworm')
- Technique: Attack techniques/TTPs (e.g., 'T1057: Process Discovery', 'Privilege Escalation', 'Phishing')
- Tool: Security tools or attack tools (e.g., 'PowerShell', 'Cobalt Strike', 'EHole')
- Vulnerability: Security weaknesses (e.g., 'CVE-2020-1472', 'CVE-2021-44228')
- IP: IP addresses (e.g., '45.153.243.93', '192.168.1.100')
- Domain: Domain names (e.g., 'malicious-domain[.]com', 'evil[.]example[.]com')
- URL: URLs (e.g., 'hxxp://178.73.192[.]15/cal.exe')
- File: File names (e.g., 'rtk.lnk', 'payload.exe', 'shtasks.exe')
- Hash: File hashes (e.g., '2aee8bb2a953124803bc42e5c42935c9', MD5/SHA1/SHA256)

Relationship Types:
- use, hash, aka, execute, used by, download, resolved to, IP, drop, associated with, deploy, communicate with, connect to, install, exploit, contain, run, launch, target, linked to

If there are no entities and relationships pertaining to the specified types, please state 'No related entities and relations'. Make sure to follow the output format shown in the following examples.

Example 1:
Input: A hitherto unknown attack group has been observed targeting a materials research organization in Asia. The group, which Symantec calls Clasiopa, is characterized by a distinct toolset, which includes one piece of custom malware (Backdoor.Atharvan).
Output: Named Entities: (Clasiopa, Attacker), (Backdoor.Atharvan, Malware)\\nRelationships: (Clasiopa, uses, Backdoor.Atharvan)

Example 2:
Input: The Emotet malware has been observed using new phishing techniques to target banking institutions. The malware exploits CVE-2021-1234 vulnerability in Microsoft Office.
Output: Named Entities: (Emotet, Malware), (phishing, Technique), (CVE-2021-1234, Vulnerability), (Microsoft Office, Tool)\\nRelationships: (Emotet, uses, phishing), (Emotet, exploits, CVE-2021-1234)

Example 3:
Input: The threat actor downloaded malicious payload from hxxp://malicious-domain[.]com/payload.exe and used hash 2aee8bb2a953124803bc42e5c42935c9 to verify file integrity. The attack targeted IP address 192.168.1.100.
Output: Named Entities: (threat actor, Attacker), (malicious payload, File), (hxxp://malicious-domain[.]com/payload.exe, URL), (2aee8bb2a953124803bc42e5c42935c9, Hash), (192.168.1.100, IP)\\nRelationships: (threat actor, uses, hxxp://malicious-domain[.]com/payload.exe), (threat actor, targets, 192.168.1.100)

Example 4:
Input: H2Miner botnet uses Kinsing malware and Cobalt Strike to deploy XMRig miners. The campaign communicates with C2 server at evil[.]domain[.]com and is attributed to APT group.
Output: Named Entities: (H2Miner, Threat Type), (Kinsing, Malware), (Cobalt Strike, Tool), (XMRig, Tool), (evil[.]domain[.]com, Domain), (APT group, Attacker)\\nRelationships: (H2Miner, uses, Kinsing), (H2Miner, uses, Cobalt Strike), (H2Miner, uses, XMRig), (Kinsing, communicatesWith, evil[.]domain[.]com), (H2Miner, attributedTo, APT group)

Example 5:
Input: The weather forecast shows sunny skies and moderate temperatures for the weekend.
Output: No related entities and relations

Now extract entities and relationships from the following text:
Input: {text_truncated}
Output:"""
    
    return prompt

# Test the prompt creation
if data:
    sample_prompt = create_entity_extraction_prompt(data[0]['content'])
    print("📝 Sample prompt (first 500 chars):")
    print(sample_prompt[:500] + "...")


📝 Sample prompt (first 500 chars):
Instruction: Please identify the following types of entities and then extract the relationships between these extracted entities:

Entity Types (focus on these only):
- Malware: Malicious software (e.g., 'Stuxnet', 'Emotet', 'Backdoor.Atharvan')
- Threat Type: Category of threats (e.g., 'Ransomware', 'APT', 'Botnet')
- Attacker: Threat actors/groups (e.g., 'APT28', 'Lazarus Group', 'Shuckworm')
- Technique: Attack techniques/TTPs (e.g., 'T1057: Process Discovery', 'Privilege Escalation', 'Phishi...


In [7]:
def extract_entities_and_relationships(pipe, text: str) -> Dict[str, Any]:
    """
    Extract entities and relationships from text using the LLM.
    """
    try:
        prompt = create_entity_extraction_prompt(text)
        
        # Generate response
        response = pipe(
            prompt,
            max_new_tokens=300,
            do_sample=False,
            temperature=0.1,
            pad_token_id=pipe.tokenizer.eos_token_id,
        )
        
        # Extract generated text
        generated_text = response[0]['generated_text']
        answer = generated_text[len(prompt):].strip()
        
        print(f"🔍 Raw model output: {answer[:200]}...")
        
        # Parse the response
        entities, relationships = parse_extraction_output(answer)
        
        return {
            "raw_output": answer,
            "entities": entities,
            "relationships": relationships,
            "has_entities": len(entities) > 0
        }
        
    except Exception as e:
        print(f"❌ Error in extraction: {e}")
        return {
            "raw_output": "",
            "entities": [],
            "relationships": [],
            "has_entities": False,
            "error": str(e)
        }

def parse_extraction_output(output: str) -> Tuple[List[Tuple], List[Tuple]]:
    """
    Parse the model output to extract entities and relationships.
    """
    entities = []
    relationships = []
    
    # Check for "No related entities" case
    if "no related entities" in output.lower():
        return entities, relationships
    
    try:
        # Split output into lines
        lines = [line.strip() for line in output.split('\n') if line.strip()]
        
        current_section = None
        for line in lines:
            line_lower = line.lower()
            
            if "named entities:" in line_lower:
                current_section = "entities"
                # Extract entities from the same line
                entity_part = line.split(":", 1)[1] if ":" in line else ""
                entities.extend(extract_tuples_from_text(entity_part))
                
            elif "relationships:" in line_lower:
                current_section = "relationships"
                # Extract relationships from the same line
                rel_part = line.split(":", 1)[1] if ":" in line else ""
                relationships.extend(extract_tuples_from_text(rel_part))
                
            elif current_section == "entities":
                entities.extend(extract_tuples_from_text(line))
                
            elif current_section == "relationships":
                relationships.extend(extract_tuples_from_text(line))
    
    except Exception as e:
        print(f"⚠️  Error parsing output: {e}")
    
    return entities, relationships

def extract_tuples_from_text(text: str) -> List[Tuple]:
    """
    Extract tuples from text using regex pattern matching.
    """
    tuples = []
    
    # Pattern to match (item1, item2) or (item1, item2, item3)
    pattern = r'\(([^)]+)\)'
    matches = re.findall(pattern, text)
    
    for match in matches:
        # Split by comma and clean up
        parts = [part.strip() for part in match.split(',')]
        if len(parts) >= 2:
            tuples.append(tuple(parts))
    
    return tuples

# Test the extraction function
if extraction_model and data:
    print("\n🧪 Testing entity extraction on sample data...")
    test_result = extract_entities_and_relationships(extraction_model, data[0]['content'])
    
    print(f"\n📊 Extraction Results:")
    print(f"   Entities found: {len(test_result['entities'])}")
    print(f"   Relationships found: {len(test_result['relationships'])}")
    
    if test_result['entities']:
        print("\n🏷️  Sample Entities:")
        for entity in test_result['entities'][:5]:
            print(f"     {entity}")
    
    if test_result['relationships']:
        print("\n🔗 Sample Relationships:")
        for rel in test_result['relationships'][:5]:
            print(f"     {rel}")


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



🧪 Testing entity extraction on sample data...
🔍 Raw model output: Named Entities: (NailaoLocker, Malware), (SM2, Technique), (Lcrypt0rx, Malware), (Dark 101, Malware), (FortiCNAPP Composite Alerts, Tool), (Lcrypt0rx, Malware), (FortiCNAPP Labs, Tool), (FortiSandbox ...

📊 Extraction Results:
   Entities found: 12
   Relationships found: 9

🏷️  Sample Entities:
     ('NailaoLocker', 'Malware')
     ('SM2', 'Technique')
     ('Lcrypt0rx', 'Malware')
     ('Dark 101', 'Malware')
     ('FortiCNAPP Composite Alerts', 'Tool')

🔗 Sample Relationships:
     ('NailaoLocker', 'uses', 'SM2')
     ('NailaoLocker', 'contains', 'Lcrypt0rx')
     ('Lcrypt0rx', 'uses', 'Dark 101')
     ('FortiCNAPP Composite Alerts', 'linksWeakSignalsIntoClearTimelines')
     ('Lcrypt0rx', 'uses', 'FortiCNAPP Labs')


In [8]:
def process_articles_for_extraction(data: List[Dict], pipe, start: int = 0, offset:int=5) -> List[Dict]:
    """
    Process multiple articles for entity and relationship extraction.
    """
    end = min(start + offset, len(data))
    articles_to_process = data[start:end]
    results = []

    print(f"🔍 Processing {len(articles_to_process)} articles for entity extraction...")

    for i, article in enumerate(articles_to_process):
        print(f"\nProcessing {i+1}/{len(articles_to_process)}: {article.get('title', 'Unknown')[:60]}...")

        # Extract entities and relationships
        extraction_result = extract_entities_and_relationships(pipe, article.get('content', ''))

        # Combine with original article data
        result = {
            "title": article.get('title', ''),
            "link": article.get('link', ''),
            "content": article.get('content', ''),
            "extraction": extraction_result,
            "entity_count": len(extraction_result['entities']),
            "relationship_count": len(extraction_result['relationships'])
        }

        results.append(result)
        
        # Progress update
        if (i + 1) % 5 == 0:
            print(f"  ✅ Processed {i+1}/{len(articles_to_process)} articles")

    return results

# Process a small batch first for testing
print("\n🚀 Processing first 5 articles for entity extraction...")
extraction_results = process_articles_for_extraction(data, extraction_model, start=0, offset = 5)



🚀 Processing first 5 articles for entity extraction...


In [9]:
def save_extraction_results(results: List[Dict], output_file: str = "entity-extraction.json"):
    """
    Save extraction results to files.
    """
    try:
        # Create output directory
        output_path = Path(output_file)
        output_path.parent.mkdir(parents=True, exist_ok=True)
        
        # Convert to an absolute path to avoid relative path issues
        absolute_path = output_path.resolve()
        print(f"💾 Saving to: {absolute_path}")
        
        with open(absolute_path, 'w', encoding='utf-8') as f:
            json.dump(results, f, ensure_ascii=False, indent=2)
        
        print(f"\n💾 SAVED EXTRACTION RESULTS to {output_file}")

        
    except Exception as e:
        print(f"❌ Error saving results: {e}")
        print(f"   Attempted path: {output_path}")
        print(f"   Current working directory: {os.getcwd()}")
        print(f"   Absolute path would be: {Path(output_path).resolve()}")


In [10]:
# test save results
import datetime
today = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
offset = 2
start = 0
end = min(len(data), start+offset)

results = process_articles_for_extraction(data, extraction_model,start=start,offset=offset)


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Processing 2 articles for entity extraction...

Processing 1/2: FortiGuard Labs Threat Research...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (NailaoLocker, Malware), (SM2, Technique), (Lcrypt0rx, Malware), (Dark 101, Malware), (FortiCNAPP Composite Alerts, Tool), (Lcrypt0rx, Malware), (FortiCNAPP Labs, Tool), (FortiSandbox ...

Processing 2/2: NailaoLocker Ransomware’s “Cheese”...
🔍 Raw model output: Named Entities: (FortiGuard Labs Threat Research, Entity), (NailaoLocker, Malware), (AES-256-CBC, Technique), (SM2 cryptographic key, Vulnerability), (Windows, Platform), (user files, File), (high sev...


In [11]:
output_path = f"../data/entity-extraction/entity-extraction_{today}_{start}_{end}.json"
save_extraction_results(results, output_path)

💾 Saving to: /Users/huynguyen/Documents/UIT/2nd/NLP/LLM-TKIG/data/entity-extraction/entity-extraction_2025-08-04_18-30-40_0_2.json

💾 SAVED EXTRACTION RESULTS to ../data/entity-extraction/entity-extraction_2025-08-04_18-30-40_0_2.json


In [12]:
import datetime
today = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
offset = 50


In [13]:
start = 0
end = min(len(data), start+offset)
output_path = f"../data/entity-extraction/entity_extraction_results_{os.getenv('DEFAULT_MODEL', 'Qwen/Qwen2.5-1.5B-Instruct')}_test_{today}_{start}_{end}.json"

results = process_articles_for_extraction(data, extraction_model,start=start,offset=offset)
save_extraction_results(results, output_path)

In [None]:
start = 50
end = min(len(data), start+offset)
output_path = f"../data/entity-extraction/entity_extraction_results_{os.getenv('DEFAULT_MODEL', 'Qwen/Qwen2.5-1.5B-Instruct')}_test_{today}_{start}_{end}.json"

results = process_articles_for_extraction(data, extraction_model,start=start,offset=offset)
save_extraction_results(results, output_path)

In [15]:
start = 100
end = min(len(data), start+offset)
output_path = f"../data/entity-extraction/entity_extraction_results_{os.getenv('DEFAULT_MODEL', 'Qwen/Qwen2.5-1.5B-Instruct')}_test_{today}_{start}_{end}.json"

results = process_articles_for_extraction(data, extraction_model,start=start,offset=offset)
save_extraction_results(results, output_path)

In [16]:
start = 150
end = min(len(data), start+offset)
output_path = f"../data/entity-extraction/entity_extraction_results_{os.getenv('DEFAULT_MODEL', 'Qwen/Qwen2.5-1.5B-Instruct')}_test_{today}_{start}_{end}.json"

results = process_articles_for_extraction(data, extraction_model,start=start,offset=offset)
save_extraction_results(results, output_path)

In [17]:
start = 200
end = min(len(data), start+offset)
output_path = f"../data/entity-extraction/entity_extraction_results_{os.getenv('DEFAULT_MODEL', 'Qwen/Qwen2.5-1.5B-Instruct')}_test_{today}_{start}_{end}.json"

results = process_articles_for_extraction(data, extraction_model,start=start,offset=offset)
save_extraction_results(results, output_path)

In [18]:
start = 250
end = min(len(data), start+offset)
output_path = f"../data/entity-extraction/entity_extraction_results_{os.getenv('DEFAULT_MODEL', 'Qwen/Qwen2.5-1.5B-Instruct')}_test_{today}_{start}_{end}.json"

results = process_articles_for_extraction(data, extraction_model,start=start,offset=offset)
save_extraction_results(results, output_path)

In [19]:
start = 300
end = min(len(data), start+offset)
output_path = f"../data/entity-extraction/entity_extraction_results_{os.getenv('DEFAULT_MODEL', 'Qwen/Qwen2.5-1.5B-Instruct')}_test_{today}_{start}_{end}.json"

results = process_articles_for_extraction(data, extraction_model,start=start,offset=offset)
save_extraction_results(results, output_path)

In [20]:
start = 350
end = min(len(data), start+offset)
output_path = f"../data/entity-extraction/entity_extraction_results_{os.getenv('DEFAULT_MODEL', 'Qwen/Qwen2.5-1.5B-Instruct')}_test_{today}_{start}_{end}.json"

results = process_articles_for_extraction(data, extraction_model,start=start,offset=offset)
save_extraction_results(results, output_path)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Processing 50 articles for entity extraction...

Processing 1/50: symantec latest intelligence refresh...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Symantec, Company), (PDF report, Document), (financial Trojan, Threat Type), (Ramnit, Malware), (September, Month), (August, Month), (IoT device, Device)
Relationships: (Symantec, pub...

Processing 2/50: formjacking attacks retailers...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Magecart, Attacker), (Ticketmaster, Domain), (British Airways, Domain), (Feedify, Domain), (Newegg, Domain), (formjacking, Threat Type), (Malicious JavaScript, Tool), (payment card de...

Processing 3/50: microsoft patch tuesday september 2018...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Microsoft, Tool), (Chakra Scripting Engine, Tool), (Memory Corruption Vulnerability, Vulnerability), (Internet Explorer, Tool), (PDF Remote Code Execution Vulnerability, Vulnerability...

Processing 4/50: wmic download malware...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Windows Management Instrumentation Command-line, Tool), (eXtensible Stylesheet Language, Tool), (WMIC, Tool), (XSL, Tool), (Malware, Malware), (XML, File), (WMI, Technique), (eXtensib...

Processing 5/50: mirai cross platform infection...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Mirai botnet, Threat Type), (Linux.Mirai, Malware), (Mirai, Attacker), (shell script, Tool), (vulnerable device, Target), (executables, File), (remote server, Host), (July, Date)
Rela...
  ✅ Processed 5/50 articles

Processing 6/50: jrat new anti parsing techniques...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (jRAT, Malware), (Trojan.Maljava, Malware), (MZ, Tool), (JAR file, File), (spam email, Threat Type), (social engineering, Technique)
Relationships: (jRAT, uses, Trojan.Maljava), (jRAT,...

Processing 7/50: microsoft patch tuesday august 2018...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Microsoft, Tool), (Browsers, Tool), (Chakra, Tool), (Scripting Engine, Tool), (Memory Corruption Vulnerability, Vulnerability), (Information Disclosure Vulnerability, Vulnerability), ...

Processing 8/50: hacked mikrotik router...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Cryptocurrency coinminers, Malware), (ransomware, Threat Type), (Symantec, Attacker), (MikroTik routers, Tool), (Brazil, Country), (August 2018, Time Period), (Figure 2, Image), (Figu...

Processing 9/50: leafminer espionage middle east...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Leafminer, Attacker), (Malware, Malware), (e-qht.az, Domain), (publicly accessible, Linked To)
Relationships: (Leafminer, uses, Malware), (Leafminer, downloads, e-qht.az), (e-qht.az, ...

Processing 10/50: evolution emotet trojan distributor...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Mealybug, Attacker), (Trojan.Emotet, Malware), (WannaCry, Threat Type), (Petya/NotPetya, Threat Type), (Conficker, Threat Type), (W32.Downadup, Threat Type)
Relationships: (Mealybug, ...
  ✅ Processed 10/50 articles

Processing 11/50: powershell threats grow further and operate plain sight...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Windows PowerShell, Tool), (WMI, Tool), (PsExec, Tool), (PowerSploit, Tool), (Empire, Tool), (living off the land, Technique), (fileless, Technique), (PowerShell, Tool), (PowerShell s...

Processing 12/50: microsoft patch tuesday july 2018...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Cumulative Security Update for Microsoft Browsers Scripting Engine Memory Corruption Vulnerability(CVE-2018-8242), Technique), (Scripting Engine Memory Corruption Vulnerability(CVE-20...

Processing 13/50: thrip hits satellite telecoms defense targets...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Living off the land, Technique), (operating system features, Tool), (legitimate network administration tools, Tool), (victim's network, Target), (Sunny skies, Weather), (moderate temp...

Processing 14/50: microsoft patch tuesday june 2018...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Cumulative Security Update for Microsoft Browsers, Tool), (Internet Explorer Memory Corruption Vulnerability, Vulnerability), (Chakra Scripting Engine Memory Corruption Vulnerability,...

Processing 15/50: industry and law enforcement cooperation bears fruit fight a...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Business email attacks, Threat Type), (419 scams, Threat Type), (FBI, Organization), (Symantec, Company), (Operation Wire-Wire, Name), (BEC attackers, Attacker), (private sector compa...
  ✅ Processed 15/50 articles

Processing 16/50: vpnfilter iot malware...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Cisco Talos, Attacker), (Stage 3 module, Malware), (ssler, Malware), (VPNFilter, Malware), (Modbus SCADA, Vulnerability), (SCADA industrial control systems, Vulnerability), (Ukraine, ...

Processing 17/50: scan4you masterminds guilty...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Scan4You, Threat Type), (Jurijs Martisevs, Attacker), (Ruslans Bondars, Attacker), (Malware, Malware), (credit and debit card numbers, Vulnerable Data), (FBI, Organization), (undergro...

Processing 18/50: latest intelligence march 2018...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Inception Framework, Attacker), (email malware, Threat Type), (Trojan.Coinminer, Malware), (browser-based cryptocurrency mining, Technique), (Agriculture, Sector), (1 in 1,394, Vulner...

Processing 19/50: coin mining without browser...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (browser-based cryptocurrency mining, Threat Type), (JavaScript, Tool), (WebAssembly (WASM), Tool), (Portable Executable file (.NET), Tool), (Coinhive, Tool), (Form1, File), (script ta...

Processing 20/50: istr 23 cyber security threat landscape...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Coin mining, Threat Type), (ransomware, Threat Type), (targeted attacks, Threat Type), (mobile security, Threat Type), (software supply chain, Threat Type)\nRelationships: (Coin minin...
  ✅ Processed 20/50 articles

Processing 21/50: fakebank intercepts calls banks...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Fakebank, Threat Type), (Android, Platform), (command and control (C&C), Technique), (phone number, Vulnerability), (fake UI, File), (system alert window, Permission), (scammer, Attac...

Processing 22/50: inception framework hiding behind proxies...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Inception Framework, Attacker), (stealthy new tools, Technique), (cloud, Tool), (Internet of Things (IoT), Tool), (advanced, Technique), (automated framework, Tool), (spear-phishing e...

Processing 23/50: microsoft patch tuesday march 2018...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: No related entities and relations

Explanation: The input text does not contain any named entities or relationships that match the provided entity types and relationship types. It appears to be a list...

Processing 24/50: latest intelligence february 2018...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Chafer attack group, Attacker), (email malware, Threat Type), (Necurs botnet, Tool), (Facebook account, Target), (Finance, Industry), (Mining, Industry), (Ne, Domain)
Relationships: (...

Processing 25/50: chafer latest attacks reveal heightened ambitions...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Chafer, Attacker), (Iran, Country), (targeted attack group, Entity Type), (seven new tools, Technique), (nine new target organizations, Target), (Israel, Country), (Jordan, Country), ...
  ✅ Processed 25/50 articles

Processing 26/50: android malware harvests facebook details...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Facebook, Website), (Android.Fakeapp, Malware), (third-party markets, Tool), (English speakers, Target), (C&C server, IP)
Relationships: (Facebook, contains, Android.Fakeapp), (Facebo...

Processing 27/50: microsoft patch tuesday february 2018...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Microsoft, Tool), (Browsers, Vulnerability), (Edge, Tool), (CVE-2018-0763, Vulnerability), (Critical, MS Rating), (Scripting Engine, Tool), (Memory Corruption, Vulnerability), (CVE-20...

Processing 28/50: meltdown spectre cpu bugs...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Meltdown, Vulnerability), (Spectre, Vulnerability), (kernel, File), (JavaScript, Tool), (operating system, Tool), (Symantec, Company), (personal computer, Device), (virtual machine, D...

Processing 29/50: android malware uber credentials deep links...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Android.Fakeappmalware, Malware), (Uber, Threat Type), (deep link URI, Technique), (current location, Vulnerability), (Ride Request activity, Technique), (victim, Target)
Relationship...

Processing 30/50: browser mining cryptocurrency...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Browser-based cryptocurrency mining, Threat Type), (Coinhive, Malware), (JavaScript, Tool), (BitcoinPlus.com, Domain), (Monero, Vulnerable Software), (ASIC mining, Technique)
Relation...
  ✅ Processed 30/50 articles

Processing 31/50: triton malware ics...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Symantec, Company), (Trojan, Malware), (Triton, Malware), (Trojan.Trisis, Malware), (Safety Instrumented Systems, Vulnerability), (Windows, Tool), (Industrial Control System, Threat T...

Processing 32/50: microsoft patch tuesday december...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Cumulative Security Update for Microsoft Browsers, Tool), (CVE-2017-11888, Vulnerability), (Critical, Severity), (Scripting Engine Memory Corruption Vulnerability, Malware), (CVE-2017...

Processing 33/50: mailsploit email exploit spoofing...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Mailsploit, Malware), (RFC-1342, Vulnerability), (Yahoo Mail for iOS and Android, Tool), (Sabri Haddouche, Attacker), (Domain-based Message Authentication, Reporting and Conformance (...

Processing 34/50: surge adwind distribution emails...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Adwind, Malware), (JAR, File), (ZIP, File), (Symantec, Tool), (August 2017, Month), (October 2017, Month), (November 2017, Month), (Holiday/Shopping Season, Event)
Relationships: (Adw...

Processing 35/50: latest intel november 2017...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (spam, Threat Type), (Black Friday, Event), (Cyber Monday, Event), (Necurs, Malware), (SMS, Tool), (legitimate company, Victim), (personal information, Target), (Malware, Malware), (SM...
  ✅ Processed 35/50 articles

Processing 36/50: doublehidden android malware google play...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Android.Trojan, Malware), (Google Play Store, Tool), (photograph by fiery, File), (i.r.r developer, Attacker), (com.aseee.apptec.treeapp, File), (Device Administrator, Vulnerability),...

Processing 37/50: android malware porn apps chinese...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Android.Rootnik.B, Malware), (Android.Reputation.1, Malware), (app-centric websites, Technique), (forums, Technique), (torrent sites, Technique), (social messaging networks, Technique...

Processing 38/50: ms patch tuesday november 2017...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Cumulative Security Update for Microsoft Browsers Scripting Engine Memory Corruption Vulnerability(CVE-2017-11858), Tool), (Scripting Engine Memory Corruption Vulnerability(CVE-2017-1...

Processing 39/50: tech support scams aes...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (tech support scam, Threat Type), (code obfuscation, Technique), (string-based detection engines, Tool), (JavaScript, Language), (SMB, Protocol), (ransomware, Malware), (Microsoft, Org...

Processing 40/50: sowbug cyber espionage south america asia...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Sowbug, Attacker), (Felismus, Malware), (South America, Region), (Southeast Asia, Region), (Argentina, Country), (Brazil, Country), (Ecuador, Country), (Peru, Country), (Brunei, Count...
  ✅ Processed 40/50 articles

Processing 41/50: ransomeware risks 2017...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (ransomware, Threat Type), (WannaCry, Malware), (Petya, Malware), (EternalBlue, Vulnerability), (Windows SMB protocol, Tool), (SMB protocol, Tool)
Relationships: (ransomware, contains,...

Processing 42/50: petya ransomware wiper...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Petya, Malware), (EternalBlue, Technique), (MEDoc, Tool), (Norton products, Tool), (Symantec Endpoint Protection, Tool)
Relationships: (Petya, uses, EternalBlue), (Petya, spreadsAcros...

Processing 43/50: wannacry ransomware attack...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Symantec, Company), (WannaCry, Malware), (Lazarus group, Attacker), (Eternal Blue, Technique), (Shadow Brokers, Threat Actor), (Equation cyber espionage group, Threat Actor), (SEP, To...

Processing 44/50: dragonfly energy sector cyber attacks...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Cyber attack, Threat Type), (Dragonfly, Attacker), (Ukraine's power system, Target), (operational systems, Target), (Nuclear facility, Target), (Symantec, Tool), (energy sector, Targe...

Processing 45/50: bachosens cyber crime investigation...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Trojan.Bachosens, Malware), (Igor, Attacker), (international airline, Target), (Chinese auto-tech company, Target), (car diagnostics software, Vulnerability), (underground forums and ...
  ✅ Processed 45/50 articles

Processing 46/50: longhorn cyberespionage vault7...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Longhorn, Attacker), (Vault 7, Document), (back door Trojans, Tool), (zero-day vulnerabilities, Vulnerability), (United States, Country)
Relationships: (Longhorn, uses, back door Troj...

Processing 47/50: bayrob suspects extradited...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Bayrob, Threat Type), (Bogdan Nicolescu, Attacker), (Danet Tiberiu, Attacker), (Radu Miclaus, Attacker), (Masterfraud, Attacker), (Amy, Attacker), (Minolta, Attacker), (Amightysa, Att...

Processing 48/50: shamoon back destructive...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Shamoon, Malware), (W32.Disttrack, Malware), (W32.Disttrack.B, Malware), (Alan Kurdi, Victim), (Saudi Arabian, Location), (working week, Time Period), (Thursday, Day)
Relationships: (...

Processing 49/50: gatak healthcare...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Gatak Trojan, Malware), (healthcare sector, Target), (insurance sector, Target), (enterprise computers, Device), (product key generator, File), (software, Software), (website, Website...

Processing 50/50: odinaff trojan financial attacks...
🔍 Raw model output: Named Entities: (Trojan.Odinaff, Malware), (Odinaff, Malware), (Carbanak, Threat Type), (Backdoor.Batel, Malware), (Carbanak, Threat Type)
Relationships: (Trojan.Odinaff, uses, Backdoor.Batel), (Carba...
  ✅ Processed 50/50 articles
💾 Saving to: /Users/huynguyen/Documents/UIT/2nd/NLP/LLM-TKIG/data/entity-extraction/entity_extraction_results_Qwen/Qwen2.5-1.5B-Instruct_test_2025-08-04_18-31-25_350_400.json

💾 SAVED EXTRACTION RESULTS to ../data/entity-extraction/entity_extraction_results_Qwen/Qwen2.5-1.5B-Instruct_test_2025-08-04_18-31-25_350_400.json


In [21]:
start = 400
end = min(len(data), start+offset)
output_path = f"../data/entity-extraction/entity_extraction_results_{os.getenv('DEFAULT_MODEL', 'Qwen/Qwen2.5-1.5B-Instruct')}_test_{today}_{start}_{end}.json"

results = process_articles_for_extraction(data, extraction_model,start=start,offset=offset)
save_extraction_results(results, output_path)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Processing 27 articles for entity extraction...

Processing 1/27: buckeye cyberespionage hong kong...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Buckeye, Attacker), (APT3, Attacker), (Gothic Panda, Attacker), (UPS Team, Attacker), (TG-0110, Attacker), (Hong Kong, Location), (US, Location), (backdoor.pirpi, Malware), (spear-phi...

Processing 2/27: equation cyberespionage group breached...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Shadow Brokers, Attacker), (Equation, Threat Type), (Malware, Tool), (router, Device), (firewall appliance, Device), (exploit, Technique)
Relationships: (Shadow Brokers, uses, Equatio...

Processing 3/27: strider cyberespionage sauron...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Strider, Attacker), (Remsec, Malware), (Sauron, Threat Type), (Regin, Threat Type), (Flamer, Threat Type), (Lua, Technique)
Relationships: (Strider, uses, Remsec), (Strider, linkedTo,...

Processing 4/27: swift malware financial attacks...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (bank in the Philippines, Target), (Bangladesh central bank, Target), (Tien Phong Bank, Target), (Vietnam's Tien Phong Bank, Target), (Banco del Austro, Target), (Ecuador's Banco del A...

Processing 5/27: tick cyberespionage japan...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Tick, Attacker), (Backdoor.Daserf, Malware), (Gofarer, Tool), (Flash(.swf), Vulnerability), (Japanese, Location), (technology, Sector), (aquatic engineering, Sector), (broadcasting, S...
  ✅ Processed 5/27 articles

Processing 6/27: taiwan cyberespionage backdoor trojan...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Backdoor.Dripion, Malware), (Budminer, Attacker), (Trojan.Taidoor, Malware), (file hashes, Hash), (Taiwan, Location), (Brazil, Location), (United States, Location), (command and contr...

Processing 7/27: operation blockbuster lazarus...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Operation Blockbuster, Initiative), (Lazarus, Attacker), (Novetta, Company), (Symantec, Company), (u, Technique)
Relationships: (Operation Blockbuster, launchedBy, Novetta), (Operatio...

Processing 8/27: dridex financial trojan spam...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Dridex, Malware), (Symantec, Company), (W32.Cridex, Malware), (English, Language), (financial, Threat Type), (banking, Target), (Symantec, Company), (whitepaper, Document), (Symantec,...

Processing 9/27: dyre bank fraud group takedown...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Dyre, Malware), (Upatre, Malware), (Downloader.Upatre, Tool), (email spam campaigns, Technique), (November, Date), (November 18, Date), (Downloader.Upatre, usedBy, Dyre), (Updater.Upa...

Processing 10/27: destructive disakil malware ukraine...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Disakil, Malware), (BlackEnergy, Malware), (Sandworm, Attacker), (Apress, Organization), (SBU, Organization), (Ukraine, Location), (energy sector, Threat Type)
Relationships: (Disakil...
  ✅ Processed 10/27 articles

Processing 11/27: dridex takedown...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Dridex, Threat Type), (W32.Cridex, Malware), (Bugat, Malware), (financial threat, Threat Type), (malicious macros, Technique), (Microsoft Office, Tool), (Symantec, Company), (State of...

Processing 12/27: regin mysteries cyberespionage...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Regin, Malware), (Symantec, Attacker), (technical whitepaper, Document), (command-and-control (C&C) infrastructure, Infrastructure)
Relationships: (Regin, uses, technical whitepaper),...

Processing 13/27: black vine cyberespionage aerospace healthcare...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Anthem, Threat Type), (Black Vine, Attacker), (zero-day vulnerability, Vulnerability), (watering-hole attack, Technique), (custom malware, Malware), (legitimate website, Domain), (rem...

Processing 14/27: forkmeiamfamous seaduke duke...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Seaduke, Malware), (Cozyduke, Malware), (Cyberespionage group, Attacker), (United States, Country), (Europe, Continent)
Relationships: (Seaduke, uses, Cozyduke), (Cyberespionage group...

Processing 15/27: butterfly corporate attacks...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Butterfly, Attacker), (Windows, OS), (Apple, OS), (zero-day vulnerability, Vulnerability), (Twitter, Domain), (Facebook, Domain), (Apple, Company), (Microsoft, Company)
Relationships:...
  ✅ Processed 15/27 articles

Processing 16/27: dyre financial trojan...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Dyre, Malware), (Symantec, Tool), (Infostealer.Dyre, Hash), (spam emails, Technique), (Malicious website, URL)
Relationships: (Dyre, uses, Infostealer.Dyre), (Dyre, spreadsUsing, spam...

Processing 17/27: duqu 20 cyberespionage...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Duqu 2.0, Malware), (Kaspersky Lab, Attacker), (Stuxnet, Malware), (Iranian nuclear development program, Target), (European telecoms operator, Target), (North African telecoms operato...

Processing 18/27: equation cyberespionage group...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Equation, Attacker), (Malware, Malware), (Wipbot, Malware), (Trojan Turla, Malware), (Infostealer.Micstus, Malware), (Trojan.Tripfant, Malware), (Grayphish, Malware), (GrayFish, Malwa...

Processing 19/27: carbanak cybercrime gang...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Carbanak, Threat Type), (Trojan.Carberp.B, Malware), (Trojan.Carberp, Malware), (Silicon, Attacker), (Anunak, Attacker), (ATM, Object), (money mule, Object)
Relationships: (Carbanak, ...

Processing 20/27: destover destructive malware south korea...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Backdoor.Destover, Malware), (FBI, Attacker), (Trojan.Volgmer, Malware), (Volgmer, Malware), (Jokra, Malware), (Shamoon, Malware), (commercially available drivers, Tool), (Destover, M...
  ✅ Processed 20/27 articles

Processing 21/27: regin espionage surveillance...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Regin, Malware), (Backdoor.Regin, Malware)
Relationships: (Regin, uses, Backdoor.Regin) ```...

Processing 22/27: turla espionage diplomats...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Wipbot, Malware), (Turla, Malware), (Spear phishing, Technique), (Watering hole, Technique), (IP address, IP), (Legitimate website, Domain), (Compromised website, Domain), (Malware, F...

Processing 23/27: dragonfly energy companies sabotage...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Dragonfly, Attacker), (Stuxnet, Malware), (remote access type Trojan, Malware), (ICS equipment providers, Target), (software, File), (ICS equipment, File), (ICS computers, Target), (S...

Processing 24/27: hidden lynx professional hackers hire...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Hidden Lynx, Attacker), (Advanced Persistent Threats, Threat Type), (Watering hole, Technique), (zero-day vulnerabilities, Vulnerability), (supply chain, Tool), (intelligent hunter, A...

Processing 25/27: darkseoul cyberattacks south korea...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (DarkSeoul gang, Attacker), (DDoS, Technique), (Trojan.Castov, Malware), (Jokra attacks, Threat Type), (United States Independence Day, Event), (South Korean independence day, Event), ...
  ✅ Processed 25/27 articles

Processing 26/27: duqu next stuxnet...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🔍 Raw model output: Named Entities: (Stuxnet, Malware), (Duqu, Malware), (Industrial Control System, Vulnerability), (remote access trojan, Tool), (telemetry, Data)
Relationships: (Stuxnet, contains, Duqu), (Duqu, uses, ...

Processing 27/27: stuxnet dossier espionage...
🔍 Raw model output: Named Entities: (Stuxnet, Malware), (VirusBlokada, Attacker), (unpatched vulnerability, Vulnerability), (removable drive, Device), (industrial control systems, Target)
Relationships: (Stuxnet, uses, u...
💾 Saving to: /Users/huynguyen/Documents/UIT/2nd/NLP/LLM-TKIG/data/entity-extraction/entity_extraction_results_Qwen/Qwen2.5-1.5B-Instruct_test_2025-08-04_18-31-25_400_427.json

💾 SAVED EXTRACTION RESULTS to ../data/entity-extraction/entity_extraction_results_Qwen/Qwen2.5-1.5B-Instruct_test_2025-08-04_18-31-25_400_427.json
