[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/cybersecurity/02_Threat_Intelligence_Hybrid_RAG.ipynb)

# Threat Intelligence Hybrid RAG - Vector + Graph Retrieval

## Overview

This notebook demonstrates **threat intelligence hybrid RAG** using Semantica with focus on **hybrid search**, **vector + graph retrieval**, and **context-aware queries**. The pipeline combines vector search with temporal knowledge graphs for advanced threat intelligence querying.

### Key Features

- **Hybrid RAG**: Combines vector similarity search with knowledge graph traversal
- **Vector + Graph Retrieval**: Uses both vector embeddings and graph relationships
- **Context-Aware Queries**: Provides context-aware retrieval for threat intelligence
- **Temporal Knowledge Graphs**: Builds temporal KGs for threat timeline analysis
- **Multi-hop Reasoning**: Follows relationships across the graph for deeper context
- **Comprehensive Data Sources**: Multiple threat intelligence feeds, APIs, and databases
- **Modular Architecture**: Direct use of Semantica modules without core orchestrator

### Learning Objectives

- Ingest threat intelligence data from multiple sources
- Extract threat entities (IOCs, Campaigns, Threats, Actors, TTPs, Malware)
- Build temporal threat intelligence knowledge graphs
- Generate embeddings and populate vector stores
- Perform hybrid vector + graph queries
- Analyze threat networks using graph analytics
- Store and query threat intelligence using vector stores and graph stores

### Pipeline Flow

```mermaid
graph TD
    A[Data Ingestion] --> B[Document Parsing]
    B --> C[Text Processing]
    C --> D[Entity Extraction]
    D --> E[Relationship Extraction]
    E --> F[Deduplication]
    F --> G[Conflict Detection]
    G --> H[Temporal Knowledge Graph]
    H --> I[Embeddings]
    I --> J[Vector Store]
    H --> K[Temporal Queries]
    K --> L[Graph Analytics]
    L --> M[GraphRAG Queries]
    J --> M
    H --> N[Reasoning & Threat]
    M --> O[Visualization]
    N --> O
    H --> P[Graph Store]
    P --> O
    O --> Q[Export]
```


## Installation


In [1]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


Note: you may need to restart the kernel to use updated packages.




## Configuration & Setup


In [2]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "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")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
TEMPORAL_GRANULARITY = "day"


## Ingesting Threat Intelligence Data


In [3]:
from semantica.ingest import FeedIngestor, WebIngestor, FileIngestor
import os
from contextlib import redirect_stderr
from io import StringIO

os.makedirs("data", exist_ok=True)

feed_sources = [
    # Threat Intelligence RSS Feeds
    ("US-CERT Alerts", "https://www.us-cert.gov/ncas/alerts.xml"),
    ("SANS ISC", "https://isc.sans.edu/rssfeed.xml"),
    ("Krebs on Security", "https://krebsonsecurity.com/feed/"),
    ("ThreatPost", "https://threatpost.com/feed/"),
    ("BleepingComputer", "https://www.bleepingcomputer.com/feed/"),
    ("SecurityWeek", "https://www.securityweek.com/rss"),
]

feed_ingestor = FeedIngestor()
all_documents = []

print(f"Ingesting from {len(feed_sources)} feed sources...")
for i, (feed_name, feed_url) in enumerate(feed_sources, 1):
    try:
        with redirect_stderr(StringIO()):
            feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
        
        feed_count = 0
        for item in feed_data.items:
            if not item.content:
                item.content = item.description or item.title or ""
            if item.content:
                if not hasattr(item, 'metadata'):
                    item.metadata = {}
                item.metadata['source'] = feed_name
                all_documents.append(item)
                feed_count += 1
        
        if feed_count > 0:
            print(f"  [{i}/{len(feed_sources)}] {feed_name}: {feed_count} documents")
    except Exception:
        continue

if not all_documents:
    threat_data = """
    IOC: IP address 192.168.1.50 associated with APT28 campaign.
    Threat actor APT28 uses TTP: Spear phishing and credential harvesting.
    Campaign Operation GhostShell targets financial institutions.
    Malware sample hash: abc123def456 linked to APT28 infrastructure.
    IOC: Domain example-malicious.com linked to APT29 operations.
    Threat actor APT29 uses TTP: Watering hole attacks and lateral movement.
    Campaign Operation SolarWinds targets technology companies.
    IOC: File hash xyz789ghi012 associated with ransomware group.
    """
    with open("data/threat_intel.txt", "w") as f:
        f.write(threat_data)
    file_ingestor = FileIngestor()
    all_documents = file_ingestor.ingest("data/threat_intel.txt")

documents = all_documents
print(f"Ingested {len(documents)} documents")


Ingesting from 6 feed sources...


Status,Action,Module,Submodule,Progress,ETA,Rate,Time
‚ùå,Semantica is parsing,üîç parse,DocumentParser,-,-,-,0.00s
‚ùå,Semantica is parsing,üîç parse,DocumentParser,-,-,-,0.00s
‚ùå,Semantica is parsing,üîç parse,DocumentParser,-,-,-,0.00s
‚úÖ,Semantica is normalizing,üîß normalize,TextNormalizer,100.0%,-,-,0.00s
‚úÖ,Semantica is extracting,üéØ semantic_extract,NERExtractor,100.0%,-,-,0.70s
‚úÖ,Semantica is extracting,üéØ semantic_extract,RelationExtractor,100.0%,-,-,0.47s
‚úÖ,Semantica is deduplicating,üîÑ deduplication,DuplicateDetector,100.0% (3291/3291),-,516.7/s,3.41s
‚úÖ,Semantica is deduplicating,üîÑ deduplication,SimilarityCalculator,100.0% (26870/26870),-,4219.7/s,1.83s
‚úÖ,Semantica is deduplicating,üîÑ deduplication,EntityMerger,100.0% (111/111),-,17.4/s,6.38s
‚úÖ,Semantica is deduplicating,üîÑ deduplication,MergeStrategyManager,100.0% (4/4),-,144.6/s,0.02s


üß† Semantica is ingesting: Ingested 10 items |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüì•                                                 [1/6] US-CERT Alerts: 10 documents
üß† Semantica is ingesting: Ingested 10 items |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüì• | üß† Semantica is ingesting: Ingested 10 items |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüì•  [2/6] SANS ISC: 10 documents
üß† Semantica is ingesting: Ingested 10 items |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüì• | üß† Semantica is ingesting: Ingested 10 items |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüì•  [3/6] Krebs on Security: 10 documents
üß† Semantica is ingesting: Ingested 10 items |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüì• | üß† Semantica is ingesting: Ingested 10 items |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüì•  [4/6] ThreatPost: 10 documents
üß† Semantica is ingesting: 403 Client 

## Parsing Threat Intelligence Documents


In [4]:
from semantica.parse import DocumentParser

parser = DocumentParser()

print(f"Parsing {len(documents)} documents...")
parsed_documents = []
for i, doc in enumerate(documents, 1):
    try:
        parsed = parser.parse(
            doc.content if hasattr(doc, 'content') else str(doc),
            content_type="text"
        )
        parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc)
    if i % 50 == 0 or i == len(documents):
        print(f"  Parsed {i}/{len(documents)} documents...")

documents = parsed_documents


Parsing 40 documents...
üß† Semantica is ingesting: 403 Client Error: Forbidden for url: https://www.securityweek.com/rss ‚ùåüì• (0.4s) | üß† Semantica is parsing: Document: p>
üß† Semantica is parsing: Document: p>                                                                                   
 üîÑüîç (0.0s) | üß† Semantica is parsing: Document: p>
üß† Semantica is parsing: Document: p>                                                                                   
 üîÑüîç (0.0s) | üß† Semantica is parsing: Document file not found: Direct navigation -- the act of visiting a website by manually typing a domain name in a web browser -- has never been riskier: A new study finds the vast majority of "parked" domains -- mostly expired or dormant domain names, or common misspellings of popular websites -- are now configured to redirect visitors to sites that foist scams and malware. ‚ùåüîç (0.0s)  Parsed 40/40 documents...


## Normalizing and Chunking Threat Intelligence Data


In [5]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

normalizer = TextNormalizer()
# Use entity-aware chunking to preserve threat entity boundaries for GraphRAG
splitter = TextSplitter(
    method="entity_aware",
    ner_method="spacy",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

print(f"Normalizing {len(documents)} documents...")
normalized_documents = []
for i, doc in enumerate(documents, 1):
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        remove_extra_whitespace=True,
        lowercase=False
    )
    normalized_documents.append(normalized_text)
    if i % 50 == 0 or i == len(documents):
        print(f"  Normalized {i}/{len(documents)} documents...")

print(f"Chunking {len(normalized_documents)} documents...")
chunked_documents = []
for i, doc_text in enumerate(normalized_documents, 1):
    try:
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
        chunked_documents.extend(chunks)
    except Exception:
        simple_splitter = TextSplitter(method="recursive", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        chunks = simple_splitter.split(doc_text)
        chunked_documents.extend(chunks)
    if i % 50 == 0 or i == len(normalized_documents):
        print(f"  Chunked {i}/{len(normalized_documents)} documents ({len(chunked_documents)} chunks so far)")

print(f"Created {len(chunked_documents)} chunks from {len(normalized_documents)} documents")


Normalizing 40 documents...
üß† Semantica is parsing: Document file not found: Direct navigation -- the act of visiting a website by manually typing a domain name in a web browser -- has never been riskier: A new study finds the vast majority of "parked" domains -- mostly expired or dormant domain names, or common misspellings of popular websites -- are now configured to redirect visitors to sites that foist scams and malware. ‚ùåüîç (0.0s) | üß† Semantica is normalizing |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüîß  Normalized 40/40 documents...
Chunking 40 documents...
üß† Semantica is normalizing |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüîß | üß† Semantica is extracting: Extracted 3 entities using spacy |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØÔøΩÔøΩ  Chunked 40/40 documents (85 chunks so far)
Created 85 chunks from 40 documents


In [6]:
from semantica.semantic_extract import NERExtractor

entity_extractor = NERExtractor(
    method="ml",  
    model="en_core_web_sm"
)

all_entities = []
print(f"Extracting entities from {len(chunked_documents)} chunks using ML-based extraction...")
for i, chunk in enumerate(chunked_documents, 1):
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        entities = entity_extractor.extract_entities(chunk_text)
        # Filter entities by threat intelligence types
        filtered_entities = [
            e for e in entities 
            if any(entity_type.lower() in e.label.lower() for entity_type in ["IOC", "Campaign", "Threat", "Actor", "TTP", "Malware", "ORG", "PERSON", "GPE"])
        ]
        all_entities.extend(filtered_entities)
    except Exception:
        continue
    
    if i % 20 == 0 or i == len(chunked_documents):
        print(f"  Processed {i}/{len(chunked_documents)} chunks ({len(all_entities)} entities found)")

# Map spaCy entity types to threat intelligence types
iocs = [e for e in all_entities if "ioc" in e.label.lower() or e.text.startswith(("http", "192", "10.", "172."))]
actors = [e for e in all_entities if e.label in ["PERSON", "ORG"] or "actor" in e.label.lower()]
campaigns = [e for e in all_entities if "campaign" in e.label.lower() or "campaign" in e.text.lower()]
ttps = [e for e in all_entities if "ttp" in e.label.lower() or "technique" in e.label.lower()]

print(f"Extracted {len(iocs)} IOCs, {len(actors)} actors, {len(campaigns)} campaigns, {len(ttps)} TTPs")


Extracting entities from 85 chunks using ML-based extraction...
üß† Semantica is normalizing |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüîß | üß† Semantica is extracting: Extracted 8 entities using ml |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØÔøΩ  Processed 20/85 chunks (248 entities found)
üß† Semantica is normalizing |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüîß | üß† Semantica is extracting: Extracted 2 entities using ml |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØÔøΩ  Processed 40/85 chunks (504 entities found)
üß† Semantica is normalizing |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüîß | üß† Semantica is extracting: Extracted 4 entities using ml |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØÔøΩÔøΩ  Processed 60/85 chunks (811 entities found)
üß† Semantica is normalizing |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüîß | üß† Semantica is extracting: Extracted 

## Extracting Threat Relationships


In [7]:
from semantica.semantic_extract import RelationExtractor

relation_extractor = RelationExtractor(
    method="dependency",  
    model="en_core_web_sm",  
    confidence_threshold=0.5,  
    max_distance=50  
)

all_relationships = []
print(f"Extracting relationships from {len(chunked_documents)} chunks using ML-based dependency parsing...")
for i, chunk in enumerate(chunked_documents, 1):
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        # Extract relationships using dependency parsing
        relationships = relation_extractor.extract_relations(
            chunk_text,
            entities=all_entities,
            relation_types=["associated_with", "uses", "targets", "linked_to", "part_of", "employs"]
        )
        all_relationships.extend(relationships)
    except Exception:
        continue
    
    if i % 20 == 0 or i == len(chunked_documents):
        print(f"  Processed {i}/{len(chunked_documents)} chunks ({len(all_relationships)} relationships found)")

print(f"Extracted {len(all_relationships)} relationships")


Extracting relationships from 85 chunks using ML-based dependency parsing...
üß† Semantica is extracting: Extracted 3 entities using ml |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØ | üß† Semantica is extracting: Extracted 11 relations using dependency |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØ  Processed 20/85 chunks (359 relationships found)
üß† Semantica is extracting: Extracted 3 entities using ml |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØ | üß† Semantica is extracting: Extracted 9 relations using dependency |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØÔøΩ  Processed 40/85 chunks (630 relationships found)
üß† Semantica is extracting: Extracted 3 entities using ml |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØ | üß† Semantica is extracting: Extracted 7 relations using dependency |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØÔøΩ  Processed 60/85 chunks (830 relationships fo

## Resolving Duplicate IOCs and Actors

**Best Approach & Methods:**

‚Ä¢ **Multi-Factor Detection**: `DuplicateDetector` with Jaro-Winkler similarity (0.85 threshold) + property/type matching for high-precision duplicate identification

‚Ä¢ **Keep Most Complete Merge**: `EntityMerger` with `strategy="keep_most_complete"` preserves entities with maximum information (properties, relationships, metadata)


In [8]:
from semantica.deduplication import DuplicateDetector, EntityMerger
from semantica.semantic_extract import Entity

# Convert Entity objects to dictionaries for deduplication module
print(f"Converting {len(all_entities)} entities to dictionaries...")
entity_dicts = [
    {
        "id": f"entity_{i}",
        "name": e.text,
        "type": e.label,
        "start_char": e.start_char,
        "end_char": e.end_char,
        "confidence": e.confidence,
        "metadata": e.metadata if hasattr(e, 'metadata') else {}
    }
    for i, e in enumerate(all_entities)
]

# Use DuplicateDetector with similarity threshold for duplicate detection
# Progress tracking is built-in: automatically shows similarity calculation, 
# duplicate candidate creation, and group formation progress
duplicate_detector = DuplicateDetector(
    similarity_threshold=0.85,  # Jaro-Winkler similarity threshold
    confidence_threshold=0.7  # Minimum confidence for duplicate candidates
)

print(f"Detecting duplicates in {len(entity_dicts)} entities...")
# Progress tracking automatically displays:
# - Similarity calculation progress (comparing entity pairs)
# - Duplicate candidate creation progress
# - Duplicate group formation progress
duplicate_groups = duplicate_detector.detect_duplicate_groups(entity_dicts)

print(f"Detected {len(duplicate_groups)} duplicate groups")

# Use EntityMerger to merge duplicates using keep_most_complete strategy
# Progress tracking is built-in: automatically shows duplicate detection 
# and merge operations progress
entity_merger = EntityMerger(preserve_provenance=True)

print(f"Merging duplicates using keep_most_complete strategy...")
# Progress tracking automatically displays:
# - Duplicate group detection progress
# - Merge operations progress (for each group being merged)
merge_operations = entity_merger.merge_duplicates(
    entity_dicts,
    strategy="keep_most_complete",  # Preserve entity with most information
    threshold=0.85
)

# Extract merged entities from merge operations
merged_entity_dicts = []
merged_ids = set()

for op in merge_operations:
    merged_entity_dicts.append(op.merged_entity)
    # Track all source entity IDs that were merged
    for source in op.source_entities:
        merged_ids.add(source.get("id") or source.get("name"))

# Add entities that weren't merged (singletons)
for entity in entity_dicts:
    entity_id = entity.get("id") or entity.get("name")
    if entity_id not in merged_ids:
        merged_entity_dicts.append(entity)

# Convert back to Entity objects
merged_entities = [
    Entity(
        text=e.get("name", ""),
        label=e.get("type", ""),
        start_char=e.get("start_char", 0),
        end_char=e.get("end_char", 0),
        confidence=e.get("confidence", 1.0),
        metadata=e.get("metadata", {})
    )
    for e in merged_entity_dicts
]

print(f"Deduplicated {len(entity_dicts)} entities to {len(merged_entities)} unique entities")


Converting 857 entities to dictionaries...
Detecting duplicates in 857 entities...
üß† Semantica is deduplicating: Detected 3291 duplicate candidates |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [3291/3291] ‚úÖüîÑ (922.5/s) | üß† Semantica is deduplicating: Found 3291 similar pairs across 27 blocks |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [26870/26870] ‚úÖüîÑ (12335.6/s)0/26870] ‚úÖüîÑ (12335.6/s) (12335.6/s)))Detected 111 duplicate groups
Merging duplicates using keep_most_complete strategy...
üß† Semantica is deduplicating: Completed 111 merge operations |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [111/111] ‚úÖüîÑ (17.4/s) | üß† Semantica is deduplicating: Building merged entity... (4/4, remaining: 0 steps) |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [4/4] üîÑüîÑ (283.6/s)ÔøΩ (283.6/s)ÔøΩ (283.6/s) üîÑüîÑ (ETA: 0.0s | 247.9/s)124.7/s)s)Deduplicated 857 entities to 414 unique entities


## Detecting Threat Intelligence Conflicts

**Best Approach & Methods:**

‚Ä¢ **Type Conflict Detection**: `method="type"` identifies conflicting entity classifications (e.g., IOC as both "Malware" and "Threat")

‚Ä¢ **Highest Confidence Resolution**: `strategy="highest_confidence"` automatically resolves conflicts by prioritizing the most confident source


In [14]:
from semantica.conflicts import ConflictDetector, ConflictResolver

conflict_detector = ConflictDetector()
conflict_resolver = ConflictResolver()

# Convert Entity objects to dictionaries for conflict detection
entity_dicts = [
    {
        "id": e.text if hasattr(e, 'text') else str(e),
        "text": e.text if hasattr(e, 'text') else str(e),
        "label": e.label if hasattr(e, 'label') else "ENTITY",
        "type": e.label if hasattr(e, 'label') else "ENTITY",
        "confidence": e.confidence if hasattr(e, 'confidence') else 1.0,
        "metadata": e.metadata if hasattr(e, 'metadata') else {}
    }
    for e in all_entities
]

print(f"Detecting type conflicts in {len(entity_dicts)} entities...")
conflicts = conflict_detector.detect_type_conflicts(entity_dicts)

print(f"Detected {len(conflicts)} type conflicts")

if conflicts:
    print(f"Resolving conflicts using highest_confidence strategy...")
    resolved = conflict_resolver.resolve_conflicts(
        conflicts,
        strategy="highest_confidence"
    )
    print(f"Resolved {len(resolved)} conflicts")
else:
    print("No conflicts detected")

Detecting type conflicts in 857 entities...
üß† Semantica is deduplicating: Building merged entity... (4/4, remaining: 0 steps) |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [4/4] üîÑüîÑ (283.6/s) | üß† Semantica is resolving: Grouping entities... 96/857 (remaining: 761) |‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 11.2% [96/857] üîÑ‚ö†Ô∏è (ETA: 0.4s | 1921.4/s)

üß† Semantica is deduplicating: Building merged entity... (4/4, remaining: 0 steps) |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [4/4] üîÑüîÑ (283.6/s) | üß† Semantica is resolving: Checking entity groups for type conflicts... 0/445 (remaining: 445) |‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 0.0% [0/445] üîÑ‚ö†Ô∏ès)))

Type conflict detected: China Chopper conflicting types: ['PERSON', 'ORG']


üß† Semantica is deduplicating: Building merged entity... (4/4, remaining: 0 steps) |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [4/4] üîÑüîÑ (283.6/s) | üß† Semantica is resolving: Checking entity groups for type conflicts... 28/445 (remaining: 417) |‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 6.3% [28/445] üîÑ‚ö†Ô∏è (ETA: 5.9s | 69.9/s)

Type conflict detected: PowerShell Empire conflicting types: ['GPE', 'ORG']


üß† Semantica is deduplicating: Building merged entity... (4/4, remaining: 0 steps) |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [4/4] üîÑüîÑ (283.6/s) | üß† Semantica is resolving: Checking entity groups for type conflicts... 184/445 (remaining: 261) |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 41.3% [184/445] üîÑ‚ö†Ô∏è (ETA: 0.6s | 413.2/s)

Type conflict detected: Dridex conflicting types: ['PERSON', 'ORG']
Type conflict detected: CISA conflicting types: ['GPE', 'ORG']


üß† Semantica is deduplicating: Building merged entity... (4/4, remaining: 0 steps) |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [4/4] üîÑüîÑ (283.6/s) | üß† Semantica is resolving: Checking entity groups for type conflicts... 304/445 (remaining: 141) |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë| 68.3% [304/445] üîÑ‚ö†Ô∏è (ETA: 0.2s | 644.9/s)

Type conflict detected: IRGC conflicting types: ['GPE', 'ORG']


üß† Semantica is deduplicating: Building merged entity... (4/4, remaining: 0 steps) |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [4/4] üîÑüîÑ (283.6/s) | üß† Semantica is resolving: Detected 5 type conflicts |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [436/445] ‚úÖ‚ö†Ô∏è (862.7/s)‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë| 98.0% [436/445] üîÑ‚ö†Ô∏è (ETA: 0.0s | 867.9/s)))Detected 5 type conflicts
Resolving conflicts using highest_confidence strategy...
üß† Semantica is resolving: Detected 5 type conflicts |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [436/445] ‚úÖ‚ö†Ô∏è (862.7/s) | üß† Semantica is resolving: Resolving conflicts... 0/5 (remaining: 5) |‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 0.0% [0/5] üîÑ‚ö†Ô∏èResolved 5 conflicts


## Building Temporal Threat Intelligence Knowledge Graph


In [17]:
from semantica.kg import GraphBuilder

graph_builder = GraphBuilder(
    merge_entities=False,
    resolve_conflicts=False,
    entity_resolution_strategy="fuzzy",
    enable_temporal=True,
    temporal_granularity=TEMPORAL_GRANULARITY
)

print(f"Building knowledge graph...")
kg_sources = [{
    "entities": [{"text": e.text, "type": e.label, "confidence": e.confidence} for e in merged_entities],
    "relationships": [{"source": r.subject.text, "target": r.object.text, "type": r.predicate, "confidence": r.confidence} for r in all_relationships]
}]

kg = graph_builder.build(kg_sources)

entities_count = len(kg.get('entities', []))
relationships_count = len(kg.get('relationships', []))
print(f"Graph: {entities_count} entities, {relationships_count} relationships")


Building knowledge graph...
üß† Semantica is resolving: Resolving conflicts... 0/5 (remaining: 5) |‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 0.0% [0/5] üîÑ‚ö†Ô∏è | üß† Semantica is building: Processing relationships... 896/896 |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [896/896] üîÑüß† (18605.5/s) 17715.1/s)Building graph structure...
‚úÖ Graph structure built (0.00s)

‚úÖ Knowledge Graph Build Complete
   Entities: 414
   Relationships: 896
   Total time: 0.08s
Graph: 414 entities, 896 relationships


## Generating Embeddings for IOCs and Threats


In [18]:
from semantica.embeddings import EmbeddingGenerator

embedding_gen = EmbeddingGenerator(
    provider="sentence_transformers",
    model=EMBEDDING_MODEL
)

ioc_texts = [f"{ioc.text} {getattr(ioc, 'description', '')}" for ioc in iocs]
ioc_embeddings = embedding_gen.generate_embeddings(ioc_texts)

actor_texts = [f"{actor.text} {getattr(actor, 'description', '')}" for actor in actors]
actor_embeddings = embedding_gen.generate_embeddings(actor_texts)

print(f"Generated {len(ioc_embeddings)} IOC embeddings and {len(actor_embeddings)} actor embeddings")


fastembed not available. Install with: pip install fastembed. Using fallback embedding method.


Generated 2 IOC embeddings and 741 actor embeddings


## Populating Vector Store


In [19]:
from semantica.vector_store import VectorStore

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

print(f"Storing {len(ioc_embeddings)} IOC vectors and {len(actor_embeddings)} actor vectors...")
ioc_ids = vector_store.store_vectors(
    vectors=ioc_embeddings,
    metadata=[{"type": "ioc", "name": ioc.text, "label": ioc.label} for ioc in iocs]
)

actor_ids = vector_store.store_vectors(
    vectors=actor_embeddings,
    metadata=[{"type": "actor", "name": actor.text, "label": actor.label} for actor in actors]
)

print(f"Stored {len(ioc_ids)} IOC vectors and {len(actor_ids)} actor vectors")


fastembed not available. Install with: pip install fastembed. Using fallback embedding method.


Storing 2 IOC vectors and 741 actor vectors...
üß† Semantica is building: Processing relationships... 896/896 |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [896/896] üîÑüß† (18605.5/s) | üß† Semantica is indexing: Storing 2 vectors üîÑüìä (0.0s)Stored 2 IOC vectors and 741 actor vectors


## Temporal Graph Queries


In [21]:
from semantica.kg import TemporalGraphQuery

temporal_query = TemporalGraphQuery(
    enable_temporal_reasoning=True,
    temporal_granularity=TEMPORAL_GRANULARITY
)

query_results = temporal_query.query_at_time(
    kg,
    query={"type": "Campaign"},
    at_time="2024-01-01"
)

evolution = temporal_query.analyze_evolution(kg)
temporal_patterns = temporal_query.query_temporal_pattern(kg, pattern="sequence")

print(f"Temporal queries: {len(query_results)} campaigns at query time")
print(f"Temporal patterns detected: {temporal_patterns.get('num_patterns', 0)}")


Temporal queries: 6 campaigns at query time
Temporal patterns detected: 0


## Analyzing Threat Network Structure


In [22]:
from semantica.kg import GraphAnalyzer, CentralityCalculator, CommunityDetector

graph_analyzer = GraphAnalyzer()
centrality_calc = CentralityCalculator()
community_detector = CommunityDetector()

analysis = graph_analyzer.analyze_graph(kg)

degree_centrality = centrality_calc.calculate_degree_centrality(kg)
betweenness_centrality = centrality_calc.calculate_betweenness_centrality(kg)

communities = community_detector.detect_communities(kg, method="louvain")
connectivity = graph_analyzer.analyze_connectivity(kg)

print(f"Graph analytics:")
print(f"  - Communities: {len(communities)}")
print(f"  - Connected components: {len(connectivity.get('components', []))}")
print(f"  - Graph density: {analysis.get('density', 0):.3f}")
print(f"  - Central nodes (degree): {len(degree_centrality)}")


üß† Semantica is building: Calculating degree centrality üîÑüß† (0.0s) | üß† Semantica is building: Detecting communities with NetworkX... üîÑüß† (0.0s)üß†Graph analytics:
  - Communities: 4
  - Connected components: 11
  - Graph density: 0.000
  - Central nodes (degree): 4


## GraphRAG: Hybrid Vector + Graph Queries


In [26]:
from semantica.context import AgentContext
from semantica.llms import Groq
import os

context = AgentContext(
    vector_store=vector_store, 
    knowledge_graph=kg,
    max_expansion_hops=3,
    hybrid_alpha=0.7
)

llm = Groq(model="llama-3.1-8b-instant", api_key=os.getenv("GROQ_API_KEY"))

# First, explore what threat actors/entities are in the graph
print("Exploring knowledge graph for threat actors and entities...\n")
kg_entities = kg.get('entities', [])
kg_relationships = kg.get('relationships', [])

# Find threat-related entities
threat_keywords = ['APT', 'malware', 'threat', 'actor', 'campaign', 'attack', 'vulnerability', 'exploit']
threat_entities = []
for e in kg_entities:
    text = str(e.get('text', '') or e.get('name', '')).upper()
    entity_type = str(e.get('type', '')).upper()
    if any(kw in text or kw in entity_type for kw in threat_keywords):
        threat_entities.append(e)

print(f"Found {len(threat_entities)} threat-related entities:")
for e in threat_entities[:10]:
    print(f"  - {e.get('text') or e.get('name')} (type: {e.get('type')})")

# Check for APT28 specifically
apt28_entities = [
    e for e in kg_entities 
    if 'APT28' in str(e.get('text', '')).upper() or 
       'APT28' in str(e.get('name', '')).upper() or
       'FANCY BEAR' in str(e.get('text', '')).upper()
]

print(f"\nSearching for APT28/Fancy Bear: {'Found' if apt28_entities else 'Not found in knowledge graph'}")
if apt28_entities:
    for e in apt28_entities:
        print(f"  - {e.get('text') or e.get('name')} (type: {e.get('type')})")

# Try broader query if APT28 not found
query = "What threats are associated with APT28?" if apt28_entities else "What are the main cybersecurity threats and threat actors mentioned?"

print(f"\n{'='*80}")
print(f"Query: {query}")
print(f"{'='*80}\n")

# Use multi-hop reasoning with improved prompt
result = context.query_with_reasoning(
    query=query,
    llm_provider=llm,
    max_results=20,
    max_hops=3,
    min_score=0.15
)

print("=" * 80)
print("Generated Answer (with Multi-hop Reasoning):")
print("=" * 80)
response = result.get('response', 'No response generated')

# If APT28 not found, enhance the response
if not apt28_entities and 'APT28' in query:
    response += f"\n\nNote: APT28 (Fancy Bear) was not found in the current knowledge graph. "
    response += f"The graph contains {len(threat_entities)} threat-related entities. "
    response += "Consider ingesting more threat intelligence feeds that mention APT28."

print(response)
print("\n" + "=" * 80)

print(f"\nReasoning Details:")
print(f"- Confidence: {result.get('confidence', 0):.3f}")
print(f"- Sources: {result.get('num_sources', 0)}")
print(f"- Reasoning Paths: {result.get('num_reasoning_paths', 0)}")
print(f"- Total entities in graph: {len(kg_entities)}")
print(f"- Total relationships in graph: {len(kg_relationships)}")

if result.get('sources'):
    print(f"\nTop Sources:")
    for i, source in enumerate(result['sources'][:5], 1):
        content = source.get('content', '')[:200] if isinstance(source, dict) else str(source)[:200]
        score = source.get('score', 0) if isinstance(source, dict) else 0
        print(f"  {i}. Score: {score:.3f}")
        print(f"     {content}...")


Exploring knowledge graph for threat actors and entities...

Found 2 threat-related entities:
  - APT (type: ORG)
  - APT TA423 (type: ORG)

Searching for APT28/Fancy Bear: Not found in knowledge graph

Query: What are the main cybersecurity threats and threat actors mentioned?

üß† Semantica is embedding: Generating text embedding: observe Credential Guard.</p>                                     antica is processing: Searching vector store... üîÑüîó (0.0s)0s)0.0s)
üß† Semantica is embedding: Generating text embedding: href="https://learn.microsoft.com is a ORG.... üîÑüíæ (0.0s) | üß† Semantica is processing: Searching vector store... üîÑüîó (0.0s) (0.0s)

Embedding generation failed: Text cannot be empty or whitespace-only
Using random fallback embedding


Generated Answer (with Multi-hop Reasoning):
Based on the retrieved context and reasoning paths, the main cybersecurity threats and threat actors mentioned are:

1. **Malware**: Mentioned in the context of "Guide to Malware Incident Prevention and Handling for Desktops" (Context 2) and "Securing Active" (Reasoning Paths 1-2). This suggests that malware is a significant cybersecurity threat.
2. **Cyber attacks**: Implied by the involvement of the Cybersecurity and Infrastructure Security Agency (Context 2) and the National Cybersecurity and Communications Integration Center (Context 4).
3. **Phishing**: Suggested by the mention of "redirection" (Context 4) and the involvement of the Department of the Treasury's Financial Crimes Enforcement Network (Context 4).
4. **Social engineering**: Implied by the mention of Microsoft Office (Context 4) and the involvement of the National Cybersecurity and Communications Integration Center (Context 4).

As for threat actors, the following are mentio

## Reasoning and Threat Analysis


In [28]:
from semantica.reasoning import Reasoner
from semantica.kg import ConnectivityAnalyzer

# Rule-based inference
reasoner = Reasoner()
reasoner.add_rule("IF IOC associated_with Campaign AND Campaign uses TTP THEN IOC linked_to TTP")
reasoner.add_rule("IF Actor uses TTP AND TTP targets Campaign THEN Actor part_of Campaign")

inferred_facts = reasoner.infer_facts(kg)
print(f"Inferred {len(inferred_facts)} facts from rules")

# Analyze connectivity using ConnectivityAnalyzer class
connectivity = ConnectivityAnalyzer()
connectivity_result = connectivity.analyze_connectivity(kg)
print(f"\nGraph connectivity: {connectivity_result.get('num_components', 0)} components")
print(f"Graph density: {connectivity_result.get('density', 0):.3f}")

# Find paths between entity types
kg_entities = kg.get('entities', [])
actors = [e for e in kg_entities if 'actor' in str(e.get('type', '')).lower()][:3]
iocs = [e for e in kg_entities if 'ioc' in str(e.get('type', '')).lower()][:3]

threat_paths = []
for actor in actors:
    for ioc in iocs:
        actor_id = actor.get('id') or actor.get('text')
        ioc_id = ioc.get('id') or ioc.get('text')
        if actor_id and ioc_id:
            path_result = connectivity.calculate_shortest_paths(
                kg,
                source=actor_id,
                target=ioc_id
            )
            if path_result.get('exists'):
                threat_paths.append(path_result)

print(f"\nFound {len(threat_paths)} threat paths between Actor and IOC entities")
if threat_paths:
    for i, path in enumerate(threat_paths[:3], 1):
        print(f"  Path {i}: distance={path.get('distance', -1)}")


üß† Semantica is processing: Searching vector store... üîÑüîó (0.0s) | üß† Semantica is reasoning: Inferring facts üîÑü§î (0.0s)Inferred 0 facts from rules

Graph connectivity: 11 components
Graph density: 0.023

Found 0 threat paths between Actor and IOC entities


## Storing Threat Intelligence Graph (Optional)


In [29]:
from semantica.graph_store import GraphStore

# Optional: Store to persistent graph database
# graph_store = GraphStore(backend="neo4j", uri="bolt://localhost:7687", user="neo4j", password="password")
# graph_store.store_graph(kg)

print("Graph store configured (commented out for demo)")


Graph store configured (commented out for demo)


## Visualizing the Threat Intelligence Knowledge Graph


In [32]:
from semantica.visualization import KGVisualizer
import plotly.io as pio

# Configure Plotly for Colab
pio.renderers.default = "colab"

visualizer = KGVisualizer(layout="force", node_size=20)
fig = visualizer.visualize_network(
    kg,
    output="interactive",
    node_color_by="type"
)

# Display in Colab
if fig:
    fig.show()


üß† Semantica is reasoning: Inferring facts üîÑü§î (0.0s) | üß† Semantica is visualizing: Visualization generated: 414 nodes, 739 edges |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüìà

## Exporting Results


In [34]:
from semantica.export import GraphExporter

exporter = GraphExporter()
exporter.export(kg, output_path="threat_intelligence_kg.json", format="json")
exporter.export(kg, output_path="threat_intelligence_kg.graphml", format="graphml")

print("Exported threat intelligence knowledge graph to JSON and GraphML formats")


üß† Semantica is visualizing: Visualization generated: 414 nodes, 739 edges |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüìà | üß† Semantica is exporting: Exporting graph to json: threat_intelligence_kg.json üîÑüíæ (0.0s)Exported threat intelligence knowledge graph to JSON and GraphML formats
