[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/cybersecurity/02_Threat_Intelligence_Hybrid_RAG.ipynb)

# Threat Intelligence Hybrid RAG - Vector + Graph Retrieval

## Overview

This notebook demonstrates **threat intelligence hybrid RAG** using Semantica with focus on **hybrid search**, **vector + graph retrieval**, and **context-aware queries**. The pipeline combines vector search with temporal knowledge graphs for advanced threat intelligence querying.

### Key Features

- **Hybrid RAG**: Combines vector similarity search with knowledge graph traversal
- **Vector + Graph Retrieval**: Uses both vector embeddings and graph relationships
- **Context-Aware Queries**: Provides context-aware retrieval for threat intelligence
- **Temporal Knowledge Graphs**: Builds temporal KGs for threat timeline analysis
- **Multi-hop Reasoning**: Follows relationships across the graph for deeper context
- **Comprehensive Data Sources**: Multiple threat intelligence feeds, APIs, and databases
- **Modular Architecture**: Direct use of Semantica modules without core orchestrator

### Learning Objectives

- Ingest threat intelligence data from multiple sources
- Extract threat entities (IOCs, Campaigns, Threats, Actors, TTPs, Malware)
- Build temporal threat intelligence knowledge graphs
- Generate embeddings and populate vector stores
- Perform hybrid vector + graph queries
- Analyze threat networks using graph analytics
- Store and query threat intelligence using vector stores and graph stores

### Pipeline Flow

```mermaid
graph TD
    A[Data Ingestion] --> B[Document Parsing]
    B --> C[Text Processing]
    C --> D[Entity Extraction]
    D --> E[Relationship Extraction]
    E --> F[Deduplication]
    F --> G[Conflict Detection]
    G --> H[Temporal Knowledge Graph]
    H --> I[Embeddings]
    I --> J[Vector Store]
    H --> K[Temporal Queries]
    K --> L[Graph Analytics]
    L --> M[GraphRAG Queries]
    J --> M
    H --> N[Reasoning & Threat]
    M --> O[Visualization]
    N --> O
    H --> P[Graph Store]
    P --> O
    O --> Q[Export]
```

### Data Sources

#### Threat Intelligence RSS Feeds
- **US-CERT Alerts**: https://www.us-cert.gov/ncas/alerts.xml
- **SANS Internet Storm Center**: https://isc.sans.edu/rssfeed.xml
- **Krebs on Security**: https://krebsonsecurity.com/feed/
- **ThreatPost**: https://threatpost.com/feed/
- **BleepingComputer**: https://www.bleepingcomputer.com/feed/
- **SecurityWeek**: https://www.securityweek.com/rss

#### Threat Intelligence APIs
- **MITRE ATT&CK**: https://attack.mitre.org/
- **VirusTotal API**: https://www.virustotal.com/gui/join-us
- **AbuseIPDB API**: https://www.abuseipdb.com/api
- **AlienVault OTX**: https://otx.alienvault.com/api
- **Shodan API**: https://www.shodan.io/
- **ThreatCrowd API**: https://www.threatcrowd.org/

#### IOC Sources
- **MISP (Malware Information Sharing Platform)**: https://www.misp-project.org/
- **OpenCTI**: https://www.opencti.io/
- **ThreatFox**: https://threatfox.abuse.ch/
- **URLhaus**: https://urlhaus.abuse.ch/

#### Database Links
- **MITRE ATT&CK**: https://attack.mitre.org/
- **CVE Database**: https://cve.mitre.org/
- **NVD (National Vulnerability Database)**: https://nvd.nist.gov/
- **Exploit-DB**: https://www.exploit-db.com/
- **CAPEC (Common Attack Pattern)**: https://capec.mitre.org/

---

## Installation


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


## Configuration & Setup


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "your-key-here")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
TEMPORAL_GRANULARITY = "day"


## Ingesting Threat Intelligence Data


In [None]:
from semantica.ingest import FeedIngestor, WebIngestor, FileIngestor
import os
from contextlib import redirect_stderr
from io import StringIO

os.makedirs("data", exist_ok=True)

feed_sources = [
    # Threat Intelligence RSS Feeds
    ("US-CERT Alerts", "https://www.us-cert.gov/ncas/alerts.xml"),
    ("SANS ISC", "https://isc.sans.edu/rssfeed.xml"),
    ("Krebs on Security", "https://krebsonsecurity.com/feed/"),
    ("ThreatPost", "https://threatpost.com/feed/"),
    ("BleepingComputer", "https://www.bleepingcomputer.com/feed/"),
    ("SecurityWeek", "https://www.securityweek.com/rss"),
]

feed_ingestor = FeedIngestor()
all_documents = []

for feed_name, feed_url in feed_sources:
    try:
        with redirect_stderr(StringIO()):
            feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
        for item in feed_data.items:
            if not item.content:
                item.content = item.description or item.title or ""
            if item.content:
                if not hasattr(item, 'metadata'):
                    item.metadata = {}
                item.metadata['source'] = feed_name
                all_documents.append(item)
    except Exception:
        continue

if not all_documents:
    threat_data = """
    IOC: IP address 192.168.1.50 associated with APT28 campaign.
    Threat actor APT28 uses TTP: Spear phishing and credential harvesting.
    Campaign Operation GhostShell targets financial institutions.
    Malware sample hash: abc123def456 linked to APT28 infrastructure.
    IOC: Domain example-malicious.com linked to APT29 operations.
    Threat actor APT29 uses TTP: Watering hole attacks and lateral movement.
    Campaign Operation SolarWinds targets technology companies.
    IOC: File hash xyz789ghi012 associated with ransomware group.
    """
    with open("data/threat_intel.txt", "w") as f:
        f.write(threat_data)
    file_ingestor = FileIngestor()
    all_documents = file_ingestor.ingest("data/threat_intel.txt")

documents = all_documents
print(f"Ingested {len(documents)} documents")


## Parsing Threat Intelligence Documents


In [None]:
from semantica.parse import DocumentParser

parser = DocumentParser()

parsed_documents = []
for doc in documents:
    try:
        parsed = parser.parse(
            doc.content if hasattr(doc, 'content') else str(doc),
            content_type="text"
        )
        parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc)

documents = parsed_documents


## Normalizing and Chunking Threat Intelligence Data


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

normalizer = TextNormalizer()
# Use entity-aware chunking to preserve threat entity boundaries for GraphRAG
splitter = TextSplitter(
    method="entity_aware",
    ner_method="spacy",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

normalized_documents = []
for doc in documents:
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        remove_extra_whitespace=True,
        lowercase=False
    )
    normalized_documents.append(normalized_text)

chunked_documents = []
for doc_text in normalized_documents:
    try:
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
        chunked_documents.extend(chunks)
    except Exception:
        simple_splitter = TextSplitter(method="recursive", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        chunks = simple_splitter.split(doc_text)
        chunked_documents.extend(chunks)


In [None]:
from semantica.semantic_extract import NERExtractor

entity_extractor = NERExtractor(
    method="llm",
    provider="groq",
    llm_model="llama-3.1-8b-instant",
    temperature=0.0
)

all_entities = []
for chunk in chunked_documents:
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        entities = entity_extractor.extract_entities(
            chunk_text,
            entity_types=["IOC", "Campaign", "Threat", "Actor", "TTP", "Malware"]
        )
        all_entities.extend(entities)
    except Exception:
        continue

iocs = [e for e in all_entities if e.label == "IOC" or "ioc" in e.label.lower()]
actors = [e for e in all_entities if e.label == "Actor" or "actor" in e.label.lower()]
campaigns = [e for e in all_entities if e.label == "Campaign" or "campaign" in e.label.lower()]
ttps = [e for e in all_entities if e.label == "TTP" or "ttp" in e.label.lower()]

print(f"Extracted {len(iocs)} IOCs, {len(actors)} actors, {len(campaigns)} campaigns, {len(ttps)} TTPs")


## Extracting Threat Relationships


In [None]:
from semantica.semantic_extract import RelationExtractor

relation_extractor = RelationExtractor(
    method="llm",
    provider="groq",
    llm_model="llama-3.1-8b-instant",
    temperature=0.0
)

all_relationships = []
for chunk in chunked_documents:
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        relationships = relation_extractor.extract_relations(
            chunk_text,
            entities=all_entities,
            relation_types=["associated_with", "uses", "targets", "linked_to", "part_of", "employs"]
        )
        all_relationships.extend(relationships)
    except Exception:
        continue

print(f"Extracted {len(all_relationships)} relationships")


## Resolving Duplicate IOCs and Actors


In [None]:
from semantica.deduplication import DuplicateDetector

duplicate_detector = DuplicateDetector(
    similarity_threshold=0.85,
    method="semantic"
)

deduplicated_entities = duplicate_detector.detect_duplicates(all_entities)
merged_entities = duplicate_detector.merge_duplicates(deduplicated_entities)

print(f"Deduplicated {len(all_entities)} entities to {len(merged_entities)} unique entities")


## Detecting Threat Intelligence Conflicts


In [None]:
from semantica.conflicts import ConflictDetector

conflict_detector = ConflictDetector()

conflicts = conflict_detector.detect_conflicts(merged_entities, all_relationships)

if conflicts:
    resolved = conflict_detector.resolve_conflicts(conflicts, strategy="highest_confidence")
    print(f"Detected {len(conflicts)} conflicts, resolved {len(resolved)}")
else:
    print("No conflicts detected")


## Building Temporal Threat Intelligence Knowledge Graph


In [None]:
from semantica.kg import GraphBuilder

graph_builder = GraphBuilder(
    merge_entities=True,
    resolve_conflicts=True,
    entity_resolution_strategy="fuzzy",
    enable_temporal=True,
    temporal_granularity=TEMPORAL_GRANULARITY
)

kg_sources = [{
    "entities": [{"text": e.text, "type": e.label, "confidence": e.confidence} for e in merged_entities],
    "relationships": [{"source": r.source, "target": r.target, "type": r.label, "confidence": r.confidence} for r in all_relationships]
}]

kg = graph_builder.build(kg_sources)

entities_count = len(kg.get('entities', []))
relationships_count = len(kg.get('relationships', []))
print(f"Graph: {entities_count} entities, {relationships_count} relationships")


## Generating Embeddings for IOCs and Threats


In [None]:
from semantica.embeddings import EmbeddingGenerator

embedding_gen = EmbeddingGenerator(
    provider="sentence_transformers",
    model=EMBEDDING_MODEL
)

ioc_texts = [f"{ioc.text} {getattr(ioc, 'description', '')}" for ioc in iocs]
ioc_embeddings = embedding_gen.generate_embeddings(ioc_texts)

actor_texts = [f"{actor.text} {getattr(actor, 'description', '')}" for actor in actors]
actor_embeddings = embedding_gen.generate_embeddings(actor_texts)

print(f"Generated {len(ioc_embeddings)} IOC embeddings and {len(actor_embeddings)} actor embeddings")


## Populating Vector Store


In [None]:
from semantica.vector_store import VectorStore

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

ioc_ids = vector_store.store_vectors(
    vectors=ioc_embeddings,
    metadata=[{"type": "ioc", "name": ioc.text, "label": ioc.label} for ioc in iocs]
)

actor_ids = vector_store.store_vectors(
    vectors=actor_embeddings,
    metadata=[{"type": "actor", "name": actor.text, "label": actor.label} for actor in actors]
)

print(f"Stored {len(ioc_ids)} IOC vectors and {len(actor_ids)} actor vectors")


## Temporal Graph Queries


In [None]:
from semantica.kg import TemporalGraphQuery

temporal_query = TemporalGraphQuery(
    enable_temporal_reasoning=True,
    temporal_granularity=TEMPORAL_GRANULARITY
)

query_results = temporal_query.query_at_time(
    kg,
    query={"type": "Campaign"},
    at_time="2024-01-01"
)

evolution = temporal_query.analyze_evolution(kg)
temporal_patterns = temporal_query.detect_temporal_patterns(kg, pattern_type="sequence")

print(f"Temporal queries: {len(query_results)} campaigns at query time")
print(f"Temporal patterns detected: {len(temporal_patterns)}")


## Analyzing Threat Network Structure


In [None]:
from semantica.kg import GraphAnalyzer, CentralityCalculator, CommunityDetector

graph_analyzer = GraphAnalyzer()
centrality_calc = CentralityCalculator()
community_detector = CommunityDetector()

analysis = graph_analyzer.analyze_graph(kg)

degree_centrality = centrality_calc.calculate_degree_centrality(kg)
betweenness_centrality = centrality_calc.calculate_betweenness_centrality(kg)

communities = community_detector.detect_communities(kg, method="louvain")
connectivity = graph_analyzer.analyze_connectivity(kg)

print(f"Graph analytics:")
print(f"  - Communities: {len(communities)}")
print(f"  - Connected components: {len(connectivity.get('components', []))}")
print(f"  - Graph density: {analysis.get('density', 0):.3f}")
print(f"  - Central nodes (degree): {len(degree_centrality)}")


## GraphRAG: Hybrid Vector + Graph Queries


In [None]:
from semantica.context import AgentContext

context = AgentContext(vector_store=vector_store, knowledge_graph=kg)

query = "What threats are associated with APT28?"
results = context.retrieve(
    query,
    max_results=10,
    use_graph=True,
    expand_graph=True,
    include_entities=True,
    include_relationships=True
)

print(f"GraphRAG query: '{query}'")
print(f"\nRetrieved {len(results)} results:\n")
for i, result in enumerate(results[:5], 1):
    print(f"{i}. Score: {result.get('score', 0):.3f}")
    print(f"   Content: {result.get('content', '')[:200]}...")
    if result.get('related_entities'):
        print(f"   Related entities: {len(result['related_entities'])}")
    print()


## Reasoning and Threat Analysis


In [None]:
from semantica.reasoning import Reasoner

reasoner = Reasoner()

reasoner.add_rule("IF IOC associated_with Campaign AND Campaign uses TTP THEN IOC linked_to TTP")
reasoner.add_rule("IF Actor uses TTP AND TTP targets Campaign THEN Actor part_of Campaign")

inferred_facts = reasoner.infer_facts(kg)

threat_paths = reasoner.find_paths(
    kg,
    source_type="Actor",
    target_type="IOC",
    max_hops=3
)

print(f"Inferred {len(inferred_facts)} facts")
print(f"Found {len(threat_paths)} threat paths")


## Storing Threat Intelligence Graph (Optional)


In [None]:
from semantica.graph_store import GraphStore

# Optional: Store to persistent graph database
# graph_store = GraphStore(backend="neo4j", uri="bolt://localhost:7687", user="neo4j", password="password")
# graph_store.store_graph(kg)

print("Graph store configured (commented out for demo)")


## Visualizing the Threat Intelligence Knowledge Graph


In [None]:
from semantica.visualization import KGVisualizer

visualizer = KGVisualizer()
visualizer.visualize(
    kg,
    output_path="threat_intelligence_kg.html",
    layout="spring",
    node_size=20
)

print("Visualization saved to threat_intelligence_kg.html")


## Exporting Results


In [None]:
from semantica.export import GraphExporter

exporter = GraphExporter()
exporter.export(kg, output_path="threat_intelligence_kg.json", format="json")
exporter.export(kg, output_path="threat_intelligence_kg.graphml", format="graphml")
exporter.export(kg, output_path="threat_intelligence_iocs.csv", format="csv")

print("Exported threat intelligence knowledge graph to JSON, GraphML, and CSV formats")
