[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/cybersecurity/01_Real_Time_Anomaly_Detection.ipynb)

# Real-Time Anomaly Detection - Stream Processing & Temporal KGs

## Overview

This notebook demonstrates **real-time anomaly detection** using Semantica with focus on **stream ingestion**, **temporal knowledge graphs**, and **pattern detection**. The pipeline streams security logs in real-time, builds temporal knowledge graphs, and detects anomalies using pattern detection.

### Key Features

- **Stream Processing**: Emphasizes real-time log streaming and processing
- **Temporal Knowledge Graphs**: Builds temporal KGs to track events over time
- **Pattern Detection**: Uses graph patterns to identify anomalies
- **Automated Alerting**: Generates alerts for detected anomalies
- **Real-Time Processing**: Demonstrates stream ingestion capabilities

### Pipeline Architecture

1. **Phase 0**: Setup & Configuration
2. **Phase 1**: Stream Security Log Ingestion
3. **Phase 2**: Real-Time Log Parsing
4. **Phase 3**: Entity Extraction (Log, Event, IP, User, Alert)
5. **Phase 4**: Temporal Knowledge Graph Construction
6. **Phase 5**: Pattern Detection
7. **Phase 6**: Anomaly Detection
8. **Phase 7**: Alert Generation & Visualization

---

## Installation


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas groq


---

## Phase 0: Setup & Configuration


In [None]:
import os
from semantica.core import Semantica, ConfigManager
from semantica.ingest import StreamIngestor

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "your-key")

config_dict = {
    "project_name": "Real_Time_Anomaly_Detection",
    "extraction": {"provider": "groq", "model": "llama-3.1-8b-instant"},
    "knowledge_graph": {"backend": "networkx", "temporal": True}
}

config = ConfigManager().load_from_dict(config_dict)
core = Semantica(config=config)
print("Configured for real-time anomaly detection with stream processing focus")


---

## Phase 1: Real Data Ingestion (CVE RSS Feed & Kafka Stream)

Ingest security data from CVE RSS feeds and simulated Kafka streams.


In [None]:
from semantica.ingest import FeedIngestor, StreamIngestor, FileIngestor
import os

os.makedirs("data", exist_ok=True)

# Option 1: Ingest from CVE RSS feed (real data source)
cve_rss_url = "https://cve.mitre.org/data/downloads/allitems.xml"  # CVE feed

documents = []
try:
    feed_ingestor = FeedIngestor()
    feed_documents = feed_ingestor.ingest(cve_rss_url, method="rss")
    print(f"Ingested {len(feed_documents)} documents from CVE RSS feed")
    documents.extend(feed_documents)
except Exception as e:
    print(f"CVE RSS feed ingestion failed: {e}")

# Option 2: Simulate Kafka stream (in production, use actual Kafka)
# stream_ingestor = StreamIngestor()
# stream_documents = stream_ingestor.ingest("kafka://localhost:9092/security-logs", method="kafka")

# Fallback: Sample security log stream data
security_logs = """
2024-01-01 10:00:00 - Login attempt from IP 192.168.1.100 user admin
2024-01-01 10:01:00 - Failed login from IP 192.168.1.100 user admin
2024-01-01 10:02:00 - Multiple failed logins from IP 192.168.1.100 user admin
2024-01-01 10:03:00 - Unusual activity detected from IP 192.168.1.100
2024-01-01 10:04:00 - Alert: Potential brute force attack from IP 192.168.1.100
2024-01-01 10:05:00 - Login attempt from IP 192.168.1.101 user test
"""

with open("data/security_logs.txt", "w") as f:
    f.write(security_logs)

stream_documents = FileIngestor().ingest("data/security_logs.txt")
documents.extend(stream_documents)
print(f"Added {len(stream_documents)} documents from simulated stream")
print(f"Total documents: {len(documents)}")


---

## Phase 2: Text Normalization & Advanced Chunking

Normalize log data and use sentence/recursive chunking for structured logs.


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

# Normalize log data
normalizer = TextNormalizer()
normalized_documents = []
for doc in documents:
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        remove_extra_whitespace=True
    )
    normalized_documents.append(normalized_text)

print(f"Normalized {len(normalized_documents)} documents")

# Use sentence chunking for log line boundaries (structured logs)
splitter = TextSplitter(method="sentence", chunk_size=500, chunk_overlap=50)
# Alternative: recursive for hierarchical log structures
# splitter = TextSplitter(method="recursive", chunk_size=500, chunk_overlap=50)

chunked_docs = []
for doc_text in normalized_documents:
    chunks = splitter.split(doc_text)
    chunked_docs.extend([chunk.content if hasattr(chunk, 'content') else str(chunk) for chunk in chunks])

print(f"Created {len(chunked_docs)} chunks using sentence chunking")


---

## Phase 3-4: Temporal Knowledge Graph Construction

Build full temporal KG with TemporalGraphQuery capabilities.


In [None]:
from semantica.kg import GraphBuilder, TemporalGraphQuery

# Build temporal knowledge graph
result = core.build_knowledge_base(
    sources=chunked_docs,
    custom_entity_types=["Log", "Event", "IP", "User", "Alert", "Attack"],
    graph=True,
    temporal=True
)

kg = result["knowledge_graph"]

# Initialize temporal graph query engine
temporal_query = TemporalGraphQuery(
    enable_temporal_reasoning=True,
    temporal_granularity="minute"  # Fine-grained for real-time logs
)

# Query graph at specific time point
query_results = temporal_query.query_at_time(
    kg,
    query={"type": "Alert"},
    at_time="2024-01-01 10:04:00"
)

# Analyze temporal evolution
evolution = temporal_query.analyze_evolution(kg)

print(f"Built temporal KG with {len(kg.get('entities', []))} entities")
print(f"Temporal queries: {len(query_results)} alerts at query time")
print("Focus: Stream processing, temporal KGs, pattern detection")


In [None]:
from semantica.reasoning import GraphReasoner

# Detect anomaly patterns (e.g., multiple failed logins)
reasoner = GraphReasoner(kg)
anomaly_patterns = reasoner.find_patterns(pattern_type="anomaly")

# Temporal pattern detection
temporal_patterns = temporal_query.detect_temporal_patterns(kg, pattern_type="sequence")

# Identify suspicious IPs
suspicious_ips = [e for e in kg.get("entities", []) 
                  if e.get("type") == "IP" and 
                  any("alert" in str(r.get("predicate", "")).lower() 
                      for r in kg.get("relationships", []) 
                      if r.get("target") == e.get("id"))]

print(f"Pattern detection: {len(anomaly_patterns)} anomaly patterns found")
print(f"Temporal patterns: {len(temporal_patterns)} temporal patterns detected")
print(f"Anomaly detection: {len(suspicious_ips)} suspicious IPs identified")
print("This cookbook emphasizes stream processing, temporal KGs, and pattern detection")


---

## Phase 7: Visualization & Alert Generation

Visualize temporal knowledge graph with anomaly alerts.


In [None]:
from semantica.visualization import KGVisualizer

visualizer = KGVisualizer()
visualizer.visualize(kg, output_path="anomaly_detection_kg.html", layout="temporal")

print("Real-time anomaly detection analysis complete")
print("\n=== Pipeline Summary ===")
print(f"✓ Ingested {len(documents)} documents from CVE RSS feed and stream")
print(f"✓ Normalized {len(normalized_documents)} documents")
print(f"✓ Created {len(chunked_docs)} chunks using sentence chunking")
print(f"✓ Built temporal KG with {len(kg.get('entities', []))} entities")
print(f"✓ Detected {len(anomaly_patterns)} anomaly patterns and {len(suspicious_ips)} suspicious IPs")
print(f"✓ Emphasizes: Stream processing, temporal KGs, pattern detection")
