In [1]:
!pip install pandas networkx




[notice] A new release of pip is available: 23.2.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


# Graph Type and Purpose

You are constructing a **heterogeneous directed multigraph** using `NetworkX`’s `MultiDiGraph()` to model complex cyber network interactions. This design is particularly effective for advanced cybersecurity applications such as:

- **Graph-based threat detection**
- **Anomaly identification in multi-modal behaviors**
- **Learning embeddings for heterogeneous entities**

### Key Characteristics

- **Heterogeneous nodes**  
  Represents diverse entities: IP addresses, domain names, HTTP URIs, SSL certificate subjects/issuers, protocol violation types, etc.

- **Multi-view relationships**  
  Multiple directed edge types between the same pair of nodes allow different interaction views (e.g., flows, DNS queries, HTTP requests).

- **Directed edges**  
  Encode **temporal or causal flow** (e.g., `src_ip ➝ dst_ip`, `IP ➝ domain`), reflecting who initiated what.

# Node Types (Entities)

Each node represents a real-world entity, extracted from one or more dataset columns:

| Node Type         | Source Column(s)    | Description                                                                 |
|-------------------|---------------------|-----------------------------------------------------------------------------|
| **IP Address**     | `src_ip`, `dst_ip`  | Devices or interfaces on the network (e.g., `192.168.1.37`).                |
| **Domain Name**    | `dns_query`         | Fully qualified domain names queried by IPs (e.g., `www.example.com`).      |
| **HTTP URI**       | `http_uri`          | HTTP resource paths (e.g., `/login`, `/index.html`).                        |
| **SSL Subject**    | `ssl_subject`       | Distinguished Name of the certificate subject (e.g., `/C=US/O=Let's Encrypt`). |
| **SSL Issuer**     | `ssl_issuer`        | Distinguished Name of the certificate issuer (e.g., `/C=US/O=Google Trust Services`). |
| **Protocol Violation** | `weird_name`     | Descriptive label of detected anomalies (e.g., `bad_TCP_checksum`).         |

---

# Edge Types (Views)

Each directed edge represents an interaction or behavioral relationship, often enriched with protocol metadata:

## 1. `flow` — (IP ➝ IP)

Represents a network flow between two IP addresses.

- **Source:** `src_ip`  
- **Target:** `dst_ip`  
- **Attributes:**
  - `proto`, `service`, `duration`, `conn_state`
  - `src_bytes`, `dst_bytes`
  - `label`, `attack_type`

**Usefulness:**  
Defines the **structural backbone** of the graph, enabling analysis of traffic patterns and attack topologies.

## 2. `dns_query` — (IP ➝ Domain Name)

Represents a DNS lookup initiated by a host.

- **Source:** `src_ip`  
- **Target:** `dns_query`  
- **Attributes:**
  - `qclass`, `qtype`, `rcode`
  - `dns_AA`, `dns_RD`, `dns_RA`, `dns_rejected`

**Usefulness:**  
Reveals **host intent** and can indicate access to suspicious or malicious domains.

## 3. `http_request` — (IP ➝ HTTP URI)

Captures web resource requests made by a host.

- **Source:** `src_ip`  
- **Target:** `http_uri`  
- **Attributes:**
  - `method`, `version`, `status_code`
  - `trans_depth`, `req_body_len`, `resp_body_len`
  - `user_agent`, `orig_mime`, `resp_mime`

**Usefulness:**  
Reflects **web behavior**; useful for detecting scanning, reconnaissance, and probing activity.

## 4. `protocol_violation` — (IP ➝ Violation Label)

Links an IP to a protocol anomaly observed during communication.

- **Source:** `src_ip`  
- **Target:** `weird_name`  
- **Attributes:**
  - `weird_addl`, `weird_notice`

**Usefulness:**  
Highlights **anomalous or misconfigured hosts**. Many such events are early indicators of compromise or malicious activity.

# Semantic Graph Properties

- **IP nodes are central:**  
  Most interaction types originate from or are directed to IP addresses, making them critical in graph topology.

- **Multi-modal behavioral modeling:**  
  Combines HTTP, DNS, SSL, and flow-level information into one unified representation.

- **Multi-view learning ready:**  
  The graph supports training models on **protocol-specific subgraphs or jointly across views**.

- **Temporal/causal interpretation:**  
  Directed edges preserve **who initiated the interaction**, enabling traceability and behavioral profiling.


In [3]:
import pandas as pd
import networkx as nx

df = pd.read_csv("../datasets/train_test_network.csv")

G = nx.MultiDiGraph()

for _, row in df.iterrows():
    src_ip = row['src_ip']
    dst_ip = row['dst_ip']

    G.add_edge(
        src_ip, dst_ip,
        key="flow",
        proto=row.get("proto"),
        service=row.get("service"),
        duration=row.get("duration"),
        src_bytes=row.get("src_bytes"),
        dst_bytes=row.get("dst_bytes"),
        conn_state=row.get("conn_state"),
        label=row.get("label"),
        attack_type=row.get("type")
    )

    if pd.notna(row.get("dns_query")):
        dns_domain = row["dns_query"]
        G.add_edge(
            src_ip, dns_domain,
            key="dns_query",
            qclass=row.get("dns_qclass"),
            qtype=row.get("dns_qtype"),
            rcode=row.get("dns_rcode"),
            dns_AA=row.get("dns_AA"),
            dns_RD=row.get("dns_RD"),
            dns_RA=row.get("dns_RA"),
            dns_rejected=row.get("dns_rejected")
        )

    if pd.notna(row.get("http_uri")):
        http_target = row["http_uri"]
        G.add_edge(
            src_ip, http_target,
            key="http_request",
            method=row.get("http_method"),
            version=row.get("http_version"),
            status_code=row.get("http_status_code"),
            trans_depth=row.get("http_trans_depth"),
            req_body_len=row.get("http_request_body_len"),
            resp_body_len=row.get("http_response_body_len"),
            user_agent=row.get("http_user_agent"),
            orig_mime=row.get("http_orig_mime_types"),
            resp_mime=row.get("http_resp_mime_types")
        )

    if pd.notna(row.get("ssl_subject")):
        G.add_edge(
            src_ip, row["ssl_subject"],
            key="ssl_subject",
            ssl_version=row.get("ssl_version"),
            ssl_cipher=row.get("ssl_cipher"),
            ssl_resumed=row.get("ssl_resumed"),
            ssl_established=row.get("ssl_established")
        )

    if pd.notna(row.get("ssl_issuer")):
        G.add_edge(
            src_ip, row["ssl_issuer"],
            key="ssl_issuer",
            ssl_version=row.get("ssl_version"),
            ssl_cipher=row.get("ssl_cipher"),
            ssl_resumed=row.get("ssl_resumed"),
            ssl_established=row.get("ssl_established")
        )

    if pd.notna(row.get("weird_name")):
        G.add_edge(
            src_ip, row["weird_name"],
            key="protocol_violation",
            weird_addl=row.get("weird_addl"),
            weird_notice=row.get("weird_notice")
        )
        
print(f"Graph built with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges.")
print("Edge types (views) include:", set(k for _, _, k in G.edges(keys=True)))

Graph built with 1605 nodes and 2554 edges.
Edge types (views) include: {'protocol_violation', 'flow', 'http_request', 'ssl_issuer', 'dns_query', 'ssl_subject'}


In [4]:
import networkx as nx
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import numpy as np


In [10]:
def print_detailed_classification_report(report):
    print("Classification Report Summary\n")

    for label in ["Normal", "Attack"]:
        print(f"🔹 Class: {label}")
        precision = report[label]['precision']
        recall = report[label]['recall']
        f1 = report[label]['f1-score']
        support = report[label]['support']

        print(f"  - Number of true samples (support): {int(support)}")
        print(f"  - Precision: {precision:.2f} -> Of all predicted '{label}', {precision:.0%} were correct.")
        print(f"  - Recall:    {recall:.2f} -> Of all actual '{label}', {recall:.0%} were found.")
        print(f"  - F1 Score:  {f1:.2f} -> Harmonic mean of precision and recall.\n")

    print("Overall Performance")
    print(f"  - Accuracy:         {report['accuracy']:.2%} -> Total correct predictions out of all samples.\n")
    
    print("  - Macro Avg (equal weight per class):")
    print(f"    - Precision: {report['macro avg']['precision']:.2f}")
    print(f"    - Recall:    {report['macro avg']['recall']:.2f}")
    print(f"    - F1 Score:  {report['macro avg']['f1-score']:.2f}")

    print("\n  - Weighted Avg (weighted by class size):")
    print(f"    - Precision: {report['weighted avg']['precision']:.2f}")
    print(f"    - Recall:    {report['weighted avg']['recall']:.2f}")
    print(f"    - F1 Score:  {report['weighted avg']['f1-score']:.2f}")

### 1. Community Detection
Apply clustering or community detection algorithms on specific views:
- flow → group IPs that communicate frequently
- dns_query → group IPs that query similar domains (suspicious beaconing behavior?)
- http_request → group clients based on similar URLs

In [11]:
ip_nodes = [n for n in G.nodes if isinstance(n, str) and '.' in n]  # crude IP filter

# Create a feature matrix: in-degree and out-degree from 'flow' edges
features = []
labels = []

for ip in ip_nodes:
    out_deg = len([1 for _, _, k in G.out_edges(ip, keys=True) if k == "flow"])
    in_deg = len([1 for _, _, k in G.in_edges(ip, keys=True) if k == "flow"])
    label = None
    for _, _, k, d in G.out_edges(ip, keys=True, data=True):
        if k == "flow" and d.get("label"):
            label = d["label"]
            break
    features.append([in_deg, out_deg])
    labels.append(label if label else "Normal")

X = StandardScaler().fit_transform(features)
kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(X)

# Interpret labels
cluster_labels = kmeans.labels_
true_labels = ["Attack" if str(l).lower() != "normal" else "Normal" for l in labels]

# Step 2: Create a mapping from cluster index to predicted label
# This assumes only two clusters: 0 and 1
df = pd.DataFrame({'cluster': cluster_labels, 'true': true_labels})
mapping = {}

for cluster_id in np.unique(cluster_labels):
    majority_class = df[df['cluster'] == cluster_id]['true'].mode()[0]
    mapping[cluster_id] = majority_class

# Step 3: Map numeric cluster labels to "Normal"/"Attack"
predicted_labels = [mapping[c] for c in cluster_labels]

# Step 4: Classification report
report = classification_report(true_labels, predicted_labels, target_names=["Normal", "Attack"], output_dict=True)
print_detailed_classification_report(report)

Classification Report Summary

🔹 Class: Normal
  - Number of true samples (support): 19
  - Precision: 1.00 -> Of all predicted 'Normal', 100% were correct.
  - Recall:    0.32 -> Of all actual 'Normal', 32% were found.
  - F1 Score:  0.48 -> Harmonic mean of precision and recall.

🔹 Class: Attack
  - Number of true samples (support): 1303
  - Precision: 0.99 -> Of all predicted 'Attack', 99% were correct.
  - Recall:    1.00 -> Of all actual 'Attack', 100% were found.
  - F1 Score:  1.00 -> Harmonic mean of precision and recall.

Overall Performance
  - Accuracy:         99.02% -> Total correct predictions out of all samples.

  - Macro Avg (equal weight per class):
    - Precision: 1.00
    - Recall:    0.66
    - F1 Score:  0.74

  - Weighted Avg (weighted by class size):
    - Precision: 0.99
    - Recall:    0.99
    - F1 Score:  0.99


## 2. Node Centrality Analysis
Compute betweenness centrality, eigenvector centrality, or PageRank on:
- Flow view → who routes/relays the most traffic?
- DNS view → which domains are queried the most?

Flow view → who routes/relays the most traffic?

In [15]:
import networkx as nx

# FLOW VIEW: Create subgraph for 'flow' edges only
flow_edges = [(u, v) for u, v, k in G.edges(keys=True) if k == "flow"]
flow_G = nx.DiGraph()
flow_G.add_edges_from(flow_edges)

print("Flow-based Centrality Analysis")

# Betweenness Centrality
flow_betweenness = nx.betweenness_centrality(flow_G)
top_flow_betweenness = sorted(flow_betweenness.items(), key=lambda x: x[1], reverse=True)[:10]
print("\nTop 10 by Betweenness Centrality (Flow):")
for node, score in top_flow_betweenness:
    print(f"  {node}: {score:.4f}")

# PageRank
flow_pagerank = nx.pagerank(flow_G, alpha=0.85)
top_flow_pagerank = sorted(flow_pagerank.items(), key=lambda x: x[1], reverse=True)[:10]
print("\nTop 10 by PageRank (Flow):")
for node, score in top_flow_pagerank:
    print(f"  {node}: {score:.4f}")

# Eigenvector Centrality (only works on strongly connected graphs or large components)
try:
    flow_eigen = nx.eigenvector_centrality(flow_G, max_iter=1000)
    top_flow_eigen = sorted(flow_eigen.items(), key=lambda x: x[1], reverse=True)[:10]
    print("\nTop 10 by Eigenvector Centrality (Flow):")
    for node, score in top_flow_eigen:
        print(f"  {node}: {score:.4f}")
except nx.PowerIterationFailedConvergence:
    print("\nEigenvector centrality failed to converge on flow view.")

Flow-based Centrality Analysis

Top 10 by Betweenness Centrality (Flow):
  192.168.1.190: 0.0126
  192.168.1.152: 0.0055
  192.168.1.193: 0.0044
  192.168.1.31: 0.0040
  192.168.1.30: 0.0034
  192.168.1.195: 0.0034
  192.168.1.34: 0.0018
  192.168.1.37: 0.0010
  192.168.1.33: 0.0010
  192.168.1.1: 0.0004

Top 10 by PageRank (Flow):
  ff02::fb: 0.0081
  224.0.0.251: 0.0074
  192.168.1.190: 0.0063
  192.168.1.255: 0.0049
  ff02::1:3: 0.0048
  192.168.1.193: 0.0030
  192.168.1.31: 0.0029
  192.168.1.37: 0.0029
  192.168.1.33: 0.0029
  224.0.0.252: 0.0028

Top 10 by Eigenvector Centrality (Flow):
  192.168.1.195: 0.2238
  192.168.1.152: 0.1960
  192.168.1.190: 0.1865
  224.0.0.251: 0.1824
  192.168.1.255: 0.1669
  117.18.237.29: 0.1616
  192.168.1.193: 0.1438
  13.35.146.12: 0.1305
  192.168.1.1: 0.1162
  239.255.255.250: 0.1158


DNS view → which domains are queried the most?

In [24]:
# DNS VIEW: Create bipartite-like graph of IPs querying domain names
dns_edges = [(u, v) for u, v, k in G.edges(keys=True) if k == "dns_query"]
dns_G = nx.DiGraph()
dns_G.add_edges_from(dns_edges)

print("\nDNS-based Centrality Analysis")

# In-degree Centrality: How often a domain is queried
dns_indegree = dns_G.in_degree()
top_domains = sorted(dns_indegree, key=lambda x: x[1], reverse=True)[:10]
print("\nTop 10 Queried Domains (by in-degree):")
for domain, deg in top_domains:
    print(f"  {domain}: {deg} queries")

# PageRank for domains
dns_pagerank = nx.pagerank(dns_G, alpha=0.85)
top_dns_pagerank = sorted(dns_pagerank.items(), key=lambda x: x[1], reverse=True)[:10]
print("\nTop 10 by PageRank (DNS):")
for node, score in top_dns_pagerank:
    print(f"  {node}: {score:.4f}")


DNS-based Centrality Analysis

Top 10 Queried Domains (by in-degree):
  -: 36 queries
  _googlecast._tcp.local: 9 queries
  shavar.services.mozilla.com: 8 queries
  wpad: 7 queries
  services.addons.mozilla.org: 7 queries
  versioncheck-bg.addons.mozilla.org: 7 queries
  detectportal.firefox.com: 6 queries
  aus5.mozilla.org: 6 queries
  firefox.settings.services.mozilla.com: 6 queries
  blocklists.settings.services.mozilla.com: 6 queries

Top 10 by PageRank (DNS):
  -: 0.0210
  _sleep-proxy._udp.local: 0.0027
  android.local: 0.0026
  isatap: 0.0025
  _googlecast._tcp.local: 0.0023
  _fb._tcp.local: 0.0022
  _raop._tcp.local: 0.0022
  desktop-18ss3ba: 0.0021
  _ipps._tcp.local: 0.0020
  _companion-link._tcp.local: 0.0020


## TODO: DIMKA
Write the code for 
### 1. Community Detection
Apply clustering or community detection algorithms on specific views:
- flow → group IPs that communicate frequently
- dns_query → group IPs that query similar domains (suspicious beaconing behavior?)
- http_request → group clients based on similar URLs

### 2. Node Centrality Analysis
Compute betweenness centrality, eigenvector centrality, or PageRank on:
- Flow view → who routes/relays the most traffic?
- DNS view → which domains are queried the most?

### 3. Node Feature Extraction for Classification  
Use the graph structure to extract features for IP nodes and apply supervised machine learning to classify them as normal or malicious.

- Use the existing `label` and `attack_type` attributes from the `flow` view as ground truth.
- Generate per-node features across different views to capture behavioral patterns.
- Train a classifier (e.g., Random Forest, XGBoost, or MLP) to detect malicious IPs.

Feature ideas for each IP node:

| Feature                               | Description                                                   |
|--------------------------------------|---------------------------------------------------------------|
| Degree / in-degree / out-degree      | Number of total/initiated/received connections (flow view)   |
| Number of distinct queried domains   | From `dns_query` edges — indicates domain diversity           |
| Number of protocol violations        | From `protocol_violation` edges — potential misbehavior       |
| Most common HTTP status codes        | From `http_request` edges — could signal probing or scanning  |
| Avg. bytes sent/received             | Captures traffic volume per connection                        |
| PageRank / betweenness / eigenvector | Centrality in communication network (flow view)               |
| Clustering coefficient               | Measures tightness of local communication                     |
| Number of SSL subjects or issuers    | Captures breadth of contacted certs (SSL-related views)       |

Usefulness: Enables explainable, graph-based threat classification using traditional ML pipelines. Serves as a strong baseline and complements centrality/community detection.