# Detecting DNS Exfiltration via Traffic Analytics
By Herbert Maosa  
Cybersecurity Consultant | PhD | CISSP | OSCP

---

*"Most networks let DNS traffic pass without a second thought - and that's exactly what attackers are counting on."*  

DNS exfiltration is a stealthy cyberattack where sensitive data - passwords, source code, or trade secrets - is smuggled out of a network inside DNS queries. Because DNS is rarely blocked or scrutinized, attackers encode payloads into domain lookups and quietly bypass firewalls.
This data leakage can include passwords, intellectual property, or other confidential information, sent as parts of DNS query names or payloads. Detecting DNS exfiltration requires looking for unusual patterns such as:
Excessively long or random-looking domain names
High volumes of DNS queries to suspicious domains
Anomalies in query types and response behavior

In this post, we'll dive into hands-on analytics and visualizations using real PCAP data - and expose how stealthy DNS traffic can reveal a breach in progress.

---
## Imports

In [1]:

# !pip install tldextract
import json
import pandas as pd
import math
import ipywidgets as widgets
import matplotlib.pyplot as plt
from collections import defaultdict
from IPython.display import display
import seaborn as sns
import tldextract
from ipyfilechooser import FileChooser
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

sns.set(style='whitegrid')
plt.close('all')
# mpl.style.use('seaborn-v0_8')
warnings.filterwarnings("ignore", message=".*use_inf_as_na.*")

---
## 1. Dataset and Parsing
For this Notebook, we analyze the [CIC-Bell-DNS-EXF-2021 dataset](https://www.unb.ca/cic/datasets/dns-exf-2021.html) by the Canadian Institute of Cyber Security- a well-known benchmark in cybersecurity research, particularly for DNS exfiltration detection. This dataset is provided in two formats:
- A CSV file with ~30 pre-engineered features for machine learning
- A raw PCAP file containing full packet captures.

Although the PCAP includes other protocols, we focused exclusively on DNS traffic to investigate potential exfiltration activity.
The processed output is stored in a structured JSON file, which we load into a Pandas DataFrame for analysis.
You can find the PCAP parser and dataset link on this project's GitHub page

In [2]:
def shannon_entropy(s):
    if not s:
        return 0
    prob = [float(s.count(c)) / len(s) for c in set(s)]
    entropy = -sum(p * math.log2(p) for p in prob)
    return entropy

def map_dns_subdomains(domain_series):
    subdomain_map = defaultdict(set)
    for fqdn in domain_series.dropna().unique():
        extracted = tldextract.extract(fqdn)
        root_domain = f"{extracted.domain}.{extracted.suffix}"
        if extracted.subdomain:
            subdomain_map[root_domain].add(extracted.subdomain)
    return subdomain_map

In [3]:
# upload = widgets.FileUpload(accept='.json', multiple=False)
# display(upload)
fc = FileChooser()
display(fc)

def process_uploaded_dns(input_file):
    if not input_file:
        return None, None

    with open(input_file) as f:
        dns_data = json.load(f)

    df = pd.json_normalize(dns_data)

    # Preserve empty-question entries for analysis
    empty_q_df = df[df['questions'].isna() | df['questions'].apply(lambda q: isinstance(q, list) and len(q) == 0)].copy()

    # Proceed with entries that have questions
    df = df[df['questions'].notna() & df['questions'].apply(lambda q: isinstance(q, list) and len(q) > 0)].copy()
    df = df.explode('questions').reset_index(drop=True)

    # If exploded questions still exist, normalize them
    if not df.empty and df['questions'].notna().any():
        q_exp = pd.json_normalize(df['questions'])
        df = pd.concat([df.drop(columns=['questions']), q_exp], axis=1)
    else:
        print("No valid question entries to normalize.")
        return None, empty_q_df

    # Safely handle qname-based features
    if 'qname' in df.columns:
        df = df[df['qname'].notna()].copy()
        df['qname_length'] = df['qname'].apply(len)
        df['qname_entropy'] = df['qname'].apply(shannon_entropy)
    else:
        print("'qname' not found in normalized questions. Skipping length/entropy features.")
        df['qname_length'] = None
        df['qname_entropy'] = None

    # Convert timestamp
    df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')

    return df, empty_q_df

FileChooser(path='C:\Users\hmaos\Projects\DNS-Exfiltration-Detection\Notebooks', filename='', title='', show_h…

In [6]:
df, empty_q_df = process_uploaded_dns(fc.selected)
df.head(3) if df is not None else "Upload a file to begin analysis."

Unnamed: 0,timestamp,src_ip,dst_ip,ip_protocol,ip_header_length,packet_size,type,src_port,src_service,dst_port,...,id,qr,opcode,rcode,answers,qname,qtype,qclass,qname_length,qname_entropy
0,2020-11-24 00:32:38.062066,192.168.20.38,8.8.8.8,UDP,20,75,other,52433,,53,...,53331,0,0,0,[],v10.events.data.microsoft.com.,1,1,30,3.80291
1,2020-11-24 00:32:38.086978,8.8.8.8,192.168.20.38,UDP,20,193,other,53,domain,52433,...,53331,1,0,0,[],v10.events.data.microsoft.com.,1,1,30,3.80291
2,2020-11-24 00:37:20.374794,192.168.20.38,8.8.8.8,UDP,20,56,other,51698,,53,...,11770,0,0,0,[],dns.google.,1,1,11,2.913977


---
## 2. Static Analysis
Static analysis focuses on features based on what is being queried or the responses received. IThis is data that can be directly extracted from the DNS Header, or query and response payloads. In this blog we examine:
- DNS response codes (RCODEs)
- Query Length Distribution.
- Subdomain cardinality (how many subdomains per root domain)
- Most frequently queried domain names

---
### 2.1 RCODE Analysis
In DNS, the **rcode** field in a response indicates the result of the query. Analyzing these response codes helps identify suspicious or anomalous behavior, especially in the context of DNS exfiltration.

Common RCODE Values:

**0** — NoError: Normal response

**1** — FormErr: Malformed query

**2** — ServFail: Server failure

**3** — NXDomain: Domain does not exist

**5** — Refused: Query denied by server.
### Why RCODE Matters in Exfiltration Analysis
In many DNS exfiltration campaigns, attackers use **RCODE=0** (NoError) responses - but without returning any actual answers in the payload. This combination is unusual and may indicate:
A stealth tunnel that confirms query receipt without revealing actual records
DNS servers under attacker control that acknowledge all queries without serving real data

Conversely, repeated NXDOMAIN (RCODE=3) or Refused (RCODE=5) errors may signal probing or command-and-control attempts via non-existent domains.
DNS queries that return `rcode=0` (NoError) but have empty `answers` are suspicious. Here we break down DNS responses by RCODE to understand response patterns.
In Our Dataset
We broke down the frequency of RCODEs across the dataset to identify patterns. As shown in the chart below, the vast majority of queries returned NoError - but with empty answer sections, raising suspicion of covert behavior.

```python
# Filter DNS responses only
responses_df = df[df["qr"] == 1].copy()

# Map RCODEs to descriptive names
rcode_labels = {
    0: "NoError",
    1: "FormErr",
    2: "ServFail",
    3: "NXDomain",
    4: "NotImp",
    5: "Refused"
}
responses_df["rcode_name"] = responses_df["rcode"].map(rcode_labels).fillna("Other")

# Count occurrences
rcode_counts = responses_df["rcode_name"].value_counts().reset_index()
rcode_counts.columns = ['rcode_name', 'count']
rcode_counts["hue"] = rcode_counts["rcode_name"]  # Dummy hue to suppress warning

# Plot with hue and no legend
plt.figure(figsize=(10, 5))
sns.barplot(
    data=rcode_counts,
    x='rcode_name', y='count',
    hue='hue', palette='coolwarm', edgecolor='black', legend=False
)
plt.title("DNS Response Code Breakdown")
plt.xlabel("RCODE (Response Code)")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig('rcode_analysis.png')
plt.show()
```

![DNS RCODE Breakdown](rcode_breakdown.png)

**Figure 2.1:** Breakdown of DNS response codes (RCODE). The overwhelming majority of responses are `NoError` (rcode=0), yet many have *empty answer sections* — a potential sign of DNS-based data exfiltration.
This result shows that the  majority of responses with RCODE=0 (NoError) have no actual answers. This is a known pattern in DNS exfiltration, where the attacker's server acknowledges queries to maintain stealth — without serving real DNS data. We can already flag these queries as suspicious, warranting further investigation.


---
### 2.2 DNS Query Length Distribution

### Why This Matters
Analyzing DNS query length and entropy helps detect abnormal patterns. Long or high-entropy query names may contain encoded or obfuscated data, often used in DNS-based data exfiltration or tunneling attacks.

```python
plt.figure(figsize=(12, 5))
plt.hist(df['qname_length'], bins=50, color='skyblue', edgecolor='black')
plt.title("DNS Query Length Distribution")
plt.xlabel("Length")
plt.ylabel("Frequency")
plt.grid(True)
plt.savefig('query_len.png')
plt.show()
```
![Query Lengths](query_len.png)
**Figure 2.2:** Distribution of DNS query lengths (`qname_length`). Abnormally long query names may suggest data encoding for exfiltration, while uniform lengths could indicate automated beaconing or command-and-control activity.
The distribution of query lengths reveals structural irregularities in DNS traffic. In typical user-driven DNS usage, query lengths vary widely depending on websites and services accessed.

However, in our dataset:

We observe a peak at unusually long query lengths, which may suggest that data is being encoded directly into the query name.  
If there is a tight cluster of queries all around the same length, it may point to automated scripts or malware generating structured, periodic DNS requests.

These findings reinforce earlier indicators of DNS misuse — particularly when correlated with high subdomain cardinality or suspicious destination domains, as is the case with our dataset


---
### 2.3 Top Queried Domain Names
Beyond analyzing subdomain patterns, it's equally important to examine which domains are being queried most frequently. Domains with an unusually high number of total queries - especially when paired with high subdomain cardinality - are prime suspects for data exfiltration.
These domains often:
- Receive thousands of queries within short periods
- Appear obscure, newly registered, or unrelated to business activity
- Encode data into subdomains to bypass detection

```python

# Filter to rcode=0 responses with empty answers
empty_success_responses = df[
    (df['qr'] == 1) &
    (df['rcode'] == 0) &
    (df['answers'].apply(lambda a: isinstance(a, list) and len(a) == 0))
].copy()

# Drop rows without qname
empty_success_responses = empty_success_responses[empty_success_responses['qname'].notna()].copy()

# Show top qnames
top_qnames = empty_success_responses['qname'].value_counts().head(10)
top_qnames_df = top_qnames.reset_index()
top_qnames_df.columns = ['qname', 'count']
top_qnames_df['hue'] = top_qnames_df['qname']

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(data=top_qnames_df, x='count', y='qname', hue='hue', palette='flare', legend=False)
plt.title("Top Queried Domain Names (rcode=0, no answers)")
plt.xlabel("Query Count")
plt.ylabel("QNAME")
plt.tight_layout()
plt.savefig('top_queried_domain_names.png')
plt.show()
```
![Stacked Query Volume](top_queried_domain_names.png)


**Figure 2.3:** Domains receiving the highest number of DNS queries. A high query count, especially in combination with subdomain encoding, may reveal command-and-control or exfiltration endpoints.

As the chart shows, the *'microsoft.com'* domain in our dataset has been queried with 20 different subdomain names. In many cases, benign domains rarely exhibit this behavior. Sudden spikes in subdomain diversity under a single domain - especially with high entropy - may be a strong indicator of data exfiltration in progress.


---
### 2.4 Subdomain Cardinality

Another hallmark of DNS-based exfiltration is the use of many unique subdomains under the same root domain. Attackers often encode sensitive data into subdomains — and then repeatedly query these dynamically generated names to leak information bit by bit.

For example, a root domain like malicious-domain.com might receive dozens or hundreds of queries like:

```
abc123.malicious-domain.com  
x7dfg.malicious-domain.com  
9dkaei.malicious-domain.com  
```
These domains often appear random and are not typically seen in high traffic


```python
subdomain_map = map_dns_subdomains(df['qname'])

records = [{"root_domain": root, "subdomain_count": len(subs)} for root, subs in subdomain_map.items()]
subdomain_count_df = pd.DataFrame(records).sort_values(by="subdomain_count", ascending=False)

plt.figure(figsize=(12, 6))
#sns.barplot(data=subdomain_count_df.head(10), x="subdomain_count", y="root_domain", palette="mako", edgecolor="black")
sns.barplot(
    data=subdomain_count_df.head(10),
    x="subdomain_count",
    y="root_domain",
    hue="root_domain",         # ✅ assign hue to the y-axis
    palette="mako",
    edgecolor="black",
    dodge=False                # optional, keeps bars aligned
)
plt.legend([],[], frameon=False)  # ✅ hide the legend

plt.title("Top Root Domains by Number of Subdomains")
plt.xlabel("Subdomain Count")
plt.ylabel("Root Domain")
plt.tight_layout()
plt.show()
```
![Top Queried Domains](tld_cardinality.png)

**Figure 2.4**: Root domains with the highest number of unique subdomains. Unusual subdomain diversity, especially for obscure domains, may suggest that DNS is being used to encode and exfiltrate data.

---
## 3. Temporal Analysis
This analysis focuses on time series based characterlitics, for example, the time window, duration, and frequency of when the DNS activity occurs. This includes:
- Time series trends (e.g. spikes or regular intervals)
- Interarrival times between queries
These patterns help uncover **beaconing behavior, bursts of activity, or unusual timing** often associated with automated exfiltration tools.

---
### 3.1 Stacked Time Series Analysis
To uncover possible data exfiltration, we analyze the temporal distribution of DNS traffic. Specifically, we compare:

- All DNS queries over time
- Suspicious queries with rcode=0 and empty answer sections

This time-series visualization helps us spot bursts, regular intervals, or anomalous surges in suspicious traffic.
This stacked chart highlights when suspicious queries occur. Spikes in rcode=0 queries with empty answers — especially in tight bursts or repeated patterns — may signal automated beaconing or payload transfer intervals.

Look for sharp surges in red overlay (suspicious traffic) as a potential indicator of data leakage in progress.

```python
# Convert timestamp and group by minute
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['minute'] = df['timestamp'].dt.floor('min')

# Total DNS queries per minute
all_volume = df.groupby('minute').size().reset_index(name='all_queries')

# Filter suspicious queries: NoError responses with no answers
suspicious_df = df[
    (df['qr'] == 1) &
    (df['rcode'] == 0) &
    (df['answers'].apply(lambda a: isinstance(a, list) and len(a) == 0))
].copy()
suspicious_df['minute'] = suspicious_df['timestamp'].dt.floor('min')
suspicious_volume = suspicious_df.groupby('minute').size().reset_index(name='suspicious_queries')

# Merge and fill gaps
volume_df = pd.merge(all_volume, suspicious_volume, on='minute', how='outer').fillna(0)

# Plot stacked area chart
plt.figure(figsize=(14, 6))
plt.stackplot(
    volume_df['minute'],
    volume_df['all_queries'],
    volume_df['suspicious_queries'],
    labels=["All DNS Queries", "Suspicious (rcode=0, no answer)"],
    colors=["#9ecae1", "#de2d26"],
    alpha=0.8
)
plt.legend(loc="upper left")
plt.title("Stacked DNS Query Volume Over Time")
plt.xlabel("Time")
plt.ylabel("Query Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("stacked_query_volume.png")
plt.show()
```
![Stacked Query Volume](stacked_time_series.png)

**Figure 3.1:** Stacked time series showing total DNS queries vs. suspicious queries (`rcode=0` with no answers). Repeating bursts or periodic spikes may indicate *automated exfiltration or beaconing behavior*.

This stacked chart highlights when suspicious queries occur. Spikes in rcode=0 queries with empty answers - especially in tight bursts or repeated patterns - may signal automated beaconing or payload transfer intervals.
Look for sharp surges in red overlay (suspicious traffic) as a potential indicator of data leakage in progress.
In addition to traffic volume spikes, this chart reveals another subtle signal: periodicity.
Attackers often configure malware to exfiltrate data at regular intervals to avoid triggering volume-based alerts. These time-based patterns - sometimes referred to as beaconing behavior - can show up as:
Evenly spaced bursts of suspicious DNS queries
Consistent minute-by-minute activity, even during low-traffic periods

In our dataset, we observe such recurring intervals in the rcode=0 with no answer responses - reinforcing the likelihood of automated exfiltration mechanisms at play.

---
### 3.2 Inter-arrival Time Distribution
Attackers often attempt to evade detection by spacing out DNS queries at regular intervals — a technique known as beaconing. To analyze this, we calculate the time between successive queries from the same source IP.

If many queries are tightly clustered, or if they follow a repetitive pattern, it could indicate automated data exfiltration behavior.

```python
# Sort by source and time
empty_success_responses = empty_success_responses.sort_values(by=['src_ip', 'timestamp'])

# Calculate time difference from previous packet by the same src_ip
empty_success_responses['interarrival'] = empty_success_responses.groupby('src_ip')['timestamp'].diff().dt.total_seconds()

# Drop nulls (first request per src_ip)
interarrival_df = empty_success_responses.dropna(subset=['interarrival'])

# Basic distribution plot
n
plt.figure(figsize=(12, 5))
sns.histplot(interarrival_df['interarrival'], bins=50, kde=True)
plt.title("Distribution of Inter-arrival Times (seconds) for Suspicious DNS Queries")
plt.xlabel("Seconds Between Queries (per src_ip)")
plt.ylabel("Frequency")
plt.tight_layout()
plt.savefig('inter_arrival_times.png')
plt.show()
```
![Interarrival Time Distribution](inter_arrival_times.png)

**Figure 3.2:** Histogram of DNS query interarrival times. Sharp peaks or evenly spaced intervals may signal regular, automated query behavior — consistent with malware beacons or data tunneling.

This distribution reveals whether traffic is bursty, periodic, or random. Peaks at regular intervals may indicate automated beaconing, while a large number of queries with very short gaps may suggest data bursts. The chart shows the behaviour per souce ip address. In our dataset, we observe both - supporting the hypothesis of controlled, possibly malicious DNS communication..

---
## Conclusion: What We Learned

In this blog, we walked through a real-world example of how to detect DNS exfiltration using Python and Pandas for traffic analytics. We explored a combination of:  
### Static Features
- *RCODE patterns*, especially NoError responses with no answers
- *Query Length Distribution* - revealing data being encoded in subdomain labels.
- *Subdomain cardinality* as a proxy for data encoding activity
- *Top queried domains* to isolate suspicious focal points in traffic

### Temporal Features
- *Time series visualizations* to detect query bursts and irregularities
- *Interarrival times* to uncover signs of automated beaconing

Together, these indicators revealed patterns consistent with stealthy data exfiltration: consistent traffic to obscure domains, unusually high subdomain counts, and tightly timed queries with minimal response data.

## What’s Next?
In our next post, we’ll take these handcrafted features and explore how to:
- Train machine learning models to detect DNS exfiltration automatically
- Use both supervised and unsupervised approaches
- Apply anomaly detection techniques to real PCAP datasets

Stay tuned — and in the meantime, you can grab the source code and notebook on GitHub (add your repo link here) to try it out on your own traffic captures.

---
## Dataset Reference

Lashkari, A.H., Gil, G., Mamun, M.S.I., & Ghorbani, A.A. (2021). CIC-Bell DNS EXF Dataset [Data set]. Canadian Institute for Cybersecurity.
Available at: https://www.unb.ca/cic/datasets/dns-exf.html  
 
Samaneh Mahdavifar, Amgad Hanafy Salem, Princy Victor, Miguel Garzon, Amir H. Razavi, Natasha Hellberg, Arash Habibi Lashkari, "Lightweight Hybrid Detection of Data Exfiltration using DNS based on Machine Learning", The 11th IEEE International Conference on Communication and Network Security (ICCNS), Dec. 3–5, 2021, Beijing Jiaotong University, Weihai, China.