# C2 over DNS

## How to detect C2 channels over DNS?

Using DNS as a mean of communication for a C2 channel has a high chance of success because DNS is one of few protocols to be (quite) always allowed in outbound connections.  

Hunters can investigate C2 communications over DNS by collecting queries made by internal endpoints for external resources over a period of 12/24 hours and watch for high-entropy FQDNs and domains with a huge amount of subdomains.  

**But, why is that?**

A C2 over DNS communication channel relies on agents beaconing for instructions at a fixed time interval (attackers may inject jitter to mimic users activity).
The requests are encoded or encrypted as subdomains like answers from the C2 server are.

The sum of all subdomains per domain over a 12/24 hours packet capture just sticks out because very few companies in the world have hundreds of subdomains (Microsoft, Amazon, Google, Akamai) and none has more than 1000.  
Also, encoding and ecnryption produces high-entropy strings (amount of randomness in a string).

## Let's import some data

This Jupyter Notebook takes Zeek DNS logs and parse them to extract useful information.  
If you only happen to have a PCAP file you can install a Zeek container from [ActiveCountermeasures' GitHub](https://github.com/activecm/docker-zeek/) and use the command `zeek readpcap <absolute path of the source file> <absolute path of the destination folder>` to create a list of logs out of it.  

From the destination folder upload the file `dns.log`.

In [1]:
import ipywidgets as widgets
from IPython.display import display

# Upload a dns.log file
button = widgets.FileUpload(accept=".log", multiple=False)
display(button)

FileUpload(value=(), accept='.log', description='Upload')

## Ingestion
All right, now it is time to parse the log file and extract what we need: the DNS queries.  
First of all, the content of the uploaded file is a [memoryview](https://docs.python.org/3/library/stdtypes.html#memory-views) so, in order to access its content, we need to decode the stream of bytes and get the *text*.

### We are not done yet
Since strings are strings, there is a little more work to do before we can actually do something.  
What we have now is just a wall of text but we need to split at *newlines* character (`\n`), which gives us an array or list of *rows*.  

After that we need to split each row at *tab* characters (`\t`) so to have a list of values per row.

In [2]:
# Get content of the file
import codecs
log: str = codecs.decode(button.value[0].content, encoding="utf-8")
# Split lines at newline character
rows: list = log.split("\n")

### ! Shannon Entropy !

Hold on! Before we continue there is one last but important thing to do... define a function to calculate the [Shannon Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)).  
As stated earlier, strings with random characters produces a high level of entropy which we can leverage to filter out legit traffic from our analysis and spot what needs more attention faster.

In [3]:
import math
from collections import Counter

def shannon_entropy(data):
    if not data:
        return 0
    entropy = 0
    length = len(data)
    counts = Counter(data)
    for count in counts.values():
        p_x = count / length
        entropy += -p_x * math.log(p_x, 2)
    return entropy


### Safelisting

Remember to safelist legit domains to remove them from the final output.  
Safelisting is useful to reduce the noise from non-malicious traffic and should be performed each time a new log file is analyzed.  

It takes each time less work to exclude legit domains while it speeds up the analysis because the safelist already contains a lot of entries.

In [4]:
queries: dict = {}
# List of domain names to ignore
safelist = ["_tcp.local", "_udp.local", "windowsupdate.com", "msedge.net", "googlevideo.com", "youtube.com", "microsoft.com", "ubuntu.com", "msn.com"]
for row in rows:
    # Ignore first lines of Zeek logs which starts with a # symbol
    if not row.startswith("#"):
        splitted_line: list = row.split("\t")
        
        # Skip unknown entries, too short to be DNS entries 
        if len(splitted_line) < 24:
            continue
        
        # Extract queries and domain names
        query: str = splitted_line[9]
        
        domain: str = ".".join(query.split(".")[-2:])
        
        # Skip safelisted domains and queries
        if domain in safelist or query in safelist:
            continue
        
        domain_entropy: float = shannon_entropy(domain)
        
        subdomain: str = query.replace(domain, "")[:-1]
        if domain not in queries.keys():
            queries[domain] = {}
            queries[domain]["count"] = 0
            queries[domain]["subdomains"] = {}
            
        if subdomain not in queries[domain]["subdomains"].keys():
            queries[domain]["subdomains"][subdomain] = {}
            queries[domain]["subdomains"][subdomain]["count"] = 0
            queries[domain]["subdomains"][subdomain]["entropy"] = shannon_entropy(subdomain)

        queries[domain]["count"] += 1
        queries[domain]["subdomains"][subdomain]["count"] += 1

## Visualization

Now that the cells above have done their job, we have a dictionary with the following structure:

```
Domain: dict
| - "count": int
| - "subdomains": dict
| - - subdomain: dict
| - - - "count": int
| - - - "entropy": float
```

Each key is a domain name whose keys are **count** and **subdomains**.  
While the former is an integer counter, the latter is a dictionary whose keys are subdomains to the domain.  
Each subdomain have two keys, **count** and **entropy** which is the value of the Shannon Entropy of the subdomain.  

Knowing the dictionary layout, we can parse it and produce data for [pandas](https://pandas.pydata.org/docs/index.html).   

### Domains ordered by unique subdomains counter

The purpose of this DataFrame is to determine which domains have the highest number of unique subdomains.  
Domains having hundreds, or even thousands, of subdomains need further investigations because (if not well-known like Google, Akamai, Microsoft, Amazon...) may be related to a C2 channel over DNS.  

#### DataFrame layout

|Domain|Unique Subdomains|Times looked up|
|-|-|-|
|Domain name|Total number of unique subdomains|Total number of queries in which the domain has been found|

In [5]:
import pandas as pd
data: list = [{"Domain": x[0], "Unique Subdomains": len(x[1]["subdomains"].keys()), "Times looked up": x[1]['count']} for x in queries.items()]
df = pd.DataFrame(data)

In [None]:
df.sort_values(by=["Unique Subdomains", "Times looked up"], ascending=False)

### Subdomains ordered by entropy score

The purpose of this DataFrame is to determine what subdomains have the highest entropy score.  
Entropy is useful to determine the amount of randomness in a string which may be indication of encoding or encryption.


#### DataFrame layout

|Domain|Subdomain|Times looked up|Entropy|
|-|-|-|-|
|Domain name|Subdomain name|Total number of queries in which the domain has been found|Entropy score|

In [7]:
subdomain_data: list = []
for domain in queries.items():
    for subdomain in domain[1]["subdomains"].items():
        subdomain_data.append({"Domain": domain[0],
                               "Subdomain": subdomain[0],
                               "Times looked up": subdomain[1]['count'],
                               "Entropy": subdomain[1]['entropy']})
        
df_subdomains = pd.DataFrame(subdomain_data)

In [None]:
#pd.options.display.max_rows = 100
#pd.options.display.max_colwidth = 200
df_subdomains.sort_values(by=["Entropy"], ascending=False)

## Found something?

- Are there domains with hundreds or thousands of subdomains? => **Investigate further**
- Are there subdomains with high entropy score (4.0+)? => **Investigate further**
- Are there domains with few subdomains but have a *strange/uncommon* name? => **Listen to your intuition and investigate further**