# ISOT HTTP Botnet Dataset

There are 2 different datasets: a botnet dataset consisting of malicious DNS traffic generated by
different botnets, and a benign dataset consisting of DNS traffic generated by different known
software applications.

### Malicious Botnet DNS Traffic
The botnet dataset contains full DNS packets of 9 exploit kits that were collected in a virtual
environment. One of the bots ( Atom )* did not generate traffic because
it detects virtual machines. Each bot was deployed in a Windows XP virtual machine that ran for
several days. The virtual environment was fully monitored from DNS server to the router. We
configured and deployed an authoritative DNS server (in the lab) **ns1.botnet.isot**. with IP
**192.168.50.88**. We configured each bot to specifically communicate with a command and
control (C&C) server that we setup. For example, for the Zeus botnet, we setup a C&C server
named zeus.botnet.isot.

List of botnets and their VMs, where the following naming convention is used
{**Exploit_kit_name**}.botnet.isot:
```
192.168.50.14 zyklon.botnet.isot
192.168.50.15 blue.botnet.isot
192.168.50.16 liphyra.botnet.isot
192.168.50.17 gaudox.botnet.isot gdox.botnet.isot dox.botnet.isot **
192.168.50.18 blackout.botnet.isot
192.168.50.30 citadel.botnet.isot ***
192.168.50.31 citadel.botnet.isot ***
192.168.50.32 be.botnet.isot black energy
192.168.50.34 zeus.botnet.isot
```

**Notes:**

\* We have also deployed ATOM exploit kit, which is the third generation of Zeus, but we did not
see any traffic going as we expected that it detects that it is running under a virtual machine.

** dox, gdox, gaudox, all are refereeing to the same exploit kit “Gaudox”. We wrote three zone
files dox, gdox, gaudox because at least three C&C channels were required to deploy the
botnet. We used slightly different naming because we want to track how communication flows
between the channels

*** citadel bot was installed twice in two different machine to check that it shows the same
behaviour on different machines.

### Application DNS Dataset
The ISOT application dataset was collected from individual known (benign/normal) applications
to profile their DNS behaviour. This allowed us to passively classify DNS traffic and
differentiate malicious traffic vs. normal traffic. The data was collected in a virtual environment. Each individual software application was installed on a virtual machine that
was running windows 7. The DNS resolver of each machine was pointed to our DNS server
**192.168.50.88**. The collected data is considered normal traffic since its coming from known
applications.

List of applications and their VMs:
```
192.168.50.19 dropbox.com Dropbox
192.168.50.50 Avast
192.168.50.51 Adobe Reader
192.168.50.52 Adobe Software Suite
192.168.50.54 Chrome
192.168.50.55 Firefox
192.168.50.56 Malwarebyte
192.168.50.57 WPS office
192.168.50.58 Windows update
192.168.50.59 utorrent.com bittorrent.com
192.168.50.60 fosshub.com audacity
192.168.50.61 Bytefence-com
192.168.50.63 Thunderbird Mozila
192.168.50.64 Avast
192.168.50.65 Skype
192.168.50.66 Facebook massager
192.168.50.67 CCleaner
192.168.50.68 Win update
192.168.50.69 Hitmanpro.com
-> background data from windows
> time.windows.com
> time.microsoft.akadns.net
> dns.msftncsi.com
```

#### Timeline
The data was collected over the following period:
- Start: 2017-06-14
- End: 2017-06-21 18:31

See:
- https://www.uvic.ca/ecs/ece/isot/assets/docs/ISOT%20HTTP%20Botnet%20Dataset.pdf

In [None]:
#import sys
#!{sys.executable} -m pip install pandas pyshark

In [None]:
import glob
import nest_asyncio
nest_asyncio.apply()
import numpy as np
import re
from tqdm import tqdm

import pandas as pd
import pyshark
from pyshark.capture.capture import TSharkCrashException

In [None]:
EDGES_COLUMNS = ["ip_source", "ip_destination", "port_source", "port_destination", "length"]

def extract_packet_info(pcap_file):
    """Extract ip/port source and destination from packets of pcap file"""
    cap = pyshark.FileCapture(pcap_file)
    packet_info = []
    try:
        for i, packet in tqdm(enumerate(cap)):
            try:
                try:
                    # IPv4
                    ip_src = packet.ip.src
                    ip_dst = packet.ip.dst
                except:
                    # IPv6
                    ip_src = packet.ipv6.src
                    ip_dst = packet.ipv6.dst

                port_src = packet.udp.srcport
                port_dst = packet.udp.dstport
                length = packet.length

                packet_info.append([ip_src, ip_dst, port_src, port_dst, length])
            except:
                pass
    except TSharkCrashException as e:
        print(f"TSharkCrashException on file {pcap_file}, line {i}: {e}")
    return packet_info

def extract_packet_info_from_multiple_pcap(dir_path):
    """Extract ip/port source and destination from packets of multiple pcap files"""
    packet_info = []
    for i, pcap_file in enumerate(glob.glob(dir_path)):
        packet_info.extend(extract_packet_info(pcap_file))
    packet_info = pd.DataFrame(packet_info, columns=EDGES_COLUMNS)
    return packet_info

## I. Read raw data of malicious traffic

In [None]:
# preprocessing pcap
dir_path = r'/Users/martin/Downloads/isot_app_and_botnet_dataset 2/botnet_data/*.pcap'
packet_info = extract_packet_info_from_multiple_pcap(dir_path)
packet_info.to_csv('isot_edges_malicious_traffic.csv', index=None)

In [None]:
df_bot = pd.read_csv('isot_edges_malicious_traffic.csv')
df_bot['malicious'] = True
df_bot

Unnamed: 0,ip_source,ip_destination,port_source,port_destination,length,malicious
0,fe80::891f:ff8f:8660:beff,ff02::c,58618,1900,181,True
1,192.168.50.11,239.255.255.250,58620,1900,167,True
2,fe80::891f:ff8f:8660:beff,ff02::c,58618,1900,179,True
3,192.168.50.11,239.255.255.250,58620,1900,165,True
4,fe80::3403:4993:3b2c:2ae9,ff02::c,55152,1900,181,True
...,...,...,...,...,...,...
3182863,192.168.50.50,192.168.50.255,137,137,92,True
3182864,192.168.50.88,8.8.4.4,60657,53,97,True
3182865,192.168.50.51,192.168.50.88,61485,53,84,True
3182866,192.168.50.50,192.168.50.255,137,137,92,True


## II. Read raw data of normal traffic

In [None]:
# preprocessing pcap
dir_path = r'/Users/martin/Downloads/isot_app_and_botnet_dataset 2/application_data/*.pcap'
packet_info = extract_packet_info_from_multiple_pcap(dir_path)
packet_info.to_csv('isot_edges_normal_traffic.csv', index=None)

In [None]:
df_normal = pd.read_csv('isot_edges_normal_traffic.csv')
df_normal['malicious'] = False
df_normal

Unnamed: 0,ip_source,ip_destination,port_source,port_destination,length,malicious
0,fe80::e14d:8fc5:b840:50b8,ff02::1:2,546,547,148,False
1,fe80::e14d:8fc5:b840:50b8,ff02::1:3,62730,5355,86,False
2,192.168.50.59,224.0.0.252,51017,5355,66,False
3,fe80::e14d:8fc5:b840:50b8,ff02::1:3,62730,5355,86,False
4,192.168.50.59,224.0.0.252,51017,5355,66,False
...,...,...,...,...,...,...
419242,192.168.50.51,192.168.50.255,137,137,92,False
419243,192.168.50.19,192.168.50.88,62476,53,81,False
419244,192.168.50.88,8.8.4.4,55706,53,92,False
419245,192.168.50.51,192.168.50.255,137,137,92,False


## III. Merge these two datasets in one

In [None]:
df = pd.concat([df_bot, df_normal])
df['malicious'].value_counts()

True     3182868
False     419247
Name: malicious, dtype: int64

## IV. Aggregate
The connections were aggregated together to reduce the number of edges. Creating an edge for each transmitted packet would create super nodes and make the graph very difficult to read.

In [None]:
df['packets'] = 1
df_agg = pd.DataFrame(df.groupby(["ip_source", "ip_destination", "port_destination", 'malicious'], dropna=False)[['length', 'packets']].sum())
df_agg['mean_length_by_packet'] = df_agg.apply(lambda row: np.round(row['length'] / row['packets'], 2), axis=1)
df_agg.reset_index(inplace=True)
df_agg

Unnamed: 0,ip_source,ip_destination,port_destination,malicious,length,packets,mean_length_by_packet
0,192.168.50.1,224.0.0.251,5353,False,14529,167,87.00
1,192.168.50.1,224.0.0.251,5353,True,228660,1325,172.57
2,192.168.50.101,192.168.50.88,53,True,32968,439,75.10
3,192.168.50.102,192.168.50.88,53,True,33258,438,75.93
4,192.168.50.103,192.168.50.88,53,True,33384,444,75.19
...,...,...,...,...,...,...,...
228207,fe80::f566:6465:d2bd:8834,ff02::1:2,547,False,1474200,9828,150.00
228208,fe80::f566:6465:d2bd:8834,ff02::1:3,5355,False,3096,36,86.00
228209,fe80::f566:6465:d2bd:8834,ff02::c,3702,False,8400,8,1050.00
228210,fe80::f956:19dc:8e16:e81b,ff02::1:2,547,True,11906790,76818,155.00


## V. Save result as a new csv

**Save edges**

In [None]:
df_agg.to_csv('isot_edges_unique_with_dest_port.csv', index=None)

**Save nodes**

In [None]:
info = pd.DataFrame([
    ['192.168.50.14', 'zyklon.botnet.isot', True],
    ['192.168.50.15', 'blue.botnet.isot', True],
    ['192.168.50.16', 'liphyra.botnet.isot', True],
    ['192.168.50.17', 'gaudox.botnet.isot gdox.botnet.isot dox.botnet.isot **', True],
    ['192.168.50.18', 'blackout.botnet.isot', True],
    ['192.168.50.30', 'citadel.botnet.isot ***', True],
    ['192.168.50.31', 'citadel.botnet.isot ***', True],
    ['192.168.50.32', 'be.botnet.isot black energy', True],
    ['192.168.50.34', 'zeus.botnet.isot', True],
    ['192.168.50.19', 'dropbox.com Dropbox', False],
    ['192.168.50.50', 'Avast', False],
    ['192.168.50.51', 'Adobe Reader', False],
    ['192.168.50.52', 'Adobe Software Suite', False],
    ['192.168.50.54', 'Chrome', False],
    ['192.168.50.55', 'Firefox', False],
    ['192.168.50.56', 'Malwarebyte', False],
    ['192.168.50.57', 'WPS office', False],
    ['192.168.50.58', 'Windows update', False],
    ['192.168.50.59', 'utorrent.com bittorrent.com', False],
    ['192.168.50.60', 'fosshub.com audacity', False],
    ['192.168.50.61', 'Bytefence-com', False],
    ['192.168.50.63', 'Thunderbird Mozila', False],
    ['192.168.50.64', 'Avast', False],
    ['192.168.50.65', 'Skype', False],
    ['192.168.50.66', 'Facebook massager', False],
    ['192.168.50.67', 'CCleaner', False],
    ['192.168.50.68', 'Win update', False],
    ['192.168.50.69', 'Hitmanpro.com', False]
], columns=['ip', 'label', 'malicious'])
info.to_csv('isot_nodes.csv', index=None)
info

Unnamed: 0,ip,label,malicious
0,192.168.50.14,zyklon.botnet.isot,True
1,192.168.50.15,blue.botnet.isot,True
2,192.168.50.16,liphyra.botnet.isot,True
3,192.168.50.17,gaudox.botnet.isot gdox.botnet.isot dox.botnet...,True
4,192.168.50.18,blackout.botnet.isot,True
5,192.168.50.30,citadel.botnet.isot ***,True
6,192.168.50.31,citadel.botnet.isot ***,True
7,192.168.50.32,be.botnet.isot black energy,True
8,192.168.50.34,zeus.botnet.isot,True
9,192.168.50.19,dropbox.com Dropbox,False
