In [1]:
from math import floor
import math
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import OneClassSVM
from sklearn.metrics import confusion_matrix, classification_report, make_scorer
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_predict
import numpy as np
import common # type: ignore
from sklearn.feature_selection import VarianceThreshold
from mlxtend.feature_selection import SequentialFeatureSelector
from IPython.display import display, Markdown, Latex, HTML
import matplotlib.pyplot as plt
from sqlalchemy import create_engine, text
import quantiphy as qq
import warnings
import enum

# SEC_PER_SLOT = 12 * 60 * 60
# EPS_TH = 0.5
DATASET = "CTU-13"
database = common.Database()
# dataset = common.Dataset()
# slot = common.Slot(database, SEC_PER_SLOT, EPS_TH, DATASET)

DF = pd.read_sql(f"""
    SELECT
        PCAP.*,
        MW.DGA
    FROM PCAP 
    JOIN MALWARE AS MW
    ON MW.ID = PCAP.MALWARE_ID
    WHERE PCAP.DATASET = '{DATASET}'
""", database.engine)

DF["dga"] = DF["dga"].replace([0,2], ["not-infected", "infected"])


In [36]:
import os
import importlib
import latex
importlib.reload(latex)

from latex import dm, pp, ptime, Cites, AC
from latex import Tables, Table, Figure, Figures
from latex import is_latex, set_latex, unset_latex

In [3]:

dm(
f"""

# Dataset

## Data set requirements

Network data set for malware detection based on {AC.DNS} are very ICmited. As
noted in {Cites.CTU_SME_11}, not all malware behaves in the same way and the
choice of malware used to infect a machine is of great importance during the
design phase of the data set.

For example, in the design of the {Cites.CTU_SME_11} network data set, the
maICcious activity has been chosen accordingly to its capacity to generate
network traffic, otherwise it has not been included in the data set generation.

In addition, since our work is based on malware trying to estabICsh a connection
with the {AC.DNS} server, we have two further requirements:

- the malware must estabICsh a connection with the {AC.CC} server,
- it must use {AC.DGA} algorithm.

Hence, this requirements reduce the number of data sets compatible with our
experiment.


""")



# Dataset

## Data set requirements

Network data set for malware detection based on _DNS_ are very ICmited. As
noted in _[CTU-SME-11]_, not all malware behaves in the same way and the
choice of malware used to infect a machine is of great importance during the
design phase of the data set.

For example, in the design of the _[CTU-SME-11]_ network data set, the
maICcious activity has been chosen accordingly to its capacity to generate
network traffic, otherwise it has not been included in the data set generation.

In addition, since our work is based on malware trying to estabICsh a connection
with the _DNS_ server, we have two further requirements:

- the malware must estabICsh a connection with the _CC_ server,
- it must use _DGA_ algorithm.

Hence, this requirements reduce the number of data sets compatible with our
experiment.




In [4]:

pp(f"""

## DNS Data set

The data set used make use of network traffic capture provided by the {AC.MCFP}
developed by {Cites.STSPH}, a repository of captured network generated by
infected or not-infected machines.

The project provides hundreds of captures, diveded in two groups: the so-called
_"normal"_ captures, here named {AC.NIC}, and the {AC.IC}.

For each capture, we have:

- a {AC.PCAP} file.
- The malware if the capture is infected.
- Other files generated by network analysis tools ICke Argus.

Given this repository, we need to check for each capture if it would be
compatible with our purpose. The compatibiICty check consists of analysing for
each capture the amount of {AC.DNS} traffic:

- For a {AC.NIC}, we can only hope that the amount of traffic is the greater
  possible.
- For an {AC.IC}, we check if the malware produce {AC.DGA} traffic.

Given this requirements, we finally built the DNS Data Set. For each capture
available in {AC.MCFP} we performed the following steps:

1. We generate a new "DNS-{AC.PCAP}" consisting only of the {AC.DNS} packages of
the original {AC.PCAP}.
2. We check the amount of {AC.DNS} packets and if it is in the order of tens, we
discard it.
3. We insert each {AC.DNS} packet of the capture into a relational database.

The DNS-Data Set will include the non-{AC.DGA} captures and the decision
regarding their use will be made at a later point.


""")



## DNS Data set

The data set used make use of network traffic capture provided by the _MCFP_
developed by _[STSPH]_, a repository of captured network generated by
infected or not-infected machines.

The project provides hundreds of captures, diveded in two groups: the so-called
_"normal"_ captures, here named _NIC_, and the _IC_.

For each capture, we have:

- a _PCAP_ file.
- The malware if the capture is infected.
- Other files generated by network analysis tools ICke Argus.

Given this repository, we need to check for each capture if it would be
compatible with our purpose. The compatibiICty check consists of analysing for
each capture the amount of _DNS_ traffic:

- For a _NIC_, we can only hope that the amount of traffic is the greater
  possible.
- For an _IC_, we check if the malware produce _DGA_ traffic.

Given this requirements, we finally built the DNS Data Set. For each capture
available in _MCFP_ we performed the following steps:

1. We generate a new "DNS-_PCAP_" consisting only of the _DNS_ packages of
the original _PCAP_.
2. We check the amount of _DNS_ packets and if it is in the order of tens, we
discard it.
3. We insert each _DNS_ packet of the capture into a relational database.

The DNS-Data Set will include the non-_DGA_ captures and the decision
regarding their use will be made at a later point.




In [5]:
pp(f"""


### Database

As mentioned earlier, we opted for a relational database to store each captured
record. This choice promotes a well-organized and centralized data repository,
ensuring a consistent procedure for adding new captures, and the benefits of the
SQL language.

The main tables of the database are:

- DN (Domain Name table), contains information about all unique domain names
  that have appeared among all PCAPs.
- Malware-table, storing information about the malwares which has infected one
  or multiple captures.
- PCAP-table, storing information about captures. It is related to the
  Malware-table.
- Packet-table, storing information about DNS-packet. Each packet is related to
  its parent PCAP and to the DN record the packet contains.

- ~NN-table (Neural Network table), indicating each @LSTM neural network used to
  predict {AC.DGA} domain name.~
- ~DN-NN-table, a many to many relationship which relates each domain name of
DN-table to a neural network of NN-table, including the prediction value
$\\varepsilon_i = O_i(d_j)$ where $i$ indicate the NN record, and $j$ the the DN
record.~

Using this methodology, we avoid:

- Duplication of work, the prediction for each packet will be performed just one
  time for each neural network.
- Duplication of data, the information about the same domain name - and its
predictions - will not be duplicated for each time it appears. Further
advantages are:
- Saving data store memory.
- Use of SQL language.
- Better data management related to a test-bed made of *sparse* CSV.
- Each capture insterted into the Database follow the same data processing steps
  and fits into the database data structure.


""")




### Database

As mentioned earlier, we opted for a relational database to store each captured
record. This choice promotes a well-organized and centralized data repository,
ensuring a consistent procedure for adding new captures, and the benefits of the
SQL language.

The main tables of the database are:

- DN (Domain Name table), contains information about all unique domain names
  that have appeared among all PCAPs.
- Malware-table, storing information about the malwares which has infected one
  or multiple captures.
- PCAP-table, storing information about captures. It is related to the
  Malware-table.
- Packet-table, storing information about DNS-packet. Each packet is related to
  its parent PCAP and to the DN record the packet contains.

- ~NN-table (Neural Network table), indicating each @LSTM neural network used to
  predict _DGA_ domain name.~
- ~DN-NN-table, a many to many relationship which relates each domain name of
DN-table to a neural network of NN-table, including the prediction value
$\varepsilon_i = O_i(d_j)$ where $i$ indicate the NN record, and $j$ the the DN
record.~

Using this methodology, we avoid:

- Duplication of work, the prediction for each packet will be performed just one
  time for each neural network.
- Duplication of data, the information about the same domain name - and its
predictions - will not be duplicated for each time it appears. Further
advantages are:
- Saving data store memory.
- Use of SQL language.
- Better data management related to a test-bed made of *sparse* CSV.
- Each capture insterted into the Database follow the same data processing steps
  and fits into the database data structure.




In [66]:
DF = DF.rename(columns={"id": "count", "duration": "d"})

totals = {
    "count": DF[["count", "dga"]]
            .groupby("dga")
            .count()
}
for c in ["q", "u", "d"]:
    totals[c] = (
        DF[[c, "dga"]]
            .groupby("dga")
            .agg({c: ["sum", "max", "mean"]})
    )
    totals[c].index.rename(None, inplace=True)
    totals[c].columns = totals[c].columns.droplevel(0)
    totals[c] = totals[c].map(lambda x: qq.Quantity(x).render(prec=2))

    totals[c].rename(columns={
        "sum": f"{c}_sum", #f"$\\sum {c}$",
        "max": f"{c}_max", #f"$\\max {c}$",
        "mean": f"{c}_mean"#f"$\\text{{avg}}\\: {c}$"
    }, inplace=True)
    pass

totals = [ totals[c] for c in ["count", "q", "u", "d"] ]
totals = pd.concat(totals, axis=1)

totals = Table(totals, Tables.TOTALS, "Request, uniques, duration statistics for each class.")

renames = {
    "count": "$N^{\\bullet/\\star}$"
}
for c in ["q", "u", "d"]:
    renames[f"{c}_sum"] = f"$\\sum {c}$"
    renames[f"{c}_max"] = f"$\\max {c}$"
    renames[f"{c}_mean"] = f"$\\text{{avg}}\\: {c}$"
    pass
totals.rename(renames)


# # total_q.drop(columns="u", inplace=True)
# total_q["q"] = total_q["q"].map(lambda x: qq.Quantity(x).render(prec=2))
# total_q["duration"] = total_q["duration"].div(60 * 60).map(ptime)
# total_q = Table(total_q, Tables.TOTAL_Q, "Amount of requests for each class.")
# total_q.rename({"q": "$\\sum^{N^{\\bullet}}_i q_i$",
#                  "duration": "$\\sum^{N^{\\star}}_i d_i$",
#                  "count": "$N^{\\bullet/\\star}$"})
# total_q.show()

In [67]:

pp(f"""

### DNS Data set analysis

#### Notation

From now on, we refer to the DNS data set simply with data set. We indicate with:

- $q_i$ the total number of requests.
- $d_i$ the duration for each $i$-th capture.
- $N^\\bullet$ the number of {AC.NIC}.
- $N^\\star$ the number of {AC.IC}.

It consists of {DF.shape[0]} captures, {totals.df.loc["not-infected", "count"]}
{AC.NIC} and {totals.df.loc["infected", "count"]} {AC.IC}.

The table {totals.label.ref()} shows the amount of requests for each class. We
can see that we are faced with an imbalance towards {AC.IC} in both the number
of captures, the number of requests "q" and the duration.

This is because the network traffic generation require far less human effort for a
{AC.IC} than for a {AC.NIC}:
- In the {AC.IC} case,  the traffic has been generated just by the malware activity.
- In the {AC.NIC} case, the traffic has been generated by human action for multiple
hours.

This would introduce another problem, which is the **_lack of mixed traffic_**, i.e.
the traffic generated by an infected machine while producing other kind of traffic -
like human, webserver, database server generated traffic - since that topology of each
capture, citing {Cites.STSPH}, _"was designed to be as simple as possible. It uses VirtualBox to
execute Windows virtual machines on Linux Hosts"_.

""")


totals.show('400px')



### DNS Data set analysis

#### Notation

From now on, we refer to the DNS data set simply with data set. We indicate with:

- $q_i$ the total number of requests.
- $d_i$ the duration for each $i$-th capture.
- $N^\bullet$ the number of _NIC_.
- $N^\star$ the number of _IC_.

It consists of 50 captures, 17
_NIC_ and 33 _IC_.

The table _[totals]_ shows the amount of requests for each class. We
can see that we are faced with an imbalance towards _IC_ in both the number
of captures, the number of requests "q" and the duration.

This is because the network traffic generation require far less human effort for a
_IC_ than for a _NIC_:
- In the _IC_ case,  the traffic has been generated just by the malware activity.
- In the _NIC_ case, the traffic has been generated by human action for multiple
hours.

This would introduce another problem, which is the **_lack of mixed traffic_**, i.e.
the traffic generated by an infected machine while producing other kind of traffic -
like human, webserver, database server generated traffic - since that topology of each
capture, citing _[STSPH]_, _"was designed to be as simple as possible. It uses VirtualBox to
execute Windows virtual machines on Linux Hosts"_.



Unnamed: 0,$N^{\bullet/\star}$,$\sum q$,$\max q$,$\text{avg}\: q$,$\sum u$,$\max u$,$\text{avg}\: u$,$\sum d$,$\max d$,$\text{avg}\: d$
infected,33,13.1M,4.08M,396k,113k,41.7k,3.42k,52.2M,6.14M,1.58M
not-infected,17,299k,52k,17.6k,29k,4.81k,1.71k,147k,23.3k,8.62k


In [71]:


df_duration_1hr = DF[["d", "dga"]].copy()

df_duration_1hr["d"] = df_duration_1hr["d"] / (1 * 60 * 60)
df_duration_nic = df_duration_1hr[df_duration_1hr["dga"] == "not-infected"]

df_duration_12hr = DF[["d", "dga"]].copy()
df_duration_12hr["d"] = df_duration_12hr["d"] / (24 * 60 * 60)
df_duration_ic = df_duration_12hr[df_duration_12hr["dga"] == "infected"]



fig, axs = plt.subplots(1,2, figsize=(7, 2))
ax_nic = axs[0]
ax_ic = axs[1]

df_duration_nic.plot.hist(
    ax=ax_nic,
    bins=30,
    legend=False,
    title="not-infected",
    color="#0000FF66")
ax_nic.set_xlabel("1 Hour")
# ax_nic.set_yticks([0,1,2,3])
ax_nic.set_xticks([0,1,2,3,4,5,6,7])


df_duration_ic.plot.hist(
    ax=ax_ic,
    legend=False,
    bins=30,
    title="infected",
    color="#FF000066"
)
ax_ic.set_xlabel("1 Day")
ax_ic.set_ylabel(None)
ax_ic.set_xticks([0,25,50,71])#,100,125,150])

fig_duration = Figure(fig, axs, Figures.DURATION,
             f"Capture duration distribution for each class. Note "
             f"that the x-scale is 1 hour for {AC.NICs} and 1 day for {AC.ICs}.")
fig_duration.ycaption = -0.2
plt.close()

df_duration = (
    df_duration_1hr
       .groupby("dga")
       .describe()
       .drop(columns=("d", "count"))
       .map(ptime))
df_duration.columns = df_duration.columns.droplevel(0)

tab_duration = Table(df_duration, Tables.DURATION, "Statistics of captures duration for each class.")

In [73]:
q_per_s = DF.copy()
qi_hr = "q^{hr}_i"

per_time = 60 * 60
ratio_label = "q/hr"

q_per_s[ratio_label] = q_per_s["q"] / (q_per_s["d"] / per_time)


fig, axs = plt.subplots(1,2, figsize=(7,2))
ax_pdf = axs[1]
ax_hist = axs[0]
ax = q_per_s[q_per_s["dga"]=="infected"][[ratio_label, "dga"]].plot(kind="kde", ax=ax_pdf, color="#FF000066", legend=False)
q_per_s[q_per_s["dga"]=="not-infected"][[ratio_label, "dga"]].plot(kind="kde", ax=ax_pdf, color="#0000FF66", legend=False)
fig.legend(["infected", "not-infected"])
ax = q_per_s[q_per_s["dga"]=="infected"][[ratio_label, "dga"]].plot.hist(ax=ax_hist, bins=30, color="#FF000066", legend=False)
ax = q_per_s[q_per_s["dga"]=="not-infected"][[ratio_label, "dga"]].plot.hist(ax=ax_hist, bins=30, color="#0000FF66", legend=False)

# ax_pdf.set_xlim([-1000, 20000])
ax_pdf.axvline(8950, color="#0000FF66")
ax_pdf.axvline(4210, color="#FF000066")
fig_q_per_s = Figure(fig, axs, Figures.Q_PER_S, f"Histogram and density distribution of ${ratio_label}$ per class.")

q_per_s = q_per_s[[ratio_label, "dga"]].groupby("dga").describe()
q_per_s.index.rename(None, inplace=True)
q_per_s.columns = q_per_s.columns.droplevel(0)
q_per_s = Table(
    q_per_s.map(lambda x: qq.Quantity(x).render(prec=2)),
    Tables.Q_PER_S,
    caption=f"Statistics for each class of the requests ratio ${qi_hr}.$"
)

plt.close()


# tmp = DF.copy()
# for col in ["u"]:
#     tmp[f"{col}/s"] = tmp[f"{col}"] / tmp["duration"]
# tmp = tmp[[ "u/s", "dga"]].groupby("dga").describe()
# tmp.index.rename(None, inplace=True)

In [74]:

pp(f"""

#### Captures duration

A deeper analysis of the duration, as can be seen in Table {tab_duration.label.ref()}, shows that:

- The max duration for {AC.NICs} is just 6.5 hours respect to 71 days of {AC.ICs}.
- The {AC.NIC} average duration is just 2.4 hours while the {AC.IC} is 18 days.

The Figure {Figures.DURATION} shows the captures durations for each class.

""")

fig_duration.show()
tab_duration.show()

pp(f"""
##### Requests ratio

To further investigate the  behaviour of each capture respect to the time, we analyze the
requests ratio ${qi_hr}=q_i/hr$, i.e. the average number of requests per hour
of the $i$-th capture.

Given the big difference between {AC.NICs} and {AC.ICs} durations distribution, we analyze
if the number of requests per unit of time differs between the two classes.

The Figure {Figures.Q_PER_S.ref()} shows on the right the ratio distribution and on the
left its estimated probability density function.

We can observe that:

- ${qi_hr}$ presents more variability for the {AC.NIC} than for the {AC.IC}. Most of the {AC.IC}
are included in the first three bins of the distribution plot. This can be seen also in
the Table {Tables.Q_PER_S}, where for the {AC.IC} the median value is 390, while for the
{AC.NIC} is 6.36k.

- On average, ${qi_hr}$ is larger for {AC.NIC} than for {AC.IC}.

We can conclude that in this data set the **{AC.NIC} requests ratio is higher than the {AC.IC} one**.
""")


fig_q_per_s.show()

q_per_s.show()


pp(f"""

#### Data set observation
This analysis shows the multiple aspect that can influence the final results.
The different durations and the different requests ratio for each capture class
can be very different in other applications or in different real scenarios.
The restricted number of data set available and the lack of _mixed captures_
are an obstacle to the generalization of the results obtained here.
Anyway, we will try to understand the possibility of detection in such a data set.
""")



#### Captures duration

A deeper analysis of the duration, as can be seen in Table _[duration]_, shows that:

- The max duration for _NICs_ is just 6.5 hours respect to 71 days of _ICs_.
- The _NIC_ average duration is just 2.4 hours while the _IC_ is 18 days.

The Figure Figures.DURATION shows the captures durations for each class.



![](duration.svg "Example")

Unnamed: 0_level_0,mean,std,min,25%,50%,75%,max
dga,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
infected,18 days,17 days,1.4 min,5.6 days,13 days,30 days,71 days
not-infected,2.4 hr,1.8 hr,19 min,1.1 hr,1.9 hr,3.9 hr,6.5 hr



##### Requests ratio

To further investigate the  behaviour of each capture respect to the time, we analyze the
requests ratio $q^{hr}_i=q_i/hr$, i.e. the average number of requests per hour
of the $i$-th capture.

Given the big difference between _NICs_ and _ICs_ durations distribution, we analyze
if the number of requests per unit of time differs between the two classes.

The Figure _[q-per-s]_ shows on the right the ratio distribution and on the
left its estimated probability density function.

We can observe that:

- $q^{hr}_i$ presents more variability for the _NIC_ than for the _IC_. Most of the _IC_
are included in the first three bins of the distribution plot. This can be seen also in
the Table Tables.Q_PER_S, where for the _IC_ the median value is 390, while for the
_NIC_ is 6.36k.

- On average, $q^{hr}_i$ is larger for _NIC_ than for _IC_.

We can conclude that in this data set the **_NIC_ requests ratio is higher than the _IC_ one**.


![](q-per-s.svg "Example")

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
infected,33,4.21k,8.85k,6.75,48,390,2.75k,36.2k
not-infected,17,8.95k,9.27k,1.12k,3.02k,6.36k,10.5k,35.7k




#### Data set observation
This analysis shows the multiple aspect that can influence the final results.
The different durations and the different requests ratio for each capture class
can be very different in other applications or in different real scenarios.
The restricted number of data set available and the lack of _mixed captures_
are an obstacle to the generalization of the results obtained here.
Anyway, we will try to understand the possibility of detection in such a data set.
