In [3]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 200)

## Loading the TON_IoT dataset and having a preview of the data

In [4]:
df = pd.read_csv('./dataset/toniot.csv.gz')
df.head()

Unnamed: 0,src_ip,src_port,dst_ip,dst_port,proto,service,duration,src_bytes,dst_bytes,conn_state,missed_bytes,src_pkts,src_ip_bytes,dst_pkts,dst_ip_bytes,dns_query,dns_qclass,dns_qtype,dns_rcode,dns_AA,dns_RD,dns_RA,dns_rejected,ssl_version,ssl_cipher,ssl_resumed,ssl_established,ssl_subject,ssl_issuer,http_trans_depth,http_method,http_uri,http_version,http_request_body_len,http_response_body_len,http_status_code,http_user_agent,http_orig_mime_types,http_resp_mime_types,weird_name,weird_addl,weird_notice,label,type
0,192.168.1.37,4444,192.168.1.193,49178,tcp,-,290.371539,101568,2592,OTH,0,108,108064,31,3832,-,0,0,0,-,-,-,-,-,-,-,-,-,-,-,-,-,-,0,0,0,-,-,-,-,-,-,1,backdoor
1,192.168.1.193,49180,192.168.1.37,8080,tcp,-,0.000102,0,0,REJ,0,1,52,1,40,-,0,0,0,-,-,-,-,-,-,-,-,-,-,-,-,-,-,0,0,0,-,-,-,-,-,-,1,backdoor
2,192.168.1.193,49180,192.168.1.37,8080,tcp,-,0.000148,0,0,REJ,0,1,52,1,40,-,0,0,0,-,-,-,-,-,-,-,-,-,-,-,-,-,-,0,0,0,-,-,-,-,-,-,1,backdoor
3,192.168.1.193,49180,192.168.1.37,8080,tcp,-,0.000113,0,0,REJ,0,1,48,1,40,-,0,0,0,-,-,-,-,-,-,-,-,-,-,-,-,-,-,0,0,0,-,-,-,-,-,-,1,backdoor
4,192.168.1.193,49180,192.168.1.37,8080,tcp,-,0.00013,0,0,REJ,0,1,52,1,40,-,0,0,0,-,-,-,-,-,-,-,-,-,-,-,-,-,-,0,0,0,-,-,-,-,-,-,1,backdoor


#### Available attributes in the dataset

In [22]:
print(df.columns.values, len(df.columns.values))

['src_ip' 'src_port' 'dst_ip' 'dst_port' 'proto' 'service' 'duration'
 'src_bytes' 'dst_bytes' 'conn_state' 'missed_bytes' 'src_pkts'
 'src_ip_bytes' 'dst_pkts' 'dst_ip_bytes' 'dns_query' 'dns_qclass'
 'dns_qtype' 'dns_rcode' 'dns_AA' 'dns_RD' 'dns_RA' 'dns_rejected'
 'ssl_version' 'ssl_cipher' 'ssl_resumed' 'ssl_established' 'ssl_subject'
 'ssl_issuer' 'http_trans_depth' 'http_method' 'http_uri' 'http_version'
 'http_request_body_len' 'http_response_body_len' 'http_status_code'
 'http_user_agent' 'http_orig_mime_types' 'http_resp_mime_types'
 'weird_name' 'weird_addl' 'weird_notice' 'label' 'type'] 44


#### The total number of samples

In [7]:
print(f"There are {df.shape[0]} flow-based samples")

There are 211043 flow-based samples


#### Checking the available classes

In [5]:
df['type'].value_counts()

type
normal        50000
backdoor      20000
ddos          20000
dos           20000
injection     20000
password      20000
ransomware    20000
scanning      20000
xss           20000
mitm           1043
Name: count, dtype: int64

In [24]:
df['label'].value_counts(), df['label'].value_counts(normalize=True)

(label
 1    161043
 0     50000
 Name: count, dtype: int64,
 label
 1    0.763081
 0    0.236919
 Name: proportion, dtype: float64)

---
#### Signature- or Anomaly-based Discussion

The attribute `label` has only two values, `0` representing a normal sample and `1` representing one of the nine available attacks in the dataset. This attribute is helpful for the binary classification task (normal or attack) or from an anomaly approach using `0` as the normal behavior and the rest as non-normal.

As can be seen, the dataset is inherently unbalanced, with 76.3% of attack samples and 23.7% of normal samples. It is essential to highlight that an analysis of imbalances is important because, during operation, most of the data presented to an ML Model for inference would be normal samples.

---

#### Tuple 

This dataset is arranged in a 6-tuple based on the attributes:
- src_ip
- src_port
- dst_ip
- dst_port
- proto
- service

In [25]:
unique_tuples = df[['src_ip', 'src_port', 'dst_ip', 'dst_port', 'proto', 'service']].drop_duplicates().shape[0]
print(f"And there are only {unique_tuples} flow-based samples based on the 6-tuple.")

And there are only 124152 flow-based samples based on the 6-tuple.


#### Preprocessing

To use the dataset, it is important to remove all empty values (`NaN`), and infinite values. Additionally, a good practice is to work with scaled features, more info [here](https://en.wikipedia.org/wiki/Feature_scaling).

In [18]:
# Replace infinite values with NaN
df.replace([np.inf, -np.inf], np.nan, inplace=True)
# Now, remove rows with NaN values (which include former infinities)
df = df.dropna()