# BoT-IoT Dataset

Bot-IoT Description (to be done)

| Feature | Description |
|---|---|
| pkSeqID | Row Identifier |
| stime | Record start time |
| flgs | Flow state flags seen in transactions |
| proto | Textual representation of transaction protocols present in network flow |
| saddr | Source IP address |
| sport | Source port number |
| daddr | Destination IP address |
| dport | Destination port number |
| pkts | Total count of packets in transaction |
| bytes | Total number of bytes in transaction |
| state | Transaction state |
| ltime | Record last time |
| seq | Argus sequence number |
| dur | Record total duration |
| mean | Average duration of aggregated records |
| stddev | Standard deviation of aggregated records |
| smac | - |
| dmac | - |
| sum | Total duration of aggregated records |
| min | Minimum duration of aggregated records |
| max | Maximum duration of aggregated records |
| soui | - |
| doui | - |
| sco | - |
| dco | - |
| spkts | Source-to-destination packet count |
| dpkts | Destination-to-source packet count |
| sbytes | Source-to-destination byte count |
| dbytes | Destination-to-source byte count |
| rate | Total packets per second in transaction |
| srate | Source-to-destination packets per second |
| drate | Destination-to-source packets per second |
| attack | Class label: 0 for Normal traffic, 1 for Attack Traffic |
| category | Traffic category |
| subcategory | Traffic subcategory |

Initial feature selection: https://docs.google.com/spreadsheets/d/1yyDf0Jsi0t6rCKHLaHbrPcSXwqDzRNw_OuhpH2sUUGU/edit?usp=sharing

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing
import seaborn as sns # statist graph package
import matplotlib.pyplot as plt # plot package
import pandasql as ps # sql package
from skmultiflow.trees import HoeffdingTreeClassifier

In [2]:
pd.set_option("display.max_columns", 30)
pd.set_option("display.max_rows", 1000)

In [3]:
csv_file = "../processed-data/botiot-25-fs.csv"
df = pd.read_csv(csv_file)

In [4]:
df = df.rename(columns = {'ipv6-icmp':'ipv6icmp'})

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9639397 entries, 0 to 9639396
Data columns (total 29 columns):
 #   Column    Dtype  
---  ------    -----  
 0   sport     float64
 1   dport     float64
 2   pkts      int64  
 3   bytes     int64  
 4   ltime     float64
 5   mean      float64
 6   sum       float64
 7   min       float64
 8   max       float64
 9   spkts     int64  
 10  dpkts     int64  
 11  sbytes    int64  
 12  dbytes    int64  
 13  rate      float64
 14  srate     float64
 15  drate     float64
 16  icmp      int64  
 17  ipv6icmp  int64  
 18  tcp       int64  
 19  udp       int64  
 20  ACC       int64  
 21  CON       int64  
 22  FIN       int64  
 23  INT       int64  
 24  NRS       int64  
 25  REQ       int64  
 26  RST       int64  
 27  URP       int64  
 28  attack    int64  
dtypes: float64(10), int64(19)
memory usage: 2.1 GB


In [6]:
df.shape

(9639397, 29)

In [7]:
df.attack.value_counts()

1    9630325
0       9072
Name: attack, dtype: int64

In [8]:
df.head(n=10)

Unnamed: 0,sport,dport,pkts,bytes,ltime,mean,sum,min,max,spkts,dpkts,sbytes,dbytes,rate,srate,drate,icmp,ipv6icmp,tcp,udp,ACC,CON,FIN,INT,NRS,REQ,RST,URP,attack
0,139.0,36390.0,10,680,1526346000.0,2.8e-05,0.000138,2.2e-05,4.2e-05,5,5,350,330,0.00619,0.002751,0.002751,0,0,1,0,0,1,0,0,0,0,0,0,0
1,51838.0,123.0,2,180,1526344000.0,0.048565,0.048565,0.048565,0.048565,1,1,90,90,20.59096,0.0,0.0,0,0,0,1,0,1,0,0,0,0,0,0,0
2,58999.0,53.0,4,630,1526345000.0,0.098505,0.197011,0.018356,0.178655,2,2,174,456,0.005264,0.001755,0.001755,0,0,0,1,0,1,0,0,0,0,0,0,0
3,58360.0,53.0,2,172,1526344000.0,0.0,0.0,0.0,0.0,2,0,172,0,0.399984,0.399984,0.0,0,0,0,1,0,0,0,1,0,0,0,0,0
4,37214.0,53.0,2,172,1526344000.0,0.0,0.0,0.0,0.0,2,0,172,0,0.399824,0.399824,0.0,0,0,0,1,0,0,0,1,0,0,0,0,0
5,138.0,138.0,4,1086,1526345000.0,7.7e-05,0.000154,6.1e-05,9.3e-05,4,0,1086,0,0.004119,0.004119,0.0,0,0,0,1,0,0,0,1,0,0,0,0,0
6,57950.0,53.0,2,172,1526344000.0,0.007523,0.007523,0.007523,0.007523,1,1,86,86,132.925705,0.0,0.0,0,0,0,1,0,1,0,0,0,0,0,0,0
7,36138.0,53.0,2,172,1526344000.0,0.0,0.0,0.0,0.0,2,0,172,0,0.399805,0.399805,0.0,0,0,0,1,0,0,0,1,0,0,0,0,0
8,34295.0,53.0,2,172,1526344000.0,0.007698,0.007698,0.007698,0.007698,1,1,86,86,129.90387,0.0,0.0,0,0,0,1,0,1,0,0,0,0,0,0,0
9,43735.0,53.0,2,172,1526344000.0,0.0,0.0,0.0,0.0,2,0,172,0,0.399839,0.399839,0.0,0,0,0,1,0,0,0,1,0,0,0,0,0


In [9]:
df.tail(n=10)

Unnamed: 0,sport,dport,pkts,bytes,ltime,mean,sum,min,max,spkts,dpkts,sbytes,dbytes,rate,srate,drate,icmp,ipv6icmp,tcp,udp,ACC,CON,FIN,INT,NRS,REQ,RST,URP,attack
9639387,138.0,138.0,2,543,1528102000.0,0.000209,0.000209,0.000209,0.000209,2,0,543,0,4784.688965,4784.688965,0.0,0,0,0,1,0,0,0,1,0,0,0,0,0
9639388,50458.0,123.0,2,180,1528102000.0,0.007083,0.007083,0.007083,0.007083,1,1,90,90,141.183105,0.0,0.0,0,0,0,1,0,1,0,0,0,0,0,0,0
9639389,80.0,80.0,9904,9267118,1528102000.0,4.940684,123.517105,3.766563,4.998064,9904,0,9267118,0,79.974998,79.974998,0.0,0,0,1,0,0,1,0,0,0,0,0,0,0
9639390,3456.0,80.0,19806,19267852,1528102000.0,4.940639,123.515961,3.754065,4.997637,9903,9903,9410581,9857271,159.957977,79.974945,79.974945,0,0,0,1,0,1,0,0,0,0,0,0,0
9639391,8080.0,80.0,19806,19301382,1528102000.0,4.940453,123.51133,3.778993,4.991025,9903,9903,9961841,9339541,159.957977,79.974953,79.974953,0,0,1,0,0,1,0,0,0,0,0,0,0
9639392,80.0,80.0,9903,9501175,1528102000.0,4.93996,123.499001,3.701398,4.998852,9903,0,9501175,0,79.977333,79.977333,0.0,0,0,0,1,0,0,0,1,0,0,0,0,0
9639393,0.0,0.0,5942,4223674,1528102000.0,4.929386,123.23465,3.691942,4.999439,5942,0,4223674,0,47.984787,47.984787,0.0,0,0,1,0,0,1,0,0,0,0,0,0,0
9639394,365.0,565.0,5323,319380,1528102000.0,4.928208,123.2052,3.641112,4.999921,5323,0,319380,0,42.989494,42.989494,0.0,0,0,0,1,0,0,0,1,0,0,0,0,0
9639395,80.0,80.0,3342,989232,1528102000.0,4.915251,122.881271,3.555233,4.999967,3342,0,989232,0,26.992807,26.992807,0.0,0,0,1,0,0,1,0,0,0,0,0,0,0
9639396,41307.0,8883.0,96,40916,1528102000.0,2.712452,65.098846,0.006126,4.999908,48,48,37748,3168,0.821691,0.4085,0.406521,0,0,1,0,0,1,0,0,0,0,0,0,0


In [10]:
df.describe()

Unnamed: 0,sport,dport,pkts,bytes,ltime,mean,sum,min,max,spkts,dpkts,sbytes,dbytes,rate,srate,drate,icmp,ipv6icmp,tcp,udp,ACC,CON,FIN,INT,NRS,REQ,RST,URP,attack
count,9639315.0,9639315.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0,9639397.0
mean,32731.54,111.5997,8.018178,2342.449,1528098000.0,2.010042,5.951869,1.059225,2.953331,7.290935,0.7272436,1780.371,562.0782,0.4836517,1.098909,0.00877463,8.506756e-06,9.129202e-06,0.5073191,0.4926633,0.0004304211,0.0005007575,1.400503e-05,0.4923285,9.129202e-06,0.3941934,0.1125152,8.506756e-06,0.9990589
std,18929.36,1166.514,390.4737,367161.8,33405.55,1.584799,11.05643,1.637311,1.952601,279.2644,157.5743,250278.3,155008.1,120.3655,629.2718,1.168055,0.002916622,0.003021443,0.4999465,0.4999462,0.02074213,0.02237201,0.003742303,0.4999412,0.003021443,0.4886768,0.3159993,0.002916622,0.03066353
min,0.0,0.0,1.0,60.0,1526344000.0,0.0,0.0,0.0,0.0,1.0,0.0,60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,16352.0,80.0,3.0,308.0,1528097000.0,0.019337,0.056746,0.0,0.047647,3.0,0.0,308.0,0.0,0.156481,0.150158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,32713.0,80.0,5.0,582.0,1528098000.0,2.207738,4.853511,0.0,3.933735,5.0,0.0,540.0,0.0,0.230108,0.217676,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,49128.0,80.0,8.0,676.0,1528100000.0,3.350504,10.21427,2.539142,4.493843,8.0,0.0,660.0,0.0,0.399317,0.380848,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
max,65535.0,65489.0,322677.0,314459600.0,1529382000.0,4.999999,3502.99,4.999999,5.027652,234761.0,161338.0,225239200.0,152179500.0,333333.3,1000000.0,2178.649,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [11]:
df.sample(frac=0.0001, random_state=1)

Unnamed: 0,sport,dport,pkts,bytes,ltime,mean,sum,min,max,spkts,dpkts,sbytes,dbytes,rate,srate,drate,icmp,ipv6icmp,tcp,udp,ACC,CON,FIN,INT,NRS,REQ,RST,URP,attack
51821,43569.0,80.0,1,154,1528096000.0,0.0,0.0,0.0,0.0,1,0,154,0,0.0,0.0,0.0,0,0,1,0,0,0,0,0,0,1,0,0,1
1743582,20587.0,80.0,5,676,1528096000.0,1.445994,4.337983,0.0,4.337983,4,1,616,60,0.249839,0.187379,0.0,0,0,1,0,0,0,0,0,0,0,1,0,1
1694159,59314.0,80.0,4,616,1528096000.0,3.424843,6.849686,3.143486,3.7062,4,0,616,0,0.291946,0.291946,0.0,0,0,1,0,0,0,0,0,0,1,0,0,1
8635718,20894.0,80.0,7,420,1528101000.0,0.0,0.0,0.0,0.0,7,0,420,0,0.062562,0.062562,0.0,0,0,0,1,0,0,0,1,0,0,0,0,1
634590,36686.0,80.0,5,770,1528096000.0,0.0,0.0,0.0,0.0,5,0,770,0,0.172951,0.172951,0.0,0,0,1,0,0,0,0,0,0,1,0,0,1
2076120,7536.0,80.0,5,770,1528097000.0,2.108761,6.326283,0.0,3.660232,5,0,770,0,0.194594,0.194594,0.0,0,0,1,0,0,0,0,0,0,1,0,0,1
7905984,43176.0,80.0,1,60,1528100000.0,0.0,0.0,0.0,0.0,1,0,60,0,0.0,0.0,0.0,0,0,0,1,0,0,0,1,0,0,0,0,1
1085514,38802.0,80.0,9,1010,1528096000.0,1.196215,4.784859,0.0,4.24764,7,2,890,120,0.331895,0.248921,0.041487,0,0,1,0,0,0,0,0,0,0,1,0,1
7057293,31035.0,80.0,7,420,1528100000.0,3.718233,11.154697,2.680125,4.7594,7,0,420,0,0.361038,0.361038,0.0,0,0,0,1,0,0,0,1,0,0,0,0,1
1465897,37858.0,80.0,8,1044,1528096000.0,3.986108,11.958323,3.774672,4.10726,7,1,984,60,0.351303,0.301117,0.0,0,0,1,0,0,0,0,0,0,0,1,0,1


In [12]:
df.isnull().sum()

sport       82
dport       82
pkts         0
bytes        0
ltime        0
mean         0
sum          0
min          0
max          0
spkts        0
dpkts        0
sbytes       0
dbytes       0
rate         0
srate        0
drate        0
icmp         0
ipv6icmp     0
tcp          0
udp          0
ACC          0
CON          0
FIN          0
INT          0
NRS          0
REQ          0
RST          0
URP          0
attack       0
dtype: int64

In [13]:
df.nunique()

sport         65536
dport          8690
pkts           1668
bytes          2720
ltime       8990451
mean        2070759
sum         3396149
min         1066449
max         1766138
spkts          1408
dpkts           501
sbytes         2587
dbytes          923
rate         438020
srate        378255
drate         41238
icmp              2
ipv6icmp          2
tcp               2
udp               2
ACC               2
CON               2
FIN               2
INT               2
NRS               2
REQ               2
RST               2
URP               2
attack            2
dtype: int64

In [14]:
df.icmp.value_counts()

0    9639315
1         82
Name: icmp, dtype: int64

In [15]:
df.ipv6icmp.value_counts()

0    9639309
1         88
Name: ipv6icmp, dtype: int64

In [16]:
df.tcp.value_counts()

1    4890250
0    4749147
Name: tcp, dtype: int64

In [17]:
df.udp.value_counts()

0    4890420
1    4748977
Name: udp, dtype: int64

In [18]:
df.ACC.value_counts()

0    9635248
1       4149
Name: ACC, dtype: int64

In [19]:
df.CON.value_counts()

0    9634570
1       4827
Name: CON, dtype: int64

In [20]:
df.FIN.value_counts()

0    9639262
1        135
Name: FIN, dtype: int64

In [21]:
df.INT.value_counts()

0    4893647
1    4745750
Name: INT, dtype: int64

In [22]:
df.NRS.value_counts()

0    9639309
1         88
Name: NRS, dtype: int64

In [23]:
df.REQ.value_counts()

0    5839610
1    3799787
Name: REQ, dtype: int64

In [24]:
df.RST.value_counts()

0    8554818
1    1084579
Name: RST, dtype: int64

In [25]:
df.URP.value_counts()

0    9639315
1         82
Name: URP, dtype: int64

In [26]:
df.corr()

Unnamed: 0,sport,dport,pkts,bytes,ltime,mean,sum,min,max,spkts,dpkts,sbytes,dbytes,rate,srate,drate,icmp,ipv6icmp,tcp,udp,ACC,CON,FIN,INT,NRS,REQ,RST,URP,attack
sport,1.0,-0.045929,-0.008837,-0.007993,-0.010128,0.003627,-0.007856,0.004985,0.000977,-0.009242,-0.005519,-0.008283,-0.0055579,0.000243,0.000945,-0.004475,,-0.005203,-9.822867e-05,0.0001296754,-0.004227,-0.004749,0.002957,-4.7e-05,-0.005203,0.004219,-0.005822,,0.001016
dport,-0.045929,1.0,0.006519,0.006572,-0.010232,-0.009803,0.004303,-0.008563,-0.004279,0.004494,0.00819,0.004737,0.00791818,0.000233,-3.8e-05,0.006489,,-0.000289,0.02632641,-0.02632467,-0.000405,0.015042,0.004231,-0.026321,-0.000289,-0.021001,0.073032,,-0.017115
pkts,-0.008837,0.006519,1.0,0.978431,-0.082407,0.012947,0.639165,0.005355,0.009538,0.942858,0.807029,0.93577,0.8066645,0.002981,0.000355,0.111067,0.00331,-5.4e-05,-0.004079676,0.004060696,-0.000133,0.191342,0.000309,0.001209,-5.4e-05,-0.008814,-0.001854,0.00331,-0.175622
bytes,-0.007993,0.006572,0.978431,1.0,-0.076913,0.00901,0.57253,0.004367,0.00535,0.891158,0.845207,0.943599,0.8451145,0.002948,0.000344,0.115771,0.000233,-1.9e-05,0.0005530089,-0.0005542526,-9.1e-05,0.186821,0.000197,-0.003564,-1.9e-05,-0.003893,-0.001565,0.000233,-0.160593
ltime,-0.010128,-0.010232,-0.082407,-0.076913,1.0,0.043258,-0.136136,0.024475,0.046473,-0.084359,-0.0547,-0.07878,-0.05498135,-0.005862,-0.052334,-0.033813,-0.010136,-0.090435,-0.02940946,0.03001516,-0.001044,-0.469505,-0.093933,0.049365,-0.090435,-0.018907,-0.013482,-0.010136,0.789499
mean,0.003627,-0.009803,0.012947,0.00901,0.043258,1.0,0.340582,0.783961,0.831716,0.014829,0.005801,0.009236,0.006428247,-0.00048,-0.001186,0.004839,0.002193,-0.003832,-0.4171693,0.4171799,-0.009321,-0.003541,-0.003816,0.417793,-0.003832,-0.289834,-0.211851,0.002193,0.014223
sum,-0.007856,0.004303,0.639165,0.57253,-0.136136,0.340582,1.0,0.147862,0.359995,0.685133,0.369628,0.610378,0.3706058,0.001054,-0.000351,0.037375,0.001638,-0.001627,-0.2260372,0.2260376,-0.004672,0.174002,-0.00163,0.224913,-0.001627,-0.180842,-0.088164,0.001638,-0.167823
min,0.004985,-0.008563,0.005355,0.004367,0.024475,0.783961,0.147862,1.0,0.413033,0.00587,0.002867,0.00432,0.003368267,8.3e-05,-0.000494,0.002241,0.00132,-0.001955,-0.277633,0.2776373,-0.008558,-7.4e-05,-0.001809,0.277912,-0.001955,-0.184786,-0.153327,0.00132,0.005335
max,0.000977,-0.004279,0.009538,0.00535,0.046473,0.831716,0.359995,0.413033,1.0,0.011539,0.003185,0.005562,0.003691973,-0.001167,-0.001563,0.002936,0.001767,-0.00457,-0.3385596,0.3385771,-0.002754,-0.011549,-0.00416,0.339388,-0.00457,-0.232339,-0.176569,0.001767,0.024346
spkts,-0.009242,0.004494,0.942858,0.891158,-0.084359,0.014829,0.685133,0.00587,0.011539,1.0,0.564159,0.958089,0.5639089,0.002664,0.000347,0.077581,0.004636,-6.8e-05,-0.00626749,0.00624086,-0.000217,0.177419,0.000151,0.004255,-6.8e-05,-0.010374,-0.00328,0.004636,-0.179789
