# Data

## AWID3 Dataset

AWID3 dataset consists of 13 captures of traffic in a wireless network. Of these 7 that focus on attacks on IEEE 802.11 MAC layer. The attacks chosen are: 

* Deauth
* Disass
* (Re)Assoc
* RogueAP
* Krack
* Kr00k
* Evil Twin

## Data preprocessing

For training of the model 

Features selection was based on [Pick Quality Over Quantity: Expert Feature Selection and Data Preprocessing for 802.11 Intrusion Detection Systems](https://ieeexplore.ieee.org/document/9797689) by the authors of the AWID3 dataset:

### Features chosen by Chatzoglou et. al. and their preprocessing

| Feature                    | Preprocessing    |
|----------------------------|------------------|
| frame.len                  | Min-Max Scaling  |
| radiotap.len               | Min-Max Scaling  |
| radiotap.dbm_antsignal     | Min-Max Scaling  |
| wlan.duration              | Min-Max Scaling  |
| radiotap.present.tsft      | One Hot Encoding |
| radiotap.channel.freq      | One Hot Encoding |
| radiotap.channel.type.cck  | One Hot Encoding |
| radiotap.channel.type.ofdm | One Hot Encoding |
| wlan.fc.type               | One Hot Encoding |
| wlan.fc.subtype            | One Hot Encoding |
| wlan.fc.ds                 | One Hot Encoding |
| wlan.fc.frag               | One Hot Encoding |
| wlan.fc.retry              | One Hot Encoding |
| wlan.fc.pwrmgt             | One Hot Encoding |
| wlan.fc.moredata           | One Hot Encoding |
| wlan.fc.protected          | One Hot Encoding |

The chosen features were prepocessed with following differences:
* frame.delta_time was was was added, as it is crucial for analyzing temproal patterns.
* features expressed by 0/1 values, such as IEEE 802.11 Frame Control flags were left unchanged (e.g. frag, retry), not One Hot Encoded as in above mentioned article.
* the authors  One Hot Encoding of channel frequencies, however such approach is only feasable, when there rather few categories (3 frequencies were used in the above mentioned files), so a more genral method of preprocessing frequncy is proposed:
    * create two binary features `2ghz_spectrum` and `5ghz_spectrum` to indicate in which band a frame was sent
    * apply Min-Max Scaling to frequency, using the highest and the lowest channel frequency in given band as Min and Max values, as seen below

In [None]:
def preporcess_frequency(radiotap_channel_freq):
    lower_2ghz, higer_2ghz = 2412, 2472
    lower_5ghz, higher_5gzh = 5160, 5885
    
    if (lower_2ghz <= radiotap_channel_freq <= higer_2ghz):
        _2ghz_spectrum = 1
        _5gzh_spectrum = 0
        freq = (radiotap_channel_freq - lower_2ghz) / (higer_2ghz - lower_2ghz)
    elif (lower_5ghz <= radiotap_channel_freq <= higer_5ghz):
        _2ghz_spectrum = 0
        _5gzh_spectrum = 1
        freq = (radiotap_channel_freq - lower_5ghz) / (higer_5ghz - lower_5ghz)
    else:
        _2ghz_spectrum = 0
        _5gzh_spectrum = 0
        freq = -1
        
    return _2ghz_spectrum, _5gzh_spectrum, freq
     

### Features and used preprocessing

| Feature                   | Type        | Preprocessing       | Values                    | Description                                                                                                       |
|---------------------------|-------------|---------------------|---------------------------|-------------------------------------------------------------------------------------------------------------------|
| frame.len                 | numeric     | Min-Max Scaling     | from 70 to 3220           | Length of frame, in bytes                                                                                         |
| frame.time_delta          | numeric     | Min-Max Scaling     | from 0 to 0.001817        | Time interval since previous frame, in seconds                                                                    |
| radiotap.len              | numeric     | Min-Max Scaling     | from 48 to 64             | Length of Radiotap header, in bytes                                                                               |
| radiotap.dbm_antsignal    | numeric     | Min-Max Scaling     | form -255 to -78          | Strength of recieved signal, in dBm. In AWID3 CSV files the value was summarized for antennas, hence very low values |
| wlan.duration             | numeric     | Min-Max Scaling     | from 0 to 726             | Duration/Id field in IEEE 802.11 header                                                                           |
| wlan.fc.type              | categorical | One Hot Encoding    | 0, 1, 2                   | IEEE 802.11 frame type: Management, Control or Data                                                               |
| wlan.fc.subtype           | categorical | One Hot Encoding    | 0, 1, ... 14, 15          | IEEE 802.11 frame subtype                                                                                         |
| wlan.fc.ds                | categorical | One Hot Encoding    | 0, 1, 2, 3                | Indicates whete a frame was sent to (tods) or from (fromds) a Distribution System (ds), neither or both           |
| radiotap.present.tsft     | binary      | convert to 0,1      | '0-0-0', '1-0-0'          | Presence of TSFT (Time Synchroniztion Function) in Radiotap header                                                |
| radiotap.channel.type.cck | binary      | None                | 0, 1                      | Whether frame was sent using CCK (Complemantary Code Keying), used in IEEE 802.11b                                |
| radiotap.channel.type.ofdm | binary     | None                | 0, 1                      | Whether frame was sent using OFDM (Orthognal Frequency Division Multiplexing) modulation                          |
| wlan.fc.frag              | binary      | None                | 0, 1                      | Frame Control flag, indicates if the frame was fragmented                                                         |
| wlan.fc.retry             | binary      | None                | 0, 1                      | Frame Control flag, indicates if the frame is retransmission of a previous frame                                  |
| wlan.fc.pwrmgt            | binary      | None                | 0, 1                      | Frame Control flag, used when station enters power management state                                               |
| wlan.fc.moredata          | binary      | None                | 0, 1                      | Frame Control flag, indicates that AP has buffered for the station                                                |
| wlan.fc.protected         | binary      | None                | 0, 1                      | Frame Control flag, indicates that frame has been encrypted                                                       |


Classes:

| Class | name           | Attacks                                                               |
|-------|----------------|-----------------------------------------------------------------------|
| 0     | normal         |                                                                       |
| 1     | flooding       | Deauth, Disass, (Re)Assoc, Kr00k                                      |
| 2     | impersonation  | RogueAP, Krack, Evil Twin                                             |


Ranges were calculated across all AWID3 csv files using IQR, 

In [15]:
import tensorflow as tf
import os
import numpy as np
import random

random.seed(42)

tfrecords_dir='dataset/AWID3_tfrecords'
tfrecords_balanced_dir='dataset/AWID3_tfrecords_balanced'

In [16]:
import data_utils

files = os.listdir(tfrecords_dir)

train_ratio = 0.25
train_files, test_files = data_utils.train_test_split(files, train_ratio=train_ratio)

print("Training set:")
for t in train_files:
    print(t.split('.')[0], end=', ')

print("\nTest set:")
for t in test_files:
    print(t.split('.')[0], end=', ')
    
train_seq_set = [os.path.join(tfrecords_dir, file) for file in train_files]
train_balanced_set = [os.path.join(tfrecords_balanced_dir, file) for file in train_files]
test_set = [os.path.join(tfrecords_dir, file) for file in train_files]

Training set:
RogueAP_37
Deauth_23
Disas_29
Deauth_31
Kr00k_33
Evil_Twin_58
Kr00k_42
(Re)Assoc_22
(Re)Assoc_26
(Re)Assoc_31
Kr00k_34
Evil_Twin_64
Disas_40
Evil_Twin_68
Evil_Twin_32
Kr00k_55
Evil_Twin_31
Deauth_24
RogueAP_27
Kr00k_40
Kr00k_32
RogueAP_38
Evil_Twin_49
Kr00k_54
Evil_Twin_43
Disas_31
Disas_28
Evil_Twin_39
Evil_Twin_53
Evil_Twin_46
Evil_Twin_71
Krack_26
Evil_Twin_69
(Re)Assoc_34
RogueAP_29

Test set:
Krack_25
Kr00k_49
RogueAP_28
Evil_Twin_45
Deauth_32
(Re)Assoc_29
Kr00k_38
Evil_Twin_55
Evil_Twin_44
(Re)Assoc_24
Disas_37
Kr00k_48
Kr00k_37
Krack_28
RogueAP_36
Kr00k_56
Kr00k_46
Kr00k_44
Deauth_22
(Re)Assoc_27
Kr00k_51
Evil_Twin_30
(Re)Assoc_23
Evil_Twin_59
Evil_Twin_36
Disas_35
Evil_Twin_75
(Re)Assoc_33
Evil_Twin_48
Disas_30
Evil_Twin_41
Disas_36
Evil_Twin_61
(Re)Assoc_28
Evil_Twin_63
Disas_39
Kr00k_45
RogueAP_39
RogueAP_32
RogueAP_33
Evil_Twin_47
Kr00k_41
Deauth_30
Evil_Twin_42
Kr00k_52
Deauth_28
RogueAP_25
Evil_Twin_33
Deauth_29
Disas_32
Evil_Twin_29
(Re)Assoc_36
Deauth_21
Ev

In [17]:
sequence_length = 64
sequence_shift = 56
batch_size = 50
n_feautes = 39

import data_utils

raw_train_ds = tf.data.TFRecordDataset(train_balanced_set)
train_ds = raw_train_ds.map(data_utils.parse_record).shuffle(100000).batch(batch_size).prefetch(tf.data.AUTOTUNE)

raw_test_ds = tf.data.TFRecordDataset(test_set)
test_ds = raw_train_ds.map(data_utils.parse_record).batch(batch_size).prefetch(tf.data.AUTOTUNE)

train_seq_ds = data_utils.create_sequential_dataset(train_seq_set, seq_length=sequence_length, seq_shift=sequence_shift, batch_size=batch_size, shuffle=False)
test_seq_ds = data_utils.create_sequential_dataset(test_set, seq_length=sequence_length, seq_shift=sequence_shift, batch_size=batch_size, shuffle=False)

In [18]:
import cnn_lstm

cnn_lstm_model = cnn_lstm.CNN_LSTM_model()

cnn_lstm_model.fit(
        train_seq_ds,
        epochs=1,
        callbacks = [cnn_lstm.checkpoint_callback],
    )

    241/Unknown [1m150s[0m 573ms/step - accuracy: 0.8606 - loss: nan - precision: 0.0255 - recall: 0.0074
Epoch 1: accuracy improved from 0.82287 to 0.82597, saving model to saved_models/cnn_lstm.keras
[1m241/241[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m150s[0m 573ms/step - accuracy: 0.8604 - loss: nan - precision: 0.0255 - recall: 0.0074


2024-11-16 17:15:25.467607: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 15410661272650719235
2024-11-16 17:15:25.467662: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 16107512087902888839
2024-11-16 17:15:25.467676: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 2131296300501960936
2024-11-16 17:15:25.467685: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 13897202906965758814
2024-11-16 17:15:25.467694: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 11288505370683823994


<keras.src.callbacks.history.History at 0x7f4bcec4b140>

In [19]:
cnn_lstm_model.summary()

cnn_lstm_model.evaluate(test_seq_ds)

[1m241/241[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m121s[0m 489ms/step - accuracy: 0.8775 - loss: nan - precision: 0.0000e+00 - recall: 0.0000e+00


2024-11-16 17:17:25.996617: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 3411042345109326146
2024-11-16 17:17:25.996801: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 4316426749130279535
2024-11-16 17:17:25.996825: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 2676948891736928057
2024-11-16 17:17:25.996872: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]
	 [[IteratorGetNext/_4]]


[nan, 0.8289610743522644, 0.0, 0.0]

In [20]:
import cnn1d

cnn1d_model = cnn1d.CNN1D_model()


cnn1d_model.fit(
        train_ds,
        epochs=15,
        callbacks = [cnn1d.checkpoint_callback],
    )


Epoch 1/15


ValueError: The total size of the tensor must be unchanged. Received: input_shape=(39,), target_shape=(1, 40)

In [7]:
cnn1d_model.summary()

cnn1d_model.evaluate(test_ds)

[1m105506/105506[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m181s[0m 2ms/step - accuracy: 0.9274 - f1_score: 0.1351 - loss: nan - precision: 0.0000e+00 - recall: 0.0000e+00


  self.gen.throw(typ, value, traceback)


[nan, 0.9271121621131897, 0.0, 0.0, 0.13587220013141632]

In [9]:
import dnn


dnn_model = dnn.DNN_model()

dnn_model.fit(
        train_ds,
        epochs=15,
        callbacks = [dnn.checkpoint_callback],
    )


Epoch 1/2
 105500/Unknown [1m238s[0m 2ms/step - accuracy: 0.9269 - f1_score: 0.1345 - loss: 0.0678 - precision: 0.3228 - recall: 0.0139
Epoch 1: f1_score improved from -inf to 0.13587, saving model to saved_models/dnn.keras
[1m105506/105506[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m238s[0m 2ms/step - accuracy: 0.9269 - f1_score: 0.1345 - loss: 0.0678 - precision: 0.3228 - recall: 0.0139
Epoch 2/2


  self.gen.throw(typ, value, traceback)


[1m105500/105506[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 2ms/step - accuracy: 0.9278 - f1_score: 0.1344 - loss: 0.0684 - precision: 0.0000e+00 - recall: 0.0000e+00
Epoch 2: f1_score did not improve from 0.13587
[1m105506/105506[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m234s[0m 2ms/step - accuracy: 0.9278 - f1_score: 0.1344 - loss: 0.0684 - precision: 0.0000e+00 - recall: 0.0000e+00


<keras.src.callbacks.history.History at 0x2ae04848710>

In [10]:
dnn_model.summary()

dnn_model.evaluate(test_ds)

[1m105506/105506[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m155s[0m 1ms/step - accuracy: 0.9274 - f1_score: 0.1350 - loss: 0.0756 - precision: 0.0000e+00 - recall: 0.0000e+00


  self.gen.throw(typ, value, traceback)


[0.07589024305343628, 0.9271121621131897, 0.0, 0.0, 0.13587220013141632]

In [None]:
from tensorflow.keras.layers import Input, Dense, TimeDistributed, Concatenate

input_shape = (sequence_length, n_feautes)
input = Input(input_shape)

# making per packet models take  
td_cnn1d_model = TimeDistributed(cnn1d_model)
td_dnn_model = TimeDistributed(dnn_model)

base_classifiers = [
    cnn_lstm_model,
    td_cnn1d_model,
    td_dnn_model,
]

# disabling training of base classifiers
for bc in base_classifiers:
    bc.trainable=False

base_classifiers_outputs = [bc(input) for bc in base_classifiers]
combined_output = Concatenate()(base_classifiers_outputs)

# Logistic Regression for outputs of every timestep
y = TimeDistributed(Dense(1, activation='sigmoid'))(combined_output)

stacked_model = tf.keras.Model(input, y)

metaclassifier_epochs = 2

loss = tf.keras.losses.BinaryCrossentropy(
    from_logits=True
)

stacked_model.compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',
    metrics=['accuracy', 'precision', 'recall']
)


In [13]:
stacked_model.fit(
    train_seq_ds,
    epochs=metaclassifier_epochs,
)

Epoch 1/2
[1m795/795[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m253s[0m 80ms/step - accuracy: 0.8273 - loss: nan - precision: 0.0000e+00 - recall: 0.0000e+00
Epoch 2/2


  self.gen.throw(typ, value, traceback)


[1m795/795[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m245s[0m 78ms/step - accuracy: 0.8265 - loss: nan - precision: 0.0000e+00 - recall: 0.0000e+00


<keras.src.callbacks.history.History at 0x2ae06940710>

In [15]:
stacked_model.summary()

stacked_model.evaluate(test_seq_ds)

[1m164/164[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m57s[0m 83ms/step - accuracy: 0.8321 - loss: nan - precision: 0.0000e+00 - recall: 0.0000e+00


  self.gen.throw(typ, value, traceback)


[nan, 0.8313971757888794, 0.0, 0.0]