# Introduction
I develop a custom model to generate network packet data. The full architecture of the model was designed by me. Model has an encoder that encodes the columns and generates packet data given any specific column. It is fully developed in tensorflow and all preprocessing steps are included in this file. The dataset has over 100k samples and 32 columns.

After training, the model was used to generate over 300,000 unique network packet datapoints with minimal error!

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
import os
from tensorflow.keras.layers.experimental import preprocessing
import time
from keras.layers import Dropout
import pandas as pd

## Introduction
The below code changes my working directory to the initialized FILE_PATH. It's a path to a created directory on my google drive where I have stored all the resources and model I need for the code in this notebook to function. To make this work for you, simply create a directory, upload the 'output.csv' csv file into it and edit the FILE_PATH variable to the directory path (of the directory you just created).

All the code have been carefully written and documented in this notebook. 

In [None]:
FILE_PATH = '/content/drive/MyDrive/Deep Learning/Dataset generation'
os.chdir(FILE_PATH)

packet_data = pd.read_csv('output.csv')

In [None]:
packet_data.head(10)

Unnamed: 0,ip.id,ip.flags,ip.flags.rb,_ws.col.Protocol,ip.flags.df,ip.flags.mf,ip.frag_offset,ip.ttl,ip.proto,ip.checksum,ip.src,ip.dst,ip.len,ip.dsfield,tcp.srcport,tcp.dstport,tcp.seq,tcp.ack,tcp.len,tcp.hdr_len,tcp.flags,tcp.flags.fin,tcp.flags.syn,tcp.flags.reset,tcp.flags.push,tcp.flags.ack,tcp.flags.urg,tcp.flags.cwr,tcp.window_size,tcp.checksum,tcp.urgent_pointer,tcp.options.mss_val
0,0x00009314,0x00004000,0.0,MPEG PES,1.0,0.0,0.0,64.0,17.0,0x0000d5d2,10.0.0.23,192.168.2.7,1344.0,0x00000000,,,,,,,,,,,,,,,,,,
1,0x000047d3,0x00004000,0.0,TCP,1.0,0.0,0.0,64.0,6.0,0x0000272a,10.0.0.16,192.168.1.7,60.0,0x00000000,32968.0,1883.0,0.0,0.0,0.0,40.0,0x00000002,0.0,1.0,0.0,0.0,0.0,0.0,0.0,29200.0,0x000073e5,0.0,1460.0
2,0x0000bb2e,0x00004000,0.0,TCP,1.0,0.0,0.0,64.0,6.0,0x0000b3d4,10.0.0.10,192.168.1.7,60.0,0x00000000,42990.0,1883.0,0.0,0.0,0.0,40.0,0x00000002,0.0,1.0,0.0,0.0,0.0,0.0,0.0,29200.0,0x0000d261,0.0,1460.0
3,0x00000000,0x00004000,0.0,TCP,1.0,0.0,0.0,63.0,6.0,0x00006ffd,192.168.1.7,10.0.0.16,60.0,0x00000000,1883.0,32968.0,0.0,1.0,0.0,40.0,0x00000012,0.0,1.0,0.0,0.0,1.0,0.0,0.0,28960.0,0x0000f36b,0.0,1460.0
4,0x00000000,0x00004000,0.0,TCP,1.0,0.0,0.0,63.0,6.0,0x00007003,192.168.1.7,10.0.0.10,60.0,0x00000000,1883.0,42990.0,0.0,1.0,0.0,40.0,0x00000012,0.0,1.0,0.0,0.0,1.0,0.0,0.0,28960.0,0x00002858,0.0,1460.0
5,0x0000bb2f,0x00004000,0.0,TCP,1.0,0.0,0.0,64.0,6.0,0x0000b3db,10.0.0.10,192.168.1.7,52.0,0x00000000,42990.0,1883.0,1.0,1.0,0.0,32.0,0x00000010,0.0,0.0,0.0,0.0,1.0,0.0,0.0,29200.0,0x0000c120,0.0,
6,0x000047d4,0x00004000,0.0,TCP,1.0,0.0,0.0,64.0,6.0,0x00002731,10.0.0.16,192.168.1.7,52.0,0x00000000,32968.0,1883.0,1.0,1.0,0.0,32.0,0x00000010,0.0,0.0,0.0,0.0,1.0,0.0,0.0,29200.0,0x00008c33,0.0,
7,0x000047d5,0x00004000,0.0,MQTT,1.0,0.0,0.0,64.0,6.0,0x00002700,10.0.0.16,192.168.1.7,100.0,0x00000000,32968.0,1883.0,1.0,1.0,48.0,32.0,0x00000018,0.0,0.0,0.0,1.0,1.0,0.0,0.0,29200.0,0x00001194,0.0,
8,0x0000bb30,0x00004000,0.0,MQTT,1.0,0.0,0.0,64.0,6.0,0x0000b3aa,10.0.0.10,192.168.1.7,100.0,0x00000000,42990.0,1883.0,1.0,1.0,48.0,32.0,0x00000018,0.0,0.0,0.0,1.0,1.0,0.0,0.0,29200.0,0x00004284,0.0,
9,0x00005cc5,0x00004000,0.0,TCP,1.0,0.0,0.0,63.0,6.0,0x00001340,192.168.1.7,10.0.0.16,52.0,0x00000000,1883.0,32968.0,1.0,49.0,0.0,32.0,0x00000010,0.0,0.0,0.0,0.0,1.0,0.0,0.0,28960.0,0x00008c12,0.0,


Examining the data, there are 32 columns and about 112373 samples in the dataset as we can see below.

In [None]:
len(packet_data)

112373

## Model Architecture
The model makes use of the LSTM model as its core generative system. The model works by generating a character at every time step.

Based on the nature of the dataset (32 columns), there are basically 2 ways to model the data:

- A basic (vanilla) generator system.
- An encoder-based generator system.

In both cases, the systems attempts to learn and model the probability distribution responsible for the production of the real samples in the dataset.

### Basic generator system

Each sample in the dataset is modelled as one long sequence by concatenating each column values to form one long sequence. During concatenation, a unique separator (e.g ' '  etc) is used to separate each column values. This can be viewed as each column being a word in a full sentence (the result of the full concatenation).

This model then predicts the next character in the sequence given the previous character. This is the way the model weights are updated. This architecture will have an:
- Embedding system: that converts every character to a vector (digital representation).

- LSTM model: It takes the previous embedded character as input and outputs the prediction of the next character at every given time step.

- Fully connected layer: It makes the final prediction over all words in the model's vocabulary given the lstm's output.

The downside to this architecture is that it introduces a lot of uncertainties and errors during training and testing. This is because the algorithm is forced to model longer sequences of characters, generalize over all columns and their unique characters (learns the probability distribution responsible for all characters which is not dependent on the unique columns). In addition, for every time step (character step), there will always be more variation/possibilities as to the model's output because there is no constraint or condition on the model's output. This is as a result of the fact that each column has a set of characters that it can contain. For example some columns are plain numbers (e.g 1.0), while others have letters mixed with numbers (e.g 0x00000014) or just capital letters (e.g TCP). Therefore, the model does not attend to the categories (particular kind of character) during its generation and the error increases this way.

### Encoder-based generator system

This model is made up of :
- An encoder: The columns are represented using one-hot encoding (using a one-hot encoder) and then passed through a 2-layered neural network to get a unique representation (logit values: linear activation values only).

- LSTM model: It takes as input both this unique representation of the column and the embedding representation of the previous character.

- Fully connected layer: It makes the final prediction over all words in the model's vocabulary given the lstm's output.

In this architecture, the encoder helps to constrain the model output by giving the model information about the column its currently trying to generate characters for. This is in turn will put a constraint on the type of characters that the model generates. Consequently, it reduces the loss error of the model as the model learns to generate only relevant characters given or based on the column its generating for at time step t.

This is the model used in this notebook. There are 2 ways to train this model and both methods are defined as classes in the next cell below :

- PacketGeneratorSlow: It iterates through each column (first loop), encodes the column and then iteratively generates the sequence of characters (second loop) based on the encoded column representation. Since the hidden state (output ) of the lstm model will always be passed as input subsequently at every time step t + 1 and through all the columns, there will be some sort of dependency between each columns. However, training, and data generation will be slower here. Also, optimization might be worse in this case.


- PacketGeneratorModel2: It parallelizes the data generation by getting rid of the first iterative step (first loop). It focuses on just generating sequences based on the individual encoded column representations. In this method, the hidden state of the lstm model is not passed through all columns but instead passed as input only during generation of the next character within a column. This method is faster and has no relationship or dependency between the columns.


The code for this model follows. We only focus on the second type during training. We use 256 neural units for the embedding layer, 512 units for the last encoder layer, 1024 units for the lstm layer. However, this value can be played with to get best performance.

We also define a customized loss class that handles the calculation of the loss error during training.

In [None]:
# Define the loss class responsible for calculating the error loss
class MaskedLoss(tf.keras.losses.Loss):
  def __init__(self):
    self.name = 'masked_loss'
    # Initialize categorical cross entropy class:
    self.loss = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction='none')

  def __call__(self, y_true, y_pred):
    # Calculate the loss for each item in the batch.
    loss = self.loss(y_true, y_pred)

    # Mask off the losses on the elements that were padded in the hpe sentences.
    mask = tf.cast(y_true != 0, tf.float32)
    loss *= mask

    # Return the mean loss.
    return tf.reduce_mean(loss)



# First class method:
class PacketGeneratorSlow(tf.keras.Model):
  def __init__(self,vocab_size, categorical_columns, char_id, id_char,
               dense_units=512, embed_dim=256, lstm_units=1024, dropout=0.15):
    # Initialize parent class:
    super().__init__(self)
    # save character look up objects:
    self.char_id = char_id
    self.id_char = id_char
    # Defines number of lstm units:
    self.units = lstm_units
    # Store the one-hot encoding representation of the 32 columns in the dataset:
    self.cat_columns = tf.expand_dims(categorical_columns ,axis=0)
    # Get the number of columns in the dataset which is 32 in this case:
    self.num_of_columns = categorical_columns.shape[0]
    # Define the embedding layer:
    self.embedding = tf.keras.layers.Embedding(vocab_size, embed_dim)
    # Define LSTM layer:
    self.lstm = tf.keras.layers.LSTM(lstm_units,
                                   return_sequences=True,
                                   return_state=True)
    # Define a second lstm layer
    # self.lstm2 = tf.keras.layers.LSTM(lstm_units,
    #                                return_sequences=True,
    #                                return_state=True)
    # Define the 2- layered neural network that encodes the column representation
    self.d1 = tf.keras.layers.Dense(256)
    self.d2 = tf.keras.layers.Dense(dense_units,)
    # Define dropout layer that will be used after the lstm layer
    self.drop_out = Dropout(dropout)
    #self.dense = tf.keras.layers.Dense(dense_units)
    # Define the fully connected layer that makes the prediction for each character at a given time step:
    self.fc2 = tf.keras.layers.Dense(vocab_size)

  # Define a function that resets the hidden_state of the lstm model every time it is fed a new batch of samples to learn from:
  def reset_state(self, batch_size):
    return tf.zeros((batch_size, self.units))

  # Define the training function: It takes the data set has input and calculates all its requirements dynamically:
  @tf.function
  def train_step(self,data):
    # Define loss variable:
    loss = tf.constant(0.0)
    # Define loss variable for every column:
    category_loss= tf.constant(0.0)
    # data shape: (batch size, number of columns, max_length of sequences):
    # Get batch_size dynamically from the data fed:
    batch_size = tf.shape(data)[0]
    # Get the maximum length of the sequences from the data fed:
    max_length = tf.shape(data)[-1]
    # Create a dynamic column one-hot encoding for each sample in the index batch using the self.cat_colums variable defined during initialization of the model's class:
    # batch_categorical shape : (batch_size, number of columns , number of columns )
    batch_categorical = tf.repeat(self.cat_columns, batch_size, axis=0)

    # Now, train model and update weights with respect to the total_loss per batch.
    with tf.GradientTape() as tape:
      # Reset hidden state of the lstm model for every batch:
      # This ensures that prediction between samples are not dependent.
      hidden = self.reset_state(batch_size)
      #hidden2 = self.reset_state(batch_size)
      # For each column in the batch calculate the encoding representation:
      for i in range(self.num_of_columns):
        # Each column
        column = batch_categorical[:,i]
        # Calculate the encoding using the 2-layered network:
        logit = self.d1(column)
        # logit shape == (batch_size * number of columns, dense units)
        logit = self.d2(logit)
        #Get the first character in that column for the current batch of samples:
        x = data[:,i,0]
        # Predict the next character; iterate through all characters:
        for j in range(1,max_length):
          # Get embedding representation of the index character:
          # x shape == (batch_size, 1, embed_dim)
          x = self.embedding(x)
          # Concatenate the column encoding representation and vector representation of the previous character.
          # The 2 will be the input to the lstm model at every time step t:
          # x shape == (batch_size, 1, dense_units + lstm_units + embed_dim units)
          x = tf.expand_dims(tf.concat((logit, hidden, x), axis=-1) , axis=1)
          #  Pass as input to the lstm model: output shape == (batch_size, 1, lstm_units)
          output, hidden, cell_state = self.lstm(x)
          #output, hidden, _ = self.lstm2(output)
          # Use dropout:
          output = self.drop_out(output)
          #output = self.dense(output)
          # Make probability predictions over the vocabulary size using the final fully connected layer given the hidden state output of the lstm layer:
          # Predictions shape == (batch_size, vocab_size)
          predictions = self.fc2(output)

          # Calculate loss on each character and add to the previous value of the loss variable:
          loss  = loss + self.loss(data[:,i,j],predictions)

          # Using teacher forcing:
          # Teacher forcing continually feeds the next correct character to the model
          # instead of passing what the model predicted back into the model.
          x = data[:, i,j]

        # Average the loss over all characters in the current column category.
        category_loss = category_loss  + loss/ tf.cast(max_length, tf.float32)
      # Calculate total loss over all columns:
      total_loss = category_loss/ tf.cast(self.num_of_columns, tf.float32)

    # Apply an optimization step
    variables = self.trainable_variables 
    gradients = tape.gradient(total_loss, variables)
    # Apply error gradients to the full model:
    self.optimizer.apply_gradients(zip(gradients, variables))

    # Return a dict mapping metric names to current value
    return {'batch_loss': total_loss}

# The second method: Parellilizes the computation of the prediction.
# It is the same as the one above. The only slight difference is in the train step function
# The train step function of this model leads to faster training and data generation.
class PacketGeneratorModel(tf.keras.Model):
    def __init__(self,vocab_size, categorical_columns, char_id , id_char, 
                 dense_units=512, embed_dim=256, lstm_units=1024, dropout=0.12):
      super().__init__(self)
       # save character look up objects:
      self.char_id = char_id
      self.id_char = id_char
      self.units = lstm_units
      self.cat_columns = tf.expand_dims(categorical_columns ,axis=0)
      self.num_of_columns = categorical_columns.shape[0]
      self.embedding = tf.keras.layers.Embedding(vocab_size, embed_dim)
      self.lstm = tf.keras.layers.LSTM(lstm_units,
                                    return_sequences=True,
                                    return_state=True)
      # self.lstm2 = tf.keras.layers.LSTM(lstm_units,
      #                                return_sequences=True,
      #                                return_state=True)
      self.d1 = tf.keras.layers.Dense(256, activation='relu')
      self.d2 = tf.keras.layers.Dense(dense_units)
      self.drop_out = Dropout(dropout)
      #self.dense = tf.keras.layers.Dense(dense_units)
      self.fc2 = tf.keras.layers.Dense(vocab_size)

    def reset_state(self, batch_size):
      return tf.zeros((batch_size, self.units))

    @tf.function
    def train_step(self,data):
      #print(data.shape)
      # Define loss variable
      loss = tf.constant(0.0)
      # Define total loss variable
      total_loss= tf.constant(0.0)
      # data shape: (batch size, number of columns, max_length of sequences):
      # Get batch size and maximum sequence length dynamically:
      batch_size = tf.shape(data)[0]
      max_length = tf.shape(data)[-1]
      # batch_categorical shape : (batch_size, number of columns , number of columns )
      # Propagate/Extend the one-hot encoding of the columns to fit the number of samples in the batch:
      # batch_categorical shape == (batch_size , num of columns , num of columns)
      batch_categorical = tf.repeat(self.cat_columns, batch_size, axis=0)
      # Reshape for parallelization: new shape == (batch_size * number of columns in the dataset , number of columns in the dataset (one-hot encoding size))
      batch_categorical = tf.reshape(batch_categorical, (batch_size * self.num_of_columns, self.num_of_columns))
      # Also reshape the batched data for parellization.
      # New data shape == (batch_size * number of columns in the dataset, maximum length of sequences):
      # This way, the prediction of character sequences in each column can be calculated and optimized at once!
      # Data shape == (batch_size * num of columns, maximum length of characters)
      data = tf.reshape(data, (batch_size * self.num_of_columns, max_length))

      with tf.GradientTape() as tape:
        # Reset hidden state of lstm model per batch:
        hidden = self.reset_state(batch_size*self.num_of_columns)
        #hidden2 = self.reset_state(batch_size)
        # Calculate the Encoding representation of all column category in the batch at once:
        logit = self.d1(batch_categorical)
        # logit shape == (batch_size * number of columns, dense units)
        logit = self.d2(logit)
        # Get the first character in all colums for all samples in the batch:
        x = data[:,0]
        # Iterate through subsequent characters whilst predicting the next character and calculating the error loss:
        for i in range(1,max_length):
          # Embed character : x shape == (batch_size * num of columns, 1, embed_dim)
          x = self.embedding(x)
          # x shape == (batch_size * num of columns, 1, dense_units + lstm_units + embed_dim)
          x = tf.expand_dims(tf.concat((logit, hidden, x), axis=-1) , axis=1)
          # Pass to lstm: output shape == (batch_size * num of columns, 1, lstm_units)
          output, hidden, _= self.lstm(x)
          #output, hidden, _ = self.lstm2(output)
          # Dropout some units:
          output = self.drop_out(output)
          #output = self.dense(output)
          # Make predictions on lstm output over vocabulary size:
          predictions = self.fc2(output)

          # Calculate loss for each index character in all column categories at the same time:
          loss  = loss + self.loss(data[:,i],predictions)

          # Using teacher forcing:
          # Teacher forcing continually feeds the next correct character to the model
          # instead of passing what the model predicted back into the model.
          x = data[:, i]

        # Calculate total loss over all categorical columns:
        total_loss = total_loss + loss/ tf.cast(max_length, tf.float32)

        # Apply an optimization step
        variables = self.trainable_variables 
        gradients = tape.gradient(total_loss, variables)
        self.optimizer.apply_gradients(zip(gradients, variables))

        # Return a dict mapping metric names to current value
      return {'batch_loss': total_loss}

## Data Exploration

Based on our model architecture, let's get the number of unique values for each columns. This tells the number of columns that can be treated/(modeled by the algorithm) as a categorical feature or 'continous feature'. In this notebook, if a column has less than or equal to 10 exact unique values, it is treated as a categorical feature (generated as a class) and if not, a 'continous feature' (generated as a sequence). we do this to reduce the error that may be gotten during training.

For example in the _ws.protocol column that will be treated as a categorical column, instead of the model being trained to generate a full sequence of 'M', 'Q', 'T', 'T' to get 'MQTT', we enforce it to just generate the class 'MQTT' at once!
This way, there is only one error generated at prediction in worst case scenario (for example, generating 'TCP' instead of 'MQTT') as opposed to generating a wrong character for each of the character in the sequence 'MQTT'.

First, we extract all the column names and store in a list and then, calculate the number of unique values and store it in a dictionary as below.


While this value (10) is intuitive, it can be increased or decreased just to optimize model performance in the end. 



In [None]:
# Get the number of unique values for each column
columns = list(packet_data.columns)
# Create a dictionary to store the number of unique values per column as key, value pairs:
unique_values_per_column = { col : 0 for col in columns}

# For each column, calculate and store the number of unique columns:
for column in columns : 
  unique_values_per_column[column] = len(packet_data[column].unique())

print(len(columns))
unique_values_per_column

32


{'_ws.col.Protocol': 8,
 'ip.checksum': 52643,
 'ip.dsfield': 19,
 'ip.dst': 23,
 'ip.flags': 3,
 'ip.flags.df': 3,
 'ip.flags.mf': 2,
 'ip.flags.rb': 2,
 'ip.frag_offset': 2,
 'ip.id': 52444,
 'ip.len': 74,
 'ip.proto': 4,
 'ip.src': 21,
 'ip.ttl': 27,
 'tcp.ack': 68,
 'tcp.checksum': 46175,
 'tcp.dstport': 5526,
 'tcp.flags': 10,
 'tcp.flags.ack': 3,
 'tcp.flags.cwr': 2,
 'tcp.flags.fin': 3,
 'tcp.flags.push': 3,
 'tcp.flags.reset': 3,
 'tcp.flags.syn': 3,
 'tcp.flags.urg': 3,
 'tcp.hdr_len': 6,
 'tcp.len': 58,
 'tcp.options.mss_val': 3,
 'tcp.seq': 143,
 'tcp.srcport': 5677,
 'tcp.urgent_pointer': 2,
 'tcp.window_size': 27}

With this result above, we are able to extract the so-called 'categorical columns' and store them in the variable 'cat_columns' below. We only include the variables that have at most, 10 unique values.

In [None]:
cat_columns= ['_ws.col.Protocol', 'ip.flags',  'ip.flags.df',  'ip.flags.mf',  'ip.flags.rb',  'ip.frag_offset',  'ip.proto',
               'tcp.flags', 'tcp.flags.ack', 'tcp.flags.cwr', 'tcp.flags.fin', 'tcp.flags.push', 'tcp.flags.reset', 'tcp.flags.syn', 'tcp.flags.urg', 
              'tcp.hdr_len', 'tcp.options.mss_val', 'tcp.urgent_pointer']

Intuitively speaking, to increase diversity (in the case of the first model class defined above) in the dataset generation, commencing data generation with one of the categorical columns may help increase quality of generation.

Therefore, I use 'tcp.flags' as the first column in the dataset. However, this stage can be skipped entirely as it may have very little effect on model performance. For the second model, we are going to skip this phase.

In [None]:
index = columns.index('tcp.flags')
columns.pop(index)

'tcp.flags'

In [None]:
index

20

In [None]:
# Insert the 'tcp.flags' at the first positon of the list.
columns.insert(0,'tcp.flags')

In [None]:
print(columns)

['tcp.flags', 'ip.id', 'ip.flags', 'ip.flags.rb', '_ws.col.Protocol', 'ip.flags.df', 'ip.flags.mf', 'ip.frag_offset', 'ip.ttl', 'ip.proto', 'ip.checksum', 'ip.src', 'ip.dst', 'ip.len', 'ip.dsfield', 'tcp.srcport', 'tcp.dstport', 'tcp.seq', 'tcp.ack', 'tcp.len', 'tcp.hdr_len', 'tcp.flags.fin', 'tcp.flags.syn', 'tcp.flags.reset', 'tcp.flags.push', 'tcp.flags.ack', 'tcp.flags.urg', 'tcp.flags.cwr', 'tcp.window_size', 'tcp.checksum', 'tcp.urgent_pointer', 'tcp.options.mss_val']


Let's get the unique data type of each column in the dataset. It was noticed that all the columns had a datatype of 'float64' except for 10 columns. e.g '_ws.col.protocol', 'ip.flags' etc.

We use this understanding to get the different datatypes.

In [None]:
# Create a list to store the datatypes of each column:
columns_dtype = []
for col in columns:
  # For each column, if its among the columns with an 'float64' datatype, store 'float64' in the list, else store 'object'
  if col not in ('_ws.col.Protocol', 'ip.flags','tcp.flags','ip.id',
                 'ip.checksum','ip.src','ip.dst','ip.dsfield', 'tcp.flags','tcp.checksum'):                        
    columns_dtype.append( 'float64')
  else:
    columns_dtype.append('object')

In [None]:
len(columns)

32

Let's check the number of empty cells for each column.

In [None]:
packet_data.isna().sum()

ip.id                      1
ip.flags                   1
ip.flags.rb                1
_ws.col.Protocol           0
ip.flags.df                1
ip.flags.mf                1
ip.frag_offset             1
ip.ttl                     1
ip.proto                   1
ip.checksum                1
ip.src                     1
ip.dst                     1
ip.len                     1
ip.dsfield                 1
tcp.srcport             7290
tcp.dstport             7290
tcp.seq                 7290
tcp.ack                 7290
tcp.len                 7354
tcp.hdr_len             7290
tcp.flags               7290
tcp.flags.fin           7290
tcp.flags.syn           7290
tcp.flags.reset         7290
tcp.flags.push          7290
tcp.flags.ack           7290
tcp.flags.urg           7290
tcp.flags.cwr           7290
tcp.window_size         7290
tcp.checksum            7290
tcp.urgent_pointer      7290
tcp.options.mss_val    80886
dtype: int64

## Preprocess data
Some of the samples in the dataset are very incomplete while some have a few missing cells as seen above. The 'tcp.options.mss_val' has about 80,886 missing cells. Therefore to save most of the samples, filling up the null values with their respective mode value will be best here instead of deleting the rows with an empty cell.

In [None]:
mode_values = { col : packet_data[col].mode()[0] for col in columns}
for column in columns :
  packet_data[column].fillna(value= mode_values[column], inplace=True)
  
packet_data.isna().sum()

ip.id                  0
ip.flags               0
ip.flags.rb            0
_ws.col.Protocol       0
ip.flags.df            0
ip.flags.mf            0
ip.frag_offset         0
ip.ttl                 0
ip.proto               0
ip.checksum            0
ip.src                 0
ip.dst                 0
ip.len                 0
ip.dsfield             0
tcp.srcport            0
tcp.dstport            0
tcp.seq                0
tcp.ack                0
tcp.len                0
tcp.hdr_len            0
tcp.flags              0
tcp.flags.fin          0
tcp.flags.syn          0
tcp.flags.reset        0
tcp.flags.push         0
tcp.flags.ack          0
tcp.flags.urg          0
tcp.flags.cwr          0
tcp.window_size        0
tcp.checksum           0
tcp.urgent_pointer     0
tcp.options.mss_val    0
dtype: int64

Empty cells have been successfully dealt with. Now, we create a one hot encoding of the different columns in the dataset.

In [None]:
one_hot_columns = tf.keras.utils.to_categorical(range(len(columns)))
one_hot_columns

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]], dtype=float32)

Now, we preprocess the dataset by creating a list of columns. Each column is a list of the containing values across the rows in the dataset. 

In the process, we also calculate the maximum length of sequence in the code below. we also extract  the unique values (for both 'categorical' or 'continous' columns) that make up the vocabulary. We also prepend '<' and append '>' to each values in the respective columns. They will serve as the 'start' and 'end' prompt during training.

In [None]:
# Create variable to hold the maximum length of sequences:
max_length = 0
# Define the vocabulary container:
vocab = set()
# list of columns:
train_data = [[] for i in range(len(columns))]

# This function does not treat the categorical column values as just one whole character:
def extract_data(column, col, index, max_len):
  # For each column in dataset, prepend '<' and append '>' to each unique values.
  for each in col:
    # change to a list of characters first:
    char_list = list(str(each))
    # prepend '<' and append '>' to the list
    char_list.insert(0,'<')
    char_list.append('>')
    # Extract unique characters and store them in the vocabulary:
    for i in char_list:
      vocab.add(i)
    # Store processed values appropriately in the list of columns container.
    train_data[index].append(char_list)
    # Get the maximum length of sequences accross all rows and columns:
    max_len = max(max_len, len(char_list))
  # Return the maximum length:
  return max_len
    
# This function treats the categorical column values as one whole character as discussed earlier:
def extract_data2(column, col, index, max_len):
  # If column is a categorical column, do this:
  if column in cat_columns:
    # Just append '>' and prepend '<' to the value and store in the list of columns container:
    for each in col:
        temp = []
        temp.append('<')
        temp.append(str(each))
        temp.append('>')
        train_data[index].append(temp)
        #items_list.append(each)
        # Store unique values in vocabulary container.
        vocab.add(str(each))
  # Else:
  else:
    # Else just do exactly what the first function does.
    for each in col:
      char_list = list(str(each))
      char_list.insert(0,'<')
      char_list.append('>')
      for i in char_list:
        vocab.add(i)
      train_data[index].append(char_list)
      max_len = max(max_len, len(char_list))
  return max_len
  
# Generate processed dataset here:
for index, column in enumerate(columns):
  max_length = extract_data2(column,packet_data[column],index, max_length)

In [None]:
len(vocab)

50

We see the maximum_length is 14 as below and the number of unique characters in the dataset is 50. Let's examine the contents of the vocabulary. In this notebook, no foreign letter/character was added to the dataset. All characters in vocabulary were gotten from the dataset.

Perhaps, some letters can be added to increase dynamics of generation.

In [None]:
max_length, len(vocab)

(14, 50)

In [None]:
#{'.',  '0',  '0.0',  '0x00000000',  '0x00000002',  '0x00000004',  '0x00000010',  '0x00000011',   '0x00000012',  '0x00000014',  '0x00000018',  '0x00000019',
# '0x00000029',  '0x00004000',  '1',  '1.0',  '1460.0',  '17.0',  '2',  '20.0',  '24.0',  '265.0',  '3',  '32.0',  '4',  '40.0',  '44.0',  '5',  '6',  '6.0',
# '7',  '8',  '9',  '<',  '>',  'DNS',  'ICMP',  'MDNS',  'MPEG PES',  'MPEG TS',  'MQTT',  'TCP',  'UDP',  'a',  'b',  'c',  'd',  'e',  'f',  'x'}


print(vocab)

{'17.0', 'f', '20.0', '6', '0x00000012', 'TCP', '0x00000011', '0x00004000', 'MPEG TS', '32.0', '4', 'MDNS', '0x00000004', '<', 'a', 'ICMP', '24.0', 'd', '0x00000019', '9', '3', '0x00000000', '6.0', '1460.0', 'c', '0x00000010', '40.0', '265.0', 'UDP', '2', '0', 'MPEG PES', '>', '0.0', 'x', 'MQTT', '44.0', '5', '1.0', 'b', '0x00000029', '0x00000002', '0x00000018', '0x00000014', 'e', '7', '8', '1', '.', 'DNS'}


Next, we define 2 separate stringlookup class object that (1) convert the characters to unique integer values and (2) convert these unique integer values to their respective characters. we also add a pad value (i.e mask_token) which represents the padding character for all the sequences.

This is done because machine learning models can only understand digital values or numbers and not strings during their computation.

In [None]:
# object converts characters to indexes.
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), mask_token='<p>')

# object converts unique integers to their respective characters.
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), invert=True, mask_token='<p>')

print('MQTT :- ',ids_from_chars(['MQTT']).numpy()[0])
print(' 37 :- ', chars_from_ids([37]).numpy()[0])

MQTT :-  37
 37 :-  b'MQTT'


Next, we pad each sequence with a unique value so that they all have the same length.

In [None]:
# For each columns in the train_data:
for column in train_data:
  for each in column:
    # Calculate the length of each value 
    len_chars = len(each)
    # If the calculated length is less than the maximum length, pad the sequence with '<p>':
    if len_chars < max_length:
          for i in range(max_length - len_chars):
            each.append('<p>')

In [None]:
train_data = np.array(train_data)
train_data = np.transpose(train_data,(1,0,2))
train_data.shape

(112373, 32, 14)

In [None]:
# convert the dataset to their respective integer values:
data = ids_from_chars(train_data)
data.shape

TensorShape([112373, 32, 14])

Let's examine a random row to see what the preprocessed data looks like:

In [None]:
train_data[55004]

array([['<', '0x00000010', '>', '<p>', '<p>', '<p>', '<p>', '<p>', '<p>',
        '<p>', '<p>', '<p>', '<p>', '<p>'],
       ['<', '0', 'x', '0', '0', '0', '0', 'e', 'c', '8', 'b', '>',
        '<p>', '<p>'],
       ['<', '0x00004000', '>', '<p>', '<p>', '<p>', '<p>', '<p>', '<p>',
        '<p>', '<p>', '<p>', '<p>', '<p>'],
       ['<', '0.0', '>', '<p>', '<p>', '<p>', '<p>', '<p>', '<p>', '<p>',
        '<p>', '<p>', '<p>', '<p>'],
       ['<', 'TCP', '>', '<p>', '<p>', '<p>', '<p>', '<p>', '<p>', '<p>',
        '<p>', '<p>', '<p>', '<p>'],
       ['<', '1.0', '>', '<p>', '<p>', '<p>', '<p>', '<p>', '<p>', '<p>',
        '<p>', '<p>', '<p>', '<p>'],
       ['<', '0.0', '>', '<p>', '<p>', '<p>', '<p>', '<p>', '<p>', '<p>',
        '<p>', '<p>', '<p>', '<p>'],
       ['<', '0.0', '>', '<p>', '<p>', '<p>', '<p>', '<p>', '<p>', '<p>',
        '<p>', '<p>', '<p>', '<p>'],
       ['<', '6', '4', '.', '0', '>', '<p>', '<p>', '<p>', '<p>', '<p>',
        '<p>', '<p>', '<p>'],
       ['<', '6

## Define and Train Model

Now let's initialize and train a model from the second model class definition. We then evaluate the performance intuitively. We train the second model first as it is fast and with less overheads during training and random generation.

In [None]:
# Define number of epochs for both kind of model:
epochs_model1 = 20
epochs_model2 = 2
TRAIN_BUFFER_SIZE = len(data)
BATCH_SIZE = 256


# This Dataset class shuffles the data and precreates batches for it.
dataset = tf.data.Dataset.from_tensor_slices(data).shuffle(TRAIN_BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)

In [None]:
# Initialize model from the second class definition: Pass the vocabulary size and the one_hot_encoded column representation.
model = PacketGeneratorModel(len(ids_from_chars.get_vocabulary()), one_hot_columns, ids_from_chars, chars_from_ids)

In [None]:
# Compile model with the respective loss class and optimizer:
model.compile(
    optimizer=tf.optimizers.Adam(),
    loss=MaskedLoss(),
)

In [None]:
# path to save model weight checkpoint
checkpoint_path = os.path.join(FILE_PATH, "Train2")
# ckpt = tf.train.Checkpoint(model)
# ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=3)

# start_epoch = 0
# if ckpt_manager.latest_checkpoint:
#   # If you retrain this model then uncomment the next line of code:
#   #start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
#   start_epoch = int(ckpt_manager.checkpoints[0].split('-')[-1])

#   # restoring the latest checkpoint in checkpoint_path
#   ckpt.restore(ckpt_manager.latest_checkpoint)
# #ckpt.restore(ckpt_manager.checkpoints[0])

In [None]:
# Load previous model weights from directory
model.load_weights(checkpoint_path)

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f0e84615dd0>

In [None]:
# Define a model checkpoint class from keras for this model training.
# It saves the model weights in the checkpoint path after every batch iteration:
cp = tf.keras.callbacks.ModelCheckpoint(
    checkpoint_path, monitor='batch_loss', verbose=0, save_best_only=True,
    save_weights_only=True, mode='min', save_freq= BATCH_SIZE
)

In [None]:
# Train the model 
model.fit(dataset,epochs=epochs_model2, callbacks=[cp])

Epoch 1/2
Epoch 2/2
 37/439 [=>............................] - ETA: 28:32 - batch_loss: 0.0778

KeyboardInterrupt: ignored

While training the model above, it was noticed that loss was < 0.1 during the second epoch training so it was stopped as seen above. The first kind of model can also be trained with the code below. However, it was ignored because of the length of time it took to train (it took about 45 mins to train 1 epoch with a total loss of about 5.2) and slower/poor optimization as compared to the first given the GPU usage limit imposed on notebooks.

To train the model, just run the next 2 cells below.

However, we focus on the above trained model for the rest of the notebook.

In [None]:
# model = PacketGeneratorModel(len(ids_from_chars.get_vocabulary()), one_hot_columns, ids_from_chars, chars_from_ids)
# #model = PacketGeneratorModel(len(ids_from_chars.get_vocabulary()), one_hot_columns)
# model.compile(
#     optimizer=tf.optimizers.Adam(),
#     loss=MaskedLoss(),
# )
# checkpoint_path = os.path.join(FILE_PATH, "Train")
# #model.load_weights(checkpoint_path)
# cp = tf.keras.callbacks.ModelCheckpoint(
#     checkpoint_path, monitor='batch_loss', verbose=0, save_best_only=True,
#     save_weights_only=True, mode='min', save_freq= BATCH_SIZE
# )

In [None]:

#model.fit(dataset,epochs=epochs_model1, callbacks=[cp])


In [None]:
# Packet generator for model:
class PacketGenerator(tf.Module):
  def __init__(self, model, column_names, columns_dtype):
    super(PacketGenerator, self).__init__()
    # Initialize model, ids_from_chars, chars_from_ids, cat_columns , num_of_columns, column_names, columns_dtype:
    self.model = model
    self.char_id = model.char_id
    self.id_char = model.id_char
    self.cat_columns = self.model.cat_columns
    self.num_of_columns = self.model.num_of_columns
    self.column_names = column_names
    #self.cat_column_names = cat_column_names
    self.columns_dtype = columns_dtype

    # The output should never generate padding, unknown, or start.
    # Therefore, get the index of '<p>', '[UNK]' and '[START]' from the vocabulary:
    token_mask_ids = self.char_id(['<p>', '[UNK]', '<',]).numpy()

    # Create a token mask by first creating an array of tokens (vocabulary) and then initialize all to False
    # Set the indexes of the '', '[UNK]' and '<' to True:
    token_mask = np.zeros([self.char_id.vocabulary_size()], dtype=np.bool)
    token_mask[np.array(token_mask_ids)] = True

    # set as a property of the class
    self.token_mask = token_mask

    # Get the index of the '[START]' and '[END]' tokens:
    
    self.start_token = self.char_id(['<']).numpy()[0]
    self.end_token = self.char_id(['>']).numpy()[0]

  #@tf.function
  def tokens_to_text(self, result_tokens):
    # This method converts index tokens to the string tokens by using the stringlookup we initialized in the init() method:
    result_text_tokens = self.id_char(result_tokens)

    return result_text_tokens#.numpy()

  #@tf.function
  def sample(self, logits, temperature):

    # Add 2 new axis to the token_mask shape
    token_mask = self.token_mask[tf.newaxis, tf.newaxis, :]

    # Set the logits for all masked tokens ('','[START]', '[UNK]') to -inf, so they are never chosen.
    # The logits (shape== (batch_size, vocab_size) here represents the independent probabilities of each word in the vocabulary.
    logits = tf.where(token_mask, -np.inf, logits)
    # If temperature is 0, get the max logit argument for each sample in the batch:
    # This argument represents the index of the word with the highest independent probability.
    if temperature == 0.0:
      new_tokens = tf.argmax(logits, axis=-1)
    # Else, sample from the independent probabilities of the logit:
    else: 
      logits = tf.squeeze(logits, axis=1)
      new_tokens = tf.random.categorical(logits/temperature,
                                          num_samples=1)
    return new_tokens
    
  @tf.function
  def generate_samples(self,batch_size=500, max_length=14, temperature=1.0):

    batch_categorical = tf.repeat(self.cat_columns, batch_size, axis=0)
    batch_categorical = tf.reshape(batch_categorical, (batch_size * self.num_of_columns, self.num_of_columns))

    # Initialize the accumulators for the decoder output:
    #result_tokens = tf.TensorArray()
    
    hidden = self.model.reset_state(batch_size*self.num_of_columns)
    logit = self.model.d1(batch_categorical)
    logit = self.model.d2(logit)
    #x = data[:,i,0]
    # Initialize
    new_tokens = tf.fill((batch_size * self.num_of_columns,), self.start_token)

    # Initialize a tf array that tracks the end_token of all samples in the batch:
    done = tf.zeros([batch_size*self.num_of_columns, 1], dtype=tf.bool)

    # Initialize the accumulators for the decoder output:
    categorical_tokens = tf.TensorArray(tf.int64, size=1, dynamic_size=True)

    for t in tf.range(1,max_length):
      # Embed the index character:
      x = self.model.embedding(new_tokens)
      # Concatenate the logit, previous hidden state and index embedding representaton x:
      x = tf.expand_dims(tf.concat((logit, hidden, x), axis=-1) , axis=1)
      # Pass result through the lstm model.
      output, hidden, _ = self.model.lstm(x)
      #output = self.dense(output)
      predictions = self.model.fc2(output)

      new_tokens = self.sample(predictions, temperature)
      #print('After self.sample',new_tokens.shape)
      # If a sequence produces an `end_token`, set it `done`
      done = done | (new_tokens == self.end_token)
      # Once a sequence is done it only produces 0-padding.
      new_tokens = tf.where(done, tf.constant(0, dtype=tf.int64), new_tokens)

      # Collect the generated tokens
      categorical_tokens = categorical_tokens.write(t, new_tokens)
      new_tokens = tf.squeeze(new_tokens)


      if tf.reduce_all(done):
        break

    # Convert the list of generated token ids to a list of strings.
    categorical_tokens = categorical_tokens.stack()
    #cate
    categorical_tokens = tf.squeeze(categorical_tokens, -1)
    categorical_tokens = tf.transpose(categorical_tokens, [1, 0])
    # change shape of categorical tokens:
    #print(categorical_tokens.shape)
    categorical_tokens = tf.reshape(categorical_tokens, (batch_size, self.num_of_columns, -1))
    result_text_array = self.tokens_to_text(categorical_tokens)


    return result_text_array


def create_tabular_data(dataset, column_names, tcp_flags_index = 20, file_name = 'Model_packet_data_output', save=False):
  print('Creating tabular data...')
  # Create a pandas dataframe from the processed dataset:
  dataframe = pd.DataFrame(dataset, columns= column_names)
  # Restore to natural order of columns to match with the original dataset:
  column  = column_names.pop(0)
  # Take the column name 'tcp.flags' to its original position in the dataset:
  column_names.insert(tcp_flags_index,column)
  # Restore
  dataframe = dataframe[column_names]
  # Decode all strings in each column from bytes to 'utf-8':
  for col in column_names:
    dataframe[col] = dataframe[col].str.decode('utf-8')
  # If save , save to the provided file.
  print(file_name)
  if save:
    dataframe.to_csv(file_name)
    return dataframe
  else:
    return dataframe

def process_data(dataset):
  # Replace all occurences of '>', '<', '<p>' with ''.
  dataset = tf.strings.regex_replace(dataset, b'<p>', b'')
  dataset = tf.strings.regex_replace(dataset, b'>', b'')
  dataset = tf.strings.regex_replace(dataset, b'<', b'')
  # In the innermost axis, join all the elements of the list together to form one long string:
  dataset = tf.strings.reduce_join(dataset, axis = -1)
  
  
  return dataset.numpy()

def generate(model,data_size, batch_size=2000, temperature=1.0, save=False, file_name='packet_data_output.csv'):
  print('Generating packet data ...')
  # If data_size < batch_size, make data_size the batch_size:
  if data_size < batch_size:
    data = model.generate_samples(batch_size=data_size, temperature =temperature)
  else:
    # Else use the batch_size to generate subsequent batches of generated datasets till the data_size is met:
    incorrect_size = True
    data = None
    while (incorrect_size):
      data = model.generate_samples(batch_size=batch_size, temperature =temperature)
      if data.shape[-1] < 14:
        continue
      else:
        incorrect_size = False
    for i in range(batch_size, data_size, batch_size):
      unequal_size = True
      while (unequal_size):
        temp_data = model.generate_samples(batch_size=batch_size, temperature=temperature)
        if temp_data.shape[-1] < 14:
          continue
        else:
          data  = tf.concat([data, temp_data], axis=0)
          unequal_size = False

  # process_data: remove '>', '<', '<p>' strings from all items:
  data = process_data(data)
  #print(data.shape)
  data = create_tabular_data(data, model.column_names, file_name, save=save)
  return data

# Function wraps the model in a class for tabular data generation:
def get_data_generator(model, column_names, columns_dtype):
  return PacketGenerator(model,  column_names, columns_dtype)


Save the column names and their respective data types to local storage.

In [None]:
import pickle

with open('columns.pkl', 'wb') as f:
  pickle.dump(columns,f)

with open('column_dtypes.pkl', 'wb') as f:
  pickle.dump(columns_dtype, f)

In [None]:
# Create an instance of the translator model:
data_generator = get_data_generator(model,  columns, columns_dtype)

Let's save the trained model along with all its dependencies to our local directory using the tf.saved_model.save() function.

In [None]:
tf.saved_model.save(data_generator, 'packet_generator2')





INFO:tensorflow:Assets written to: packet_generator2/assets


INFO:tensorflow:Assets written to: packet_generator2/assets


In [None]:
m = tf.saved_model.load('packet_generator2')

The following code below defines the customized function (and dependencies) that will be used by the saved trained model (saved via tf.saved_model).

In [None]:
# Function that loads column names from memory:
def get_column_names():
  with open('columns.pkl','rb') as f:
    columns = pickle.load(f)
  return columns

# Function that load the column datatypes from storage:
def get_columns_dtype():
  with open('column_dtypes.pkl', 'rb') as f:
    columns_dtype = pickle.load(f)
  return columns_dtype

def generate(model,data_size, column_names=None, batch_size=2000, temperature=1.0, save=False, file_name='packet_data_output.csv'):
  print('Generating packet data ...')
  '''
  '''
  # If batch_size is not 100, 500, 1000, 0r 2000, print the below:
  if batch_size not in [100, 500, 1000, 2000]:
      print('Model only takes batch sizes of : 100, 500, 1000, 2000. Please, use one of these predefined sizes.')
      return 
  # If data_size < batch_size, make data_size the batch_size:
  if data_size < batch_size:
    try:
      data = model.generate_samples(batch_size=data_size, temperature =temperature)
    except:
      print('Model only takes batch sizes of : 100, 500, 1000, 2000. Please, use one of these predefined sizes or a size larger and that is a multiple of the batch size used.')
      return
  else:
    # Else use the batch_size to generate subsequent batches of generated datasets till the data_size is met:
    incorrect_size = True
    data = None
    while (incorrect_size):
      data = model.generate_samples(batch_size=batch_size, temperature =temperature)
      if data.shape[-1] < 14:
        continue
      else:
        incorrect_size = False
    for i in range(batch_size, data_size, batch_size):
      unequal_size = True
      while (unequal_size):
        temp_data = model.generate_samples(batch_size=batch_size, temperature=temperature)
        if temp_data.shape[-1] < 14:
          continue
        else:
          data  = tf.concat([data, temp_data], axis=0)
          unequal_size = False

  # process_data: remove '>', '<', '<p>' strings from all items:
  data = process_data(data)
  if column_names == None:
    column_names = get_column_names()
    data = create_tabular_data(data, column_names=column_names, file_name=file_name, save=save)
  else:
    data = create_tabular_data(data, column_names=column_names, file_name=file_name, save=save)
  return data


Now, we use the model to generate 100,000 samples below and we also record the time it took to generate these 100,000 samples.

In [None]:
n_samples= 100000
# Get the start time
start_time = time.time()
# Generate data:
#check = data_generator.generate(n_samples,save=True ,temperature=1.0)
check = generate(m, n_samples, columns, save=True ,temperature=1.0)
# print(check.shape)
# check.head()
print(f'Time taken to generate {n_samples} is : {(time.time() - start_time)/60} min(s)')
check[:5]


Generating packet data ...
Creating tabular data...
packet_data_output.csv
Time taken to generate 100000 is : 9.640553319454193 min(s)


Unnamed: 0,tcp.flags,ip.id,ip.flags,ip.flags.rb,_ws.col.Protocol,ip.flags.df,ip.flags.mf,ip.frag_offset,ip.ttl,ip.proto,ip.checksum,ip.src,ip.dst,ip.len,ip.dsfield,tcp.srcport,tcp.dstport,tcp.seq,tcp.ack,tcp.len,tcp.hdr_len,tcp.flags.fin,tcp.flags.syn,tcp.flags.reset,tcp.flags.push,tcp.flags.ack,tcp.flags.urg,tcp.flags.cwr,tcp.window_size,tcp.checksum,tcp.urgent_pointer,tcp.options.mss_val
0,0x00000012,0x0000d40b,0x00000000,0.0,TCP,0.0,0.0,0.0,33.0,6.0,0x00005afa,10.0.0.15,192.168.2.5,44.0,0x00000000,46192.0,35087.0,50.0,164.0,4.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29200.0,0x0000a73d,0.0,1460.0
1,0x00000002,0x000064b2,0x00004000,0.0,TCP,1.0,0.0,0.0,63.0,6.0,0x000065e1,192.168.1.7,192.168.2.7,52.0,0x00000000,35087.0,1883.0,1.0,1.0,0.0,40.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1024.0,0x000037b5,0.0,1460.0
2,0x00000010,0x0000d927,0x00004000,0.0,TCP,1.0,0.0,0.0,64.0,6.0,0x0000271d,10.0.0.23,10.0.0.7,56.0,0x00000000,59992.0,56756.0,0.0,1.0,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1024.0,0x00001bc1,0.0,1460.0
3,0x00000010,0x0000588f,0x00004000,0.0,TCP,1.0,0.0,0.0,64.0,6.0,0x00000c3d,192.168.1.7,892.168.1.7,52.0,0x00000000,1883.0,3824.0,1.0,1.0,0.0,24.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,29200.0,0x00008892,0.0,1460.0
4,0x00000010,0x0000b502,0x00004000,0.0,TCP,1.0,0.0,0.0,63.0,6.0,0x00008c35,192.168.2.5,192.168.2.5,101.0,0x00000000,1883.0,1730.0,0.0,74.0,0.0,20.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1024.0,0x0000cfe1,0.0,1460.0


It took about 9.6 mins for this model to generate a dataset of 100,000 samples. Also, some minor errors may be noted in the model's output. This is because no model can be 100% optimized in most cases.

Let's create another 150,000 samples

In [None]:
n_samples= 150000
# Get the start time
start_time = time.time()
# Generate data:
#check = data_generator.generate(n_samples,save=True ,temperature=1.0)
new_samples = generate(m, 150000, columns, save=True ,temperature=1.0, file_name='model_packet_data_output2.csv')
# print(check.shape)
# check.head()
print(f'Time taken to generate {n_samples} is : {(time.time() - start_time)/60} min(s)')
new_samples[:5]


Generating packet data ...
Creating tabular data...
model_packet_data_output2.csv
Time taken to generate 150000 is : 14.23298218647639 min(s)


Unnamed: 0,ip.flags,ip.flags.rb,_ws.col.Protocol,ip.flags.df,ip.flags.mf,ip.frag_offset,ip.ttl,ip.proto,ip.checksum,ip.src,ip.dst,ip.len,ip.dsfield,tcp.srcport,tcp.dstport,tcp.seq,tcp.ack,tcp.len,tcp.hdr_len,tcp.flags,ip.id,tcp.flags.fin,tcp.flags.syn,tcp.flags.reset,tcp.flags.push,tcp.flags.ack,tcp.flags.urg,tcp.flags.cwr,tcp.window_size,tcp.checksum,tcp.urgent_pointer,tcp.options.mss_val
0,0x0000a005,0x00004000,0.0,TCP,0.0,0.0,0.0,64.0,6.0,0x0000312f,10.0.0.15,192.168.2.5,40.0,0x00000000,1809.0,33192.0,1.0,1.0,0.0,24.0,0x00000014,1.0,0.0,1.0,0.0,1.0,0.0,0.0,28960.0,0x0000eb3b,0.0,1460.0
1,0x00006936,0x00004000,0.0,TCP,1.0,0.0,0.0,63.0,6.0,0x0000e304,10.0.0.14,192.168.1.7,52.0,0x00000000,1883.0,33772.0,0.0,5.0,0.0,32.0,0x00000010,0.0,0.0,0.0,0.0,1.0,0.0,0.0,29200.0,0x0000e2fe,0.0,1460.0
2,0x000027e5,0x00000000,0.0,MPEG TS,0.0,0.0,0.0,64.0,6.0,0x0000dd72,10.0.0.1,892.168.2.5,52.0,0x00000000,35487.0,1.0,0.0,5.0,0.0,32.0,0x00000011,0.0,0.0,0.0,0.0,1.0,0.0,0.0,29200.0,0x00003f79,0.0,1460.0
3,0x00005067,0x00004000,0.0,TCP,0.0,0.0,0.0,64.0,6.0,0x00006a8e,10.0.0.7,192.168.2.5,60.0,0x00000000,1883.0,1883.0,0.0,5.0,2.0,20.0,0x00000010,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0x000024f6,0.0,1460.0
4,0x0000b5a4,0x00004000,0.0,TCP,1.0,0.0,0.0,64.0,6.0,0x00005dca,192.168.2.5,10.0.0.1,52.0,0x00000000,3700.0,35089.0,50.0,94.0,0.0,24.0,0x00000014,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0x00000735,0.0,1460.0


Now, let's check out the unique values in the columns treated as categorical columns.
We expect errors based on the fact that no single model  will have a perfect loss score of 0 and fit perfectly.

In [None]:
new_samples.columns

Index(['ip.flags', 'ip.flags.rb', '_ws.col.Protocol', 'ip.flags.df',
       'ip.flags.mf', 'ip.frag_offset', 'ip.ttl', 'ip.proto', 'ip.checksum',
       'ip.src', 'ip.dst', 'ip.len', 'ip.dsfield', 'tcp.srcport',
       'tcp.dstport', 'tcp.seq', 'tcp.ack', 'tcp.len', 'tcp.hdr_len',
       'tcp.flags', 'ip.id', 'tcp.flags.fin', 'tcp.flags.syn',
       'tcp.flags.reset', 'tcp.flags.push', 'tcp.flags.ack', 'tcp.flags.urg',
       'tcp.flags.cwr', 'tcp.window_size', 'tcp.checksum',
       'tcp.urgent_pointer', 'tcp.options.mss_val'],
      dtype='object')

In [None]:
for col in cat_columns:
  print(col)
  print(check[col].unique())
  print()

_ws.col.Protocol
['TCP' 'DNS' 'MQTT' 'MPEG TS' 'MDNS' 'MPEG PES' 'ICMP' '0.0' '6.0'
 'TCP32.02' 'UDP' '17.0' '0x00000010' '1.0' '0x00000004' '0x00000000'
 '32.0' 'd' '0x00000018' '2' '0x00000019' 'MPEG TS20.0' 'c' 'MPEG TSICMP2'
 '0x00004000' '0x00000002' '20.0' 'b' '40.0' '0x00000011' 'TCP2TCP'
 'MQTT1.0MQTT' 'TCPe2' '0x00000012' '9' '44.0' '1' '0x00000014' '265.0'
 'TCPb2' '6.0DNSMPEG PESMQTT' 'MQTTDNS24.0TCP' 'TCP928MQTT'
 'MQTTTCP21460.0' 'TCP0x000000292' 'TCPTCP2.MQTT' 'e' 'f' '0x00000029'
 'MPEG TSMPEG PESMQTT' 'ICMP1.0' 'MPEG TSb2' '24.0' 'TCP0x00000029MQTT'
 'ICMPTCP' 'MQTTTCP' 'TCP0x00000011' 'MDNS0x0000400032.0' 'TCP17.0MQTT'
 'MPEG TSTCP2MDNSTCP' '8' 'MQTTMQTT2' 'DNS0x00004000MQTT' 'TCP1.0' ''
 '0x000000000x000000102' 'MPEG TS40.02TCP']

ip.flags
['0x00000000' '0x00004000' 'MQTT' 'TCP' '0.0'
 '0x000040000x000040000x00004000' '32.0' '0x000040001.0.0'
 '0x000040000x00004000.0' '1' '0x00004000MQTT0x00004000' '0x00004000.0'
 '1460.0' '1.0' 'UDP' '0x0000400060x00004000' '2' '0x00

As we can see, these 'Categorical columns' still have some values that were not included in their original and respective unique values. Note, that the 'continous columns' were not observed or taken into consideration here.