## Dataset Combination
---------------
This notebook provides the framework for combining the common features of multiple datasets 

In [1]:
#imports
from DatasetTools import loadDataset,Datasets,findCommonFeatures,combineDatasets,saveDataset
from Loader.CombinedDataset import CombinationMethod
import pandas as pd

### Global variables

In [2]:
calcFeatures = True
outputPath = "../datasets/Combined/sample.csv"

### Loading datasets
To load a dataset, simply call the *loadDataset* method from DatasetTools and specify which dataset is being loaded along with the path the the file. You can also provide a name for the dataset as a note on the combined.

In [6]:

usb1 = loadDataset(Datasets.USBIDS_2021,'../datasets/USBIDS21/Slowhttptest-NoDefense.csv',"USB-IDS Slowhttptest",calcFeatures=calcFeatures)
usb2 = loadDataset(Datasets.USBIDS_2021,'../datasets/USBIDS21/REGULAR.csv',"USB-IDS Regular",calcFeatures=calcFeatures)

UNSW1 = loadDataset(Datasets.UNSW_NB15_2015,"../datasets/UNSW-NB15/UNSW-NB15_1.csv","UNSW-NB15_1",calcFeatures=calcFeatures)
UNSW2 = loadDataset(Datasets.UNSW_NB15_2015,"../datasets/UNSW-NB15/UNSW-NB15_2.csv","UNSW-NB15_2",calcFeatures=calcFeatures)

loading Datasets.USBIDS_2021 from "../datasets/USBIDS21/Slowhttptest-NoDefense.csv" with dataset name: USB-IDS Slowhttptest
loading Datasets.USBIDS_2021 from "../datasets/USBIDS21/REGULAR.csv" with dataset name: USB-IDS Regular
loading Datasets.UNSW_NB15_2015 from "../datasets/UNSW-NB15/UNSW-NB15_1.csv" with dataset name: UNSW-NB15_1
loading Datasets.UNSW_NB15_2015 from "../datasets/UNSW-NB15/UNSW-NB15_2.csv" with dataset name: UNSW-NB15_2


Each data set can have additional preprocessors added to the processing chain at three different stages. The preprocessor will be passed a Pandas dataset chunk to process and should return the modified chunk when finished. An example of this can be found in *UNSWNB15.py*

```python
    #Adds a Preprocessor before performing unit conversion and column renaming
    addPreConvertProcessor(self, preprocess)
    #Adds a Preprocessor before performing additional feature calculation
    addPreCalculateProcessor(self, preprocess)
    #Adds a Preprocessor at the end of the processing chain before returning the loaded dataset
    addPreprocessor(self, preprocess)
```

### Combining datasets
There are two methods of combining datasets, sequentially and interlaced.
- Sequentially: datasets are concatenated together in the order they are passed in.
- Interlaced: datasets are mixed together based on the relative timestamp in each dataset.

#### Simple combination
When combining datasets, simply call `combineDataset` with all the datasets to combine and the method of combination. Below are two examples of combining different datasets together

In [7]:
#combine Sequentially
comb = combineDatasets(usb1, usb2, UNSW1, UNSW2, method = CombinationMethod.SEQUENTIAL)

Combined Dataset has 28 features: ['ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp', 'Flow Duration', 'Packet Length Mean', 'Fwd Packet Length Mean', 'Bwd Packet Length Mean', 'Flow Bytes/s', 'Fwd Flow Byte/s', 'Bwd Flow Byte/s', 'Flow Packets/s', 'Fwd Flow Packets/s', 'Bwd Flow Packets/s', 'Total Length of Packets', 'Total Length of Fwd Packets', 'Total Length of Bwd Packets', 'Total Packets', 'Total Fwd Packets', 'Total Bwd Packets', 'Flow IAT Mean', 'Fwd IAT Mean', 'Bwd IAT Mean', 'Down/Up Ratio', 'Label']
USB-IDS Slowhttptest had 50 features, missing 22
USB-IDS Regular had 50 features, missing 22
UNSW-NB15_1 had 29 features, missing 1
UNSW-NB15_2 had 29 features, missing 1


In [4]:
#combine Interlaced
comb = combineDatasets(usb1, usb2, UNSW1, UNSW2, method = CombinationMethod.INTERLACE)
comb.chunksize = 10**6 #recommended to increase the chunksize for interlaced datasets to prevent artifacts between chunks

Combined Dataset has 28 features: ['ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp', 'Flow Duration', 'Packet Length Mean', 'Fwd Packet Length Mean', 'Bwd Packet Length Mean', 'Flow Bytes/s', 'Fwd Flow Byte/s', 'Bwd Flow Byte/s', 'Flow Packets/s', 'Fwd Flow Packets/s', 'Bwd Flow Packets/s', 'Total Length of Packets', 'Total Length of Fwd Packets', 'Total Length of Bwd Packets', 'Total Packets', 'Total Fwd Packets', 'Total Bwd Packets', 'Flow IAT Mean', 'Fwd IAT Mean', 'Bwd IAT Mean', 'Down/Up Ratio', 'Label']
USB-IDS Slowhttptest had 50 features, missing 22
USB-IDS Regular had 50 features, missing 22
UNSW-NB15_1 had 29 features, missing 1
UNSW-NB15_2 had 29 features, missing 1


#### Multiple combinations
When combining multiple datasets, there might be multiple methods that need to be used. For example, concatenating multiple files from the same dataset and interlacing it with another dataset. 

In [4]:
combUSB = combineDatasets(usb1, usb2, method = CombinationMethod.SEQUENTIAL, name = "USB_Combined")
combUNSW = combineDatasets(UNSW1, UNSW2, method = CombinationMethod.SEQUENTIAL, name = "UNSW_Combined")
comb = combineDatasets(combUNSW, combUSB, method = CombinationMethod.INTERLACE)

Combined Dataset has 49 features: ['ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp', 'Flow Duration', 'Packet Length Min', 'Packet Length Max', 'Packet Length Mean', 'Packet Length Std', 'Packet Length Variance', 'Fwd Packet Length Min', 'Fwd Packet Length Max', 'Fwd Packet Length Mean', 'Fwd Packet Length Std', 'Bwd Packet Length Min', 'Bwd Packet Length Max', 'Bwd Packet Length Mean', 'Bwd Packet Length Std', 'Flow Bytes/s', 'Fwd Flow Byte/s', 'Bwd Flow Byte/s', 'Flow Packets/s', 'Fwd Flow Packets/s', 'Bwd Flow Packets/s', 'Total Length of Packets', 'Total Length of Fwd Packets', 'Total Length of Bwd Packets', 'Total Packets', 'Total Fwd Packets', 'Total Bwd Packets', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min', 'Fwd IAT Min', 'Fwd IAT Max', 'Fwd IAT Mean', 'Fwd IAT Std', 'Fwd IAT Total', 'Bwd IAT Min', 'Bwd IAT Max', 'Bwd IAT Mean', 'Bwd IAT Std', 'Bwd IAT Total', 'Down/Up Ratio', 'Label']
USB-IDS Slowhttptest had 50 features, missing 1
U

### Using the combined dataset
Once a dataset is loaded or a combined dataset is created, the data is then loaded in chunks to minimize ram usage when combining several large datasets. The default chunk size for each dataset is 100000. This can be changed by setting `chunksize` equal to the new size of the chunk for any dataset. 

In [None]:
#get the next chunk of the the combined dataset
temp = next(comb)

In [None]:
#get all the chunks and do something with them
for ds in comb:
    # do something with each chunk
    pass

### Saving the dataset
Saving the combined dataset is a simple as calling `saveDataset` with the path to save the result to.

In [8]:
saveDataset("Sample/sample1.csv", comb, overwriteExisting=True)

Saving Combined with 29 features: ['ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp', 'Flow Duration', 'Packet Length Mean', 'Fwd Packet Length Mean', 'Bwd Packet Length Mean', 'Flow Bytes/s', 'Fwd Flow Byte/s', 'Bwd Flow Byte/s', 'Flow Packets/s', 'Fwd Flow Packets/s', 'Bwd Flow Packets/s', 'Total Length of Packets', 'Total Length of Fwd Packets', 'Total Length of Bwd Packets', 'Total Packets', 'Total Fwd Packets', 'Total Bwd Packets', 'Flow IAT Mean', 'Fwd IAT Mean', 'Bwd IAT Mean', 'Down/Up Ratio', 'Label', 'Dataset']
Reached end of USB-IDS Slowhttptest with 6737 flows, switching to USB-IDS Regular
Reached end of USB-IDS Regular with 12658 flows, switching to UNSW-NB15_1


  df = self.reader.get_chunk(rows)
  df = self.reader.get_chunk(rows)


Reached end of UNSW-NB15_1 with 12659 flows, switching to UNSW-NB15_2


  df = self.reader.get_chunk(rows)
  df = self.reader.get_chunk(rows)
  df = self.reader.get_chunk(rows)


Reached end of UNSW-NB15_2 with 12660 flows, end of combined dataset
Saved 1712660 flows!


In [16]:
comb.reset()

#### Saving datasets with common features
Existing datasets can have have their common features extracted and saved for comparing results between datasets without having to combine datasets

In [10]:
datasets = [usb1,usb2,UNSW1,UNSW2]
features = findCommonFeatures(usb1,usb2,UNSW1,UNSW2)
for ds in datasets:
    ds.reset() # reset the dataset back to the beginning for a potential second read
    saveDataset(f'Sample/separate/{ds.name}.csv', ds, features=features)

Saving USB-IDS Slowhttptest with 28 features: ['ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp', 'Flow Duration', 'Packet Length Mean', 'Fwd Packet Length Mean', 'Bwd Packet Length Mean', 'Flow Bytes/s', 'Fwd Flow Byte/s', 'Bwd Flow Byte/s', 'Flow Packets/s', 'Fwd Flow Packets/s', 'Bwd Flow Packets/s', 'Total Length of Packets', 'Total Length of Fwd Packets', 'Total Length of Bwd Packets', 'Total Packets', 'Total Fwd Packets', 'Total Bwd Packets', 'Flow IAT Mean', 'Fwd IAT Mean', 'Bwd IAT Mean', 'Down/Up Ratio', 'Label']
Saved 6737 flows!
Saving USB-IDS Regular with 28 features: ['ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp', 'Flow Duration', 'Packet Length Mean', 'Fwd Packet Length Mean', 'Bwd Packet Length Mean', 'Flow Bytes/s', 'Fwd Flow Byte/s', 'Bwd Flow Byte/s', 'Flow Packets/s', 'Fwd Flow Packets/s', 'Bwd Flow Packets/s', 'Total Length of Packets', 'Total Length of Fwd Packets', 'Total Length of Bwd Packets', 'Total Packets', 

  df = self.reader.get_chunk(rows)
  df = self.reader.get_chunk(rows)


Saved 700001 flows!
Saving UNSW-NB15_2 with 28 features: ['ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp', 'Flow Duration', 'Packet Length Mean', 'Fwd Packet Length Mean', 'Bwd Packet Length Mean', 'Flow Bytes/s', 'Fwd Flow Byte/s', 'Bwd Flow Byte/s', 'Flow Packets/s', 'Fwd Flow Packets/s', 'Bwd Flow Packets/s', 'Total Length of Packets', 'Total Length of Fwd Packets', 'Total Length of Bwd Packets', 'Total Packets', 'Total Fwd Packets', 'Total Bwd Packets', 'Flow IAT Mean', 'Fwd IAT Mean', 'Bwd IAT Mean', 'Down/Up Ratio', 'Label']


  df = self.reader.get_chunk(rows)
  df = self.reader.get_chunk(rows)
  df = self.reader.get_chunk(rows)
  df = self.reader.get_chunk(rows)


Saved 700001 flows!
