# Description

This notebook demonstrate how to download and combine the CICIDS2017 dataset.

*Author*: **Mahendra Data** mahendra.data@dbms.cs.kumamoto-u.ac.jp

License: **BSD 3 clause**

# Mounting Google Drive

We will save the downloaded dataset to Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Downloading the dataset

The description of CICIDS2017 dataset is accessible at https://www.unb.ca/cic/datasets/ids-2017.html

There are three versions available:

1. Raw network captured data (PCAPs),
2. Generated Labelled Flows, and
3. Machine Learning CSV.

In this notebook, we will download the `GeneratedLabelledFlows.zip` version of this dataset.

In [None]:
!wget -nc -O GeneratedLabelledFlows.zip http://205.174.165.80/CICDataset/CIC-IDS-2017/Dataset/GeneratedLabelledFlows.zip

--2020-08-06 01:53:30--  http://205.174.165.80/CICDataset/CIC-IDS-2017/Dataset/GeneratedLabelledFlows.zip
Connecting to 205.174.165.80:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 283876488 (271M) [application/zip]
Saving to: ‘GeneratedLabelledFlows.zip’


2020-08-06 01:55:12 (2.67 MB/s) - ‘GeneratedLabelledFlows.zip’ saved [283876488/283876488]



# Integrity check

Download `GeneratedLabelledFlows.md5` file to check the integrity of the downloaded file.

In [None]:
!wget -nc http://205.174.165.80/CICDataset/CIC-IDS-2017/Dataset/GeneratedLabelledFlows.md5

--2020-08-06 01:55:13--  http://205.174.165.80/CICDataset/CIC-IDS-2017/Dataset/GeneratedLabelledFlows.md5
Connecting to 205.174.165.80:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61
Saving to: ‘GeneratedLabelledFlows.md5’


2020-08-06 01:55:14 (8.46 MB/s) - ‘GeneratedLabelledFlows.md5’ saved [61/61]



Checking the file integrity.

In [None]:
!md5sum -c GeneratedLabelledFlows.md5

GeneratedLabelledFlows.zip: OK


If the downloaded dataset is correct, then the output should be like this

`GeneratedLabelledFlows.zip: OK`

# Unzip the dataset

Unzip the `GeneratedLabelledFlows.zip` and remove the extra spance character at the end of the extracted folder name.

In [None]:
!unzip -n GeneratedLabelledFlows.zip
!mv TrafficLabelling\ / TrafficLabelling

Archive:  GeneratedLabelledFlows.zip
   creating: TrafficLabelling /
  inflating: TrafficLabelling /Wednesday-workingHours.pcap_ISCX.csv  
  inflating: TrafficLabelling /Tuesday-WorkingHours.pcap_ISCX.csv  
  inflating: TrafficLabelling /Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv  
  inflating: TrafficLabelling /Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv  
  inflating: TrafficLabelling /Monday-WorkingHours.pcap_ISCX.csv  
  inflating: TrafficLabelling /Friday-WorkingHours-Morning.pcap_ISCX.csv  
  inflating: TrafficLabelling /Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv  
  inflating: TrafficLabelling /Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv  


There are eight files extracted from this zip file.

1. `Monday-WorkingHours.pcap_ISCX.csv`
2. `Tuesday-WorkingHours.pcap_ISCX.csv`
3. `Wednesday-workingHours.pcap_ISCX.csv`
4. `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv`
5. `Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv`
6. `Friday-WorkingHours-Morning.pcap_ISCX.csv`
7. `Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv`
8. `Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv`

# Change the encoding to utf-8

File `Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv` is encoded in latin1 format. We should change it to utf-8 like other files.

Now, import the libraries.

In [None]:
import os
import codecs
import pandas as pd
import numpy as np

# Change display.max_rows to show all features.
pd.set_option('display.max_rows', 85)

Change the encoding to utf-8.

In [None]:
def _to_utf8(filename: str, encoding="latin1", blocksize=1048576):
    """ This function changee the encoding of a file to utf-8.

    Args:
        filename (str): The path of the source file.
        encoding (str): The encoding of the source file.
        blocksize (int): The blocksize when reading the source file.
    """
    tmpfilename = filename + ".tmp"
    with codecs.open(filename, "r", encoding) as source:
        with codecs.open(tmpfilename, "w", "utf-8") as target:
            while True:
                contents = source.read(blocksize)
                if not contents:
                    break
                target.write(contents)

    # replace the original file
    os.rename(tmpfilename, filename)


# The location of the extracted dataset.
datasets_path = 'TrafficLabelling'

_to_utf8(os.path.join(datasets_path, "Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv"))

Save the zip and extracted files to Google Drive.

In [None]:
!mkdir -p '/content/drive/My Drive/CICIDS2017/'

!cp GeneratedLabelledFlows.zip '/content/drive/My Drive/CICIDS2017/'

!cp -r 'TrafficLabelling/' '/content/drive/My Drive/CICIDS2017/'

Now the dataset is saved to your Google Drive at `CICIDS2017` folder.