# Data preprocessing

This notebook was tested on a standard Amazon SageMaker notebook (ml.m5.2xlarge) with a "conda_python3" kernel.

## Initial data extraction

Before running the notebook, you should upload the `htmldata.tar.gz` file to this notebook. If the file is in S3, you can copy it from there:
```python 
%%sh
aws s3 cp s3://YOURBUCKET/htmldata.tar.gz .
```

Once the data is in your notebook, you can untar it:
```python
%%sh
tar xvfz htmldata.tar.gz
```

After that, you will have the html folder with the decompressed files in your notebook.

The `data_recordio.py` script must be uploaded in the same folder as this notebook.

## Libraries installation

First we install two libraries that are used by the `data_recordio.py` script.

In [4]:
!pip install mmh3

Collecting mmh3
  Downloading mmh3-2.5.1.tar.gz (9.8 kB)
Building wheels for collected packages: mmh3
  Building wheel for mmh3 (setup.py) ... [?25ldone
[?25h  Created wheel for mmh3: filename=mmh3-2.5.1-cp36-cp36m-linux_x86_64.whl size=24895 sha256=14869a05a1e3d03588a920065ebf421b18585e040cc4ba3402f78ac9c3cde047
  Stored in directory: /home/ec2-user/.cache/pip/wheels/cc/3a/98/fc5e7f8e1840cf6dcf2435260b29661db90a0b22dbd2739df6
Successfully built mmh3
Installing collected packages: mmh3
Successfully installed mmh3-2.5.1
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [6]:
!pip install mxnet

Collecting mxnet
  Downloading mxnet-1.6.0-py2.py3-none-any.whl (68.7 MB)
[K     |████████████████████████████████| 68.7 MB 92 kB/s s eta 0:00:01
Collecting graphviz<0.9.0,>=0.8.1
  Downloading graphviz-0.8.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: graphviz, mxnet
Successfully installed graphviz-0.8.4 mxnet-1.6.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


## Execution of `data_recordio.py` script

We execute the script for the training files. This script will output 3 files in the root folder of your notebook:
- `output-file_train.lst`
- `output-train.idx`
- `output-train.rec`

In [45]:
!python data_recordio.py --benign html/benign_files/training --malicious html/malicious_files/training --output output- --shuffle --num-thread 8 --train

usage: data_recordio.py [-h] --benign BENIGN --malicious MALICIOUS --output
                        OUTPUT [--train] [--val] [--feature-size FEATURE_SIZE]
                        [--shuffle] [--seed SEED] [--num-thread NUM_THREAD]
data_recordio.py: error: the following arguments are required: --output


In the same way, we execute the script for the training files. This script will output 3 files in the root folder of your notebook:
- `output-file_validation.lst`
- `output-validation.idx`
- `output-validation.rec`

In [41]:
!python data_recordio.py --benign html/benign_files/validation --malicious html/malicious_files/validation --output output- --shuffle --num-thread 8 --val

INFO:root:Namespace(benign='html/benign_files/validation', feature_size=1024, malicious='html/malicious_files/validation', num_thread=8, output='output-', seed=999, shuffle=True, train=False, val=True)
INFO:root:Creating a .lst file...
INFO:root:Creating a .rec and .idx file using multiprocessing...


## Upload of the new files to S3

When the files are created in your notebook, upload the `.idx` and `.rec` files to your S3 bucket (change the name of the bucket to yours).

In [44]:
%%sh
aws s3 cp output-train.rec s3://YOURBUCKET
aws s3 cp output-train.idx s3://YOURBUCKET
aws s3 cp output-val.rec s3://YOURBUCKET
aws s3 cp output-val.idx s3://YOURBUCKET

upload: ./output-train.rec to s3://sagemaker-nnmwd/output-train.rec 
upload: ./output-train.idx to s3://sagemaker-nnmwd/output-train.idx
upload: ./output-val.rec to s3://sagemaker-nnmwd/output-val.rec   
upload: ./output-val.idx to s3://sagemaker-nnmwd/output-val.idx   
