<a href="https://colab.research.google.com/github/noise-lab/netssm/blob/main/example/train_netssm_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook covers how to use the NetSSM repository to train a new NetSSM model from scratch on toy PCAP data, including the steps of preprocessing, tokenization, and finally training. It also demonstrates how to convert the raw generation outputted by NetSSM to a parsable PCAP format.

NetSSM can be used to train on, and generate both **single and multi-flow network traffic sessions**. The process for either type of data/generation is exactly the same, and follows what is detailed in the notebook below.

# Setup dependencies

### Setup uv

In [None]:
! curl -LsSf https://astral.sh/uv/install.sh | sh

### Setup repo

In [None]:
! git clone https://github.com/noise-lab/netssm.git

### Python dependencies

If asked to restart the runtime to use newly installed versions, this message can be dismissed.

In [None]:
import os

base_dir = os.path.join(os.getcwd(), "netssm")
os.chdir(base_dir)
! uv sync

#### Install mamba-ssm and causal-conv1d

To speed things up, directly install the wheel for both packages by finding the relevant information for our system (CUDA version and ABI).

In [None]:
import re
import subprocess

nvcc_cmd = ["/usr/local/cuda/bin/nvcc", "--version"]
nvcc_result = subprocess.run(nvcc_cmd, capture_output=True, text=True)
match = re.search(r"release (\d+)", nvcc_result.stdout)
cuda_major_version = match.group(1) if match else "Unknown"

torch_cmd = [".venv/bin/python", "-c", "import torch;print(torch._C._GLIBCXX_USE_CXX11_ABI)"]
torch_result = subprocess.run(torch_cmd, capture_output=True, text=True)
use_cxx11_abi = torch_result.stdout.strip().upper()

mamba_ssm_whl = f"mamba_ssm-2.2.6.post3+cu{cuda_major_version}torch2.7cxx11abi{use_cxx11_abi}-cp310-cp310-linux_x86_64.whl"
causalconv_whl = f"causal_conv1d-1.5.4+cu{cuda_major_version}torch2.7cxx11abi{use_cxx11_abi}-cp310-cp310-linux_x86_64.whl"

subprocess.run(["wget", f"https://github.com/state-spaces/mamba/releases/download/v2.2.6.post3/{mamba_ssm_whl}"])
subprocess.run(["wget", f"https://github.com/Dao-AILab/causal-conv1d/releases/download/v1.5.4/{causalconv_whl}"])

! uv pip install {mamba_ssm_whl} --python .venv
! uv pip install {causalconv_whl} --python .venv

### Setup Go

In [None]:
import sys

! rm -rf go1.24.2.linux-amd64.tar.gz
! wget https://go.dev/dl/go1.24.2.linux-amd64.tar.gz
! rm -rf /usr/local/go && tar -C /usr/local -xzf go1.24.2.linux-amd64.tar.gz
os.environ['PATH'] += ':/usr/local/go/bin'
sys.path.append('/usr/local/go/bin')

### Setup libpcap

In [None]:
! apt update
! apt install -y libpcap-dev tshark

# Preparing training data

Creating training data for NetSSM consists of three steps:
  1. Preprocessing PCAPs to raw training samples
  2. Creating a custom tokenizer
  3. Tokenizing the raw training samples

In this notebook, we will be using sample PCAPs representing traffic from various video streaming services, sourced from the [nPrint project](https://nprint.github.io).

## Preprocessing PCAPs

NetSSM trains on sequences of the raw bytes of packets, represented in a string-based format. We provide a preprocessor written in Go, that converts PCAPs to this representation.


### Build the preprocessor

The following commands build the Go preprocessor to a binary called `netssm_preprocessor`.

In [None]:
example_dir = os.path.join(base_dir, "example")
preprocessing_dir = os.path.join(base_dir, "preprocessing")
os.chdir(preprocessing_dir)
! go mod init netssm_preprocessor
! go mod tidy
! go build
os.chdir(example_dir)

### Run the preprocessor on sample data

Our sample data for this example lives at `input`, and contains four PCAP files corresponding to video streaming data for four different platforms:

In [None]:
! ls input

The file `labels.csv` contains a mapping of how `netssm_preprocessor` should "label" each PCAP for training.

In [None]:
! cat input/labels.csv

Let's run our preprocessor on the sample data.

We pass `input` as the input directory for the processor, `input/labels.csv` for the labels, and write the parsed representations to `output/training_raw.jsonl`.

In [None]:
! ../preprocessing/netssm_preprocessor -in-dir input -label-csv input/labels.csv -out output/training_raw.jsonl

`output/training_raw.jsonl` contains four lines, each corresponding to a training sample parsed from a PCAP in `input`.

In [None]:
! wc -l output/training_raw.jsonl

Each training sample follows the format of `<|label|> RAW_BYTES <|pkt|> RAW_BYTES <|pkt|>...`, where `<|label|>` and `<|pkt|>` are special tokens for the NetSSM model. Let's see an example of what this exactly looks like:

In [None]:
! head -c 500 output/training_raw.jsonl

Here, we see that the first sample in the `training_raw.jsonl` file we just created, is Amazon streaming traffic.

## Creating a custom tokenizer

NetSSM treats generating network traffic data as an unsupervised sequence generation problem, where each byte in a packet is represented by a corresponding token. In this section, we'll first create a custom tokenizer for this task, and apply the tokenizer to our preprocessed data prior to model training.

We next need to create a custom tokenizer that allows NetSSM to understand the representation we created in the previous step. Here, we'll create a custom tokenizer which maps each byte value to a corresponding token ID, and also contains special tokens NetSSM needs to understand the specific data (i.e., video streaming traffic) that we use in this example.

We'll use the `create_tokenizer.py` script to do this. Specifically, the following command creates a tokenizer contained in directory `tokenizers/video_streaming_tok` that has special tokens corresponding to the four different video streaming sources we use in our example data.

In [None]:
os.chdir(os.path.join(base_dir, "tokenizers"))

! uv run create_tokenizer.py \
    --special_tokens "netflix amazon youtube twitch" \
    --tokenizer_name video_streaming_tok

If we take a look at `video_streaming_tok/tokenizer.json`, we can see the special tokens we defined, along with the mapping of string values 0 - 255, which will be mapped to their corresponding matching token ID.

In [None]:
! sed -n '42,98 p' video_streaming_tok/tokenizer.json

## Tokenizing our preprocessed data

We'll tokenized our data from the processing step prior to giving it to the model, to help speedup the training pipeline.

This can be helpful if you are using a cluster with a workload manager (e.g., Slurm) that has a time limit per job, and may preempt jobs after this duration has been reached. In this case, you may desire to use the GPU on a cluster node as efficiently as possible.

We'll use the script at `preprocessing/create_tok_dataset.py` to do this. This script will use the tokenizer provided to it to convert the raw training JSONL data to the tokenized format (stored as .arrow files) expected by NetSSM.

In the below example, we'll specify a max sample length of 1,000 tokens, as well as enable padding. This will truncate any training samples to a max length of 1,000 tokens, or pad them to this length if they are shorter. The `max_len` value is dependent on your training hardware, and data. In our paper, we are able to use a max length of 100,000 tokens on a Nvidia A40 with 48GB VRAM, corresponding to multiflow traces comprised of over 1,000 packets. Here, we scale this down to 1,000 to accomodate the 12GB VRAM that the Google Colab T4 GPU has, as well as reduce noise from padding (none of our training PCAPs have more than ~170 packets).

We also specify the path to the custom tokenizer we created, as well as the raw input data at `example/output/training_raw.jsonl`, and an output directory.

In [None]:
os.chdir(preprocessing_dir)

! uv run create_tok_dataset.py \
    --padding \
    --max_len 1000 \
    --tokenizer ../tokenizers/video_streaming_tok/ \
    --data_path ../example/output/training_raw.jsonl \
    --out_path ../example/output/train_data

We're now ready to move on to training a NetSSM model!

# Training

We'll train a NetSSM model using the `train.py` script in the root directory of the repo. This script can take several arguments, some of which are specified in the command below.

## Training from scratch

`--batch` specifies the batch size, or number of that are presented to the model to train on at a time. A larger batch size will typically speed up training time. In our paper, we use a batch size of 1 to present NetSSM with the largest session representation possible, disregarding training time.

`--output` specifies the output folder/path to save the model checkpoint.

`--tokenizer` specifies the location of the tokenizer the model should use. This should be the same path used to tokenize the data in the prior steps.

`--data_path` specifies the location of the training dataset. Though in the above steps, we first convert the raw JSONL data to the tokenized version in `.arrow` format, our training script also accepts a path to the raw JSONL file. Keep in mind the notes from above, in that this may take lots of time and can be inefficient use of a GPU node, especially for a large dataset.

`--num_epochs` specifies the number of epochs -- passes over the entire dataset -- that the model should train for.

`--torch_dtype` specifies the data precision used to represent the tensors during training. In the below example, we use `float16` to be compatible with the T4 GPU, but in our paper use `bfloat16`, a precision only available on newer GPUs.

There exist a number of other parameters that can be passed to the script, that are not shown here (`learning_rate`, `gradient_accumulation_steps`). These are model hyperparameters that influence how frequently and by how much model parameters are updated during gradient descent.

In [None]:
os.chdir(base_dir)

! uv run train.py \
    --batch_size 1 \
    --output checkpoints/notebook_example \
    --tokenizer tokenizers/video_streaming_tok \
    --data_path example/output/train_data \
    --num_epochs 30 \
    --torch_dtype float32

Great! If you take a look at `checkpoints/notebook_example`, you'll see checkpoint folders:

In [None]:
! ls checkpoints/notebook_example

## Resuming training from a checkpoint

The training script saves the most recent two checkpoints. Inside each folder there is the model binary itself in PyTorch format, and the training state.

These directories are useful for resuming training. Specifically, training can be picked up from a checkpoint using the same `train.py` script, and by keeping the same arguments as used to begin training from scratch.  The script will look at the location provided by `--output` and find the most recent checkpoint to load from. Let's try this out now, to take our `30_epochs` checkpoint, and train for another 20 epochs:

In [None]:
! uv run train.py \
    --batch_size 1 \
    --output checkpoints/notebook_example \
    --tokenizer tokenizers/video_streaming_tok \
    --data_path example/output/train_data \
    --num_epochs 50 \
    --torch_dtype float32

Here, we see the model pick up from the 30 epoch checkpoint, and continue training for the specified amount. The output directory also now has the 50 epoch checkpoint:

In [None]:
! ls checkpoints/notebook_example

# Generation

Now that we've trained a model, let's try generating data with it using the `generate.py` script.

This script also takes various arguments:

`--prompt` specifies the starting string/prompt for kickstarting generation. Typically, this can be the label special token, or first packet in a flow to generate the remaining packets for.

`--model` specifies the path to the model checkpoint to be used to generate with.

`--tokenizer` specifies the location of the tokenizer the model should use. This should be the same path used to tokenize the data in the prior steps.

`--genlen` specifies the number of tokens to generate (including the tokenized representation passed in the prompt).

`--gen_len_pkts` changes the behavior of `--genlen` to instead correspond to the number of packets (e.g., 100 packets).


In [None]:
! uv run generation/generate.py \
    --prompt "<|netflix|>" \
    --model checkpoints/notebook_example/50_epochs \
    --tokenizer tokenizers/video_streaming_tok \
    --genlen 100 \
    --gen_len_pkts \
    --torch_dtype float32

Let's take a look at the raw generated output. By default this will live at `inference/EXP_1/RUN_1`, where `EXP_1` and `RUN_1` can be modified by changing the arguments `experiment_base_dir`, and `experiment` respectively:

In [None]:
! cat inference/EXP_1/RUN_1/generated.txt

## Generating using a pretrained model

Let's now try loading a pretrained checkpoint that was used for our paper, which was trained on a dataset of Netflix multi-flow traffic comprised of 5,882 captures for 30 epochs. Each capture is represented by 238,500 tokens (2,250 packets), with each packet being 106 tokens.

We change `experiment_base_dir` and `experiment` such that the output should be at `inference/notebook/netflix`.

Additionally, we pass the flag `gen_len_pkts`, which will cause `genlen` to be treated as the number of ***packets*** to generate (i.e., 100), instead of tokens.

There also are a number of generation parameters that can influence how the model outputs tokens. In the next cells, we'll use the generation parameters found in the NetSSM paper, for the Semantic Similarity section (5.3):

## Generation parameters

`--repetition-penalty` penalizes repetition of tokens during generation. Values > 1.0 discourage the model from repeating the same tokens, which can help improve diversity and reduce looping outputs.

`--temperature` controls the randomness of predictions by scaling the logits before sampling. Lower values (e.g., < 1.0) make the output more deterministic; higher values increase diversity.

`--minp` sets the minimum probability threshold for tokens to be considered during sampling. Tokens with probabilities below this value are filtered out, encouraging more confident predictions.

`--topk` limits the sampling pool to the top-k most probable tokens. Only the top-k tokens are considered for sampling, reducing randomness and encouraging more likely continuations.

`--topp` enables nucleus (top-p) sampling, where tokens are sampled from the smallest possible set whose cumulative probability exceeds `p`. This allows for dynamic control over token diversity based on probability mass.

In [None]:
# Download and setup the pretrained checkpoint
! gdown 1koMbDyaTi0buF1eoDplqOFtJLX-ssS6a
! mv netflix_multi_100k_30_epochs.zip ./checkpoints && cd ./checkpoints && unzip netflix_multi_100k_30_epochs.zip && mv checkpoint-176460 netflix_multi_100k_30_epochs && cd ..

# Generate using the checkpoint
! uv run generation/generate.py \
    --prompt "<|netflix|>" \
    --model "./checkpoints/netflix_multi_100k_30_epochs" \
    --tokenizer "./tokenizers/nm_tokenizer_multi_netflix" \
    --experiment_base_dir notebook \
    --experiment netflix \
    --genlen 100 \
    --gen_len_pkts \
    --repetition-penalty 1.8 \
    --temperature 0.5 \
    --minp 0.0 \
    --topp 0.9 \
    --topk 25 \
    --torch_dtype float32

Let's look again now at the generated output:

In [None]:
! cat inference/notebook/netflix/generated.txt

# Converting raw generation to a PCAP

Next, you'll want to convert the raw text output to a PCAP for it to be useful. We can do this using the `conversion.py` script like so:

In [None]:
! uv run generation/conversion.py inference/notebook/netflix/generated.txt inference/notebook/netflix/generated.pcap

This script uses `scapy` to build the binary PCAP from the raw text output. We can confirm that the synthetic data is parsable:

In [None]:
! capinfos inference/notebook/netflix/generated.pcap

In [None]:
! tshark -r inference/notebook/netflix/generated.pcap