## Install Dependencies

1. Installs the latest development version of Hugging Face's `diffusers` library directly from GitHub. This library is used for working with diffusion models such as Stable Diffusion.

2. Installs and upgrades several core Hugging Face and optimization libraries:
   - `transformers`: For using pre-trained language and vision models
   - `accelerate`: Simplifies training across CPUs, GPUs, and distributed setups
   - `wandb`: For logging experiments and visualizing training progress
   - `bitsandbytes`: Enables 8-bit model loading for memory-efficient inference/training
   - `peft`: For Parameter-Efficient Fine-Tuning of large models

3. Installs additional Python libraries needed for data handling, model input/output, and networking or computer vision tasks:
   - `pandas`: For data manipulation and tabular processing
   - `torchvision`: For image transformations and loading datasets (used with PyTorch)
   - `pyarrow`: For efficient I/O and working with Apache Arrow / Parquet formats
   - `sentencepiece`: For subword tokenization used in many NLP models
   - `controlnet_aux`: Adds support functions for ControlNet like HED, Canny, Depth, etc.
   - `scapy`: For packet parsing and crafting, often used in networking/PCAP analysis
   - `gdown`: For downloading files from Google Drive using file IDs
   - `opencv-python`: For computer vision tasks and image manipulation

In [None]:
!pip install -q -U git+https://github.com/huggingface/diffusers

In [None]:
!pip install -q -U \
    transformers \
    accelerate \
    wandb \
    bitsandbytes \
    peft

In [None]:
!pip install pandas torchvision pyarrow sentencepiece controlnet_aux scapy gdown opencv-python

As SD3 is gated, before using it with diffusers you first need to go to the Stable Diffusion 3 Medium Hugging Face page, fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate. Use the command below to log in:
#Ignore if already logged in.

## Hugging Face Authentication

As SD3 is gated, you need to:
1. Go to the [Stable Diffusion 3 Medium Hugging Face page](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers)
2. Fill in the form and accept the gate
3. Log in to authenticate your access

Check if you're already logged in first:

In [None]:
# If not logged in, run this command and enter your token when prompted
!huggingface-cli login

In [None]:
# Test if shell commands work and show output
!echo "This should show output"
!pwd
!ls -la | head -5

In [None]:
# Alternative: Use subprocess to capture output explicitly
import subprocess
import sys

def run_command(cmd):
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    print(f"Command: {cmd}")
    print(f"Return code: {result.returncode}")
    if result.stdout:
        print(f"Output: {result.stdout}")
    if result.stderr:
        print(f"Error: {result.stderr}")

# Test with a simple command
run_command("echo 'Testing subprocess output'")

In [None]:
# Force output to show in VS Code
import sys
import os

# Test basic output
print("Python version:", sys.version)
print("Current directory:", os.getcwd())

# Test shell command with explicit flushing
import subprocess
result = subprocess.run(['echo', 'VS Code test'], capture_output=True, text=True)
print("Shell output:", result.stdout.strip())

## Download Example Dataset

Alternatively, create your own dataset:
1. Install nPrint in your environment (see https://nprint.github.io/nprint/)
2. Collect PCAPs belonging to the same service/application you plan to model
3. Convert each PCAP to an nPrint file with the following command:
   ```
   nprint -F -1 -P {pcap} -4 -i -6 -t -u -p 0 -c 1024 -W {output_file}
   ```
4. Place all resulting nPrint files in a folder named "nprint_traffic"

In [None]:
!gdown --id 1vvneSH0a1WZFPHTafKusOUjNhg7oQioq --output preprocessed_dataset.zip
!unzip -q preprocessed_dataset.zip
!mkdir -p nprint_traffic
!mv amazon_nprint_traffic/* nprint_traffic/

## Convert nPrint to PNG Images

This script transforms `.nprint` files—tabular feature representations of network packets—into fixed-size PNG images that can be used as input for SD fine-tuning.

**Input:** `.nprint` files generated from packet captures (PCAPs) using the nPrint tool. Each file is a CSV-like matrix where each row represents a single packet and columns correspond to extracted features.

**Preprocessing:**
- Drops IP address-related columns to avoid injecting identifiable or non-generalizable information into the model
- Maps integer values in the remaining columns to RGBA color tuples to visualize numeric features as colored pixels

**Padding:** Pads each image to a uniform height (default 1024) using a solid background to ensure model input consistency across varying packet counts.

**Output:** Saves a PNG file for each `.nprint` file, preserving the packet structure as a vertically stacked color-coded image (rows = packets, cols = features).

In [None]:
!python ./scripts/nprint_to_png.py -i ./nprint_traffic/ -o ./nprint_traffic_images

## Compute Embeddings

Generates text prompt embeddings via a Stable Diffusion 3 pipeline and T5 text encoder. Maps each local PNG image to a unique SHA-256 hash and associates it with the computed embeddings. Stores the resulting image-hash-to-embedding data in a .parquet file for further processing.

Here we are using the default instance prompt "pixelated network data for type-0 application traffic". You can configure this by referring to the compute_embeddings.py script for details on other supported arguments.

In [None]:
!python ./scripts/compute_embeddings.py

Compute embeddings:
* Generates text prompt embeddings via a Stable Diffusion 3 pipeline and T5 text encoder.
* Maps each local PNG image to a unique SHA-256 hash and associates it with the computed embeddings.
* Stores the resulting image-hash-to-embedding data in a .parquet file for further processing.
* Here we are using the default instance prompt "pixelated network data for type-0 application traffic".
* But you can configure this. Refer to the compute_embeddings.py script for details on other supported arguments.


In [None]:
!python ./scripts/compute_embeddings.py

## Train LoRA Adapter on Stable Diffusion 3 (Miniature Setup)

This command launches training using `accelerate` with DreamBooth-style LoRA tuning, optimized for quick experimentation or demo runs.

**⚠️ Current configuration uses:**
- Only 1 training step
- Small batch size
- No warmup
- High learning rate

**Intended only for testing or verifying training scripts**, NOT for quality results. For actual training, increase `max_train_steps`, adjust learning rate, and consider enabling full validation and saving checkpoints.

In [None]:
!accelerate launch ./scripts/train_dreambooth_lora_sd3_miniature.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-3-medium-diffusers"  \
  --instance_data_dir="nprint_traffic_images" \
  --data_df_path="sample_embeddings.parquet" \
  --output_dir="trained-sd3-lora-miniature" \
  --mixed_precision="fp16" \
  --instance_prompt="pixelated network data for type-0 application traffic" \
  --train_batch_size=2 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --use_8bit_adam \
  --learning_rate=5e-5 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1 \
  --seed="0"

## Inference Pipeline for SD3 + ControlNet + LoRA

This cell demonstrates the full generation process:
- Loads Stable Diffusion 3 Medium base model
- Loads ControlNet (Canny edge-based guidance)
- Loads LoRA weights fine-tuned on pixelated network traffic
- Applies edge-conditioned generation using a sample input

In [None]:
flush()
import os
import torch
import cv2
from PIL import Image
from diffusers import StableDiffusion3ControlNetPipeline, SD3ControlNetModel
from diffusers.utils import load_image

# Make sure our output folder exists
os.makedirs("generated_traffic_images", exist_ok=True)

In [None]:
# Base SD 3.0 model
base_model_path = "stabilityai/stable-diffusion-3-medium-diffusers"

# Canny-based ControlNet
controlnet_path = "InstantX/SD3-Controlnet-Canny"

# Load the ControlNet and pipeline
controlnet = SD3ControlNetModel.from_pretrained(
    controlnet_path, torch_dtype=torch.float16
)
pipe = StableDiffusion3ControlNetPipeline.from_pretrained(
    base_model_path,
    controlnet=controlnet,
)

# Load LoRA weights
lora_output_path = "trained-sd3-lora-miniature"
pipe.load_lora_weights(lora_output_path)

# Move pipeline to GPU (half precision)
pipe.to("cuda", torch.float16)
pipe.enable_sequential_cpu_offload()

In [None]:
# Convert original control image to Canny edges via OpenCV
orig_path = "./scripts/traffic_conditioning_image.png"
orig_bgr = cv2.imread(orig_path)
if orig_bgr is None:
    raise ValueError(f"Could not load file: {orig_path}")

# Convert to grayscale
gray = cv2.cvtColor(orig_bgr, cv2.COLOR_BGR2GRAY)

# Generate Canny edge map (tweak thresholds as needed)
edges = cv2.Canny(gray, 100, 200)

# Convert single-channel edge map to 3-channel RGB
edges_rgb = cv2.cvtColor(edges, cv2.COLOR_GRAY2RGB)

# Convert to PIL for use in ControlNet pipeline
control_image = Image.fromarray(edges_rgb)
orig_width, orig_height = control_image.size
target_width = 1024
if orig_width > target_width:
    # (left, upper, right, lower)
    control_image = control_image.crop((0, 0, target_width, orig_height))
    
print("Displaying Canny control image:")
display(control_image)

## Post-Generation Processing Pipeline

This cell performs a 3-stage transformation of the generated PNG images:

1. **Color Augmentation:** Applies standardized color shifts to improve nprint reconstruction accuracy
2. **Image-to-nPrint Conversion:** Converts augmented images back into nPrint-compatible feature format using a reference file to maintain consistent structure
3. **Heuristic Correction & PCAP Reconstruction:** Reconstructs valid and replayable PCAP files from the diffusion-generated nPrint representation

This pipeline enables turning synthetic traffic images back into replayable network traffic for evaluation or simulation.

In [None]:
# Step 1: Color Augmentation
!python ./scripts/color_processor.py \
  --input_dir="./generated_traffic_images" \
  --output_dir="./color_corrected_generated_traffic_images"

In [None]:
# Step 2: Image-to-nPrint Conversion
!python ./scripts/image_to_nprint.py \
  --org_nprint ./scripts/column_example.nprint \
  --input_dir ./color_corrected_generated_traffic_images \
  --output_dir ./generated_nprint

In [None]:
# Step 3: Heuristic Correction & PCAP Reconstruction
!python ./scripts/mass_reconstruction.py \
  --input_dir ./generated_nprint \
  --output_pcap_dir ./replayable_generated_pcaps \
  --output_nprint_dir ./replayable_generated_nprints \
  --formatted_nprint_path ./scripts/correct_format.nprint

Final PCAP files are stored in `replayable_generated_pcaps`

In [None]:
# -----------------------------------------------------------
# 🎯 Inference Pipeline for SD3 + ControlNet + LoRA
#
# This cell demonstrates the full generation process:
#  - Loads Stable Diffusion 3 Medium base model
#  - Loads ControlNet (Canny edge-based guidance)
#  - Loads LoRA weights fine-tuned on pixelated network traffic
#  - Applies edge-conditioned generation using a sample input
#
# -----------------------------------------------------------
flush()
import os
import torch
import cv2
from PIL import Image
from diffusers import StableDiffusion3ControlNetPipeline, SD3ControlNetModel
from diffusers.utils import load_image

# Make sure our output folder exists
os.makedirs("generated_traffic_images", exist_ok=True)

# Base SD 3.0 model
base_model_path = "stabilityai/stable-diffusion-3-medium-diffusers"

# Canny-based ControlNet
controlnet_path = "InstantX/SD3-Controlnet-Canny"

# Load the ControlNet and pipeline
controlnet = SD3ControlNetModel.from_pretrained(
    controlnet_path, torch_dtype=torch.float16
)
pipe = StableDiffusion3ControlNetPipeline.from_pretrained(
    base_model_path,
    controlnet=controlnet,
)

# Load LoRA weights
lora_output_path = "trained-sd3-lora-miniature"
pipe.load_lora_weights(lora_output_path)

# Move pipeline to GPU (half precision)
pipe.to("cuda", torch.float16)
pipe.enable_sequential_cpu_offload()

# ----------------------------------------------------
# 1) Convert original control image to Canny edges via OpenCV
# ----------------------------------------------------
orig_path = "./scripts/traffic_conditioning_image.png"
orig_bgr = cv2.imread(orig_path)
if orig_bgr is None:
    raise ValueError(f"Could not load file: {orig_path}")

# Convert to grayscale
gray = cv2.cvtColor(orig_bgr, cv2.COLOR_BGR2GRAY)

# Generate Canny edge map (tweak thresholds as needed)
edges = cv2.Canny(gray, 100, 200)

# Convert single-channel edge map to 3-channel RGB
edges_rgb = cv2.cvtColor(edges, cv2.COLOR_GRAY2RGB)

# Convert to PIL for use in ControlNet pipeline
control_image = Image.fromarray(edges_rgb)
orig_width, orig_height = control_image.size
target_width = 1024
if orig_width > target_width:
    # (left, upper, right, lower)
    control_image = control_image.crop((0, 0, target_width, orig_height))
    
print("Displaying Canny control image:")

display(control_image)

# ----------------------------------------------------
# 3) Set up prompts and run pipeline
# ----------------------------------------------------
prompt = "pixelated network data for type-0 application traffic"
generator = torch.manual_seed(0)  # reproducibility

# Generate at 1024×1024 to match the new control image
image = pipe(
    prompt=prompt,
    num_inference_steps=20,
    generator=generator,
    height=1024,
    width=1088,
    control_image=control_image,
    controlnet_conditioning_scale=0.5,  # increase to adhere more strongly to edges
).images[0]

# ----------------------------------------------------
# 4) Save the generated image
# ----------------------------------------------------
output_path = os.path.join("generated_traffic_images", "generated_traffic.png")
image.save(output_path)
print(f"Generated image saved to: {output_path}")


In [None]:
# -----------------------------------------------------------
# 🔄 Post-Generation Processing Pipeline
#
# This cell performs a 3-stage transformation of the generated PNG images:
#   1. Applies color correction for standardization.
#   2. Converts augmented images back into nPrint-compatible feature format.
#   3. Applies heuristic corrections and reconstructs valid PCAP files.
#
# ⚙️ This pipeline enables turning synthetic traffic images
#    back into replayable network traffic for evaluation or simulation.
# -----------------------------------------------------------

# 🎨 Step 1: Color Augmentation
# Applies standardized color shifts to improve nprint reconstruction accuracy
!python ./scripts/color_processor.py \
  --input_dir="./generated_traffic_images" \
  --output_dir="./color_corrected_generated_traffic_images"
# -----------------------------------------------------------
# 🔁 Step 2: Image-to-nPrint Conversion
# Converts augmented PNG images back into `.nprint` tabular format.
#
# Uses a reference `.nprint` file to maintain consistent structure and column order.
# This step allows diffusion-generated visual traffic to be fed into analysis tools.
# -----------------------------------------------------------
!python ./scripts/image_to_nprint.py \
  --org_nprint ./scripts/column_example.nprint \
  --input_dir ./color_corrected_generated_traffic_images \
  --output_dir ./generated_nprint
# -----------------------------------------------------------
# 🧠 Step 3: Heuristic Correction & PCAP Reconstruction
#
# This step reconstructs a valid and replayable `.pcap` file
# from the diffusion-generated `.nprint` representation.
# 🔍 Core Functionalities:
# ✅ Intra-packet corrections (fixes within individual packets).
# 🔁 Inter-packet dependency enforcement.
# 🔧 Reconstruction:
#   - Save the corrected `.nprint` to disk
#   - Call `nprint -W` to convert `.nprint` into `.pcap` using external tool
#   - Run Scapy-based checksum updates to ensure IPv4 validity
#   - Reconvert final `.pcap` back to `.nprint` (with fixed layout) for downstream tasks
# -----------------------------------------------------------
!python ./scripts/mass_reconstruction.py \
  --input_dir ./generated_nprint \
  --output_pcap_dir ./replayable_generated_pcaps \
  --output_nprint_dir ./replayable_generated_nprints \
  --formatted_nprint_path ./scripts/correct_format.nprint
# Final Pcap is stored in replayable_generated_pcaps