# Example: Dataset Anonymisation Framework

This notebook demonstrates how to use the Python reference implementation of the anonymisation scheme described in the paper.  

It shows how logical network groups are defined in a YAML configuration, and how IP addresses are anonymised while preserving logical structure.

## 1. Setup

Make sure your working directory contains:

- `dataset_anonymiser.py` (the implementation)
- `config.yaml` (the configuration file with group definitions, running this notebook will create an example config.yaml)

Install dependencies if needed:
```bash
pip install -r requirements.txt
```

In [None]:
from dataset_anonymiser import DatasetAnonymiser
import pandas as pd

## 2. Example Configuration

Below is an example configuration (`config.yaml`) defining logical network groups and their output address ranges.  
Each group has one or more IP prefixes and a startâ€“end range defining where anonymised addresses will be mapped.

In [8]:
%%writefile config.yaml
groups:
  servers:
    prefixes:
      - "149.171.126.0/24"
      - "59.166.0.0/24"
    output_range:
      start: "10.0.0.0"
      end: "10.0.0.255"

  users:
    prefixes:
      - "192.168.0.0/16"
      - "10.40.170.0/24"
    output_range:
      start: "10.0.1.0"
      end: "10.0.1.255"

  external:
    prefixes:
      - "175.45.176.0/24"
    output_range:
      start: "10.0.2.0"
      end: "10.0.2.255"

Writing config.yaml


## 3. Create Example Data

Here we construct a small Pandas DataFrame simulating network flow records with source and destination IP addresses.

In [9]:
df = pd.DataFrame({
    "srcip": ["149.171.126.10", "192.168.1.15", "175.45.176.5", "59.166.0.22"],
    "dstip": ["10.40.170.2", "59.166.0.2", "149.171.126.44", "192.168.3.4"]
})

print("Original dataset:")
display(df)

Original dataset:


Unnamed: 0,srcip,dstip
0,149.171.126.10,10.40.170.2
1,192.168.1.15,59.166.0.2
2,175.45.176.5,149.171.126.44
3,59.166.0.22,192.168.3.4


## 4. Apply the Anonymisation Process

We instantiate the anonymiser, apply it to the chosen columns, and save the pseudonym mapping to disk.

On subsequent runs, the mapping is reloaded so pseudonyms remain consistent between sessions. 

If `include_logical_groups=True` columns will be appended with the actual group labels for each address, as well as the anonymised address.

In [None]:
anonymiser = DatasetAnonymiser(config_path="config.yaml", state_path="pseudonym_table.json")

# Anonymise the addresses
df_anon = anonymiser.apply_to_dataframe(df, ["srcip", "dstip"], include_logical_groups=True)

# Export the mapping table
anonymiser.save_state()

display(df_anon)

## 5. Examine the Mapping Table

You can inspect the saved pseudonym table (`pseudonym_table.json`) to view the persistent host-to-pseudonym mapping.

This enables consistent pseudonymisation across multiple anonymisation runs.

In [11]:
display(pd.read_json("pseudonym_table.json").head())

Unnamed: 0,host,pseudonym
0,149.171.126.10,1.061744e+38
1,192.168.1.15,3.078779e+38
2,175.45.176.5,1.35537e+38
3,59.166.0.22,2.2103050000000002e+38
4,10.40.170.2,2.9780820000000002e+38


## 6. Adding Group Information to an Existing Anonymised Trace

If you already have an anonymised dataset that **does not include group columns**, you can enrich it by calling `add_groups(df)`.
This will append group information for any anonymised address that is present in the configuration file.

In [12]:
# Suppose we have an anonymised trace without group columns
df_minimal = df_anon[["srcip_anon", "dstip_anon"]].copy()
print("Anonymised dataset without groups:")
display(df_minimal)

# Re-add logical group columns using add_groups()
df_with_groups = anonymiser.add_groups(df_minimal)

print("Dataset after re-adding group information:")
display(df_with_groups)

Anonymised dataset without groups:


Unnamed: 0,srcip_anon,dstip_anon
0,10.0.0.35,10.0.1.87
1,10.0.1.15,10.0.0.189
2,10.0.2.9,10.0.0.136
3,10.0.0.130,10.0.1.187


Dataset after re-adding group information:


Unnamed: 0,srcip_anon,dstip_anon,srcip_anon_group,dstip_anon_group
0,10.0.0.35,10.0.1.87,servers,users
1,10.0.1.15,10.0.0.189,users,servers
2,10.0.2.9,10.0.0.136,external,servers
3,10.0.0.130,10.0.1.187,servers,users


## 7. Summary

- **Logical roles** are defined by prefix groups in YAML.
- Each host is assigned a 128-bit UUID pseudonym on first observation.
- Pseudonyms are mapped into group-specific address ranges.
- The mapping is persisted for deterministic reproducibility across runs.
- `add_groups()` can be used to restore logical group context for anonymised data after the fact.