In [1]:
from src.flowmia import FlowMIA

## FlowMIA: Avaliando Ataques de Inferência de Membros em Modelos Generativos de Dados de Fluxo de Rede

Foram utilizados dois datasets: CIDDS-001 e TON-IoT. Ambos são datasets de fluxo de rede. O CIDDS-001 foi dividido em treino e teste e foi colocado na pasta datasets/real/cidds_train.csv, enquanto que o teste está em datasets/real/cidds_test.csv. O dataset do TON-IoT está em datasets/reference/ton.csv.

Com o cidds_train foram treinados 3 modelos generativos para dados tabulares: CTGAN, NetShare e Tabula. Na pasta datasets/synthetic estão os datasets sintéticos gerados por cada modelo. 

Será realizado um ataque de inferência de membros nesses modelos, ou seja, dados os dados sintéticos, o objetivo é descobrir amostras que foram usadas no treino. 

O atacante possui duas informações: o conjunto de dados sintéticos gerados pelo modelo e um dataset de referência de mesmo domínio (ou seja, o atacante sabe o domínio dos dados)

O dataset de teste, cidds_test, é usado unicamente para a análise de utilidade. 

Definição:

- **Membros**: cidds_train.csv: dados utilizados para treinamento dos modelos

- **Não-Membros**: ton.csv: dados de referência de mesmo domínio

- **Sintéticos**: ctgan.csv, netshare.csv, tabula.csv: dados sintéticos gerados por cada modelo

- **Teste**: cidds_test.csv: dados de teste amostrados da mesma distribuição dos membros para utilidade


Segue o arquivo de configuração padrão. Para cada modelo atacado serão ajustados o synth_path e save_path.

In [2]:
import pandas as pd

df = pd.read_csv('datasets/real/cidds_train.csv')
df.head()

Unnamed: 0,srcip,dstip,srcport,dstport,proto,ts,td,pkt,byt,label
0,3232291855,3232261125,48888,445,TCP,1489536000000000.0,0.004,2,174.0,0
1,3232291855,3232261125,48888,445,TCP,1489536000000000.0,0.004,2,174.0,0
2,3232261125,3232291855,445,48888,TCP,1489536000000000.0,0.0,1,108.0,0
3,3232261125,3232291855,445,48888,TCP,1489536000000000.0,0.0,1,108.0,0
4,3232291856,3232261125,58844,445,TCP,1489536000000000.0,0.004,2,174.0,0


In [None]:
{
    'member_path': 'datasets/real/cidds_train.csv', # path dos membros
    'non_member_path': 'datasets/reference/ton.csv', # path dos não-membros
    'synth_path': 'datasets/synthetic/netshare.csv', # path dos sintéticos
    'test_path': 'datasets/real/cidds_test.csv', # path do teste
    'categorical_cols': ['proto'], # colunas categóricas
    'numerical_cols': ['srcport', 'dstport', 'td', 'pkt', 'byt'], #colunas numéricas
    'ip_cols': ['srcip', 'dstip'], # colunas de ip
    'label_col': 'label', # nome da coluna do rótulo 
    'batch_size': 200, # número de amostrar por lote
    'num_epochs': 500, # número de épocas
    'fcheckpoint': 100, # frequência para salvar o checkpoint
    'save_path': 'teste/netshare'    # pasta para salvar resultados
}

{'member_path': 'datasets/real/cidds_train.csv',
 'non_member_path': 'datasets/reference/ton.csv',
 'synth_path': 'datasets/synthetic/netshare.csv',
 'test_path': 'datasets/real/cidds_test.csv',
 'categorical_cols': ['proto'],
 'numerical_cols': ['srcport', 'dstport', 'td', 'pkt', 'byt'],
 'ip_cols': ['srcip', 'dstip'],
 'label_col': 'label',
 'batch_size': 200,
 'num_epochs': 500,
 'fcheckpoint': 100,
 'save_path': 'results/netshare'}

### NetShare

In [None]:
config_netshare = {
    'member_path': 'datasets/real/cidds_train.csv', # path dos membros
    'non_member_path': 'datasets/reference/ton.csv', # path dos não-membros
    'synth_path': 'datasets/synthetic/netshare.csv', # path dos sintéticos
    'test_path': 'datasets/real/cidds_test.csv', # path do teste
    'categorical_cols': ['proto'], # colunas categóricas
    'numerical_cols': ['srcport', 'dstport', 'td', 'pkt', 'byt'], #colunas numéricas
    'ip_cols': ['srcip', 'dstip'], # colunas de ip
    'label_col': 'label', # nome da coluna do rótulo 
    'batch_size': 200, # número de amostrar por lote
    'num_epochs': 10, # número de épocas
    'fcheckpoint': 5, # frequência para salvar o checkpoint
    'save_path': 'teste/netshare'    # pasta para salvar resultados
}

In [28]:
flowmia_netshare = FlowMIA(config=config_netshare) # cria um objeto de classe FlowMIA

Executa o MIA. Uma GAN é treinada com os dados sintéticos gerados pelo modelo. O pré-processador dessa GAN é ajustado com os sintéticos + não-membros, ou seja, o atacante tem o conhecimento dos sintéticos e do domínio do problema.

Essa GAN é treinada. O gerador gera amostras falsas, enquanto que o discrminador tenta distinguir essas amostras falsas das amostras sintéticas de treinamento. Ele atribui um score no intervalo de 0 a 1 para cada amostra, sendo que, quando mais perto de 1, mais parecido com os sintéticos a amostra é, enquanto que quanto mais próximo de 0, menos parecido ou mais próximo do aleatório.

Após o treinamento, o discriminador é utilizado para inferência de membros. São passados para ele, os membros, não-membros, os próprios sintéticos e também amostras ruidosas. São observados os scores de output do discriminador. 

A função retorna o histórico de treinamento (loss do gerador e loss do discrminador), além do resultado do MIA.

In [29]:
history, mia_results = flowmia_netshare.flowmiagan(plot=True) 

Starting FlowMIA privacy evaluation...
Training FlowMIA GAN...
Fitting the pre processors...
Preprocessors fitted. Starting GAN training...


Training GAN:  20%|██        | 100/500 [05:58<23:53,  3.58s/epoch, D_loss=0.5082, G_loss=2.9792]


✓ Model checkpoint 100 saved to results/netshare/checkpoints/checkpoint_epoch_100.pth


Training GAN:  40%|████      | 200/500 [11:56<17:41,  3.54s/epoch, D_loss=0.4915, G_loss=3.0834]


✓ Model checkpoint 200 saved to results/netshare/checkpoints/checkpoint_epoch_200.pth


Training GAN:  60%|██████    | 300/500 [17:56<12:10,  3.65s/epoch, D_loss=0.4911, G_loss=3.0767]


✓ Model checkpoint 300 saved to results/netshare/checkpoints/checkpoint_epoch_300.pth


Training GAN:  80%|████████  | 400/500 [23:58<05:56,  3.57s/epoch, D_loss=0.4901, G_loss=3.0907]


✓ Model checkpoint 400 saved to results/netshare/checkpoints/checkpoint_epoch_400.pth


Training GAN: 100%|██████████| 500/500 [29:57<00:00,  3.60s/epoch, D_loss=0.4831, G_loss=3.1341]



✓ Model checkpoint 500 saved to results/netshare/checkpoints/checkpoint_epoch_500.pth
FlowMIA GAN inference results: {'score_members': array([0.02009393, 0.02009393, 0.8932337 , ..., 0.8998758 , 0.89889723,
       0.89889723], shape=(65000,), dtype=float32), 'score_non_members': array([1.        , 0.00656263, 0.00639131, ..., 0.8784728 , 0.8785762 ,
       0.8792584 ], shape=(64998,), dtype=float32), 'score_random': array([0.02772027, 0.01902272, 0.01271395, ..., 0.02131341, 0.02165896,
       0.00738676], shape=(64998,), dtype=float32), 'score_synthetic': array([0.9514512 , 0.9103348 , 0.9049635 , ..., 0.89275396, 0.12277587,
       0.8994425 ], shape=(15911,), dtype=float32), 'auc': 0.7790276142224494, 'accuracy': 0.7611270942629886, 'precision': np.float64(0.6767907175368976), 'recall': np.float64(0.9996615384615385), 'threshold': np.float64(0.014999092556536198), 'threshold_method': 'optimal', 'mean_score_members': np.float32(0.61675704), 'mean_score_non_members': np.float32(0.242

In [30]:
dcr_results = flowmia_netshare.compute_dcr(n_sample=5000)

Starting DCR evaluation...




DCR evaluation results: {'score': np.float64(0.6632), 'synthetic_data_percentages': {'closer_to_training': np.float64(0.6684), 'closer_to_holdout': np.float64(0.3316)}}


In [31]:
flowmia_netshare.dcr_results

{'score': np.float64(0.6632),
 'synthetic_data_percentages': {'closer_to_training': np.float64(0.6684),
  'closer_to_holdout': np.float64(0.3316)}}

In [32]:
fidelity = flowmia_netshare.evaluate_fidelity(plot=True)

In [9]:
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

classifiers = [MLPClassifier(hidden_layer_sizes=(20,10), max_iter=100, random_state=42),
                    DecisionTreeClassifier(max_depth=12, min_samples_leaf=50),
                    KNeighborsClassifier(n_neighbors=5),
                    RandomForestClassifier(max_depth=12, min_samples_leaf=50, n_estimators=100, class_weight="balanced")]

In [None]:
utility_dict = flowmia_netshare.evaluate_utility(classifiers=classifiers, plot=True)

Utility for classifier: MLPClassifier
Running RTR evaluation...
Running TSTR evaluation...
Utility for classifier: DecisionTreeClassifier
Running RTR evaluation...
Running TSTR evaluation...
Utility for classifier: KNeighborsClassifier
Running RTR evaluation...
Running TSTR evaluation...
Utility for classifier: RandomForestClassifier
Running RTR evaluation...
Running TSTR evaluation...


### CTGAN

In [None]:
config_ctgan = {
    'member_path': 'datasets/real/cidds_train.csv', # path dos membros
    'non_member_path': 'datasets/reference/ton.csv', # path dos não-membros
    'synth_path': 'datasets/synthetic/ctgan.csv', # path dos sintéticos
    'test_path': 'datasets/real/cidds_test.csv', # path do teste
    'categorical_cols': ['proto'], # colunas categóricas
    'numerical_cols': ['srcport', 'dstport', 'td', 'pkt', 'byt'], #colunas numéricas
    'ip_cols': ['srcip', 'dstip'], # colunas de ip
    'label_col': 'label', # nome da coluna do rótulo 
    'batch_size': 200, # número de amostrar por lote
    'num_epochs': 10, # número de épocas
    'fcheckpoint': 5, # frequência para salvar o checkpoint
    'save_path': 'teste/ctgan'    # pasta para salvar resultados
}

In [None]:
flowmia_ctgan = FlowMIA(config=config_ctgan) # cria um objeto de classe FlowMIA

In [15]:
history, mia_results = flowmia_ctgan.flowmiagan(plot=True) 

Starting FlowMIA privacy evaluation...
Training FlowMIA GAN...
Fitting the pre processors...
Preprocessors fitted. Starting GAN training...


Training GAN:  20%|██        | 100/500 [04:40<18:42,  2.81s/epoch, D_loss=0.5069, G_loss=3.0091]


✓ Model checkpoint 100 saved to results/netshare/checkpoints/checkpoint_epoch_100.pth


Training GAN:  40%|████      | 200/500 [09:20<13:54,  2.78s/epoch, D_loss=0.5006, G_loss=3.0434]


✓ Model checkpoint 200 saved to results/netshare/checkpoints/checkpoint_epoch_200.pth


Training GAN:  60%|██████    | 300/500 [13:59<10:00,  3.00s/epoch, D_loss=0.4970, G_loss=3.0613]


✓ Model checkpoint 300 saved to results/netshare/checkpoints/checkpoint_epoch_300.pth


Training GAN:  80%|████████  | 400/500 [18:27<04:27,  2.68s/epoch, D_loss=0.4886, G_loss=3.0758]


✓ Model checkpoint 400 saved to results/netshare/checkpoints/checkpoint_epoch_400.pth


Training GAN: 100%|██████████| 500/500 [22:59<00:00,  2.76s/epoch, D_loss=0.4863, G_loss=3.1227]



✓ Model checkpoint 500 saved to results/netshare/checkpoints/checkpoint_epoch_500.pth
FlowMIA GAN inference results: {'score_members': array([0.01922364, 0.01922364, 0.90120333, ..., 0.89372754, 0.8933396 ,
       0.8933396 ], shape=(65000,), dtype=float32), 'score_non_members': array([1.        , 0.03211676, 0.0316948 , ..., 0.8939851 , 0.8940108 ,
       0.89418024], shape=(64998,), dtype=float32), 'score_random': array([0.03231132, 0.0286195 , 0.0247515 , ..., 0.02201831, 0.0200127 ,
       0.01375713], shape=(64998,), dtype=float32), 'score_synthetic': array([0.9312855 , 0.90149903, 0.89504254, ..., 0.8944685 , 0.1173097 ,
       0.90036887], shape=(15911,), dtype=float32), 'auc': 0.6390984594792266, 'accuracy': 0.733565131771258, 'precision': np.float64(0.7471511362896399), 'recall': np.float64(0.7060923076923077), 'threshold': np.float64(0.7706919312477112), 'threshold_method': 'optimal', 'mean_score_members': np.float32(0.63484806), 'mean_score_non_members': np.float32(0.268293

In [16]:
dcr_results = flowmia_ctgan.compute_dcr(n_sample=5000)

Starting DCR evaluation...




DCR evaluation results: {'score': np.float64(0.6632), 'synthetic_data_percentages': {'closer_to_training': np.float64(0.6684), 'closer_to_holdout': np.float64(0.3316)}}


In [17]:
fidelity = flowmia_ctgan.evaluate_fidelity(plot=True)

In [18]:
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

classifiers = [MLPClassifier(hidden_layer_sizes=(20,10), max_iter=100, random_state=42),
                    DecisionTreeClassifier(max_depth=12, min_samples_leaf=50),
                    KNeighborsClassifier(n_neighbors=5),
                    RandomForestClassifier(max_depth=12, min_samples_leaf=50, n_estimators=100, class_weight="balanced")]


utility_dict = flowmia_ctgan.evaluate_utility(classifiers=classifiers, plot=True)

Utility for classifier: MLPClassifier
Running RTR evaluation...
Running TSTR evaluation...
Utility for classifier: DecisionTreeClassifier
Running RTR evaluation...
Running TSTR evaluation...
Utility for classifier: KNeighborsClassifier
Running RTR evaluation...
Running TSTR evaluation...
Utility for classifier: RandomForestClassifier
Running RTR evaluation...
Running TSTR evaluation...


### Tabula

In [None]:
config_tabula = {
    'member_path': 'datasets/real/cidds_train.csv', # path dos membros
    'non_member_path': 'datasets/reference/ton.csv', # path dos não-membros
    'synth_path': 'datasets/synthetic/tabula.csv', # path dos sintéticos
    'test_path': 'datasets/real/cidds_test.csv', # path do teste
    'categorical_cols': ['proto'], # colunas categóricas
    'numerical_cols': ['srcport', 'dstport', 'td', 'pkt', 'byt'], #colunas numéricas
    'ip_cols': ['srcip', 'dstip'], # colunas de ip
    'label_col': 'label', # nome da coluna do rótulo 
    'batch_size': 200, # número de amostrar por lote
    'num_epochs': 10, # número de épocas
    'fcheckpoint': 5, # frequência para salvar o checkpoint
    'save_path': 'teste/tabula'    # pasta para salvar resultados
}

In [20]:
flowmia_tabula = FlowMIA(config=config_tabula) # cria um objeto de classe FlowMIA

In [21]:
history, mia_results = flowmia_tabula.flowmiagan(plot=True) 

Starting FlowMIA privacy evaluation...
Training FlowMIA GAN...
Fitting the pre processors...
Preprocessors fitted. Starting GAN training...


Training GAN:  20%|██        | 100/500 [04:14<16:41,  2.50s/epoch, D_loss=0.6246, G_loss=2.5514]


✓ Model checkpoint 100 saved to results/tabula/checkpoints/checkpoint_epoch_100.pth


Training GAN:  40%|████      | 200/500 [08:36<13:05,  2.62s/epoch, D_loss=0.5877, G_loss=2.6624]


✓ Model checkpoint 200 saved to results/tabula/checkpoints/checkpoint_epoch_200.pth


Training GAN:  60%|██████    | 300/500 [12:57<08:40,  2.60s/epoch, D_loss=0.5841, G_loss=2.6992]


✓ Model checkpoint 300 saved to results/tabula/checkpoints/checkpoint_epoch_300.pth


Training GAN:  80%|████████  | 400/500 [17:21<04:22,  2.63s/epoch, D_loss=0.5868, G_loss=2.7242]


✓ Model checkpoint 400 saved to results/tabula/checkpoints/checkpoint_epoch_400.pth


Training GAN: 100%|██████████| 500/500 [21:35<00:00,  2.59s/epoch, D_loss=0.5818, G_loss=2.7499]



✓ Model checkpoint 500 saved to results/tabula/checkpoints/checkpoint_epoch_500.pth
FlowMIA GAN inference results: {'score_members': array([0.23251016, 0.23251016, 0.8885667 , ..., 0.90419865, 0.9068586 ,
       0.9068586 ], shape=(65000,), dtype=float32), 'score_non_members': array([1.        , 0.02758368, 0.04271372, ..., 0.84895486, 0.8486485 ,
       0.8466144 ], shape=(64998,), dtype=float32), 'score_random': array([0.04931453, 0.03791369, 0.02430568, ..., 0.0411238 , 0.04512173,
       0.02845111], shape=(64998,), dtype=float32), 'score_synthetic': array([0.03330705, 0.04232586, 0.8904092 , ..., 0.8854648 , 0.886783  ,
       0.22686757], shape=(15000,), dtype=float32), 'auc': 0.74063120131507, 'accuracy': 0.781765873321128, 'precision': np.float64(0.7260943633804904), 'recall': np.float64(0.9048923076923077), 'threshold': np.float64(0.03766358271241188), 'threshold_method': 'optimal', 'mean_score_members': np.float32(0.6845134), 'mean_score_non_members': np.float32(0.31330922),

In [22]:
dcr_results = flowmia_tabula.compute_dcr(n_sample=5000)

Starting DCR evaluation...




DCR evaluation results: {'score': np.float64(0.010399999999999965), 'synthetic_data_percentages': {'closer_to_training': np.float64(0.9948), 'closer_to_holdout': np.float64(0.005199999999999982)}}


In [23]:
fidelity = flowmia_tabula.evaluate_fidelity(plot=True)

In [24]:
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

classifiers = [MLPClassifier(hidden_layer_sizes=(20,10), max_iter=100, random_state=42),
                    DecisionTreeClassifier(max_depth=12, min_samples_leaf=50),
                    KNeighborsClassifier(n_neighbors=5),
                    RandomForestClassifier(max_depth=12, min_samples_leaf=50, n_estimators=100, class_weight="balanced")]


utility_dict = flowmia_tabula.evaluate_utility(classifiers=classifiers, plot=True)

Utility for classifier: MLPClassifier
Running RTR evaluation...
Running TSTR evaluation...
Utility for classifier: DecisionTreeClassifier
Running RTR evaluation...
Running TSTR evaluation...
Utility for classifier: KNeighborsClassifier
Running RTR evaluation...
Running TSTR evaluation...
Utility for classifier: RandomForestClassifier
Running RTR evaluation...
Running TSTR evaluation...
