# 03 - Federated Learning

## Defines

Define the available types of federated learning.

 - 'STRATIFIED': Stratified sampling of the data. The data is split into a number of shards, and each shard is assigned to a client. The data is split in a stratified manner, meaning that the distribution of the labels is approximately the same in each shard.
 - 'MISSING_1_ATTACK' - Each client is assigned a shard of data, each shard is missing one of the attack labels. Other clients in the network are exposed to the attack label, but the specific client is not. This demonstrates the ability of federated learning to protect against unknown attacks.
 - '1_ATTACK_ONLY' - Each client is assigned a shard of data, each shard contains only one of the attack labels.
 - 'HALF_BENIGN_ONLY' - Half of the clients are exposed to Benign data only, the other half are exposed to all data.


In [None]:
### THIS SECTION NEEDS TO BE SET TO DETERMINE WHICH CONFIGURATION METHOD TO UTILISE

SPLIT_AVAILABLE_METHODS = ['STRATIFIED','MISSING_1_ATTACK', '1_ATTACK_ONLY', 'HALF_BENIGN_ONLY' ]
METHOD = 'MISSING_1_ATTACK'
NUM_OF_STRATIFIED_CLIENTS = 10  # only applies to stratified method
NUM_OF_ROUNDS = 10              # Number of FL rounds


The above test method in conjunction with the below classification selection will determine the number of clients.

EG: 
`STRATIFIED` with:
 - `ALL TYPES` - Results in `NUM_OF_STRATIFIED_CLIENTS` clients. Each client will have a stratified sample of the data.

`MISSING_1_ATTACK` with:
 - `individual_classifier` - Results in 33 clients. Each client will have benign traffic and 32 attack labels.
 - `group_classifier` - Results in 7 clients. Each client will have benign traffic and 6 attack groups.
 - `binary_classifier` - Results in 10 clients. Five clients will have benign traffic only and the other will have Benign and malicious attack labels.

`1_ATTACK_ONLY` with:
 - `individual_classifier` - Results in 33 clients. Each client will have benign traffic and 1 attack label.
 - `group_classifier` - Results in 7 clients. Each client will have benign traffic and 1 attack groups.
 - `binary_classifier` - Results in 10 clients. Five clients will have benign traffic only and the other will have Benign and malicious attack labels.

`HALF_BENIGN_ONLY` with:
 - `individual_classifier` - Results in 10 clients. Five clients will have benign traffic only and the other will have Benign and 33 malicious attack labels.
 - `group_classifier` - Results in 10 clients. Five clients will have benign traffic only and the other will have Benign and 7 malicious attack groups.
 - `binary_classifier` - Results in 10 clients. Five clients will have benign traffic only and the other will have Benign and Malicious attack labels.

In [44]:
individual_classifier = True
group_classifier = False
binary_classifier = False


Include the defines for the dataframe columns and the attack labels and their mappings

In [45]:
from enum import Enum
from includes import *

##  Imports

In [46]:
%%capture
%pip install flwr[simulation] torch torchvision matplotlib sklearn openml

In [47]:
import os
import pandas as pd
import numpy as np
import flwr as fl
from tqdm import tqdm
import warnings
#warnings.filterwarnings('ignore')

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
from flwr.common import Metrics
from torch.utils.data import DataLoader, random_split


In [48]:
print("flwr", fl.__version__)
print("numpy", np.__version__)
print("torch", torch.__version__)

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Training on {DEVICE}")

flwr 1.4.0
numpy 1.23.5
torch 2.0.1+cpu
Training on cpu


## Load the Dataset

In [49]:
DATASET_DIRECTORY = '../datasets/CICIoT2023/'

## Training data

Either read the training pickle file if it exists, or process the dataset from scratch.

In [50]:
# Check to see if the file 'training_data.pkl' exists in the directory. If it does, load it. If not, print an error.
if os.path.isfile('training_data.pkl'):
    print("File exists, loading data...")
    train_df = pd.read_pickle('training_data.pkl')
    print("Training data loaded from pickle file.")

else:
    df_sets = [k for k in os.listdir(DATASET_DIRECTORY) if k.endswith('.csv')]
    df_sets.sort()
    training_sets = df_sets[:int(len(df_sets)*.8)]
    test_sets = df_sets[int(len(df_sets)*.8):]

    # Print the number of files in each set
    print('Training sets: {}'.format(len(training_sets)))
    print('Test sets: {}'.format(len(test_sets)))

    ######################
    # HACK TEMP CODE
    ######################
    # Set training_sets to the last entry of training_sets
    training_sets = training_sets[-5:]
    print(f"HACK TO REPLICATE ORIGINAL AUTHORS CODE WITH ONE FILE TRAIN - {training_sets}")
    ######################
    # HACK END TEMP CODE
    ######################

    # Concatenate all training sets into one dataframe
    dfs = []
    print("Reading training data...")
    for train_set in tqdm(training_sets):
        df_new = pd.read_csv(DATASET_DIRECTORY + train_set)
        dfs.append(df_new)
    train_df = pd.concat(dfs, ignore_index=True)

    # Map y column to the dict_34_classes values - The pickle file already has this done.
    train_df['label'] = train_df['label'].map(dict_34_classes)

    # Save the output to a pickle file
    print("Writing training data to pickle file...")
    train_df.to_pickle('training_data.pkl')

print("Training data size: {}".format(train_df.shape))


File exists, loading data...
Training data loaded from pickle file.
Training data size: (1425287, 47)


In [51]:
train_df

Unnamed: 0,flow_duration,Header_Length,Protocol Type,Duration,Rate,Srate,Drate,fin_flag_number,syn_flag_number,rst_flag_number,...,Std,Tot size,IAT,Number,Magnitue,Radius,Covariance,Variance,Weight,label
0,0.000000,182.00,17.00,64.00,22.362751,22.362751,0.0,0.0,0.0,0.0,...,0.000000,182.00,8.300743e+07,9.5,19.078784,0.000000,0.000000,0.00,141.55,13
1,2.437778,129.60,6.00,64.00,0.978382,0.978382,0.0,0.0,1.0,0.0,...,0.000000,54.00,8.336252e+07,9.5,10.392305,0.000000,0.000000,0.00,141.55,7
2,0.000000,54.00,6.00,64.00,0.000000,0.000000,0.0,1.0,0.0,1.0,...,0.000000,54.00,8.334496e+07,9.5,10.392305,0.000000,0.000000,0.00,141.55,1
3,0.453670,39173.00,17.00,64.00,4967.422026,4967.422026,0.0,0.0,0.0,0.0,...,0.000000,50.00,8.310643e+07,9.5,10.000000,0.000000,0.000000,0.00,141.55,4
4,0.000000,54.00,6.00,64.00,166.930829,166.930829,0.0,0.0,0.0,0.0,...,0.000000,54.00,8.331469e+07,9.5,10.392305,0.000000,0.000000,0.00,141.55,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1425282,0.000000,54.00,6.00,64.00,19.582485,19.582485,0.0,0.0,0.0,0.0,...,0.000000,54.00,8.331443e+07,9.5,10.392305,0.000000,0.000000,0.00,141.55,2
1425283,0.037146,78.22,36.21,63.18,24.542045,24.542045,0.0,0.0,0.0,0.0,...,110.233513,453.78,8.358187e+07,9.5,30.338676,154.660856,23401.960226,0.53,141.55,18
1425284,3.293075,1025996.92,17.00,64.00,572.160392,572.160392,0.0,0.0,0.0,0.0,...,0.000000,554.00,8.378910e+07,9.5,33.286634,0.000000,0.000000,0.00,141.55,19
1425285,0.047343,35223.00,17.00,64.00,15083.107398,15083.107398,0.0,0.0,0.0,0.0,...,0.000000,50.00,8.309852e+07,9.5,10.000000,0.000000,0.000000,0.00,141.55,4


---
## Test Data
Concat the test data into a single dataframe

In [52]:
# Check to see if the file 'test_data.pkl' exists in the directory. If it does, load it. If not, print an error.
testing_data_pickle_file = 'testing_data.pkl'

if os.path.isfile(testing_data_pickle_file):
    print(f"File {testing_data_pickle_file} exists, loading data...")
    test_df = pd.read_pickle(testing_data_pickle_file)
    print("Test data loaded from pickle file.")

else:
    print(f"File {testing_data_pickle_file} does not exist, constructing data...")

    df_sets = [k for k in os.listdir(DATASET_DIRECTORY) if k.endswith('.csv')]
    df_sets.sort()
    training_sets = df_sets[:int(len(df_sets)*.8)]
    test_sets = df_sets[int(len(df_sets)*.8):]

    ############################################
    ############################################
    # HACK - Make things quicker for now
    ############################################
    ############################################

    test_sets = df_sets[int(len(df_sets)*.95):]
    
    # Set training_sets to the last entry of training_sets
    test_sets = test_sets[-5:]
    
    ############################################
    ############################################
    # END HACK 
    ############################################
    ############################################

    # Print the number of files in each set
    print('Test sets: {}'.format(len(test_sets)))
    
    # Concatenate all testing sets into one dataframe
    dfs = []
    print("Reading test data...")
    for test_set in tqdm(test_sets):
        df_new = pd.read_csv(DATASET_DIRECTORY + test_set)
        dfs.append(df_new)
    test_df = pd.concat(dfs, ignore_index=True)

    # Map y column to the dict_34_classes values - The pickle file already has this done.
    test_df['label'] = test_df['label'].map(dict_34_classes)

    # Save the output to a pickle file
    print(f"Writing test data to pickle file {testing_data_pickle_file}...")
    test_df.to_pickle(testing_data_pickle_file)

print("Testing data size: {}".format(test_df.shape))

File testing_data.pkl exists, loading data...
Test data loaded from pickle file.
Testing data size: (1803497, 47)


---
# Scale the test and train data

### Scale the training data input features

In [53]:
scaler = StandardScaler()
train_df[X_columns] = scaler.fit_transform(train_df[X_columns])

### Scale the testing data input features

In [54]:
test_df[X_columns] = scaler.fit_transform(test_df[X_columns])

---
# Define the classification problem - (2 classes, 8 classes or 34 classes)
Change the following cell to select the classification type

If the METHOD == STRATIFIED, then we can use any classifier
If the METHOD == ATTACK_GROUP then we must use Group Classifier.

In [55]:

class_size_map = {2: "Binary", 8: "Group", 34: "Individual"}

if group_classifier:
    print("Group 8 Class Classifier... - Adjusting labels in test and train dataframes")
    # Map y column to the dict_7_classes values
    test_df['label'] = test_df['label'].map(dict_8_classes)
    train_df['label'] = train_df['label'].map(dict_8_classes)
    class_size = "8"
        
elif binary_classifier:
    print("Binary 2 Class Classifier... - Adjusting labels in test and train dataframes")
    # Map y column to the dict_2_classes values
    test_df['label'] = test_df['label'].map(dict_2_classes)
    train_df['label'] = train_df['label'].map(dict_2_classes)
    class_size = "2"

else:
    print ("Individual 34 Class classifier... - No adjustments to labels in test and train dataframes")
    class_size = "34"

Individual 34 Class classifier... - No adjustments to labels in test and train dataframes


---
# Split the Training Data into partitions for the Federated Learning clients depending on the test required
As a reminder:

`STRATIFIED` with:
 - `ALL TYPES` - Results in `NUM_OF_STRATIFIED_CLIENTS` clients. Each client will have a stratified sample of the data.

`MISSING_1_ATTACK` with:
 - `individual_classifier` - Results in 33 clients. Each client will have benign traffic and 32 attack labels.
 - `group_classifier` - Results in 7 clients. Each client will have benign traffic and 6 attack groups.
 - `binary_classifier` - Results in 10 clients. Five clients will have benign traffic only and the other will have Benign and malicious attack labels.

`1_ATTACK_ONLY` with:
 - `individual_classifier` - Results in 33 clients. Each client will have benign traffic and 1 attack label.
 - `group_classifier` - Results in 7 clients. Each client will have benign traffic and 1 attack groups.
 - `binary_classifier` - Results in 10 clients. Five clients will have benign traffic only and the other will have Benign and malicious attack labels.

`HALF_BENIGN_ONLY` with:
 - `individual_classifier` - Results in 10 clients. Five clients will have benign traffic only and the other will have Benign and 33 malicious attack labels.
 - `group_classifier` - Results in 10 clients. Five clients will have benign traffic only and the other will have Benign and 7 malicious attack groups.
 - `binary_classifier` - Results in 10 clients. Five clients will have benign traffic only and the other will have Benign and Malicious attack labels.

In [56]:
from sklearn.model_selection import StratifiedKFold

# Define fl_X_train and fl_y_train
fl_X_train = []
fl_y_train = []

if METHOD == 'STRATIFIED':
    print(f"{Colours.YELLOW.value}STRATIFIED METHOD{Colours.NORMAL.value} with {class_size} class classifier")
    # We are going to split the training data into 'NUM_OF_STRATIFIED_CLIENTS' smaller groups using StratifiedKFold
    skf = StratifiedKFold(n_splits=NUM_OF_STRATIFIED_CLIENTS, shuffle=True, random_state=42)
    for train_index, test_index in skf.split(train_df[X_columns], train_df[y_column]):
        fl_X_train.append(train_df[X_columns].iloc[test_index])
        fl_y_train.append(train_df[y_column].iloc[test_index])

elif METHOD == 'MISSING_1_ATTACK':
    print(f"{Colours.YELLOW.value}MISSING_1_ATTACK METHOD{Colours.NORMAL.value} with {class_size} class classifier")

    if individual_classifier or group_classifier:
        # Set the number of splits required to the number of classes - 1
        num_splits = int(class_size) - 1
    else:
        # For binary classifier, set the number of splits to 10
        num_splits = 10

    skf = StratifiedKFold(n_splits=num_splits, shuffle=True, random_state=42)

    # When creating the clients, we will remove one attack class from the training data
    # For the binary classifier, evey other client will have the benign class removed
    for i, (train_index, test_index) in enumerate(skf.split(train_df[X_columns], train_df[y_column])):
        print(f"Train index: {train_index}, Test index: {test_index}")
        if binary_classifier:
            if i % 2 == 0:
                # Create a new dataframe for the client data
                client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != 1]], ignore_index=True)
                fl_X_train.append(client_df[X_columns])
                fl_y_train.append(client_df[y_column])
            else:
                # Create a new dataframe for the client data
                client_df = pd.concat(train_df.iloc[test_index][train_df[y_column]], ignore_index=True)
                fl_X_train.append(client_df[X_columns])
                fl_y_train.append(client_df[y_column])
        else:
            # Create a new dataframe for the client data
            client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
            fl_X_train.append(client_df[X_columns])
            fl_y_train.append(client_df[y_column])

elif METHOD == 'ATTACK_GROUP':
    print(f"{Colours.YELLOW.value}ATTACK_GROUP METHOD{Colours.NORMAL.value}")
    # With this method we split the data so that each client data sees all attacks except one. 
    # All clients will see attack traffic BenignTraffic - 0.
    # EG:
    # client 0 will see attacks 2-7
    # client 1 will see attacks 1, 3-7
    # client 2 will see attacks 1-2, 4-7
    
    # There are 7 attack groups + 1 benign class, so we will create 7 clients
    skf = StratifiedKFold(n_splits=7, shuffle=True, random_state=42)
    for i, (train_index, test_index) in enumerate(skf.split(train_df[X_columns], train_df[y_column])):
        # Create a new dataframe for the client data
        client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
        fl_X_train.append(client_df[X_columns])
        fl_y_train.append(client_df[y_column])
        
    pass  



[33mMISSING_1_ATTACK METHOD[0m with 34 class classifier
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [      4      48      64 ... 1425244 1425254 1425260]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     68     140     148 ... 1425139 1425145 1425163]
Train index: [      0       1       2 ... 1425283 1425284 1425286], Test index: [     27      31      36 ... 1425259 1425282 1425285]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     28      70     107 ... 1425162 1425220 1425235]
Train index: [      0       1       2 ... 1425283 1425285 1425286], Test index: [      5      12      13 ... 1425226 1425278 1425284]


  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)


Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     43      72     142 ... 1425109 1425168 1425268]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     33      93     190 ... 1425079 1425155 1425176]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     10      66     100 ... 1425206 1425236 1425246]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     15      45     112 ... 1425198 1425255 1425275]


  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)


Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     84      85     121 ... 1425186 1425229 1425234]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     44      65      83 ... 1425231 1425253 1425261]
Train index: [      1       2       3 ... 1425284 1425285 1425286], Test index: [      0      40      61 ... 1425112 1425187 1425250]
Train index: [      0       1       3 ... 1425284 1425285 1425286], Test index: [      2       9      52 ... 1425178 1425196 1425262]
Train index: [      0       2       3 ... 1425284 1425285 1425286], Test index: [      1      47      58 ... 1425200 1425240 1425273]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     49      67     132 ... 1425227 1425245 1425251]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     29      41     103 ... 1425065 1425105 1425181]
Train index: [      0       1       2 ... 1425284 1425285 1425

  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)


Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     42      50      95 ... 1425212 1425264 1425271]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [      7      14      19 ... 1425213 1425214 1425263]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     51      63      92 ... 1425123 1425147 1425258]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     30      86     111 ... 1425191 1425238 1425247]


  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)


Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     32      53      55 ... 1425170 1425241 1425270]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     25      38      57 ... 1425265 1425279 1425280]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     23      34      76 ... 1425242 1425249 1425256]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     22      60     127 ... 1425223 1425224 1425283]


  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)


Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     77     105     108 ... 1425216 1425274 1425281]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     24      26     104 ... 1425221 1425267 1425276]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     17      35     116 ... 1425194 1425243 1425257]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [      3      39      94 ... 1425154 1425165 1425272]
Train index: [      0       1       2 ... 1425283 1425284 1425285], Test index: [     11      21      46 ... 1425228 1425233 1425286]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [     73      97     184 ... 1425209 1425269 1425277]
Train index: [      0       1       2 ... 1425284 1425285 1425286], Test index: [      6      18      71 ... 1425179 1425199 1425237]


  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)
  client_df = pd.concat([train_df.iloc[test_index][train_df[y_column] != i+1]], ignore_index=True)


In [61]:
for i in range(len(fl_X_train)):
    # Show the unique values in the y column
    print(f"fl_X_train[{i}].shape: {fl_X_train[i].shape}")  
    print(f"fl_y_train[{i}].value_counts():\n{fl_y_train[i].value_counts()}")
    print(f"fl_y_train[{i}].unique(): {fl_y_train[i].unique()}")

fl_X_train[0].shape: (39445, 46)
fl_y_train[0].value_counts():
6     6644
4     4978
5     4172
2     3800
3     3756
7     3349
13    3075
15    2474
14    1877
0     1015
17     918
19     818
18     690
10     421
26     280
9      267
8      263
25     163
24     125
21      92
22      78
16      67
23      36
12      26
11      22
33      12
27       7
32       6
31       5
29       3
28       3
20       2
30       1
Name: label, dtype: int64
fl_y_train[0].unique(): [ 2  8  7 15  5 18  4  6  0 13  3 19 25 17 14  9 26 10 22 12 29 16 11 24
 23 21 27 33 32 31 20 28 30]
fl_X_train[1].shape: (39391, 46)
fl_y_train[1].value_counts():
6     6644
4     4978
5     4172
3     3756
1     3746
7     3349
13    3075
15    2474
14    1877
0     1015
17     918
19     818
18     690
10     421
26     280
9      267
8      263
25     163
24     125
21      92
22      78
16      66
23      36
12      26
11      23
33      12
27       7
32       6
31       5
29       3
28       3
20       2
30     

### Convert the training dataset

In [62]:
# Convert the testing daya to X_test and y_test ndarrays
X_test = test_df[X_columns].to_numpy()
y_test = test_df[y_column].to_numpy()

---
### Data check

In [63]:
NUM_OF_CLIENTS = len(fl_X_train)
print("NUM_CLIENTS:", NUM_OF_CLIENTS)

print("NUM_ROUNDS:", NUM_OF_ROUNDS)
print()
print("Original train_df size: {}".format(train_df.shape))

print("Checking training data split groups")
for i in range(len(fl_X_train)):
    print(i, ":", "X Shape", fl_X_train[i].shape, "Y Shape", fl_y_train[i].shape)


# Print the sizes of X_test and y_test
print("\nChecking testing data")
print("X_test size: {}".format(X_test.shape))
print("y_test size: {}".format(y_test.shape))

print("\nDeploy Simulation")

NUM_CLIENTS: 33
NUM_ROUNDS: 10

Original train_df size: (1425287, 47)
Checking training data split groups
0 : X Shape (39445, 46) Y Shape (39445,)
1 : X Shape (39391, 46) Y Shape (39391,)
2 : X Shape (39435, 46) Y Shape (39435,)
3 : X Shape (38213, 46) Y Shape (38213,)
4 : X Shape (39019, 46) Y Shape (39019,)
5 : X Shape (36547, 46) Y Shape (36547,)
6 : X Shape (39842, 46) Y Shape (39842,)
7 : X Shape (42928, 46) Y Shape (42928,)
8 : X Shape (42925, 46) Y Shape (42925,)
9 : X Shape (42771, 46) Y Shape (42771,)
10 : X Shape (43169, 46) Y Shape (43169,)
11 : X Shape (43164, 46) Y Shape (43164,)
12 : X Shape (40116, 46) Y Shape (40116,)
13 : X Shape (41314, 46) Y Shape (41314,)
14 : X Shape (40718, 46) Y Shape (40718,)
15 : X Shape (43125, 46) Y Shape (43125,)
16 : X Shape (42273, 46) Y Shape (42273,)
17 : X Shape (42499, 46) Y Shape (42499,)
18 : X Shape (42371, 46) Y Shape (42371,)
19 : X Shape (43188, 46) Y Shape (43188,)
20 : X Shape (43097, 46) Y Shape (43097,)
21 : X Shape (43112, 4

----
# Federated Learning
## Import the libraries and print the versions

In [64]:
import os
import flwr as fl
import numpy as np
import tensorflow as tf

# Make TensorFlow log less verbose
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Dropout

label = train_df[y_column]

Define the Client and Server code

In [65]:
import os
import flwr as fl
import numpy as np
import tensorflow as tf

print('scikit-learn {}.'.format(sklearn.__version__))
print("flwr", fl.__version__)
print("numpy", np.__version__)
print("tf", tf.__version__)
# Make TensorFlow log less verbose
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Dropout

import datetime

class NumpyFlowerClient(fl.client.NumPyClient):
    def __init__(self, cid, model, train_data, train_labels):
        self.model = model
        self.cid = cid
        self.train_data = train_data
        self.train_labels = train_labels

    def get_parameters(self, config):
        return self.model.get_weights()

    def fit(self, parameters, config):
        self.model.set_weights(parameters)
        print ("Client ", self.cid, "Training...")
        self.model.fit(self.train_data, self.train_labels, epochs=10, batch_size=64)
        print ("Client ", self.cid, "Training complete...")
        return self.model.get_weights(), len(self.train_data), {}

    def evaluate(self, parameters, config):
        self.model.set_weights(parameters)
        print ("Client ", self.cid, "Evaluating...")
        loss, accuracy = self.model.evaluate(self.train_data, self.train_labels, batch_size=64)
        print(f"{Colours.YELLOW.value}Client {self.cid} evaluation complete - Accuracy: {accuracy:.6f}, Loss: {loss:.6f}{Colours.NORMAL.value}")

        return loss, len(self.train_data), {"accuracy": accuracy}
    
    def predict(self, incoming):
        prediction = np.argmax( self.model.predict(incoming) ,axis=1)
        return prediction

def client_fn(cid: str) -> NumpyFlowerClient:
    """Create a Flower client representing a single organization."""

    # Load model
    #model = tf.keras.applications.MobileNetV2((32, 32, 3), classes=10, weights=None)
    #model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])

    print ("Client ID:", cid)

    model = Sequential([
      #Flatten(input_shape=(79,1)),
      Flatten(input_shape=(fl_X_train[0].shape[1] , 1)),
      Dense(50, activation='relu'),  
      Dense(25, activation='relu'),  
      Dense(len(label.unique()), activation='softmax')
    ])
    
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

   
    partition_id = int(cid)
    X_train_c = fl_X_train[partition_id]
    y_train_c = fl_y_train[partition_id]

    # Create a  single Flower client representing a single organization
    return NumpyFlowerClient(cid, model, X_train_c, y_train_c)


from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
eval_count = 0

def get_evaluate_fn(server_model):
    global eval_count
    """Return an evaluation function for server-side evaluation."""
    # The `evaluate` function will be called after every round
    
    
    def evaluate(server_round, parameters, config):
        global eval_count
        
        # Update model with the latest parameters
        server_model.set_weights(parameters)
        print (f"Server Evaluating... Evaluation Count:{eval_count}")
        loss, accuracy = server_model.evaluate(X_test, y_test)
        
        y_pred = server_model.predict(X_test)
        print ("Prediction: ", y_pred, y_pred.shape)
        #cmatrix = confusion_matrix(y_test, np.rint(y_pred))
        #print ("confusion_matrix:", cmatrix, cmatrix.shape)
                        
        print(f"{Colours.YELLOW.value}Server evaluation complete - Accuracy: {accuracy:.4f}, Loss: {loss:.4f}{Colours.NORMAL.value}")
        
        np.save("y_pred-" + str(eval_count) + ".npy", y_pred)
        #np.save("cmatrix-" + str(eval_count) + ".npy", cmatrix)
        eval_count = eval_count + 1
        
        return loss, {"accuracy": accuracy}
    return evaluate



server_model = Sequential([
    #Flatten(input_shape=(79,1)),
    Flatten(input_shape=(fl_X_train[0].shape[1] , 1)),
    Dense(50, activation='relu'),  
    Dense(25, activation='relu'),  
    Dense(len(label.unique()), activation='softmax')
])


server_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Create FedAvg strategy
strategy = fl.server.strategy.FedAvg(
        fraction_fit=1.0,
        fraction_evaluate=0.5,
        min_fit_clients=2, #10,
        min_evaluate_clients=2, #5,
        min_available_clients=2, #10,
        evaluate_fn=get_evaluate_fn(server_model),
        #evaluate_metrics_aggregation_fn=weighted_average,
)

scikit-learn 1.2.2.
flwr 1.4.0
numpy 1.23.5
tf 2.12.0


In [66]:
%%time
print (f"\n{Colours.YELLOW.value} Deploy simulation... {class_size_map[len(label.unique())]} ({class_size}) Classifier\n{Colours.NORMAL.value}")

start_time = datetime.datetime.now()

# Start simulation
fl.simulation.start_simulation(
    client_fn=client_fn,
    num_clients=NUM_OF_CLIENTS,
    config=fl.server.ServerConfig(num_rounds=NUM_OF_ROUNDS),
    strategy=strategy,
)

end_time = datetime.datetime.now()
print("Total time taken: ", end_time - start_time)

INFO flwr 2023-07-09 12:59:51,411 | app.py:146 | Starting Flower simulation, config: ServerConfig(num_rounds=10, round_timeout=None)



[33m Deploy simulation... Individual (34) Classifier
[0m


2023-07-09 12:59:57,729	INFO worker.py:1636 -- Started a local Ray instance.
INFO flwr 2023-07-09 13:00:00,761 | app.py:180 | Flower VCE: Ray initialized with resources: {'GPU': 1.0, 'object_store_memory': 5855638732.0, 'node:127.0.0.1': 1.0, 'memory': 11711277467.0, 'CPU': 12.0}
INFO flwr 2023-07-09 13:00:00,762 | server.py:86 | Initializing global parameters
INFO flwr 2023-07-09 13:00:00,763 | server.py:273 | Requesting initial parameters from one random client
INFO flwr 2023-07-09 13:00:07,811 | server.py:277 | Received initial parameters from one random client
INFO flwr 2023-07-09 13:00:07,812 | server.py:88 | Evaluating initial parameters


[2m[36m(launch_and_get_parameters pid=21948)[0m Client ID: 11
Server Evaluating... Evaluation Count:0
Prediction:  [[0.02793223 0.02579061 0.03100688 ... 0.02779605 0.0501412  0.02199134]
 [0.032065   0.02384374 0.03484754 ... 0.03450171 0.03505244 0.01904592]
 [0.02784782 0.02554777 0.03082235 ... 0.02770456 0.05004438 0.02206242]
 ...
 [0.03375848 0.02338876 0.03199707 ... 0.03124009 0.03488163 0.02570772]
 [0.02790414 0.02577431 0.03070454 ... 0.02777931 0.05011694 0.02208443]
 [0.04853022 0.01809155 0.04331232 ... 0.04121283 0.03077046 0.02368784]] (1803497, 34)
[33mServer evaluation complete - Accuracy: 0.0172, Loss: 3.6450[0m


INFO flwr 2023-07-09 13:02:08,032 | server.py:91 | initial parameters (loss, other metrics): 3.644986629486084, {'accuracy': 0.017153341323137283}
INFO flwr 2023-07-09 13:02:08,033 | server.py:101 | FL starting
DEBUG flwr 2023-07-09 13:02:08,034 | server.py:218 | fit_round 1: strategy sampled 33 clients (out of 33)


[2m[36m(launch_and_fit pid=21948)[0m Client ID: 14
[2m[36m(launch_and_fit pid=21948)[0m Client  14 Training...
[2m[36m(launch_and_fit pid=21948)[0m Epoch 1/10
[2m[36m(launch_and_fit pid=21948)[0m 
[2m[36m(launch_and_fit pid=21948)[0m   1/637 [..............................] - ETA: 17:49 - loss: 3.7132 - accuracy: 0.0000e+00
[2m[36m(launch_and_fit pid=21948)[0m 
[2m[36m(launch_and_fit pid=21948)[0m  20/637 [..............................] - ETA: 1s - loss: 3.3972 - accuracy: 0.1461       
[2m[36m(launch_and_fit pid=21948)[0m 
[2m[36m(launch_and_fit pid=21948)[0m  41/637 [>.............................] - ETA: 1s - loss: 3.1787 - accuracy: 0.3388
[2m[36m(launch_and_fit pid=21948)[0m  63/637 [=>............................] - ETA: 1s - loss: 2.8859 - accuracy: 0.4524
[2m[36m(launch_and_fit pid=21948)[0m 
[2m[36m(launch_and_fit pid=21948)[0m  85/637 [===>..........................] - ETA: 1s - loss: 2.5397 - accuracy: 0.5175
[2m[36m(launch_and_fit pid=

DEBUG flwr 2023-07-09 13:04:03,866 | server.py:232 | fit_round 1 received 33 results and 0 failures


Server Evaluating... Evaluation Count:1
Prediction:  [[4.24820399e-07 8.96152596e-06 3.70100606e-05 ... 2.61601207e-09
  3.14178436e-07 7.43643000e-07]
 [9.45491706e-07 1.85074589e-09 8.00353384e-09 ... 2.34661868e-09
  5.07945614e-08 1.30668116e-08]
 [4.16272627e-07 7.61970205e-06 4.69943952e-05 ... 2.72722112e-09
  3.19153258e-07 7.72383714e-07]
 ...
 [1.18127453e-07 2.06156892e-09 1.03255363e-04 ... 9.37267615e-08
  4.05320151e-07 2.35537158e-07]
 [4.23082327e-07 8.86899852e-06 4.49683466e-05 ... 2.70639444e-09
  3.25660892e-07 7.47104934e-07]
 [1.61362010e-07 6.97964811e-11 9.99930859e-01 ... 1.11052705e-08
  6.28640606e-08 2.75813250e-07]] (1803497, 34)
[33mServer evaluation complete - Accuracy: 0.8119, Loss: 0.4258[0m


INFO flwr 2023-07-09 13:06:04,265 | server.py:119 | fit progress: (1, 0.42582088708877563, {'accuracy': 0.8119353652000427}, 236.2305626999878)
DEBUG flwr 2023-07-09 13:06:04,266 | server.py:168 | evaluate_round 1: strategy sampled 16 clients (out of 33)


[2m[36m(launch_and_evaluate pid=5920)[0m Client ID: 3
[2m[36m(launch_and_fit pid=5920)[0m   1/675 [..............................] - ETA: 0s - loss: 0.5109 - accuracy: 0.7969[32m [repeated 10x across cluster][0m
[2m[36m(launch_and_fit pid=5920)[0m  41/675 [>.............................] - ETA: 0s - loss: 0.4066 - accuracy: 0.8155[32m [repeated 11x across cluster][0m
[2m[36m(launch_and_fit pid=5920)[0m  60/675 [=>............................] - ETA: 1s - loss: 0.4098 - accuracy: 0.8096[32m [repeated 9x across cluster][0m
[2m[36m(launch_and_fit pid=5920)[0m 122/675 [====>.........................] - ETA: 0s - loss: 0.4016 - accuracy: 0.8171[32m [repeated 10x across cluster][0m
[2m[36m(launch_and_fit pid=28876)[0m 143/675 [=====>........................] - ETA: 0s - loss: 0.4186 - accuracy: 0.8097[32m [repeated 11x across cluster][0m
[2m[36m(launch_and_fit pid=5920)[0m  82/675 [==>...........................] - ETA: 0s - loss: 0.3992 - accuracy: 0.8182[32m

DEBUG flwr 2023-07-09 13:06:13,997 | server.py:182 | evaluate_round 1 received 16 results and 0 failures
DEBUG flwr 2023-07-09 13:06:13,999 | server.py:218 | fit_round 2: strategy sampled 33 clients (out of 33)


[2m[36m(launch_and_fit pid=5920)[0m Client  29 Training...
[2m[36m(launch_and_fit pid=5920)[0m Epoch 1/10
[2m[36m(launch_and_fit pid=21948)[0m Client ID: 10[32m [repeated 11x across cluster][0m
[2m[36m(launch_and_fit pid=5920)[0m   1/675 [..............................] - ETA: 26:45 - loss: 0.3234 - accuracy: 0.8906[32m [repeated 6x across cluster][0m
[2m[36m(launch_and_evaluate pid=5920)[0m  31/671 [>.............................] - ETA: 1s - loss: 0.4121 - accuracy: 0.8286  [32m [repeated 5x across cluster][0m
[2m[36m(launch_and_evaluate pid=5920)[0m  60/671 [=>............................] - ETA: 1s - loss: 0.5258 - accuracy: 0.8193[32m [repeated 5x across cluster][0m
[2m[36m(launch_and_evaluate pid=5920)[0m 123/671 [====>.........................] - ETA: 0s - loss: 0.4603 - accuracy: 0.8238[32m [repeated 5x across cluster][0m
[2m[36m(launch_and_evaluate pid=5920)[0m 152/671 [=====>........................] - ETA: 0s - loss: 0.4530 - accuracy: 0.821

[2m[33m(raylet)[0m [2023-07-09 13:06:29,280 C 9996 18240] (raylet.exe) dlmalloc.cc:129:  Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455
[2m[33m(raylet)[0m *** StackTrace Information ***
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m unknown
[2m[33m(raylet)[0m configthreadlocale
[2m[33m(r



ERROR flwr 2023-07-09 13:06:31,746 | ray_client_proxy.py:87 | The task's local raylet died. Check raylet.out for more information.
ERROR flwr 2023-07-09 13:06:31,749 | ray_client_proxy.py:87 | The task's local raylet died. Check raylet.out for more information.
ERROR flwr 2023-07-09 13:06:31,749 | ray_client_proxy.py:87 | The task's local raylet died. Check raylet.out for more information.
ERROR flwr 2023-07-09 13:06:31,749 | ray_client_proxy.py:87 | The task's local raylet died. Check raylet.out for more information.
ERROR flwr 2023-07-09 13:06:31,749 | ray_client_proxy.py:87 | The task's local raylet died. Check raylet.out for more information.
ERROR flwr 2023-07-09 13:06:31,750 | ray_client_proxy.py:87 | The task's local raylet died. Check raylet.out for more information.
ERROR flwr 2023-07-09 13:06:32,269 | ray_client_proxy.py:87 | The task's local raylet died. Check raylet.out for more information.
ERROR flwr 2023-07-09 13:06:32,274 | ray_client_proxy.py:87 | The task's local rayl