# Federated Learning: Attack and Defense Demonstration

In this notebook, we will demonstrate a potential attack on a federated learning system and how to defend against it. Federated learning is a machine learning approach where a model is trained across multiple devices or servers while keeping the data localized. This approach is beneficial for scenarios where data privacy is crucial, and transferring data to a central server is not feasible.

However, federated learning is susceptible to attacks, especially when malicious devices introduce noisy or poisoned data. In this demonstration, we will simulate such an attack and then introduce a defense mechanism to mitigate its effects.

Let's begin by setting up the necessary libraries and defining our simple linear regression model.

In [None]:
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Define a simple linear regression model
class LinearRegression:
    def __init__(self, learning_rate=0.0001, weight=None, bias=None):
        self.learning_rate = learning_rate
        self.weight = weight if weight is not None else np.random.randn(10)  # Initialize weight as a vector of size 10
        self.bias = bias if bias is not None else np.random.randn()

    def predict(self, x):
        return np.dot(self.weight, x) + self.bias

    def update(self, gradient_weight, gradient_bias):
        self.weight -= self.learning_rate * gradient_weight
        self.bias -= self.learning_rate * gradient_bias

    def mse(self, data):
        total_error = 0
        for x, y in data:
            y_pred = self.predict(x)
            total_error += (y - y_pred) ** 2
        return total_error / len(data)

## Linear Regression Model
We've defined a simple linear regression model with the following components:
- **Initialization**: The model initializes with a learning rate, weights, and bias. If no weights or bias are provided, they are randomly initialized. The weight vector is of size 10.
- **Predict**: This function predicts the output for a given input `x` using the linear equation $y = wx + b$, where `w` is the weight and `b` is the bias.
- **Update**: This function updates the model's weights and bias based on the provided gradients. The learning rate determines the step size for the update.
- **Mean Squared Error (MSE)**: This function calculates the mean squared error for the given data. It's a measure of the model's performance, with lower values indicating better performance.
Next, we'll define functions to simulate local training on devices and federated learning.

In [None]:
# Simulate local training on devices
def local_training(model, data, epochs=100):
    for _ in range(epochs):
        gradient_weight = 0
        gradient_bias = 0
        for x, y in data:
            y_pred = model.predict(x)
            gradient_weight += -2 * x * (y - y_pred)
            gradient_bias += -2 * (y - y_pred)
        model.update(gradient_weight, gradient_bias)

# Federated learning simulation
def federated_learning(global_model, local_data_list, epochs=100):
    num_devices = len(local_data_list)
    for _ in range(epochs):
        global_gradient_weight = 0
        global_gradient_bias = 0
        for local_data in local_data_list:
            local_model = LinearRegression(weight=global_model.weight, bias=global_model.bias)
            local_training(local_model, local_data)
            global_gradient_weight += (local_model.weight - global_model.weight) / num_devices  # Average difference
            global_gradient_bias += (local_model.bias - global_model.bias) / num_devices  # Average difference
        global_model.update(global_gradient_weight, global_gradient_bias)

# Generate random data for devices
def generate_data(num_points, x_range, y_range):
    x = np.random.uniform(x_range[0], x_range[1], num_points)
    y = np.random.uniform(y_range[0], y_range[1], num_points)
    return list(zip(x, y))

## Local Training and Federated Learning
### Local Training
The `local_training` function simulates the training of a model on a local device. It performs the following steps:

1. For each epoch, it initializes the gradients for weight and bias to zero.
2. For each data point, it predicts the output using the model and calculates the gradients for weight and bias based on the prediction error.
3. After processing all data points, it updates the model's weight and bias using the accumulated gradients.

### Federated Learning
The `federated_learning` function simulates the federated learning process. It performs the following steps:

1. For each epoch, it initializes the global gradients for weight and bias to zero.
2. For each local dataset (representing a device's data):
   - A local model is created with the same weight and bias as the global model.
   - The local model is trained using the `local_training` function.
   - The difference between the local model's weight and bias and the global model's weight and bias is calculated and added to the global gradients. This difference is averaged over the number of devices.
3. After processing all local datasets, the global model's weight and bias are updated using the accumulated global gradients.

### Data Generation
The `generate_data` function simulates the generation of random data for devices. It generates random `x` and `y` values within the specified ranges and returns them as pairs.

Next, we'll load a real dataset, preprocess it, and simulate the federated learning process.

In [None]:
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target

# Standardize the dataset
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the dataset into chunks to simulate data on different devices
num_devices = 4
device_data = []
for _ in range(num_devices):
    X_train, X, y_train, y = train_test_split(X, y, test_size=0.75)
    device_data.append(list(zip(X_train, y_train)))

# Initialize a global model
global_model = LinearRegression()

# Evaluate model's metrics before federated learning
all_data = [item for sublist in device_data for item in sublist]
initial_mse = global_model.mse(all_data)
print(f'Initial MSE: {initial_mse}')

# Perform federated learning with the devices
federated_learning(global_model, device_data)

# Evaluate model's metrics after federated learning
post_devices_mse = global_model.mse(all_data)
print(f'MSE after training: {post_devices_mse}')

## Data Preprocessing and Initial Federated Learning
In the above code, we perform the following steps:

1. **Load the Diabetes Dataset**: We use the diabetes dataset from the `sklearn.datasets` module. This dataset contains ten baseline variables like age, sex, BMI, average blood pressure, and six blood serum measurements for 442 diabetes patients. The target variable is a quantitative measure of disease progression one year after baseline.

2. **Data Standardization**: We standardize the dataset using the `StandardScaler` from `sklearn.preprocessing`. Standardization of a dataset is a common requirement for many machine learning estimators. They might behave badly if the individual features do not more or less look like standard normally distributed data (e.g., Gaussian with 0 mean and unit variance).

3. **Simulate Data on Different Devices**: We split the dataset into chunks to simulate the scenario where data is distributed across different devices. Each device gets a subset of the data. This is a common scenario in federated learning where each device (e.g., mobile phones) has its local data and doesn't share it directly due to privacy concerns.

4. **Initialize a Global Model**: We create an instance of our `LinearRegression` class, which will act as our global model.

5. **Evaluate Model Before Training**: Before starting the federated learning process, we evaluate the model's performance on all data using the Mean Squared Error (MSE) metric. This gives us a baseline to compare against after training.

6. **Federated Learning**: We then perform federated learning using the `federated_learning` function. After training, we evaluate the model's performance again to see the improvement.

The outputs show the model's MSE before and after federated learning. Next, we'll introduce an adversarial attack by adding a device with noisy data and observe its impact on the model.

In [None]:
# Introduce a fifth device with noisy data
num_devices += 1
X_noisy, _, y_noisy, _ = train_test_split(X, y, test_size=0.75)
X_noisy += np.random.normal(0, 5, X_noisy.shape)  # Adding large noise to features
y_noisy += np.random.normal(0, 100, y_noisy.shape)  # Adding large noise to targets
device_data.append(list(zip(X_noisy, y_noisy)))

# Re-evaluate the model with the new data
federated_learning(global_model, device_data)

# Evaluate model's metrics after federated learning with 5 devices
all_data += list(zip(X_noisy, y_noisy))
post_5_devices_mse = global_model.mse(all_data)
print(f'MSE after poisoning: {post_5_devices_mse}')

## Adversarial Attack: Introducing Noisy Data
In the above code, we simulate an adversarial attack by introducing a fifth device with noisy data. Here's a breakdown of the steps:

1. **Introduce a Fifth Device**: We add another device to our simulation. This device will contain data that has been intentionally corrupted or poisoned.

2. **Generate Noisy Data**: For this device, we add significant noise to the features (`X_noisy`) and targets (`y_noisy`). This noise is generated from a normal distribution with a mean of 0 and standard deviations of 5 for features and 100 for targets. This simulates a scenario where an adversary might try to sabotage the federated learning process by introducing corrupted data.

3. **Re-evaluate the Model with Noisy Data**: We then perform federated learning again, this time including the noisy data from the fifth device.

4. **Evaluate Model's Metrics After Poisoning**: After training with the noisy data, we evaluate the model's performance using the MSE metric. This will help us understand the impact of the adversarial attack on the model's performance.

The output shows the model's MSE after the adversarial attack. As we can see, the performance of the model is affected by the introduction of the noisy data. Next, we'll explore a defense mechanism to mitigate the impact of such attacks.

## Averaged Federated Learning with Trust Scores
In this section, we'll explore a defense mechanism against adversarial attacks in federated learning using a trust-based model averaging approach. The main idea is to assign trust scores to each device and use these scores to weigh the contributions of each device during the model aggregation phase. Devices that consistently provide gradients that deviate significantly from the norm will have their trust scores reduced, thus diminishing their influence on the global model over time.

Here's a step-by-step breakdown of the process:

In [None]:
# Redo but with averaged federated learning
# Split the dataset into chunks to simulate data on different devices
num_devices = 4
device_data = []
for _ in range(num_devices):
    X_train, X, y_train, y = train_test_split(X, y, test_size=0.75)
    device_data.append(list(zip(X_train, y_train)))

# Initialize a global model
global_model = LinearRegression()

# Evaluate model's metrics before federated learning
all_data = [item for sublist in device_data for item in sublist]
initial_mse = global_model.mse(all_data)
print(f'Initial MSE: {initial_mse}')

1. **Data Splitting for Devices:**
   - We start by simulating a scenario where the dataset is split across multiple devices. This is done by dividing the dataset into chunks, with each chunk representing the data available on a particular device.

2. **Global Model Initialization:**
   - A global model is initialized, which will be updated based on the local models' updates from each device.

3. **Evaluation Before Training:**
   - Before starting the federated learning process, we evaluate the performance of the global model using the Mean Squared Error (MSE) metric. This gives us a baseline to compare against after the training.

### Federated Learning with Model Averaging

Next, we introduce a federated learning approach that incorporates model averaging based on trust scores. The trust scores are used to weigh the contributions of each device during the model aggregation phase. The main steps involved in this approach are:

In [None]:
# Federated learning simulation with model averaging
def averaged_federated_learning(global_model, local_data_list, epochs=100):
    for _ in range(epochs):
        global_gradient_weight = 0
        global_gradient_bias = 0
        total_trust = sum(trust_scores)

        for device_id, local_data in enumerate(local_data_list):
            # Skip devices with trust score below threshold
            if trust_scores[device_id] < trust_threshold:
                continue

            local_model = LinearRegression(weight=global_model.weight, bias=global_model.bias)
            local_training(local_model, local_data)

            # Weighted averaging using trust scores
            weight_diff = (local_model.weight - global_model.weight)
            bias_diff = (local_model.bias - global_model.bias)

            global_gradient_weight += (weight_diff * trust_scores[device_id]) / total_trust
            global_gradient_bias += (bias_diff * trust_scores[device_id]) / total_trust

            # Adjust trust scores based on the magnitude of the update
            if np.linalg.norm(weight_diff) > 0.05:  # Example threshold
                trust_scores[device_id] *= 0.8  # Decrease trust score
            else:
                trust_scores[device_id] *= 1.05  # Increase trust score, but ensure it doesn't grow indefinitely

        global_model.update(global_gradient_weight, global_gradient_bias)

4. **Model Averaging Based on Trust Scores:**
   - In the federated learning process, each device trains a local model on its data and then sends the model updates to the server (or the global model).
   - Instead of simply averaging the model updates from all devices, we introduce a trust score for each device. This trust score determines how much the server should trust the updates from a particular device.
   - The trust scores are initialized to 1.0 for all devices, meaning that initially, all devices are equally trusted.
   - During the aggregation phase, the server computes a weighted average of the model updates based on the trust scores of the devices.
   - If the magnitude of the model update from a device exceeds a certain threshold (in this case, 0.05), the trust score of that device is decreased. This is because a large update might indicate that the device's data is noisy or that the device is malicious. Conversely, if the update is small, the trust score is slightly increased.
   - Devices with trust scores below a certain threshold (in this case, 0.5) are considered untrustworthy and their updates are ignored.

5. **Training with Averaged Federated Learning:**
   - The global model is trained using the averaged federated learning approach. After training, we evaluate the model's performance using the MSE metric.

In [None]:
# Initialize trust scores for all devices
trust_scores = [1.0 for _ in range(num_devices)]
trust_threshold = 0.5  # Threshold below which a device is considered untrustworthy

# Perform federated learning with the devices
averaged_federated_learning(global_model, device_data)

# Evaluate model's metrics after federated learning
post_devices_mse = global_model.mse(all_data)
print(f'MSE after averaging: {post_devices_mse}')

6. **Introducing a Noisy Device:**
   - To demonstrate the robustness of the averaged federated learning approach, we introduce a fifth device with noisy data. This device can be thought of as a malicious or compromised device that aims to degrade the performance of the global model.
   - The data for this device is generated by adding large noise to the features and targets. This simulates a scenario where the device's data is significantly different from the other devices.
   - The noisy device is added to the list of devices, and its trust score is initialized to 1.0.

7. **Training with the Noisy Device:**
   - The global model is re-trained using the averaged federated learning approach, which now includes the noisy device.
   - After training, we evaluate the model's performance using the MSE metric. This allows us to assess the impact of the noisy device on the global model's performance.

In [None]:
# Introduce a fifth device with noisy data
X_noisy, _, y_noisy, _ = train_test_split(X, y, test_size=0.75)
X_noisy += np.random.normal(0, 5, X_noisy.shape)  # Adding large noise to features
y_noisy += np.random.normal(0, 100, y_noisy.shape)  # Adding large noise to targets
device_data.append(list(zip(X_noisy, y_noisy)))
# Add a trust score for the new device
trust_scores.append(1.0)

# Re-evaluate the model with the poisoned data but with averaged federated learning
averaged_federated_learning(global_model, device_data)

# Evaluate model's metrics after federated learning with 5 devices
all_data += list(zip(X_noisy, y_noisy))
post_5_devices_mse = global_model.mse(all_data)
print(f'MSE after poisoning but with averaged federated learning: {post_5_devices_mse}')

8. **Analysis and Conclusion:**
   - The results show the efficacy of the averaged federated learning approach in mitigating the impact of a noisy or malicious device.
   - The initial MSE before introducing the noisy device and after training with the averaged federated learning approach was `29482.5` and `26469.9` respectively, indicating an improvement in the model's performance.
   - After introducing the noisy device and re-training, the MSE was `29356.4`. This value is higher than the initial MSE, suggesting that the model's performance was negatively impacted by the introduction of adversarial data.
   - The averaged federated learning approach, which uses trust scores to weigh the contributions of each device, proves to be robust against data poisoning attacks. By adjusting trust scores based on the magnitude of updates from each device, the approach can identify and mitigate the impact of malicious devices.
   - When re-instantiated, the initial model had an MSE of `28427.9`, and after the first round of averaged federated learning the model's MSE was `26383.7`. These are very similar results to the first time through.
   - However, when the model was poisoned but trained using the trust system present in the averaged federated learning approach, the MSE was `23259.9`. This is the ***lowest MSE of all***.
   - This suggests that this approach is not only resilient to this type of poisoning attack but also augments the model's learning proficiency beyond its pre-poisoning capabilities.
   - In conclusion, the averaged federated learning approach provides a promising defense mechanism against data poisoning attacks in federated learning scenarios.