# Robustness of a Regression Model

In this notebook, we will evaluate the robustness of a regression model on the "US Crimes" dataset, to perform the analysis. We will use the Linear Regression Gradient Descent Based poisoner to poison the dataset and evaluate the robustness of the model to the poison data. We will use the Mean Squared Error (MSE) as the evaluation metric.

In [1]:
import warnings

import numpy as np
import pandas as pd
from sklearn import linear_model
from holisticai.datasets import load_dataset
from sklearn.metrics import mean_squared_error
from holisticai.robustness.attackers import LinRegGDPoisoner

warnings.filterwarnings("ignore")


## Loading the dataset

We will use the "US Crimes" dataset, which contains crime statistics for each US state. The target variable is the number of crimes per 100,000 people, and the features are the different crime statistics.

We will use the preprocessed version of the dataset, which is available in the `load_dataset` module of the `holisticai` library.

In [2]:
dataset = load_dataset('us_crime', preprocessed=True)
train_test = dataset.train_test_split(test_size=0.25, random_state=42)

train = train_test['train']
test = train_test['test']
train, test

(<holisticai.datasets._dataset.Dataset at 0x7d728860a560>,
 <holisticai.datasets._dataset.Dataset at 0x7d7235939750>)

## Preprocessing the data

Since the 'fold' column is used for stratified cross-validation, we will remove it from the input features of the dataset.

In [3]:
X_train = train['X'].drop(columns=['fold'])
X_test = test['X'].drop(columns=['fold'])

y_train = train['y']
y_test = test['y']

X_train.head()

Unnamed: 0,state,population,householdsize,racepctblack,racePctAsian,racePctHisp,agePct12t21,agePct12t29,agePct16t24,agePct65up,...,NumStreet,PctForeignBorn,PctBornSameState,PctSameHouse85,PctSameCity85,PctSameState85,LandArea,PopDens,PctUsePubTrans,LemasPctOfficDrugUn
0,25,0.0,0.42,0.04,0.13,0.03,0.25,0.42,0.23,0.3,...,0.0,0.2,0.69,0.6,0.72,0.7,0.03,0.08,0.18,0.0
1,42,0.0,0.5,0.04,0.03,0.01,0.37,0.39,0.26,0.36,...,0.0,0.07,0.91,0.85,0.88,0.86,0.04,0.06,0.02,0.0
2,34,0.0,0.67,0.03,0.4,0.03,0.35,0.33,0.23,0.4,...,0.0,0.31,0.53,0.77,0.64,0.69,0.02,0.16,0.54,0.0
3,25,0.01,0.41,0.01,0.03,0.02,0.29,0.33,0.22,0.6,...,0.0,0.24,0.77,0.78,0.8,0.78,0.01,0.37,0.54,0.0
4,6,0.05,0.51,0.08,0.47,0.47,0.41,0.53,0.34,0.33,...,0.0,0.47,0.52,0.49,0.85,0.78,0.02,0.52,0.16,0.0


## Training the regression model

We will train a linear regression model on the training data to use as a baseline model.

In [4]:
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)

baseline_error = mean_squared_error(y_test, clf.predict(X_test))

print(f"Baseline error: {baseline_error}")

Baseline error: 0.018695360634785115


## Gradient Descent Based Poisoning

### Linear-based Poisoning

We will now generate poison data using the Linear Regression Gradient Descent Based poisoner. This poisoner uses the training data and the gradient descent algorithm with a linear regressor at its core to generate poison data and then, we will use the poison data to evaluate the robustness of the model and how it affects the model's performance. A recommended practice is to use a small fraction (no more than 0.2) of the training data to generate the poison data that will be appended to the training data at the end.

To do that, first, we will create a categorical mask to tell the poisoner which features will be treated as categorical (for our case the `state` feature). Then, we will create the poisoner object and use the `generate` method to generate the poison data.

In [5]:
categorical_mask = np.zeros(X_train.shape[1])
categorical_mask[0] = 1

In [6]:
poiser = LinRegGDPoisoner(poison_proportion=0.2, num_inits=1) #  

# Poison the training data
x_poised, y_poised = poiser.generate(X_train, y_train,
                                      categorical_mask = categorical_mask, 
                                      return_only_poisoned=True)

2024-10-03 10:15:44,934 - INFO - Training Error: 0.044647
2024-10-03 10:15:44,937 - INFO - Best initialization error: 0.044647
2024-10-03 10:15:44,938 - INFO - Poison Count: 374.000000
2024-10-03 10:15:45,045 - INFO - Iteration 0:
2024-10-03 10:15:45,047 - INFO - Objective Value: 0.04464659130536795 Change: 0.04464659130536795
2024-10-03 10:16:32,419 - INFO - Iteration 1:
2024-10-03 10:16:32,422 - INFO - Objective Value: 0.04472660434882227 Change: 8.001304345432031e-05
2024-10-03 10:16:32,424 - INFO - Y pushed out of bounds: 361/374
2024-10-03 10:17:23,469 - INFO - Iteration 2:
2024-10-03 10:17:23,471 - INFO - Objective Value: 0.0452637976663006 Change: 0.0005371933174783267
2024-10-03 10:17:23,472 - INFO - Y pushed out of bounds: 361/374
2024-10-03 10:18:14,753 - INFO - Iteration 3:
2024-10-03 10:18:14,758 - INFO - Objective Value: 0.048907387030308606 Change: 0.003643589364008007
2024-10-03 10:18:14,761 - INFO - Y pushed out of bounds: 360/374
2024-10-03 10:18:59,858 - INFO - Iterat

**Evaluating Model Robustness**

We will evaluate the robustness of the model by training the model on the poisoned data and evaluating its performance on the test data. To do that we will concatenate the poisoned samples with the training data, train a model and then compare the performance of the model on the test data.

In [7]:
clfp = linear_model.LinearRegression()

poisedx = np.concatenate((X_train, x_poised),axis = 0)
poisedy = np.concatenate([y_train, y_poised])

clfp.fit(poisedx, poisedy)

poised_err = mean_squared_error(y_test, clfp.predict(X_test))

print("Error before poisoning:", baseline_error)
print("Error after poisoning (Linear):", poised_err)

Error before poisoning: 0.018695360634785115
Error after poisoning (Linear): 0.019366176179256695


As we can see from the results, the model's performance on the test data is slightly worse when trained on the poisoned data. This demonstrates the importance of evaluating the robustness of a model to poison data.

### Ridge-based Poisoning

We will now generate poison data using the same Regression Gradient Descent Based poisoner but with a Ridge regressor at its core. Similarly, we will evaluate the robustness of the model by training the model on the poisoned data and evaluating its performance on the test data.

In [8]:
from holisticai.robustness.attackers import RidgeGDPoisoner

poiser = RidgeGDPoisoner(poison_proportion=0.2, num_inits=1) #  

# Poison the training data
x_poised, y_poised = poiser.generate(X_train, y_train,
                                      categorical_mask = categorical_mask, 
                                      return_only_poisoned=True)

2024-10-03 10:27:16,775 - INFO - Training Error: 0.042605
2024-10-03 10:27:16,778 - INFO - Best initialization error: 0.042605
2024-10-03 10:27:16,788 - INFO - Poison Count: 374.000000
2024-10-03 10:27:16,896 - INFO - Iteration 0:
2024-10-03 10:27:16,902 - INFO - Objective Value: 0.15132651830050933 Change: 0.15132651830050933
2024-10-03 10:28:18,228 - INFO - Iteration 1:
2024-10-03 10:28:18,231 - INFO - Objective Value: 0.47055352441836296 Change: 0.3192270061178536
2024-10-03 10:28:18,235 - INFO - Y pushed out of bounds: 330/374
2024-10-03 10:29:29,596 - INFO - Iteration 2:
2024-10-03 10:29:29,598 - INFO - Objective Value: 0.7847624381404186 Change: 0.3142089137220556
2024-10-03 10:29:29,600 - INFO - Y pushed out of bounds: 329/374
2024-10-03 10:30:31,277 - INFO - Iteration 3:
2024-10-03 10:30:31,278 - INFO - Objective Value: 0.8602627436585353 Change: 0.0755003055181167
2024-10-03 10:30:31,279 - INFO - Y pushed out of bounds: 333/374
2024-10-03 10:31:34,036 - INFO - Iteration 4:
202

Finally, we will compare the performance of the model trained on the poison data generated by the Ridge poisoner.

In [9]:
clfp = linear_model.LinearRegression()

poisedx = np.concatenate((X_train, x_poised),axis = 0)
poisedy = np.concatenate([y_train, y_poised])

clfp.fit(poisedx, poisedy)

poised_ridge_err = mean_squared_error(y_test, clfp.predict(X_test))

print("Error before poisoning:", baseline_error)
print("Error after poisoning (Ridge):", poised_ridge_err)

Error before poisoning: 0.018695360634785115
Error after poisoning (Ridge): 0.01986438344146899


As we can see from the results, similar to the Linear-based poisoning, the model's performance on the test data is slightly worse when trained on the poisoned data.