# Robustness of a Regression Model

In this notebook, we will evaluate the robustness of a regression model on the "US Crimes" dataset, to perform the analysis. We will use the Linear Regression Gradient Descent Based poisoner to poison the dataset and evaluate the robustness of the model to the poison data. We will use the Mean Squared Error (MSE) as the evaluation metric.

In [1]:
import warnings

import numpy as np
import pandas as pd
from sklearn import linear_model
from holisticai.datasets import load_dataset
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from holisticai.robustness.attackers import LinRegGDPoisoner
from holisticai.robustness.attackers.regression.utils import one_hot_encode_columns

warnings.filterwarnings("ignore")


## Loading the dataset

We will use the "US Crimes" dataset, which contains crime statistics for each US state. The target variable is the number of crimes per 100,000 people, and the features are the different crime statistics.

We will use the preprocessed version of the dataset, which is available in the `load_dataset` module of the `holisticai` library.

In [2]:
dataset = load_dataset('us_crime', preprocessed=True)
dataset['X'].head()

Unnamed: 0,state,fold,population,householdsize,racepctblack,racePctAsian,racePctHisp,agePct12t21,agePct12t29,agePct16t24,...,NumStreet,PctForeignBorn,PctBornSameState,PctSameHouse85,PctSameCity85,PctSameState85,LandArea,PopDens,PctUsePubTrans,LemasPctOfficDrugUn
0,8,1,0.19,0.33,0.02,0.12,0.17,0.34,0.47,0.29,...,0.0,0.12,0.42,0.5,0.51,0.64,0.12,0.26,0.2,0.32
1,53,1,0.0,0.16,0.12,0.45,0.07,0.26,0.59,0.35,...,0.0,0.21,0.5,0.34,0.6,0.52,0.02,0.12,0.45,0.0
2,24,1,0.0,0.42,0.49,0.17,0.04,0.39,0.47,0.28,...,0.0,0.14,0.49,0.54,0.67,0.56,0.01,0.21,0.02,0.0
3,34,1,0.04,0.77,1.0,0.12,0.1,0.51,0.5,0.34,...,0.0,0.19,0.3,0.73,0.64,0.65,0.02,0.39,0.28,0.0
4,42,1,0.01,0.55,0.02,0.09,0.05,0.38,0.38,0.23,...,0.0,0.11,0.72,0.64,0.61,0.53,0.04,0.09,0.02,0.0


## Preprocessing the data

Since the 'fold' column is used for stratified cross-validation, we will remove it from the dataset. Also, we will transform the 'state' column to a one-hot encoding representation.

In [3]:
X = dataset['X']
y = dataset['y']

X = X.drop(columns=['fold'])
columns_to_encode = ['state']
column_mapping, X = one_hot_encode_columns(X, columns_to_encode)

In [4]:
X.shape, y.shape

((1993, 145), (1993,))

In [5]:
X.head()

Unnamed: 0,population,householdsize,racepctblack,racePctAsian,racePctHisp,agePct12t21,agePct12t29,agePct16t24,agePct65up,numbUrban,...,state_46,state_47,state_48,state_49,state_50,state_51,state_53,state_54,state_55,state_56
0,0.19,0.33,0.02,0.12,0.17,0.34,0.47,0.29,0.32,0.2,...,0,0,0,0,0,0,0,0,0,0
1,0.0,0.16,0.12,0.45,0.07,0.26,0.59,0.35,0.27,0.02,...,0,0,0,0,0,0,1,0,0,0
2,0.0,0.42,0.49,0.17,0.04,0.39,0.47,0.28,0.32,0.0,...,0,0,0,0,0,0,0,0,0,0
3,0.04,0.77,1.0,0.12,0.1,0.51,0.5,0.34,0.21,0.06,...,0,0,0,0,0,0,0,0,0,0
4,0.01,0.55,0.02,0.09,0.05,0.38,0.38,0.23,0.36,0.02,...,0,0,0,0,0,0,0,0,0,0


Next, we will split the dataset into training and testing sets.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

X_train.shape, X_test.shape

((1494, 145), (499, 145))

## Training the regression model

We will train a linear regression model on the training data to use as a baseline model.

In [7]:
clf = linear_model.LinearRegression()
clf.fit(np.asarray(X_train), y_train)

baseline_error = mean_squared_error(y_test, clf.predict(X_test))

print(f"Baseline error: {baseline_error}")

Baseline error: 0.018911873372045453


## Gradient Descent Based Poisoning

We will now generate poison data using the Linear Regression Gradient Descent Based poisoner. We will use the poison data to evaluate the robustness of the model and how it affects the model's performance. This poisoner uses the training data and the gradient descent algorithm to generate poison data. A recommended practice is to use a small fraction (no more than 0.2) of the training data to generate the poison data that will be appended to the training data at the end.

To return the joined training and poisoned data, we will use the `return_data` parameter of the poisoner.

In [8]:
poiser = LinRegGDPoisoner(column_mapping, poison_proportion=0.2)

# Poison the training data
x_poised, y_poised = poiser.poison_data(X_train, y_train, return_data=True)

2024-09-13 10:26:25,377 - INFO - Training Error: 0.042428
2024-09-13 10:26:25,379 - INFO - Best initialization error: 0.042428
2024-09-13 10:26:25,382 - INFO - Poison Count: 374.000000
2024-09-13 10:26:25,481 - INFO - Iteration 0:
2024-09-13 10:26:25,482 - INFO - Objective Value: 0.04242842855916918 Change: 0.04242842855916918
2024-09-13 10:27:06,017 - INFO - Iteration 1:
2024-09-13 10:27:06,018 - INFO - Objective Value: 0.04282704141334596 Change: 0.0003986128541767775
2024-09-13 10:27:06,025 - INFO - Y pushed out of bounds: 353/374
2024-09-13 10:27:46,748 - INFO - Iteration 2:
2024-09-13 10:27:46,750 - INFO - Objective Value: 0.04397853422802394 Change: 0.0011514928146779818
2024-09-13 10:27:46,751 - INFO - Y pushed out of bounds: 355/374
2024-09-13 10:28:27,611 - INFO - Iteration 3:
2024-09-13 10:28:27,613 - INFO - Objective Value: 0.04330797572080405 Change: -0.0006705585072198927
2024-09-13 10:28:27,614 - INFO - Y pushed out of bounds: 343/374
2024-09-13 10:28:27,616 - INFO - no p

**Evaluating Model Robustness**

We will evaluate the robustness of the model by training the model on the poisoned data and evaluating its performance on the test data. We will compare the performance of the model on the test data.

In [9]:
clfp = linear_model.LinearRegression()
clfp.fit(np.asarray(x_poised), y_poised)

poised_err = mean_squared_error(y_test, clfp.predict(X_test))

print("Error before poisoning:", baseline_error)
print("Error after poisoning:", poised_err)

Error before poisoning: 0.018911873372045453
Error after poisoning: 0.04884968204931966


As we can see from the results, the model's performance on the test data is significantly worse when trained on the poisoned data. This demonstrates the importance of evaluating the robustness of a model to poison data.