# Homework 4: Fairness and bias interventions

## Regression: Download the "wine quality" dataset:

https://archive.ics.uci.edu/dataset/186/wine+quality

## Unzip the file "wine+quality.zip" to obtain:

- winequality.names
- winequality-red.csv
- winequality-white.csv

Predifine the answers:

In [11]:
answers = {}

### Implement a  linear regressor using all continuous attributes (i.e., everything except color) to predict the wine quality. Use an 80/20 train/test split. Use sklearn’s `linear_model.LinearRegression`

In [28]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

scaler = StandardScaler()

# Load datasets
winequality_red = pd.read_csv("winequality-red.csv", sep=';')
winequality_white = pd.read_csv("winequality-white.csv", sep=';')

# Concatenate the datasets
wine_data = pd.concat([winequality_red, winequality_white], axis=0).reset_index(drop=True)

# Set a random seed and split the train/test subsets
random_seed = 42
train_data, test_data = train_test_split(wine_data, test_size=0.2, random_state=random_seed)

# Display the train and test data
print(f"Train data shape: {train_data.shape}")
print(f"Test data shape: {test_data.shape}")

# Train the linear regression model
X_train = train_data.drop(columns=['quality'])
y_train = train_data['quality']
X_test = test_data.drop(columns=['quality'])
y_test = test_data['quality']

# normalize the dataset
X_train_normalized = scaler.fit_transform(X_train)
X_test_normalized = scaler.transform(X_test)

Train data shape: (5197, 12)
Test data shape: (1300, 12)


In [3]:
y_test

3103    8
1419    5
4761    7
4690    6
4032    6
       ..
889     5
2850    5
4917    7
5198    6
5643    7
Name: quality, Length: 1300, dtype: int64

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [6]:
# run linear regression here
model = LinearRegression()
model.fit(X_train_normalized, y_train)

1. Report the feature with the largest coefficient value and the corresponding coefficient (not including any offset term).

In [12]:
feature = model.coef_.argmax()
corresponding_coefficient = model.coef_[feature]

In [13]:
answers['Q1'] = [feature, corresponding_coefficient]

2. On the first example in the test set, determine which feature has the largest effect and report its effect (see "Explaining predictions using weight plots & effect plots").

In [21]:
largest_effect = X_test_normalized[0] * model.coef_
feature = (largest_effect).argmax()
corresponding_coefficient = largest_effect[feature]

feature, corresponding_coefficient

(10, 0.46457652617367895)

In [22]:
answers['Q2'] = [feature, corresponding_coefficient]

3. (2 marks) Based on the MSE, compute ablations of the model including every feature (other than the offset). Find the most important feature (i.e., such that the ablated model has the highest MSE) and report the value of MSE_ablated - MSE_full.

In [37]:
MSE = mean_squared_error(y_test, model.predict(X_test_normalized))

# ablation
mse_diffs = {}

for i, feature_name in enumerate(X_train.columns):
    X_train_ablated = np.delete(X_train_normalized, i, axis=1)
    X_test_ablated = np.delete(X_test_normalized, i, axis=1)
    ablated_model = LinearRegression()
    ablated_model.fit(X_train_ablated, y_train)
    ablated_MSE = mean_squared_error(y_test, ablated_model.predict(X_test_ablated))
    mse_diffs[feature_name] = ablated_MSE - MSE

max_diff = max(mse_diffs, key=mse_diffs.get)
max_diff_value = mse_diffs[max_diff]
max_diff, max_diff_value

('volatile acidity', 0.023537285288143472)

In [44]:
most_important_feature = max_diff
mse_diff = max_diff_value

In [45]:
answers['Q3'] = [most_important_feature, mse_diff]

4. (2 marks) Implement a full backward selection pipeline and report the sequence of MSE values for each model as a list (of increasing MSEs).

In [51]:
list(range(X_train_normalized.shape[1]))


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [81]:
y_pred = model.predict(X_test_normalized)
mse = mean_squared_error(y_test, y_pred)

mse_list = [mse]

column_indices = list(range(X_train_normalized.shape[1]))
column_names = list(X_train.columns)

while len(column_indices) > 0:
    results = {}

    X_train_subset = X_train_normalized[:, column_indices]
    X_test_subset = X_test_normalized[:, column_indices]

    ablated_model = LinearRegression()
    ablated_model.fit(X_train_subset, y_train)
    
    y_pred_ablated = ablated_model.predict(X_test_subset)
    mse_ablated = mean_squared_error(y_test, y_pred_ablated)
    
    mse_list.append(mse_ablated)
    column_indices.pop(ablated_model.coef_.argmin())

In [82]:
mse_list

[0.5466964419580584,
 0.5466964419580584,
 0.5702337272462019,
 0.5782492638012302,
 0.5792879779808473,
 0.585372169949457,
 0.5874576675906084,
 0.5873423679660886,
 0.5907474498436729,
 0.5900730472109091,
 0.5971827548020747,
 0.6044380755188254]

In [None]:
mse_list = # increasing MSEs, same length as feature vector

In [83]:
answers['Q4'] = mse_list 

5. (2 marks) Change your model to use an l1 regularizer. Increasing the regularization strength will cause variables to gradually be removed (coefficient reduced to zero) from the model. Which is the first and the last variable to be eliminated via this process?

In [128]:
from sklearn.linear_model import Lasso
alpha = 0.0
feature_indices = list(range(X_train_normalized.shape[1]))
features = list(X_train.columns)
found_first = False
last_feature = -1
first_feature = -1


while len(feature_indices) > 0:
    
    X_train_subset = X_train_normalized[:, feature_indices]

    lasso_model = Lasso(alpha=alpha, random_state=42)
    lasso_model.fit(X_train_subset, y_train)
    print(len(feature_indices))
    if 0 in lasso_model.coef_:
        if found_first == False:
            found_first = True
            first_feature = features[np.where(lasso_model.coef_ == 0)[0][0]]
        if len(feature_indices) == 1:
            last_feature = features[np.where(lasso_model.coef_ == 0)[0][0]]

        print("index where coef is 0", np.where(lasso_model.coef_ == 0), np.where(lasso_model.coef_ == 0)[0][0])
        feature_indices.pop(np.where(lasso_model.coef_ == 0)[0][0])
        features.pop(np.where(lasso_model.coef_ == 0)[0][0])
        print(feature_indices)
    alpha += 0.001
    
first_feature, last_feature

  return fit_method(estimator, *args, **kwargs)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


11
11
11
11
11
11
11
11
index where coef is 0 (array([7]),) 7
[0, 1, 2, 3, 4, 5, 6, 8, 9, 10]
10
10
index where coef is 0 (array([0, 2]),) 0
[1, 2, 3, 4, 5, 6, 8, 9, 10]
9
index where coef is 0 (array([1]),) 1
[1, 3, 4, 5, 6, 8, 9, 10]
8
8
8
8
8
8
8
8
8
8
8
8
index where coef is 0 (array([2]),) 2
[1, 3, 5, 6, 8, 9, 10]
7
7
index where coef is 0 (array([4]),) 4
[1, 3, 5, 6, 9, 10]
6
6
6
6
6
6
6
6
6
index where coef is 0 (array([3]),) 3
[1, 3, 5, 9, 10]
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
index where coef is 0 (array([1]),) 1
[1, 5, 9, 10]
4
4
index where coef is 0 (array([1]),) 1
[1, 9, 10]
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
index where coef is 0 (array([1]),) 1
[1, 10]
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
index where c

('density', 'alcohol')

In [130]:
answers['Q5'] = [first_feature, last_feature]

### Implement a classifier to predict the wine color (red / white), again using an 80/20 train/test split, and including only continuous variables.

In [133]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load datasets
winequality_red = pd.read_csv("winequality-red.csv", sep=';')
winequality_white = pd.read_csv("winequality-white.csv", sep=';')

# Add a column to distinguish red and white wines
winequality_red['type'] = 0  # Red wine (encoded as 0)
winequality_white['type'] = 1  # White wine (encoded as 1)

# Concatenate the datasets
wine_data = pd.concat([winequality_red, winequality_white], axis=0)

# Separate features (and drop "quality" to get continuous variables) and target
X = wine_data.drop(columns=['quality', 'type'])  # Drop the target column
y = wine_data['type']  # Target column (wine type)

# Perform train/test split
random_seed = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_seed)

# Display shapes of the resulting splits
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (5197, 11)
X_test shape: (1300, 11)
y_train shape: (5197,)
y_test shape: (1300,)


6. Report the odds ratio associated with the first sample in the test set.

In [169]:
from sklearn.linear_model import LogisticRegression

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

first_sample = log_reg.predict_proba(X_test_scaled[0].reshape(1, -1))
odds_ratio = first_sample[0][1] / first_sample[0][0]

In [170]:
answers['Q6'] = odds_ratio

7. Find the 50 nearest neighbors (in the training set) to the first datapoint in the test set, based on the l2 distance. Train a classifier using only those 50 points, and report the largest value of e^theta_j (see “odds ratio” slides).

In [182]:
from sklearn.neighbors import NearestNeighbors

first_sample = X_test_scaled[0].reshape(1, -1)

knn = NearestNeighbors(n_neighbors=50, metric='l2')
knn.fit(X_train_scaled)
distances, indices = knn.kneighbors(first_sample)
X_nearest = X_train_scaled[indices[0]]
y_nearest = y_train.iloc[indices[0]]
local_model = LogisticRegression(random_state=42)
local_model.fit(X_nearest, y_nearest)
ethetaj = np.exp(local_model.coef_[0])
ethetaj.max()

1.638976333020081

In [183]:
value = ethetaj.max()

In [184]:
answers['Q7'] = value

In [185]:
import json

with open("answers_hw4.txt", "w") as file:
    json.dump(answers, file, indent=4, default=str)


In [186]:
answers

{'Q1': [10, 0.3224373794887738],
 'Q2': [10, 0.46457652617367895],
 'Q3': ['volatile acidity', 0.023537285288143472],
 'Q4': [0.5466964419580584,
  0.5466964419580584,
  0.5702337272462019,
  0.5782492638012302,
  0.5792879779808473,
  0.585372169949457,
  0.5874576675906084,
  0.5873423679660886,
  0.5907474498436729,
  0.5900730472109091,
  0.5971827548020747,
  0.6044380755188254],
 'Q5': ['density', 'alcohol'],
 'Q6': 416242.1392524223,
 'Q7': 1.638976333020081}