
# CS 3110/5110: Data Privacy
## Final Project

## Project Plan

My question is: Can I predict income levels (above thresholds or within categories), based on demographic features (age, education, etc.) in a privacy-preserving way? I plan on doing this by training a differentially private logistic regression model on census data using private gradient descent methods. I will also evaluate accuracy and privacy trade-offs by varying epsilon and clipping parameters. 

In [1]:
# Load the data and libraries
import pandas as pd
import numpy as np
import random
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def laplace_mech_vec(vec, sensitivity, epsilon):
    return [v + np.random.laplace(loc=0, scale=sensitivity / epsilon) for v in vec]

def gaussian_mech(v, sensitivity, epsilon, delta):
    return v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)

def gaussian_mech_vec(vec, sensitivity, epsilon, delta):
    return [v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)
            for v in vec]

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

## Collecting and Setting up Datasets

note:
- There are some features which originally have 'N' as a possible value so I am removing those or changing it to 0 if 0 has not already been used

- The columns with 'N' possible are: MSP, NOC, NPF, INDP_CAT, EDU, PINCP, PINCP_DECILE POVPIP, DVET, DREM, DPHY, 

In [12]:
# Dataset I am using - NIST data with income information for Massachusetts
data = pd.read_csv('NIST Datasets/ma2019.csv')


# The ones of these that already use 0: INDP_CAT, NOC   (so dropping these)
data = data.drop(columns=['PUMA', 'NOC', 'INDP_CAT'])   #along with PUMA because its a string

#Need to do some cleaning so all features are numeric and can be used in the model
data.replace('N', '0', inplace=True)

np.array(data)

array([[18, 1, '6', ..., 2, 72, 0],
       [21, 2, '6', ..., 2, 6, 0],
       [22, 2, '6', ..., 2, 80, 0],
       ...,
       [3, 1, '0', ..., 2, 69, 75],
       [1, 2, '0', ..., 2, 64, 75],
       [0, 1, '0', ..., 2, 107, 145]], dtype=object)

I will add a column to the data that will be the target feature
- This column will be binary 0 for bottom 50% of income
- and 1 for top 50% of income

This will allow logistic regression to perform the classification

In [13]:
#adds new column to data
data['PINCP_BIN'] = (data['PINCP_DECILE'].astype(int) >= 5).astype(int)
#data['PINCP_BIN'] = np.where(data['PINCP_DECILE'].astype(int) >= 5, 1, -1)

#checking to make sure it worked
data['PINCP_BIN'].value_counts()

PINCP_BIN
0    4022
1    3612
Name: count, dtype: int64

In [14]:
#now defining the features that are model will be built off
#Age, Sex, Marital Status, Race, and Education 
training_features = ['AGEP', 'SEX', 'MSP', 'HISP', 'RAC1P', 'NPF', 'HOUSING_TYPE', 'OWN_RENT', 'DENSITY', 'EDU']
X = data[training_features]
print(X)
#the goal of this project is to predict income percentile so income is our target variable
#I made this a binary variable so people in the top half of income bracket have a score of 1
# and people in the bottom half have a score of 0
y = data['PINCP_BIN']

#here is our split into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = np.array(X_train)
X_test = np.array(X_test)
y_train = np.array(y_train)
y_test = np.array(y_test)

print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)


      AGEP  SEX MSP  HISP  RAC1P NPF  HOUSING_TYPE  OWN_RENT  DENSITY EDU
0       18    1   6     0      1   0             3         0   2872.7   5
1       21    2   6     0      1   0             3         0   2195.3   7
2       22    2   6     0      6   0             3         0   2872.7   9
3       58    1   6     0      1   0             2         0   1457.2   8
4       18    2   6     0      1   0             3         0   2195.3   7
...    ...  ...  ..   ...    ...  ..           ...       ...      ...  ..
7629     8    1   0     0      1   3             1         1   1457.2   3
7630    14    1   0     0      1   3             1         1   2195.3   3
7631     3    1   0     0      6   4             1         1   2872.7   1
7632     1    2   0     0      6   4             1         1   2872.7   0
7633     0    1   0     0      1   3             1         1   3683.9   0

[7634 rows x 10 columns]
Training set shape: (6107, 10) (6107,)
Testing set shape: (1527, 10) (1527,)


## Building the Model

- I am using a logistic regression model to solve a multiclass classification problem of predicting income for people in this dataset
- My target variable is income decile
- My training features are age, sex, marital status, race, and education

In [15]:
def train_model():
    model = LogisticRegression( solver='lbfgs', max_iter=10000)
    model.fit(X_train, y_train)
    return model

model = train_model()
print('Model coefficients:', model.coef_[0])
print('Model accuracy:', np.sum(model.predict(X_test) == y_test)/X_test.shape[0])

Model coefficients: [ 8.22157105e-03 -1.06179114e+00 -6.98554478e-02  3.89381720e-02
 -6.96846150e-02 -1.19875694e-01 -2.54253051e+00 -3.83714897e-01
  6.78163639e-07  5.03893976e-01]
Model accuracy: 0.7786509495743288


In [16]:
# Get the feature names and coefficients
X_train = pd.DataFrame(X_train, columns=training_features)
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,  # Replace with your feature names
    'Importance': np.abs(model.coef_[0])  # Absolute value of coefficients
})

# Sort features by importance
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)
print(feature_importance)
X_train = np.array(X_train)

        Feature    Importance
6  HOUSING_TYPE  2.542531e+00
1           SEX  1.061791e+00
9           EDU  5.038940e-01
7      OWN_RENT  3.837149e-01
5           NPF  1.198757e-01
2           MSP  6.985545e-02
4         RAC1P  6.968462e-02
3          HISP  3.893817e-02
0          AGEP  8.221571e-03
8       DENSITY  6.781636e-07


## Gradient Descent

Use gradient descent to find the accuracy of the model

In [17]:
# The loss function measures how good our model is. The training goal is to minimize the loss.
# This is the logistic loss function.
def loss(theta, xi, yi):
    exponent = - yi * (xi.dot(theta))
    return np.log(1 + np.exp(exponent))

# This is the gradient of the logistic loss
# The gradient is a vector that indicates the rate of change of the loss in each direction
def gradient(theta, xi, yi):
    xi = np.array(xi).astype(int)
    yi = np.array(yi).astype(int)  # Ensure yi is a scalar integer
    
    exponent = yi * (xi.dot(theta))
    exponent = np.clip(exponent, -700, 700)
    return - (yi*xi) / (1+np.exp(exponent))

In [18]:
theta = [0 for _ in range(len(training_features))] # I use 5 for the amount of training features I have
i=1

y_train = np.reshape(y_train,(y_train.size, 1))
y_train.shape
gradient(theta, X_train[i], y_train[i]).shape  # Use iloc to get the correct row


(10,)

In [19]:
def avg_grad(theta, X, y):
    #list of vectors, each vector has length 104
    all_grads = [gradient(theta, X[i], y[i]) for i in range(len(X))]
    #compute the column-wise average
    avg_grad = np.mean(all_grads, axis=0)
    return avg_grad

avg_grad(theta, X_train, y_train)

array([-1.20949730e+01, -3.40674636e-01, -5.60258720e-01, -2.12870477e-02,
       -3.45668905e-01, -5.72130342e-01, -2.36368102e-01, -2.75339774e-01,
       -5.85370313e+02, -1.99869003e+00])

In [20]:
# Prediction: take a model (theta) and a single example (xi) and return its predicted label
def predict(xi, theta, bias=0):
    xi = np.array(xi).astype(int)
    label = np.sign(xi @ theta + bias)
    return label

def accuracy(theta):
    return np.sum(predict(X_test, theta) == y_test)/X_test.shape[0]

In [21]:
def gradient_descent(iterations):
    theta = [0 for _ in range(len(training_features))] #initial model
    for _ in range(iterations):
        theta = theta - avg_grad(theta, X_train, y_train)
    return theta

theta = gradient_descent(10)
accuracy(theta)

0.48526522593320237

## Adding Noise

So, we have successfully implemented the regression model that predicts income based on the data
- Now we must add Differential Privacy to the gradient descent algorithm

In [22]:
# First we must clip

# L2 Clipping function we will use for our gradient descent
def L2_clip(v, b):
    norm = np.linalg.norm(v, ord=2)
    
    if norm > b:
        return b * (v / norm)
    else:
        return v
    

#Helper function for our noisy gradient descent
def gradient_sum(theta, X, y, b):
    gradients = [L2_clip(gradient(theta, x_i, y_i), b) for x_i, y_i in zip(X,y)]
        
    # sum query
    # L2 sensitivity is b (by clipping performed above)
    return np.sum(gradients, axis=0)

In [23]:
# Noisy gradient descent
# Satisfies (k*epsilon + epsilon, k*delta)-differential privacy
def noisy_gradient_descent(iterations, epsilon, delta):
    theta = np.array([0 for _ in range(len(training_features))]) #resets theta to zeros
    b = 20      #dont do 50 or more
    epsilon_i = epsilon/iterations
    delta_i = delta/iterations


    noisy_count = laplace_mech(X_train.shape[0], 1, epsilon)

    for i in range(iterations):
        clipped_gradient_sum = gradient_sum(theta, X_train, y_train, b)
        noisy_gradient_sum = np.array(gaussian_mech_vec(clipped_gradient_sum, b, epsilon_i, delta_i))
        noisy_avg_gradient = np.array(noisy_gradient_sum) / len(X_train) #noisy_count
        theta = theta - noisy_avg_gradient
    return theta

#New accuracy with noise added
theta = noisy_gradient_descent(10, .1, 1e-5)
print('Final accuracy:', accuracy(theta))

Final accuracy: 0.48526522593320237


## A Little Analysis

- With a larger epsilon, I find that the addition of noise does not change the accuracy of the model at all
- The more I decrease the epsilon, the lower the accuracy gets. This is a good sign that the noise is successfully being added and that tradeoff is appearing of less accuracy with more noise.
- My clipping parameter is relatively small because I adjusted for the change in the size of data I was working with. My dataset is not small, but it it not as large as the adult set and does not require as much clipping.

## Alternative Methods

I wanted to try out some of the more advanced Differential Privacy methods as well and see how the model performs with those.

In [24]:
def gaussian_mech_RDP_vec(vec, sensitivity, alpha, epsilon):
    sigma = np.sqrt((sensitivity**2 * alpha) / (2 * epsilon))
    return [v + np.random.normal(loc=0, scale=sigma) for v in vec]

In [None]:
def noisy_gradient_descent_RDP(iterations, alpha, epsilon_bar):
    theta = np.array([0 for _ in range(len(training_features))]) #resets theta to zeros
    b = 20   #or b
    epsilon_i = epsilon_bar/iterations
    alpha_i = alpha/iterations

    for i in range(iterations):
        #gradient sum:
        # 1. computes the gradients for all of X_train
        # 2. clips them
        # 3. and sums them
        grad_sum  = gradient_sum(theta, X_train, y_train, b) 

        #now we add noise with renyi diff priv
        noisy_grad_sum = gaussian_mech_RDP_vec(grad_sum, sensitivity=b, alpha=alpha_i, epsilon=epsilon_i)

        noisy_grad = np.array(noisy_grad_sum)/len(X_train)
        theta = theta - noisy_grad
    return theta

theta = noisy_gradient_descent_RDP(10, 20, 0.000001)
print(theta)
print('Final accuracy:', accuracy(theta))

In [None]:
def gaussian_mech_zCDP_vec(vec, sensitivity, rho):
    sigma = np.sqrt((sensitivity**2) / (2 * rho))
    return [v + np.random.normal(loc=0, scale=sigma) for v in vec]

In [None]:
def noisy_gradient_descent_zCDP(iterations, rho):
    theta = np.array([0 for _ in range(len(training_features))]) #resets theta to zeros
    b = 20
    rho_i = rho/iterations

    for i in range(iterations):
        #gradient sum:
        # 1. computes the gradients for all of X_train
        # 2. clips them
        # 3. and sums them
        grad_sum  = gradient_sum(theta, X_train, y_train, b) 

        #now we add noise with renyi diff priv
        noisy_grad_sum = gaussian_mech_zCDP_vec(grad_sum, sensitivity=b, rho=rho_i)

        noisy_grad = np.array(noisy_grad_sum)/len(X_train)
        theta = theta - noisy_grad
    return theta

theta = noisy_gradient_descent_zCDP(10, 0.001)
print('Final accuracy:', accuracy(theta))

In [None]:
def noisy_gradient_descent_zCDP_ED(iterations, epsilon, delta):
    theta = np.array([0 for _ in range(len(training_features))]) #resets theta to zeros
    b = 20
    tolerance = 0.1 * epsilon
    
    #convert epsilon and delta to rho

    rho_low, rho_high = 0, 1  # initial bounds for rho
    while rho_high - rho_low > tolerance:
        rho_mid = (rho_high + rho_low) / 2
        # Calculate epsilon for this rho using zCDP to DP conversion formula
        calculated_epsilon = np.sqrt(2 * rho_mid * np.log(1 / delta)) + rho_mid
        if calculated_epsilon < epsilon:
            rho_low = rho_mid
        else:
            rho_high = rho_mid
    
    rho = (rho_high + rho_low) / 2
    rho_i = rho/iterations


    for i in range(iterations):
        #gradient sum:
        # 1. computes the gradients for all of X_train
        # 2. clips them
        # 3. and sums them
        grad_sum  = gradient_sum(theta, X_train, y_train, b) 

        #now we add noise with renyi diff priv
        noisy_grad_sum = gaussian_mech_zCDP_vec(grad_sum, sensitivity=b, rho=rho_i)

        noisy_grad = np.array(noisy_grad_sum)/len(X_train)
        theta = theta - noisy_grad
    return theta



theta = noisy_gradient_descent_zCDP_ED(10, .0001, 1e-5)
print('Final accuracy:', accuracy(theta))

# More Analysis

These methods basically all show similar solutions, but just test the same model out with a different differential privacy algorithms.

The model I have developed is not the most accurate. It seems to only be able to predict what income bracket a person is in a little less than 50% of the time. 