# Privacy-Preserving Techniques for Credit Data Analysis

**Author**: Paul Belland

This notebook explores privacy-preserving techniques applied to the German Credit dataset, including:
- Randomized response mechanisms for binary data
- Differential privacy using Laplace noise
- Privacy-accuracy tradeoff analysis

In [4]:
# Import necessary libraries
import numpy as np
from numpy.random import default_rng
rng = default_rng()
import pandas as pd
from scipy.optimize import minimize
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## Data Loading and Exploration

In [5]:
# Load the German Credit dataset
df = pd.read_csv('german_credit.csv')

print("Shape: ", df.shape)
df.head(5)

Shape:  (1000, 21)


Unnamed: 0,Creditability,Account Balance,Duration of Credit (month),Payment Status of Previous Credit,Purpose,Credit Amount,Value Savings/Stocks,Length of current employment,Instalment per cent,Sex & Marital Status,...,Duration in Current address,Most valuable available asset,Age (years),Concurrent Credits,Type of apartment,No of Credits at this Bank,Occupation,No of dependents,Telephone,Foreign Worker
0,1,1,18,4,2,1049,1,2,4,2,...,4,2,21,3,1,1,3,1,1,1
1,1,1,9,4,0,2799,1,3,2,3,...,2,1,36,3,1,2,3,2,1,1
2,1,2,12,2,9,841,2,4,2,2,...,4,1,23,3,1,1,2,1,1,1
3,1,1,12,4,0,2122,1,3,3,3,...,2,1,39,3,1,2,2,2,1,2
4,1,1,12,4,0,2171,1,3,4,3,...,4,2,38,1,2,2,2,1,1,2


## Dataset Overview

The data is relating to German creditors and contains a variety of different factors that can be used to determine their credit status. One variable that is binary is the Creditability variable. Another variable that is not a binary feature is Purpose.

## Data Preprocessing

In [6]:
# Remove any missing values
df.dropna(inplace = True)

## Randomized Response Implementation

### Binary Feature Query

My binary feature is already set up as a regular binary feature with only 1s and 0s. I'd retrieve it by just doing this:

bf = df['Creditability']

### Dice-Based Randomized Response Mechanism

Instead of the coin method from class, we use a more efficient way to get the randomized response that only requires 1 action. We roll an 8-sided dice and if the roll lands on an even number, we return the person's actual response. If it lands on an odd number we check if the number is 4 or below. If so we return 1, if not we return 0.

In [7]:
import random
def rand_resp(x):
    dice_roll = random.randint(1,8)
    if dice_roll % 2 == 0:
        return x
    else:
        if dice_roll <= 4: return 1
        if dice_roll > 4: return 0

### Applying Randomized Response

In [8]:
# Apply randomized response to each creditability value
for i in range(len(df)):
    x = df['Creditability'].iloc[i]
    df.loc[i, 'rrc1'] = rand_resp(x)

### Estimating True Count from Randomized Responses

In [9]:
# Calculate estimated vs actual true count
count_true_yes = sum(df['Creditability'])
n = len(df)
z = len(df[df['rrc1'] == 1])
x = (2 * z / n) - 1 / 2
count_est_true_yes = x * n

print("Estimated true yes count: ", count_est_true_yes)
print("True yes count: ", count_true_yes)

Estimated true yes count:  726.0
True yes count:  700


### Privacy-Accuracy Analysis

My randomized response mechanism is effective since it obfuscates the actual 1 and 0 responses but keeps the estimated true yes count very close to the actual true yes count (700). As such we gain much more privacy for a very small trade off in accuracy.

### Aggregate Statistics Comparison

In [10]:
# Calculate mean and median for both true and randomized responses
mean_est_true_yes = df['rrc1'].mean()
mean_true_yes = df['Creditability'].mean()
median_est_true_yes = df['rrc1'].median()
median_true_yes = df['Creditability'].median()
print("Mean: ", mean_est_true_yes, mean_true_yes)
print("Median: ", median_est_true_yes, median_true_yes)

Mean:  0.613 0.7
Median:  1.0 1.0


The estimated results are fairly close to the real statistics. The estimated median is identical and the estimated mean is very close but consistently a little bit below the actual mean. The results are definitely useable. The accuracy trade-off is worth the gained privacy. The distributions estimated of the data will be slightly off since the original data was not a 50:50 ratio before it was put through the randomized response, which is why the mean is slightly lower. This can be corrected but the question does not ask for this.

## Differential Privacy with Laplace Noise

### Global Sensitivity Analysis

As our notes mention, global sensitivity is essentially how much one user's data can change the output in the absolute worst case. Since we're querying count, the answer would be S(f) = 1 because adding or removing one user can only affect the count by 1.

### Laplace Mechanism Implementation

In [11]:
def add_laplace_noise(v, s, e):
    noise = np.random.laplace(0, s/e)
    return v + noise

### Applying Differential Privacy to Purpose Categories

In [12]:
# Get count of each unique purpose value
values = df['Purpose'].value_counts()

In [13]:
# Calculate statistics before adding noise
mean_count = values.mean()
median_count = values.median()
total_count = len(df['Purpose'])

print(f"Mean of true counts: {mean_count}")
print(f"Median of true counts: {median_count}")
print(f"Total count of true records: {total_count}")

Mean of true counts: 100.0
Median of true counts: 73.5
Total count of true records: 1000


In [14]:
# Add Laplace noise to each count (epsilon = 1)
noisy_values = values.copy()

for i in range(len(values)):
    x = values.iloc[i]
    noisy_count = add_laplace_noise(x, 1, 1)
    noisy_values.iloc[i] = noisy_count

In [15]:
# Calculate statistics after adding noise
mean_noisy_count = noisy_values.mean()
median_noisy_count = noisy_values.median()
total_noisy_count = len(df['Purpose'])

print(f"Noisy mean of true counts: {mean_noisy_count}")
print(f"Noisy median of true counts: {median_noisy_count}")
print(f"Total noisy count of true records: {total_noisy_count}")

Noisy mean of true counts: 100.28172050207407
Noisy median of true counts: 73.52415360747608
Total noisy count of true records: 1000


### Differential Privacy Analysis

The differences between my aggregate statistics and the noisy values is very small, within few decimal places of the actual values. As such the values given are very useful. On the other hand, because the values are still so close, the privacy could be a bit better if we reduced epsilon more.

## Privacy-Accuracy Tradeoff with Varying Epsilon

As epsilon decreases, the aggregate statistics data becomes less and less useful but becomes much more private. On the other hand, when we increase epsilon the opposite happens. The aggregate statistics data becomes almost identical to the true counts but becomes less and less private.