<h1>MICRODATA PROTECTION</h1>

Snippets of code to protect your micordata. To make the code work, you will also need adult and iris datasets in the same directory

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as npr
from scipy.linalg import cholesky
from sklearn.decomposition import PCA

In [None]:
adultdf = pd.read_csv("adult.csv")

adultdf

The histogram for the hours per week column: show similar histograms for your attributes in HW1

In [None]:
sns.histplot(data=adultdf, x='hours-per-week', bins=30)
plt.show()

check <url>https://www.kaggle.com/datasets/wenruliu/adult-income-dataset</url> to understand what each variable corresponds to

<h2>MACRODATA</h2>

In [None]:
#convert gender/occupation columns in a double entry table

macrodata_occupation = adultdf[['gender', 'occupation']]

#the following line computes the frequencies, renames the columns and rotates the table
macrodata_occupation = pd.DataFrame(macrodata_occupation.groupby('gender').value_counts())\
                         .reset_index().rename({0:'count'}, axis=1)\
                         .pivot_table(index='gender', columns='occupation', values='count')\
                         .fillna(0)

#print the table
macrodata_occupation


What cells are sensible? Which threshold should we use? If our threshold is 500, we will have to remove all the values corresponding to cells with less than 500 subjects before releasing our data.

In [None]:
privatized_mo = macrodata_occupation.copy()
privatized_mo[macrodata_occupation<500] = "SUPPRESSED"
privatized_mo

*SOME SUGGESTIONS FOR HW1*</br>
Show on your data how to carry out:
<ul>
    <li> cell suppression </li>
    <li> rounding </li>
    <li> roll up categories </li>
</ul>

<h3>(n-k) rule</h3>

to determine which cells are sensitive using the n-k rule, we need to determine how much each individual is responsible for the value of the cell.

The aggregation is the sum: what is the contribution of each individual?

In [None]:
def nkrule(values, n=3, k=0.3):
    contribution = values/np.sum(values)
    print(f"contributions: {contribution}")
    print(f"number of individuals contributing more than {k}: {np.sum(contribution>k)}")
    return np.sum(contribution>k)<n

Consider the following scenario: we are trying to compute the time spent in studying for the various cybersecurity courses during the week.

We report the results on a table with a cell for each course.

we are interested in determining if, according to the 3-0.3 rule, the cell for the PPIA course is sensitive.

Alice spends 20 hours studying PPIA, Bob and Claire both spend 10 hours, David spend 0 hours.  

In [None]:
n = 3
k = 0.3

print(f"is the cell sensible according to {n}-{k} rule?: {nkrule(np.array([20, 10, 10, 0]), n, k)}\n")

Alice differs too much from the "standard student": this cell is much more informative on her behaviour than others'.

In [None]:
"""
Now we slightly modify the scenario. and we assume that also Alice spends 10 hours
studying, making her more similar to the rest of the students.
"""

print(f"is the cell sensible according to {n}-{k} rule?: {nkrule(np.array([10, 10, 10, 0]), n, k)}\n")

We go back to the adult dataset and we are interested in verifying if the cell "other"-"priv-house-serv" is sensitive (i.e., contains subjects that differ) too much from the average

In [None]:
macrodata_hpw = adultdf[['race', 'hours-per-week', 'occupation']]\
                       .groupby(["race", "occupation"]).mean().reset_index()\
                       .pivot_table(index='race', columns='occupation', values='hours-per-week')\
                       .fillna(0)
macrodata_hpw

In [None]:
n = 3
k = 0.25


contr = np.array(adultdf[(adultdf['race']=='Other') &
                         (adultdf['occupation']=='Priv-house-serv')]['hours-per-week'])
print(f"is the cell sensible according to {n}-{k} rule?: {nkrule(contr, n, k)}\n")
k = 0.2
print(f"is the cell sensible according to {n}-{k} rule?: {nkrule(contr, n, k)}\n")

*SOME SUGGESTIONS FOR HW1*</br>
Show, on your data or on the adult dataset which cells are sensible according to:
<ul>
    <li> p-percentage </li>
    <li> pq-rule </li>
</ul>

<h2>MICRODATA: MASKING</h2>

<h3>Sampling</h3>

In [None]:
#select a sample of the individuals, by randomly sampling the indexes
sampled_individuals = npr.choice(adultdf.index.to_list(), int(len(adultdf.index.to_list())*0.3), replace=False)

#filter out individuals whose index is not among the sampled ones
sampled_adultdf = adultdf.loc[sampled_individuals, :]
sampled_adultdf

What challenges are associated with the sampling strategy?

<h3> Local Suppression </h3>

we considered individuals ending in the cell 'race'=='Other' & 'occupation'=='Priv-house-serv' as sensible

In [None]:
adultdf[(adultdf['race']=='Other') &(adultdf['occupation']=='Priv-house-serv')]['hours-per-week']

more in details, we consider problematic cells for which the number of hours per week was 40. We will remove them

In [None]:
adultdf.loc[(adultdf['race']=='Other')
             & (adultdf['occupation']=='Priv-house-serv')
             & (adultdf['hours-per-week']==40), 'hours-per-week']

In [None]:
adultdf.loc[(adultdf['race']=='Other')
             & (adultdf['occupation']=='Priv-house-serv')
             & (adultdf['hours-per-week']==40), 'hours-per-week'] = "SUPPRESSED"

In [None]:
adultdf.loc[(adultdf['race']=='Other')
             & (adultdf['occupation']=='Priv-house-serv'), 'hours-per-week']

Notice that, the cell remains sensible due to the fact that only two users contribute to 50% of its value: we have to verify that and, eventually, suppress other values

In [None]:
# reset the adult dataframe
adultdf = pd.read_csv("adult.csv")

<h3>Global Recoding</h3>

In [None]:
#equal sized bins
adultdf['hpw'] = pd.cut(adultdf['hours-per-week'], bins=5)
adultdf[['hours-per-week', 'hpw']]

In [None]:
sns.histplot(data=adultdf, x = adultdf['hours-per-week'])
plt.show()
labels = ['very low', 'low', 'medium', 'high', 'very high']
adultdf['hpw'] = pd.cut(adultdf['hours-per-week'], bins=5, labels=labels)
sns.histplot(data=adultdf, x='hpw')
plt.show()

In [None]:
adultdf['hpw']

In [None]:
sns.histplot(data=adultdf, x='hours-per-week')

In [None]:
adultdf['hpw'] = pd.qcut(adultdf['hours-per-week'], 10)

why it fails?

In [None]:
#quantized bins

adultdf['hpw'] = pd.qcut(adultdf['hours-per-week'], 10, duplicates='drop')
adultdf[['hours-per-week', 'hpw']]

What problems present the global recoding? are all the bins equal in size? do all the bins contain the same number of subjects?

<h3>Top & Bottom Coding</h3>

In [None]:
#top coding
greater_than_60 = adultdf[adultdf['hours-per-week']>60].index.to_list()

In [None]:
#bottom coding
smaller_than_20 = adultdf[adultdf['hours-per-week']<20].index.to_list()

In [None]:
adultdf.loc[greater_than_60, 'hours-per-week'] = ">60"
adultdf.loc[smaller_than_20, 'hours-per-week'] = "<20"
adultdf.loc[(adultdf['hours-per-week']==">60") | (adultdf['hours-per-week']=="<20"), ['workclass', 'education', 'native-country', 'hours-per-week']]

In [None]:
# reset the adult dataframe
adultdf = pd.read_csv("adult.csv")

<h3>Generalization</h3>

In [None]:
adultdf['occupation'].unique()

In [None]:
#prepare a generalization map
gen_map = {'Machine-op-inspct': 'manual','Farming-fishing': 'manual', 'Armed-Forces':'military',
           'Protective-serv':'military', '?': 'others-or-unknown', 'Other-service':'others-or-unknown',
           'Prof-specialty':'business', 'Craft-repair': 'manual', 'Adm-clerical': 'business',
           'Exec-managerial':'business', 'Tech-support': 'manual', 'Sales': 'business',
           'Priv-house-serv': 'manual', 'Transport-moving':'business', 'Handlers-cleaners': 'manual'}
adultdf['occupation_gen'] = adultdf['occupation'].replace(gen_map)
adultdf[['occupation', 'occupation_gen']].head()

<h3>Resampling</h3>

In [None]:
#slides example
M = np.array([10, 18, 20, 8, 11, 14])
print(f"original values: {M}")

print(f"argsort: {np.argsort(M)}")

sM = np.array([npr.choice(M, size=len(M)) for s in range(4)]).T
print("\nsampled table:")
print(sM)

rows, cols = sM.shape

sMs = np.array([sorted(sM[:, c]) for c in range(cols)]).T
print("\nsorted table:")
print(sMs)

means = np.mean(sMs, axis=1)
print(f"\nmeans: {means}")
released = np.zeros(len(M))
for e, i in enumerate(np.argsort(M)):
    released[i] = means[e]
print(f"released values: {released}")

In [None]:
def resampling_values(data, k=10):

    sM = npr.choice(data, size=(len(data), k))
    rows, cols = sM.shape
    sMs = np.array([sorted(sM[:, c]) for c in range(cols)]).T
    means = np.mean(sMs, axis=1)
    released = np.zeros(len(data))
    for e, i in enumerate(np.argsort(data)):
        released[i] = means[e]
    return released

In [None]:
values = resampling_values(adultdf['fnlwgt'])
print(f"original: {adultdf['fnlwgt'].to_list()[:10]}")
print(f"released: {list(values[:10])}")


In [None]:
iris = pd.read_csv("iris.data", names=["sl", "sw", "pl", "pw", "class"], header=None)
iris.head()

In [None]:
#identify the variable that you want to apply the noise on
variables = ["sl", "sw", "pl", "pw"]

In [None]:
corr = iris[variables].corr()
sns.heatmap(corr)
plt.show()

<h3>Random noise: uncorrelated additive noise</h3>

In [None]:
sigmas2 = iris[variables].std()**2
print(f"stds: {sigmas2}")
alpha = 1.4
scaled_sigmas2 = sigmas2*alpha

In [None]:
# generate the random noise with mean 0 and the computed standard deviations
noise = npr.normal(
                size=(len(iris.index), len(variables)), #set the size of the data
                loc=0, #set the mean
                scale=scaled_sigmas2 #set the standard deviations
                  )

noised_idf = iris.copy()
noised_idf[variables] += noise

In [None]:
noised_idf[variables].head()

In [None]:
iris[variables].head()

In [None]:
corr = noised_idf[variables].corr()
sns.heatmap(corr)
plt.show()

<h3>Random noise: correlated additive noise</h3>

In [None]:
#compute the covariance matrix
covm = iris[variables].cov()
print(covm.shape)
print(np.diag(covm)) # the diagonal of the covariance matrix, contains the variance

In [None]:
#generate the noise using the covariance matrix
covnoise = npr.multivariate_normal(np.zeros(len(variables)), covm, size=len(iris.index))

In [None]:
#add the noise to the data
noisedc_idf = iris.copy()
noisedc_idf[variables] += covnoise
noisedc_idf.head()

In [None]:
corr = noisedc_idf[variables].corr()
sns.heatmap(corr)
plt.show()

<h2>MICRODATA: GENERATION</h2>

<h3>Cholesky decomposition</h3>

In [None]:
covm = iris[variables].cov()
U = cholesky(covm)
print(U.T.conj()@U)

In [None]:
R =  npr.multivariate_normal(np.zeros(len(variables)), np.eye(len(variables)), size=len(iris.index))
generated_iris = R@U

In [None]:
generated_iris.size

In [None]:
print(np.cov(generated_iris.T))

<h3>Blank and Impute</h3>

In [None]:
#compute the means of each variable
means = iris[variables].mean()


proportion = 0.5
# sample proportion*100% cells (rows and columns)
sampled = [(r, c) for r in range(len(iris.index)) for c in range(len(variables)) if npr.random() < proportion]


#make a copy of the dataset
iris_bi = iris[variables].copy()

#for each sampled row, column pair, replace the value with the mean
for r, c in sampled:
    iris_bi.iloc[r, c] = means[c]

In [None]:
iris_bi.head()

In [None]:
iris.head()

<h2>UNIQUENESS</h2>

In [None]:
considered_variables = ['age', 'workclass', 'education', 'educational-num',
       'marital-status', 'occupation', 'relationship', 'race', 'gender',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
PU = (adultdf[considered_variables].value_counts() == 1).sum()/len(adultdf.index)
print(f"population uniqueness: {PU:.3f}")

If the population uniqueness is too big, reduce it with microdata protections techniques

*SOME SUGGESTIONS FOR HW1*</br>
Measure the sample uniqueness of your data. Try also to carry out a simple record linkage analysis.