# Mini-project 4: Privacy-Preserving Data Mining in Banking Applications

## Classification performance with K-Anonymity

<div style="text-align: right"> Amen Memmi</div>
<div style="text-align: right"> amen.memmi@mail.mcgill.ca</div>
<div style="text-align: right">  ID: 260755070</div>

- The jupyter notebook summarizes the second part of my participation in the 4th mini project. It presents the _classifiaction algorithms_ for evaluation of performance _before_ and _after_ anonymization.
- K-Anonymity has been applied in the first part using freeware software, ARX which could be downloaded from [here](http://arx.deidentifier.org/downloads/), please refer to the [report](Report.pdf) and the [presentation](Project_4_updated) for further details on the anonymization operation
- Anononymized data could be found in csv format for diffenet values of k in [k_anonymity](data\secured_files\k_anonymity) folder.
- I will be  testing out 3 common machine learning algorithms for data before and after applying data privacy techniques
    *  Logistic regression
    *  Naive Bayes
    *  Random Forest


- The test will follow these ideas:
    * The dataset will be divided into 2 sets: train and test sets.
    * Stratified sampling is used as our target value is unbalanced. In each set, we maintain the ratio of zeros over ones, the same ratio as it is in the full dataset (About 3/4).
    * Accuracy (ACC), area-under-curve (AUC), Precision (PRE) and Recall (REC)  are used as performance metrics.

In [1]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
import statsmodels.api as sm
from sklearn.metrics import accuracy_score
import types
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score
import pandas as pd 
from scipy.stats import pointbiserialr, spearmanr
from os import listdir
from os.path import isfile, join





In [4]:
data = pd.read_csv("data/default of credit card clients.csv")   #original data
data.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [5]:
# check correlation to see attribute importance - feature selection
col_names = data.columns

param=[]
correlation=[]
abs_corr=[]

for c in col_names:
    #Check if binary or continuous
    if c != "default payment next month":
        if len(data[c].unique()) <= 2:
            corr = spearmanr(data["default payment next month"],data[c])[0]
        else:
            corr = pointbiserialr(data["default payment next month"],data[c])[0]
        param.append(c)
        correlation.append(corr)
        abs_corr.append(abs(corr))

#Create dataframe for visualization
param_df=pd.DataFrame({'correlation':correlation,'parameter':param, 'abs_corr':abs_corr})

#Sort by absolute correlation
param_df=param_df.sort_values(by=['abs_corr'], ascending=False)

#Set parameter name as index
param_df=param_df.set_index('parameter')

param_df

Unnamed: 0_level_0,abs_corr,correlation
parameter,Unnamed: 1_level_1,Unnamed: 2_level_1
PAY_0,0.324794,0.324794
PAY_2,0.263551,0.263551
PAY_3,0.235253,0.235253
PAY_4,0.216614,0.216614
PAY_5,0.204149,0.204149
PAY_6,0.186866,0.186866
LIMIT_BAL,0.15352,-0.15352
PAY_AMT1,0.072929,-0.072929
PAY_AMT2,0.058579,-0.058579
PAY_AMT4,0.056827,-0.056827


In [6]:
best_features=param_df.index[0:7].values
print 'Best features:\t',best_features

Best features:	['PAY_0' 'PAY_2' 'PAY_3' 'PAY_4' 'PAY_5' 'PAY_6' 'LIMIT_BAL']


# Classification algorithms



## logistic regression


In [8]:
predictors = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4',
                 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
                 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

#predictors = [''PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL']

credible = data[data["default payment next month"] == 1]
not_credible = data[data["default payment next month"] == 0]

# stratified sampling
#80% to train set
train = pd.concat([credible.sample(frac=0.8, random_state=1),
                   not_credible.sample(frac=0.8, random_state=1)]) 
y_train = train["default payment next month"]
X_train = train[predictors]

#20% to test set
test = pd.concat([credible.sample(frac=0.2, random_state=2), 
                  not_credible.sample(frac=0.2, random_state=2)])
y_test = test["default payment next month"]
X_test = test[predictors]



In [30]:
#train
print("train set result\n")
logit_train = sm.Logit(y_train, X_train) 
result_train = logit_train.fit()

y_train_pred = result_train.predict(X_train) 
y_train_pred = (y_train_pred > 0.5).astype(int) 
acc = accuracy_score(y_train, y_train_pred) 
print("ACC= %.2f %%" % (100*acc))
auc = roc_auc_score(y_train, y_train_pred) 
print("AUC= %.2f %%" % (100*auc))
pre = precision_score(y_train, y_train_pred)
print("PRE= %.2f %%" % (100*pre))
rec = recall_score(y_train, y_train_pred)
print("REC= %.2f %%" % (100*rec))


print("\n test set result\n")
y_test_pred = result_train.predict(X_test) 
y_test_pred = (y_test_pred > 0.5).astype(int) 
acc = accuracy_score(y_test, y_test_pred)
print("ACC= %.2f %%" % (100*acc))
auc = roc_auc_score(y_test, y_test_pred)
print("AUC= %.2f %%" % (100*auc))
pre = precision_score(y_test, y_test_pred)
print("PRE= %.2f %%" % (100*pre))
rec = recall_score(y_test, y_test_pred)
print("REC= %.2f %%" % (100*rec))

train set result

Optimization terminated successfully.
         Current function value: 0.465550
         Iterations 7
ACC= 80.98 %
AUC= 60.56 %
PRE= 70.75 %
REC= 23.92 %

 CV set result

ACC= 81.63 %
AUC= 61.15 %
PRE= 76.78 %
REC= 24.40 %

 test set result

ACC= 81.00 %
AUC= 61.07 %
PRE= 69.42 %
REC= 25.30 %


## Naive Bayes


In [18]:
def calc_NB(dataset):
    """
    Computes accuracy ACC and area under curve AUC for Naive Bayes classifier
    """
    # Convert categorical to numerical representations
    
    category_columns = ['LIMIT_BAL', 'SEX', 'EDUCATION',  'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4',
                 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'PAY_AMT1','PAY_AMT2']
    for column in category_columns:
        if not isinstance(dataset[column][0], np.int64):
            aux1, aux2 = np.unique(dataset[column], return_inverse=True) 
            dataset[column] = aux2
    #########
    predictors = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4',
                 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
                 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
    #predictors = [''PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL']
    clf = BernoulliNB()
    pred_data = dataset[predictors] #X
    target = dataset["default payment next month"] #y
    split_nbr=50
    sss = StratifiedShuffleSplit(target, split_nbr, test_size=0.1, random_state=np.random.randint(50))
    acc=0
    auc=0
    pre=0
    rec=0
    for train_index, test_index in sss:
        train_data = dataset.iloc[train_index]
        test_data = dataset.iloc[test_index]
         
        X_train, X_test = train_data[predictors], test_data[predictors] 
        y_train, y_test = train_data["default payment next month"], test_data["default payment next month"]
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        acc_score = accuracy_score(y_test, y_pred) 
        acc += acc_score
        auc_score = roc_auc_score(y_test, y_pred) 
        auc += auc_score
        pre_score = precision_score(y_test, y_pred)
        pre+=pre_score
        rec_score = recall_score(y_test, y_pred)
        rec+=rec_score
    acc=acc/split_nbr
    auc=auc/split_nbr
    pre=pre/split_nbr
    rec=rec/split_nbr
    return acc, auc, pre, rec

In [33]:
data_5_anonymity = pd.read_csv("5ano.csv")
a, b, c , d= calc_NB(data_5_anonymity)
print ("## ACC=%.2f %%" % ( a*100))
print ("## AUC=%.2f %%" % ( b*100))
print ("## PRE=%.2f %%" % ( c*100))
print ("## REC=%.2f %%" % ( d*100))

## ACC=74.81 %
## AUC=52.80 %
## PRE=32.96 %
## REC=13.32 %


In [32]:
untouched_data = pd.read_csv("default of credit card clients.csv")
a, b, c , d= calc_NB(untached_data)
print ("## ACC=%.2f %%" % ( a*100))
print ("## AUC=%.2f %%" % ( b*100))
print ("## PRE=%.2f %%" % ( c*100))
print ("## REC=%.2f %%" % ( d*100))

## ACC=76.98 %
## AUC=67.17 %
## PRE=48.09 %
## REC=49.57 %


In [34]:
#load the data files in secured_files\k_anonymity directory
files = [join('secured_files/k_anonymity', f) for f in listdir('secured_files/k_anonymity') if isfile(join('secured_files/k_anonymity', f))]

In [45]:
acc=np.zeros(len(files)) #Accuracy 
auc=np.zeros(len(files)) #Area Under Curve 
pre=np.zeros(len(files)) #Accuracy 
rec=np.zeros(len(files)) #Area Under Curve 
for i in range(len(files)):
    data = pd.read_csv(files[i])
    acc[i], auc[i], pre[i], rec[i] = calc_NB(data)    

In [46]:
print 'ACC:\t',100*acc
print 'AUC:\t',100*auc
print 'PRE:\t',100*pre
print 'REC:\t',100*rec

ACC:	[ 77.13        76.098       74.79733333  74.46733333  76.00266667
  76.37733333  76.46133333  73.96933333  74.21466667]
AUC:	[ 67.2407885   54.23783834  52.84319814  53.78038352  55.71376774
  57.83428062  57.76102183  63.90047347  68.09597809]
PRE:	[ 48.39198755  39.45117213  33.08191551  34.23942597  41.06753449
  43.98687894  44.21458528  41.94717026  43.70522227]
REC:	[ 49.49698795  15.01506024  13.45180723  16.6626506   19.31024096
  24.56325301  24.20783133  45.83433735  57.11746988]


## Random Forest

In [26]:
data=data21

In [53]:
# for random forest

def calc_RF(dataset):
    """
    Computes accuracy ACC and area under curve AUC for Naive Random Forest classifier
    """
    # Convert categorical to numerical representations
    
    category_columns = ['LIMIT_BAL', 'SEX', 'EDUCATION',  'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4',
                 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'PAY_AMT1','PAY_AMT2']
    for column in category_columns:
        if not isinstance(dataset[column][0], np.int64):
            aux1, aux2 = np.unique(dataset[column], return_inverse=True) 
            dataset[column] = aux2
    #########
    predictors = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4',
                 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
                 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
    pred_data = dataset[predictors] #X
    target = dataset["default payment next month"] #y
    #Bagging
    tree_count = 10
    bag_proportion = 0.2 
    predictions = []
    sss = StratifiedShuffleSplit(target, 1, test_size=0.1, random_state=np.random.randint(50)) 
    for train_index, test_index in sss:
        train_data = data.iloc[train_index] 
        test_data = data.iloc[test_index]

        for i in range(tree_count):
            bag = train_data.sample(frac=bag_proportion, replace = True, random_state=i)
            X_train, X_test = bag[predictors], test_data[predictors]
            y_train, y_test = bag["default payment next month"], test_data["default payment next month"]
            clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=75) 
            clf.fit(X_train, y_train) 
            predictions.append(clf.predict_proba(X_test)[:,1])

    combined = np.sum(predictions, axis=0)/10 
    rounded= np.round(combined)
    z=accuracy_score(rounded, y_test)
    zz=roc_auc_score(rounded, y_test)
    zzz=precision_score(rounded, y_test)
    zzzz = recall_score(rounded, y_test)
    return (z,zz,zzz,zzzz) 
   

In [48]:
#load the data files in secured_files\k_anonymity directory
files = [join('data/secured_files/k_anonymity', f) for f in listdir('secured_files/k_anonymity') if isfile(join('secured_files/k_anonymity', f))]

In [54]:
acc=np.zeros(len(files)) #Accuracy 
auc=np.zeros(len(files)) #Area Under Curve 
pre=np.zeros(len(files)) #Accuracy 
rec=np.zeros(len(files)) #Area Under Curve 
for i in range(len(files)):
    data = pd.read_csv(files[i])
    acc[i], auc[i], pre[i], rec[i] = calc_RF(data)    

In [55]:
print 'ACC:\t',100*acc
print 'AUC:\t',100*auc
print 'PRE:\t',100*pre
print 'REC:\t',100*rec

ACC:	[ 81.9         80.96666667  80.7         82.16666667  80.66666667
  79.76666667  79.86666667  79.33333333  78.53333333]
AUC:	[ 76.06470165  73.99586362  72.72713806  76.52180001  72.20453876
  70.76921839  71.19906622  69.97643308  67.7578986 ]
PRE:	[ 33.58433735  29.96987952  32.22891566  34.93975904  35.69277108
  26.80722892  26.05421687  22.28915663  15.21084337]
REC:	[ 68.61538462  65.24590164  62.39067055  69.25373134  60.76923077
  59.53177258  60.48951049  58.73015873  55.49450549]
