#Set-Up

### SOURCE: **Mitigating Subgroup Unfairness in Machine Learning Classifiers: A Data-Driven Approach**  

This notebook will **output a .csv file named compas.csv** with the fairness and accuracy result metrics of the different remedy methods if **all cells are run**.  

The **Set-Up** section includes processing the data, running the baseline divexplorer functions and various helper functions used for **Identification** (and its relevant preprocessing methods) and helper functions for the **Remedy Algorithms**. 

The rest of the notebook is divided by the different **4 remedy algorithms** of: 

1.   Preferential Sampling
2.   Duplication/Oversampling
3.   Down-sampling/Undersampling
4. Massaging

Each of these Remedy Algorithms will have the 3 subgroups and the subsequent results:
1. Lattice
2. Leaf
3. Top

 For more information about the methods, refer to the paper: Mitigating Subgroup Unfairness in Machine Learning Classifiers: A Data-Driven Approach

##Imports and Dataset processing

In [120]:
# Import all neccessary packages 
import pandas as pd
import time
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.metrics import accuracy_score, make_scorer
import copy
from sympy import Symbol
from sympy.solvers import solve
pd.options.mode.chained_assignment = None 
import csv

In [121]:
# read in the data, change url accordingly
url = "https://raw.githubusercontent.com/niceIrene/remedy/main/datasets/compas_numerical.csv"
data = pd.read_csv(url)
data

Unnamed: 0,age,charge,race,sex,#prior,stay,class,predicted
0,0,0,0,0,0,0,0,0
1,1,0,1,0,0,1,1,0
2,2,0,1,0,2,0,1,0
3,1,1,0,0,0,0,0,0
4,1,0,2,0,2,0,1,0
...,...,...,...,...,...,...,...,...
6167,2,0,1,0,0,0,0,0
6168,2,0,1,0,0,0,0,0
6169,0,0,0,0,0,0,0,0
6170,1,1,1,1,1,0,0,0


In [122]:
# get training and testing set

# column names that will be used as protected class
columns_compas = ['stay', 'age', 'charge', 'sex', '#prior', 'race']

#all colummns except the 
columns_all = columns_compas

#y_label 
compas_y = 'class'

# Used in get_train_test as helper function
def split_train_test(data,test_ratio):
    np.random.seed(42)
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices],data.iloc[test_indices]
    
"""
Split the data into test and train dataset and get the results
Input: 
data = whole dataset (dataframe)
split = float between 0.0-1.0 for size of test set
list_cols = which columns to use (list)
y_label = str containing column name of y_label 
"""
def get_train_test(data, split, list_cols, y_label):
  all_list = copy.deepcopy(list_cols)
  all_list.append(y_label)
  data = pd.DataFrame(data, columns = all_list)
  train_set,test_set = split_train_test(data,split)
  print(len(train_set), "train +", len(test_set), "test")
  train_x = pd.DataFrame(train_set, columns = list_cols)
  train_label = train_set[y_label]
  test_x = pd.DataFrame(test_set, columns = list_cols)
  test_label = test_set[y_label]
  return train_x, test_x, train_label, test_label, train_set, test_set

In [123]:
# Get test and train datasets
train_x, test_x, train_label, test_label, train_set, test_set  = get_train_test(data, 0.3, columns_all, compas_y)


4321 train + 1851 test


In [124]:

"""
Model Settings and Other Global Variables
"""

# The minimum size of the group (used in the run algorithm cells)
filter_count = 30

scoring = make_scorer(accuracy_score)

# #####################

#Logisitic Regression Settings

# #####################
param_gridlg = {
    'penalty': ['l1', 'l2'],
    'C': [0.01, 0.1, 1, 10, 100]
}
logreg = LogisticRegression(random_state=42, max_iter=1000)
gridlg = GridSearchCV(logreg, param_grid=param_gridlg, scoring=scoring, cv=5)


# #####################

#Decision Tree Classifier Settings

# #####################
param_griddt = {
    'max_depth': [2, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
dt = DecisionTreeClassifier(random_state=42)
griddt = GridSearchCV(dt, param_grid=param_griddt, scoring=scoring, cv=5)

# #####################

#Random Forest Classifier Settings

# #####################
param_gridrf = {'criterion': ['gini', 'entropy'], 'max_depth': [10, 20, 30, 40, 50, 100], 'random_state':[17]}
rf = RandomForestClassifier(random_state=42)
gridrf = GridSearchCV(rf, param_grid=param_gridrf, scoring=scoring, cv=5)

# #####################

# SVM Settings

# #####################

clf = SVC(kernel='rbf', C=1.0, gamma = 'scale', random_state =42)




In [125]:
# train model for SVC, used for baseline results for divexplorer
clf.fit(train_x, train_label)
test_predict = clf.predict(test_x)
test_set['predicted'] = test_predict
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy is " , accuracy)

accuracy is  0.6661264181523501


## Divexplorer

In [126]:
#Install divexplorer if needed

In [127]:
pip install divexplorer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [128]:
# Computes the fairness scoring in terms of the support for unfair group
# input of the dataframe from divexplorer and str of metric used
def fairness_score_computation(d, metrics):
    sum_of_score = 0
    for idx, row in d.iterrows():
      sum_of_score += row['support'] * row[metrics]
    return sum_of_score



In [129]:
# Gets the fpr of a single group 
# Takes input of the true and predicted values as list
def fpr_onegroup(true, predict):
    fp = 0
    tn = 0
    for i in range(len(true)):
        if (true[i] == 0 and predict[i] == 1):
            fp += 1 
        if(true[i] == 0 and predict[i] == 0):
            tn += 1
    return fp/(fp+tn)

# Gets the fpr of a single group 
# Takes input of the true and predicted values as list
def fnr_onegroup(true, predict):
    fn = 0
    tp = 0
    for i in range(len(true)):
        if (true[i] == 1 and predict[i] == 0):
            fn += 1 
        if(true[i] == 1 and predict[i] == 1):
            tp += 1
    return fn/(fn+tp)

In [130]:
# run divexplorer to find unfair groups
from divexplorer.FP_DivergenceExplorer import FP_DivergenceExplorer
from divexplorer.FP_Divergence import FP_Divergence


######
# Data-preprocessing before divexplorer results
class_map={'N': 0, 'P': 1}
columns_compas.extend([compas_y, "predicted"])

df = pd.DataFrame(test_set, columns = columns_compas)

columns_compas.remove(compas_y)
columns_compas.remove('predicted')

######

#Support metrics for the group must be greater than 0.1
min_sup=0.1

fp_diver=FP_DivergenceExplorer(df,compas_y, "predicted", class_map=class_map)
FP_fm=fp_diver.getFrequentPatternDivergence(min_support=min_sup, metrics=["d_fpr", "d_fnr", "d_accuracy"])

fp_divergence_fpr=FP_Divergence(FP_fm, "d_fpr")
fp_divergence_fnr=FP_Divergence(FP_fm, "d_fnr")
fp_divergence_acc=FP_Divergence(FP_fm, "d_accuracy")

INFO_VIZ=["support", "itemsets",  fp_divergence_fpr.metric, fp_divergence_fpr.t_value_col]
INFO_VIZ2=["support", "itemsets",  fp_divergence_fnr.metric, fp_divergence_fnr.t_value_col]
INFO_VIZ3=["support", "itemsets",  fp_divergence_acc.metric, fp_divergence_acc.t_value_col]

#Setting for output of divexplorer results
eps=0.01
K=1000

d = fp_divergence_fpr.getDivergence(th_redundancy=eps)[INFO_VIZ].head(K)
d2 = fp_divergence_fnr.getDivergence(th_redundancy=eps)[INFO_VIZ2].head(K)
d3 = fp_divergence_acc.getDivergence(th_redundancy=eps)[INFO_VIZ3].head(K)



pd.options.display.max_rows = 200
d = fp_divergence_fpr.getDivergence(th_redundancy=0)[INFO_VIZ].head(K)
# summerization

d = fp_divergence_fpr.getDivergence(th_redundancy=eps)[INFO_VIZ].head(K)
d= d[d['d_fpr'] > 0]
d2= d2[d2['d_fnr'] > 0]
d3= d3[d3['d_accuracy'] > 0]

dfpr = fairness_score_computation(d, 'd_fpr')
dfnr = fairness_score_computation(d2, 'd_fnr')
dacc = fairness_score_computation(d3, 'd_accuracy')

print("dfpr: ", dfpr)
print("dfnr: ", dfnr)
print("dacc: ", dacc)
print()

#print the d_fpr dataframe results and the most unfair group
d.head(), list(d["itemsets"].iloc[0])

dfpr:  1.194963899198641
dfnr:  3.1294379205954597
dacc:  0.1642495580381887



(      support                          itemsets     d_fpr  t_value_fp
 28   0.289573                        (#prior=2)  0.745098   50.438466
 105  0.136143          (charge=0, sex=0, age=2)  0.323529    6.417386
 61   0.183684                    (sex=0, age=2)  0.269236    6.205293
 148  0.104268  (charge=0, sex=0, age=2, stay=0)  0.256462    4.722968
 75   0.162075                 (charge=0, age=2)  0.252911    5.513975,
 ['#prior=2'])

In [131]:
# If save results to a new file: compas.csv, use this cell
with open('compas.csv', 'w', newline='') as file:
  writer = csv.writer(file)
  writer.writerow(["Dataset","Remedy", "Algorithm","d_fpr","d_fnr", "d_acc", "model_acc"])
  writer.writerow(["COMPAS", "Original", "SVM", dfpr, dfnr, dacc, accuracy])

In [132]:
# # If file: compas.csv already exists, use this cell

# with open('compas.csv', 'a+', newline='') as file:
#   writer = csv.writer(file)
#   writer.writerow(["COMPAS", "Original", "SVM", dfpr, dfnr, dacc, accuracy])

## Helper Functions

These functions will be used within different remedy methods 

### General Helper Functions


In [133]:
import itertools

# give list_parse more parameters if smaller subset is desired
# unnecessary to use otherwise
def get_unfair_group(list_parse, entire = 1):
  unfair_group = []
  unfair_dict = {}
  names = []
  for col in columns_compas:
    found = False
    for item in list_parse:
      attr_given = item.split("=")[0]
      if col == attr_given:
        unfair_group.append(int(item.split("=")[1]))
        names.append(attr_given)
        unfair_dict[attr_given] = int(item.split("=")[1])
        found = True
  # if use the entire dataset (usually the case)
  if entire:
    return unfair_group, names, columns_compas, unfair_dict
  return unfair_group, names, list(set(columns_compas).symmetric_difference(set(names))), unfair_dict

# Return the combinations of the skew_candidates in dict form
# Can use the output from get_unfair_group() to create filtered dict of combos
def candidate_groups(skew_candidates, unfair_dict, ordering, names):
  candidate_combos = []
  candidate_ind = {}
  num = 0
  for i in range(len(skew_candidates)+1):
    temp_candidate = list(itertools.combinations(skew_candidates, i))
    for tc in temp_candidate:
      candidate_ind[num] = list(tc)
      num += 1
  return candidate_ind

  

In [134]:
#Runs functions to output the correct format for all candidate group combinations

unfair_group, unfair_names, skew_candidates, unfair_dict = get_unfair_group([])
all_names = candidate_groups(skew_candidates, unfair_dict, columns_compas, unfair_names)

# put into list form
all_names_lst = list(all_names.keys())[len(columns_compas)+1:]

# reverse list for desired format 
all_names_lst.reverse()

In [135]:

# Create a new subset dataframe using the desired unfair group candidates 
# Dataframe will have a new category "cnt", groupby counts needed for filtering
# and computing differences between groups 

def get_temp_g(train_set, names, y_label):
  names2 = copy.deepcopy(names)
  names2.append(y_label)
  temp = train_set[names2]
  temp['cnt'] = 0
  temp_g = temp.groupby(names)['cnt'].count().reset_index()
  return temp, temp_g

In [136]:
# Create a new subset dataframe using the desired unfair group candidates 
# Dataframe will have a new category "cnt", groupby counts that are summed up
# needed for filtering and computing differences between groups 
def get_temp(train_set, names, y_label):
  names2 = copy.deepcopy(names)
  names2.append(y_label)
  temp = train_set[names2]
  temp['cnt'] = 0
  temp2 = temp.groupby(names2)['cnt'].count().reset_index()
  temp2['cnt'].sum()
  return temp2, names
temp2, names = get_temp(train_set, columns_compas, compas_y)

# Example output
temp2

Unnamed: 0,stay,age,charge,sex,#prior,race,class,cnt
0,0,0,0,0,0,0,0,11
1,0,0,0,0,0,0,1,4
2,0,0,0,0,0,1,0,11
3,0,0,0,0,0,1,1,9
4,0,0,0,0,0,2,0,40
...,...,...,...,...,...,...,...,...
483,2,2,1,0,1,1,0,1
484,2,2,1,0,1,1,1,1
485,2,2,1,1,0,1,0,1
486,2,2,1,1,2,1,0,1


In [137]:

"""
Function used to find all groups that belong in "top" group
(Groups of size 2)
"""
def find_top(all_names):
  all_names_lst_top = []
  for all in range(len(all_names)):
    if len(all_names[all]) == 2:
      all_names_lst_top.append(all)
  return all_names_lst_top

In [138]:
"""
Finds all of the closest neighbors by one degree to a group 
Naive Method
"""
def get_one_degree_neighbors(temp2, names, group_lst):
    result = []
    for i in range(len(group_lst)):
        d = copy.copy(temp2)
        for k in range(len(group_lst)):
            if k != i:
                d = d[d[names[k]] == group_lst[k]]
            else:
                d = d[d[names[k]] != group_lst[k]]
        result.append(d)
    return result

In [139]:
"""
 After finding the closest neighbors, compute the 
 pos/neg ratio of these neighbors 
 Naive method
"""
def compute_neighbors(group_lst, result):
    # compute the ratio of positive and negative records
    start2 = time.time()
    pos = 0
    neg = 0 
    for r in result:
        total  = r['cnt'].sum()
        r = r[r[compas_y] == 1]
        pos += r['cnt'].sum()
        neg += total - r['cnt'].sum()
    if(neg == 0):
        return (pos, neg, -1)
    end2 = time.time()
    return(pos, neg, pos/neg)

In [140]:
"""
Used in Preferential Sampling and Massaging Algorithms
These algorithms choose to add and remove records depending on if
more positive or negative values or less positive or negative value are needed  

"""

def compute_diff_add_and_remove(group_lst, temp2, need_positive_or_negative, label, names):
    d = copy.copy(temp2)
    for i in range(len(group_lst)):
        d = d[d[names[i]] == group_lst[i]]
    total =  d['cnt'].sum()
    # Total here was 0: here, errors when this is commented out
    if total == 0:
      return -1
    d = d[d[label] == 1]
    pos = d['cnt'].sum()

    neg = total - pos
    result = get_one_degree_neighbors(temp2,names, group_lst)
    neighbors = compute_neighbors(group_lst, result)
    if(need_positive_or_negative == 1):
        # need pos
        x = Symbol('x')
        try:
          diff = solve((pos + x)/ (neg - x) - neighbors[2])[0]
        except:
          return -1       
    else:
        #need negative
        x = Symbol('x')
        try:
          diff = solve((pos - x)/ (neg + x) - neighbors[2])[0]
        except:
          return -1
    return diff

In [141]:
"""
Used in Duplication Algorithms
This algorithm choose to add records depending on if
more positive or negative values are needed  

"""
def compute_diff_add(group_lst, temp2, names, label_y, need_positive_or_negative):
    d = copy.copy(temp2)
    for i in range(len(group_lst)):
        d = d[d[names[i]] == group_lst[i]]
    total =  d['cnt'].sum()
    d = d[d[label_y] == 1]
    pos = d['cnt'].sum()
    neg = total - pos
    result = get_one_degree_neighbors(temp2, names, group_lst)
    neighbors = compute_neighbors(group_lst, result)
    if(need_positive_or_negative == 1):
        # need pos
        x = Symbol('x')
        try:
          diff = solve((pos + x)/ neg -  neighbors[2])[0]
        except:
          return -1
        print(neighbors[2], pos, neg, diff)
    else:
        #need negative
        x = Symbol('x')
        try:
          diff = solve(pos/ (neg + x) -  neighbors[2])[0]
        except:
          return -1
    print(neighbors[2], pos, neg, diff)
    return diff

""" 
Used in Downsampling Algorithms
This algorithm choose to remove records depending on if
more/less positive or negative values are needed  

"""
def compute_diff_remove(group_lst, temp2, names, label_y, need_positive_or_negative):
    d = copy.copy(temp2)
    for i in range(len(group_lst)):
        d = d[d[names[i]] == group_lst[i]]
    total =  d['cnt'].sum()
    d = d[d[label_y] == 1]
    pos = d['cnt'].sum()
    neg = total - pos
    result = get_one_degree_neighbors(temp2, names, group_lst)
    neighbors = compute_neighbors(group_lst, result)
    if(need_positive_or_negative == 1):
        # need pos, remove some neg
        x = Symbol('x')
        try:
          diff = solve( pos/ (neg - x) -  neighbors[2])[0]
        except:
          return -1
        print(neighbors[2], pos, neg, diff)
    else:
        #need negative
        x = Symbol('x')
        try:
          diff = solve((pos -x )/ neg -  neighbors[2])[0]
        except:
          return -1
        print(neighbors[2], pos, neg, diff)
    return diff


In [142]:
from divexplorer.FP_DivergenceExplorer import FP_DivergenceExplorer

"""
Generalized function to output the results from the divexplorer functions
prints the model accuracy, d_fpr, d_fnr, d_acc scores and writes to csv
"""
def div_results(db, remedy, algo):
  columns_compas.extend([compas_y, "predicted"])

  df = pd.DataFrame(test_set, columns = columns_compas)

  columns_compas.remove(compas_y)
  columns_compas.remove('predicted')
  class_map={'N': 0, 'P': 1}
  
  min_sup=0.1


  fp_diver=FP_DivergenceExplorer(df,compas_y, "predicted", class_map=class_map)
  FP_fm=fp_diver.getFrequentPatternDivergence(min_support=min_sup, metrics=["d_fpr", "d_fnr", "d_accuracy"])
  from divexplorer.FP_Divergence import FP_Divergence
  fp_divergence_fpr=FP_Divergence(FP_fm, "d_fpr")
  fp_divergence_fnr=FP_Divergence(FP_fm, "d_fnr")
  fp_divergence_acc=FP_Divergence(FP_fm, "d_accuracy")

  INFO_VIZ=["support", "itemsets",  fp_divergence_fpr.metric, fp_divergence_fpr.t_value_col]
  INFO_VIZ2=["support", "itemsets",  fp_divergence_fnr.metric, fp_divergence_fnr.t_value_col]
  INFO_VIZ3=["support", "itemsets",  fp_divergence_acc.metric, fp_divergence_acc.t_value_col]

  K=200
  d = fp_divergence_fpr.getDivergence(th_redundancy=0)[INFO_VIZ].head(K)
  # summerization
  eps=0.01

  d = fp_divergence_fpr.getDivergence(th_redundancy=eps)[INFO_VIZ].head(K)
  d2 = fp_divergence_fnr.getDivergence(th_redundancy=eps)[INFO_VIZ2].head(K)
  d3 = fp_divergence_acc.getDivergence(th_redundancy=eps)[INFO_VIZ3].head(K)

  d= d[d['d_fpr'] > 0]
  d2= d2[d2['d_fnr'] > 0]
  d3= d3[d3['d_accuracy'] > 0]
  
  dfpr = fairness_score_computation(d, 'd_fpr')
  dfnr = fairness_score_computation(d2, 'd_fnr')
  dacc = fairness_score_computation(d3, 'd_accuracy')

  print("dfpr", dfpr)
  print("dfnr", dfnr)
  print("dacc", dacc)
  accuracy = accuracy_score(test_label, test_set['predicted'])
  print("accuracy is " , accuracy)

  writelist = [db,remedy,algo, dfpr, dfnr, dacc, accuracy]
  with open('compas.csv', 'a', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(writelist)
  # print(writelist)
  return d,d2,d3

### Optimized Helper Function

Used for Identification Method of Pos/Neg groups for all Algorithms

In [143]:
# helper function for optimized
"""
 After finding the closest neighbors, compute the 
 pos/neg ratio of these neighbors 
 Optimized method
"""
def compute_neighbors_opt(group_lst,lst_of_counts, pos, neg):
    times = len(group_lst)
    pos_cnt = 0
    neg_cnt = 0
    for i in range(times):
        df_groupby = lst_of_counts[i]
        temp_group_lst_pos = copy.copy(group_lst)
        temp_group_lst_neg = copy.copy(group_lst)
        del temp_group_lst_pos[i]
        del temp_group_lst_neg[i]
        # count positive
        temp_group_lst_pos.append(1)
        group_tuple_pos = tuple(temp_group_lst_pos)
        if group_tuple_pos in df_groupby.keys():
            pos_cnt += df_groupby[group_tuple_pos]
        else:
            pos_cnt += 0
        # count negative
        temp_group_lst_neg.append(0)
        group_tuple_neg = tuple(temp_group_lst_neg)
        if group_tuple_neg in df_groupby.keys():
            neg_cnt += df_groupby[group_tuple_neg]
        else:
            neg_cnt += 0
    pos_val = pos_cnt - times* pos
    neg_val = neg_cnt - times* neg

    if neg_val == -1 or (neg_val == 0 and pos_val == 0):
        return (pos_val, neg_val, -1)
    if pos_val == 0 or neg_val == 0:
        return (pos_val, neg_val, 0)

    return (pos_val, neg_val, pos_val/neg_val)

In [144]:
# get the list of neighbors
"""
Finds all of the closest neighbors by one degree to a group 
Optimized Method
"""
def get_one_degree_neighbors_opt(group_lst):
    start1 = time.time()
    result = []
    for i in range(len(group_lst)):
        d = copy.copy(group_lst)
        d[i] = 'x'
        result.append(d)
    end1 = time.time()
    return result

In [145]:
"""
Function to determine based on the neighbors if the group is positive or negative
"""
def determine_problematic_opt(group_lst, names, temp2, lst_of_counts, label, threshold= 0.3):
    #0: ok group, 1: need negative records, 2: need positive records
    d = copy.copy(temp2)
    for i in range(len(group_lst)):
        d = d[d[names[i]] == group_lst[i]]
    total =  d['cnt'].sum()
    d = d[d[label] == 1]
    pos = d['cnt'].sum()
    neg = total - pos
    neighbors = compute_neighbors_opt(group_lst,lst_of_counts, pos, neg)
    if(neighbors[2] == -1):
        # there is no neighbors
        return 0
    if(total > 30):
        # need to be large enough, need to adjust with different datasets.
        if neg == 0:
            if (pos > neighbors[2]):
                return 1
            if(pos <= neighbors[2]):
                return 0
        if (pos/(neg) - neighbors[2] > threshold):
            # too many positive records
            return 1
        if (neighbors[2] - pos/(neg) > threshold):
            return 2
    return 0

In [146]:
"""
Function to determine based on the neighbors if the group is positive or negative
"""
def compute_problematic_opt(temp2, temp_g, names, label, lst_of_counts):
    need_pos = []
    need_neg = []
    for index, row in temp_g.iterrows():
        group_lst = []
        for n in names:
            group_lst.append(row[n])
        problematic = determine_problematic_opt(group_lst, names, temp2, lst_of_counts,label)
        if(problematic == 1):
            if group_lst not in need_neg:
                need_neg.append(group_lst)
        if(problematic == 2):
            if group_lst not in need_pos:
                need_pos.append(group_lst)
    return need_pos, need_neg

In [147]:
# build the list of X00
def compute_lst_of_counts(temp, names, label):
    # get the list of group-by attributes
    lst_of_counts = []
    for i in range(len(names)):
        grp_names = copy.copy(names)
        del grp_names[i]
        grp_names.append(label)
        temp_df = temp.groupby(grp_names)['cnt'].count()
        lst_of_counts.append(temp_df)
    return lst_of_counts
    
def get_tuple(group_lst):
    return tuple(group_lst) 


# Preferential Sampling 

In [148]:
#####################################

#  Preferential Sampling Algorithm

#####################################
def pref_sampling_opt(train_set, cols_given, label, need_pos, need_neg):
    if len(need_pos)+ len(need_neg) > 0:
        temp_train_x = pd.DataFrame(train_set, columns = columns_all)
        temp_train_label = pd.DataFrame(train_set, columns = [label])
        temp_train_label = temp_train_label[label]
        temp_train_label = temp_train_label.astype('int')
        mnb = MultinomialNB()
        mnb = mnb.fit(temp_train_x, temp_train_label)
        probs = mnb.predict_proba(temp_train_x)[:,0]
        train_set["prob"] = abs(probs - 0.5)
    new_train_set = pd.DataFrame(columns = list(train_set.columns))
    updated_pos = 0
    for i in need_pos:
        # needs to updated more positive records
        temp_df = copy.deepcopy(train_set)
        for n in range(len(i)):
          temp_df = temp_df[temp_df[cols_given[n]] == i[n]]
        # update the skew and diff
        idx = list(temp_df.index)
        train_set.loc[idx, 'skewed'] = 1
        idx_pos = list(temp_df[(getattr(temp_df, label) == 1)].index)
        if(len(idx_pos) == 0):
          # if there is no positive
          idx_neg = list(temp_df[(getattr(temp_df, label) == 0)].index)
          neg_ranked = train_set.loc[idx_neg].sort_values(by="prob", ascending=True)
          new_train_set = pd.concat([new_train_set, neg_ranked], ignore_index=True)
          continue
        idx_neg = list(temp_df[(getattr(temp_df, label) == 0)].index)
        pos_ranked = train_set.loc[idx_pos].sort_values(by="prob", ascending=True)
        neg_ranked = train_set.loc[idx_neg].sort_values(by="prob", ascending=True)
        diff = compute_diff_add_and_remove(i, temp2,  1, compas_y, names)
        if diff == -1:
          new_train_set = pd.concat([new_train_set, pos_ranked], ignore_index=True)
          new_train_set = pd.concat([new_train_set, neg_ranked], ignore_index=True)
          continue
        train_set.loc[idx, 'diff'] = int(diff)
        cnt = int(train_set.loc[idx_pos[0]]["diff"])
        updated_pos += cnt * 2 
        # add more records when there are not enough available records
        new_train_set = pd.concat([new_train_set, pos_ranked], ignore_index=True)
        temp_cnt = cnt
        if len(pos_ranked) >= temp_cnt:
            new_train_set = pd.concat([new_train_set,pos_ranked[0:cnt]], ignore_index=True)
        else:
            while(temp_cnt > 0 ):
                new_train_set = pd.concat([new_train_set,pos_ranked[0:temp_cnt]], ignore_index=True) 
            # duplicate the dataframe
                temp_cnt = temp_cnt - len(pos_ranked)
        # duplicate the top cnt records from the pos
        # remove the top cnt records from the neg
        if cnt == 0:
          new_train_set = pd.concat([new_train_set, neg_ranked], ignore_index=True)
        else:
          new_train_set = pd.concat([new_train_set, neg_ranked[cnt-1:-1]], ignore_index=True)
    print("updated {} positive records".format(str(updated_pos)))
    updated_neg = 0
    # adding more records to the need_neg set
    for i in need_neg:
        # list of idx belongs to this group
        temp_df = copy.deepcopy(train_set)
        for n in range(len(i)):
          temp_df = temp_df[temp_df[cols_given[n]] == i[n]]
        # update the skew and diff
        idx = list(temp_df.index)
        train_set.loc[idx, 'skewed'] = 1
        idx_pos = list(temp_df[(getattr(temp_df, label) == 1)].index)
        idx_neg = list(temp_df[(getattr(temp_df, label) == 0)].index)
        if(len(idx_neg) == 0):
          pos_ranked = train_set.loc[idx_pos].sort_values(by="prob", ascending=True)
          new_train_set = pd.concat([new_train_set, pos_ranked], ignore_index=True)
          continue
        pos_ranked = train_set.loc[idx_pos].sort_values(by="prob", ascending=True)
        neg_ranked = train_set.loc[idx_neg].sort_values(by="prob", ascending=True)
        diff = compute_diff_add_and_remove(i, temp2, 0, compas_y, names)
        if diff == -1:
          new_train_set = pd.concat([new_train_set, neg_ranked], ignore_index=True)
          new_train_set = pd.concat([new_train_set, pos_ranked], ignore_index=True)
          continue
        train_set.loc[idx, 'diff'] = int(diff)
        cnt = int(train_set.loc[idx_pos[0]]["diff"])
        updated_neg += cnt * 2 
        # add more records when there are not enough available records
        new_train_set = pd.concat([new_train_set, neg_ranked], ignore_index=True)
        temp_cnt = cnt
        if len(neg_ranked) >= temp_cnt:
            new_train_set = pd.concat([new_train_set,neg_ranked[0:cnt]], ignore_index=True)
        else:
            while(temp_cnt > 0 ):
                new_train_set = pd.concat([new_train_set,neg_ranked[0:temp_cnt]], ignore_index=True) 
            # duplicate the dataframe
                temp_cnt = temp_cnt - len(neg_ranked)
        # duplicate the top cnt records from the pos
        # remove the top cnt records from the neg
        if cnt ==0:
          new_train_set = pd.concat([new_train_set, pos_ranked], ignore_index=True)       
        else:
          new_train_set = pd.concat([new_train_set, pos_ranked[cnt-1:-1]], ignore_index=True)
   
    print("updated {} negative records".format(str(updated_neg)))
    # add the other irrelavant items:
    idx_irr = list(train_set[train_set['skewed'] == 0].index)
    irr_df = train_set.loc[idx_irr]
    new_train_set = pd.concat([new_train_set, irr_df], ignore_index=True)
    print("The new dataset contains {} rows.".format(str(len(new_train_set))))
    new_train_set.reset_index()
    return new_train_set



## Run Algorithm Lattice


In [149]:
#get all of the candidate groups possible with the combos and names

new_train_data = copy.deepcopy(train_set)

#iterate over all the names to get the temp2 df for each name
for a in all_names_lst:
  print("//////"+str(a)+"///////")

  # Group pre-processing methods
  temp2, names = get_temp(new_train_data, all_names[a], compas_y)
  temp, temp_g = get_temp_g(new_train_data, names, compas_y)
  temp_g = temp_g[temp_g['cnt'] > filter_count]
  lst_of_counts = compute_lst_of_counts(temp, names, compas_y)

  #call opt identification function  
  need_pos, need_neg = compute_problematic_opt(temp2, temp_g, names, compas_y, lst_of_counts)

  print("The sets of need pos and neg are")
  print(need_pos)
  print(need_neg)
  new_train_data['skewed'] = 0
  new_train_data["diff"] = 0
  print("started pref sampling")

  #call opt preferential sampling algorithm
  new_train_data = pref_sampling_opt(new_train_data, names, compas_y, need_pos, need_neg)

  print(new_train_data[compas_y].value_counts())

#create new dataframe using the results of preferential sampling 
new_train_x = pd.DataFrame(new_train_data, columns = columns_all)
new_train_label = pd.DataFrame(new_train_data, columns = [compas_y])
new_train_label = new_train_label[compas_y]
new_train_label = new_train_label.astype('int')

//////63///////
The sets of need pos and neg are
[[0, 0, 0, 0, 0, 2], [0, 0, 0, 0, 1, 1], [0, 1, 0, 0, 0, 1], [0, 1, 0, 1, 2, 1], [1, 1, 0, 0, 1, 1]]
[[0, 1, 0, 0, 2, 1], [0, 1, 1, 0, 2, 1], [0, 1, 1, 0, 2, 2], [0, 2, 0, 0, 1, 1], [0, 2, 1, 0, 0, 1], [1, 1, 0, 0, 2, 1], [1, 1, 0, 0, 2, 2], [2, 1, 0, 0, 2, 1]]
started pref sampling
updated 104 positive records
updated 218 negative records
The new dataset contains 4321 rows.
0    2400
1    1921
Name: class, dtype: int64
//////62///////
The sets of need pos and neg are
[[0, 0, 0, 1, 2]]
[[0, 0, 0, 2, 2], [1, 0, 0, 2, 2], [1, 0, 1, 1, 2], [1, 0, 1, 2, 1], [1, 1, 0, 2, 1], [2, 0, 0, 1, 2], [2, 0, 0, 2, 1], [2, 1, 0, 1, 1]]
started pref sampling
updated 24 positive records
updated 142 negative records
The new dataset contains 4321 rows.
0    2459
1    1862
Name: class, dtype: int64
//////61///////
The sets of need pos and neg are
[[0, 0, 1, 0, 1]]
[[1, 0, 0, 1, 1], [1, 0, 0, 2, 1], [1, 0, 0, 2, 2], [1, 1, 0, 2, 1], [2, 0, 0, 2, 1]]
started p

### Preferential Sampling Results Lattice

In [150]:
print()
print("dt")
griddt.fit(new_train_x, new_train_label)
print("best", griddt.best_score_)
test_predict = griddt.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Preferential Sampling-Lattice","DT")

print()
print("rf")
gridrf.fit(new_train_x, new_train_label)
print("best", gridrf.best_score_)
test_predict = gridrf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Preferential Sampling-Lattice","RF")

print()
print("logistic")
gridlg.fit(new_train_x, new_train_label)
print("best", gridlg.best_score_)
test_predict = gridlg.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Preferential Sampling-Lattice","LG")

print()
print("svm")
clf.fit(new_train_x, new_train_label)
test_predict = clf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
s,s2,s3 = div_results("COMPAS","Preferential Sampling-Lattice","SVM")


dt
best 0.6753066795118818
fpr and fnr
0.0
1.0
accuracy
0.6661264181523501
dfpr 0
dfnr 0
dacc 0.8613400089486869
accuracy is  0.5510534846029174

rf
best 0.6387417041318775
fpr and fnr
0.0696078431372549
0.9121540312876053
accuracy
0.5510534846029174
dfpr 0.30472896327769233
dfnr 0.589635534178638
dacc 0.825726219798547
accuracy is  0.5521339816315505

logistic


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.67530668        nan 0.6753

best 0.6753066795118818
fpr and fnr
0.0
1.0
accuracy
0.5521339816315505
dfpr 0
dfnr 0
dacc 0.8613400089486869
accuracy is  0.5510534846029174

svm
fpr and fnr
0.0
1.0
accuracy
0.5510534846029174
dfpr 0
dfnr 0
dacc 0.8613400089486869
accuracy is  0.5510534846029174


## Run Algorithm Leaf

In [151]:

#get all of the candidate groups possible with the combos and names

new_train_data = copy.deepcopy(train_set)

#iterate over all the names to get the temp2 df for each name
for a in [all_names_lst[0]]:
  print("/////////////")
  print(a)
  temp2, names = get_temp(new_train_data, all_names[a], compas_y)
  temp, temp_g = get_temp_g(new_train_data, names, compas_y)
  # only keep the records in temp_g that have a size > 30.
  temp_g = temp_g[temp_g['cnt'] > filter_count]
  lst_of_counts = compute_lst_of_counts(temp, names, compas_y)
   
  #id function
  need_pos, need_neg = compute_problematic_opt(temp2, temp_g, names, compas_y, lst_of_counts)

  print("The sets of need pos and neg are")
  print(need_pos)
  print(need_neg)
  new_train_data['skewed'] = 0
  new_train_data["diff"] = 0
  print("started pref sampling")

  #pref sampling function
  new_train_data = pref_sampling_opt(new_train_data, names, compas_y, need_pos, need_neg)

  print(new_train_data[compas_y].value_counts())
new_train_x = pd.DataFrame(new_train_data, columns = columns_all)
new_train_label = pd.DataFrame(new_train_data, columns = [compas_y])
new_train_label = new_train_label[compas_y]
new_train_label = new_train_label.astype('int')


/////////////
63
The sets of need pos and neg are
[[0, 0, 0, 0, 0, 2], [0, 0, 0, 0, 1, 1], [0, 1, 0, 0, 0, 1], [0, 1, 0, 1, 2, 1], [1, 1, 0, 0, 1, 1]]
[[0, 1, 0, 0, 2, 1], [0, 1, 1, 0, 2, 1], [0, 1, 1, 0, 2, 2], [0, 2, 0, 0, 1, 1], [0, 2, 1, 0, 0, 1], [1, 1, 0, 0, 2, 1], [1, 1, 0, 0, 2, 2], [2, 1, 0, 0, 2, 1]]
started pref sampling
updated 104 positive records
updated 218 negative records
The new dataset contains 4321 rows.
0    2400
1    1921
Name: class, dtype: int64


### Preferential Sampling Results Leaf

Output of all of the model results from Preferential Sampling-Leaf Algorithm

In [152]:
print()
print("dt")
griddt.fit(new_train_x, new_train_label)
print("best", griddt.best_score_)
test_predict = griddt.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Preferential Sampling-Leaf","DT")

print()
print("rf")
gridrf.fit(new_train_x, new_train_label)
print("best", gridrf.best_score_)
test_predict = gridrf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Preferential Sampling-Leaf","RF")

print()
print("logistic")
gridlg.fit(new_train_x, new_train_label)
print("best", gridlg.best_score_)
test_predict = gridlg.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Preferential Sampling-Leaf","LG")

print()
print("svm")
clf.fit(new_train_x, new_train_label)
test_predict = clf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
s,s2,s3 = div_results("COMPAS","Preferential Sampling-Leaf","SVM")


dt
best 0.6253492292870906
fpr and fnr
0.19509803921568628
0.48736462093862815
accuracy
0.5510534846029174
dfpr 1.830902622725156
dfnr 2.2985472926720902
dacc 0.15384678248590786
accuracy is  0.6736898973527823

rf
best 0.5785886319845857
fpr and fnr
0.2696078431372549
0.4464500601684717
accuracy
0.6736898973527823
dfpr 2.559271701477016
dfnr 3.511384012793787
dacc 0.19443371827864173
accuracy is  0.6509994597514857

logistic


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.63761106        nan 0.6376

best 0.6376110575893813
fpr and fnr
0.23627450980392156
0.4428399518652226
accuracy
0.6509994597514857
dfpr 3.0514811651636378
dfnr 3.156926896160785
dacc 0.15150687306436472
accuracy is  0.6709886547811994

svm
fpr and fnr
0.24705882352941178
0.43200962695547535
accuracy
0.6709886547811994
dfpr 1.0969663472342601
dfnr 2.8144272413525213
dacc 0.15150745680127928
accuracy is  0.6699081577525662


## Run Algorithm Top

In [153]:

#get all of the candidate groups possible with the combos and names

new_train_data = copy.deepcopy(train_set)

# Uses helper function to find all names in top group
all_names_lst_top = find_top(all_names)

#iterate over all the names to get the temp2 df for each name
for a in all_names_lst_top:
  print("/////////////")
  print(a)

  temp2, names = get_temp(new_train_data, all_names[a], compas_y)
  temp, temp_g = get_temp_g(new_train_data, names, compas_y)
  temp_g = temp_g[temp_g['cnt'] > filter_count]
  lst_of_counts = compute_lst_of_counts(temp, names, compas_y)
  
  # id function   
  need_pos, need_neg = compute_problematic_opt(temp2, temp_g, names, compas_y, lst_of_counts)

  print("The sets of need pos and neg are")
  print(need_pos)
  print(need_neg)
  new_train_data['skewed'] = 0
  new_train_data["diff"] = 0
  print("started pref sampling")

  # pref sampling function
  new_train_data = pref_sampling_opt(new_train_data, names, compas_y, need_pos, need_neg)

  print(new_train_data[compas_y].value_counts())
new_train_x = pd.DataFrame(new_train_data, columns = columns_all)
new_train_label = pd.DataFrame(new_train_data, columns = [compas_y])
new_train_label = new_train_label[compas_y]
new_train_label = new_train_label.astype('int')


/////////////
7
The sets of need pos and neg are
[[0, 0]]
[[0, 2], [1, 1], [1, 2], [2, 0], [2, 1], [2, 2]]
started pref sampling
updated 256 positive records
updated 422 negative records
The new dataset contains 4321 rows.
0    2426
1    1895
Name: class, dtype: int64
/////////////
8
The sets of need pos and neg are
[[0, 1], [2, 1]]
[]
started pref sampling
updated 302 positive records
updated 0 negative records
The new dataset contains 4321 rows.
0    2275
1    2046
Name: class, dtype: int64
/////////////
9
The sets of need pos and neg are
[[0, 1]]
[]
started pref sampling
updated 254 positive records
updated 0 negative records
The new dataset contains 4321 rows.
1    2173
0    2148
Name: class, dtype: int64
/////////////
10
The sets of need pos and neg are
[[0, 0], [1, 1], [2, 1], [2, 2]]
[[0, 2], [1, 2]]
started pref sampling
updated 1166 positive records
updated 812 negative records
The new dataset contains 4321 rows.
1    2350
0    1971
Name: class, dtype: int64
/////////////
11
T

### Preferential Sampling Results Top

In [154]:
print()
print("dt")
griddt.fit(new_train_x, new_train_label)
print("best", griddt.best_score_)
test_predict = griddt.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Preferential Sampling-Top","DT")

print()
print("rf")
gridrf.fit(new_train_x, new_train_label)
print("best", gridrf.best_score_)
test_predict = gridrf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Preferential Sampling-Top","RF")

print()
print("logistic")
gridlg.fit(new_train_x, new_train_label)
print("best", gridlg.best_score_)
test_predict = gridlg.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Preferential Sampling-Top","LG")

print()
print("svm")
clf.fit(new_train_x, new_train_label)
test_predict = clf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
s,s2,s3 = div_results("COMPAS","Preferential Sampling-Top","SVM")


dt
best 0.49393197388139587
fpr and fnr
1.0
0.0
accuracy
0.6699081577525662
dfpr 0
dfnr 0
dacc 1.0206966841700187
accuracy is  0.44894651539708263

rf
best 0.42906658103189893
fpr and fnr
0.7362745098039216
0.22021660649819494
accuracy
0.44894651539708263
dfpr 1.7926356268395065
dfnr 1.0308611976690114
dacc 0.35809778819164434
accuracy is  0.495407887628309

logistic


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.39969493        nan 0.3996

best 0.5358028259473346
fpr and fnr
0.9823529411764705
0.006016847172081829
accuracy
0.495407887628309
dfpr 0.02811516827152262
dfnr 0.005421244405038754
dacc 0.9706937800788688
accuracy is  0.4559697460831983

svm
fpr and fnr
0.7509803921568627
0.17328519855595667
accuracy
0.4559697460831983
dfpr 1.5872818502278114
dfnr 0.8451614022635038
dacc 0.5629494591823418
accuracy is  0.5083738519719071


# Duplication

In [155]:
def round_int(x):
    if x in [float("-inf"),float("inf")]: return 0
    return int(round(x))
    

def make_duplicate(d, group_lst, diff, label_y, names, need_positive_or_negative):

    selected = copy.deepcopy(d)

    for i in range(len(group_lst)):
        att_name = names[i]
        selected = selected[(selected[att_name] == group_lst[i])]
    selected = selected[(selected[label_y] == need_positive_or_negative)]

    if len(selected) == 0:
        return pd.DataFrame()

    # randomly generated diff samples:
    while(len(selected) < diff):
        # duplicate the dataframe
        select_copy = selected.copy(deep=True)
        selected = pd.concat([selected, select_copy])

        # the number needed is more than the not needed numbers.

    generated = selected.sample(n = diff, replace = False, axis = 0)

    return generated 


def naive_duplicate(d, temp2, names, need_pos, need_neg, label_y):
    # add more records for all groups
    # The smote algorithm to boost the coverage
    for r in need_pos:
   
    # add more positive records
        # determine how many points to add
 
        diff = compute_diff_add(r, temp2, names, label_y, 1)
        if diff == -1:
          continue
        diff = round_int(diff)
        # add more records
        print("Adding " + str(diff) +" positive records")
        samples_to_add = make_duplicate(d, r, diff, label_y, names, need_positive_or_negative = 1)
        d = pd.concat([d, samples_to_add], ignore_index=True) 
    for k in need_neg:

   
        diff = compute_diff_add(k, temp2, names, label_y, need_positive_or_negative = 0)
        if diff == -1:
          continue
        diff = round_int(diff)
        print("Adding " + str(diff) +" negative records")
        samples_to_add = make_duplicate(d, k, diff, label_y, names, need_positive_or_negative = 0)
        d = pd.concat([d, samples_to_add], ignore_index=True)
    return d

## Run Algorithm Lattice

In [156]:
#get all of the candidate groups possible with the combos and names

new_train_data = copy.deepcopy(train_set)

#iterate over all the names to get the temp2 df for each name
for a in all_names_lst:
  print("?????/////")
  print(a)
  temp2, names = get_temp(new_train_data, all_names[a], compas_y)
  temp, temp_g = get_temp_g(new_train_data, names, compas_y)
  temp_g = temp_g[temp_g['cnt'] > filter_count]
  lst_of_counts = compute_lst_of_counts(temp, names, compas_y)
  
  # id function   
  need_pos, need_neg = compute_problematic_opt(temp2, temp_g, names, compas_y, lst_of_counts)
  
  print("The sets of need pos and neg are")
  print(need_pos)
  print(need_neg)
  new_train_data['skewed'] = 0
  new_train_data["diff"] = 0
 
 # Duplication function
  new_train_data = naive_duplicate(new_train_data, temp2, names, need_pos, need_neg, compas_y)
 
  print("label y ", new_train_data[compas_y].value_counts())
new_train_x = pd.DataFrame(new_train_data, columns = columns_all)
new_train_label = pd.DataFrame(new_train_data, columns = [compas_y])
new_train_label = new_train_label[compas_y]
new_train_label = new_train_label.astype('int')

?????/////
63
The sets of need pos and neg are
[[0, 0, 0, 0, 0, 2], [0, 0, 0, 0, 1, 1], [0, 1, 0, 0, 0, 1], [0, 1, 0, 1, 2, 1], [1, 1, 0, 0, 1, 1]]
[[0, 1, 0, 0, 2, 1], [0, 1, 1, 0, 2, 1], [0, 1, 1, 0, 2, 2], [0, 2, 0, 0, 1, 1], [0, 2, 1, 0, 0, 1], [1, 1, 0, 0, 2, 1], [1, 1, 0, 0, 2, 2], [2, 1, 0, 0, 2, 1]]
0.432 3 40 14.2800000000000
0.432 3 40 14.2800000000000
Adding 14 positive records
1.050420168067227 8 32 25.6134453781513
1.050420168067227 8 32 25.6134453781513
Adding 26 positive records
0.9322429906542056 28 83 49.3761682242991
0.9322429906542056 28 83 49.3761682242991
Adding 49 positive records
1.8057553956834533 27 22 12.7266187050360
1.8057553956834533 27 22 12.7266187050360
Adding 13 positive records
1.306878306878307 17 18 6.52380952380953
1.306878306878307 17 18 6.52380952380953
Adding 7 positive records
1.2645348837209303 187 74 73.8804597701150
Adding 74 negative records
1.3547008547008548 75 38 17.3627760252368
Adding 17 negative records
0.8894230769230769 23 8 17.85945

### Duplication Results Lattice


In [157]:
print()
print("dt")
griddt.fit(new_train_x, new_train_label)
print("best", griddt.best_score_)
test_predict = griddt.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Duplication-Lattice","DT")

print()
print("rf")
gridrf.fit(new_train_x, new_train_label)
print("best", gridrf.best_score_)
test_predict = gridrf.predict(test_x)
print("fpr and fnr")
print("fpr", fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","PDuplication-Lattice","RF")

print()
print("logistic")
gridlg.fit(new_train_x, new_train_label)
print("best", gridlg.best_score_)
test_predict = gridlg.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Duplication-Lattice","LG")

print()
print("svm")
clf.fit(new_train_x, new_train_label)
test_predict = clf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
s,s2,s3 = div_results("COMPAS","Duplication-Lattice","SVM")


dt
best 0.507718120805369
fpr and fnr
0.12745098039215685
0.6762936221419976
accuracy
0.5083738519719071
dfpr 2.035272689711184
dfnr 0.7195321917643135
dacc 0.4175525604014481
accuracy is  0.6261480280929227

rf
best 0.4656040268456376
fpr and fnr
fpr 0.1568627450980392
0.717208182912154
accuracy
0.6261480280929227
dfpr 1.8997780303865845
dfnr 1.9662288417590035
dacc 0.4921795306229842
accuracy is  0.5915721231766613

logistic


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.48573826        nan 0.4857

best 0.5021812080536913
fpr and fnr
0.03333333333333333
0.9157641395908543
accuracy
0.5915721231766613
dfpr 1.0113248929155323
dfnr 0.2790614844975958
dacc 0.6562285750310619
accuracy is  0.5705024311183144

svm
fpr and fnr
0.09117647058823529
0.7641395908543923
accuracy
0.5705024311183144
dfpr 2.7064111994318636
dfnr 0.5884226654913153
dacc 0.5488723516221033
accuracy is  0.6066990815775256


## Run Algorithm Leaf

In [158]:
#get all of the candidate groups possible with the combos and names

new_train_data = copy.deepcopy(train_set)

#iterate over all the names to get the temp2 df for each name
for a in [all_names_lst[0]]:
  print("?????/////")
  print(a)
  temp2, names = get_temp(new_train_data, all_names[a], compas_y)
  temp, temp_g = get_temp_g(new_train_data, names, compas_y)
  temp_g = temp_g[temp_g['cnt'] > filter_count]
  lst_of_counts = compute_lst_of_counts(temp, names, compas_y)
  
  #id function  
  need_pos, need_neg = compute_problematic_opt(temp2, temp_g, names, compas_y, lst_of_counts)

  print("The sets of need pos and neg are")
  print(need_pos)
  print(need_neg)
  new_train_data['skewed'] = 0
  new_train_data["diff"] = 0
  
  # Duplication Function
  new_train_data = naive_duplicate(new_train_data, temp2, names, need_pos, need_neg, compas_y)

  print("label y", new_train_data[compas_y].value_counts())
new_train_x = pd.DataFrame(new_train_data, columns = columns_all)
new_train_label = pd.DataFrame(new_train_data, columns = [compas_y])
new_train_label = new_train_label[compas_y]
new_train_label = new_train_label.astype('int')

?????/////
63
The sets of need pos and neg are
[[0, 0, 0, 0, 0, 2], [0, 0, 0, 0, 1, 1], [0, 1, 0, 0, 0, 1], [0, 1, 0, 1, 2, 1], [1, 1, 0, 0, 1, 1]]
[[0, 1, 0, 0, 2, 1], [0, 1, 1, 0, 2, 1], [0, 1, 1, 0, 2, 2], [0, 2, 0, 0, 1, 1], [0, 2, 1, 0, 0, 1], [1, 1, 0, 0, 2, 1], [1, 1, 0, 0, 2, 2], [2, 1, 0, 0, 2, 1]]
0.432 3 40 14.2800000000000
0.432 3 40 14.2800000000000
Adding 14 positive records
1.050420168067227 8 32 25.6134453781513
1.050420168067227 8 32 25.6134453781513
Adding 26 positive records
0.9322429906542056 28 83 49.3761682242991
0.9322429906542056 28 83 49.3761682242991
Adding 49 positive records
1.8057553956834533 27 22 12.7266187050360
1.8057553956834533 27 22 12.7266187050360
Adding 13 positive records
1.306878306878307 17 18 6.52380952380953
1.306878306878307 17 18 6.52380952380953
Adding 7 positive records
1.2645348837209303 187 74 73.8804597701150
Adding 74 negative records
1.3547008547008548 75 38 17.3627760252368
Adding 17 negative records
0.8894230769230769 23 8 17.85945

### Duplication Results Leaf

In [159]:
print()
print("dt")
griddt.fit(new_train_x, new_train_label)
print("best", griddt.best_score_)
test_predict = griddt.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Duplication-Leaf","DT")

print()
print("rf")
gridrf.fit(new_train_x, new_train_label)
print("best", gridrf.best_score_)
test_predict = gridrf.predict(test_x)
print("fpr and fnr")
print("fpr", fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Duplication-Leaf","RF")

print()
print("logistic")
gridlg.fit(new_train_x, new_train_label)
print("best", gridlg.best_score_)
test_predict = gridlg.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Duplication-Leaf","LG")

print()
print("svm")
clf.fit(new_train_x, new_train_label)
test_predict = clf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
s,s2,s3 = div_results("COMPAS","Duplication-Leaf","SVM")


dt
best 0.6116590875274337
fpr and fnr
0.22156862745098038
0.4681107099879663
accuracy
0.6066990815775256
dfpr 1.8892362480592937
dfnr 2.5165630964591306
dacc 0.16207542990034735
accuracy is  0.6677471636952999

rf
best 0.5905297957817208
fpr and fnr
fpr 0.27941176470588236
0.4536702767749699
accuracy
0.6677471636952999
dfpr 2.723390637653868
dfnr 3.3508135325004695
dacc 0.27655382740242024
accuracy is  0.6423554835224203

logistic


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.63063809        nan 0.6306

best 0.6310700535654503
fpr and fnr
0.2284313725490196
0.4584837545126354
accuracy
0.6423554835224203
dfpr 3.2425214733366707
dfnr 2.948308955897536
dacc 0.15983738257037464
accuracy is  0.6682874122096164

svm
fpr and fnr
0.24509803921568626
0.43561973525872444
accuracy
0.6682874122096164
dfpr 1.5502220353138638
dfnr 2.7784135633982134
dacc 0.1540189848756683
accuracy is  0.6693679092382496


## Run Algorithm Top

In [160]:
#get all of the candidate groups possible with the combos and names

new_train_data = copy.deepcopy(train_set)

#iterate over all the names to get the temp2 df for each name
all_names_lst_top = find_top(all_names)
for a in all_names_lst_top:
  print("?????/////")
  print(a)
  temp2, names = get_temp(new_train_data, all_names[a], compas_y)
  temp, temp_g = get_temp_g(new_train_data, names, compas_y)
  temp_g = temp_g[temp_g['cnt'] > filter_count]
  lst_of_counts = compute_lst_of_counts(temp, names, compas_y)
  
  #id function    
  need_pos, need_neg = compute_problematic_opt(temp2, temp_g, names, compas_y, lst_of_counts)
  
  print("The sets of need pos and neg are")
  print(need_pos)
  print(need_neg)
  new_train_data['skewed'] = 0
  new_train_data["diff"] = 0
  
  # Duplication Results 
  new_train_data = naive_duplicate(new_train_data, temp2, names, need_pos, need_neg, compas_y)

  print("Label y", new_train_data[compas_y].value_counts())
new_train_x = pd.DataFrame(new_train_data, columns = columns_all)
new_train_label = pd.DataFrame(new_train_data, columns = [compas_y])
new_train_label = new_train_label[compas_y]
new_train_label = new_train_label.astype('int')

?????/////
7
The sets of need pos and neg are
[[0, 0]]
[[0, 2], [1, 1], [1, 2], [2, 0], [2, 1], [2, 2]]
0.842139175257732 190 506 236.122422680412
0.842139175257732 190 506 236.122422680412
Adding 236 positive records
0.6729970326409496 386 332 241.553791887125
Adding 242 negative records
0.8258513931888545 276 174 160.200562324274
Adding 160 negative records
1.2990033222591362 109 58 25.9104859335036
Adding 26 negative records
0.583203732503888 28 27 21.0106666666667
Adding 21 negative records
0.8615733736762481 65 41 34.4433713784021
Adding 34 negative records
1.2838427947598254 31 12 12.1462585034013
Adding 12 negative records
Label y 0    2838
1    2214
Name: class, dtype: int64
?????/////
8
The sets of need pos and neg are
[[0, 1]]
[]
0.8696883852691218 514 938 301.767705382436
0.8696883852691218 514 938 301.767705382436
Adding 302 positive records
Label y 0    2838
1    2516
Name: class, dtype: int64
?????/////
9
The sets of need pos and neg are
[[0, 1]]
[]
0.9776760160274757 284

### Duplication Results Top

In [161]:
print()
print("dt")
griddt.fit(new_train_x, new_train_label)
print("best", griddt.best_score_)
test_predict = griddt.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Duplication-Top","DT")

print()
print("rf")
gridrf.fit(new_train_x, new_train_label)
print("best", gridrf.best_score_)
test_predict = gridrf.predict(test_x)
print("fpr and fnr")
print("fpr", fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Duplication-Top","RF")

print()
print("logistic")
gridlg.fit(new_train_x, new_train_label)
print("best", gridlg.best_score_)
test_predict = gridlg.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Duplication-Top","LG")

print()
print("svm")
clf.fit(new_train_x, new_train_label)
test_predict = clf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
s,s2,s3 = div_results("COMPAS","Duplication-Top","SVM")


dt
best 0.47038058466629895
fpr and fnr
0.6823529411764706
0.3104693140794224
accuracy
0.6693679092382496
dfpr 1.762707558633074
dfnr 0.8907361768859389
dacc 0.7432313515756959
accuracy is  0.4846029173419773

rf
best 0.47159404302261443
fpr and fnr
fpr 0.7049019607843138
0.28760529482551145
accuracy
0.4846029173419773
dfpr 1.5938455018542461
dfnr 0.8993106528485432
dacc 0.7645345383998194
accuracy is  0.48244192328471097

logistic


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.37661335        nan 0.3766

best 0.38852730281301706
fpr and fnr
0.9990196078431373
0.0012033694344163659
accuracy
0.48244192328471097
dfpr 0
dfnr 0
dacc 1.0636915347348272
accuracy is  0.44894651539708263

svm
fpr and fnr
0.8382352941176471
0.13598074608904934
accuracy
0.44894651539708263
dfpr 1.2650288664672282
dfnr 0.767207659695063
dacc 0.5742459359506347
accuracy is  0.4770394381415451


#Down-sampling

In [162]:
def round_int(x):
    if x in [float("-inf"),float("inf")]: return 0
    return int(round(x))


def make_remove(d, group_lst, diff, names, label_y, need_positive_or_negative):
    temp = copy.deepcopy(d)
    for i in range(len(group_lst)):
        att_name = names[i]
        temp = temp[(temp[att_name] == group_lst[i])]
    temp = temp[(temp[label_y] == need_positive_or_negative)]
    # randomly generated diff samples
        #generated = temp
        # the number needed is more than the not needed numbers.

    if(diff>len(temp)):
        diff = len(temp)
    generated = temp.sample(n = diff, replace = False, axis = 0)
    return generated.index


def naive_downsampling(d, temp2, names, need_pos, need_neg, label_y):
    # add more records for all groups
    # The smote algorithm to boost the coverage
    for r in need_pos:
        print("removing more negative")
    # add more positive records
        # determine how many points to add
        print(r)
        diff = compute_diff_remove(r, temp2, names, label_y, need_positive_or_negative = 1)
        if diff == -1:
          continue
        diff = round_int(diff)
        # add more records
        print("Removed " + str(diff) +" negative records")
        samples_to_remove = make_remove(d, r, diff, names, label_y, need_positive_or_negative = 0)
        d.drop(index  = samples_to_remove, inplace = True)
    for k in need_neg:
        print(k)
        diff = compute_diff_remove(k, temp2, names, label_y, need_positive_or_negative = 0)
        if diff == -1:
          continue
        diff = round_int(diff)
        print("Removed " + str(diff) +" positive records")
        samples_to_remove = make_remove(d, k, diff, names, label_y, need_positive_or_negative = 1)
        d.drop(index  = samples_to_remove, inplace = True)
    return d

## Run Algorithm Lattice

In [163]:
#get all of the candidate groups possible with the combos and names

new_train_data = copy.deepcopy(train_set)

#iterate over all the names to get the temp2 df for each name
for a in all_names_lst:
  print("?????/////")
  print(a)
  temp2, names = get_temp(new_train_data, all_names[a], compas_y)
  temp, temp_g = get_temp_g(new_train_data, names, compas_y)
  temp_g = temp_g[temp_g['cnt'] > filter_count]
  lst_of_counts = compute_lst_of_counts(temp, names, compas_y)
  
  #id function   
  need_pos, need_neg = compute_problematic_opt(temp2, temp_g, names, compas_y, lst_of_counts)

  print("The sets of need pos and neg are")
  print(need_pos)
  print(need_neg)
  new_train_data['skewed'] = 0
  new_train_data["diff"] = 0
  
  #Down-sampling function
  new_train_data = naive_downsampling(new_train_data, temp2, names, need_pos, need_neg, compas_y)

  print(new_train_data[compas_y].value_counts())
  
new_train_x = pd.DataFrame(new_train_data, columns = columns_all)
new_train_label = pd.DataFrame(new_train_data, columns = [compas_y])
new_train_label = new_train_label[compas_y]
new_train_label = new_train_label.astype('int')

?????/////
63
The sets of need pos and neg are
[[0, 0, 0, 0, 0, 2], [0, 0, 0, 0, 1, 1], [0, 1, 0, 0, 0, 1], [0, 1, 0, 1, 2, 1], [1, 1, 0, 0, 1, 1]]
[[0, 1, 0, 0, 2, 1], [0, 1, 1, 0, 2, 1], [0, 1, 1, 0, 2, 2], [0, 2, 0, 0, 1, 1], [0, 2, 1, 0, 0, 1], [1, 1, 0, 0, 2, 1], [1, 1, 0, 0, 2, 2], [2, 1, 0, 0, 2, 1]]
removing more negative
[0, 0, 0, 0, 0, 2]
0.432 3 40 33.0555555555556
Removed 33 negative records
removing more negative
[0, 0, 0, 0, 1, 1]
1.050420168067227 8 32 24.3840000000000
Removed 24 negative records
removing more negative
[0, 1, 0, 0, 0, 1]
0.9322429906542056 28 83 52.9649122807018
Removed 53 negative records
removing more negative
[0, 1, 0, 1, 2, 1]
1.8057553956834533 27 22 7.04780876494024
Removed 7 negative records
removing more negative
[1, 1, 0, 0, 1, 1]
1.306878306878307 17 18 4.99190283400813
Removed 5 negative records
[0, 1, 0, 0, 2, 1]
1.2645348837209303 187 74 93.4244186046514
Removed 93 positive records
[0, 1, 1, 0, 2, 1]
1.3547008547008548 75 38 23.5213675213675

### Down-sampling Results Lattice

In [164]:
print()
print("dt")
griddt.fit(new_train_x, new_train_label)
print("best", griddt.best_score_)
test_predict = griddt.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Down Sampling-Lattice","DT")

print()
print("rf")
gridrf.fit(new_train_x, new_train_label)
print("best", gridrf.best_score_)
test_predict = gridrf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Down Sampling-Lattice","RF")

print()
print("logistic")
gridlg.fit(new_train_x, new_train_label)
print("best", gridlg.best_score_)
test_predict = gridlg.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Down Sampling-Lattice","LG")

print()
print("svm")
clf.fit(new_train_x, new_train_label)
test_predict = clf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
s,s2,s3 = div_results("COMPAS","Down Sampling-Lattice","SVM")


dt
best 0.7206338655113063
fpr and fnr
0.0
1.0
accuracy
0.4770394381415451
dfpr 0
dfnr 0
dacc 0.8613400089486869
accuracy is  0.5510534846029174

rf
best 0.7020274200249274
fpr and fnr
0.016666666666666666
0.9771359807460891
accuracy
0.5510534846029174
dfpr 0.044888568609463574
dfnr 0.03148005823324494
dacc 0.7904641321393582
accuracy is  0.5521339816315505

logistic


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.72063387        nan 0.7206

best 0.7206338655113063
fpr and fnr
0.0
1.0
accuracy
0.5521339816315505
dfpr 0
dfnr 0
dacc 0.8613400089486869
accuracy is  0.5510534846029174

svm
fpr and fnr
0.0
1.0
accuracy
0.5510534846029174
dfpr 0
dfnr 0
dacc 0.8613400089486869
accuracy is  0.5510534846029174


## Run Algorithm Leaf

In [165]:
#get all of the candidate groups possible with the combos and names

new_train_data = copy.deepcopy(train_set)

#iterate over all the names to get the temp2 df for each name
for a in [all_names_lst[0]]:
  print("?????/////")
  print(a)
  temp2, names = get_temp(new_train_data, all_names[a], compas_y)
  temp, temp_g = get_temp_g(new_train_data, names, compas_y)
  temp_g = temp_g[temp_g['cnt'] > filter_count]

  lst_of_counts = compute_lst_of_counts(temp, names, compas_y)

  # id function 
  need_pos, need_neg = compute_problematic_opt(temp2, temp_g, names, compas_y, lst_of_counts)

  print("The sets of need pos and neg are")
  print(need_pos)
  print(need_neg)
  new_train_data['skewed'] = 0
  new_train_data["diff"] = 0
  
  #Down-sampling function
  new_train_data = naive_downsampling(new_train_data, temp2, names, need_pos, need_neg, compas_y)
  print(new_train_data[compas_y].value_counts())
  
new_train_x = pd.DataFrame(new_train_data, columns = columns_all)
new_train_label = pd.DataFrame(new_train_data, columns = [compas_y])
new_train_label = new_train_label[compas_y]
new_train_label = new_train_label.astype('int')


?????/////
63
The sets of need pos and neg are
[[0, 0, 0, 0, 0, 2], [0, 0, 0, 0, 1, 1], [0, 1, 0, 0, 0, 1], [0, 1, 0, 1, 2, 1], [1, 1, 0, 0, 1, 1]]
[[0, 1, 0, 0, 2, 1], [0, 1, 1, 0, 2, 1], [0, 1, 1, 0, 2, 2], [0, 2, 0, 0, 1, 1], [0, 2, 1, 0, 0, 1], [1, 1, 0, 0, 2, 1], [1, 1, 0, 0, 2, 2], [2, 1, 0, 0, 2, 1]]
removing more negative
[0, 0, 0, 0, 0, 2]
0.432 3 40 33.0555555555556
Removed 33 negative records
removing more negative
[0, 0, 0, 0, 1, 1]
1.050420168067227 8 32 24.3840000000000
Removed 24 negative records
removing more negative
[0, 1, 0, 0, 0, 1]
0.9322429906542056 28 83 52.9649122807018
Removed 53 negative records
removing more negative
[0, 1, 0, 1, 2, 1]
1.8057553956834533 27 22 7.04780876494024
Removed 7 negative records
removing more negative
[1, 1, 0, 0, 1, 1]
1.306878306878307 17 18 4.99190283400813
Removed 5 negative records
[0, 1, 0, 0, 2, 1]
1.2645348837209303 187 74 93.4244186046514
Removed 93 positive records
[0, 1, 1, 0, 2, 1]
1.3547008547008548 75 38 23.5213675213675

### Down-sampling Results Leaf

In [166]:
print()
print("dt")
griddt.fit(new_train_x, new_train_label)
print("best", griddt.best_score_)
test_predict = griddt.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Down Sampling-Leaf","DT")

print()
print("rf")
gridrf.fit(new_train_x, new_train_label)
print("best", gridrf.best_score_)
test_predict = gridrf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Down Sampling-Leaf","RF")

print()
print("logistic")
gridlg.fit(new_train_x, new_train_label)
print("best", gridlg.best_score_)
test_predict = gridlg.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Down Sampling-Leaf","LG")

print()
print("svm")
clf.fit(new_train_x, new_train_label)
test_predict = clf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
s,s2,s3 = div_results("COMPAS","Down Sampling-Leaf","SVM")


dt
best 0.6426202439386219
fpr and fnr
0.19509803921568628
0.4897713598074609
accuracy
0.5510534846029174
dfpr 1.830902622725156
dfnr 2.272430033368377
dacc 0.16091963080975125
accuracy is  0.6726094003241491

rf
best 0.6271264004540793
fpr and fnr
0.28431372549019607
0.44765342960288806
accuracy
0.6726094003241491
dfpr 2.6686762557146304
dfnr 3.4572695041471087
dacc 0.29274581380368514
accuracy is  0.6423554835224203

logistic


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.64769735        nan 0.6476

best 0.647697353569102
fpr and fnr
0.246078431372549
0.4404332129963899
accuracy
0.6423554835224203
dfpr 2.9184364873000423
dfnr 3.010392566801359
dacc 0.16855753646677485
accuracy is  0.6666666666666666

svm
fpr and fnr
0.24509803921568626
0.43441636582430804
accuracy
0.6666666666666666
dfpr 1.1265062029685533
dfnr 2.7888665853963444
dacc 0.15757948818531073
accuracy is  0.6699081577525662


## Run Algorithm Top

In [167]:
#get all of the candidate groups possible with the combos and names

new_train_data = copy.deepcopy(train_set)

#iterate over all the names to get the temp2 df for each name
all_names_lst_top = find_top(all_names)
for a in all_names_lst_top:
  print("?????/////")
  print(a)
  temp2, names = get_temp(new_train_data, all_names[a], compas_y)
  temp, temp_g = get_temp_g(new_train_data, names, compas_y)
  temp_g = temp_g[temp_g['cnt'] > filter_count]

  lst_of_counts = compute_lst_of_counts(temp, names, compas_y)
  
  # id function  
  need_pos, need_neg = compute_problematic_opt(temp2, temp_g, names, compas_y, lst_of_counts)
 
  print("The sets of need pos and neg are")
  print(need_pos)
  print(need_neg)
  new_train_data['skewed'] = 0
  new_train_data["diff"] = 0
  
  # Down-sampling function
  new_train_data = naive_downsampling(new_train_data, temp2, names, need_pos, need_neg, compas_y)
  print(new_train_data[compas_y].value_counts())
new_train_x = pd.DataFrame(new_train_data, columns = columns_all)
new_train_label = pd.DataFrame(new_train_data, columns = [compas_y])
new_train_label = new_train_label[compas_y]
new_train_label = new_train_label.astype('int')


?????/////
7
The sets of need pos and neg are
[[0, 0]]
[[0, 2], [1, 1], [1, 2], [2, 0], [2, 1], [2, 2]]
removing more negative
[0, 0]
0.842139175257732 190 506 280.384085692425
Removed 280 negative records
[0, 2]
0.6729970326409496 386 332 162.564985163205
Removed 163 positive records
[1, 1]
0.8258513931888545 276 174 132.301857585139
Removed 132 positive records
[1, 2]
1.2990033222591362 109 58 33.6578073089701
Removed 34 positive records
[2, 0]
0.583203732503888 28 27 12.2534992223950
Removed 12 positive records
[2, 1]
0.8615733736762481 65 41 29.6754916792739
Removed 30 positive records
[2, 2]
1.2838427947598254 31 12 15.5938864628821
Removed 16 positive records
0    2063
1    1591
Name: class, dtype: int64
?????/////
8
The sets of need pos and neg are
[[0, 1]]
[]
removing more negative
[0, 1]
0.8658420551855376 388 714 265.881318681319
Removed 266 negative records
0    1797
1    1591
Name: class, dtype: int64
?????/////
9
The sets of need pos and neg are
[[0, 1]]
[]
removing more n

### Down-sampling Results Top

In [168]:
print()
print("dt")
griddt.fit(new_train_x, new_train_label)
print("best", griddt.best_score_)

test_predict = griddt.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Down Sampling-Top","DT")

print()
print("rf")
gridrf.fit(new_train_x, new_train_label)
print("best", gridrf.best_score_)
test_predict = gridrf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Down Sampling-Top","RF")

print()
print("logistic")
gridlg.fit(new_train_x, new_train_label)
print("best", gridlg.best_score_)
test_predict = gridlg.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Down Sampling-Top","LG")

print()
print("svm")
clf.fit(new_train_x, new_train_label)
test_predict = clf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
s,s2,s3 = div_results("COMPAS","Down Sampling-Top","SVM")


dt
best 0.5634660098703256
fpr and fnr
0.05392156862745098
0.8592057761732852
accuracy
0.6699081577525662
dfpr 0.9104801161528263
dfnr 0.28995383761466764
dacc 0.4868152802477146
accuracy is  0.5845488924905456

rf
best 0.5415849500510059
fpr and fnr
0.42254901960784313
0.6546329723225031
accuracy
0.5845488924905456
dfpr 1.2188209103874605
dfnr 1.764597455353002
dacc 0.2590951318968153
accuracy is  0.473257698541329

logistic


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.53087464        nan 0.5308

best 0.5321557563114023
fpr and fnr
0.03725490196078431
0.9253910950661853
accuracy
0.473257698541329
dfpr 0.38911614211769197
dfnr 0.4355413996347597
dacc 0.7054288408648532
accuracy is  0.5640194489465153

svm
fpr and fnr
0.453921568627451
0.6534296028880866
accuracy
0.5640194489465153
dfpr 1.7393511410177387
dfnr 2.4182032576591137
dacc 0.3683096817729024
accuracy is  0.45650999459751485


# Massaging 

In [169]:
from sklearn.naive_bayes import MultinomialNB
def round_int(x):
    if x in [float("-inf"),float("inf")]: return 0
    return int(round(x))

def get_depromotion(d, diff, group_lst, names, label_y, flag_depro):

    input_test = pd.DataFrame(d, columns = columns_compas)
    clf = MultinomialNB()
    temp_train_label = pd.DataFrame(d, columns = [label_y])
    temp_train_label = temp_train_label[label_y]
    temp_train_label = temp_train_label.astype('int')
    clf = clf.fit(input_test, temp_train_label)
    prob  = clf.predict_proba(input_test)[:,0]
    select = copy.deepcopy(d)
    select['prob'] = prob # the higher the probablity is, the more likely for it to be 0
    # filter out those belongs to this group
    for i in range(len(group_lst)):
        att_name = names[i]
        select = select[(select[att_name] == group_lst[i])]
    select = select[(select[label_y] == flag_depro)]
    # rank them according to the probability
    # filp the records and remove the records from d
    if (flag_depro == 0):
        select.sort_values(by="prob", ascending=True, inplace=True)
        select[label_y] = 1
    else:
        select.sort_values(by="prob", ascending=False, inplace=True)
        select[label_y] = 0
    head = select.head(diff)
    index_list = []
    index_list = list(head.index)
    d.drop(index_list,inplace = True)
    head.drop(columns = ['prob'],inplace = True)
    return head



def naive_massaging(d, temp2, names, need_pos, need_neg,label_y):
    # add more records for all groups
    # The smote algorithm to boost the coverage
    for r in need_pos:
        print("adding more positive")
    # add more positive records
        # determine how many points to add
        print(r)
        diff = compute_diff_add_and_remove(r, temp2, 1, label_y, names)
        diff =  round_int(diff)
        # add more records
        #0 for promotion
        samples_to_add = get_depromotion(d, diff, r, names, label_y, flag_depro = 0)
        print("Changed " + str(len(samples_to_add)) +" records")
        d = pd.concat([d, samples_to_add])
        print(len(d))
    for k in need_neg:
        print(k)
        print("adding more negative")
        diff = compute_diff_add_and_remove(k, temp2, 0, label_y, names)
        diff =  round_int(diff)
        #1 for demotion
        samples_to_add = get_depromotion(d, diff, k, names, label_y, flag_depro = 1)
        print("Changed " + str(len(samples_to_add)) +" records")
        d = pd.concat([d, samples_to_add])
        print(len(d))
    return d

## Run Algorithm Lattice

In [171]:
#get all of the candidate groups possible with the combos and names

new_train_data = copy.deepcopy(train_set)

#iterate over all the names to get the temp2 df for each name
for a in all_names_lst:
  print("?????/////")
  print(a)
  temp2, names = get_temp(new_train_data, all_names[a], compas_y)
  temp, temp_g = get_temp_g(new_train_data, names, compas_y)
  temp_g = temp_g[temp_g['cnt'] > filter_count]
  lst_of_counts = compute_lst_of_counts(temp, names, compas_y)
   
   #id function
  need_pos, need_neg = compute_problematic_opt(temp2, temp_g, names, compas_y, lst_of_counts)

  print("The sets of need pos and neg are")
  print(need_pos)
  print(need_neg)
  new_train_data['skewed'] = 0
  new_train_data["diff"] = 0

  # massaging function
  new_train_data = naive_massaging(new_train_data, temp2, names, need_pos, need_neg, compas_y)

  print(new_train_data[compas_y].value_counts())
new_train_x = pd.DataFrame(new_train_data, columns = columns_all)
new_train_label = pd.DataFrame(new_train_data, columns = [compas_y])
new_train_label = new_train_label[compas_y]
new_train_label = new_train_label.astype('int')

?????/////
63
The sets of need pos and neg are
[[0, 0, 0, 0, 0, 2], [0, 0, 0, 0, 1, 1], [0, 1, 0, 0, 0, 1], [0, 1, 0, 1, 2, 1], [1, 1, 0, 0, 1, 1]]
[[0, 1, 0, 0, 2, 1], [0, 1, 1, 0, 2, 1], [0, 1, 1, 0, 2, 2], [0, 2, 0, 0, 1, 1], [0, 2, 1, 0, 0, 1], [1, 1, 0, 0, 2, 1], [1, 1, 0, 0, 2, 2], [2, 1, 0, 0, 2, 1]]
adding more positive
[0, 0, 0, 0, 0, 2]
Changed 10 records
4321
adding more positive
[0, 0, 0, 0, 1, 1]
Changed 12 records
4321
adding more positive
[0, 1, 0, 0, 0, 1]
Changed 26 records
4321
adding more positive
[0, 1, 0, 1, 2, 1]
Changed 5 records
4321
adding more positive
[1, 1, 0, 0, 1, 1]
Changed 3 records
4321
[0, 1, 0, 0, 2, 1]
adding more negative
Changed 41 records
4321
[0, 1, 1, 0, 2, 1]
adding more negative
Changed 10 records
4321
[0, 1, 1, 0, 2, 2]
adding more negative
Changed 8 records
4321
[0, 2, 0, 0, 1, 1]
adding more negative
Changed 30 records
4321
[0, 2, 1, 0, 0, 1]
adding more negative
Changed 6 records
4321
[1, 1, 0, 0, 2, 1]
adding more negative
Changed 11 reco

### Massaging Results Lattice 

In [172]:
print()
print("dt")
griddt.fit(new_train_x, new_train_label)
print("best", griddt.best_score_)
test_predict = griddt.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Massage-Lattice","DT")

print()
print("rf")
gridrf.fit(new_train_x, new_train_label)
print("best", gridrf.best_score_)
test_predict = gridrf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Massage-Lattice","RF")

print()
print("logistic")
gridlg.fit(new_train_x, new_train_label)
print("best", gridlg.best_score_)
test_predict = gridlg.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Massage-Lattice","LG")

print()
print("svm")
clf.fit(new_train_x, new_train_label)
test_predict = clf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
s,s2,s3 = div_results("COMPAS","Massage-Lattice","SVM")


dt
best 0.7949512952258617
fpr and fnr
0.0
1.0
accuracy
0.45650999459751485
dfpr 0
dfnr 0
dacc 0.8613400089486869
accuracy is  0.5510534846029174

rf
best 0.7859227146221366
fpr and fnr
0.06764705882352941
0.9025270758122743
accuracy
0.5510534846029174
dfpr 0.7479327437364252
dfnr 0.48184544385839373
dacc 0.7116412609768078
accuracy is  0.5575364667747164

logistic


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.81022907        nan 0.8102

best 0.8102290730036396
fpr and fnr
0.0
1.0
accuracy
0.5575364667747164
dfpr 0
dfnr 0
dacc 0.8613400089486869
accuracy is  0.5510534846029174

svm
fpr and fnr
0.04411764705882353
0.9169675090252708
accuracy
0.5510534846029174
dfpr 0.7608212562028613
dfnr 0.3789451626427168
dacc 0.7324395153699396
accuracy is  0.5640194489465153


## Run Algorithm Leaf

In [173]:
#get all of the candidate groups possible with the combos and names

new_train_data = copy.deepcopy(train_set)

#iterate over all the names to get the temp2 df for each name
for a in [all_names_lst[0]]:
  print("?????/////")
  print(a)
  temp2, names = get_temp(new_train_data, all_names[a], compas_y)
  temp, temp_g = get_temp_g(new_train_data, names, compas_y)
  temp_g = temp_g[temp_g['cnt'] > filter_count]
  lst_of_counts = compute_lst_of_counts(temp, names, compas_y)
  
  #id function  
  need_pos, need_neg = compute_problematic_opt(temp2, temp_g, names, compas_y, lst_of_counts)

  print("The sets of need pos and neg are")
  print(need_pos)
  print(need_neg)
  new_train_data['skewed'] = 0
  new_train_data["diff"] = 0

  # massaging function
  new_train_data = naive_massaging(new_train_data, temp2, names, need_pos, need_neg, compas_y)

  print(new_train_data[compas_y].value_counts())
new_train_x = pd.DataFrame(new_train_data, columns = columns_all)
new_train_label = pd.DataFrame(new_train_data, columns = [compas_y])
new_train_label = new_train_label[compas_y]
new_train_label = new_train_label.astype('int')

?????/////
63
The sets of need pos and neg are
[[0, 0, 0, 0, 0, 2], [0, 0, 0, 0, 1, 1], [0, 1, 0, 0, 0, 1], [0, 1, 0, 1, 2, 1], [1, 1, 0, 0, 1, 1]]
[[0, 1, 0, 0, 2, 1], [0, 1, 1, 0, 2, 1], [0, 1, 1, 0, 2, 2], [0, 2, 0, 0, 1, 1], [0, 2, 1, 0, 0, 1], [1, 1, 0, 0, 2, 1], [1, 1, 0, 0, 2, 2], [2, 1, 0, 0, 2, 1]]
adding more positive
[0, 0, 0, 0, 0, 2]
Changed 10 records
4321
adding more positive
[0, 0, 0, 0, 1, 1]
Changed 12 records
4321
adding more positive
[0, 1, 0, 0, 0, 1]
Changed 26 records
4321
adding more positive
[0, 1, 0, 1, 2, 1]
Changed 5 records
4321
adding more positive
[1, 1, 0, 0, 1, 1]
Changed 3 records
4321
[0, 1, 0, 0, 2, 1]
adding more negative
Changed 41 records
4321
[0, 1, 1, 0, 2, 1]
adding more negative
Changed 10 records
4321
[0, 1, 1, 0, 2, 2]
adding more negative
Changed 8 records
4321
[0, 2, 0, 0, 1, 1]
adding more negative
Changed 30 records
4321
[0, 2, 1, 0, 0, 1]
adding more negative
Changed 6 records
4321
[1, 1, 0, 0, 2, 1]
adding more negative
Changed 11 reco

### Massaging Results Leaf

In [174]:
print()
print("dt")
griddt.fit(new_train_x, new_train_label)
print("best", griddt.best_score_)
test_predict = griddt.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Massage-Leaf","DT")

print()
print("rf")
gridrf.fit(new_train_x, new_train_label)
print("best", gridrf.best_score_)
test_predict = gridrf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Massage-Leaf","RF")

print()
print("logistic")
gridlg.fit(new_train_x, new_train_label)
print("best", gridlg.best_score_)
test_predict = gridlg.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Massage-Leaf","LG")

print()
print("svm")
clf.fit(new_train_x, new_train_label)
test_predict = clf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
s,s2,s3 = div_results("COMPAS","Massage-Leaf","SVM")


dt
best 0.621149111539285
fpr and fnr
0.19411764705882353
0.4897713598074609
accuracy
0.5640194489465153
dfpr 2.223247068811258
dfnr 2.275118418495093
dacc 0.1604053585881271
accuracy is  0.6731496488384657

rf
best 0.6086493791479339
fpr and fnr
0.2735294117647059
0.4500601684717208
accuracy
0.6731496488384657
dfpr 2.4803519161042393
dfnr 3.4909766650703578
dacc 0.23710167617136318
accuracy is  0.6472177201512695

logistic


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.64266645        nan 0.6426

best 0.6426675230143438
fpr and fnr
0.23529411764705882
0.4428399518652226
accuracy
0.6472177201512695
dfpr 3.0625152182392226
dfnr 3.156926896160785
dacc 0.15393609423381752
accuracy is  0.671528903295516

svm
fpr and fnr
0.24705882352941178
0.43200962695547535
accuracy
0.671528903295516
dfpr 1.0969663472342601
dfnr 2.8144272413525213
dacc 0.15150745680127928
accuracy is  0.6699081577525662


## Run Algorithm Top

In [175]:
#get all of the candidate groups possible with the combos and names

new_train_data = copy.deepcopy(train_set)

#iterate over all the names to get the temp2 df for each name
all_names_lst_top = find_top(all_names)
for a in all_names_lst_top:
  print("?????/////")
  print(a)
  temp2, names = get_temp(new_train_data, all_names[a], compas_y)
  temp, temp_g = get_temp_g(new_train_data, names, compas_y)
  temp_g = temp_g[temp_g['cnt'] > filter_count]
  lst_of_counts = compute_lst_of_counts(temp, names, compas_y)
  
  # id function   
  need_pos, need_neg = compute_problematic_opt(temp2, temp_g, names, compas_y, lst_of_counts)
 
  print("The sets of need pos and neg are")
  print(need_pos)
  print(need_neg)
  new_train_data['skewed'] = 0
  new_train_data["diff"] = 0
  
  # massaging function
  new_train_data = naive_massaging(new_train_data, temp2, names, need_pos, need_neg, compas_y)

  print(new_train_data[compas_y].value_counts())
new_train_x = pd.DataFrame(new_train_data, columns = columns_all)
new_train_label = pd.DataFrame(new_train_data, columns = [compas_y])
new_train_label = new_train_label[compas_y]
new_train_label = new_train_label.astype('int')

?????/////
7
The sets of need pos and neg are
[[0, 0]]
[[0, 2], [1, 1], [1, 2], [2, 0], [2, 1], [2, 2]]
adding more positive
[0, 0]
Changed 128 records
4321
[0, 2]
adding more negative
Changed 97 records
4321
[1, 1]
adding more negative
Changed 72 records
4321
[1, 2]
adding more negative
Changed 15 records
4321
[2, 0]
adding more negative
Changed 8 records
4321
[2, 1]
adding more negative
Changed 16 records
4321
[2, 2]
adding more negative
Changed 7 records
4321
0    2430
1    1891
Name: class, dtype: int64
?????/////
8
The sets of need pos and neg are
[[0, 1]]
[[0, 0]]
adding more positive
[0, 1]
Changed 255 records
4321
[0, 0]
adding more negative
Changed 261 records
4321
0    2436
1    1885
Name: class, dtype: int64
?????/////
9
The sets of need pos and neg are
[[0, 1]]
[[0, 0]]
adding more positive
[0, 1]
Changed 236 records
4321
[0, 0]
adding more negative
Changed 381 records
4321
0    2581
1    1740
Name: class, dtype: int64
?????/////
10
The sets of need pos and neg are
[[0, 0],

### Massaging Results Top

In [176]:
print()
print("dt")
griddt.fit(new_train_x, new_train_label)
print("best", griddt.best_score_)
test_predict = griddt.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Massage-Top","DT")

print()
print("rf")
gridrf.fit(new_train_x, new_train_label)
print("best", gridrf.best_score_)
test_predict = gridrf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Massage-Top","RF")

print()
print("logistic")
gridlg.fit(new_train_x, new_train_label)
print("best", gridlg.best_score_)
test_predict = gridlg.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
r,r2,r3 = div_results("COMPAS","Massage-Top","LG")

print()
print("svm")
clf.fit(new_train_x, new_train_label)
test_predict = clf.predict(test_x)
print("fpr and fnr")
print(fpr_onegroup(list(test_label), test_predict))
print(fnr_onegroup(list(test_label), test_predict))
accuracy = accuracy_score(test_label, test_set['predicted'])
print("accuracy")
print(accuracy)
test_set['predicted'] = test_predict
s,s2,s3 = div_results("COMPAS","Massage-Top","SVM")


dt
best 0.834049721687005
fpr and fnr
0.22549019607843138
0.7436823104693141
accuracy
0.6699081577525662
dfpr 0.8901317055556609
dfnr 1.2506835775488587
dacc 0.3718001366528118
accuracy is  0.5418692598595354

rf
best 0.7752411153928495
fpr and fnr
0.31470588235294117
0.6979542719614922
accuracy
0.5418692598595354
dfpr 0.9527189020743742
dfnr 1.703938178835546
dacc 0.30697381735630835
accuracy is  0.5132360886007563

logistic


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.85789285        nan 0.8578

best 0.8578928494968958
fpr and fnr
0.21862745098039216
0.7099879663056559
accuracy
0.5132360886007563
dfpr 0.7141845146034491
dfnr 1.1550471253780277
dacc 0.40275132719884194
accuracy is  0.5607779578606159

svm
fpr and fnr
0.31862745098039214
0.6847172081829122
accuracy
0.5607779578606159
dfpr 1.0638524793865747
dfnr 1.829136870553926
dacc 0.3284684698883693
accuracy is  0.5170178282009724
