##### Deanonymising group B's dataset 

Anonymisation techniques: 
- Pram on evote, zip, citizenship and party 
- Grouping on age -> age groups 
- 10% chance of each record switching age groups to one above or one below 
- Grouping on education levels -> broader categories 
- Recoding citizenship into boolean Danish citizenship 
- Removed name, dob 
- Masking the party variable + randomly morphed the 2 entries marked as "Invalid Vote" ??
- Masked the zip variable - assigninng a random zip region to each zip code 
- k-anonomity of 2 (supression) 

Importing libraries 

In [2]:
import pandas as pd
import functions 
from datetime import date

Loading in their data

In [3]:
# global recoding on age and pram on sex 
anonymised_data = pd.read_csv("deanon_data/anonymised_dataB.csv")
# global recoding on age, global recoding on marital status 
register_data = pd.read_excel("deanon_data/public_data_registerB.xlsx")
# global recodign on age, global recoding on marital status, and pram on sex 
results_data = pd.read_excel("deanon_data/public_data_resultsB.xlsx")

Preparing the results data 
- Converting age -> age groups 
- Converting citizenship into boolean Danish_citizenship
- Masking zip code into region ???

In [4]:
# convert dob to age 
def calculate_age(born):
    today = date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))
register_data["dob"] = pd.to_datetime(register_data["dob"])
register_data['age'] = register_data['dob'].apply(lambda x: calculate_age(x))

# convert age to age groups 
age_bins = [18, 29, 39, 49, 59, 69, 1000]  # Adjust as needed
age_labels = ["18-29", "30-39", "40-49", "50-59", "60-69", "70+"] 
register_data['age_group'] = pd.cut(register_data['age'], bins=age_bins,right=True, labels=age_labels, include_lowest=True)

In [5]:
# define function 
def danish(x): 
    if x == "Denmark": 
        return True 
    else: 
        return False 

register_data["Danish_Citizenship"]= register_data["citizenship"].apply(lambda x: danish(x))

Getting survey voters from register data 

In [6]:
quasi = ['sex', 'marital_status', 'age_group', 'Danish_Citizenship']
quasi2 = ['sex', 'marital_status']

with open("deanon_data/survey_listB.txt", "r") as my_file:
    # Read the file content
    data = my_file.read()
    
    # Split the text into a list by newline ('\n')
    data_into_list = data.split("\n")

survey_voters = register_data.query('name in @data_into_list')
survey_voters = survey_voters[quasi+["name"]]
survey_voters

Unnamed: 0,sex,marital_status,age_group,Danish_Citizenship,name
0,Female,Never married,18-29,False,"Dang, Lila"
12,Female,Never married,18-29,True,"Rivera, Gabriela"
14,Male,Never married,18-29,True,"Vogel, William"
18,Female,Never married,18-29,True,"Palacios, Mireya"
25,Male,Never married,18-29,False,"Mcclain, Vaughn"
...,...,...,...,...,...
1503,Male,Never married,60-69,True,"Lau, Francis"
1517,Male,Divorced,60-69,True,"Hanshaw, William"
1518,Female,Widowed,70+,True,"al-Dar, Waneesa"
1521,Male,Married/separated,70+,True,"al-Tabatabai, Tammaam"


Computing k-anonimity violations

In [7]:
functions.k_anonymity_violations(anonymised_data, quasi)

{2: (19, 9.595959595959595),
 3: (45, 22.727272727272727),
 5: (62, 31.313131313131315)}

In [8]:
k2_violations_anon = functions.identify_k_anonymity_violations(anonymised_data, quasi)
k2_violations_anon

Unnamed: 0,sex,evote,marital_status,party,age_group,zip_region,education_category,Danish_Citizenship
0,Male,0,Divorced,Green,18-29,Region 3,Vocational and Short-Cycle Education,True
1,Male,0,Divorced,Green,50-59,Region 1,Vocational and Short-Cycle Education,False
2,Male,0,Married/separated,Green,70+,Region 3,Higher Education,False
3,Male,0,Never married,Green,40-49,Region 2,Basic Education,False
4,Male,1,Never married,Green,30-39,Region 3,Basic Education,False
5,Male,0,Never married,Red,60-69,Region 1,Basic Education,True
6,Female,1,Divorced,Green,60-69,Region 2,Vocational and Short-Cycle Education,False
7,Male,1,Widowed,Green,50-59,Region 2,Vocational and Short-Cycle Education,True
8,Male,0,Widowed,Green,60-69,Region 1,Vocational and Short-Cycle Education,False
9,Female,1,Divorced,Green,50-59,Region 4,Vocational and Short-Cycle Education,False


In [9]:
df_matches = []
for _, record in survey_voters.iterrows():
    matches = anonymised_data[
                        (anonymised_data['sex'] == record['sex']) &
                        (anonymised_data['age_group'] == record['age_group']) &
                        (anonymised_data['Danish_Citizenship'] == record['Danish_Citizenship']) &
                        (anonymised_data['marital_status'] == record['marital_status'])]
    matches = matches.copy()  # Avoid SettingWithCopyWarning
    matches["name"] = record["name"]
    #print(f"Matches for record {record.to_dict()}:")
    #print(matches)
    df_matches.append(matches)

# Combine all matches into a single DataFrame
df_matches = pd.concat(df_matches, ignore_index=True)

num_unique=df_matches.groupby("name")[["party"]].nunique()
unique = num_unique[num_unique["party"]==1].reset_index()
filtered_df = df_matches[df_matches['name'].isin(unique['name'])][["name", "party"]]
sim_identifiable = filtered_df.groupby(["name", "party"]).count()
sim_identifiable=sim_identifiable.reset_index()
sim_identifiable
#valid_matches['IsInDf2'] = valid_matches['name'].isin(sim_identifiable['name'])

Unnamed: 0,name,party
0,"Adams, Samantha",Green
1,"Anderson, Quianah",Green
2,"Brodie, Mariah",Red
3,"Brown, Shnika",Red
4,"Bullard, Matthew",Green
5,"Cardoza, Spring",Green
6,"Dahlberg, Chelsea",Red
7,"Dang, Lila",Green
8,"Garcia, Shamika",Green
9,"Hanshaw, William",Green


Make it into .csv file for handin

In [None]:
final_df = sim_identifiable.copy()
final_df['name'] = final_df['name'].apply(lambda x: ' '.join(x.split(', ')[::-1]))
#final_df

Unnamed: 0,name,party
0,Samantha Adams,Green
1,Quianah Anderson,Green
2,Mariah Brodie,Red
3,Shnika Brown,Red
4,Matthew Bullard,Green
5,Spring Cardoza,Green
6,Chelsea Dahlberg,Red
7,Lila Dang,Green
8,Shamika Garcia,Green
9,William Hanshaw,Green


In [14]:
final_df.to_csv('deanon/deanonymised.csv', index=False)