##### Deanonymising group B's dataset 

Anonymisation techniques: 
- Pram on evote, zip, citizenship and party 
- Grouping on age -> age groups 
- Grouping on education levels -> broader categories 
- Recoding citizenship into boolean Danish citizenship 
- Removed name, dob 
- Masking the party variable + randomly morphed the 2 entries marked as "Invalid Vote" ??
- Masked the zip variable - assigninng a random zip region to each zip code 
- k-anonomity of 2 (supression)

Importing libraries 

In [32]:
import pandas as pd
import functions 
from datetime import date

Loading in their data

In [33]:
# global recoding on age and pram on sex 
anonymised_data = pd.read_csv("deanon_data/anonymised_dataB.csv")
# global recoding on age, global recoding on marital status 
register_data = pd.read_excel("deanon_data/public_data_registerB.xlsx")
# global recodign on age, global recoding on marital status, and pram on sex 
results_data = pd.read_excel("deanon_data/public_data_resultsB.xlsx")

Preparing the results data 
- Converting age -> age groups 
- Converting citizenship into boolean Danish_citizenship
- Masking zip code into region ???

In [34]:
# convert dob to age 
def calculate_age(born):
    today = date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))
register_data["dob"] = pd.to_datetime(register_data["dob"])
register_data['age'] = register_data['dob'].apply(lambda x: calculate_age(x))

# convert age to age groups 
age_bins = [18, 29, 39, 49, 59, 69, 1000]  # Adjust as needed
age_labels = ["18-29", "30-39", "40-49", "50-59", "60-69", "70+"] 
register_data['age_group'] = pd.cut(register_data['age'], bins=age_bins,right=True, labels=age_labels, include_lowest=True)

In [35]:
# define function 
def danish(x): 
    if x == "Denmark": 
        return True 
    else: 
        return False 

register_data["Danish_Citizenship"]= register_data["citizenship"].apply(lambda x: danish(x))

Getting survey voters from register data 

In [36]:
quasi = ['sex', 'marital_status', 'age_group', 'Danish_Citizenship']

with open("deanon_data/survey_listB.txt", "r") as my_file:
    # Read the file content
    data = my_file.read()
    
    # Split the text into a list by newline ('\n')
    data_into_list = data.split("\n")

survey_voters = register_data.query('name in @data_into_list')
survey_voters = survey_voters[quasi+["name"]]
survey_voters

Unnamed: 0,sex,marital_status,age_group,Danish_Citizenship,name
0,Female,Never married,18-29,False,"Dang, Lila"
12,Female,Never married,18-29,True,"Rivera, Gabriela"
14,Male,Never married,18-29,True,"Vogel, William"
18,Female,Never married,18-29,True,"Palacios, Mireya"
25,Male,Never married,18-29,False,"Mcclain, Vaughn"
...,...,...,...,...,...
1503,Male,Never married,60-69,True,"Lau, Francis"
1517,Male,Divorced,60-69,True,"Hanshaw, William"
1518,Female,Widowed,70+,True,"al-Dar, Waneesa"
1521,Male,Married/separated,70+,True,"al-Tabatabai, Tammaam"


Computing k-anonimity violations

In [37]:
functions.k_anonymity_violations(anonymised_data, quasi)

{2: (19, 9.595959595959595),
 3: (45, 22.727272727272727),
 5: (62, 31.313131313131315)}

In [38]:
k2_violations_anon = functions.identify_k_anonymity_violations(anonymised_data, quasi)

In [45]:
# Merge the datasets
merged_data = pd.merge(survey_voters, anonymised_data, 
                       on=quasi, 
                       how="inner")

# Count the number of matches for each 'survey_voter_id' (or the unique ID for survey_voters)
match_counts = merged_data['name'].value_counts()

# Filter survey_voters with exactly one match
valid_matches = merged_data[merged_data['name'].isin(match_counts[match_counts == 1].index)]
valid_matches[quasi+["name", "party"]]

Unnamed: 0,sex,marital_status,age_group,Danish_Citizenship,name,party
519,Female,Divorced,30-39,True,"al-Yasin, Rumaana",Green
520,Male,Divorced,18-29,True,"Samoy, Michael",Green
930,Male,Widowed,50-59,True,"al-Moradi, Shaheer",Green
1245,Female,Divorced,60-69,False,"Garcia, Shamika",Green
1246,Female,Married/separated,18-29,True,"Adams, Samantha",Green
1280,Male,Never married,60-69,True,"Kanherkar, Tue Yer",Red
1281,Male,Never married,60-69,True,"Lau, Francis",Red
1292,Female,Never married,50-59,False,"Silverston, Aylissa",Red
1297,Female,Married/separated,40-49,False,"Stripling, Liana",Green
1304,Male,Never married,40-49,False,"Bullard, Matthew",Green
