# Data anonymisation in Python
**James Tait, Marina Berger, Yuju Ahn, Robert Campbell**

---

## 1. Background

The CEO of the insurance company iInsureU123 is collaborating with researchers at Imperial to investigate the association between the presence of the "Wanderlust" gene variant, *DRD4*, and travel behavior in order to assess customer risk profiles. The CEO is also a part of a larger insurance project with the government that explores the relationship between the *DRD4* variant, geographical location, and educational attainment. She would like to anonymize her data so that it can be shared with the researchers at Imperial and with the government. She knows that the data given to the government will be released to the public. 

In order to better meet the disparate needs of the two groups we create two separate anonymized datasets - one suitable for the iInsureU123 project, and one for the government that is suitable for public release. These datasets are created with a balance between providing the most usable information while still preserving an acceptable degree of privacy.  

We use the packages pandas (1), numpy (2), datetime (3), iso3166 (4), hdx.location.country (5), and hashlib (6).

In [1]:
# Importing required packages
import pandas as pd
import numpy as np
from datetime import datetime
from iso3166 import countries
from hdx.location.country import Country
import hashlib

---

## 2. Anonymisation

### a. Removal of direct identifiers

The first step in anonymising the customer_information dataset for both datasets is to remove direct identifiers. Based on HIPAA guidelines (7), the following direct identifiers are removed:

- Given_name
- Surname
- Birthdate
- Phone_number
- Bank_account_number

In compliance with HIPAA, the birthdate variable is converted to age to reduce the identifiability of the variable. National Insurance Number is preserved so that it can be hashed and used as a pseudonymised unique identifier. 

In [2]:
# Loading a csv file with customer information - dataset to be anonymised 
df = pd.read_csv("Data/customer_information.csv")

In [3]:
#Banding birthdate into an age variable
today = datetime.today()
df["birthdate"] = pd.DatetimeIndex(df["birthdate"])
df['age'] = df['birthdate'].apply(lambda x: today.year - x.year - ((today.month, today.day) < (x.month, x.day)))

In [4]:
# Removing direct identifiers
df = df.drop(columns=['given_name', 'surname', 'birthdate', 'phone_number', 'bank_account_number'])

In [5]:
# Remaining columns in the dataset
list(df.columns)

['gender',
 'country_of_birth',
 'current_country',
 'postcode',
 'national_insurance_number',
 'cc_status',
 'weight',
 'height',
 'blood_group',
 'avg_n_drinks_per_week',
 'avg_n_cigret_per_week',
 'education_level',
 'n_countries_visited',
 'age']

### b. Pseudonymisation


The national insurance number (NIN) is a direct identifier that cannot be published by the government or shared with the researchers at Imperial. However, trusted parties in the government may wish to have access to NIN data in order to link datasets or contact individuals for the purpose of public health interventions. In order to preserve this information for trusted parties while maintaining public anonymity, the NIN column is anonymized using a hash function and a de-anonymizing hash table is created. We include the hash table in the data given to the government with the understanding that they can publish the dataset but not the hash table. The SHA-256 hashing algorithm is utilised per recommendations by the the National Institute of Standards and Technology (8).

In [6]:
# Applying a hash function to the national insurance number and generating a hash table mapping between the original and hashed values 
df['nat_insurance_nb_hash'] = df['national_insurance_number'].apply(lambda x:hashlib.sha256(x.encode()).hexdigest())

hash_table = df[['national_insurance_number', 'nat_insurance_nb_hash']]

In [7]:
# Removing the original national insurance number column
df = df.drop(columns=['national_insurance_number'])


In [8]:
# Exporting the hash table into a csv file
hash_table.to_csv("hash_table.csv")

#### c. Banding

The following variables will be banded in an attempt to augment the anonymity of the dataset and not share any information in a raw format:

* postcode
* education_level
* age
* weight
* height
* blood_group
* avg_n_drinks_per_week
* avg_n_cigret_per_week
* n_countries_visited

Variables including weight, height, blood_group, avg_n_drinks_per_week, avg_n_cigret_per_week, n_countries_visited are not considered to be quasi-identifiers because they are generally not publically avaliable. Despite this, we lightly band them as a precaution against the possibility of future data breaches, which may transform them into quasi-identifiers retroactively. We have made efforts to ensure this banding creates a minimal loss of data.

For age, weight, height, avg_n_drinks_per_week, avg_n_cigret_per_week and n_countries_visited, the minimum and maximum values in the dataset are determined. This allows us to create adequate bands, ensuring every value in the dataset is banded.

For blood group, the rhesus status is removed.


#### c1. postcode

We transform the direct identifier 'postcode' into a quasi-identifier 'region' by extracting the postcode area and assigning it to a region value of 'north' or 'south' based on geographical location. Postcodes beyond the contiguous United Kingdom are dropped. 

In [9]:
## Postcode

# function to obtain a partial postcode from full postcode information
def partial_postcode(postcode):

    # get the first half of the postcode by splitting the full postcode with a space
    postcode = postcode.split(' ')[0]

    # partial postcode (regional information) 
    partial_postcode = ''
    for chr in postcode:
        if chr.isnumeric():
            break
        partial_postcode += chr

    return partial_postcode

#apply this function to the postcode 
df['postcode'] = df['postcode'].apply(lambda x: partial_postcode(x))

#establish lists of prefixes to regions
north = ['AB','BB', 'BD', 'BL','CA','DD','DG','DH','DL','EH','FK','L','LA','IV','KA','HS','NE','ML','HU','M','PA','PR','PH','HX','FY','G','HD','TD','HG','SR','WA','WN','YO','ZE','B','CF', 'CH', 'CV','CW','DE','DN','DY','GL','HR','IP','LD','LE','LL','LN','LS','NG','NN','NP','OL','PE','S','SA','SK','SK','ST','SY','TF','TS','WF','WR','WS','WV']
south = ['AL','BA','BH','BN','BR','BS','CB','CM','CO','CR','IG','CT','KT','KW','KY','DA','N','MK','ME','RH','SE','SG','RM','PO','NR','SL','SW','TA','TN','TW','UB','WC','WD','W','TQ','TR','SM','SN','SO','SS','SP','RG','OX','PL','NW','LU','DT','E','EC','EN','EX','GU','HA','HP']
regions = ['north','south']

#map prefixes to regions
df['region'] = df['postcode'].apply(lambda x: 'north' if x in north else x)
df['region'] = df['region'].apply(lambda x: 'south' if x in south else x)
df['region'] = df['region'].apply(lambda x: x if x in regions else np.nan)

# Printing the region group counts
print(df.groupby(['region']).size().reset_index(name="Count").sort_values(["Count"]))

  region  Count
1  south    419
0  north    563


#### c2. education

As primary and PhD education levels are the least common, they are both banded. PhD and master's students are compared into a 'graduate' level, primary and secondary education are banded to 'no_college' and bachelor education is changed to 'undergraduate' for clarity.  

In [10]:
# Printing the education bin counts
print(df.groupby(['education_level']).size().reset_index(name="Count").sort_values(["Count"]))

  education_level  Count
3             phD     52
4         primary     61
2           other    108
1         masters    112
0        bachelor    209
5       secondary    458


In [11]:
#banding education into no college, undergraduate, and graduate
df.loc[df.education_level == 'phD', 'education_level'] = 'graduate'
df.loc[df.education_level == 'masters', 'education_level'] = 'graduate'
df.loc[df.education_level == 'bachelor', 'education_level'] = 'undergraduate'
df.loc[df.education_level == 'primary', 'education_level'] = 'no_college'
df.loc[df.education_level == 'secondary', 'education_level'] = 'no_college'

In [12]:
# Printing the education bin counts
print(df.groupby(['education_level']).size().reset_index(name="Count").sort_values(["Count"]))

  education_level  Count
2           other    108
0        graduate    164
3   undergraduate    209
1      no_college    519


#### c3. Age

In [13]:
# Printing the 10 lowest age counts
print(df.groupby(['age']).size().reset_index(name="Count").head(10).sort_values(["Count"]))

   age  Count
3   22     16
4   23     16
0   19     17
1   20     19
2   21     19
8   27     20
6   25     21
5   24     24
9   28     24
7   26     26


In [14]:
# Assessing the minimum and maximum age values
print('The minimum age of individual in the dataset is', min(df['age']))
print('The maximum age of individual in the dataset is', max(df['age']))

The minimum age of individual in the dataset is 19
The maximum age of individual in the dataset is 67


In [15]:
# Banding age values into 3 age categories
bins= [18, 30, 50, 70]
labels = ['19-30','31-50','51-70']
df['age'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

In [16]:
# Printing the age group counts
print(df.groupby(['age']).size().reset_index(name="Count").sort_values(["Count"]))

     age  Count
0  19-30    222
2  51-70    369
1  31-50    409


#### c4. Weight

In [17]:
# Printing the 10 lowest weight counts
print(df.groupby(['weight']).size().reset_index(name="Count").head(10).sort_values(["Count"]))

   weight  Count
6    35.7      1
7    35.8      1
9    36.0      1
0    35.0      2
1    35.1      2
5    35.6      2
8    35.9      2
4    35.5      3
2    35.2      4
3    35.4      4


In [18]:
# Assessing the minimum and maximum weight values
print('The minimum weight of individual in the data set is', min(df['weight']))
print('The maximum weight of individual in the data set is', max(df['weight']))

The minimum weight of individual in the data set is 35.0
The maximum weight of individual in the data set is 100.0


In [19]:
# Banding weight values into 7 weight categories
bins= [30, 40, 50, 60, 70, 80, 90, 100]
labels = ['31-40','41-50', '51-60', '61-70', '71-80', '81-90', '91-100']
df['weight'] = pd.cut(df['weight'], bins=bins, labels=labels, right=False)

In [20]:
# Printing the weight group counts
print(df.groupby(['weight']).size().reset_index(name="Count").sort_values(["Count"]))

   weight  Count
0   31-40     84
2   51-60    136
6  91-100    149
5   81-90    151
3   61-70    155
4   71-80    158
1   41-50    166


#### c5. Height

In [21]:
# Printing the 10 lowest height counts
print(df.groupby(['height']).size().reset_index(name="Count").head(50).sort_values(["Count"]))

    height  Count
32    1.72      9
0     1.40     10
36    1.76     10
23    1.63     10
13    1.53     10
33    1.73     11
39    1.79     12
24    1.64     12
49    1.89     13
47    1.87     13
26    1.66     13
20    1.60     13
3     1.43     14
18    1.58     14
17    1.57     14
7     1.47     15
22    1.62     15
21    1.61     15
1     1.41     15
15    1.55     15
10    1.50     15
4     1.44     16
14    1.54     16
41    1.81     16
37    1.77     17
35    1.75     17
34    1.74     17
44    1.84     17
27    1.67     17
11    1.51     17
31    1.71     17
40    1.80     18
28    1.68     18
48    1.88     18
5     1.45     18
16    1.56     18
9     1.49     18
42    1.82     19
45    1.85     19
8     1.48     20
19    1.59     20
25    1.65     21
38    1.78     21
30    1.70     21
43    1.83     21
2     1.42     23
12    1.52     25
29    1.69     26
6     1.46     26
46    1.86     27


In [22]:
# Assessing the minimum and maximum height values
print('The minimum height of individual in the data set is', min(df['height']))
print('The maximum height of individual in the data set is', max(df['height']))

The minimum height of individual in the data set is 1.4
The maximum height of individual in the data set is 2.0


In [23]:
# Banding height values into 6 height categories
bins= [1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.1]
labels = ['140-150', '150-160', '160-170', '170-180', '180-190', '190-200']
df['height'] = pd.cut(df['height'], bins=bins, labels=labels, right=False)

#### c6. Blood group

In [24]:
# Printing the 10 lowest blood group counts
print(df.groupby(['blood_group']).size().reset_index(name="Count").head(10).sort_values(["Count"]))

  blood_group  Count
3         AB-      6
5          B-     18
2         AB+     25
7          O-     65
1          A-     70
4          B+     90
0          A+    361
6          O+    365


In [25]:
# Banding blood group values by removing the rhesus information
df['blood_group']= df['blood_group'].apply(lambda x:x[:-1])

In [26]:
# Printing the blood group counts
print(df.groupby(['blood_group']).size().reset_index(name="Count").head(10).sort_values(["Count"]))

  blood_group  Count
1          AB     31
2           B    108
3           O    430
0           A    431


#### c7. Average number of drinks per week

In [27]:
# Printing the 10 lowest average number of drinks per week counts
print(df.groupby(['avg_n_drinks_per_week']).size().reset_index(name="Count").head(10).sort_values(["Count"]))

   avg_n_drinks_per_week  Count
0                    0.0      5
5                    0.5      5
8                    0.8      7
1                    0.1      8
9                    0.9      9
2                    0.2     10
3                    0.3     10
6                    0.6     11
7                    0.7     11
4                    0.4     13


In [28]:
# Assessing the minimum and maximum average number of drinks per week values
print('The minimum average number of drinks per week of individual in the data set is', min(df['avg_n_drinks_per_week']))
print('The maximum average number of drinks per week of individual in the data set is', max(df['avg_n_drinks_per_week']))

The minimum average number of drinks per week of individual in the data set is 0.0
The maximum average number of drinks per week of individual in the data set is 10.0


In [29]:
# Banding average number of drinks per week values into 5 average number of drinks per week categories
bins= [0.0, 2.0, 4.0, 6.0, 8.0, 10.1]
labels = ['0--2', '2--4', '4--6', '6--8', '8--10']
df['avg_n_drinks_per_week'] = pd.cut(df['avg_n_drinks_per_week'], bins=bins, labels=labels, right=False)

In [30]:
# Printing the average number of drinks per week counts
print(df.groupby(['avg_n_drinks_per_week']).size().reset_index(name="Count").sort_values(["Count"]))

  avg_n_drinks_per_week  Count
3                  6--8    172
4                 8--10    190
2                  4--6    206
0                  0--2    214
1                  2--4    218


#### c8. Average number of cigarettes per week

In [31]:
# Printing the 10 lowest average number of cigarettes per week counts
print(df.groupby(['avg_n_cigret_per_week']).size().reset_index(name="Count").head(10).sort_values(["Count"]))

   avg_n_cigret_per_week  Count
0                    0.3      1
1                    0.7      1
2                    0.8      1
4                    2.0      1
5                    2.2      1
6                    2.6      1
7                    4.2      1
8                    4.3      1
9                    4.6      1
3                    1.0      2


In [32]:
# Assessing the minimum and maximum average number of cigarettes per week values
print('The minimum average number of cigarettes per week of individual in the data set is', min(df['avg_n_cigret_per_week']))
print('The maximum average number of cigarettes per week of individual in the data set is', max(df['avg_n_cigret_per_week']))

The minimum average number of cigarettes per week of individual in the data set is 0.3
The maximum average number of cigarettes per week of individual in the data set is 500.0


In [33]:
# Banding average number of cigarettes per week values into 5 average number of cigarettes per week categories
bins= [0.0, 100.0, 200.0, 300.0, 400.0, 500.1]
labels = ['0-100', '100-200', '200-300', '300-400', '400-500']
df['avg_n_cigret_per_week'] = pd.cut(df['avg_n_cigret_per_week'], bins=bins, labels=labels, right=False)

In [34]:
# Printing the average number of cigarettes per week counts
print(df.groupby(['avg_n_cigret_per_week']).size().reset_index(name="Count").sort_values(["Count"]))

  avg_n_cigret_per_week  Count
4               400-500    190
2               200-300    191
1               100-200    194
3               300-400    199
0                 0-100    226


#### c9. The number of countries visited

In [35]:
# Printing the 10 lowest number of countries visited counts
print(df.groupby(['n_countries_visited']).size().reset_index(name="Count").head(10).sort_values(["Count"]))

   n_countries_visited  Count
9                   11      9
0                    2     12
4                    6     14
6                    8     16
2                    4     19
3                    5     19
5                    7     24
8                   10     24
1                    3     26
7                    9     30


In [36]:
# Assessing the minimum and maximum number of countries visited values
print('The minimum number of countries visited in the data set is', min(df['n_countries_visited']))
print('The maximum number of countries visited in the data set is', max(df['n_countries_visited']))

The minimum number of countries visited in the data set is 2
The maximum number of countries visited in the data set is 50


In [37]:
# Banding number of countries visited values into 5 number of countries visited categories
bins= [0, 10, 20, 30, 40, 50]
labels = ['1--10', '11--20', '21--30', '31--40', '41--50']
df['n_countries_visited'] = pd.cut(df['n_countries_visited'], bins=bins, labels=labels, right=False)

In [38]:
# Printing the number of countries visited counts
print(df.groupby(['n_countries_visited']).size().reset_index(name="Count").sort_values(["Count"]))

  n_countries_visited  Count
0               1--10    160
1              11--20    184
4              41--50    197
2              21--30    221
3              31--40    232


#### c10. Country of birth 

To anonymize the country of birth column, birth countries of customers are banded to continents using the Python package *iso3166 and Country*. The module iso3166 contains name, alpha-2, alpha-3, and numeric codes of countries and this allows for a look-up between country name and alpha3 code. With obtained alpha-3 codes of country of birth, a continent of each country is then retrieved from a GitHub repository known as 'ISO-3166-Countries-with-Regional-Codes' (9). This contains an 'all.csv' file which houses country name, alpha-2 code, alpha-3 code, country-code, region, and other variables as columns and a continent look up dictionary is generated using country alpha-3 codes as keys and regions as values. The 'all.csv' file is loaded in as the 'country_df' dataframe.

In [39]:
# Saves url of GitHub all.csv file as vector, before loading it in as the country_df dataframe 
url = "https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv"
country_df = pd.read_csv(url,index_col=0,parse_dates=[0])


In [40]:
# Defining a look up continent dictionary with alpha-3 codes of countries as keys and continents as values
look_up_continent = dict(zip(country_df['alpha-3'], country_df['region']))

In [41]:
# Defining a list with countries existing in iso3166 package
existing_countries = [country.name for country in countries]

In [42]:
# Function to convert country names into the relevant continent
def country_to_continent(country_name):
    # If country name exists in iso3166 package, find a continent with a look up dictionary
    if country_name in existing_countries:
        iso3_code = countries.get(country_name).alpha3
        continent = look_up_continent[iso3_code]
    # Perform a fuzzy search if it does not exist in iso3166 package, this allows to find country names that match defined name approximately
    # Eg. Korea -> matches with South Korea/North Korea
    else:
        iso3_code = Country.get_iso3_country_code_fuzzy(country_name)[0]
        continent = look_up_continent[iso3_code]
    return continent

In [43]:
# Applying a defined banding strategy to a country of birth column and dropping the column country of birth
df['continent_of_birth'] = df['country_of_birth'].map(lambda x:country_to_continent(x))
df = df.drop(columns=['country_of_birth'])

In [44]:
# Checking if the country_of_birth column is sufficiently banded by counting each distinct value in the column
print(df.groupby(['continent_of_birth']).size().reset_index(name="Count").sort_values(["Count"]))


  continent_of_birth  Count
4            Oceania    135
2               Asia    193
1           Americas    212
0             Africa    227
3             Europe    229


---

## 3. Attribution of variables

We create two datasets, each containing the relevant variables for their research program. 

The iInsureU123 research question is limited to Wanderlust gene status and number of countries visited. We include in their dataset the hashed NIN to uniquely identify rows, 'cc_status' as an exposure variable, 'n_countries_visited' as an outcome variable, and 'age', 'current_country', and 'continent_of_birth' as plausible confounders. Note that variables like 'age' are not plausible confounders under random sampling because the expoure variable (ie genetic data) is antecedant to age. However, we do not know how this data was gathered, which leaves open the possibility that it may confound the outcome within the dataset if sampling was non-random. 

The government research question includes the Wanderlust gene, geographic data, and educational attainment. As the government's dataset will be made publicly available, external researchers may wish to utilise the dataset to investigate a variety of research questions. With this in mind, we include 'cc_status,' all geographic and educational variables, and all other non-identifiying variables. 

In [45]:
# Selecting variables of interest for the company iInsureU123 - researchers at Imperial
df_c = df[['nat_insurance_nb_hash', 'age', 'region', 'current_country', 'cc_status', 'continent_of_birth', 'n_countries_visited']]
df_c.to_csv('Data/data_company.csv')

In [47]:
# Selecting variables of interest for the government - public
df_g = df[['nat_insurance_nb_hash', 'region', 'age', 'current_country', 'cc_status', 'continent_of_birth', 'n_countries_visited', 'education_level', 'weight', 'height', 'blood_group', 'avg_n_drinks_per_week', 'avg_n_cigret_per_week' ]]
df_g.to_csv('Data/data_government.csv')

---

## 4. Identification of k-anonymity 


The first step in calculating k-anonymity is to select quasi-identifiers. Quasi-identifiers contain information available in public datasets that could be used to deanonymize individuals in our dataset. We include the following variables as quasi-identifiers:
- gender
- age
- continent_of_birth
- current_country
- education_level

We exclude all medical data, including height, weight, blood_group, cc_status because this information is not publicly avaliable. We also exclude the hashed national_insurance_number because it has been pseudonymised. 

#### 4a. iInsureU123 dataset

In [48]:
# Defining quasi-identifiers of the company dataset
quasi_identifiers_c = ['age', 'region','continent_of_birth', 'current_country']

In [49]:
# Identifying k-anonymity by counting the unique combination of quasi-identifiers in the dataset
df_c_count = df.groupby(quasi_identifiers_c).size().reset_index(name='Count')
print(df_c_count.sort_values(by='Count'))
print("The dataset is ", min(df_c_count['Count']), "- anonymous.")

      age region continent_of_birth current_country  Count
9   19-30  south            Oceania  United Kingdom     13
4   19-30  north            Oceania  United Kingdom     14
8   19-30  south             Europe  United Kingdom     15
6   19-30  south           Americas  United Kingdom     18
7   19-30  south               Asia  United Kingdom     19
19  31-50  south            Oceania  United Kingdom     20
29  51-70  south            Oceania  United Kingdom     23
5   19-30  south             Africa  United Kingdom     24
2   19-30  north               Asia  United Kingdom     26
3   19-30  north             Europe  United Kingdom     27
27  51-70  south               Asia  United Kingdom     28
1   19-30  north           Americas  United Kingdom     29
22  51-70  north               Asia  United Kingdom     31
14  31-50  north            Oceania  United Kingdom     31
25  51-70  south             Africa  United Kingdom     32
26  51-70  south           Americas  United Kingdom     

#### 4b. Government dataset

In [50]:
# Defining quasi-identifiers of the government dataset
quasi_identifiers_g = ['current_country', 'region', 'education_level', 'continent_of_birth']

In [51]:
# Identifying k-anonymity by counting the unique combination of quasi-identifiers in the dataset
df_g_count = df_g.groupby(quasi_identifiers_g).size().reset_index(name='Count')

# Removing counts for the combination of all quasi-identifiers with 0 individual corresponding to that combination
df_g_count.drop(df_g_count[df_g_count['Count'] == 0].index, inplace=True)
print(df_g_count.sort_values(by='Count'))
print("The dataset is ", min(df_g_count['Count']), "- anonymous.")

   current_country region education_level continent_of_birth  Count
34  United Kingdom  south           other            Oceania      4
30  United Kingdom  south           other             Africa      6
14  United Kingdom  north           other            Oceania      7
10  United Kingdom  north           other             Africa      7
33  United Kingdom  south           other             Europe      8
4   United Kingdom  north        graduate            Oceania      9
32  United Kingdom  south           other               Asia     10
36  United Kingdom  south   undergraduate           Americas     11
31  United Kingdom  south           other           Americas     12
11  United Kingdom  north           other           Americas     13
12  United Kingdom  north           other               Asia     13
37  United Kingdom  south   undergraduate               Asia     13
2   United Kingdom  north        graduate               Asia     13
35  United Kingdom  south   undergraduate       

---

## 4. Safe data sharing strategy

The OneDrive for Business cloud storage software is used to safely share each dataset. In the case of the dataset for researchers at Imperial, the option "people in Imperial College London with the link" enables the view access to people with the Imperial College London account and link. For the government dataset, the link enabling view access can be shared. This dataset can be downloaded by users with the link, and edited on their downloaded version. Two different links to these dataset csv files are saved in a txt file, submitted together in a zip folder.

---

## 5. Conclusion

In this coursework, we performed the anonymisation of the customer_information dataset using several strategies including removal of direct identifiers, pseudonymisation with a hash function, banding, and splitting the original dataset into relevant targeted datasets.

We find our Imperial dataset has a k-anonymity of 13 and our government dataset has a k-anonymity of 4. We consider both dataset anonymities to be adequate. The government dataset presented a particularly challenging tradeoff between data usability and anonymity. We anonymized the government data several different ways in search of the best balance between usability and anonymity. We find that this is the best tradeoff, and that the anonymity can be further improved only by removing one of the core variables for the government research program or by draconian banding. 


---

## 6. References
We use the packages pandas[1], numpy[2], datetime[3], iso3166[4], hdx.location.country[5], and hashlib[6].

(1) Python Package Index - PyPI. (n.d.). Python Software Foundation. Retrieved from https://pypi.org/

(2) Harris, C.R., Millman, K.J., van der Walt, S.J. et al. (2020) Array programming with NumPy. (Nature 585, pp. 357–362).

(3) Dataflake, hannosch, icemac, tseaver (n.d.) DateTime 4.7. Retreived from https://pypi.org/project/DateTime/

(4) Spindel, Mike (n.d.) iso3166. Retreived from https://pypi.org/project/iso3166/

(5) mcarans (n.d.) hdx-python-country. Retreived from https://pypi.org/project/hdx-python-country/

(6) Smith, P. Gregory (n.d.) hashlib 20081119. Retreived from https://pypi.org/project/hashlib/

(7) U.S. Department of Health and Human Services. (n.d.). Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Retreived from https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html

(8) Lowery, J. M. (2020, March 26). MD5 vs SHA-1 vs SHA-2 - Which is the Most Secure Encryption Hash and How to Check Them. Retreived from https://www.freecodecamp.org/news/md5-vs-sha-1-vs-sha-2-which-is-the-most-secure-encryption-hash-and-how-to-check-them/

(9) Duncalfe, L. (2019). ISO-3166-Countries-with-Regional-Codes, GitHub repository. Retrieved from https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv
