**Notebook for 5-anonymizing a redacted version of the EdX dataset.**

**Instructions**: Assuming that each field other than the user-id and the course name is a quasi-identifier, determine the level of k-anonymity in the file. Then, make the file 5-anonymous using only record suppression; how many records need to be deleted to do this? Try making the file 5-anonymous using only column suppression; how many columns need to be deleted to do this, and which ones are they? Finally, try to produce a 5-anonymous data set using generalization. Finally, see if you can use some combination of these mechanisms to produce a 5-anonymous data set.

In [205]:
# import the dataset as a pandas dataframe
import pandas as pd
import numpy as np

df = pd.read_csv("reduced_qi_filled.csv")
df

Unnamed: 0,course_id,user_id,cc_by_ip,city,postalCode,LoE,YoB,gender,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events
0,HarvardX/PH525.1x/1T2018,29940,US,Austin,78713,,,,0,0,0,0,0,0,0
1,HarvardX/PH525.1x/1T2018,37095,BD,Dhaka,,b,1991.0,m,0,0,0,0,0,0,0
2,HarvardX/PH525.1x/1T2018,45634,CO,Medellín,,m,1982.0,m,0,0,0,0,0,0,0
3,HarvardX/PH525.1x/1T2018,52234,SE,Skanör,,p,1988.0,m,0,0,0,0,0,0,0
4,HarvardX/PH525.1x/1T2018,52238,MX,León,,,,,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,AU,Silverdale,2752,jhs,2002.0,,0,0,0,0,0,0,0
199995,HarvardX/Hum3.1x/1T2016,15292716,RU,Yekaterinburg,620000,,,,0,0,0,0,0,0,0
199996,HarvardX/Hum3.1x/1T2016,15295130,TR,Istanbul,,b,1996.0,f,0,0,0,0,0,0,0
199997,HarvardX/Hum3.1x/1T2016,15296396,US,Marshfield,02050,,2000.0,,1,0,0,0,1,0,0


In [117]:
def get_quasi_ids(df):
    """
    Finds the quasi identifiers in the given dataframe,
    which is all column names excluding "course_id" and
    "user_id"
    """
    quasi_ids = df.columns.to_list()
    quasi_ids.remove("course_id")
    quasi_ids.remove("user_id")
    return quasi_ids

In [119]:
# get quasi_ids
quasi_ids = get_quasi_ids(df)
quasi_ids

['cc_by_ip',
 'city',
 'postalCode',
 'LoE',
 'YoB',
 'gender',
 'nforum_posts',
 'nforum_votes',
 'nforum_endorsed',
 'nforum_threads',
 'nforum_comments',
 'nforum_pinned',
 'nforum_events']

In [142]:
# Function for determining level of k-anonymity in a file
# 
def level_k_anon(df, quasi_ids=quasi_ids):
    """
    Determines the level of k anonymity the given dataframe has
    for the given list of quasi identifier names.

    Parameters:
    -----------
    df: pandas DataFrame
        df to find the level of anonymity of

    quasi_ids: list
        list of names of quasi identifiers (col names that correspond
        to quasi identifiers)
        NOTE: default is all column names except "course_id" and "user_id"
    """
    # Group by set of quasi id values
    quasi_id_grouped_df = df.groupby(quasi_ids, dropna=False)
    # Get number of rows in each gruop
    grouped_row_counts = quasi_id_grouped_df.size()
    # Min number of rows in a group = level of k-anonymity
    level_k_anon_num = min(grouped_row_counts)
    return level_k_anon_num
    
    

**1. Determine the level of k-anonymity in the file**

In [63]:
original_level_k_anon = level_k_anon(df)
original_level_k_anon

1

1. **Result:** the datframe is currently 1-anonymous

**2. Make file anonymous using record supression.**  (Assuming records are rows)

**Question:** how many records need to be deleted to do this?

In [92]:
num_records_dropped = 0
# Group the dataframe by each unique set of quasi id values
df_grouped_by_quasi_ids = df.groupby(quasi_ids, dropna=False)
# Iterate through each group, removing any with < 5 entries
for name, group in df_grouped_by_quasi_ids:
    # print(name)
    # print(group)
    # Get size of group
    group_size = group.shape[0]
    # print(f"num of rows in group: {group_size}")
    if group_size < 5:
        # print("group size < 0")
        # Record how many entries were removed when the group was removed
        num_records_dropped += group_size
    # print("\n")

print(f"total num of records dropped: {num_records_dropped}")

total num of records dropped: 150286


In [99]:
# Verify results
# Group by all the quasi ids
group_by_quasi_ids_df = df.groupby(quasi_ids, dropna=False)
# Get number of rows in each gruop
grouped_row_counts = group_by_quasi_ids_df.size()
print(grouped_row_counts)

# Get number of rows who belong to a group of quasi ids with less than 5 members
num_rows_less_than_5 = grouped_row_counts[grouped_row_counts < 5 ].sum(skipna=False) 
print(f"num rows with less than 5 duplicates: {num_rows_less_than_5}")
num_rows_greater_than_4 = grouped_row_counts[grouped_row_counts >= 5 ].sum(skipna=False) 
print(f"num rows with > or = to 5 duplicates: {num_rows_greater_than_4}")

cc_by_ip  city              postalCode  LoE  YoB     gender  nforum_posts  nforum_votes  nforum_endorsed  nforum_threads  nforum_comments  nforum_pinned  nforum_events
AD        Andorra La Vella  NaN         m    1972.0  f       0             0             0                0               0                0              0                 1
          Engordany         NaN         a    1973.0  m       0             0             0                0               0                0              0                 1
                                        m    1984.0  m       0             0             0                0               0                0              0                 1
AE        Abu Dhabi         NaN         a    1988.0  m       0             0             0                0               0                0              0                 1
                                             1992.0  f       0             0             0                0               0             

2. **Result**: 150286 records were deleted to make the dataset 5-anonymous using record supression.

**3. Make the file 5-anonymous using only column suppression**

**Question:** How many columns are needed to do this, and which ones are they? 

In [53]:
# Step 1: group by each col, throw out cols that results in groups with < 5 rows
cols_removed = []
for col in quasi_ids:
    print(col)
    # Group the dataframe by column and filter out groups with < 5 rows
    col_filtered_group_df = df.groupby(col).filter(lambda x: len(x) > 4)
    # print(col_filtered_group_df)
    # print("--------------------------------")
    # Check if any groups were dropped (means there were groups with < 5 rows)
    if (df.shape[0] - col_filtered_group_df.shape[0] != df.shape[0]):
        cols_removed.append(col)

print(f"cols removed: {cols_removed}")
print(f"num cols removed: {len(cols_removed)}")

cc_by_ip
                       course_id   user_id cc_by_ip           city postalCode  \
0       HarvardX/PH525.1x/1T2018     29940       US         Austin      78713   
1       HarvardX/PH525.1x/1T2018     37095       BD          Dhaka        NaN   
2       HarvardX/PH525.1x/1T2018     45634       CO       Medellín        NaN   
3       HarvardX/PH525.1x/1T2018     52234       SE         Skanör        NaN   
4       HarvardX/PH525.1x/1T2018     52238       MX           León        NaN   
...                          ...       ...      ...            ...        ...   
199994   HarvardX/Hum3.1x/1T2016  15291085       AU     Silverdale       2752   
199995   HarvardX/Hum3.1x/1T2016  15292716       RU  Yekaterinburg     620000   
199996   HarvardX/Hum3.1x/1T2016  15295130       TR       Istanbul        NaN   
199997   HarvardX/Hum3.1x/1T2016  15296396       US     Marshfield      02050   
199998   HarvardX/Hum3.1x/1T2016  15299318       AU       Brisbane        NaN   

        LoE     Yo

In [139]:
# Step 1: group by each col, throw out cols that results in groups with < 5 rows
cols_removed = []
for quasi_id in quasi_ids:
    # Isolate just the column for a particular quasi id
    col_df = df[quasi_id]
    # Count the number of occurrences (counts) of each unique value in it
    count = df[quasi_id].value_counts()
    # print(count)
    # Get the minimum number of occurrences of a unique value (min count at end)... 
    # if this is < 5, this record must be dropped
    min_count = count.iloc[-1]
    print(f"col for {quasi_id} has a min value count of {min_count}")
    if (min_count < 5):
        cols_removed.append(quasi_id)

print(f"cols removed: {cols_removed}")
print(f"num cols removed: {len(cols_removed)}")

col for cc_by_ip has a min value count of 1
col for city has a min value count of 1
col for postalCode has a min value count of 1
col for LoE has a min value count of 535
col for YoB has a min value count of 1
col for gender has a min value count of 978
col for nforum_posts has a min value count of 1
col for nforum_votes has a min value count of 1
col for nforum_endorsed has a min value count of 1
col for nforum_threads has a min value count of 1
col for nforum_comments has a min value count of 1
col for nforum_pinned has a min value count of 1
col for nforum_events has a min value count of 1
cols removed: ['cc_by_ip', 'city', 'postalCode', 'YoB', 'nforum_posts', 'nforum_votes', 'nforum_endorsed', 'nforum_threads', 'nforum_comments', 'nforum_pinned', 'nforum_events']
num cols removed: 11


In [143]:
# Step 2: check if reached 5-anonymity (otherwise have to del more cols)
init_cols_supressed_df = df.drop(cols_removed,axis=1)
# Checking that dropped all 11 desired cols
print(f"init cols supressed df: \n{init_cols_supressed_df}")
# Check level of anonymity in remaining df
sup_quasi_ids = get_quasi_ids(init_cols_supressed_df)
print(f"Quasi ids of col supressed df: {sup_quasi_ids}")
init_col_sup_lvl_anon = level_k_anon(init_cols_supressed_df, quasi_ids=sup_quasi_ids)
print(f"Level of k anonymity of dataframe after step 1: {init_col_sup_lvl_anon}")

init cols supressed df: 
                       course_id   user_id  LoE gender
0       HarvardX/PH525.1x/1T2018     29940  NaN    NaN
1       HarvardX/PH525.1x/1T2018     37095    b      m
2       HarvardX/PH525.1x/1T2018     45634    m      m
3       HarvardX/PH525.1x/1T2018     52234    p      m
4       HarvardX/PH525.1x/1T2018     52238  NaN    NaN
...                          ...       ...  ...    ...
199994   HarvardX/Hum3.1x/1T2016  15291085  jhs    NaN
199995   HarvardX/Hum3.1x/1T2016  15292716  NaN    NaN
199996   HarvardX/Hum3.1x/1T2016  15295130    b      f
199997   HarvardX/Hum3.1x/1T2016  15296396  NaN    NaN
199998   HarvardX/Hum3.1x/1T2016  15299318  NaN      f

[199999 rows x 4 columns]
Quasi ids of col supressed df: ['LoE', 'gender']
Level of k anonymity of dataframe after step 1: 2


In [155]:
# Step 3: brute force search to make 5 anonymous using col supression: 
# find all permutations of remaining cols, delete one at a time, see which
# permutation leads to least cols deleted to get to 5 anonymity
min_num_col_del = None
min_cols_del = []
from itertools import permutations
for p in permutations(sup_quasi_ids):
    try_cols_supressed_df = init_cols_supressed_df
    num_col_del = 0
    cols_del = []
    for quasi_id in p:
        # print(quasi_id)
        try_cols_supressed_df = try_cols_supressed_df.drop([quasi_id], axis=1)
        num_col_del += 1
        # print(cols_del)
        try_quasi_ids = get_quasi_ids(try_cols_supressed_df)
        # print(try_quasi_ids)
        if (len(try_quasi_ids) > 0):
            cols_del.append(quasi_id)
            try_cols_lvl_anon = level_k_anon(try_cols_supressed_df, quasi_ids=try_quasi_ids)
            if (try_cols_lvl_anon >= 5):
                if (min_num_col_del is None or num_col_del < min_num_col_del):
                    min_num_col_del = num_col_del
                    min_cols_del = cols_del

print(f"Minimum number of columns deleted in Step 3: {min_num_col_del}")
print(f"The cols deleted in Step 3 to achieve this: {min_cols_del}")

Minimum number of columns deleted in Step 3: 1
The cols deleted in Step 3 to achieve this: ['LoE']


In [156]:
# Check what happens if dropped gender col: does this result in 5-anonymity?
try_df = init_cols_supressed_df.drop(["gender"],axis=1)
# get level of anonymity
try_counts = try_df["LoE"].value_counts(dropna=False)
try_counts

b        59422
m        48204
hs       35706
NaN      27991
p         9114
a         8224
jhs       4542
other     3721
p_se      1105
p_oth      732
el         703
none       535
Name: LoE, dtype: int64

In [None]:
# each value when drop the gender col has more than 5 duplicates, so can
# either drop LoE or gender, doesn't matter

3. **Result:** 12 columns have to be dropped.  These are:
* 'cc_by_ip', 'city', 'postalCode', 'YoB', 'nforum_posts', 'nforum_votes', 'nforum_endorsed', 'nforum_threads', 'nforum_comments', 'nforum_pinned', 'nforum_events'
* and one of either 'LoE' or 'gender'

**4. Produce a 5-anonymous dataset using generalization**

4.1: Step 1: generalize the 11 columns that did not have at least 5 entries per unique value

In [218]:
def get_grouped_row_counts(df, cols):
    grouped_df = df.groupby(cols, dropna=False)
    return grouped_df.size()

# Declare df for generalization to occur in
gen_df = df.copy()
gen_df

Unnamed: 0,course_id,user_id,cc_by_ip,city,postalCode,LoE,YoB,gender,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events
0,HarvardX/PH525.1x/1T2018,29940,US,Austin,78713,,,,0,0,0,0,0,0,0
1,HarvardX/PH525.1x/1T2018,37095,BD,Dhaka,,b,1991.0,m,0,0,0,0,0,0,0
2,HarvardX/PH525.1x/1T2018,45634,CO,Medellín,,m,1982.0,m,0,0,0,0,0,0,0
3,HarvardX/PH525.1x/1T2018,52234,SE,Skanör,,p,1988.0,m,0,0,0,0,0,0,0
4,HarvardX/PH525.1x/1T2018,52238,MX,León,,,,,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,AU,Silverdale,2752,jhs,2002.0,,0,0,0,0,0,0,0
199995,HarvardX/Hum3.1x/1T2016,15292716,RU,Yekaterinburg,620000,,,,0,0,0,0,0,0,0
199996,HarvardX/Hum3.1x/1T2016,15295130,TR,Istanbul,,b,1996.0,f,0,0,0,0,0,0,0
199997,HarvardX/Hum3.1x/1T2016,15296396,US,Marshfield,02050,,2000.0,,1,0,0,0,1,0,0


4.1.1 generalize cc_by_ip

In [219]:
cc_by_ip_counts_dict = get_grouped_row_counts(df, ["cc_by_ip"]).to_dict()
print(cc_by_ip_counts_dict)
# Easiest way is to probably group things into regions
# Got a csv mapping codes to subregions and regions, doing subregions first
# to try to preserve info
country_code_df = pd.read_csv("country_code_region.csv")
country_code_df

{'AD': 3, 'AE': 718, 'AF': 51, 'AG': 16, 'AI': 1, 'AL': 194, 'AM': 116, 'AO': 33, 'AR': 1086, 'AS': 1, 'AT': 535, 'AU': 3703, 'AW': 10, 'AX': 2, 'AZ': 99, 'BA': 113, 'BB': 47, 'BD': 569, 'BE': 963, 'BF': 32, 'BG': 353, 'BH': 85, 'BI': 10, 'BJ': 24, 'BM': 13, 'BN': 26, 'BO': 169, 'BQ': 2, 'BR': 6615, 'BS': 49, 'BT': 25, 'BW': 59, 'BY': 152, 'BZ': 27, 'CA': 6475, 'CD': 43, 'CF': 1, 'CG': 4, 'CH': 833, 'CI': 74, 'CL': 952, 'CM': 177, 'CN': 3510, 'CO': 2330, 'CR': 304, 'CU': 24, 'CV': 19, 'CW': 15, 'CY': 93, 'CZ': 474, 'DE': 3425, 'DJ': 24, 'DK': 505, 'DM': 11, 'DO': 272, 'DZ': 316, 'EC': 498, 'EE': 138, 'EG': 1629, 'ER': 6, 'ES': 3220, 'ET': 282, 'FI': 363, 'FJ': 34, 'FM': 3, 'FO': 4, 'FR': 2801, 'GA': 8, 'GB': 7638, 'GD': 18, 'GE': 182, 'GF': 1, 'GG': 4, 'GH': 597, 'GI': 3, 'GL': 4, 'GM': 25, 'GN': 15, 'GP': 7, 'GQ': 2, 'GR': 1621, 'GT': 203, 'GU': 17, 'GW': 1, 'GY': 40, 'HK': 1378, 'HN': 127, 'HR': 285, 'HT': 138, 'HU': 448, 'ID': 1296, 'IE': 746, 'IL': 575, 'IM': 5, 'IN': 14752, 'IQ': 

Unnamed: 0,name,alpha-2,alpha-3,country-code,iso_3166-2,region,sub-region,intermediate-region,region-code,sub-region-code,intermediate-region-code
0,Afghanistan,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,,142.0,34.0,
1,Åland Islands,AX,ALA,248,ISO 3166-2:AX,Europe,Northern Europe,,150.0,154.0,
2,Albania,AL,ALB,8,ISO 3166-2:AL,Europe,Southern Europe,,150.0,39.0,
3,Algeria,DZ,DZA,12,ISO 3166-2:DZ,Africa,Northern Africa,,2.0,15.0,
4,American Samoa,AS,ASM,16,ISO 3166-2:AS,Oceania,Polynesia,,9.0,61.0,
...,...,...,...,...,...,...,...,...,...,...,...
244,Wallis and Futuna,WF,WLF,876,ISO 3166-2:WF,Oceania,Polynesia,,9.0,61.0,
245,Western Sahara,EH,ESH,732,ISO 3166-2:EH,Africa,Northern Africa,,2.0,15.0,
246,Yemen,YE,YEM,887,ISO 3166-2:YE,Asia,Western Asia,,142.0,145.0,
247,Zambia,ZM,ZMB,894,ISO 3166-2:ZM,Africa,Sub-Saharan Africa,Eastern Africa,2.0,202.0,14.0


In [223]:
# Map each cc_by_ip = alpha-2 to sub-region-code through this csv
# Create mappings between cc and sub-region
mappings = {}
for cc, count in cc_by_ip_counts_dict.items():
    try:
        # TODO may have to change to region instead
        sub_region = country_code_df[country_code_df['alpha-2'] == cc]['sub-region'].values[0]
        mappings[cc] = sub_region
    except:
        # hit end of series
        pass
print(mappings)
gen_df["gen_cc_by_ip_to_sub-region"] = gen_df["cc_by_ip"].map(mappings)
gen_df = gen_df.drop(["cc_by_ip"],axis=1)

{'AD': 'Southern Europe', 'AE': 'Western Asia', 'AF': 'Southern Asia', 'AG': 'Latin America and the Caribbean', 'AI': 'Latin America and the Caribbean', 'AL': 'Southern Europe', 'AM': 'Western Asia', 'AO': 'Sub-Saharan Africa', 'AR': 'Latin America and the Caribbean', 'AS': 'Polynesia', 'AT': 'Western Europe', 'AU': 'Australia and New Zealand', 'AW': 'Latin America and the Caribbean', 'AX': 'Northern Europe', 'AZ': 'Western Asia', 'BA': 'Southern Europe', 'BB': 'Latin America and the Caribbean', 'BD': 'Southern Asia', 'BE': 'Western Europe', 'BF': 'Sub-Saharan Africa', 'BG': 'Eastern Europe', 'BH': 'Western Asia', 'BI': 'Sub-Saharan Africa', 'BJ': 'Sub-Saharan Africa', 'BM': 'Northern America', 'BN': 'South-eastern Asia', 'BO': 'Latin America and the Caribbean', 'BQ': 'Latin America and the Caribbean', 'BR': 'Latin America and the Caribbean', 'BS': 'Latin America and the Caribbean', 'BT': 'Southern Asia', 'BW': 'Sub-Saharan Africa', 'BY': 'Eastern Europe', 'BZ': 'Latin America and the 

In [224]:
gen_df

Unnamed: 0,course_id,user_id,city,postalCode,LoE,YoB,gender,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region
0,HarvardX/PH525.1x/1T2018,29940,Austin,78713,,,,0,0,0,0,0,0,0,Northern America
1,HarvardX/PH525.1x/1T2018,37095,Dhaka,,b,1991.0,m,0,0,0,0,0,0,0,Southern Asia
2,HarvardX/PH525.1x/1T2018,45634,Medellín,,m,1982.0,m,0,0,0,0,0,0,0,Latin America and the Caribbean
3,HarvardX/PH525.1x/1T2018,52234,Skanör,,p,1988.0,m,0,0,0,0,0,0,0,Northern Europe
4,HarvardX/PH525.1x/1T2018,52238,León,,,,,0,0,0,0,0,0,0,Latin America and the Caribbean
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,Silverdale,2752,jhs,2002.0,,0,0,0,0,0,0,0,Australia and New Zealand
199995,HarvardX/Hum3.1x/1T2016,15292716,Yekaterinburg,620000,,,,0,0,0,0,0,0,0,Eastern Europe
199996,HarvardX/Hum3.1x/1T2016,15295130,Istanbul,,b,1996.0,f,0,0,0,0,0,0,0,Western Asia
199997,HarvardX/Hum3.1x/1T2016,15296396,Marshfield,02050,,2000.0,,1,0,0,0,1,0,0,Northern America


In [226]:
# Validate 5 anon of this col
gen_lvl_anon = level_k_anon(gen_df, ["gen_cc_by_ip_to_sub-region"])
gen_lvl_anon # Anon is 8, this is properly generalized!

8

4.1.2 generalize city

In [229]:
cities_counts_dict = get_grouped_row_counts(df, ["city"]).to_dict()
print(cities_counts_dict)
# Easiest way is to probably group cities into countries (by country codes)... if that fails:
# do sub-regions
# and if that fails: go back and do regions
# Got a csv mapping cities to countries
# to try to preserve info
cities_df = pd.read_csv("cities.csv")
cities_df

{"'s-hertogenbosch": 3, "'t Horntje": 1, 'A Coruña': 21, 'Aabenraa': 1, 'Aach': 1, 'Aachen': 30, 'Aalsmeer': 4, 'Aalst': 5, 'Aarburg': 1, 'Aarhus': 36, 'Aartselaar': 1, 'Aas': 3, 'Abakaliki': 1, 'Abakan': 2, 'Abancay': 2, 'Abano Terme': 2, 'Abarán': 1, 'Abasolo': 1, 'Abbekerk': 1, 'Abbeville': 3, 'Abbots Langley': 2, 'Abbotsford': 16, 'Abbottabad': 7, 'Abdullah': 1, 'Abeokuta': 5, 'Aberdare': 4, 'Aberdeen': 49, 'Abernethy': 1, 'Aberystwyth': 4, 'Abha': 1, 'Abidjan': 65, 'Abiko': 4, 'Abilene': 14, 'Abingdon': 13, 'Abira': 1, 'Abrams': 1, 'Absam': 1, 'Absecon': 59, 'Abu Dhabi': 174, 'Abuja': 176, 'Abymes': 1, 'Acacia Ridge': 1, 'Acacías': 2, 'Acapulco': 10, 'Acate': 1, 'Accra': 438, 'Accrington': 2, 'Acerra': 1, 'Acheng': 1, 'Achrafieh': 1, 'Acme': 1, 'Acolman': 1, 'Acton': 30, 'Actopan': 2, 'Acushnet': 1, 'Acworth': 15, 'Ad Dammam': 1, 'Ada': 4, 'Adamantina': 1, 'Adams': 2, 'Adana': 27, 'Addis Ababa': 178, 'Addison': 14, 'Addlestone': 3, 'Adeje': 1, 'Adelaide': 160, 'Adelsheim': 1, 'Adl

Unnamed: 0,id,name,state_id,state_code,state_name,country_id,country_code,country_name,latitude,longitude,wikiDataId
0,52,Ashkāsham,3901,BDS,Badakhshan,1,AF,Afghanistan,36.68333,71.53333,Q4805192
1,68,Fayzabad,3901,BDS,Badakhshan,1,AF,Afghanistan,37.11664,70.58002,Q156558
2,78,Jurm,3901,BDS,Badakhshan,1,AF,Afghanistan,36.86477,70.83421,Q10308323
3,84,Khandūd,3901,BDS,Badakhshan,1,AF,Afghanistan,36.95127,72.31800,Q3290334
4,115,Rāghistān,3901,BDS,Badakhshan,1,AF,Afghanistan,37.66079,70.67346,Q2670909
...,...,...,...,...,...,...,...,...,...,...,...
150548,131496,Redcliff,1957,MI,Midlands Province,247,ZW,Zimbabwe,-19.03333,29.78333,Q584001
150549,131502,Shangani,1957,MI,Midlands Province,247,ZW,Zimbabwe,-19.78333,29.36667,Q32017959
150550,131503,Shurugwi,1957,MI,Midlands Province,247,ZW,Zimbabwe,-19.67016,30.00589,Q32019023
150551,131504,Shurugwi District,1957,MI,Midlands Province,247,ZW,Zimbabwe,-19.75000,30.16667,Q7505444


In [230]:
# Map each city = name to country code through this csv
# Create mappings between cc and sub-region
mappings = {}
for city, count in cities_counts_dict.items():
    try:
        # TODO may have to change to region instead
        cc = cities_df[cities_df['name'] == city]['country_code'].values[0]
        mappings[city] = cc
    except:
        # hit end of series
        pass
print(mappings)
gen_df["gen_city_to_country_code"] = gen_df["city"].map(mappings)
gen_df = gen_df.drop(["city"],axis=1)
gen_df

{"'t Horntje": 'NL', 'Aabenraa': 'DK', 'Aach': 'DE', 'Aachen': 'DE', 'Aalsmeer': 'NL', 'Aalst': 'BE', 'Aarburg': 'CH', 'Aarhus': 'DK', 'Aartselaar': 'BE', 'Abakaliki': 'NG', 'Abakan': 'RU', 'Abancay': 'PE', 'Abano Terme': 'IT', 'Abarán': 'ES', 'Abasolo': 'MX', 'Abbekerk': 'NL', 'Abbeville': 'FR', 'Abbots Langley': 'GB', 'Abbotsford': 'AU', 'Abbottabad': 'PK', 'Abeokuta': 'NG', 'Aberdare': 'AU', 'Aberdeen': 'AU', 'Abernethy': 'GB', 'Aberystwyth': 'GB', 'Abha': 'SA', 'Abidjan': 'CI', 'Abiko': 'JP', 'Abilene': 'US', 'Abingdon': 'GB', 'Absam': 'AT', 'Absecon': 'US', 'Abuja': 'NG', 'Acacia Ridge': 'AU', 'Acacías': 'CO', 'Acate': 'IT', 'Accra': 'GH', 'Accrington': 'GB', 'Acerra': 'IT', 'Acheng': 'CN', 'Acton': 'AU', 'Actopan': 'MX', 'Acushnet': 'US', 'Acworth': 'US', 'Ada': 'US', 'Adamantina': 'BR', 'Adams': 'PH', 'Addis Ababa': 'ET', 'Addison': 'US', 'Addlestone': 'GB', 'Adeje': 'ES', 'Adelaide': 'AU', 'Adelsheim': 'DE', 'Adliswil': 'CH', 'Adrar': 'DZ', 'Adrian': 'RO', 'Aerzen': 'DE', 'Aesc

Unnamed: 0,course_id,user_id,postalCode,LoE,YoB,gender,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_country_code
0,HarvardX/PH525.1x/1T2018,29940,78713,,,,0,0,0,0,0,0,0,Northern America,US
1,HarvardX/PH525.1x/1T2018,37095,,b,1991.0,m,0,0,0,0,0,0,0,Southern Asia,BD
2,HarvardX/PH525.1x/1T2018,45634,,m,1982.0,m,0,0,0,0,0,0,0,Latin America and the Caribbean,CO
3,HarvardX/PH525.1x/1T2018,52234,,p,1988.0,m,0,0,0,0,0,0,0,Northern Europe,
4,HarvardX/PH525.1x/1T2018,52238,,,,,0,0,0,0,0,0,0,Latin America and the Caribbean,MX
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,2752,jhs,2002.0,,0,0,0,0,0,0,0,Australia and New Zealand,AU
199995,HarvardX/Hum3.1x/1T2016,15292716,620000,,,,0,0,0,0,0,0,0,Eastern Europe,RU
199996,HarvardX/Hum3.1x/1T2016,15295130,,b,1996.0,f,0,0,0,0,0,0,0,Western Asia,
199997,HarvardX/Hum3.1x/1T2016,15296396,02050,,2000.0,,1,0,0,0,1,0,0,Northern America,GB


In [232]:
# Validate each unique val in this col has at least 5 entries
gen_lvl_anon = level_k_anon(gen_df, ["gen_city_to_country_code"])
gen_lvl_anon # Anon is 1, this is not properly generalized!

1

In [235]:
# That wasn't enough generalization... go from country code to sub-region again
gen_cities_counts_dict = get_grouped_row_counts(gen_df, ["gen_city_to_country_code"]).to_dict()
print(cities_counts_dict)
# Map each cc_by_ip = alpha-2 to sub-region-code through this csv
# Create mappings between cc and sub-region
mappings = {}
for cc, count in gen_cities_counts_dict.items():
    try:
        # TODO may have to change to region instead
        sub_region = country_code_df[country_code_df['alpha-2'] == cc]['sub-region'].values[0]
        mappings[cc] = sub_region
    except:
        # hit end of series
        pass
print(mappings)
gen_df["gen_city_to_sub-region"] = gen_df["gen_city_to_country_code"].map(mappings)
# gen_df = gen_df.drop(["cc_by_ip"],axis=1)

{"'s-hertogenbosch": 3, "'t Horntje": 1, 'A Coruña': 21, 'Aabenraa': 1, 'Aach': 1, 'Aachen': 30, 'Aalsmeer': 4, 'Aalst': 5, 'Aarburg': 1, 'Aarhus': 36, 'Aartselaar': 1, 'Aas': 3, 'Abakaliki': 1, 'Abakan': 2, 'Abancay': 2, 'Abano Terme': 2, 'Abarán': 1, 'Abasolo': 1, 'Abbekerk': 1, 'Abbeville': 3, 'Abbots Langley': 2, 'Abbotsford': 16, 'Abbottabad': 7, 'Abdullah': 1, 'Abeokuta': 5, 'Aberdare': 4, 'Aberdeen': 49, 'Abernethy': 1, 'Aberystwyth': 4, 'Abha': 1, 'Abidjan': 65, 'Abiko': 4, 'Abilene': 14, 'Abingdon': 13, 'Abira': 1, 'Abrams': 1, 'Absam': 1, 'Absecon': 59, 'Abu Dhabi': 174, 'Abuja': 176, 'Abymes': 1, 'Acacia Ridge': 1, 'Acacías': 2, 'Acapulco': 10, 'Acate': 1, 'Accra': 438, 'Accrington': 2, 'Acerra': 1, 'Acheng': 1, 'Achrafieh': 1, 'Acme': 1, 'Acolman': 1, 'Acton': 30, 'Actopan': 2, 'Acushnet': 1, 'Acworth': 15, 'Ad Dammam': 1, 'Ada': 4, 'Adamantina': 1, 'Adams': 2, 'Adana': 27, 'Addis Ababa': 178, 'Addison': 14, 'Addlestone': 3, 'Adeje': 1, 'Adelaide': 160, 'Adelsheim': 1, 'Adl

In [238]:
gen_df = gen_df.drop(["gen_city_to_country_code"], axis=1)

In [239]:
gen_df

Unnamed: 0,course_id,user_id,postalCode,LoE,YoB,gender,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_sub-region
0,HarvardX/PH525.1x/1T2018,29940,78713,,,,0,0,0,0,0,0,0,Northern America,Northern America
1,HarvardX/PH525.1x/1T2018,37095,,b,1991.0,m,0,0,0,0,0,0,0,Southern Asia,Southern Asia
2,HarvardX/PH525.1x/1T2018,45634,,m,1982.0,m,0,0,0,0,0,0,0,Latin America and the Caribbean,Latin America and the Caribbean
3,HarvardX/PH525.1x/1T2018,52234,,p,1988.0,m,0,0,0,0,0,0,0,Northern Europe,
4,HarvardX/PH525.1x/1T2018,52238,,,,,0,0,0,0,0,0,0,Latin America and the Caribbean,Latin America and the Caribbean
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,2752,jhs,2002.0,,0,0,0,0,0,0,0,Australia and New Zealand,Australia and New Zealand
199995,HarvardX/Hum3.1x/1T2016,15292716,620000,,,,0,0,0,0,0,0,0,Eastern Europe,Eastern Europe
199996,HarvardX/Hum3.1x/1T2016,15295130,,b,1996.0,f,0,0,0,0,0,0,0,Western Asia,
199997,HarvardX/Hum3.1x/1T2016,15296396,02050,,2000.0,,1,0,0,0,1,0,0,Northern America,Northern Europe


In [241]:
# Validate each unique val in this col has at least 5 entries
gen_lvl_anon = level_k_anon(gen_df, ["gen_city_to_sub-region"])
gen_lvl_anon # Anon is 2, this is not properly generalized!

2

In [243]:
# That wasn't enough generalization... go from sub-region to region
gen_cities2_counts_dict = get_grouped_row_counts(gen_df, ["gen_city_to_sub-region"]).to_dict()
print(gen_cities2_counts_dict)
# Map each sub-region to region through this csv
# Create mappings between sub-region and region
mappings = {}
for subregion, count in gen_cities2_counts_dict.items():
    try:
        # TODO may have to change to region instead
        region = country_code_df[country_code_df['sub-region'] == subregion]['region'].values[0]
        mappings[subregion] = region
    except:
        # hit end of series
        pass
print(mappings)
gen_df["gen_city_to_region"] = gen_df["gen_city_to_sub-region"].map(mappings)
# gen_df = gen_df.drop(["cc_by_ip"],axis=1)

{'Australia and New Zealand': 15200, 'Central Asia': 185, 'Eastern Asia': 5936, 'Eastern Europe': 4383, 'Latin America and the Caribbean': 20437, 'Melanesia': 54, 'Micronesia': 4, 'Northern Africa': 1186, 'Northern America': 40779, 'Northern Europe': 7283, 'Polynesia': 2, 'South-eastern Asia': 7937, 'Southern Asia': 13344, 'Southern Europe': 4893, 'Sub-Saharan Africa': 4264, 'Western Asia': 2525, 'Western Europe': 7172, nan: 64415}
{'Australia and New Zealand': 'Oceania', 'Central Asia': 'Asia', 'Eastern Asia': 'Asia', 'Eastern Europe': 'Europe', 'Latin America and the Caribbean': 'Americas', 'Melanesia': 'Oceania', 'Micronesia': 'Oceania', 'Northern Africa': 'Africa', 'Northern America': 'Americas', 'Northern Europe': 'Europe', 'Polynesia': 'Oceania', 'South-eastern Asia': 'Asia', 'Southern Asia': 'Asia', 'Southern Europe': 'Europe', 'Sub-Saharan Africa': 'Africa', 'Western Asia': 'Asia', 'Western Europe': 'Europe'}


In [244]:
gen_df

Unnamed: 0,course_id,user_id,postalCode,LoE,YoB,gender,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_sub-region,gen_city_to_region
0,HarvardX/PH525.1x/1T2018,29940,78713,,,,0,0,0,0,0,0,0,Northern America,Northern America,Americas
1,HarvardX/PH525.1x/1T2018,37095,,b,1991.0,m,0,0,0,0,0,0,0,Southern Asia,Southern Asia,Asia
2,HarvardX/PH525.1x/1T2018,45634,,m,1982.0,m,0,0,0,0,0,0,0,Latin America and the Caribbean,Latin America and the Caribbean,Americas
3,HarvardX/PH525.1x/1T2018,52234,,p,1988.0,m,0,0,0,0,0,0,0,Northern Europe,,
4,HarvardX/PH525.1x/1T2018,52238,,,,,0,0,0,0,0,0,0,Latin America and the Caribbean,Latin America and the Caribbean,Americas
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,2752,jhs,2002.0,,0,0,0,0,0,0,0,Australia and New Zealand,Australia and New Zealand,Oceania
199995,HarvardX/Hum3.1x/1T2016,15292716,620000,,,,0,0,0,0,0,0,0,Eastern Europe,Eastern Europe,Europe
199996,HarvardX/Hum3.1x/1T2016,15295130,,b,1996.0,f,0,0,0,0,0,0,0,Western Asia,,
199997,HarvardX/Hum3.1x/1T2016,15296396,02050,,2000.0,,1,0,0,0,1,0,0,Northern America,Northern Europe,Europe


In [245]:
gen_df = gen_df.drop(["gen_city_to_sub-region"],axis=1)
gen_df

Unnamed: 0,course_id,user_id,postalCode,LoE,YoB,gender,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_region
0,HarvardX/PH525.1x/1T2018,29940,78713,,,,0,0,0,0,0,0,0,Northern America,Americas
1,HarvardX/PH525.1x/1T2018,37095,,b,1991.0,m,0,0,0,0,0,0,0,Southern Asia,Asia
2,HarvardX/PH525.1x/1T2018,45634,,m,1982.0,m,0,0,0,0,0,0,0,Latin America and the Caribbean,Americas
3,HarvardX/PH525.1x/1T2018,52234,,p,1988.0,m,0,0,0,0,0,0,0,Northern Europe,
4,HarvardX/PH525.1x/1T2018,52238,,,,,0,0,0,0,0,0,0,Latin America and the Caribbean,Americas
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,2752,jhs,2002.0,,0,0,0,0,0,0,0,Australia and New Zealand,Oceania
199995,HarvardX/Hum3.1x/1T2016,15292716,620000,,,,0,0,0,0,0,0,0,Eastern Europe,Europe
199996,HarvardX/Hum3.1x/1T2016,15295130,,b,1996.0,f,0,0,0,0,0,0,0,Western Asia,
199997,HarvardX/Hum3.1x/1T2016,15296396,02050,,2000.0,,1,0,0,0,1,0,0,Northern America,Europe


In [247]:
# Validate each unique val in this col has at least 5 entries
gen_lvl_anon = level_k_anon(gen_df, ["gen_city_to_region"])
gen_lvl_anon # Anon is 5450, this is properly generalized!

5450

4.1.3 generalize "postalCode"

In [257]:
postal_counts_dict = get_grouped_row_counts(df, ["postalCode"]).to_dict()
print(postal_counts_dict)
# Easiest way is to probably to just go to three digits bc of regional
# (add 4th digit will give too much localization)... note that these location info are all getting redundant
# Tried to fourth digit first
truncated_postal_code = gen_df['postalCode'].str.slice(0,4).astype(str)

truncated_postal_code_df = truncated_postal_code.to_frame("gen_postalCode_to_4_digits")
truncated_postal_code_df

{'0002': 1, '00041': 1, '00045': 1, '00046': 1, '00053': 1, '00100': 7, '00123': 1, '00127': 1, '00133': 1, '00135': 24, '00136': 1, '00137': 1, '00141': 4, '00142': 3, '00143': 1, '00144': 1, '00146': 4, '00148': 3, '00149': 1, '00152': 1, '00169': 3, '00170': 1, '00172': 1, '00180': 1, '00183': 1, '00195': 8, '00196': 4, '00197': 11, '00198': 4, '0020': 1, '00270': 1, '0028': 1, '00300': 1, '00320': 1, '00380': 4, '00390': 1, '004': 1, '00400': 1, '00410': 1, '00444': 1, '0048': 1, '00510': 4, '00530': 4, '00550': 1, '00807': 1, '0081': 1, '0084': 1, '00940': 2, '00980': 20, '00990': 2, '01000': 1, '01001': 2, '01002': 22, '01003': 12, '010055': 1, '01007': 3, '01010': 1, '01012': 1, '01013': 9, '01020': 7, '01027': 7, '01028': 5, '01030': 4, '01033': 1, '01035': 1, '01036': 1, '01038': 1, '01040': 7, '0105': 1, '01050': 1, '01056': 4, '01057': 3, '01060': 9, '01062': 5, '010624': 1, '01063': 3, '01073': 1, '01075': 6, '01077': 1, '01085': 5, '01089': 7, '01090': 2, '01095': 3, '0110

Unnamed: 0,gen_postalCode_to_4_digits
0,7871
1,
2,
3,
4,
...,...
199994,2752
199995,6200
199996,
199997,0205


In [258]:
# Validate each unique val in this col has at least 5 entries
gen_lvl_anon = level_k_anon(truncated_postal_code_df, ["gen_postalCode_to_4_digits"])
gen_lvl_anon # Anon is 1, this is not properly generalized!

1

In [271]:
# Trying again with 3 digits
truncated_postal_code = gen_df['postalCode'].str.slice(0,3).astype(str)
truncated_postal_code = truncated_postal_code.str.upper()
print(truncated_postal_code)
truncated_postal_code_df = truncated_postal_code.to_frame("gen_postalCode_to_3_digits")
gen_lvl_anon = level_k_anon(truncated_postal_code_df, ["gen_postalCode_to_3_digits"])
gen_lvl_anon # Anon is 1, this is not properly generalized!

0         787
1         NAN
2         NAN
3         NAN
4         NAN
         ... 
199994    275
199995    620
199996    NAN
199997    020
199998    NAN
Name: postalCode, Length: 199999, dtype: object


1

In [272]:
# Trying again with 2 digits
truncated_postal_code = gen_df['postalCode'].str.slice(0,2).astype(str)
truncated_postal_code = truncated_postal_code.str.upper()
print(truncated_postal_code)
truncated_postal_code_df = truncated_postal_code.to_frame("gen_postalCode_to_2_digits")
gen_lvl_anon = level_k_anon(truncated_postal_code_df, ["gen_postalCode_to_2_digits"])
gen_lvl_anon # Anon is 1, this is not properly generalized!

0          78
1         NAN
2         NAN
3         NAN
4         NAN
         ... 
199994     27
199995     62
199996    NAN
199997     02
199998    NAN
Name: postalCode, Length: 199999, dtype: object


1

In [273]:
# Trying again with 1 digit
truncated_postal_code = gen_df['postalCode'].str.slice(0,1).astype(str)
# Need upper because there seem to be errors with people typing lower?
truncated_postal_code = truncated_postal_code.str.upper()
# print(truncated_postal_code)
print(truncated_postal_code.value_counts())
truncated_postal_code_df = truncated_postal_code.to_frame("gen_postalCode_to_1_digits")
gen_lvl_anon = level_k_anon(truncated_postal_code_df, ["gen_postalCode_to_1_digits"])
gen_lvl_anon # Anon is 9, this is properly generalized!

NAN    116390
9       12036
1       10919
0       10290
2        8695
3        7538
4        6140
7        5983
6        5341
5        4568
8        4182
K         920
L         904
M         877
V         747
S         688
N         607
T         494
B         431
H         401
E         363
W         272
C         228
G         210
R         186
P         140
D         115
J         111
O          70
A          60
I          24
U          22
F          19
Y          19
X           9
Name: postalCode, dtype: int64


9

In [274]:
# Since properly generalized, add this to gen_df in place of postalcode
gen_df["gen_postalCode_to_1st_digit"] = truncated_postal_code
gen_df

Unnamed: 0,course_id,user_id,postalCode,LoE,YoB,gender,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_region,gen_postalCode_to_1st_digit
0,HarvardX/PH525.1x/1T2018,29940,78713,,,,0,0,0,0,0,0,0,Northern America,Americas,7
1,HarvardX/PH525.1x/1T2018,37095,,b,1991.0,m,0,0,0,0,0,0,0,Southern Asia,Asia,NAN
2,HarvardX/PH525.1x/1T2018,45634,,m,1982.0,m,0,0,0,0,0,0,0,Latin America and the Caribbean,Americas,NAN
3,HarvardX/PH525.1x/1T2018,52234,,p,1988.0,m,0,0,0,0,0,0,0,Northern Europe,,NAN
4,HarvardX/PH525.1x/1T2018,52238,,,,,0,0,0,0,0,0,0,Latin America and the Caribbean,Americas,NAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,2752,jhs,2002.0,,0,0,0,0,0,0,0,Australia and New Zealand,Oceania,2
199995,HarvardX/Hum3.1x/1T2016,15292716,620000,,,,0,0,0,0,0,0,0,Eastern Europe,Europe,6
199996,HarvardX/Hum3.1x/1T2016,15295130,,b,1996.0,f,0,0,0,0,0,0,0,Western Asia,,NAN
199997,HarvardX/Hum3.1x/1T2016,15296396,02050,,2000.0,,1,0,0,0,1,0,0,Northern America,Europe,0


In [276]:
gen_df = gen_df.drop(["postalCode"], axis = 1)
gen_df

Unnamed: 0,course_id,user_id,LoE,YoB,gender,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_region,gen_postalCode_to_1st_digit
0,HarvardX/PH525.1x/1T2018,29940,,,,0,0,0,0,0,0,0,Northern America,Americas,7
1,HarvardX/PH525.1x/1T2018,37095,b,1991.0,m,0,0,0,0,0,0,0,Southern Asia,Asia,NAN
2,HarvardX/PH525.1x/1T2018,45634,m,1982.0,m,0,0,0,0,0,0,0,Latin America and the Caribbean,Americas,NAN
3,HarvardX/PH525.1x/1T2018,52234,p,1988.0,m,0,0,0,0,0,0,0,Northern Europe,,NAN
4,HarvardX/PH525.1x/1T2018,52238,,,,0,0,0,0,0,0,0,Latin America and the Caribbean,Americas,NAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,jhs,2002.0,,0,0,0,0,0,0,0,Australia and New Zealand,Oceania,2
199995,HarvardX/Hum3.1x/1T2016,15292716,,,,0,0,0,0,0,0,0,Eastern Europe,Europe,6
199996,HarvardX/Hum3.1x/1T2016,15295130,b,1996.0,f,0,0,0,0,0,0,0,Western Asia,,NAN
199997,HarvardX/Hum3.1x/1T2016,15296396,,2000.0,,1,0,0,0,1,0,0,Northern America,Europe,0


4.1.4 generalize 'YoB'

In [316]:
# Split into 19 quantiles
quant_YoB_col = pd.qcut(gen_df['YoB'], q=10)
quant_YoB_col #... there seems to be a noisy data point of 512.999...

0                      NaN
1         (1989.0, 1991.0]
2         (1980.0, 1984.0]
3         (1987.0, 1989.0]
4                      NaN
                ...       
199994    (1995.0, 2018.0]
199995                 NaN
199996    (1995.0, 2018.0]
199997    (1995.0, 2018.0]
199998    (1995.0, 2018.0]
Name: YoB, Length: 199999, dtype: category
Categories (10, interval[float64, right]): [(512.999, 1966.0] < (1966.0, 1975.0] < (1975.0, 1980.0] < (1980.0, 1984.0] ... (1989.0, 1991.0] < (1991.0, 1993.0] < (1993.0, 1995.0] < (1995.0, 2018.0]]

In [317]:
# Determine level of anonymity
quant_YoB_col_df = quant_YoB_col.to_frame("gen_YoB_to_10_quantiles")
gen_lvl_anon = level_k_anon(quant_YoB_col_df, ["gen_YoB_to_10_quantiles"])
gen_lvl_anon # Anon is 13181, this is properly generalized!

13181

In [318]:
## add to dataset
gen_df["gen_YoB_to_10_quantiles"] = quant_YoB_col
gen_df = gen_df.drop(["YoB"], axis=1)
gen_df

Unnamed: 0,course_id,user_id,LoE,gender,nforum_posts,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_region,gen_postalCode_to_1st_digit,gen_YoB_to_4_quantiles,gen_YoB_to_10_quantiles
0,HarvardX/PH525.1x/1T2018,29940,,,0,0,0,0,0,0,0,Northern America,Americas,7,,
1,HarvardX/PH525.1x/1T2018,37095,b,m,0,0,0,0,0,0,0,Southern Asia,Asia,NAN,"(1990.0, 1992.0]","(1989.0, 1991.0]"
2,HarvardX/PH525.1x/1T2018,45634,m,m,0,0,0,0,0,0,0,Latin America and the Caribbean,Americas,NAN,"(1979.0, 1982.0]","(1980.0, 1984.0]"
3,HarvardX/PH525.1x/1T2018,52234,p,m,0,0,0,0,0,0,0,Northern Europe,,NAN,"(1987.0, 1989.0]","(1987.0, 1989.0]"
4,HarvardX/PH525.1x/1T2018,52238,,,0,0,0,0,0,0,0,Latin America and the Caribbean,Americas,NAN,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,jhs,,0,0,0,0,0,0,0,Australia and New Zealand,Oceania,2,"(1997.0, 2018.0]","(1995.0, 2018.0]"
199995,HarvardX/Hum3.1x/1T2016,15292716,,,0,0,0,0,0,0,0,Eastern Europe,Europe,6,,
199996,HarvardX/Hum3.1x/1T2016,15295130,b,f,0,0,0,0,0,0,0,Western Asia,,NAN,"(1994.0, 1997.0]","(1995.0, 2018.0]"
199997,HarvardX/Hum3.1x/1T2016,15296396,,,1,0,0,0,1,0,0,Northern America,Europe,0,"(1997.0, 2018.0]","(1995.0, 2018.0]"


4.1.5 gneralize 'nforum_posts'

In [298]:
# Split into 5 quantiles
nforum_posts = gen_df['nforum_posts']
# Reasonable splits seem to be 0, 1-3, 4-9, 10-19, 20-
print(nforum_posts.value_counts().to_dict())
# quantile_col = pd.qcut(gen_df['nforum_posts'], q=2, duplicates='drop')]
posts_quantile_col = pd.qcut(gen_df['nforum_posts'],
                              q=[0, 0.97, 0.98, 0.99, 0.995, 0.9975, 1])
posts_quantile_col

{0: 192059, 1: 3228, 2: 1143, 3: 662, 4: 482, 5: 300, 6: 244, 7: 214, 8: 151, 9: 125, 10: 102, 11: 97, 12: 92, 13: 75, 15: 61, 16: 60, 14: 57, 19: 54, 20: 43, 18: 41, 17: 38, 22: 36, 24: 32, 25: 29, 21: 25, 27: 24, 33: 24, 23: 23, 28: 23, 29: 22, 26: 21, 36: 21, 37: 20, 38: 20, 48: 18, 32: 18, 34: 17, 30: 17, 49: 17, 31: 16, 35: 16, 44: 15, 47: 15, 40: 13, 41: 11, 50: 10, 43: 10, 39: 9, 51: 8, 46: 8, 52: 6, 45: 6, 55: 6, 53: 6, 54: 5, 42: 5, 60: 5, 68: 5, 67: 4, 57: 4, 56: 4, 66: 4, 76: 4, 58: 3, 61: 3, 69: 3, 75: 3, 82: 3, 64: 3, 74: 2, 72: 2, 92: 2, 70: 2, 94: 2, 85: 2, 110: 2, 91: 2, 99: 2, 81: 2, 88: 1, 77: 1, 116: 1, 136: 1, 363: 1, 95: 1, 86: 1, 130: 1, 80: 1, 101: 1, 100: 1, 237: 1, 465: 1, 65: 1, 87: 1, 122: 1, 106: 1, 299: 1, 71: 1, 90: 1, 111: 1, 198: 1, 117: 1, 285: 1, 125: 1, 150: 1, 62: 1, 73: 1, 83: 1, 153: 1, 142: 1}


0         (-0.001, 1.0]
1         (-0.001, 1.0]
2         (-0.001, 1.0]
3         (-0.001, 1.0]
4         (-0.001, 1.0]
              ...      
199994    (-0.001, 1.0]
199995    (-0.001, 1.0]
199996    (-0.001, 1.0]
199997    (-0.001, 1.0]
199998    (-0.001, 1.0]
Name: nforum_posts, Length: 199999, dtype: category
Categories (6, interval[float64, right]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 6.0] < (6.0, 14.0] < (14.0, 27.0] < (27.0, 465.0]]

In [299]:
# Determine level of anonymity
posts_quantile_df = posts_quantile_col.to_frame("gen_nforum_posts_to_5_quantiles")
gen_lvl_anon = level_k_anon(posts_quantile_df, ["gen_nforum_posts_to_5_quantiles"])
gen_lvl_anon # Anon is 481, this is properly generalized!

481

In [319]:
# Add to dataset
gen_df["gen_nforum_posts_to_5_quantiles"] = posts_quantile_col
gen_df = gen_df.drop(["nforum_posts"], axis=1)
gen_df

Unnamed: 0,course_id,user_id,LoE,gender,nforum_votes,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_region,gen_postalCode_to_1st_digit,gen_YoB_to_4_quantiles,gen_YoB_to_10_quantiles,gen_nforum_posts_to_5_quantiles
0,HarvardX/PH525.1x/1T2018,29940,,,0,0,0,0,0,0,Northern America,Americas,7,,,"(-0.001, 1.0]"
1,HarvardX/PH525.1x/1T2018,37095,b,m,0,0,0,0,0,0,Southern Asia,Asia,NAN,"(1990.0, 1992.0]","(1989.0, 1991.0]","(-0.001, 1.0]"
2,HarvardX/PH525.1x/1T2018,45634,m,m,0,0,0,0,0,0,Latin America and the Caribbean,Americas,NAN,"(1979.0, 1982.0]","(1980.0, 1984.0]","(-0.001, 1.0]"
3,HarvardX/PH525.1x/1T2018,52234,p,m,0,0,0,0,0,0,Northern Europe,,NAN,"(1987.0, 1989.0]","(1987.0, 1989.0]","(-0.001, 1.0]"
4,HarvardX/PH525.1x/1T2018,52238,,,0,0,0,0,0,0,Latin America and the Caribbean,Americas,NAN,,,"(-0.001, 1.0]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,jhs,,0,0,0,0,0,0,Australia and New Zealand,Oceania,2,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]"
199995,HarvardX/Hum3.1x/1T2016,15292716,,,0,0,0,0,0,0,Eastern Europe,Europe,6,,,"(-0.001, 1.0]"
199996,HarvardX/Hum3.1x/1T2016,15295130,b,f,0,0,0,0,0,0,Western Asia,,NAN,"(1994.0, 1997.0]","(1995.0, 2018.0]","(-0.001, 1.0]"
199997,HarvardX/Hum3.1x/1T2016,15296396,,,0,0,0,1,0,0,Northern America,Europe,0,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]"


4.1.6: nforum_votes

In [323]:
# Split into quantiles
nforum_votes = gen_df['nforum_votes']
# Reasonable splits seem to be 0, 1-3, 4-9, 10-19, 20-
print(nforum_votes.value_counts().to_dict())
# quantile_col = pd.qcut(gen_df['nforum_posts'], q=2, duplicates='drop')]
votes_quantile_col = pd.qcut(gen_df['nforum_votes'], 
                              q=[0, 0.97, 0.98, 0.99, 0.995, 0.9975, 1],
                              duplicates='drop')
votes_quantile_col

{0: 197690, 1: 973, 2: 377, 3: 224, 4: 136, 5: 89, 6: 67, 7: 66, 8: 44, 9: 30, 10: 28, 11: 23, 12: 21, 13: 17, 16: 16, 17: 14, 15: 14, 14: 12, 20: 10, 18: 10, 19: 10, 23: 8, 21: 6, 22: 6, 38: 4, 34: 4, 32: 4, 27: 4, 25: 4, 42: 4, 24: 3, 33: 3, 28: 3, 55: 3, 35: 3, 63: 2, 30: 2, 39: 2, 62: 2, 69: 2, 46: 2, 51: 2, 26: 2, 29: 2, 139: 2, 37: 2, 41: 2, 43: 2, 40: 1, 31: 1, 116: 1, 193: 1, 61: 1, 53: 1, 86: 1, 81: 1, 49: 1, 72: 1, 36: 1, 84: 1, 140: 1, 45: 1, 56: 1, 65: 1, 157: 1, 136: 1, 135: 1, 636: 1, 67: 1, 177: 1, 57: 1, 138: 1, 121: 1, 224: 1, 74: 1, 50: 1, 70: 1, 47: 1, 96: 1, 113: 1, 102: 1, 164: 1, 107: 1, 158: 1, 148: 1, 123: 1, 79: 1, 94: 1, 111: 1, 200: 1, 64: 1}


0         (-0.001, 1.0]
1         (-0.001, 1.0]
2         (-0.001, 1.0]
3         (-0.001, 1.0]
4         (-0.001, 1.0]
              ...      
199994    (-0.001, 1.0]
199995    (-0.001, 1.0]
199996    (-0.001, 1.0]
199997    (-0.001, 1.0]
199998    (-0.001, 1.0]
Name: nforum_votes, Length: 199999, dtype: category
Categories (4, interval[float64, right]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 6.0] < (6.0, 636.0]]

In [324]:
# Determine level of anonymity
votes_quantile_df = votes_quantile_col.to_frame("gen_nforum_votes_to_4_quantiles")
gen_lvl_anon = level_k_anon(votes_quantile_df, ["gen_nforum_votes_to_4_quantiles"])
gen_lvl_anon # Anon is 337, this is properly generalized!

377

In [325]:
# Add to dataset
gen_df["gen_nforum_votes_to_4_quantiles"] = votes_quantile_col
gen_df = gen_df.drop(["nforum_votes"], axis=1)
gen_df

Unnamed: 0,course_id,user_id,LoE,gender,nforum_endorsed,nforum_threads,nforum_comments,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_region,gen_postalCode_to_1st_digit,gen_YoB_to_4_quantiles,gen_YoB_to_10_quantiles,gen_nforum_posts_to_5_quantiles,gen_nforum_votes_to_4_quantiles
0,HarvardX/PH525.1x/1T2018,29940,,,0,0,0,0,0,Northern America,Americas,7,,,"(-0.001, 1.0]","(-0.001, 1.0]"
1,HarvardX/PH525.1x/1T2018,37095,b,m,0,0,0,0,0,Southern Asia,Asia,NAN,"(1990.0, 1992.0]","(1989.0, 1991.0]","(-0.001, 1.0]","(-0.001, 1.0]"
2,HarvardX/PH525.1x/1T2018,45634,m,m,0,0,0,0,0,Latin America and the Caribbean,Americas,NAN,"(1979.0, 1982.0]","(1980.0, 1984.0]","(-0.001, 1.0]","(-0.001, 1.0]"
3,HarvardX/PH525.1x/1T2018,52234,p,m,0,0,0,0,0,Northern Europe,,NAN,"(1987.0, 1989.0]","(1987.0, 1989.0]","(-0.001, 1.0]","(-0.001, 1.0]"
4,HarvardX/PH525.1x/1T2018,52238,,,0,0,0,0,0,Latin America and the Caribbean,Americas,NAN,,,"(-0.001, 1.0]","(-0.001, 1.0]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,jhs,,0,0,0,0,0,Australia and New Zealand,Oceania,2,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199995,HarvardX/Hum3.1x/1T2016,15292716,,,0,0,0,0,0,Eastern Europe,Europe,6,,,"(-0.001, 1.0]","(-0.001, 1.0]"
199996,HarvardX/Hum3.1x/1T2016,15295130,b,f,0,0,0,0,0,Western Asia,,NAN,"(1994.0, 1997.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199997,HarvardX/Hum3.1x/1T2016,15296396,,,0,0,1,0,0,Northern America,Europe,0,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]"


4.1.7: generalize nforum_endorsed

In [331]:
# Split into quantiles
nforum_endorsed = gen_df['nforum_endorsed']
# Reasonable splits seem to be 0, 1-3, 4-9, 10-19, 20-
print(nforum_endorsed.value_counts().to_dict())
# quantile_col = pd.qcut(gen_df['nforum_posts'], q=2, duplicates='drop')]
endorsed_quantile_col = pd.qcut(gen_df['nforum_endorsed'], 
                              q=[0, 0.99982, 1],
                              duplicates='drop')
endorsed_quantile_col

{0: 199904, 1: 59, 2: 18, 3: 8, 4: 4, 8: 1, 5: 1, 14: 1, 7: 1, 46: 1, 6: 1}


0         (-0.001, 1.0]
1         (-0.001, 1.0]
2         (-0.001, 1.0]
3         (-0.001, 1.0]
4         (-0.001, 1.0]
              ...      
199994    (-0.001, 1.0]
199995    (-0.001, 1.0]
199996    (-0.001, 1.0]
199997    (-0.001, 1.0]
199998    (-0.001, 1.0]
Name: nforum_endorsed, Length: 199999, dtype: category
Categories (2, interval[float64, right]): [(-0.001, 1.0] < (1.0, 46.0]]

In [332]:
# Determine level of anonymity
endorsed_quantile_col_df = endorsed_quantile_col.to_frame("gen_nforum_votes_to_2_quantiles")
gen_lvl_anon = level_k_anon(endorsed_quantile_col_df, ["gen_nforum_votes_to_2_quantiles"])
gen_lvl_anon # Anon is 36, this is properly generalized!

36

In [333]:
# Add to dataset
gen_df["gen_nforum_endorsed_to_2_quantiles"] = endorsed_quantile_col
gen_df = gen_df.drop(["nforum_endorsed"], axis=1)
gen_df

Unnamed: 0,course_id,user_id,LoE,gender,nforum_threads,nforum_comments,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_region,gen_postalCode_to_1st_digit,gen_YoB_to_4_quantiles,gen_YoB_to_10_quantiles,gen_nforum_posts_to_5_quantiles,gen_nforum_votes_to_4_quantiles,gen_nforum_endorsed_to_2_quantiles
0,HarvardX/PH525.1x/1T2018,29940,,,0,0,0,0,Northern America,Americas,7,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
1,HarvardX/PH525.1x/1T2018,37095,b,m,0,0,0,0,Southern Asia,Asia,NAN,"(1990.0, 1992.0]","(1989.0, 1991.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
2,HarvardX/PH525.1x/1T2018,45634,m,m,0,0,0,0,Latin America and the Caribbean,Americas,NAN,"(1979.0, 1982.0]","(1980.0, 1984.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
3,HarvardX/PH525.1x/1T2018,52234,p,m,0,0,0,0,Northern Europe,,NAN,"(1987.0, 1989.0]","(1987.0, 1989.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
4,HarvardX/PH525.1x/1T2018,52238,,,0,0,0,0,Latin America and the Caribbean,Americas,NAN,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,jhs,,0,0,0,0,Australia and New Zealand,Oceania,2,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199995,HarvardX/Hum3.1x/1T2016,15292716,,,0,0,0,0,Eastern Europe,Europe,6,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199996,HarvardX/Hum3.1x/1T2016,15295130,b,f,0,0,0,0,Western Asia,,NAN,"(1994.0, 1997.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199997,HarvardX/Hum3.1x/1T2016,15296396,,,0,1,0,0,Northern America,Europe,0,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"


4.1.8: nforum_threads

In [339]:
# Split into quantiles
nforum_threads = gen_df['nforum_threads']
# Reasonable splits seem to be 0, 1-3, 4-9, 10-19, 20-
print(nforum_threads.value_counts().to_dict())
# quantile_col = pd.qcut(gen_df['nforum_posts'], q=2, duplicates='drop')]
nforum_threads_col = pd.qcut(gen_df['nforum_threads'], 
                              q=[0, 0.98, 1],
                              duplicates='drop')
# Decisions often come to splitting between 0 versus above 1 for forum activity, etc
nforum_threads_col

{0: 195576, 1: 2306, 2: 673, 3: 368, 4: 213, 5: 151, 6: 117, 7: 83, 8: 72, 9: 52, 10: 40, 12: 37, 11: 32, 13: 31, 16: 21, 15: 19, 21: 18, 14: 17, 17: 16, 20: 16, 19: 14, 18: 13, 24: 13, 23: 12, 22: 8, 28: 6, 48: 6, 30: 6, 31: 5, 25: 4, 36: 4, 43: 4, 26: 4, 44: 4, 46: 3, 37: 3, 47: 3, 27: 3, 29: 3, 39: 3, 38: 2, 45: 2, 34: 2, 49: 2, 35: 2, 40: 1, 33: 1, 80: 1, 41: 1, 54: 1, 71: 1, 142: 1, 61: 1, 50: 1, 42: 1}


0         (-0.001, 1.0]
1         (-0.001, 1.0]
2         (-0.001, 1.0]
3         (-0.001, 1.0]
4         (-0.001, 1.0]
              ...      
199994    (-0.001, 1.0]
199995    (-0.001, 1.0]
199996    (-0.001, 1.0]
199997    (-0.001, 1.0]
199998    (-0.001, 1.0]
Name: nforum_threads, Length: 199999, dtype: category
Categories (2, interval[float64, right]): [(-0.001, 1.0] < (1.0, 142.0]]

In [341]:
# Determine level of anonymity
nforum_threads_df = nforum_threads_col.to_frame("gen_nforum_threads_to_2_quantiles")
gen_lvl_anon = level_k_anon(nforum_threads_df, ["gen_nforum_threads_to_2_quantiles"])
gen_lvl_anon # Anon is 2117, this is properly generalized!

2117

In [342]:
# Add to dataset
gen_df["gen_nforum_threads_to_2_quantiles"] = nforum_threads_df
gen_df = gen_df.drop(["nforum_threads"], axis=1)
gen_df

Unnamed: 0,course_id,user_id,LoE,gender,nforum_comments,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_region,gen_postalCode_to_1st_digit,gen_YoB_to_4_quantiles,gen_YoB_to_10_quantiles,gen_nforum_posts_to_5_quantiles,gen_nforum_votes_to_4_quantiles,gen_nforum_endorsed_to_2_quantiles,gen_nforum_threads_to_2_quantiles
0,HarvardX/PH525.1x/1T2018,29940,,,0,0,0,Northern America,Americas,7,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
1,HarvardX/PH525.1x/1T2018,37095,b,m,0,0,0,Southern Asia,Asia,NAN,"(1990.0, 1992.0]","(1989.0, 1991.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
2,HarvardX/PH525.1x/1T2018,45634,m,m,0,0,0,Latin America and the Caribbean,Americas,NAN,"(1979.0, 1982.0]","(1980.0, 1984.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
3,HarvardX/PH525.1x/1T2018,52234,p,m,0,0,0,Northern Europe,,NAN,"(1987.0, 1989.0]","(1987.0, 1989.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
4,HarvardX/PH525.1x/1T2018,52238,,,0,0,0,Latin America and the Caribbean,Americas,NAN,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,jhs,,0,0,0,Australia and New Zealand,Oceania,2,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199995,HarvardX/Hum3.1x/1T2016,15292716,,,0,0,0,Eastern Europe,Europe,6,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199996,HarvardX/Hum3.1x/1T2016,15295130,b,f,0,0,0,Western Asia,,NAN,"(1994.0, 1997.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199997,HarvardX/Hum3.1x/1T2016,15296396,,,1,0,0,Northern America,Europe,0,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"


4.1.9: nforum_comments

In [343]:
# Split into quantiles
nforum_comments = gen_df['nforum_comments']
# Reasonable splits seem to be 0, 1-3, 4-9, 10-19, 20-
print(nforum_comments.value_counts().to_dict())
# quantile_col = pd.qcut(gen_df['nforum_posts'], q=2, duplicates='drop')]
nforum_comments_col = pd.qcut(gen_df['nforum_comments'], 
                              q=[0, 0.98, 1],
                              duplicates='drop')
# Decisions often come to splitting between 0 versus above 1 for forum activity, etc
nforum_comments_col

{0: 194537, 1: 2309, 2: 754, 3: 471, 4: 306, 5: 219, 6: 157, 7: 144, 8: 98, 9: 74, 10: 62, 11: 61, 13: 49, 12: 48, 15: 41, 16: 38, 19: 31, 20: 30, 18: 29, 17: 27, 22: 25, 28: 25, 48: 25, 14: 22, 27: 21, 36: 19, 24: 18, 25: 18, 35: 17, 21: 17, 26: 16, 23: 16, 37: 15, 32: 15, 33: 15, 49: 13, 47: 13, 30: 13, 34: 12, 40: 12, 29: 10, 31: 10, 44: 10, 38: 8, 52: 7, 43: 7, 41: 7, 66: 6, 53: 6, 39: 5, 60: 5, 51: 4, 45: 4, 42: 4, 55: 4, 46: 4, 54: 4, 56: 3, 50: 3, 62: 2, 67: 2, 91: 2, 100: 2, 90: 2, 64: 2, 82: 2, 70: 2, 92: 2, 57: 2, 81: 2, 63: 1, 103: 1, 80: 1, 76: 1, 220: 1, 65: 1, 99: 1, 335: 1, 58: 1, 88: 1, 61: 1, 116: 1, 110: 1, 136: 1, 71: 1, 281: 1, 74: 1, 87: 1, 121: 1, 157: 1, 59: 1, 72: 1, 444: 1, 198: 1, 117: 1, 150: 1, 77: 1, 75: 1, 94: 1, 97: 1, 68: 1, 83: 1, 153: 1, 135: 1}


0         (-0.001, 1.0]
1         (-0.001, 1.0]
2         (-0.001, 1.0]
3         (-0.001, 1.0]
4         (-0.001, 1.0]
              ...      
199994    (-0.001, 1.0]
199995    (-0.001, 1.0]
199996    (-0.001, 1.0]
199997    (-0.001, 1.0]
199998    (-0.001, 1.0]
Name: nforum_comments, Length: 199999, dtype: category
Categories (2, interval[float64, right]): [(-0.001, 1.0] < (1.0, 444.0]]

In [344]:
# Determine level of anonymity
nforum_comments_df = nforum_comments_col.to_frame("gen_nforum_comments_to_2_quantiles")
gen_lvl_anon = level_k_anon(nforum_comments_df, ["gen_nforum_comments_to_2_quantiles"])
gen_lvl_anon # Anon is 3153, this is properly generalized!

3153

In [345]:
# Add to dataset
gen_df["gen_nforum_comments_to_2_quantiles"] = nforum_threads_df
gen_df = gen_df.drop(["nforum_comments"], axis=1)
gen_df

Unnamed: 0,course_id,user_id,LoE,gender,nforum_pinned,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_region,gen_postalCode_to_1st_digit,gen_YoB_to_4_quantiles,gen_YoB_to_10_quantiles,gen_nforum_posts_to_5_quantiles,gen_nforum_votes_to_4_quantiles,gen_nforum_endorsed_to_2_quantiles,gen_nforum_threads_to_2_quantiles,gen_nforum_comments_to_2_quantiles
0,HarvardX/PH525.1x/1T2018,29940,,,0,0,Northern America,Americas,7,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
1,HarvardX/PH525.1x/1T2018,37095,b,m,0,0,Southern Asia,Asia,NAN,"(1990.0, 1992.0]","(1989.0, 1991.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
2,HarvardX/PH525.1x/1T2018,45634,m,m,0,0,Latin America and the Caribbean,Americas,NAN,"(1979.0, 1982.0]","(1980.0, 1984.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
3,HarvardX/PH525.1x/1T2018,52234,p,m,0,0,Northern Europe,,NAN,"(1987.0, 1989.0]","(1987.0, 1989.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
4,HarvardX/PH525.1x/1T2018,52238,,,0,0,Latin America and the Caribbean,Americas,NAN,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,jhs,,0,0,Australia and New Zealand,Oceania,2,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199995,HarvardX/Hum3.1x/1T2016,15292716,,,0,0,Eastern Europe,Europe,6,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199996,HarvardX/Hum3.1x/1T2016,15295130,b,f,0,0,Western Asia,,NAN,"(1994.0, 1997.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199997,HarvardX/Hum3.1x/1T2016,15296396,,,0,0,Northern America,Europe,0,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"


4.1.10: nforum_pinned

In [354]:
# Split into quantiles
nforum_pinned = gen_df['nforum_pinned']
# Reasonable splits seem to be 0, 1-3, 4-9, 10-19, 20-
print(nforum_pinned.value_counts().to_dict())
# quantile_col = pd.qcut(gen_df['nforum_posts'], q=2, duplicates='drop')]
nforum_pinned_col = pd.qcut(gen_df['nforum_pinned'], 
                              q=[0, 0.99995, 1],
                              duplicates='drop')
# Decisions often come to splitting between 0 versus above 1 for forum activity, etc
nforum_pinned_col

{0: 199987, 1: 5, 6: 1, 7: 1, 3: 1, 4: 1, 2: 1, 18: 1, 16: 1}


0         (-0.001, 1.0]
1         (-0.001, 1.0]
2         (-0.001, 1.0]
3         (-0.001, 1.0]
4         (-0.001, 1.0]
              ...      
199994    (-0.001, 1.0]
199995    (-0.001, 1.0]
199996    (-0.001, 1.0]
199997    (-0.001, 1.0]
199998    (-0.001, 1.0]
Name: nforum_pinned, Length: 199999, dtype: category
Categories (2, interval[float64, right]): [(-0.001, 1.0] < (1.0, 18.0]]

In [355]:
# Determine level of anonymity
nforum_pinned_df = nforum_pinned_col.to_frame("gen_nforum_pinned")
gen_lvl_anon = level_k_anon(nforum_pinned_df, ["gen_nforum_pinned"])
gen_lvl_anon # Anon is 7, this is properly generalized!

7

In [356]:
# Add to dataset
gen_df["gen_nforum_pinned"] = nforum_pinned_df
gen_df = gen_df.drop(["nforum_pinned"], axis=1)
gen_df

Unnamed: 0,course_id,user_id,LoE,gender,nforum_events,gen_cc_by_ip_to_sub-region,gen_city_to_region,gen_postalCode_to_1st_digit,gen_YoB_to_4_quantiles,gen_YoB_to_10_quantiles,gen_nforum_posts_to_5_quantiles,gen_nforum_votes_to_4_quantiles,gen_nforum_endorsed_to_2_quantiles,gen_nforum_threads_to_2_quantiles,gen_nforum_comments_to_2_quantiles,gen_nforum_pinned
0,HarvardX/PH525.1x/1T2018,29940,,,0,Northern America,Americas,7,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
1,HarvardX/PH525.1x/1T2018,37095,b,m,0,Southern Asia,Asia,NAN,"(1990.0, 1992.0]","(1989.0, 1991.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
2,HarvardX/PH525.1x/1T2018,45634,m,m,0,Latin America and the Caribbean,Americas,NAN,"(1979.0, 1982.0]","(1980.0, 1984.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
3,HarvardX/PH525.1x/1T2018,52234,p,m,0,Northern Europe,,NAN,"(1987.0, 1989.0]","(1987.0, 1989.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
4,HarvardX/PH525.1x/1T2018,52238,,,0,Latin America and the Caribbean,Americas,NAN,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,jhs,,0,Australia and New Zealand,Oceania,2,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199995,HarvardX/Hum3.1x/1T2016,15292716,,,0,Eastern Europe,Europe,6,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199996,HarvardX/Hum3.1x/1T2016,15295130,b,f,0,Western Asia,,NAN,"(1994.0, 1997.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199997,HarvardX/Hum3.1x/1T2016,15296396,,,0,Northern America,Europe,0,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"


4.1.11: nforum_events

In [363]:
# Split into quantiles
nforum_events = gen_df['nforum_events']
print(nforum_events.value_counts().to_dict())
# quantile_col = pd.qcut(gen_df['nforum_posts'], q=2, duplicates='drop')]
nforum_events_col = pd.qcut(gen_df['nforum_events'], 
                              q=[0, 0.93, 0.965, 1],
                              duplicates='drop')
# Decisions often come to splitting between 0 versus above 1 for forum activity, etc
nforum_events_col

{0: 184430, 1: 3114, 2: 1846, 3: 1049, 4: 861, 5: 616, 6: 582, 7: 453, 8: 430, 9: 322, 10: 289, 11: 280, 12: 243, 13: 226, 14: 192, 15: 183, 16: 168, 18: 142, 19: 138, 17: 137, 20: 122, 21: 116, 23: 103, 22: 101, 27: 86, 24: 86, 26: 79, 28: 78, 25: 77, 31: 77, 37: 68, 29: 67, 32: 67, 33: 64, 36: 64, 35: 64, 34: 62, 30: 59, 40: 57, 53: 46, 39: 46, 44: 45, 43: 43, 46: 42, 42: 41, 41: 41, 38: 39, 48: 36, 57: 36, 52: 34, 59: 34, 45: 33, 62: 33, 58: 33, 51: 31, 47: 31, 50: 30, 61: 29, 49: 28, 68: 27, 65: 26, 56: 26, 66: 25, 54: 24, 63: 24, 64: 23, 60: 23, 73: 22, 74: 20, 80: 20, 55: 20, 71: 19, 67: 19, 90: 19, 86: 19, 88: 18, 93: 17, 111: 17, 79: 17, 85: 16, 122: 16, 96: 16, 109: 16, 70: 15, 76: 14, 102: 14, 91: 14, 69: 14, 77: 14, 95: 14, 72: 14, 98: 13, 94: 13, 75: 12, 78: 12, 106: 12, 105: 12, 143: 12, 100: 12, 150: 12, 127: 12, 116: 12, 82: 12, 125: 12, 101: 11, 141: 11, 139: 11, 81: 11, 97: 11, 120: 11, 158: 10, 89: 10, 132: 10, 133: 10, 83: 10, 104: 10, 153: 10, 108: 10, 129: 10, 119:

0         (-0.001, 1.0]
1         (-0.001, 1.0]
2         (-0.001, 1.0]
3         (-0.001, 1.0]
4         (-0.001, 1.0]
              ...      
199994    (-0.001, 1.0]
199995    (-0.001, 1.0]
199996    (-0.001, 1.0]
199997    (-0.001, 1.0]
199998    (-0.001, 1.0]
Name: nforum_events, Length: 199999, dtype: category
Categories (3, interval[float64, right]): [(-0.001, 1.0] < (1.0, 8.0] < (8.0, 9192.0]]

In [364]:
# Determine level of anonymity
nforum_events_df = nforum_events_col.to_frame("gen_nforum_events")
gen_lvl_anon = level_k_anon(nforum_events_df, ["gen_nforum_events"])
gen_lvl_anon # Anon is 5837, this is properly generalized!

5837

In [365]:
# Add to dataset
gen_df["gen_nforum_events"] = nforum_threads_df
gen_df = gen_df.drop(["nforum_events"], axis=1)
gen_df

Unnamed: 0,course_id,user_id,LoE,gender,gen_cc_by_ip_to_sub-region,gen_city_to_region,gen_postalCode_to_1st_digit,gen_YoB_to_4_quantiles,gen_YoB_to_10_quantiles,gen_nforum_posts_to_5_quantiles,gen_nforum_votes_to_4_quantiles,gen_nforum_endorsed_to_2_quantiles,gen_nforum_threads_to_2_quantiles,gen_nforum_comments_to_2_quantiles,gen_nforum_pinned,gen_nforum_events
0,HarvardX/PH525.1x/1T2018,29940,,,Northern America,Americas,7,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
1,HarvardX/PH525.1x/1T2018,37095,b,m,Southern Asia,Asia,NAN,"(1990.0, 1992.0]","(1989.0, 1991.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
2,HarvardX/PH525.1x/1T2018,45634,m,m,Latin America and the Caribbean,Americas,NAN,"(1979.0, 1982.0]","(1980.0, 1984.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
3,HarvardX/PH525.1x/1T2018,52234,p,m,Northern Europe,,NAN,"(1987.0, 1989.0]","(1987.0, 1989.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
4,HarvardX/PH525.1x/1T2018,52238,,,Latin America and the Caribbean,Americas,NAN,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,HarvardX/Hum3.1x/1T2016,15291085,jhs,,Australia and New Zealand,Oceania,2,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199995,HarvardX/Hum3.1x/1T2016,15292716,,,Eastern Europe,Europe,6,,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199996,HarvardX/Hum3.1x/1T2016,15295130,b,f,Western Asia,,NAN,"(1994.0, 1997.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199997,HarvardX/Hum3.1x/1T2016,15296396,,,Northern America,Europe,0,"(1997.0, 2018.0]","(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"


4.1.12: LoE (generalizing this over gender as there are more categories in the dataset)
... note that this is choosing to reduce the variance of the data
* needed to generalize this because found that even with all 11 prior cols dropped, this one needed to be dropped too to bring to 5-anonymous

In [375]:
# Get value counts
LoE = gen_df['LoE']
LoE_dict = LoE.value_counts().to_dict()
print(LoE_dict)

{'b': 59422, 'm': 48204, 'hs': 35706, 'p': 9114, 'a': 8224, 'jhs': 4542, 'other': 3721, 'p_se': 1105, 'p_oth': 732, 'el': 703, 'none': 535}


In [376]:
# Reasonable splits seem to be pre-high school, college, post-college; other/none
# pre-high-school: 'jhs', 'el'
# high-school: 'hs'
# college: 'b', 'a'
# post college: 'm', 'p', 'p_se', 'p_oth'
# other or none: 'other', 'none' ... this seems most reasonable... would there be anyone with none? but also adding noise through this method
# when do my own way, I'd drop none
# LoE_df = LoE.to_frame("LoE")
# print(LoE_df)
LoE = LoE.str.replace('p', 'post-college')
LoE = LoE.str.replace('jhs', 'pre-hs')
LoE = LoE.str.replace('el', 'pre-hs')
LoE = LoE.str.replace('b', 'college')
LoE = LoE.str.replace('a', 'college')
LoE = LoE.str.replace('m', 'post-college')
LoE = LoE.str.replace('p_se', 'post-college')
LoE = LoE.str.replace('p_oth', 'post-college')
# LoE = LoE.str.replace('other', 'other')
LoE = LoE.str.replace('none', 'other')
LoE
# LoE = np.where((LoE =='jhs'), 'pre-hs', LoE)
LoE

0                  NaN
1              college
2         post-college
3         post-college
4                  NaN
              ...     
199994          pre-hs
199995             NaN
199996         college
199997             NaN
199998             NaN
Name: LoE, Length: 199999, dtype: object

In [377]:
# Determine level of anonymity
LoE_df = LoE.to_frame("gen_LoE")
gen_lvl_anon = level_k_anon(LoE_df, ["gen_LoE"])
gen_lvl_anon # Anon is 732, this is properly generalized!

732

In [None]:
# Add to df
gen_df["gen_LoE"] = LoE
gen_df.drop(["LoE"],axis=1)

4.2 Step 2: check that generalization resulted in 5-anonymous dataframe

In [387]:
# gen_df = gen_df.drop(["gen_YoB_to_4_quantiles"], axis=1)

In [397]:
gen_df

Unnamed: 0,LoE,gender,gen_cc_by_ip_to_sub-region,gen_city_to_region,gen_postalCode_to_1st_digit,gen_YoB_to_10_quantiles,gen_nforum_posts_to_5_quantiles,gen_nforum_votes_to_4_quantiles,gen_nforum_endorsed_to_2_quantiles,gen_nforum_threads_to_2_quantiles,gen_nforum_comments_to_2_quantiles,gen_nforum_pinned,gen_nforum_events
0,,,Northern America,Americas,7,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
1,b,m,Southern Asia,Asia,NAN,"(1989.0, 1991.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
2,m,m,Latin America and the Caribbean,Americas,NAN,"(1980.0, 1984.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
3,p,m,Northern Europe,,NAN,"(1987.0, 1989.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
4,,,Latin America and the Caribbean,Americas,NAN,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
199994,jhs,,Australia and New Zealand,Oceania,2,"(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199995,,,Eastern Europe,Europe,6,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199996,b,f,Western Asia,,NAN,"(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199997,,,Northern America,Europe,0,"(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"


In [398]:
# Had to drop a col due to memory issue (dropping this one bc generalzes to gen_city_to_region)
gen_df_mod = gen_df.drop(["gen_cc_by_ip_to_sub-region"],axis=1)
gen_df_mod

Unnamed: 0,LoE,gender,gen_city_to_region,gen_postalCode_to_1st_digit,gen_YoB_to_10_quantiles,gen_nforum_posts_to_5_quantiles,gen_nforum_votes_to_4_quantiles,gen_nforum_endorsed_to_2_quantiles,gen_nforum_threads_to_2_quantiles,gen_nforum_comments_to_2_quantiles,gen_nforum_pinned,gen_nforum_events
0,,,Americas,7,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
1,b,m,Asia,NAN,"(1989.0, 1991.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
2,m,m,Americas,NAN,"(1980.0, 1984.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
3,p,m,,NAN,"(1987.0, 1989.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
4,,,Americas,NAN,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
...,...,...,...,...,...,...,...,...,...,...,...,...
199994,jhs,,Oceania,2,"(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199995,,,Europe,6,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199996,b,f,,NAN,"(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199997,,,Europe,0,"(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"


In [401]:
gen_df_mod = gen_df_mod.drop(["gen_postalCode_to_1st_digit"],axis=1)
# print(gen_quasi_ids)
gen_quasi_ids = ['gen_LoE', 'gender', 'gen_city_to_region', 'gen_YoB_to_10_quantiles', 'gen_nforum_posts_to_5_quantiles', 'gen_nforum_votes_to_4_quantiles', 'gen_nforum_endorsed_to_2_quantiles', 'gen_nforum_threads_to_2_quantiles', 'gen_nforum_comments_to_2_quantiles', 'gen_nforum_pinned', 'gen_nforum_events']
# Group by set of quasi id values
quasi_id_grouped_df = gen_df_mod.groupby(gen_quasi_ids, dropna=False)
# Get number of rows in each gruop
grouped_row_counts = quasi_id_grouped_df.size()
# Min number of rows in a group = level of k-anonymity
level_k_anon_num = min(grouped_row_counts)

In [403]:
print(level_k_anon_num)

0


In [406]:
gen_df_mod

Unnamed: 0,LoE,gender,gen_city_to_region,gen_YoB_to_10_quantiles,gen_nforum_posts_to_5_quantiles,gen_nforum_votes_to_4_quantiles,gen_nforum_endorsed_to_2_quantiles,gen_nforum_threads_to_2_quantiles,gen_nforum_comments_to_2_quantiles,gen_nforum_pinned,gen_nforum_events
0,,,Americas,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
1,b,m,Asia,"(1989.0, 1991.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
2,m,m,Americas,"(1980.0, 1984.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
3,p,m,,"(1987.0, 1989.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
4,,,Americas,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
...,...,...,...,...,...,...,...,...,...,...,...
199994,jhs,,Oceania,"(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199995,,,Europe,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199996,b,f,,"(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"
199997,,,Europe,"(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]"


In [410]:
gen_df_mod

Unnamed: 0,LoE,gender,gen_city_to_region,gen_YoB_to_10_quantiles,gen_nforum_posts_to_5_quantiles,gen_nforum_votes_to_4_quantiles,gen_nforum_endorsed_to_2_quantiles,gen_nforum_threads_to_2_quantiles,gen_nforum_comments_to_2_quantiles,gen_nforum_pinned,gen_nforum_events,gen_LoE
0,,,Americas,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]",
1,b,m,Asia,"(1989.0, 1991.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]",college
2,m,m,Americas,"(1980.0, 1984.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]",post-college
3,p,m,,"(1987.0, 1989.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]",post-college
4,,,Americas,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]",
...,...,...,...,...,...,...,...,...,...,...,...,...
199994,jhs,,Oceania,"(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]",pre-hs
199995,,,Europe,,"(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]",
199996,b,f,,"(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]",college
199997,,,Europe,"(1995.0, 2018.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]","(-0.001, 1.0]",


In [412]:
# gen_df_mod = gen_df_mod.drop(["LoE"], axis=1)

In [414]:
gen_quasi_ids = ['gen_LoE', 'gender', 'gen_city_to_region', 'gen_YoB_to_10_quantiles', 'gen_nforum_posts_to_5_quantiles', 'gen_nforum_votes_to_4_quantiles', 'gen_nforum_endorsed_to_2_quantiles', 'gen_nforum_threads_to_2_quantiles', 'gen_nforum_comments_to_2_quantiles', 'gen_nforum_pinned', 'gen_nforum_events']
# Group by set of quasi id values
min_rows = None
quasi_id_grouped_df = gen_df_mod.groupby(gen_quasi_ids, dropna=False)
for name, group in quasi_id_grouped_df:
    print(group.shape[0])
    if (min_rows is None or group.shape[0] < min_rows):
        min_rows = group.shape[0]
# # Get number of rows in each gruop
# grouped_row_counts = quasi_id_grouped_df.size()
# print(grouped_row_counts)
# # rouped_row_counts2 = quasi_id_grouped_df.size()
# # Min number of rows in a group = level of k-anonymity
# level_k_anon_num = min(grouped_row_counts)

21
1
1
27
1
2
1
1
48
1
1
97
2
3
1
1
1
111
1
4
1
1
2
1
1
1
99
1
1
1
76
2
1
1
1
71
1
1
1
31
16
1
1201
12
5
9
5
1
1
2
6
5
1
1
1
1
1
5
1
2
1
3
3
2
2
2
2
880
1
9
3
6
5
1
3
4
1
2
1
1
1
736
8
1
4
5
1
1
2
2
1
2
2
1
2
3
1
915
2
1
7
5
1
1
3
1
2
1
1
2
2
1
1
1055
2
1
6
6
1
3
2
1
1
1
1
1
1
1019
5
2
8
1
1
1
2
1
1
1
1
1
2
1147
2
4
2
1
1
4
1
2
1
2
1
797
3
2
4
1
2
1
1
1
1
377
1
1
2
1
1
227
1
1
94
1
1
1
1
1
174
1
2
1
1
1
1
177
1
2
1
1
1
271
2
1
1
1
1
415
3
1
1
2
2
2
1
1
452
2
3
2
2
1
1
1
688
3
2
1
1
1
1
1
1
632
3
4
2
2
1
1
1
364
1
1
1
155
1
1
229
2
2
1
1
1
1
1
228
1
2
2
4
2
1
1
1
1
205
1
286
1
3
1
1
1
1
1
338
2
1
1
1
1
323
2
1
2
1
2
1
1
2
425
1
1
1
1
2
1
1
288
1
1
1
1
124
1
3
1
54
283
5
2
1
3
1
2
1
1
2
1
1
1
1
1
2
232
1
5
2
2
1
191
1
1
3
1
265
1
1
2
1
1
1
298
1
5
2
2
1
332
1
2
4
1
1
309
1
1
2
1
1
1
1
192
1
1
1
2
79
1
29
668
1
3
1
5
2
2
2
1
2
1
1
798
4
1
1
4
2
1
1
1
1
785
1
1
1
1
7
1
3
1
1
1
1
3
915
1
3
1
4
1
1
1
1
1
1
1
1
1
1
1116
3
2
3
2
1
1
1
1
1
1
2
1
1027
2
1
3
2
1
3
1
1
1
1
1
1
1
1362
4
1
1
1
6
2
2

In [415]:
min_rows

1

**5. See if you can use some combination of these mechanisms to produce a 5-anonymous data set**