## Re-identification/De-identification 


# Question 1
### In class, we looked at how we could identify anyone in the United States based on their birthday, age, gender, and zip code, on the assumption that all of these features were uniformly distributed. Using legal means, find data on the actual distribution of some of these features. 

- According to the Henry J Kaiser Family Foundation, in 2017 a majority of the US population was between the ages of 0-18 and 35-54 years old. This trend is also consistent at the state level. [https://www.kff.org/other/state-indicator/distribution-by-age/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D]
- According to the Henry J Kaiser Family Foundation, in 2017 a majority 51% of the population was female and 49% male.  This trend is also consistent at the state level. https://www.kff.org/other/state-indicator/distribution-by-gender/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D
- According to FiveThirtyEight, from 2003-2014, the most common birthdays were in September (~12,000 births/day), and the least common birthdays, were around Christmas and New Years (~6,000 births/day). [https://github.com/fivethirtyeight/data/blob/master/births/US_births_2000-2014_SSA.csv] 
- There are over 42,000 zip codes in the US. There are 100 codes that have fewer than 10 people per zip code. http://localistica.com/usa/zipcodes/least-populated-zipcodes/, in 2016 there were 17 zip codes so sparsley populated that the first three digits of the code could uniquely identify an individual. According to HIPAA regulations, if the first three digits of a zip code correlate to a population less than 20,000, it is a quasi-identifier. The average population per zip code in 2019 is 7,844 people, and the average number of changes to a zip code per year is 4,707.6 people [https://www.zip-codes.com/zip-code-statistics.asp] 

### What does this mean to the ability to identify using just these bits of data?
- These identifiers are not uniformly distributed. The ability to identify individuals is easiest for data entries that contain uncommon birthdays and ages, and sparsely populated zip codes. 


In [1]:
""" 
Read a configuration file containing column labels in csv. Strings are quasi-identifiers. 
"""
def read_config(config_file):
    with open(config_file) as file:
        identifiers = [(q) for q in file.read().split()]
    file.close()
    return sorted(identifiers)

### Import data

In [2]:
import pandas as pd

Many entries contain NaN where the user did not enter information. Fill these values with 0 in order to filter them during analysis. 

In [3]:
NA_FILL_VALUE = 0

In [4]:
df_raw = pd.read_csv("mid_sample_set.csv", dtype='unicode')
# Set user id as index
df_raw.index = df_raw.user_id
df_raw = df_raw.drop('user_id', axis = 1)

In [5]:
# Remove NA columns
original_columns = set(df_raw.columns.values)
df = df_raw.dropna(axis = 1, how = 'all').fillna(NA_FILL_VALUE)
new_columns = set(df.columns.values)
print("Removed columns", original_columns - new_columns)

Removed columns {'roles_isLibrary', 'roles_isCCX', 'forumRoles_isCommunityTA'}


In [6]:
df.shape

(199999, 87)

Upon loading the dataset, there are 200,000 entries (users) and 87 fields (identifiers) for each row entry.

### Direct Identifiers
- Can uniquely identify an individual and should be removed
- This includes ip


In [7]:
df = df.drop('ip', axis = 1)

### Quasi-Identifiers
- Can uniquely identify an individual when linked to other datasets. 
- These include: 'user_id', 'countryLabel', 'continent', 'city', 'region', 'subdivision', 'postalCode', 'LoE', 'YoB', 'gender', 'nforum_posts', 'nforum_votes', 'nforum_endorsed', 'nforum_threads', 'nforum_comments', 'nforum_pinned', and 'nforum_events', and are listed as their corresponding column index in the configuration file.
- Redundant quasi-identifiers are not included 
- course_id was dropped 
- Create a version of the dataset that only contains the quasi-identifiers.
- <b>Categorical Quasi-Identifiers</b>: 'countryLabel', 'continent', 'city', 'region', 'subdivision', 'postalCode', 'LoE', 'gender'
- <b>Continuous Quasi-Identifiers</b>: 'YoB', 'nforum_posts', 'nforum_votes', 'nforum_endorsed', 'nforum_threads', 'nforum_comments', 'nforum_pinned', and 'nforum_events'

# Question 2
### Create a version of the dataset that only contains the quasi-identifiers.

In [8]:
quasi_identifiers = read_config("config_file.txt")
# Remove user_id because it's not a quasi-identifier, just a key, and course_id because Waldo said to drop it
quasi_identifier_labels = list(set(quasi_identifiers) - set(["user_id", "course_id"]))
df_quasi = df[quasi_identifier_labels]


In [9]:
# df filtered by quasi-identifiers
df_quasi;

# Question 3
### What happens to the size and completion rate of the dataset when made to be 3-, 4-, and 5-anonymous?

In [41]:
def getKAnonFeatureRate(df_quasi, df_feature, quasi_identifier_labels, k, featurePrintableName = "Completion", 
                        techniquesUsed = "suppression"):
    """ 
    Suppresses individuals who are uniquely identifiable among fewer than k-1 other individuals. 
    Calculates the percentage of students who complete the course. Inputs are the dataframe,
    quasi-identifiers and level of k-anonmity. Returns data set without non 
    k-anonymous rows, an additional column for class completion rates,
    and the updated data set size.

    df_quasi : A DataFrame of Quasi-Identifiers
    df_feature : A Boolean Series derived from a DataFrame
    quasi_identifier_labels: A list of the quasi-identifiers
    k : k in k-anonymity (Int)
    featurePrintableName: featurePrintableName

    """
    
    k_anonymous = df_quasi.groupby(quasi_identifier_labels)\
                  .size().reset_index(name = 'ct').set_index(quasi_identifier_labels)
    k_anonymous = k_anonymous[k_anonymous.ct >= k]

    feature = df_feature.name
    featureCount = "%sCount" % df_feature.name
    
    completedCount = df_quasi.join(df_feature).groupby(quasi_identifier_labels + [feature])\
                       .size()\
                       .reset_index(name = featureCount)\
                       .set_index(quasi_identifier_labels)

    k_anonymous = k_anonymous.reset_index()\
                  .merge(completedCount[completedCount[feature] == 'True'].reset_index(), how = "left")

    print("Size of %d-anonymous dataset after %s: %d" % (k, techniquesUsed, k_anonymous.ct.sum()))

    k_AnonTotalCount = k_anonymous["ct"].sum()
    completionRate = (k_anonymous[featureCount].fillna(0).sum() / k_AnonTotalCount * 100)

    print("%d-anonymous %s Rate: %.2f%%" % (k, featurePrintableName, completionRate))
    
    # There are no records that are k-anonymous, so set the completion rate to 0
    if len(k_anonymous) == 0: 
        completionRate = 0
    
    return completionRate 

In [11]:
overallCompletionRate = df.completed.apply(lambda x: 1 if x == 'True' else 0).sum() / float(len(df))
print("The unanonymized completion rate was: %.2f%%" % (overallCompletionRate * 100))

The unanonymized completion rate was: 2.78%


In [12]:
completionRate3Anon = getKAnonFeatureRate(df_quasi, df.completed, quasi_identifier_labels, 3)

Size of 3-anonymous dataset after suppression: 58163
3-anonymous Completion Rate: 1.55%


In [13]:
completionRate4Anon = getKAnonFeatureRate(df_quasi, df.completed, quasi_identifier_labels, 4)

Size of 4-anonymous dataset after suppression: 48890
4-anonymous Completion Rate: 1.27%


In [14]:
completionRate5Anon = getKAnonFeatureRate(df_quasi, df.completed, quasi_identifier_labels, 5)

Size of 5-anonymous dataset after suppression: 43586
5-anonymous Completion Rate: 1.19%


# Question 4
### Synthetic Records
- Make the data k-anonymous
- Find the number of synthetic records needed for each case
- Compute completion rates and compare to dataset without synthetic records

When k = 3

In [15]:
"""
Adds artifical rows of data in order to acheive desired k-anonmity level. 
Each synthetic row is a duplicate of row data that is not yet k-anonymous.
Returns data set with artifical rows. 
"""
def addSyntheticRows(df_quasi, quasi_identifier_labels, k):
    print("Size of original dataset: %d" % len(df_quasi))

    # create a new copy of the dataset that's passed in so that the original isn't modified because
    # DataFrames are passed by reference, not value
    synthetic_k_anon_df = df_quasi.copy() 
    
    # Create groupings based on quasi identifiers and count number of students in each group
    not_k_anonymous = synthetic_k_anon_df.groupby(quasi_identifier_labels)\
                      .size().reset_index(name = 'studentCount')
        
    # Assign groupings where studentCount is less than desired level for k-anonymity 
    not_k_anonymous = not_k_anonymous[not_k_anonymous.studentCount < k]

    # Duplicate rows where groupings of quasi-identifiers are not k-anonymous
    for i in range(k):
        nonKAnonymousRows = not_k_anonymous[not_k_anonymous.studentCount == i]
        if len(nonKAnonymousRows) > 0:
            for j in range(k - i):
                synthetic_k_anon_df = synthetic_k_anon_df.append(nonKAnonymousRows, ignore_index = True, sort = False)
            
    # Calculate difference between dataset and k-anonymized dataset with synthetic rows
    numRowsAdded = synthetic_k_anon_df.shape[0] - len(df_quasi)
    print("Size of synthetic dataset: %d\nSynthetic Rows Added: %d" % (len(synthetic_k_anon_df), numRowsAdded))
    
    # Drop the studentCount column from the dataset we return because df_quasi does not
    # have it.  This column is an artifact of when we grouped synthetic_k_anon_df above
    synthetic_k_anon_df = synthetic_k_anon_df.drop("studentCount", axis = 1)
    return synthetic_k_anon_df    

Add synthetic data when k = 3

In [16]:
synthetic3AnonDf = addSyntheticRows(df, quasi_identifier_labels, 3)

Size of original dataset: 199999
Size of synthetic dataset: 453071
Synthetic Rows Added: 253072


Add synthetic data when k = 4

In [17]:
synthetic4AnonDf = addSyntheticRows(df, quasi_identifier_labels, 4)

Size of original dataset: 199999
Size of synthetic dataset: 587798
Synthetic Rows Added: 387799


Add synthetic data when k = 5

In [18]:
synthetic5AnonDf = addSyntheticRows(df, quasi_identifier_labels, 5)

Size of original dataset: 199999
Size of synthetic dataset: 723851
Synthetic Rows Added: 523852


# Question 5
### K-Anonymity via generalization, blurring, and suppression 
- Generalization
    - YoB, nform_* 
- Blurring
    - Last 3 digits of postal_code
- Suprression 
    - Remaining 


In [26]:
df_quasi_q5 = df_quasi.copy()

In [27]:
"""
Partially replaces categorical or numerical data with "*" for a specified column.
Returns data set with updated values.
"""
def blurring(df_quasi, col, blurStr = "***"):
    # Blurr the last 3 elements of string with stars
    df_quasi[col] = df_quasi[col]\
                    .astype(str)\
                    .apply(lambda data: data[0:(len(data) - len(blurStr))] + blurStr 
                           if len(data) > len(blurStr) else blurStr)
    return df_quasi

In [28]:
"""
Categorizes data into groupings of specified and equal sizes. Returns 
categorized data set with ranges for specified columns.
"""
def generalizeField(df, colChange, bucketSize, maxVal):
    col = colChange
    df_quasi_gen = df
    # Calculate number of categories with maximum value in column and size of each category interval 
    binNum = int(maxVal / bucketSize)
    # Create list of categories 
    bins = [bucketSize*i for i in range(NA_FILL_VALUE-1, binNum)]
    # Convert data to integers and assign each value to specified list of categories 
    df_quasi_gen.loc[:,col] = df_quasi_gen[col].astype(int)
    df_quasi_gen.loc[:,col] = pd.cut(df_quasi_gen[col], bins)
    return df_quasi_gen

In [31]:
_ = blurring(df_quasi_q5, "postalCode")

In [32]:
fieldsToGeneralize = [("YoB", int), ("nforum_posts", int), ("nforum_votes", int), ("nforum_endorsed", int), 
                      ("nforum_threads", int), ("nforum_comments", int), ("nforum_pinned", int)]

for (field, fieldType) in fieldsToGeneralize:
    fieldMaxValue = df_quasi_q5[[field]].astype(fieldType).max()
    generalizeField(df_quasi_q5, field, 10, fieldMaxValue)

### Q5. Compare the number of students who complete and explore the course in the original and in the k-anonymous sets. What does this tell you about the process of de-identification?
The process of de-identification adds noise to the dataset, eliminates statistically relevant information, increases the size of the dataset and decreases utility. The completion rates in the k-anonymous sets were <span style="color:red">[get the numbers]</span>.

In [46]:
overallCompletionRate = df.completed.apply(lambda x: 1 if x == 'True' else 0).sum() / float(len(df))
print("The unanonymized completion rate was: %.2f%%" % (overallCompletionRate * 100))

The unanonymized completion rate was: 2.78%


In [47]:
for k in [3,4,5]:
    getKAnonFeatureRate(df_quasi_q5, df.completed, quasi_identifier_labels, k, 
                        techniquesUsed="suppression, blurring, and generalization")
    print("")

Size of 3-anonymous dataset after suppression, blurring, and generalization: 104702
3-anonymous Completion Rate: 1.66%

Size of 4-anonymous dataset after suppression, blurring, and generalization: 94181
4-anonymous Completion Rate: 1.56%

Size of 5-anonymous dataset after suppression, blurring, and generalization: 86925
5-anonymous Completion Rate: 1.51%



In [49]:
overallExploredRate = df.explored.apply(lambda x: 1 if x == 'True' else 0).sum() / float(len(df))
print("The unanonymized explored rate was: %.2f%%" % (overallExploredRate * 100))

The unanonymized explored rate was: 13.43%


In [50]:
for k in [3,4,5]:
    getKAnonFeatureRate(df_quasi_q5, df.explored, quasi_identifier_labels, k, featurePrintableName="Explored",
                       techniquesUsed="suppression, blurring, and generalization")
    print("")

Size of 3-anonymous dataset after suppression, blurring, and generalization: 104702
3-anonymous Explored Rate: 12.78%

Size of 4-anonymous dataset after suppression, blurring, and generalization: 94181
4-anonymous Explored Rate: 12.44%

Size of 5-anonymous dataset after suppression, blurring, and generalization: 86925
5-anonymous Explored Rate: 12.13%



In [54]:
(278 - 151) / 278

0.4568345323741007

The fact that the unanoymized exploration rate (13.43%) was between 4.8% and 9.6% higher than the exploration rates derived from the 3, 4, and 5 anonymous datasets while the unanoymized completion rate (2.78%) was between 40% and 45% higher than the completion rates derived from the 3, 4, and 5 anonymous datasets suggests that the individuals who completed the edX courses are much more similar to one another than two members of the entire population would be.  By anonymizing the dataset, it seems that many of the rows that were suppressed were those corresponding to individuals who completed the course because the anonymous completion estimates were so much lower than that of the true dataset.  

# Question 6
### L-Diversity
- Used to determine distinguishability among indistunghisable quasi-identifiers but distinguishable sensitive attributes
- Sensitive attributes include: grade
- Determine level of l-diversity in order to strengthen k-anonymity

In [35]:
df_quasi_q6 = df_quasi.copy()
# Add column of sensitive values to dataset that is k-anonymous
df_quasi_q6["grade"] = df["grade"]

In [36]:
lDiverseLevels = df_quasi_q6.groupby(quasi_identifier_labels)["grade"].nunique().reset_index(name = "l-diversity-grade")
# Find level of diversity (number of unique grades) for each grouping of quasi-identifiers
lDiverseLevels["l-diversity-grade"].head()

0     7
1    18
2     2
3     1
4     1
Name: l-diversity-grade, dtype: int64

In [38]:
sorted(lDiverseLevels["l-diversity-grade"].unique())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 16, 18, 19]

### Q6. What fields in the data set might be considered "sensitive" 
- A student's grade may be considered sensitive
    ### In your datasets in 5), what level of l-diversity do the data sets have with respect to these fields?
- The level od l-diversity for each set of quasi-identifiers is indicated under the "l-diversity-grade" column in lDiverseLevels.
    ###  How might you get to a higher level of l-diversity?
- Generalize both categorical and numerical data
- Test for differential privacy 
