# COGS 108 - Assignment 3: Data Privacy

### Written By: Liz Izhikevich and Harshita Mangal

## Important

- Rename this file to 'A3_A########.ipynb' (filled in with your student ID) before you submit it. Submit it to TritonED.
- Do not change / update / delete any existing cells with 'assert' in them. These are the tests used to check your assignment. 
    - Changing these will be flagged for attempted cheating. 
- This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted. 
    - This means passing all the tests you can see in the notebook here does not guarantee you have the right answer!

## Overview

We have discussed in lecture the importance and the mechanics of protecting individuals privacy when they are included in datasets. In particular, in Lecture 11 (April 26th) we introduced the concept of the Safe Harbor Method. The Safe Harbour method specifies how to protect individual's identities by telling us which tells us which information to remove from a dataset in order to avoid accidently disclosing personal information. 

In this assignment, we will explore how identity can be decoded from badly anonymized datasets, and also explore using Safe Harbour to anonymize datasets properly. 

### Import Statements

In [2]:
# Import Pandas
# Note: Pandas is all you need! Do not import any other functions / packages.
import pandas as pd

## Part 1: Identifying Data

Data Files:
- anon_user_dat.json
- employee_info.json

You will first be working with a file called 'anon_user_dat.json'. This file that contains information about some (fake) Tinder users. When creating an account, each Tinder user was asked to provide their first name, last name, work email (to verify the disclosed workplace), age, gender, phone # and zip code. Before releasing this data, a data scientist cleaned the data to protect the privacy of Tinder's users by removing the obvious personal identifiers: phone #, zip code, and IP address. However, the data scientist chose to keep each users' email addresses because when they visually skimmed a couple of the email addresses none of them seemed to have any of the user's actual names in them. This is where the data scientist made a huge mistake!

We will take advantage of having the work email addresses by finding the employee information of different companies and matching that employee information with the information we have, in order to identify the names of the secret Tinder users!

In [3]:
##################################
# 1a) Load in the 'cleaned' data #
##################################

# Load the json file into a pandas dataframe. Call it 'df_personal'.

df_personal = pd.read_json('anon_user_dat.json')
print(df_personal)

     age                           email  gender
0     46     gshoreson0@seattletimes.com    Male
1     56              eweaben1@salon.com  Female
2     30         akillerby2@gravatar.com    Male
3     87               gsainz3@zdnet.com    Male
4     58        bdanilewicz4@4shared.com    Male
5     39       sdeerness5@wikispaces.com  Female
6     43          jstillwell6@ustream.tv  Female
7     37          mpriestland7@opera.com    Male
8     35        nerickssen8@hatena.ne.jp  Female
9     40              hparsell9@xing.com    Male
10     9                 acopasa@fda.gov    Male
11    38        bdanielovitchb@jigsy.com    Male
12    42              cwestbergc@psu.edu  Female
13    39          jlarived@goodreads.com  Female
14    37             mchallisse@ning.com  Female
15    66            cbrognotf@ebay.co.uk    Male
16    79              aphearg@tumblr.com    Male
17    67             askogginsh@jugem.jp  Female
18    30       eondraseki@deviantart.com  Female
19    39            

In [4]:
assert isinstance(df_personal, pd.DataFrame)


In [5]:
#################################
# 1b) Check the first 10 emails #
#################################

# Save the first 10 emails to a Series, and call it 'sample_emails'. 
# You should then and print out this Series. 
# The purpose of this is to get a sense of how these work emails are structured
#   and how we could possibly extract where each anonymous user seems to work


#sample_emails = pd.Series(df_personal.iloc[0:10, 1])
sample_emails = df_personal['email'].head(10)
print(sample_emails)

0    gshoreson0@seattletimes.com
1             eweaben1@salon.com
2        akillerby2@gravatar.com
3              gsainz3@zdnet.com
4       bdanilewicz4@4shared.com
5      sdeerness5@wikispaces.com
6         jstillwell6@ustream.tv
7         mpriestland7@opera.com
8       nerickssen8@hatena.ne.jp
9             hparsell9@xing.com
Name: email, dtype: object


In [6]:
assert isinstance(sample_emails, pd.Series)


In [7]:
###############################################
# 1c) Extract the Company Name From the Email #
###############################################

# Create a function with the following specifications:
#   Function Name: extract_company
#   Purpose: to extract the company of the email 
#          (i.e., everything after the @ sign but before the .com )
#   Parameter(s): email (string)
#   Returns: The extracted part of the email (string)
#   Hint: This should take 1 line of code. Look into the find('') method. 
#
# You can start with this outline:
#   def extract_company(email):
#      return 
#
# Example Usage: 
#   extract_company("larhe@uber.com") should return "uber"


def extract_company(email):
    s_index = email.find('@') + 1
    e_index = email.find(".")
    return email[s_index:e_index]
    

In [8]:
assert extract_company("gshoreson0@seattletimes.com") == "seattletimes"


With a little bit of basic sleuthing (aka googling) and web-scraping (aka selectively reading in html code) it turns out that you've been able to collect information about all the present employees/interns of the companies you are interested in. Specifically, on each company website, you have found the name, gender, and age of its employees. You have saved that info in employee_info.json and plan to see if, using this new information, you can match the Tinder accounts to actual names.

In [9]:
#############################
# 1d) Load in employee data #
#############################

# Load the json file into a pandas dataframe. Call it 'df_employee'.

df_employee = pd.read_json('employee_info.json')
print(df_employee)

     age      company first_name  gender        last_name
0     40      123-reg  Inglebert    Male         Falconer
1     32          163      Penny  Female          Pennone
2     45          163       Elva  Female         Crighton
3     49          163     Lemuel    Male             Lind
4     79          163     Rafael    Male         Bedenham
5     30         1688   Herminia  Female            Sisse
6     31        1und1       Toby  Female           Nisuis
7     54        1und1     Kylynn  Female         Vedikhov
8     56        1und1     Mychal    None          Denison
9     41          360     Ilario    Male          Mannagh
10    44          360    Angelle  Female           Kupisz
11    54          360     Farley    Male        Mullenger
12    40      4shared   Ginnifer  Female           Jarret
13    58      4shared      Brody    Male         Pinckard
14    34           51     Samara    None           Soares
15    37           51    Stanton    Male            Rehme
16    55      

In [10]:
assert isinstance(df_personal, pd.DataFrame)


In [11]:
#########################################################
# 1e) Match the employee name with company, age, gender #
#########################################################

# Create a function with the following specifications:
#   Function name: employee_matcher
#   Purpose: to match the employee name with the provided company, age, and gender
#   Parameter(s): company (string), age (int), gender (string)
#   Returns: The employee first_name and last_name like this: return first_name, last_name 
#   Note: If there are multiple employees that fit the same description, first_name and 
#         last_name should return a list of all possible first names and last name
#         i.e., ['Desmund', 'Kelby'], ['Shepley', 'Tichner']
#
# Hint:
# There are many different ways to code this.
# 1) An unelegant solution is to loop through df_employee 
#    and for each data item see if the company, age, and gender match
#    i.e., for i in range(0, len(df_employee)):
#              if (company == df_employee.ix[i,'company']):
#
# However! The solution above is very inefficient and long, 
# so you should try to look into this:
# 2) Google the df.loc method: It extracts pieces of the dataframe
#    if it fulfills a certain condition.
#    i.e., df_employee.loc[df_employee['company'] == company]
#    If you need to convert your pandas data series into a list,
#    you can do list(result) where result is a pandas "series"
# 
# You can start with this outline:
#   def employee_matcher(company, age, gender):
#      return first_name, last_name

# YOUR CODE HERE

def employee_matcher(company, age, gender):
    df_company_set = df_employee.loc[df_employee['company'] == company]
    df_age_set = df_company_set.loc[df_company_set['age'] == age]
    df_gender_final_set = df_age_set.loc[df_age_set['gender'] == gender]
    first_name = list(df_gender_final_set['first_name'])
    last_name = list(df_gender_final_set['last_name'])
    return first_name, last_name
    

In [12]:
assert employee_matcher("google", 41, "Male") == (['Ab'], ['Tetley'])
assert employee_matcher("google", 42, "Male") == (['Desmund', 'Kelby'],
                                                  ['Shepley', 'Tichner'])

In [13]:
####################################
# 1f) Extract all the private Data #
####################################

# - Create 2 empty lists called 'first_names' and 'last_names'
# - Loop through all the people we are trying to identify in df_personal
# - Call the extract_company function (i.e., extract_company(df_personal.ix[i, 'email']) )
# - Call the employee_matcher function 
# - Append the results of employee_matcher to the appropriate lists (first_names and last_names)

# YOUR CODE HERE

first_names = list()
last_names = list()

for index, row in df_personal.iterrows():
    company = extract_company(df_personal.ix[index, 'email'])
    first_names.append(employee_matcher(company, row['age'], row['gender'])[0])
    last_names.append(employee_matcher(company, row['age'], row['gender'])[1])
    




In [14]:
assert first_names[45:50]== [['Justino'], ['Tadio'], ['Kennith'], ['Cedric'], ['Amargo']]
assert last_names[45:50] == [['Corro'], ['Blackford'], ['Milton'], ['Yggo'], ['Grigor']]


We have now just discovered the 'anonymous' identities of all the registered Tinder users...awkward.

## Part 2: Anonymize Data

You are hopefully now convinced that with some seemingly harmless data a hacker can pretty easily discover the identities of certain users. Thus, we will now clean the original Tinder data ourselves according to the Safe Harbor Method in order to make sure that it has been *properly* cleaned...

In [15]:
#############################
# 2a) Load in personal data #
#############################

# Load the user_dat.json file into a pandas dataframe. Call it 'df_users'.
# Note: You might find that using the same method as A2 (or above) leads to an error.
# The file has a slightly different organization. 
#   Try googling the error and finding the fix for it.
# Hint: you can still use 'pd.read_json', you just need to add another argument.


df_users = pd.read_json('user_dat.json', lines=True)
print(df_users)



     age                           email first_name  gender       ip_address  \
0     46     gshoreson0@seattletimes.com     Gordon    Male    230.97.219.70   
1     56              eweaben1@salon.com    Elenore  Female   202.253.80.173   
2     30         akillerby2@gravatar.com       Abbe    Male    15.120.128.79   
3     87               gsainz3@zdnet.com      Guido    Male   71.234.147.178   
4     58        bdanilewicz4@4shared.com      Brody    Male   68.192.188.136   
5     39       sdeerness5@wikispaces.com     Shalne  Female    204.227.6.124   
6     43          jstillwell6@ustream.tv      Joell  Female   37.146.221.194   
7     37          mpriestland7@opera.com    Manfred    Male     67.64.181.77   
8     35        nerickssen8@hatena.ne.jp     Neille  Female   180.183.192.79   
9     40              hparsell9@xing.com      Henri    Male    32.181.36.170   
10     9                 acopasa@fda.gov    Alyosha    Male   36.177.179.182   
11    38        bdanielovitchb@jigsy.com

In [16]:
assert isinstance(df_users, pd.DataFrame)


In [17]:
################################
# 2b) Drop personal attributes #
################################

# Remove any personal information, following the Safe Harbour method.
# Based on lecture 11, remove any columns from df_personal that contain personal information.


df_users = df_users.drop(df_users.columns[[1,2,4,5,6]], axis = 1)
print(df_users)


     age  gender    zip
0     46    Male  48157
1     56  Female  88414
2     30    Male  74026
3     87    Male  73002
4     58    Male  41861
5     39  Female  30045
6     43  Female  82432
7     37    Male  80745
8     35  Female   1537
9     40    Male   2559
10     9    Male  39480
11    38    Male  98537
12    42  Female  62916
13    39  Female  49415
14    37  Female  13144
15    66    Male  10199
16    79    Male  50424
17    67  Female  41513
18    30  Female  85609
19    39  Female  95135
20    22    Male  57030
21    42  Female  72945
22    42    Male  29150
23    56    Male  20141
24    64    Male  19006
25    58    Male  95616
26    44  Female  41819
27    62  Female  93592
28    45  Female  76573
29    31  Female  16201
..   ...     ...    ...
970   62  Female  41041
971   32    Male   2905
972   66  Female  12770
973   67  Female  32334
974   30    Male  19129
975   13    Male  52732
976   30  Female  58503
977   64  Female  24318
978   63  Female  99011
979   97  Female

In [18]:
assert len(df_users.columns) == 3


In [19]:
###################################
# 2c) Drop ages that are above 90 #
###################################

# Safe Harbour rule C:
#   Drop all the rows which have age greater than 90 from df_personal

df_users = df_users[df_users['age']<=90]

print(df_users)

     age  gender    zip
0     46    Male  48157
1     56  Female  88414
2     30    Male  74026
3     87    Male  73002
4     58    Male  41861
5     39  Female  30045
6     43  Female  82432
7     37    Male  80745
8     35  Female   1537
9     40    Male   2559
10     9    Male  39480
11    38    Male  98537
12    42  Female  62916
13    39  Female  49415
14    37  Female  13144
15    66    Male  10199
16    79    Male  50424
17    67  Female  41513
18    30  Female  85609
19    39  Female  95135
20    22    Male  57030
21    42  Female  72945
22    42    Male  29150
23    56    Male  20141
24    64    Male  19006
25    58    Male  95616
26    44  Female  41819
27    62  Female  93592
28    45  Female  76573
29    31  Female  16201
..   ...     ...    ...
969   46  Female  12939
970   62  Female  41041
971   32    Male   2905
972   66  Female  12770
973   67  Female  32334
974   30    Male  19129
975   13    Male  52732
976   30  Female  58503
977   64  Female  24318
978   63  Female

In [20]:
assert df_users.shape==(990, 3)


In [21]:
#############################
# 2d) Load in zip code data #
#############################

# Load the zip_pop.csv file into a (different) pandas dataframe. Call it 'df_zip'.

df_zip = pd.read_csv('zip_pop.csv')
print(df_zip)

         zip  population
0       1001       16769
1       1002       29049
2       1003       10372
3       1005        5079
4       1007       14649
5       1008        1263
6       1009         741
7       1010        3609
8       1011        1370
9       1012         661
10      1013       23188
11      1020       29668
12      1022        2451
13      1026         946
14      1027       17660
15      1028       15720
16      1029         789
17      1030       11669
18      1031        1308
19      1032         570
20      1033        6227
21      1034        2021
22      1035        5250
23      1036        5109
24      1037         838
25      1038        2545
26      1039        1336
27      1040       39880
28      1050        2530
29      1053        1685
...      ...         ...
33062  99786         259
33063  99788          69
33064  99789         402
33065  99790          20
33066  99791         237
33067  99801       29164
33068  99820         479
33069  99824        2111


In [22]:
assert isinstance(df_zip, pd.DataFrame)


In [23]:
###################################################
# 2e) Sort zipcodes into "Geographic Subdivision" #
###################################################

# The Safe Harbour Method applies to "Geographic Subdivisions"
#   as opposed to each zipcode itself. 
# Geographic Subdivision:
#   All areas which share the first 3 digits of a zip code
#
# Count the total population for each geographic subdivision
# Warning: you have to be savy with a dictionary here
# To understand how a dictionary works, check the section materials,
#   use google and go to discussion sections!
#
# Instructions: 
# - Create an empty dictionary: zip_dict = {}
# - Loop through all the zip_codes in df_zip
# - Create a dictionary key for the first 3 digits of a zip_code in zip_dict
# - Continually add population counts to the key that contains the 
#     same first 3 digits of the zip code
#
# To extract the population you will find this code useful:
#   population = list(df_zip.loc[df_zip['zip'] == zip_code]['population'])
# To extract the first 3 digits of a zip_code you will find this code useful:
#   int(str(zip_code)[:3])






zip_dict = {}
for zip_code in df_zip['zip']:
    key = int(str(zip_code)[:3])
    population = list(df_zip.loc[df_zip['zip'] == zip_code]['population'])
    if key in zip_dict:
        zip_dict[key] += population
    else:
        zip_dict[key] = population

In [24]:
assert isinstance(zip_dict, dict)
assert zip_dict[100] == 1580423


AssertionError: 

In [None]:
#################################
# 2f) Explain this Code Excerpt #
#################################

# In the cell below, explain in words what what the following line of code is doing:
population = list(df_zip.loc[df_zip['zip'] == zip_code]['population'])

In [None]:
# first of all it looks in the zip column of the df_zip dataframe to find if a row contain the value of zip_code
# if this is the case we grab the value in the same row in the population column 
# then the value is converted into a list value and returned

In [None]:
#############################
# 2g) Masking the Zip Codes #
#############################

# Go through each user, and update their zip-code, to Safe Harbour specifications:
#   If the user is from a zip code for the which the
#     "Geographic Subdivision" is less than equal to 20000:
#        - Change the zip code to 0 
#   Otherwise:
#         - Change the zip code to be only the first 3 numbers of the full zip cide
# Do all this re-writting the zip_code columns of the 'df_users' DataFrame
#
# Hints:
#  - This will be several lines of code, looping through the DataFrame, 
#      getting each zip code, checking the geographic subdivision with 
#      the population in zip_dict, and settig the zip_code accordingly. 


for index, row in df_users.iterrows():
    first_three = int(str(zip_code)[3:])
    print(first_three)
    if (zip_dict[first_three] <= 20000):
        df_users.set_value(index, 'zip', 0)
    else:
        df_users.set_value(index, 'zip', first_three)

In [None]:
assert len(df_users) == 990
assert sum(df_users.zip == 0) == 2
assert df_users.ix[671, 'zip'] == 0


In [None]:
##########################################################
# 2h) Save out the properly anonymized data to json file #
##########################################################

# Save out df_users as a json file, called 'real_anon_user_dat.json'
df_users.to_json('real_anon_user_dat.json')

In [None]:
assert isinstance(pd.read_json('real_anon_user_dat.json'), pd.DataFrame)

Congrats, you're done! The users identities are much more protected now. 

Submit this notebook file to TritonED.