# COGS 108 - Assignment 3: Data Privacy

## Important Reminders

- Do not change / update / delete any existing cells with 'assert' in them. These are the tests used to check your assignment. 
    - Changing these will be flagged for attempted cheating. 
- Do not rename this file.
- This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted file. 
    - This means passing all the tests you can see in the notebook here does not guarantee you have the right answer!

## Overview

We have briefly discussed in lecture the importance and the mechanics of protecting individuals privacy when they are included in datasets. 

One method to do so is the Safe Harbor Method. The Safe Harbour method specifies how to protect individual's identities by telling us which tells us which information to remove from a dataset in order to avoid accidently disclosing personal information. 

In this assignment, we will explore web scraping, which can often include personally identifiable information, how identity can be decoded from badly anonymized datasets, and also explore using Safe Harbour to anonymize datasets properly. 

The topics covered in this assignment are mainly covered in the 'DataGathering' and 'DataPrivacy&Anonymization' Tutorial notebooks.

In [1]:
# Imports - these provided for you. Do not import any other packages
import pandas as pd
import requests
import bs4
from bs4 import BeautifulSoup

## Part 1: Web Scraping 

### Scraping Rules

1) If you are using another organizations website for scraping, make sure to check the website's terms & conditions. 

2) Do not request data from the website too aggressively (quickly) with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.

3) The layout of a website may change from time to time. Because of this, if you're scraping website, make sure to revisit the site and rewrite your code as needed.

### 1a) Web Scrape

We will first retrieve the contents on a page and examine them a bit.

Make a variable called `wiki`, that stores the following URL (as a string):
https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population

Now, to open the URL, use `requests.get()` and provide `wiki` as its input. Store this in a variable called `page`.

After that, make a variable called `soup` to parse the HTML using `BeautifulSoup`. Consider that there will be a method from `BeautifulSoup` that you'll need to call on page to get the content from the page. 


In [2]:

# YOUR CODE HERE
wiki = 'https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population'
page = requests.get(wiki)

page.status_code #Check it download successfuly 

soup = BeautifulSoup(page.content, 'html.parser') #.content method to print html page content 


print(soup)

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of states and territories of the United States by population - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_states_and_territories_of_the_United_States_by_population","wgTitle":"List of states and territories of the United States by population","wgCurRevisionId":924018091,"wgRevisionId":924018091,"wgArticleId":87525,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with dead external links","Articles with dead external links from December 2017","Articles with permanently dead external links","CS1 maint: archived copy as title","Use mdy dates from September 2019","Articles with short description","Lists by population","Lists of states of the United States","Lists of subdivisi

In [3]:
assert wiki
assert page
assert soup


### 1b) Checking Scrape Contents

Extract the title from the page and save it in a variable called `title_page`. 

Make sure you extract it as a string.

To do so, you have to use the soup object created in the above cell. 
Hint: from your soup variable, you can access this with `.title.string`.

Make sure you print out and check the contents of `title_page`.

Note that it should not have any tags (such as `<title>` included in it).

In [4]:
# YOUR CODE HERE
title_page = soup.title.string

title_page

'List of states and territories of the United States by population - Wikipedia'

In [5]:
assert title_page
assert isinstance(title_page, str)


### 1c) Extracting Tables

In order to extract the data we want, we'll start with extracting a data table of interest.

Note that you can see this table by going to look at the link we scraped.

Use the `soup` object and call a method called `find`, which will and extract the first table in scraped webpage. Store this in the variable `right_table`. 

Note: you need to search for the name `table`, and set the `class_` argument as `wikitable sortable`.

In [6]:

# YOUR CODE HERE
right_table = soup.find('table', class_ = 'wikitable sortable')
right_table

<table class="wikitable sortable" style="width:100%; text-align:center;">
<tbody><tr style="vertical-align: top;">
<th style="vertical-align: middle">Rank in the fifty states, 2018
</th>
<th style="vertical-align: middle">Rank in states &amp; territories, 2010
</th>
<th style="vertical-align: middle">Name
</th>
<th style="vertical-align: middle">Population estimate, July 1, 2018<br/><sup class="reference" id="cite_ref-5"><a href="#cite_note-5">[5]</a></sup>
</th>
<th style="vertical-align: middle">Census population, April 1, 2010<br/><sup class="reference" id="cite_ref-6"><a href="#cite_note-6">[6]</a></sup>
</th>
<th>Percent change, 2010–2018<br/><sup class="reference" id="cite_ref-7"><a href="#cite_note-7">[note 1]</a></sup>
</th>
<th>Absolute change, 2010-2018
</th>
<th style="vertical-align: middle">Total seats in the <a href="/wiki/United_States_House_of_Representatives" title="United States House of Representatives">U.S. House of Representatives</a>, 2013–2023
</th>
<th style="ve

In [7]:
assert right_table
assert isinstance(right_table, bs4.element.Tag)
assert right_table.name == 'table'

Now, Extract the data from the table into lists.

Note: This code provided for you. Do read through it and try to see how it works.

In [8]:
# CODE PROVIDED
# YOU SHOULD NOT HAVE TO EDIT
# BUT YOU WILL WANT TO UNDERSTAND
lst_a, lst_b, lst_c = [], [], []

for row in right_table.findAll('tr'):
    
    cells = row.findAll('td')
    
    # Skips rows that aren't 10 columns long (like the heading)
    if len(cells) != 12:
        continue

    # This catches when the name cells stops having a link
    #  and ends, skipping the last (summary rows)
    try:
        lst_a.append(cells[2].find('a').text)
        lst_b.append(cells[3].find(text=True))
        lst_c.append(cells[4].find(text=True))
    except:
        break

In [9]:
lst_a
lst_b
lst_c

['37,254,523\n',
 '25,145,561\n',
 '18,801,310\n',
 '19,378,102\n',
 '12,702,379\n',
 '12,830,632\n',
 '11,536,504\n',
 '9,687,653\n',
 '9,535,483\n',
 '9,883,640\n',
 '8,791,894\n',
 '8,001,024\n',
 '6,724,540\n',
 '6,392,017\n',
 '6,547,629\n',
 '6,346,105\n',
 '6,483,802\n',
 '5,988,927\n',
 '5,773,552\n',
 '5,686,986\n',
 '5,029,196\n',
 '5,303,925\n',
 '4,625,364\n',
 '4,779,736\n',
 '4,533,372\n',
 '4,339,367\n',
 '3,831,074\n',
 '3,751,351\n',
 '3,574,097\n',
 '3,725,789\n',
 '2,763,885\n',
 '3,046,355\n',
 '2,700,551\n',
 '2,915,918\n',
 '2,967,297\n',
 '2,853,118\n',
 '2,059,179\n',
 '1,826,341\n',
 '1,852,994\n',
 '1,567,582\n',
 '1,360,301\n',
 '1,316,470\n',
 '1,328,361\n',
 '989,415\n',
 '1,052,567\n',
 '897,934\n',
 '814,180\n',
 '672,591\n',
 '710,231\n',
 '601,723\n',
 '625,741\n',
 '563,626\n',
 '159,358',
 '106,405',
 '55,519',
 '53,883',
 '306,675,006\n']

### 1d) Collecting into a dataframe

Create a dataframe `my_df` and add the data from the lists above to it. 
- `lst_a` is the state or territory name. Set the column name as `State`, and make this the index
- `lst_b` is the population estimate. Add it to the dataframe, and set the column name as `Population Estimate`
- `lst_c` is the census population. Add it to the dataframe, and set the column name as `Census Population`

In [10]:

# YOUR CODE HERE
my_df = pd.DataFrame()

my_df['State'] = lst_a
my_df['Population Estimate'] = lst_b
my_df['Census Population'] = lst_c

my_df = my_df.set_index('State')
my_df

Unnamed: 0_level_0,Population Estimate,Census Population
State,Unnamed: 1_level_1,Unnamed: 2_level_1
California,39557045,37254523
Texas,28701845,25145561
Florida,21299325,18801310
New York,19542209,19378102
Pennsylvania,12807060,12702379
Illinois,12741080,12830632
Ohio,11689442,11536504
Georgia,10519475,9687653
North Carolina,10383620,9535483
Michigan,9995915,9883640


In [11]:
assert isinstance (my_df, pd.DataFrame)
assert my_df.index.name == 'State'
assert list(my_df.columns) == ['Population Estimate', 'Census Population']


### 1e) Using the data
What is the Population Estimate of Texas? Save this answer to a variable called `texas_pop`
Notes:
- Extract this value programmatically from your dataframe (as in, don't set it explicitly, as `cf = 123`)
- You can use `.loc` to extract a particular value from a dataframe.
- The data in your dataframe will be strings - that's fine, leave them as strings (don't typecast).

In [12]:
# YOUR CODE HERE
texas_pop = my_df.loc['Texas','Population Estimate']
texas_pop

'28,701,845\n'

In [13]:
assert texas_pop


## Part 2: Identifying Data

Data Files:
- anon_user_dat.json
- employee_info.json

You will first be working with a file called 'anon_user_dat.json'. This file that contains information about some (fake) Tinder users. When creating an account, each Tinder user was asked to provide their first name, last name, work email (to verify the disclosed workplace), age, gender, phone # and zip code. Before releasing this data, a data scientist cleaned the data to protect the privacy of Tinder's users by removing the obvious personal identifiers: phone #, zip code, and IP address. However, the data scientist chose to keep each users' email addresses because when they visually skimmed a couple of the email addresses none of them seemed to have any of the user's actual names in them. This is where the data scientist made a huge mistake!

We will take advantage of having the work email addresses by finding the employee information of different companies and matching that employee information with the information we have, in order to identify the names of the secret Tinder users!

### 2a) Load in the 'cleaned' data 

Load the `anon_user_dat.json` json file into a pandas dataframe. Call it `df_personal`.

In [14]:
# YOUR CODE HERE
df_personal = pd.read_json('anon_user_dat.json')
df_personal.head()

Unnamed: 0,age,email,gender
0,60,gshoreson0@seattletimes.com,Male
1,47,eweaben1@salon.com,Female
2,27,akillerby2@gravatar.com,Male
3,46,gsainz3@zdnet.com,Male
4,72,bdanilewicz4@4shared.com,Male


In [15]:
assert isinstance(df_personal, pd.DataFrame)


### 2b) Check the first 10 emails 

Save the first 10 emails to a Series, and call it `sample_emails`. 
You should then print out this Series. 

The purpose of this is to get a sense of how these work emails are structured and how we could possibly extract where each anonymous user seems to work.


In [16]:
# YOUR CODE HERE
sample_emails = df_personal.loc[0:9, 'email']
sample_emails

0    gshoreson0@seattletimes.com
1             eweaben1@salon.com
2        akillerby2@gravatar.com
3              gsainz3@zdnet.com
4       bdanilewicz4@4shared.com
5      sdeerness5@wikispaces.com
6         jstillwell6@ustream.tv
7         mpriestland7@opera.com
8       nerickssen8@hatena.ne.jp
9             hparsell9@xing.com
Name: email, dtype: object

In [17]:
assert isinstance(sample_emails, pd.Series)


### 2c) Extract the Company Name From the Email 

Create a function with the following specifications:
- Function Name: extract_company
- Purpose: to extract the company of the email (i.e., everything after the @ sign but before the .)
- Parameter(s): email (string)
- Returns: The extracted part of the email (string)
- Hint: This should take 1 line of code. Look into the find('') method. 

You can start with this outline:
```python 
def extract_company(email):
    return
```

Example Usage: 
- extract_company("larhe@uber.com") should return "uber"
- extract_company(“ds@cogs.edu”) should return “cogs”



In [18]:
# Extract company email 
def extract_company(email):
    
    extract_start = email.find('@')
    extract_end = email.find('.',extract_start)
    
    company_email = email[extract_start + 1: extract_end]
    
    return company_email

extract_company('gshoreson0@seattletimes.com')

'seattletimes'

In [19]:
assert extract_company("gshoreson0@seattletimes.com") == "seattletimes"


With a little bit of basic sleuthing (aka googling) and web-scraping (aka selectively reading in html code) it turns out that you've been able to collect information about all the present employees/interns of the companies you are interested in. Specifically, on each company website, you have found the name, gender, and age of its employees. You have saved that info in employee_info.json and plan to see if, using this new information, you can match the Tinder accounts to actual names.

### 2d) Load in employee data 

Load the json file into a pandas dataframe. Call it `df_employee`.

In [20]:
# YOUR CODE HERE
df_employee = pd.read_json('employee_info.json')
df_employee.head()

Unnamed: 0,age,company,first_name,gender,last_name
0,42,123-reg,Inglebert,Male,Falconer
1,14,163,Rafael,Male,Bedenham
2,31,163,Lemuel,Male,Lind
3,45,163,Penny,Female,Pennone
4,52,163,Elva,Female,Crighton


In [21]:
assert isinstance(df_employee, pd.DataFrame)


### 2e) Match the employee name with company, age, gender 

Create a function with the following specifications:
- Function name: employee_matcher
- Purpose: to match the employee name with the provided company, age, and gender
- Parameter(s): company (string), age (int), gender (string)
- Returns: The employee first_name and last_name like this: return first_name, last_name 
- Note: If there are multiple employees that fit the same description, first_name and last_name should return a list of all possible first names and last name i.e., ['Desmund', 'Kelby'], ['Shepley', 'Tichner']. Note that the names of the individuals that would produce this output are 'Desmund Shepley' and 'Kelby Tichner'.

Hint:
There are many different ways to code this. An unelegant solution is to loop through `df_employee` 
   and for each data item see if the company, age, and gender match
   i.e., 
   ```python
   for i in range(0, len(df_employee)):
             if (company == df_employee.ix[i,'company']):
   ```
   
However! The solution above is very inefficient and long, so you should try to look into this:
Google the df.loc method: It extracts pieces of the dataframe
   if it fulfills a certain condition.
   i.e., 
   
```python
df_employee.loc[df_employee['company'] == company]
```

If you need to convert your pandas data series into a list, you can do ```list(result)``` where result is a pandas "series"

You can start with this outline:
```python
def employee_matcher(company, age, gender):
    return first_name, last_name
```


In [22]:
def extract_company(email):
    return email[email.find('@') + 1 : email.find('.')]

def employee_matcher(company, age, gender):
    x = df_employee.loc[(df_employee['company'] == company) & (df_employee['age'] == age) & (df_employee['gender'] == gender)]
    return x.first_name.tolist(), x.last_name.tolist()

employee_matcher("salon", 47, "Female")

(['Elenore'], ['Gravett'])

In [23]:
# match employee name with the provided company, age and gender 

def extract_company(email):
    return email[email.find('@') + 1 : email.find('.')]

def employee_matcher(company, age, gender):
    
    match = df_employee.loc[(df_employee['company'] == company) &
                                 (df_employee['age'] == age) &
                                 (df_employee['gender'] == gender)]
    
    
    return (match.first_name.tolist(), match.last_name.tolist())



employee_matcher("salon", 47, "Female")


(['Elenore'], ['Gravett'])

In [24]:
assert employee_matcher("google", 41, "Male") == (['Maxwell'], ['Jorio'])
assert employee_matcher("salon", 47, "Female") == (['Elenore'], ['Gravett'])


### 2f) Extract all the private data 

- Create 2 empty lists called `first_names` and `last_names`
- Loop through all the people we are trying to identify in df_personal
- Call the `extract_company function` (i.e., `extract_company(df_personal.ix[i, 'email'])` )
- Call the `employee_matcher` function 
- Append the results of `employee_matcher` to the appropriate lists (`first_names` and `last_names`)




In [25]:
# YOUR CODE HERE
first_names = []
last_names = []

import numpy as np

for index in np.arange(len(df_personal)):
    company = extract_company(df_personal.loc[index, 'email'])
    age = df_personal.loc[index, 'age']
    gender = df_personal.loc[index, 'gender']
    first_name, last_name = employee_matcher(company, age, gender)
    first_names.append(first_name)
    last_names.append(last_name)
    

In [26]:
assert first_names[45:50]== [['Justino'], ['Tadio'], ['Kennith'], ['Cedric'], ['Amargo']]
assert last_names[45:50] == [['Corro'], ['Blackford'], ['Milton'], ['Yggo'], ['Grigor']]


### 2g) Add the names to the original 'secure' dataset! 

We have done this last step for you below, all you need to do is run this cell.

For your own personal enjoyment, you should also print out the new `df_personal` with the identified people. 

In [27]:
df_personal['first_name'] = first_names
df_personal['last_name'] = last_names

In [28]:
df_personal.head()

Unnamed: 0,age,email,gender,first_name,last_name
0,60,gshoreson0@seattletimes.com,Male,[Gordon],[DelaField]
1,47,eweaben1@salon.com,Female,[Elenore],[Gravett]
2,27,akillerby2@gravatar.com,Male,[Abbe],[Stockdale]
3,46,gsainz3@zdnet.com,Male,[Guido],[Comfort]
4,72,bdanilewicz4@4shared.com,Male,[Brody],[Pinckard]


We have now just discovered the 'anonymous' identities of all the registered Tinder users...awkward.

## Part 3: Anonymize Data

You are hopefully now convinced that with some seemingly harmless data a hacker can pretty easily discover the identities of certain users. Thus, we will now clean the original Tinder data ourselves according to the Safe Harbor Method in order to make sure that it has been *properly* cleaned...

### 3a) Load in personal data 

Load the `user_dat.csv` file into a pandas dataframe. Call it `df_users`.

In [29]:
# YOUR CODE HERE
df_users = pd.read_csv('user_dat.csv')

df_users

Unnamed: 0,age,email,first_name,gender,last_name,ip_address,phone,zip
0,34,clilleymanlm@irs.gov,Carly,Female,Duckels,229.46.197.198,(445)515-0719,70397
1,87,parnecke9a@furl.net,Prisca,,Le Friec,60.255.20.98,(962)747-5149,71965
2,60,ldankersley7j@mysql.com,Lauree,Female,Meineking,65.148.56.18,(221)690-1264,47946
3,47,kcattrollma@msn.com,Karoly,,Hoyles,207.40.101.214,(203)282-1167,29063
4,85,rchestney60@dailymotion.com,Rona,Female,St. Quentin,177.12.128.156,(703)482-9159,68872
5,83,hfranzkebc@dion.ne.jp,Hall,Male,Belsham,102.156.192.168,(787)475-2094,94923
6,57,qspurdonmk@ezinearticles.com,Quent,Male,Alejandro,148.146.2.237,(544)622-2751,75935
7,45,mteek2m@barnesandnoble.com,Morie,Male,Fassam,128.192.0.151,(784)435-3147,23967
8,83,lcutforddj@drupal.org,Leo,Male,Hattersley,7.200.26.247,(265)826-2030,55734
9,50,arearie4q@liveinternet.ru,Alexandro,Male,Elion,75.121.184.44,(571)282-8078,78343


In [30]:
assert isinstance(df_users, pd.DataFrame)


### 3b) Drop personal attributes 

Remove any personal information, following the Safe Harbour method.
Based on the Safe Harbour method, remove any columns from `df_users` that contain personal information. 

Note that details on the Safe Harbour method are covered in the Tutorials.



In [31]:
# YOUR CODE HERE
df_users = df_users.drop('email', axis = 1)
df_users = df_users.drop('first_name', axis = 1)
df_users = df_users.drop('last_name', axis = 1)
df_users = df_users.drop('ip_address', axis = 1)
df_users = df_users.drop('phone', axis = 1)


df_users


Unnamed: 0,age,gender,zip
0,34,Female,70397
1,87,,71965
2,60,Female,47946
3,47,,29063
4,85,Female,68872
5,83,Male,94923
6,57,Male,75935
7,45,Male,23967
8,83,Male,55734
9,50,Male,78343


In [32]:
assert len(df_users.columns) == 3


### 3c) Drop ages that are above 90 

Safe Harbour rule C: Drop all the rows which have age greater than 90 from `df_users`.

In [33]:
# YOUR CODE HERE
df_users = df_users.drop(df_users[df_users.age > 90].index)


df_users

Unnamed: 0,age,gender,zip
0,34,Female,70397
1,87,,71965
2,60,Female,47946
3,47,,29063
4,85,Female,68872
5,83,Male,94923
6,57,Male,75935
7,45,Male,23967
8,83,Male,55734
9,50,Male,78343


In [34]:
assert df_users.shape == (943, 3)


### 3d) Load in zip code data 

Load the zip_pop.csv file into a (different) pandas dataframe. Call it `df_zip`.

Note that the zip data should be read in as strings, not ints, as would be the default. 

In read_csv, use the parameter `dtype` to specify to read `zip` as str, and `population` as int.

In [35]:
# YOUR CODE HERE
df_zip = pd.read_csv('zip_pop.csv', dtype = {'zip': str, 'population': int})
df_zip

Unnamed: 0,zip,population
0,01001,16769
1,01002,29049
2,01003,10372
3,01005,5079
4,01007,14649
5,01008,1263
6,01009,741
7,01010,3609
8,01011,1370
9,01012,661


In [36]:
assert isinstance(df_zip, pd.DataFrame)


### 3e) Sort zipcodes into "Geographic Subdivision" 

The Safe Harbour Method applies to "Geographic Subdivisions"as opposed to each zipcode itself. 

Geographic Subdivision: All areas which share the first 3 digits of a zip code

Count the total population for each geographic subdivision

Warning: you have to be savy with a dictionary here

To understand how a dictionary works, check the section materials, use google and go to discussion sections!

Instructions: 
- Create an empty dictionary: ```zip_dict = {}```
- Loop through all the zip_codes in df_zip
- Create a dictionary key for the first 3 digits of a zip_code in zip_dict
- Continually add population counts to the key that contains the 
    same first 3 digits of the zip code

To extract the population you will find this code useful:

```python
population = list(df_zip.loc[df_zip['zip'] == zip_code]['population'])
```

To extract the first 3 digits of a zip_code you will find this code useful:
```python
int(str(zip_code)[:3])
```
**Note**: this code may take some time (many seconds, up to a minute or two) to run



In [37]:
# YOUR CODE HERE

zip_dict = {}

for zip_code in df_zip['zip']:
    key = int(str(zip_code)[:3])
    population = list(df_zip.loc[df_zip['zip'] == zip_code]['population'])
    
    if key in zip_dict:
        zip_dict[key] += population
    else:
        zip_dict[key] = population
    

In [38]:
# Add up all the population values that we got from above 
for key in zip_dict.keys():
    zip_dict[key] = sum(zip_dict[key])

In [39]:
assert isinstance(zip_dict, dict)
assert zip_dict[100] == 1502501


### 3f) Explain this code excerpt 

```python
# In the cell below, explain in words what what the following line of code is doing:
population = list(df_zip.loc[df_zip['zip'] == zip_code]['population'])
```

Note: you do not have to *use* this line of code at this point in the assignment.

It is one of the lines provided to you in 3e. Here, just write a quick comment on what it does. This question will not be graded, but it's important to be able to read other people's code.

YOUR ANSWER HERE: This code returns the list of population where the zip code matches. 

### 3g) Masking the Zip Codes 

In this part, you should write a for loop, updating the df_users dataframe.

Go through each user, and update their zip-code, to Safe Harbour specifications:

- If the user is from a zip code for the which the "Geographic Subdivision" is less than equal to 20000, change the zip code to 0 
- Otherwise, change the zip code to be only the first 3 numbers of the full zip cide
- Do all this re-writting the zip_code columns of the `df_users` DataFrame

Hints: This will be several lines of code, looping through the DataFrame, getting each zip code, checking the geographic subdivision with the population in `zip_dict`, and setting the `zip_code` accordingly. 

In [40]:

# YOUR CODE HERE
for user_zip in df_users.zip:
    key = int(str(user_zip)[:3])
    
    if zip_dict[key] <= 20000:
        df_users.loc[df_users[df_users.zip == user_zip].index, 'zip'] = 0
    else:
        df_users.loc[df_users[df_users.zip == user_zip].index, 'zip'] = key
        
    
df_users     

Unnamed: 0,age,gender,zip
0,34,Female,703
1,87,,719
2,60,Female,479
3,47,,290
4,85,Female,688
5,83,Male,949
6,57,Male,759
7,45,Male,239
8,83,Male,557
9,50,Male,783


In [41]:
assert len(df_users) == 943
assert sum(df_users.zip == 0) == 5 or sum(df_users.zip == 0) == 6
assert df_users.loc[671, 'zip'] == 285


### 3h) Save out the properly anonymized data to json file 

Save out df_users as a json file, called `real_anon_user_dat.json`

In [42]:
# YOUR CODE HERE
df_users.to_json('real_anon_user_dat.json')

In [43]:
assert isinstance(pd.read_json('real_anon_user_dat.json'), pd.DataFrame)

Congrats, you're done! The users identities are much more protected now.

## Re-start & run all cells to be sure that everything passes, validate, and submit on DataHub!