# Lab 2: EEP 595


## Introduction to Privacy Engineering, Spring 2022


## Data Mining

##### Installation:

Same as Lab 1. <br>No additional packages are required.

##### This lab consists of 7 questions.

We will be using the 'Strava cycling data of multiple segments in Jeddah' dataset for this lab. 

##### Grading rubric: for each question,
100% of the points - Correct code, correct output<br>
50% of the points - Minor logical error, partial output<br>
0% of the points - No attempt, incomplete code, wrong output

##### Submission instructions

You will have to submit the completed jupyter notebook file (.ipynb) in Canvas.

### Note: The last cell of this .ipynb is a markdown cell for your answers. <br>Please fill this cell to complete your lab submission. 

The coding cells can be used for computations or observations. <br>


***
***

##### Dataset

##### Strava cycling data of multiple segments in Jeddah, Saudi Arabia

Strava is a social platform for cyclists and runners to share their activity and form groups.

A Strava Segment is an activity path with limited distance which anybody can take (given that they are recording an activity) to participate in a leaderboard to see who is the fastest.

Features:

- user_age_group: The age group of the participant
- user_id: The User ID of the participant
- attempt_date: The date of attempt of the entry in the leaderboard
- gender: The gender of the participant
- smt_rank: The participant's rank in the leaderboard
- smt_avg_spd: The participant's average speed in km/h within the segment
- smt_finish_seconds: The time taken for the participant to complete the segment in seconds
- smt_name: The name of the segment
- user_weight_category: The weight category of the participant
- act_title: The title of the activity which included the segment attempt
- act_avg_spd: The participant's average speed in km/h during the activity
- act_max_spd: The participant's maximum speed in km/h during the activity
- act_total_km: The total distance of the participant's activity in kilometers
- act_moving_seconds: The total time which the participant spent moving during the activity in seconds
- act_total_seconds: The total time of the activity (including stop times) in seconds
- has_hr_data: Whether there was data of participant's heart rate in the segment attempt

https://www.kaggle.com/baghlafturki/strava-jeddah-segments-leaderboard

This dataset is available in your HW zip file. 


***



#### Loading the file

In [None]:
# Loading libraries 

import numpy as np
import math
import pandas as pd

In [None]:
# Reading the data file
# Each row represents a user activity entry 

df = pd.read_csv("jeddah_strava_segments.csv")
df.head(5)

Each row of this dataframe represents a user activity entry. <br>
Each column of this dataframe represents a feature associated with the user entry. <br><br>

In [None]:
# Printing the rows and columns

print("The dataframe contains ",df.shape[0]," rows.")
print("The dataframe contains ",df.shape[1]," columns.")

In [None]:
# Printing the names of the columns

for col in df.columns: 
    print(col) 

The description of these features can be found at the first markdown cell ( on top ) of this .ipynb
<br><br>
Among these columns, let's look at the user age group.

In [None]:
# The filter function is used here to extract a subset of columns from the dataset.

df.filter(['user_age_group'])

What are the user age groups  present in this dataset? <br><br>
We can use Pandas' "unique" function to filter out all the repeating values in a column. <br>

In [None]:
# Printing the unique values 
for grp in df.user_age_group.unique():
    print(grp)

How many user entries are present per age group? 
 
<br>Let's find out the number of user entries per age group.
<br> Since each row represents a user entry, <br>
We want to count the number of rows for each of the 'user_age_group' categories. 

<br> Using Pandas,
<br> We can do this by grouping the data based on the column values in 'user_age_group' <br> and counting the number of rows for each group.
<br> The functions used are 'groupby' and 'count'.

In [None]:
# Using grouping and counting to find the number of user entries per age group.

df.groupby('user_age_group').count()


We can observe that the row counts have been computed for each category.

#### Question 1 ( 3 points )

##### Which named segment has the least number of user activity entries?

Named segments are the named path segments for the cycling/activity routes.
<br> Identify the column that contains this information
<br> Use this column to find out the number (count) of user activity entries (rows) for each named segment. 
<br> From these counts, find out which named segment has the least count.

In [None]:
## You can use this cell for Question 1
## YOUR CODE HERE:



#### Question 2 ( 3 points )

##### Which user weight category has the most number of user activity entries?
<br> Similar to Q.1

In [None]:
## You can use this cell for Question 2
## YOUR CODE HERE:



#### Question 3 ( 3 points )

##### Which numerical feature has the highest magnitude of correlation with 'smt_rank' ?
'smt_rank' refers to the participant's rank in the leaderboard

The correlation of a variable with itself is always 1. <br>
We are looking for a numerical feature other than 'smt_rank' itself. 
<br> For this question, the correlation matrix is provided to you as a dataframe.

In [None]:
#Standard correlation for the numerical features in this dataframe.

df_corr = df.select_dtypes(['number']).corr(method='pearson')
df_corr 

In [None]:
## You can use this cell for Question 3
## YOUR CODE HERE:



#### Question 4 ( 5 points)

The unique identifiers ( values with frequency = 1 ) present in the columns/features of this dataset can be used to deterministically identify a user segment entry or a user ID. <br>The unique identifiers can result in de-anonymization of the data entry or the user. 

##### Among the 16 features/columns of this dataset, which feature(s) have the highest number of unique identifiers associated with the row (user entry) ?
##### How many unique identifiers do these column(s) have?

To find the number of unique identifiers,
<br> For each column,
<br> We identify the values that have frequency (row-count) = 1
<br> These are the unique identifiers present in that column
<br> Count these unique identifiers for each column. 
<br> If this is done, For each column, we have the number of unique identifiers.
<br> The column(s) with the highest number of unique identifiers can now be determined.
 


In [None]:
## You can use this cell for Question 4
## YOUR CODE HERE:

    

#### Question 5 ( 3 points )

Binning or discretization can reduce the risk of de-anonymization in a dataset. 

##### If the 'act_moving_seconds' is replaced with the nearest integer-valued minute, <br> what is the modified number of unique identifiers present in this column? 
Steps:
- replace 'act_moving_seconds' with the nearest integer-valued minute.
- find the new/modified number of unique identifiers for this column ( same steps as Q.4 )

<br> To replace the values in a column, take a look at this following example:

In [None]:
## Example for replacing values in a column
## In this example, we are replacing the user_ID with 42 x user_ID

## We don't want to disturb the original dataframe.
df_example = df.copy()

print(df_example.filter(['user_id']).head(5))
print("\n \n Example: replacing the user_ID with 10 x user_ID  \n \n")

## Pay attention to this: 
df_example['user_id'] = df_example.apply (lambda row: row['user_id']*42 , axis=1)

print(df_example.filter(['user_id']).head(5))

In [None]:
## You can use this cell for Question 5
## YOUR CODE HERE:



You will notice that for this column ( 'act_moving_seconds' ), the modified number of unique identifiers is significantly lower than the number of unique identifiers before modification.

#### Question 6 ( 4 points )

Consider the information present in these 3 columns : <br>'user_id', 'user_age_group' and 'gender'


In [None]:
df6 = df.filter(['user_id','user_age_group','gender'])
df6.head(10)

Given only this data subset, among the different 'gender'-'user-age-group' pairings in this subset,

#####  Which pairing poses the highest risk of de-anonymization for the associated users and their 'user_id'?
##### Which pairing poses the least risk?

Pairing: refers to all the possible 'gender'-'user-age-group' pairs in this dataset. 
<br> Hint: Pandas grouping can be done using more than 1 column. <br>
<br> Risk: For this specific question, building from (Q.4), risk refers to the amount of data that can be de-anonymized using unique identifiers. <br>
<br> If a pairing has more unique identifiers, the users associated with that pairing have higher risk of being de-anonymized.

In [None]:
## You can use this cell for Question 6
## YOUR CODE HERE:



#### Question 7 ( 4 points )

For the dataframe ( 'df6' containing 3 features )  in Question 6, 


##### Recommend a method to ensure that all the users associated with the different 'gender'-'user-age-group' pairings are equally protected. 
( expected: 2-3 sentences ) 

##### Does your recommendation contribute to any additional risk of de-anonymization?   
( expected: 2-3 sentences)

<br> Hint: Observing the counts of unique identifiers associated with the pairings (Q.6) can be helpful. 

Enter your answer in the 'Submitted answers' cell.

***

( Double click the cell below to edit it )

***

# Submitted answers:


* Q.1 : [3] Least number of user activity entries - <b></b>


* Q.2 : [3] Weight category with most entries - <b></b>


* Q.3 : [3] Feature with highest magnitude of correlation - <b></b>


* Q.4 : 
   * (a) [3] Column names -  <b></b>

   * (b) [2] No. of unique identifiers - <b></b>


* Q.5 [3] : Modified number of unique identifiers - <b></b>


* Q.6 : 
   * (a) [2] Pairing with highest risk -  <b></b>

   * (b) [2] Pairing with lowest risk - <b></b>


* Q.7 :
   * (a) [2] Recommendation -  <b></b>

   * (b) [2] Additional risk - <b></b>
