## Assignment #2-3: Anonymisation
- Dataset: Crossfit [Daset](https://data.world/bgadoci/crossfit-data) (In this assignment only the athletes file was used) 
- Credits: Dataset was put together by Sam Swift
- ToDo: To run the jupyter notebook the requirements.txt need be installed (`pip install -r requirements.txt`)

In [1]:
import pandas as pd

# Read csv as dataframe
df = pd.read_csv("athletes.csv", low_memory=False)

## First Step: Revisit the data set to remind ourselves what we are working with
- For a better understanding of the structure of the dataset , we display the attribute values
    - What columns does the dataset contain and in what format are the attribute values?
        - Therefore, each column and the first value of each column (which is not empty or Null) is printed
- We've already worked with this dataset, so we won't go into detail

In [2]:
def get_first_not_not_empty_value(df_column):
    return df_column.dropna().iloc[0] if not df_column.dropna().empty else None

# Iterate each column 
for column in df.columns:
    first_value = get_first_not_not_empty_value(df[column])
    print(f"Column: '{column}', Example Data: {first_value}")

Column: 'athlete_id', Example Data: 2554.0
Column: 'name', Example Data: Pj Ablang
Column: 'region', Example Data: South West
Column: 'team', Example Data: Double Edge
Column: 'affiliate', Example Data: Double Edge CrossFit
Column: 'gender', Example Data: Male
Column: 'age', Example Data: 24.0
Column: 'height', Example Data: 70.0
Column: 'weight', Example Data: 166.0
Column: 'fran', Example Data: 211.0
Column: 'helen', Example Data: 645.0
Column: 'grace', Example Data: 300.0
Column: 'filthy50', Example Data: 1053.0
Column: 'fgonebad', Example Data: 0.0
Column: 'run400', Example Data: 61.0
Column: 'run5k', Example Data: 1081.0
Column: 'candj', Example Data: 220.0
Column: 'snatch', Example Data: 200.0
Column: 'deadlift', Example Data: 400.0
Column: 'backsq', Example Data: 305.0
Column: 'pullups', Example Data: 25.0
Column: 'eat', Example Data: I eat 1-3 full cheat meals per week|
Column: 'train', Example Data: I workout mostly at a CrossFit Affiliate|I have a coach who determines my prog

## 3.1 Anonymisation: Bare Bones – 10 marks
To get a better understanding of what we meant, we've directly applied our proposed algorithm to reach k-anonymity to the given dataset futher down in this file while explaining what we did while coding. 
The goal of k-anonymity is to modify a dataset such that any given record cannot be distinguished from at least k−1 other records regarding certain "quasi-identifier" attributes. 

### 3.1.1 Our Algorithm Steps: 
1. Identify the direct identifier attributes in the data set.
2. Identify the quasi-identifiers attributes in the dataset.
3. Apply k-anonymity: Choose a value for k (size of the groups of indistinguishable records)
   - The smaller k, the lower the anonymity -> less information loss
4. Use Generalization, Aggregation and Suppression as the transformation methods to transform each of the quasi-identifiers
   - Start with aggregating numeric attributes (those are fitting for aggregation)
   - Proceed with suppressing values that very rarely occur (these might identify an individual directly)
   - map categoric attributes to other, broader categories (to reduce unique values per column)
   - Goal in this step: Get the unique values per column as low as possible without losing too much information
5. Ensure that there are at least k records for each combination of quasi-identifiers
   - If not, drop the records that don't have at least k duplicates regarding the quasi identifiers

### 3.1.2 Discussion of pros and cons of the algorithm with respect to the dataset
1. Pros
   - The primary advantage of applying k-anonymity is the significant enhancement of privacy. This algorithm ensures that individual athletes cannot be easily identified based on quasi-identifiers. This is the main goal. By generalizing and categorizing attributes in a non automatic way, the dataset has the potential of not losing too much information.
2. Cons
   - The major drawback is the loss of information. In the pursuit of anonymity, detailed data is generalized or suppressed, which can lead to the loss of potentially valuable insights. Especially the last step, where columns not meeting the k-criteria get removed, a lot of information might be lost. Also, the process of suppressing rarely occuring data might introduce biases, as certain groups or unusual data points might be disproportionately suppressed. 

## 3.2.1 Applying the algorithm to our dataset

#### 1. Identify  direct identifier attributes
- By inspecting the different columns and the data format, several attributes which have the potential to contain explicit personally identifiable information can be identified
    - `athelete_id`
        -  This really depends on the usage of this id! Considerations to take into account are: 
            - Is the `athlete_id` only used as an internal id of this dataset or does it maybe even refer to an official id?
            - Are there other datasets available which may have a similar source to this dataset? Thus, these other datasets may use the same `athlete_id`
    - `name`
        - The name allows to identify an individual
    - `team`
        - Depending on the size of the team, this could allow to identify a specific athlete
    - `affiliate` 
        - Depending on the affiliate and the amount of contracted athletes, this could allow to identify an individual
    - All stats of the athletes
        - If an athlete has really remarkable stats (maybe even a world record in a category), this could allow to identify the individual
    - `train` 
        - If an athlete has a special and famous training routine, this could allow to identify him
    - `background`
        - If an athlete has a famous background or mentions names, this could allow to identify him
    - `experience`
        - If an athlete mentions concrete information about his experience (e.g. name of current coach), this could allow to identify him

-> As can be seen, all columns could potentially contain outliers which could be then used to identify an individual. 

#### 2. Identify the quasi-identifiers attributes in the dataset
- In this step, we use the following script to search for any attributes qualifying as a quasi-identifiers not flagged as PII in the step before. 

In [3]:
# identify potential quasi-identifiers
def identify_quasi_identifiers(dataframe, sensitive_columns):
    quasi_identifiers = []
    for column in dataframe.columns:
        # Skip sensitive attributes
        if column in sensitive_columns:
            continue
        
        unique_count = dataframe[column].nunique()
        # Assume a column could be a quasi-identifier if it's not unique for each record
        # but has a high number of unique values.
        if 1 < unique_count < len(dataframe):
            quasi_identifiers.append(column)
    
    return quasi_identifiers

# column names we know are PII
sensitive_columns = ['athlete_id', 'name', 'team', 'affiliate', 'train', 'background', 'experience', 'fran', 'helen',  'grace', 'filthy50', 'fgonebad', 'run400', 'run5k', 'candj', 'snatch', 'deadlift', 'backsq', 'pullups']

# Identify potential quasi-identifiers
potential_quasi_identifiers = identify_quasi_identifiers(df, sensitive_columns)
print("Potential Quasi-Identifiers:", potential_quasi_identifiers)

Potential Quasi-Identifiers: ['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong', 'retrieved_datetime']


#### 3. Apply k-anonymity: Choose a value for k
- In this step we choose a value for k. 
- For example if we choose k = 3, then each combination of quasi-identifier values should apply to at least three records in the given dataset
- a higher k value strengthens privacy by making re-identification more difficult, it also reduces the utility of the data by increasing information loss. 
- The choice of k thus represents a trade-off between privacy and utility that must be considered in the context of how the data will be used.

#### 4. Use Generalization, Aggregation and Suppression as the transformation methods
- Some data exploration has been done in #Assignment 1 already, so the intervals for different attributes and replacement values can be recycled. Some exploration must be done on top of it.
- The attribute 'retrieved_datetime' can be removed, since all the entries are empty. Also, we standardize empty values.
- The quasi-identifiers 'age', 'height', 'weight' will be anonymized using aggregation
- Every attribute occuring less than 50 times in the entire dataset will be suppressed
- The 'regions' will be mapped to the 7 continents
- The 'schedule' attribute will be mapped to meaningful strings
- The 'eat' attribute will be mapped to 4 different categories

In [4]:
#drop the 'retrieved_datetime' column
df = df.drop(columns=['retrieved_datetime'])

In [5]:
# Iterate over all columns in the DataFrame
for column in df.columns:
    # Replace empty strings with 'NA' in the column
    df[column] = df[column].replace({'': 'NA'})

#### Now, the aggregation
- Using aggregation for attributes like 'age', 'weight' and 'height' makes sense in this context.
- This provides a more concise representation of the data distribution while enabling to achieve k-anonymity.

In [6]:
# Aggregate age, height and weight
bins_age = [0, 30, 60, 100]
labels_age = ['0-30', '31-60', '61+']


bins_height = [0, 40, 70, 90]
labels_height = ['0-40', '41-70', '71+']

bins_weight = [0, 169, 199, 220]
labels_weight = ['0-169', '170-199', '200+']

# Apply binning
df['age'] = pd.cut(df['age'], bins=bins_age, labels=labels_age)
df['height'] = pd.cut(df['height'], bins=bins_height, labels=labels_height)
df['weight'] = pd.cut(df['weight'], bins=bins_weight, labels=labels_weight)

#### Now, the suppression
- By suppressing these rare occurrences, we reduce the risk of someone being able to link the data back to a specific individual

In [7]:
# List of columns to apply the suppression
columns_to_suppress = ['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong']

for column in columns_to_suppress:
    # Counting the frequency of each unique value in the column
    value_counts = df[column].value_counts()

    # Identifying values that occur less than 20 times
    values_to_remove = value_counts[value_counts < 100].index

    # Removing rows with these values
    df = df[~df[column].isin(values_to_remove)]


#### Now, the mapping: 
- The goal is to create broader categories that encapsulate the essence of the individual schedules without being overly specific.


In [8]:
unique_regions = df['region'].unique()
print("Unique regions:", unique_regions)

Unique regions: ['South West' nan 'Southern California' 'South Central' 'Central East'
 'Europe' 'North East' 'Africa' 'South East' 'Australia'
 'Northern California' 'Latin America' 'Canada East' 'North Central'
 'North West' 'Mid Atlantic' 'Canada West' 'Asia']


In [9]:
# mapping of regions 
region_to_continent = {
    'South West': 'North America',
    'Southern California': 'North America',
    'South Central': 'North America',
    'Central East': 'North America',
    'Europe': 'International',
    'North East': 'North America',
    'Africa': 'International',
    'South East': 'North America',
    'Australia': 'International',
    'Northern California': 'North America',
    'Latin America': 'International',
    'Canada East': 'North America',
    'North Central': 'North America',
    'North West': 'North America',
    'Mid Atlantic': 'North America',
    'Canada West': 'North America',
    'Asia': 'International',
    'NA': 'Other'  # 'NA' categorized as 'Other'
}
# Apply the mapping to the 'region' column
df['region'] = df['region'].map(region_to_continent)

In [10]:
#Data exploration to map the schedules in a meaningful way
unique_schedules = df['schedule'].unique()
print("Unique schedules:", unique_schedules)

Unique schedules: ['I do multiple workouts in a day 2x a week|' nan
 'I usually only do 1 workout a day|'
 'I usually only do 1 workout a day|I strictly schedule my rest days|'
 'I usually only do 1 workout a day|I typically rest 4 or more days per month|'
 'I do multiple workouts in a day 3+ times a week|I typically rest fewer than 4 days per month|'
 'I do multiple workouts in a day 3+ times a week|'
 'I usually only do 1 workout a day|I do multiple workouts in a day 1x a week|I typically rest 4 or more days per month|'
 'I do multiple workouts in a day 1x a week|I typically rest 4 or more days per month|'
 'I do multiple workouts in a day 1x a week|'
 'I typically rest 4 or more days per month|'
 'I do multiple workouts in a day 3+ times a week|I strictly schedule my rest days|I typically rest 4 or more days per month|'
 'I strictly schedule my rest days|'
 'I do multiple workouts in a day 2x a week|I strictly schedule my rest days|I typically rest 4 or more days per month|'
 'I do 

In [11]:
#first mapping of the schedules to meaningful strings
schedule_generalization = {
    'I usually only do 1 workout a day|': 'Mixed Workout Frequency',
    'I do multiple workouts in a day 1x a week|': 'Multiple Weekly Workouts',
    'I do multiple workouts in a day 2x a week|': 'Multiple Weekly Workouts',
    'I do multiple workouts in a day 3+ times a week|': 'Mixed Workout Frequency',
    'I typically rest 4 or more days per month|': 'Regular Rest with Strict Scheduling',
    'I typically rest fewer than 4 days per month|': 'Mixed Workout Frequency',
    'I strictly schedule my rest days|': 'Regular Rest with Strict Scheduling',
    'Decline to answer|': 'Other/Declined to Answer',
    'I usually only do 1 workout a day|I strictly schedule my rest days|': 'Regular Rest with Strict Scheduling',
    'I usually only do 1 workout a day|I typically rest 4 or more days per month|': 'Regular Rest with Strict Scheduling',
    'I do multiple workouts in a day 3+ times a week|I typically rest fewer than 4 days per month|': 'Mixed Workout Frequency',
    'I usually only do 1 workout a day|I do multiple workouts in a day 1x a week|I typically rest 4 or more days per month|': 'Mixed Workout Frequency',
    'I do multiple workouts in a day 1x a week|I typically rest 4 or more days per month|': 'Regular Rest with Strict Scheduling',
    'I do multiple workouts in a day 3+ times a week|I strictly schedule my rest days|I typically rest 4 or more days per month|': 'Regular Rest with Strict Scheduling',
    'I do multiple workouts in a day 2x a week|I strictly schedule my rest days|I typically rest 4 or more days per month|': 'Regular Rest with Strict Scheduling',
    'I do multiple workouts in a day 3+ times a week|I typically rest 4 or more days per month|': 'Regular Rest with Strict Scheduling',
    'I usually only do 1 workout a day|I do multiple workouts in a day 3+ times a week|I typically rest 4 or more days per month|': 'Mixed Workout Frequency',
    'I do multiple workouts in a day 2x a week|I typically rest 4 or more days per month|': 'Regular Rest with Strict Scheduling',
    'I do multiple workouts in a day 1x a week|I typically rest fewer than 4 days per month|': 'Regular Rest with Strict Scheduling',
    'I do multiple workouts in a day 1x a week|I strictly schedule my rest days|I typically rest 4 or more days per month|': 'Regular Rest with Strict Scheduling',
    'I usually only do 1 workout a day|I typically rest fewer than 4 days per month|': 'Regular Rest with Strict Scheduling',
    'I usually only do 1 workout a day|I do multiple workouts in a day 2x a week|I typically rest 4 or more days per month|': 'Mixed Workout Frequency',
    'I usually only do 1 workout a day|I do multiple workouts in a day 1x a week|I strictly schedule my rest days|': 'Mixed Workout Frequency',
    'I do multiple workouts in a day 1x a week|I strictly schedule my rest days|': 'Mixed Workout Frequency',
    'I usually only do 1 workout a day|I do multiple workouts in a day 2x a week|I strictly schedule my rest days|I typically rest 4 or more days per month|': 'Regular Rest with Strict Scheduling',
    'I usually only do 1 workout a day|I strictly schedule my rest days|I typically rest fewer than 4 days per month|': 'Regular Rest with Strict Scheduling',
    'I strictly schedule my rest days|I typically rest 4 or more days per month|': 'Regular Rest with Strict Scheduling',
    'I do multiple workouts in a day 2x a week|I strictly schedule my rest days|I typically rest fewer than 4 days per month|': 'Regular Rest with Strict Scheduling',
    'Multiple Weekly Workouts': 'Mixed Workout Frequency',
    'nan': 'Mixed Workout Frequency',
    'Single Daily Workout': 'Mixed Workout Frequency',
    'Strictly Scheduled Rest': 'Regular Rest with Strict Scheduling',
    'Regular Rest Days': 'Regular Rest with Strict Scheduling',
    'Frequent Workouts': 'Mixed Workout Frequency',
    'Fewer Rest Days': 'Mixed Workout Frequency',
    'Other/Declined to Answer': 'Mixed Workout Frequency',
    'Mixed Workout Frequency': 'Mixed Workout Frequency',
    'Scheduled Multiple Workouts': 'Mixed Workout Frequency',
    'Regular Workout with Strict Rest Days': 'Regular Rest with Strict Scheduling',
    'Regular Rest with Strict Scheduling': 'Regular Rest with Strict Scheduling',
    'Frequent Workouts with Strict Rest Days': 'Regular Rest with Strict Scheduling',
    'NA': 'NA'
}

# Apply the generalization to the 'schedule' column
df['schedule'] = df['schedule'].map(schedule_generalization)

In [12]:
#Second Data exploration to map the schedules in a meaningful way
unique_schedules = df['schedule'].unique()
print("Unique schedules:", unique_schedules)

Unique schedules: ['Multiple Weekly Workouts' nan 'Mixed Workout Frequency'
 'Regular Rest with Strict Scheduling' 'Other/Declined to Answer']


In [13]:
#Data exploration to map the eat attribute in a meaningful way
unique_eat = df['eat'].unique()
print("Unique eat attributes:", unique_eat)

Unique eat attributes: [nan 'I eat 1-3 full cheat meals per week|'
 "I eat quality foods but don't measure the amount|" 'I eat strict Paleo|'
 "I eat quality foods but don't measure the amount|I eat 1-3 full cheat meals per week|"
 'I eat whatever is convenient|'
 "I eat strict Paleo|I eat quality foods but don't measure the amount|"
 'I eat strict Paleo|I eat 1-3 full cheat meals per week|'
 "I eat quality foods but don't measure the amount|I eat whatever is convenient|I eat 1-3 full cheat meals per week|"
 "I eat quality foods but don't measure the amount|I eat whatever is convenient|"
 'I eat whatever is convenient|I eat 1-3 full cheat meals per week|'
 'Decline to answer|' 'I weigh and measure my food|'
 'I weigh and measure my food|I eat strict Paleo|I eat 1-3 full cheat meals per week|'
 "I eat strict Paleo|I eat quality foods but don't measure the amount|I eat 1-3 full cheat meals per week|"
 'I weigh and measure my food|I eat strict Paleo|'
 "I weigh and measure my food|I eat q

In [14]:
# Mapping of eating habits to meaningful strings
eat_generalization = {
    'I eat 1-3 full cheat meals per week|': 'Cheat Meals/Other',
    "I eat quality foods but don't measure the amount|": 'Quality Focused',
    'I eat strict Paleo|': 'Diet-Conscious',
    "I eat quality foods but don't measure the amount|I eat 1-3 full cheat meals per week|": 'Quality Focused',
    'I eat whatever is convenient|': 'Convenience Eating',
    "I eat strict Paleo|I eat quality foods but don't measure the amount|": 'Diet-Conscious',
    'I eat strict Paleo|I eat 1-3 full cheat meals per week|': 'Diet-Conscious',
    "I eat quality foods but don't measure the amount|I eat whatever is convenient|I eat 1-3 full cheat meals per week|": 'Quality Focused',
    "I eat quality foods but don't measure the amount|I eat whatever is convenient|": 'Quality Focused',
    'I eat whatever is convenient|I eat 1-3 full cheat meals per week|': 'Convenience Eating',
    'Decline to answer|': 'Cheat Meals/Other',
    'I weigh and measure my food|': 'Diet-Conscious',
    'I weigh and measure my food|I eat strict Paleo|I eat 1-3 full cheat meals per week|': 'Diet-Conscious',
    "I eat strict Paleo|I eat quality foods but don't measure the amount|I eat 1-3 full cheat meals per week|": 'Diet-Conscious',
    'I weigh and measure my food|I eat strict Paleo|': 'Diet-Conscious',
    "I weigh and measure my food|I eat quality foods but don't measure the amount|I eat 1-3 full cheat meals per week|": 'Diet-Conscious',
    'I weigh and measure my food|I eat 1-3 full cheat meals per week|': 'Diet-Conscious',
    "I weigh and measure my food|I eat quality foods but don't measure the amount|": 'Diet-Conscious',
    'I weigh and measure my food|I eat whatever is convenient|': 'Diet-Conscious',
    'NA': 'NA'
}

# Apply the generalization to the 'eat' column
df['eat'] = df['eat'].map(eat_generalization)

In [15]:
#Data exploration to map the howlong attribute in a meaningful way
unique_howlong = df['howlong'].unique()
print("Unique howlong attributes:", unique_howlong)

Unique howlong attributes: ['4+ years|' nan '1-2 years|' '2-4 years|' '6-12 months|'
 'Less than 6 months|' '1-2 years|2-4 years|'
 'Less than 6 months|1-2 years|' '2-4 years|4+ years|'
 'Decline to answer|' '6-12 months|1-2 years|']


In [16]:
# mapping for the 'howlong' attribute
howlong_generalization = {
    '4+ years|': 'Experienced',
    '1-2 years|': 'Experienced',
    '2-4 years|': 'Experienced',
    '6-12 months|': 'Novice',
    'Less than 6 months|': 'Novice',
    '1-2 years|2-4 years|': 'Experienced',
    'Less than 6 months|1-2 years|': 'Novice',
    '2-4 years|4+ years|': 'Experienced',
    'Decline to answer|': 'NA',
    '6-12 months|1-2 years|': 'Novice',
    'NA': 'NA'
}

# Map the 'howlong' values to the categories
df['howlong'] = df['howlong'].map(howlong_generalization)


#### 5. Ensure that there are at least k records for each combination of quasi-identifiers
- First, the lower the number of unique values per columnn, the closer we get to achieving k-anonymity in this algorithm. 

In [17]:
#Check how many unique values there are in each column
num_rows = len(df)
print("Number of rows:", num_rows)

columns_of_interest = ['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong']
selected_df = df[columns_of_interest]

unique_values = selected_df.nunique()
print("Unique values in each column:\n", unique_values)

Number of rows: 421097
Unique values in each column:
 region      2
gender      2
age         2
height      3
weight      3
eat         4
schedule    4
howlong     3
dtype: int64


##### 5.1 Last steps
1. As a last step, we filter out the non-compliant rows. 
2. After this step, we've reached k-anonymity in our dataframe!

In [18]:
# Group by quasi-identifiers
grouped_df = df.groupby(['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong'], observed=True).size().reset_index(name='count')

non_compliant_groups = grouped_df[grouped_df['count'] < 2]

min_count1 = len(non_compliant_groups)
# Merge with og DataFrame to flag non-compliant rows
flagged_df = pd.merge(df, non_compliant_groups, on=['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong'], how='left', indicator=True)

# Filter out non-compliant rows
compliant_df = flagged_df[flagged_df['_merge'] == 'left_only'].drop(columns=['count', '_merge'])

#group again
grouped_compliant_df = compliant_df.groupby(['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong'], observed=True).size().reset_index(name='count')

# Check the minimum count
min_count = grouped_compliant_df['count'].min()

# Verify if k-anonymity is achieved
if min_count >= 2:
    print("Number of rows dropped in this final step: " , min_count1)
    print("The dataset satisfies k=2 anonymity.")
    
else:
    print("The dataset does NOT satisfy k=2 anonymity.")

    # Display groups that occur only once
    only_once = grouped_df[grouped_df['count'] == 1]
    print("Groups that occur only once:\n", only_once)
    
    num_non_compliant = len(non_compliant_groups)
    print("Number of groups not satisfying k=2 anonymity:", num_non_compliant)


Number of rows dropped in this final step:  285
The dataset satisfies k=2 anonymity.
