## Assignment #2-3: Anonymisation
- Dataset: Crossfit [Daset](https://data.world/bgadoci/crossfit-data) (In this assignment only the athletes file was used) 
- Credits: Dataset was put together by Sam Swift
- ToDo: To run the jupyter notebook the requirements.txt need be installed (`pip install -r requirements.txt`)

In [280]:
import pandas as pd

# Read csv as dataframe
df = pd.read_csv("athletes.csv", low_memory=False)

## First Step: Revisit the data set to remind ourselves what we are working with
- For a better understanding of the structure of the dataset , we display the attribute values
    - What columns does the dataset contain and in what format are the attribute values?
        - Therefore, each column and the first value of each column (which is not empty or Null) is printed

In [281]:
def get_first_not_not_empty_value(df_column):
    return df_column.dropna().iloc[0] if not df_column.dropna().empty else None

# Iterate each column 
for column in df.columns:
    first_value = get_first_not_not_empty_value(df[column])
    print(f"Column: '{column}', Example Data: {first_value}")

Column: 'athlete_id', Example Data: 2554.0
Column: 'name', Example Data: Pj Ablang
Column: 'region', Example Data: South West
Column: 'team', Example Data: Double Edge
Column: 'affiliate', Example Data: Double Edge CrossFit
Column: 'gender', Example Data: Male
Column: 'age', Example Data: 24.0
Column: 'height', Example Data: 70.0
Column: 'weight', Example Data: 166.0
Column: 'fran', Example Data: 211.0
Column: 'helen', Example Data: 645.0
Column: 'grace', Example Data: 300.0
Column: 'filthy50', Example Data: 1053.0
Column: 'fgonebad', Example Data: 0.0
Column: 'run400', Example Data: 61.0
Column: 'run5k', Example Data: 1081.0
Column: 'candj', Example Data: 220.0
Column: 'snatch', Example Data: 200.0
Column: 'deadlift', Example Data: 400.0
Column: 'backsq', Example Data: 305.0
Column: 'pullups', Example Data: 25.0
Column: 'eat', Example Data: I eat 1-3 full cheat meals per week|
Column: 'train', Example Data: I workout mostly at a CrossFit Affiliate|I have a coach who determines my prog

## 3.1 Anonymisation: Bare Bones – 10 marks
Goals: The goal of k-anonymity is to modify a dataset such that any given record cannot be distinguished from at least k−1 other records regarding certain "quasi-identifier" attributes. 

### Algorithm Steps: 
1. Identify the direct identifier attributes in the data set.
2. Identify the quasi-identifiers attributes in the dataset.
3. Apply k-anonymity: Choose a value for k (size of the groups of indistinguishable records)
4. Use Generalization and Suppression as the transformation methods to transform quasi-identifiers
5. Ensure that there are at least k records for each combination of quasi-identifiers

#### 1. Identify  direct identifier attributes
- By inspecting the different columns and the data format, several attributes which have the potential to contain explicit personally identifiable information can be identified
    - `athelete_id`
        -  This really depends on the usage of this id! Considerations to take into account are: 
            - Is the `athlete_id` only used as an internal id of this dataset or does it maybe even refer to an official id?
            - Are there other datasets available which may have a similar source to this dataset? Thus, these other datasets may use the same `athlete_id`
    - `name`
        - The name allows to identify an individual
    - `team`
        - Depending on the size of the team, this could allow to identify a specific athlete
    - `affiliate` 
        - Depending on the affiliate and the amount of contracted athletes, this could allow to identify an individual
    - All stats of the athletes
        - If an athlete has really remarkable stats (maybe even a world record in a category), this could allow to identify the individual
    - `train` 
        - If an athlete has a special and famous training routine, this could allow to identify him
    - `background`
        - If an athlete has a famous background or mentions names, this could allow to identify him
    - `experience`
        - If an athlete mentions concrete information about his experience (e.g. name of current coach), this could allow to identify him

-> As can be seen, all columns could potentially contain outliers which could be then used to identify an individual. 

#### 2. Identify the quasi-identifiers attributes in the dataset
- In this step, we use the following script to search for any attributes qualifying as a quasi-identifiers not flagged as PII in the step before. 

In [282]:
# identify potential quasi-identifiers
def identify_quasi_identifiers(dataframe, sensitive_columns):
    quasi_identifiers = []
    for column in dataframe.columns:
        # Skip sensitive attributes
        if column in sensitive_columns:
            continue
        
        unique_count = dataframe[column].nunique()
        # Assume a column could be a quasi-identifier if it's not unique for each record
        # but has a high number of unique values.
        if 1 < unique_count < len(dataframe):
            quasi_identifiers.append(column)
    
    return quasi_identifiers

# column names we know are PII
sensitive_columns = ['athlete_id', 'name', 'team', 'affiliate', 'train', 'background', 'experience', 'fran', 'helen',  'grace', 'filthy50', 'fgonebad', 'run400', 'run5k', 'candj', 'snatch', 'deadlift', 'backsq', 'pullups']

# Identify potential quasi-identifiers
potential_quasi_identifiers = identify_quasi_identifiers(df, sensitive_columns)
print("Potential Quasi-Identifiers:", potential_quasi_identifiers)

Potential Quasi-Identifiers: ['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong', 'retrieved_datetime']


#### 3. Apply k-anonymity: Choose a value for k
- In this step we choose a value for k. 
- For example if we choose k = 3, then each combination of quasi-identifier values should apply to at least three records in the given dataset
- a higher k value strengthens privacy by making re-identification more difficult, it also reduces the utility of the data by increasing information loss. 
- The choice of k thus represents a trade-off between privacy and utility that must be considered in the context of how the data will be used.

#### 4. Use Generalization and Suppression as the transformation methods
- Some data exploration has been done in #Assignment 1 already, so the intervals for different attributes and replacement values can be recycled. Some exploration must be done on top of it.
- The attribute 'retrieved_datetime' can be removed, since all the entries are empty. Also, we standardize empty values.
- The quasi-identifiers 'age', 'height', 'weight' will be anonymized using aggregation
- Every attribute occuring less than 50 times in the entire dataset will be suppressed
- The 'regions' will be mapped to the 7 continents
- 

In [283]:
#drop the 'retrieved_datetime' column
df = df.drop(columns=['retrieved_datetime'])

In [284]:
# Iterate over all columns in the DataFrame
for column in df.columns:
    # Replace empty strings with 'NA' in the column
    df[column] = df[column].replace({'': 'NA'})

In [285]:
# Aggregate age, height and weight
bins_age = [0, 18, 30, 45, 60, 100]
labels_age = ['0-18', '19-30', '31-45', '46-60', '60+']

bins_height = [0, 20, 40, 60, 70, 80, 90]
labels_height = ['0-20', '20-40', '40-60', '60-70', '70-80', '81+']

bins_weight = [0, 159, 169, 179, 189, 199, 220]
labels_weight = ['0-159', '160-169', '170-179', '180-189', '190-199', '200+']

# Apply binning
df['age'] = pd.cut(df['age'], bins=bins_age, labels=labels_age)
df['height'] = pd.cut(df['height'], bins=bins_height, labels=labels_height)
df['weight'] = pd.cut(df['weight'], bins=bins_weight, labels=labels_weight)

In [286]:
# List of columns to apply the suppression
columns_to_suppress = ['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong']

for column in columns_to_suppress:
    # Counting the frequency of each unique value in the column
    value_counts = df[column].value_counts()

    # Identifying values that occur less than 20 times
    values_to_remove = value_counts[value_counts < 100].index

    # Removing rows with these values
    df = df[~df[column].isin(values_to_remove)]

# After this loop, df contains your DataFrame with rare values removed in each specified column

In [287]:
#Data exploration to map the regions in a meaningful way
unique_regions = df['region'].unique()
print("Unique regions:", unique_regions)

Unique regions: ['South West' nan 'Southern California' 'South Central' 'Central East'
 'Europe' 'North East' 'Africa' 'South East' 'Australia'
 'Northern California' 'Latin America' 'Canada East' 'North Central'
 'North West' 'Mid Atlantic' 'Canada West' 'Asia']


In [288]:
# Example mapping of regions to continents, including 'NA'
region_to_continent = {
    'South West': 'North America',
    'Southern California': 'North America',
    'South Central': 'North America',
    'Central East': 'North America',
    'Europe': 'Europe',
    'North East': 'North America',
    'Africa': 'Africa',
    'South East': 'North America',
    'Australia': 'Oceania',
    'Northern California': 'North America',
    'Latin America': 'South America',
    'Canada East': 'North America',
    'North Central': 'North America',
    'North West': 'North America',
    'Mid Atlantic': 'North America',
    'Canada West': 'North America',
    'Asia': 'Asia',
    'NA': 'NA'  # Preserving 'NA' as is
}
# Apply the mapping to the 'region' column
df['region'] = df['region'].map(region_to_continent)

#### 5. Ensure that there are at least k records for each combination of quasi-identifiers


In [289]:
#Check how many unique values there are in each column
num_rows = len(df)
print("Number of rows:", num_rows)

columns_of_interest = ['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong']
selected_df = df[columns_of_interest]

unique_values = selected_df.nunique()
print("Unique values in each column:\n", unique_values)

Number of rows: 420955
Unique values in each column:
 region       6
gender       2
age          4
height       4
weight       6
eat         19
schedule    40
howlong     10
dtype: int64


In [290]:
# Group by quasi-identifiers
grouped_df = df.groupby(['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong'], observed=True).size().reset_index(name='count')

# Check the minimum count
min_count = grouped_df['count'].min()
non_compliant_groups = grouped_df[grouped_df['count'] < 2]
# Verify if k-anonymity is achieved
if min_count >= 2:
    print("The dataset satisfies k=2 anonymity.")
else:
    print("The dataset does NOT satisfy k=2 anonymity.")

    # Display groups that occur only once
    only_once = grouped_df[grouped_df['count'] == 1]
    print("Groups that occur only once:\n", only_once)
    
    num_non_compliant = len(non_compliant_groups)
    print("Number of groups not satisfying k=2 anonymity:", num_non_compliant)


The dataset does NOT satisfy k=2 anonymity.
Groups that occur only once:
               region  gender    age height   weight  \
0             Africa  Female   0-18  60-70    0-159   
1             Africa  Female   0-18  60-70    0-159   
2             Africa  Female  19-30  40-60    0-159   
3             Africa  Female  19-30  40-60    0-159   
4             Africa  Female  19-30  60-70    0-159   
...              ...     ...    ...    ...      ...   
22926  South America    Male  46-60  70-80  180-189   
22927  South America    Male  46-60  70-80  190-199   
22928  South America    Male  46-60  70-80  190-199   
22929  South America    Male  46-60  70-80     200+   
22930  South America    Male  46-60  70-80     200+   

                                                     eat  \
0      I eat quality foods but don't measure the amount|   
1                           I weigh and measure my food|   
2                   I eat 1-3 full cheat meals per week|   
3      I eat quality food