## Assignment #2-3: Anonymisation
- Dataset: Crossfit [Daset](https://data.world/bgadoci/crossfit-data) (In this assignment only the athletes file was used) 
- Credits: Dataset was put together by Sam Swift
- ToDo: To run the jupyter notebook the requirements.txt need be installed (`pip install -r requirements.txt`)

In [35]:
import pandas as pd

# Read csv as dataframe
df = pd.read_csv("athletes.csv", low_memory=False)

## 3.1 Anonymisation: Bare Bones – 10 marks
Goals: The goal of k-anonymity is to modify a dataset such that any given record cannot be distinguished from at least k−1 other records regarding certain "quasi-identifier" attributes. 

### Algorithm Steps: 
1. Identify the direct identifier attributes in the data set.
2. Identify the quasi-identifiers attributes in the dataset.
3. Apply k-anonymity: Choose a value for k = size of the groups of indistinguishable records
4. Use Generalization and Suppression as the transformation methods to transform quasi-identifiers
5. Ensure that there are at least k records for each combination of quasi-identifiers

#### 1. Identify  direct identifier attributes
- By inspecting the different columns and the data format, several attributes which have the potential to contain explicit personally identifiable information can be identified
    - `athelete_id`
        -  This really depends on the usage of this id! Considerations to take into account are: 
            - Is the `athlete_id` only used as an internal id of this dataset or does it maybe even refer to an official id?
            - Are there other datasets available which may have a similar source to this dataset? Thus, these other datasets may use the same `athlete_id`
    - `name`
        - The name allows to identify an individual
    - `team`
        - Depending on the size of the team, this could allow to identify a specific athlete
    - `affiliate` 
        - Depending on the affiliate and the amount of contracted athletes, this could allow to identify an individual
    - All stats of the athletes
        - If an athlete has really remarkable stats (maybe even a world record in a category), this could allow to identify the individual
    - `train` 
        - If an athlete has a special and famous training routine, this could allow to identify him
    - `background`
        - If an athlete has a famous background or mentions names, this could allow to identify him
    - `experience`
        - If an athlete mentions concrete information about his experience (e.g. name of current coach), this could allow to identify him

-> As can be seen, all columns could potentially contain outliers which could be then used to identify an individual. 

#### 2. Identify the quasi-identifiers attributes in the dataset
- In this step, we use the following script to search for any attributes not flagged as PII in the step before. 

In [36]:
# identify potential quasi-identifiers
def identify_quasi_identifiers(dataframe, sensitive_columns):
    quasi_identifiers = []
    for column in dataframe.columns:
        # Skip sensitive attributes
        if column in sensitive_columns:
            continue
        
        unique_count = dataframe[column].nunique()
        # Assume a column could be a quasi-identifier if it's not unique for each record
        # but has a high number of unique values.
        if 1 < unique_count < len(dataframe):
            quasi_identifiers.append(column)
    
    return quasi_identifiers

# column names we know are PII
sensitive_columns = ['athlete_id', 'name', 'team', 'affiliate', 'train', 'background', 'experience', 'fran', 'helen',  'grace', 'filthy50', 'fgonebad', 'run400', 'run5k', 'candj', 'snatch', 'deadlift', 'backsq', 'pullups']

# Identify potential quasi-identifiers
potential_quasi_identifiers = identify_quasi_identifiers(df, sensitive_columns)
print("Potential Quasi-Identifiers:", potential_quasi_identifiers)

Potential Quasi-Identifiers: ['region', 'gender', 'age', 'height', 'weight', 'eat', 'schedule', 'howlong', 'retrieved_datetime']
