# Assignment #1: Pseudonymisation Techniques and Considerations
- Dataset: Crossfit [Daset](https://data.world/bgadoci/crossfit-data) (In this assignment only the athletes file was used) 
- Credits: Dataset was put together by Sam Swift
- ToDo: To run the jupyter notebook the requirements.txt need be installed (`pip install -r requirements.txt`)

In [6]:
!pip install numpy
import pandas as pd

# Read csv as dataframe
dataframe = pd.read_csv("athletes.csv")

  dataframe = pd.read_csv("athletes.csv")


## 3.1 Pseudonymisation
Goals:
-  Determine which attributes qualify as explicit personally identifiable information (3.1.1)
    - Why? 
    - What method was used to identify these? 
- Generate pseudonymous values for the identified attributes (3.1.2)

### 3.1.1 Identifying Attributes containing Personally Identifiable Information (PII)
- In order to identify th attributes first a better understanding of the structure of the dataset needs to be obtained
    - What columns does the dataset contain and in what format are the attribute values?
        - Therefore, each column and the first value of each column (which is not empty or Null) is printed

In [7]:
def get_first_not_not_empty_value(df_column):
    return df_column.dropna().iloc[0] if not df_column.dropna().empty else None

# Iterate each column 
for column in dataframe.columns:
    first_value = get_first_not_not_empty_value(dataframe[column])
    print(f"Column: '{column}', Example Data: {first_value}")

Column: 'athlete_id', Example Data: 2554.0
Column: 'name', Example Data: Pj Ablang
Column: 'region', Example Data: South West
Column: 'team', Example Data: Double Edge
Column: 'affiliate', Example Data: Double Edge CrossFit
Column: 'gender', Example Data: Male
Column: 'age', Example Data: 24.0
Column: 'height', Example Data: 70.0
Column: 'weight', Example Data: 166.0
Column: 'fran', Example Data: 211.0
Column: 'helen', Example Data: 645.0
Column: 'grace', Example Data: 300.0
Column: 'filthy50', Example Data: 1053.0
Column: 'fgonebad', Example Data: 0.0
Column: 'run400', Example Data: 61.0
Column: 'run5k', Example Data: 1081.0
Column: 'candj', Example Data: 220.0
Column: 'snatch', Example Data: 200.0
Column: 'deadlift', Example Data: 400.0
Column: 'backsq', Example Data: 305.0
Column: 'pullups', Example Data: 25.0
Column: 'eat', Example Data: I eat 1-3 full cheat meals per week|
Column: 'train', Example Data: I workout mostly at a CrossFit Affiliate|I have a coach who determines my prog

- By inspecting the different columns and the data format, several attributes which have the potential to contain explicit personally identifiable information can be identified
    - `athelete_id`
        -  This really depends on the usage of this id! Considerations to take into account are: 
            - Is the `athlete_id` only used as an internal id of this dataset or does it maybe even refer to an official id?
            - Are there other datasets available which may have a similar source to this dataset? Thus, these other datasets may use the same `athlete_id`
    - `name`
        - The name allows to identify an individual
    - `team`
        - Depending on the size of the team, this could allow to identify a specific athlete
    - `affiliate` 
        - Depending on the affiliate and the amount of contracted athletes, this could allow to identify an individual
    - All stats of the athletes
        - If an athlete has really remarkable stats (maybe even a world record in a category), this could allow to identify the individual
    - `train` 
        - If an athlete has a special and famous training routine, this could allow to identify him
    - `background`
        - If an athlete has a famous background or mentions names, this could allow to identify him
    - `experience`
        - If an athlete mentions concrete information about his experience (e.g. name of current coach), this could allow to identify him

-> As can be seen, all columns could potentially contain outliers which could be then used to identify an individual. As agreed on, for each of the columns (only those holding numbers or names) potentially leaking PII one anonymization technique was chosen which keeps the data loss as minimal as possible without leaking obvious PII. It is important to note, that there could still be columns containing PII. However, this chance for outliers was accepted in this assignment. Columns containing descriptions (e.g. train, background and experience) where to anonymized and thus the chance for outliers accepted.

### 3.1.2 Pseudonymizing
- For pseudonymisation especially the name is suitable, because it can be easily replaced with a fake name, without being as obvious or destroying the purpose of the dataset
- With the help of the `anonymizedf` library new fake names can be created for the `name` column

In [13]:
#!pip install numpy
#!pip install anonymizedf

from anonymizedf import anonymizedf

# Prepare the data to be anonymized
an = anonymizedf.anonymize(dataframe)

In [None]:
# Add new column with fake name
an.fake_names("name")

# Drop old original name column
dataframe = dataframe.drop(columns=['name'])

# Rename Fake_name column in name
dataframe = dataframe.rename(columns={'Fake_name': 'name'})

print(dataframe.iloc[:4])

## 3.2 Randomisation
Goal:
- Use randomisation technique to generate random strings (no meaning) (3.2.1)
- Use randomisation technique to generate random but meaningful replacements (3.2.2)
- Create a lookup table to keep track of the changes (3.2.3)

### 3.2.1 Random Strings
1. Identify columns to replace with random strings
    - Here columns containing words can be used 
        - `name`
        - `region`
        - `team`
        - `affiliate` 
2. Generate a function which is responsible for generating random values for replacement
3. Overwrite each value in respective attributes by using the generated value

In [None]:
import random
import string

# Define function to create random string
def generate_random_string(length):
    letters = string.ascii_letters
    return ''.join(random.choice(letters) for _ in range(length))

# Apply randomization on specified columns
dataframe['name'] = dataframe['name'].apply(lambda x: generate_random_string(10))
dataframe['region'] = dataframe['region'].apply(lambda x: generate_random_string(15))
dataframe['team'] = dataframe['team'].apply(lambda x: generate_random_string(20))
dataframe['affiliate'] = dataframe['affiliate'].apply(lambda x: generate_random_string(25))

print(dataframe.iloc[:4])

### 3.2.2 Randomization with meaningful strings
1. Identify columns suitable
    - `name`
        -  For randomizing the name, the first letter will be kept equal to the original in order to keep similarities
    - `region`, `team`, `affiliate`
        - The values in the dataset for each of the columns will be interchanged. Thus, the data is still correct and for example a team with the respective name still exists 
2. Write functionality to randomize the values

In [None]:
from anonymizedf import anonymizedf

# To see the changes properly we have to reload the initial data
dataframe = pd.read_csv("athletes.csv")
        
# Randomization the name
# insert your functionality Emirkan


# Interchange values for region, team and affiliate
def interchange_values(column_name):
    dataframe[column_name] = dataframe[column_name].sample(frac=1).reset_index(drop=True)
        
interchange_values('region')
interchange_values('team')
interchange_values('affiliate')

print(dataframe.iloc[:4])

### 3.2.3 Save mappings
- Rewrite the previous functionalities (Only for the 3.2.2) to save the mapping between old value and new value in an additional table

In [None]:
# To see the changes properly we have to reload the initial data
dataframe = pd.read_csv("athletes.csv")

# Copy the original values in mapping dataframe
dataframe_mapping = dataframe[['name', 'region', 'team', 'affiliate']].copy()

# Randomize name
# @Emirkan: Insert here to functionality to randomize name 

# Randomize region, team and affiliate
interchange_values('region')
interchange_values('team')
interchange_values('affiliate')

# Copy the modified values in mapping dataframe
dataframe_mapping[['randomized_region', 'randomized_team', 'randomized_affiliate']] = dataframe[['region', 'team', 'affiliate']].copy()

print(dataframe_mapping.iloc[:4])

## 3.3 Aggregation
Goal:
- Determine attributes which qualify for aggregation (3.3.1)
- Write a function to perform that aggregation process (3.3.2)

### 3.3.1 Determine Attributes for Aggregation

In [None]:
# Print the each column with its first not empty value
for column in dataframe.columns:
    first_value = get_first_not_not_empty_value(dataframe[column])
    print(f"Column: '{column}', Example Data: {first_value}")

- Numerical attributes can be easily aggregated
- Especially those related to information about the individual are suitable for aggregation
    - `age`
    - `height` 
    - `weight`
- To be able to identify different levels of values, the extremes have to be identified 

In [None]:
print(f"Minimum age': {dataframe['age'].nsmallest(5)}, Maximum age: {dataframe['age'].nlargest(5)}")
print(f"Minimum height': {dataframe['height'].nsmallest(10)}, Maximum height: {dataframe['height'].nlargest(10)}")
print(f"Minimum weight': {dataframe['weight'].nsmallest(10)}, Maximum weight: {dataframe['weight'].nlargest(10)}")

- Because even by observing ten extremes for the height and weight the values didn't make sense, in the following average values for the entire population were taken

In [None]:
bins_age = [0, 18, 30, 45, 60, 100]
labels_age = ['0-18', '19-30', '31-45', '46-60', '60+']

bins_height = [0, 159, 169, 179, 189, 199, 220]
labels_height = ['0-159', '160-169', '170-179', '180-189', '190-199', '200+']

bins_weight = [0, 49, 59, 69, 79, 89, 99, 130]
labels_weight = ['0-49', '50-59', '60-69', '70-79', '80-89', '90-99', '100+']

dataframe['age'] = pd.cut(dataframe['age'], bins=bins_age, labels=labels_age)
dataframe['height'] = pd.cut(dataframe['height'], bins=bins_age, labels=labels_age)
dataframe['weight'] = pd.cut(dataframe['weight'], bins=bins_age, labels=labels_age)

print(dataframe)

## 3.4 Perturbation
Goal:
- Select attributes to add noise to (3.4.1)
- Implement the functionality to add the noise (3.4.2)
    - The original distribution of values should be preserved 
- Analyse distribution of original data and then with noise added (3.4.3)

### 3.4.1 Select attributes
- Similar to the aggregation, perturbation as well fits best for numbers as attribute values 
- Because the stats of the athletes are the main subject of the dataset they should normally be no noise added to them
    - Even more, the correctness of the value really matters in order to be able to compare different athletes. Here already a small noise is not good
- However, as already mentioned in 3.1 this kind of information loss is accepted in this assignment
- Therefore, the stats will be pertubated 

### 3.4.2 Pertubate Values
- Standard deviation was used to determine noise

In [None]:
import numpy as np

def add_noise(df, column, std = None):
    if std == None:
        std = df[column].std()
    
    withNoise = df[column].add(np.random.normal(0, std, df.shape[0]))
    copy = df.copy()
    copy[column] = withNoise
    return copy

dataframe_pertubation = add_noise(dataframe, "helen")
dataframe_pertubation = add_noise(dataframe_pertubation, "grace")
dataframe_pertubation = add_noise(dataframe_pertubation, "filthy50")
dataframe_pertubation = add_noise(dataframe_pertubation, "fgonebad")
dataframe_pertubation = add_noise(dataframe_pertubation, "run400")
dataframe_pertubation = add_noise(dataframe_pertubation, "run5k")
dataframe_pertubation = add_noise(dataframe_pertubation, "candj")
dataframe_pertubation = add_noise(dataframe_pertubation, "snatch")
dataframe_pertubation = add_noise(dataframe_pertubation, "deadlift")
dataframe_pertubation = add_noise(dataframe_pertubation, "backsq")
dataframe_pertubation = add_noise(dataframe_pertubation, "pullups")

### 3.4.2 Show Distributions

In [None]:
import matplotlib.pyplot as plt

fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(20,10))
ax1.hist(
    'helen', 
    data=dataframe, 
    bins = np.arange(start=100, stop=2_000, step=200),
)
ax1.title.set_text('Original Distribution')
ax2.hist(
    'helen', 
    data=dataframe_pertubation, 
    bins = np.arange(start=100, stop=2_000, step=200),
)
ax2.title.set_text('Distribution with noise added')



## 3.5 Data Analysis
Goal:
- Determine data loss using function (3.5.1)
- Discuss pros and cons (3.5.2)