# Introduction

In this educational scenario, we will extensively explore the inner workings of the `k_anonymize` function, a tool crucial for achieving k-Anonymity in datasets. Our focus will be on comprehending the function's attributes, its underlying logic, and its pivotal role in maintaining privacy while preserving data utility.


## Step 1: Dataset Familiarization
Commence by thoroughly understanding the dataset's description. This initial step is crucial for grasping the data's structure and identifying quasi-identifiers—attributes that could potentially reveal individual identities.

```
Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
40,M,ATA,140,289,0,Normal,172,N,0,Up,0
49,F,NAP,160,180,0,Normal,156,N,1,Flat,1
37,M,ATA,130,283,0,ST,98,N,0,Up,0
48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
54,M,NAP,150,195,0,Normal,122,N,0,Up,0
39,M,NAP,120,339,0,Normal,170,N,0,Up,0
```

## Step 2: Understanding the k_anonymize Function
In this step, we will dive into the provided code for the `microaggregate` and `k_anonymize` functions, understanding their purpose and how they contribute to achieving k-Anonymity.

- import csv: This line imports the "csv" module, which provides functionality to work with Comma-Separated Values (CSV) files. CSV files are commonly used to store tabular data, where each line of the file represents a row and the values within each line are separated by commas.

- from itertools import combinations: This line imports the "combinations" function from the "itertools" module. The "combinations" function generates all possible combinations of a given length from a given iterable (like a list or a set).

- import matplotlib.pyplot as plt: This line imports the "pyplot" module from the "matplotlib" library, which is used for creating visualizations like plots and graphs. The imported module is given an alias "plt" to make it easier to refer to.

- import numpy as np: This line imports the "numpy" library, which provides support for working with arrays and matrices of numerical data. The imported library is given an alias "np" for convenience.

In [2]:
# Import the necessary libraries
import csv
from itertools import combinations
import matplotlib.pyplot as plt
import numpy as np

# Microaggregation
Microaggregation is a technique used in data privacy and statistical disclosure control. It's often employed when sharing or releasing data that contains sensitive or confidential information while trying to protect the individual privacy of the data subjects. Here's how the provided code works:

microaggregate(data, column, k): This function takes three arguments:

- data: A dataset or dataframe containing the data to be microaggregated.
column: The column in the dataset on which microaggregation is performed.
- k: A parameter that determines the level of aggregation. It represents the number of distinct values to be retained in the microaggregated dataset.
- unique_vals = np.unique(data[column]): This line uses the np.unique function from the NumPy library to find the unique values in the specified column of the dataset. These unique values will be the potential representatives of the microaggregated groups.
- aggregated_vals = {}: This initializes an empty dictionary that will store the aggregated values for each unique value.

The following loop iterates through each unique value found in the specified column:

- indices = np.where(data[column] == val)[0]: It finds the indices of all rows in the dataset where the column value matches the current unique value.
- mean = np.mean(data.iloc[indices][column]): It calculates the mean of the values in the specified column for the rows corresponding to the current unique value.
- aggregated_vals[val] = mean: It stores the calculated mean in the aggregated_vals dictionary, using the unique value as the key.

The next loop iterates through each row in the dataset:

- for index, row in data.iterrows(): This loop allows for iterating through both the index and the row contents of the dataset.
- data.at[index, column] = aggregated_vals[row[column]]: It replaces the original value in the specified column with the aggregated value from the aggregated_vals dictionary, based on the value present in the current row.

Finally, the modified dataset with microaggregated values is returned.


In [3]:
# Define the microaggregate function
def microaggregate(data, column, k):
    # Find unique values in the column
    unique_vals = np.unique(data[column])
    # Create a dictionary to store the aggregated values
    aggregated_vals = {}
    for val in unique_vals:
        # Find the indices of all rows with the same value
        indices = np.where(data[column] == val)[0]
        # Aggregate the values by finding the mean
        mean = np.mean(data.iloc[indices][column])
        # Store the aggregated value in the dictionary
        aggregated_vals[val] = mean
    # Replace the original values with the aggregated values
    for index, row in data.iterrows():
        data.at[index, column] = aggregated_vals[row[column]]
    return data

# The k_anonymize function

In summary, the k_anonymize function takes a CSV file, a k value, sensitive attributes, and optional generalization intervals as input. It performs k-anonymization by replacing sensitive attribute values in rows with asterisks if the rows do not meet the k-anonymity requirement. The function then writes the anonymized data to a new CSV file and returns the percentage of rows that have been successfully anonymized.

Let's break down the code step by step to understand how the function works:

def k_anonymize(file_path, k, sensitive_features, generalization_intervals={}):
    # Read the CSV file
    anon_rows = 0
    with open(file_path, 'r') as file:
        reader = csv.reader(file)
        header = next(reader)
        data = [row for row in reader]
        row_count = len(data)
        
    # Find the indices of the sensitive columns
    sensitive_indices = [header.index(feature) for feature in sensitive_features]
The function takes several parameters:

- file_path: The path to the CSV file containing the dataset.
- k: The desired k value for k-anonymization.
- sensitive_features: A list of column headers representing sensitive attributes.
- generalization_intervals: A dictionary that maps column headers to generalization intervals (if any). Generalization involves replacing attribute values with broader categories to reduce identifiability.

The code reads the CSV file specified by file_path using the csv.reader and extracts the header row and data rows. It also counts the total number of rows in the dataset (row_count). sensitive_indices is a list that stores the indices of the sensitive columns within the dataset, based on their headers.

The code then checks if any generalization is specified for attributes. If generalization intervals are provided in the generalization_intervals dictionary, it iterates through the data and performs generalization on the corresponding attributes. Generalization replaces attribute values with ranges based on the specified intervals.

    # Create a dictionary to store the count of each combination of sensitive values
    counts = {}

    # Count the occurrences of each combination of sensitive values
    for row in data:
        value = tuple([row[i] for i in sensitive_indices])
        counts[value] = counts.get(value, 0) + 1
The code creates a dictionary named counts to store the count of each combination of sensitive attribute values.

It then iterates through the data rows and constructs a tuple containing the sensitive attribute values for the current row. This tuple is used as a key in the counts dictionary. The code increments the count for that specific combination of attribute values.

python
Copy code
    # Perform the k-anonymization
    for row in data:
        value = tuple([row[i] for i in sensitive_indices])
        if counts[value] < k:
            for i in sensitive_indices:
                row[i] = '*' * len(row[i])
            anon_rows = anon_rows + 1
    percentage = (anon_rows/row_count)*100
The code then iterates through the data rows again. For each row, it checks if the count of the combination of sensitive attribute values (counts[value]) is less than the desired k value. If so, it means the row is not sufficiently anonymized, and it anonymizes the sensitive attributes by replacing their values with asterisks (*). It also increments the count of anonymized rows (anon_rows).

The variable percentage is calculated to represent the percentage of rows that have been successfully anonymized.

python
Copy code
    # Write the anonymous data to a new CSV file
    with open('/content/sample_data/anonymised.csv', 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(header)
        writer.writerows(data)
    return percentage
Finally, the code writes the anonymized data to a new CSV file and returns the calculated percentage of anonymized rows.

You can run the code below:

In [4]:
def k_anonymize(file_path, k, sensitive_features, generalization_intervals={}):
    # Read the CSV file
    anon_rows = 0
    with open(file_path, 'r') as file:
        reader = csv.reader(file)
        header = next(reader)
        data = [row for row in reader]
        row_count = len(data)
    # Find the indices of the sensitive columns
    sensitive_indices = [header.index(feature) for feature in sensitive_features]

    # Perform the generalization of attributes if necessary
    for attribute, interval in generalization_intervals.items():
        attribute_index = header.index(attribute)
        for row in data:
            row[attribute_index] = str(int(np.floor(float(row[attribute_index])/interval)) * interval) + "-" + str((int(np.floor(float(row[attribute_index])/interval)) + 1) * interval)

    # Create a dictionary to store the count of each combination of sensitive values
    counts = {}

    # Count the occurrences of each combination of sensitive values
    for row in data:
        value = tuple([row[i] for i in sensitive_indices])
        counts[value] = counts.get(value, 0) + 1

    # Perform the k-anonymization
    for row in data:
        value = tuple([row[i] for i in sensitive_indices])
        if counts[value] < k:
            for i in sensitive_indices:
                row[i] = '*' * len(row[i])
            anon_rows = anon_rows + 1
    percentage = (anon_rows/row_count)*100

    # Write the anonymous data to a new CSV file
    with open('/content/sample_data/anonymised.csv', 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(header)
        writer.writerows(data)
    return percentage

# Step 3: Scrutinizing k_anonymize Function Attributes
In this step, we will delve into the attributes of the k_anonymize function, understanding their roles in achieving k-Anonymity.

In [11]:
# Attribute explanations
file_path = "/content/sample_data/heart.csv"  # Path leading to the CSV dataset
k = 10  # Desired k-Anonymity level
quasiidentifiers = ["Age","Cholesterol"]  # List of attributes necessitating protection
generalization_intervals = {"Age": 15}  # Optional dictionary for customized generalization levels

# ... (include the provided code for experimenting with k_anonymize function calls)
k_anonymize('/content/sample_data/heart.csv', 3, quasiidentifiers)



83.2244008714597

# Step 4: Practical Implementation
Execute the provided code within the notebook. Witness the code in action by modifying attribute values. Analyze the anonymized datasets generated and the proportion of anonymized rows.

In [12]:
# Step 4: Practical Implementation

# Execute the provided code to anonymize the dataset
anon_percentage = k_anonymize(file_path, k, quasiidentifiers, generalization_intervals)
# Display the percentage of anonymized rows
print("Must be anonymized:", anon_percentage, "% of the dataset")

Must be anonymized: 81.26361655773421 % of the dataset


# Step 5: Experimentation and Customization: Experiment with attribute adjustments to explore diverse scenarios:

- Modify k to observe its effects on anonymity.
Investigate different combinations of sensitive_features to discern their individual significance.
- Customize generalization_intervals to fine-tune the degree of attribute generalization.
- Take note of the trade-off between ensuring privacy and retaining data utility across various scenarios.

In [None]:
k_anonymize('/content/sample_data/heart.csv', 3, quasiidentifiers)
anon_percentage=k_anonymize('/content/sample_data/heart.csv', 10, quasiidentifiers, generalization_intervals={"Age": 15})

In [14]:
# Step 5: Experimentation and Customization

# Experiment with different values of k and observe their effects on anonymity
def experiment_k_values():
    k_values = [3, 5, 10, 15, 20]
    for k in k_values:
        anon_percentage = k_anonymize(file_path, k, quasiidentifiers, generalization_intervals)
        print(f"For k = {k}, must be anonymized: {anon_percentage:.2f}% of the dataset")

# Experiment with different combinations of sensitive features
def experiment_quasiidentifiers():
    scenarios = [
        ["Age", "Cholesterol"],
        ["Cholesterol", "FastingBS"],
        ["Age", "Cholesterol", "FastingBS"]
    ]
    for quasiidentifiers in scenarios:
        anon_percentage = k_anonymize(file_path, k, quasiidentifiers, generalization_intervals)
        print(f"Sensitive features: {', '.join(quasiidentifiers)}, must be anonymized: {anon_percentage:.2f}% of the dataset")

# Experiment with customizing generalization_intervals
def experiment_generalization():
    customized_generalization_intervals = {"Age": 10, "Cholesterol": 50}
    anon_percentage = k_anonymize(file_path, k, quasiidentifiers, customized_generalization_intervals)
    print(f"Customized generalization, must be anonymized: {anon_percentage:.2f}% of the dataset")

# Call the experimentation functions
experiment_k_values()
experiment_quasiidentifiers()
experiment_generalization()


For k = 3, must be anonymized: 42.05% of the dataset
For k = 5, must be anonymized: 70.15% of the dataset
For k = 10, must be anonymized: 81.26% of the dataset
For k = 15, must be anonymized: 81.26% of the dataset
For k = 20, must be anonymized: 81.26% of the dataset
Sensitive features: Age, Cholesterol, must be anonymized: 81.26% of the dataset
Sensitive features: Cholesterol, FastingBS, must be anonymized: 81.26% of the dataset
Sensitive features: Age, Cholesterol, FastingBS, must be anonymized: 82.14% of the dataset
Customized generalization, must be anonymized: 9.15% of the dataset


# Conclusion
Summarize the key takeaways from the notebook. Emphasize how the notebook has provided an in-depth understanding of implementing k-Anonymity using the k_anonymize function. Highlight the ability to balance privacy preservation and data utility in real-world applications.