# Data Anonymization techniques

- Masking: Hiding part of sensitive data (e.g., email, SSN, phone).
- Hashing: Replacing data with irreversible hashed values (ideal for IDs).
- Tokenization/Pseudonymization: Replacing sensitive values with random tokens that can be reversed later using a lookup table.
- Redaction/Removal: Completely removing sensitive columns or rows.
- Generalization:  Reducing precision to hide exact values.
- Noise Addition: Altering values slightly to hide exact data (esp. for analytics).
- NER - Named entity recognition

[microsoft presidio](https://microsoft.github.io/presidio/)

In [1]:
# setup

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("UsersAnynomization").getOrCreate()



Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport -XX:ActiveProcessorCount=1
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport -XX:ActiveProcessorCount=1
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/07/03 19:00:21 WARN Utils: Your hostname, krishnagopi-trng2224dat-g3q9nc1wf47, resolves to a loopback address: 127.0.0.1; using 10.0.5.2 instead (on interface eth0)
25/07/03 19:00:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/03 19:00:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/07/03 19:00:23 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
# load the user data

users_df = spark.read.csv("/workspace/TRNG-2224-data-engineering/week2/datasets/user_data.csv", header=True, inferSchema=True)


In [3]:
# masking email 

df_anonymous = users_df.withColumn("email_masked", regexp_replace("email", r"(^[^@]+)", "*****")).drop("email")

df_anonymous.show(9)


+---------------+-------------------+-----------+--------------------+----------+----------------+---------+---+--------------------+
|           name|              phone|        ssn|             address|       dob|     credit_card|   salary|age|        email_masked|
+---------------+-------------------+-----------+--------------------+----------+----------------+---------+---+--------------------+
|   Allison Hill|       890.838.6379|285-01-2616|2351 Noah Knolls ...|1997-04-12|3593103413164756|106731.22| 19|    *****@miller.com|
| Henry Santiago|       503.056.4139|898-92-5156|7242 Julie Plain ...|2005-12-07|    502022691666|118986.06| 33| *****@gray-mayo.net|
|     Julie King|(270)482-8148x93252|553-68-0010|15430 Natalie Com...|1990-05-30|3538346578713317| 56785.29| 65|*****@mack-peters...|
|     Kevin Hall|       834-738-2997|249-61-6670|656 Owens Stream,...|1977-08-04|2243387262473170| 42299.42| 65|     *****@jones.com|
|Savannah Garcia| (026)064-7468x7234|223-08-9490|500 Shaw Walk

In [4]:
# hashing SSN credit card

df_anonymous = df_anonymous.withColumn("ssn_hashed", sha2(col("ssn"), 256)) \
    .withColumn("credit_card_hashed", sha2(col("credit_card").cast("string"), 256)) \
        .drop("ssn", "credit_card")

df_anonymous.show(9)




+---------------+-------------------+--------------------+----------+---------+---+--------------------+--------------------+--------------------+
|           name|              phone|             address|       dob|   salary|age|        email_masked|          ssn_hashed|  credit_card_hashed|
+---------------+-------------------+--------------------+----------+---------+---+--------------------+--------------------+--------------------+
|   Allison Hill|       890.838.6379|2351 Noah Knolls ...|1997-04-12|106731.22| 19|    *****@miller.com|008086ac7ca8b9a6e...|0c2d6bde8ca8b897c...|
| Henry Santiago|       503.056.4139|7242 Julie Plain ...|2005-12-07|118986.06| 33| *****@gray-mayo.net|2c3bcc4e0548a22d0...|d90061ff1ffcae495...|
|     Julie King|(270)482-8148x93252|15430 Natalie Com...|1990-05-30| 56785.29| 65|*****@mack-peters...|10f969724f1d084a2...|950ba19626f680c22...|
|     Kevin Hall|       834-738-2997|656 Owens Stream,...|1977-08-04| 42299.42| 65|     *****@jones.com|5319fce4d03ef7

In [5]:
# generaliation

df_anonymous = df_anonymous.withColumn("age_range", when(col("age") <18, "<18")
                .when(col("age") <30 , "18-29")
                .when(col("age") <50 , "30-49")
                .otherwise("50+")).drop("age", "dob")




In [6]:
# noise addition to salary

df_anonymous = df_anonymous.withColumn("salary_noise", (col("salary") + (rand() *100000 - 500)).cast("double")).drop("salary")

df_anonymous.show(9)



+---------------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+------------------+
|           name|              phone|             address|        email_masked|          ssn_hashed|  credit_card_hashed|age_range|      salary_noise|
+---------------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+------------------+
|   Allison Hill|       890.838.6379|2351 Noah Knolls ...|    *****@miller.com|008086ac7ca8b9a6e...|0c2d6bde8ca8b897c...|    18-29|168393.67545680012|
| Henry Santiago|       503.056.4139|7242 Julie Plain ...| *****@gray-mayo.net|2c3bcc4e0548a22d0...|d90061ff1ffcae495...|    30-49| 163552.3417348249|
|     Julie King|(270)482-8148x93252|15430 Natalie Com...|*****@mack-peters...|10f969724f1d084a2...|950ba19626f680c22...|      50+| 80278.35866307226|
|     Kevin Hall|       834-738-2997|656 Owens Stream,...|     *****@jones.com|5319fce4d03ef76

In [7]:
# save the changes

df_anonymous.write.saveAsTable("users_cleaned")


                                                                                

In [8]:
! pip install presidio-analyzer presidio_structured presidio-anonymizer faker pandas && python -m spacy download en_core_web_lg


Defaulting to user installation because normal site-packages is not writeable
/home/gitpod/.pyenv/versions/3.12.11/bin/python: No module named spacy


In [9]:
# NER

import pandas as pd
from faker import Faker
from presidio_structured import StructuredEngine, PandasAnalysisBuilder
from presidio_anonymizer.entities import OperatorConfig
from datetime import datetime


pandas_users_df = users_df.toPandas()


In [10]:

pandas_users_df = pandas_users_df.astype("str")

In [12]:
pandas_engine = StructuredEngine()
tabular_analysis = PandasAnalysisBuilder().generate_analysis(pandas_users_df)



In [14]:
fake = Faker()

operators = {
    "PERSON": OperatorConfig("replace", {"new_value" : "REDACTED"}),
    "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda" : lambda x:fake.safe_email()}),
    "US_SSN": OperatorConfig("custom", {"lambda": lambda x:fake.ssn()}),
    "DATE_TIME": OperatorConfig("custom",{"lambda": lambda x: fake.date_between_dates(date_start=datetime(1940,1,1), date_end=datetime(2010,1,1))}),
    "US_BANK_NUMBER": OperatorConfig("replace", {"new_value" :"REDACTED"})
}

In [15]:
anonymized_pd_df = pandas_engine.anonymize(pandas_users_df, tabular_analysis, operators=operators)



In [17]:
anonymized_users_df = spark.createDataFrame(anonymized_pd_df)

anonymized_users_df.drop("age").show()

+--------+--------------------+------+-----------+-------+----------+-----------+------+
|    name|               email| phone|        ssn|address|       dob|credit_card|salary|
+--------+--------------------+------+-----------+-------+----------+-----------+------+
|REDACTED|vincentjennifer@e...|<None>|627-82-2874| <None>|1970-07-11|   REDACTED|<None>|
|REDACTED| hmonroe@example.net|<None>|178-13-5068| <None>|2004-11-21|   REDACTED|<None>|
|REDACTED|nicolereeves@exam...|<None>|787-18-9687| <None>|1949-08-10|   REDACTED|<None>|
|REDACTED|pjohnston@example...|<None>|617-04-1243| <None>|2007-07-04|   REDACTED|<None>|
|REDACTED| ntucker@example.net|<None>|209-10-1273| <None>|1959-04-30|   REDACTED|<None>|
|REDACTED|kimberlymoreno@ex...|<None>|459-49-4490| <None>|2002-11-23|   REDACTED|<None>|
|REDACTED|bradley52@example...|<None>|842-09-0562| <None>|1989-11-10|   REDACTED|<None>|
|REDACTED| david57@example.com|<None>|756-88-0021| <None>|1969-10-23|   REDACTED|<None>|
|REDACTED|jefferywats