### 3.1.3 Using the techniques you applied in Assignment #1, apply a masking or transformation mechanism to modify the detected PII elements and substitute with suitable replacements.
In the following section, we will apply techniques similar to those used in Assignment #1 to mask or transform Personally Identifiable Information (PII) detected in a dataset. The goal is to substitute these sensitive elements with suitable replacements while maintaining the overall structure and coherence of the data. 

We start by anonymizing the eight categories with help from the Faker library:
- PERSON: Replaces names of people with fake names generated by Faker.
- GPE (Geopolitical Entities): Substitutes names of countries, cities, states, etc., with random city names using Faker.
- DATE: Transforms dates into random dates. Specific dates like 'today', 'tomorrow', or 'next week' are replaced with dates that correspond to these descriptions.
- ORG (Organizations): Changes names of organizations, companies, agencies, etc., to random company names generated by Faker.
- NORP (Nationalities, Religious or Political Groups): Replaces nationalities, religions, and political group names with random country names, implying a change in nationality.
- CARDINAL (Numerals): Alters numerical values to be close to the original number but not exact, within a ±10% range or ±3, whichever is greater.
- ORDINAL (Ordinal Numbers): Generates random ordinal numbers (like 1st, 2nd, 3rd, etc.) to replace existing ordinal numbers.
- TIME: Changes time mentions to random times. Specific times of the day like 'morning' or 'evening' are replaced with times corresponding to those periods.

In [1]:
import pandas as pd
df = pd.read_csv("PII_tweet_emotions.csv")

In [2]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [3]:
from datetime import datetime, timedelta
import random
from faker import Faker
import re

fake = Faker()

def close_number(original_number):
    try:
        num = int(original_number)
        # Generate a number within ±10% of the original number or ±3, whichever is greater
        percentage_variation = int(num * 0.1)
        min_variation = 3  # Minimum variation
        variation = max(min_variation, percentage_variation)
        return str(random.randint(max(0, num - variation), num + variation))
    except ValueError:
        # Return the original number if it's not an integer
        return original_number
    
def fake_ordinal():
    number = fake.random_int(min=1, max=100)
    suffix = ["th", "st", "nd", "rd"] + ["th"] * 6
    return str(number) + suffix[number % 10 if number % 100 not in [11, 12, 13] else 0]
    
def replace_date(entity_text):
    today = datetime.today()
    if entity_text.lower() in ['today']:
        new_date = fake.date_between(start_date=today, end_date=today)
    elif entity_text.lower() in ['tomorrow']:
        new_date = fake.date_between(start_date=today + timedelta(days=1), end_date=today + timedelta(days=1))
    elif entity_text.lower() in ['next week']:
        new_date = fake.date_between(start_date=today + timedelta(days=7), end_date=today + timedelta(days=14))
    else:
        # For general date entities, return a random future date
        new_date = fake.future_date()
    
    return new_date.strftime("%Y-%m-%d")  # Convert the date to a string


def replace_time(entity_text):
    # Define time ranges
    morning_times = [f"{hour:02d}:{minute:02d} AM" for hour in range(6, 12) for minute in range(0, 60)]
    evening_times = [f"{hour:02d}:{minute:02d} PM" for hour in range(6, 12) for minute in range(0, 60)]

    if entity_text.lower() in ['morning']:
        return random.choice(morning_times)
    elif entity_text.lower() in ['evening', 'tonight']:
        return random.choice(evening_times)
    else:
        # For general time entities, return any random time
        return fake.time()

def replace_pii_with_fake(text):
    # Replace Twitter @username with @ followed by a fake first name
    text = re.sub(r'@(\w+)', lambda x: '@' + fake.first_name(), text)

    # Process the text using spaCy to identify named entities
    doc = nlp(text)

    # Iterate over the identified entities
    for ent in doc.ents:
        # Replace with fake data based on the entity type
        if ent.label_ == 'PERSON':
            text = re.sub(re.escape(ent.text), fake.name(), text)
        elif ent.label_ == 'GPE':
            text = re.sub(re.escape(ent.text), fake.city(), text)
        elif ent.label_ == 'DATE':
            text = re.sub(re.escape(ent.text), replace_date(ent.text), text)
        elif ent.label_ == 'ORG':
            text = re.sub(re.escape(ent.text), fake.company(), text)
        elif ent.label_ == 'NORP':
            text = re.sub(re.escape(ent.text), fake.country(), text)
        elif ent.label_ == 'CARDINAL':
            text = re.sub(re.escape(ent.text), lambda x: close_number(ent.text), text)
        elif ent.label_ == 'ORDINAL':
            text = re.sub(re.escape(ent.text), fake_ordinal(), text)
        elif ent.label_ == 'TIME':
            text = re.sub(re.escape(ent.text), replace_time(ent.text), text)
    return text

# Apply the function to the DataFrame
df['content'] = df['content'].apply(replace_pii_with_fake)

# Save the modified DataFrame to a new CSV file
df.to_csv("Anonymized_PII_tweet_emotions.csv", index=False)

After being done with anonymizing the eight identified categories, we identify the PII for the anonymized Dataset and take a look at our anonymized dataset by reusing our spaCy from task 3.1.2:

In [4]:
import spacy

nlp = spacy.load('en_core_web_lg')

# Function to identify PII using spaCy
def identify_pii(text):
    # Process the text using spaCy to identify named entities
    doc = nlp(text)
    pii_entities = [(ent.text, ent.label_) for ent in doc.ents]
    return pii_entities

pii_original = df['content'].apply(identify_pii)

df['PII'] = pii_original
df.to_csv("Anonymized_PII_tweet_emotions.csv", index=False)
df = pd.read_csv("Anonymized_PII_tweet_emotions.csv")
print(df.iloc[:8])

     tweet_id   sentiment                                            content  \
0  1956967341       empty  @Vincent i know  i was listenin to bad habit e...   
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...   
2  1956967696     sadness            Funeral ceremony...gloomy 2024-01-29...   
3  1956967789  enthusiasm               wants to hang out with friends SOON!   
4  1956968416     neutral  @Madeline We want to trade with someone who ha...   
5  1956968477       worry  Re-pinging @Julie: why didn't you go to prom? ...   
6  1956968487     sadness  I should be sleep, but im not! thinking about ...   
7  1956968636       worry               Hmmm. http://www.djhero.com/ is down   

                              PII  
0                              []  
1                              []  
2        [('2024-01-29', 'DATE')]  
3                              []  
4       [('Anthonyburgh', 'ORG')]  
5  [('Jonathan Ayala', 'PERSON')]  
6             [('5', 'CARDI

As we can see, the by spaCys model identified PIIs contain our newly anonymized contents. 

### 3.1.4 Analyse the text to determine if any information can be obtained after the transformation process. What conclusions can you draw from this?

The code is designed to first calculate semantic similarity between texts in an original dataset and their anonymized counterparts, assessing how well the anonymization process has retained the original text's meaning. Following this, it aims to check whether any Personally Identifiable Information (PII) from the original dataset remains in the anonymized dataset, ensuring the effectiveness of the anonymization in protecting privacy .

This first script assesses how well an original dataset's anonymization process preserved semantic content compared to its anonymized version. It employs MobileBert for generating text embeddings, which represent the semantic essence of texts. By calculating the cosine similarity between embeddings of corresponding entries in the original and anonymized datasets, the script quantifies semantic similarity. High similarity scores indicate little to no semantic change, helping evaluate the anonymization's effectiveness. 

In [7]:
import pandas as pd
import torch
from scipy.spatial.distance import cosine
import numpy as np
from transformers import MobileBertTokenizer, MobileBertModel


# Load your datasets
df = pd.read_csv("tweet_emotions.csv")  # Make sure you've loaded the original dataset into 'df'
df_anonymized = pd.read_csv("Anonymized_PII_tweet_emotions.csv")

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize tokenizer and model
tokenizer = MobileBertTokenizer.from_pretrained('google/mobilebert-uncased')
model = MobileBertModel.from_pretrained('google/mobilebert-uncased').to(device)


# Modify the get_embedding function to send inputs to the GPU
def get_embedding(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()

# Function to calculate semantic similarity
def semantic_similarity(text1, text2, tokenizer, model):
    emb1 = get_embedding(text1, tokenizer, model)
    emb2 = get_embedding(text2, tokenizer, model)
    # Ensure embeddings are 1-D
    emb1 = np.squeeze(emb1)
    emb2 = np.squeeze(emb2)
    #print(text1 + " and " + text2)
    return 1 - cosine(emb1, emb2)

# Calculate similarities
try:
    similarity_scores = [semantic_similarity(orig, anon, tokenizer, model) for orig, anon in zip(df['content'], df_anonymized['content'])]
    df_anonymized['scores'] = similarity_scores
except ValueError as e:
    print(f"Error calculating similarity: {e}")

This script evaluates the effectiveness of anonymizing a dataset by comparing the PII fields in the original and anonymized datasets on a row-by-row basis. Using pandas, we load both datasets and check for identical PII entries, marking matches where anonymization may not have been successful. We calculate and report the number of entries that were correctly anonymized versus those that remained unchanged, offering a concise assessment of the anonymization process's success.

In [6]:
import pandas as pd

# Load the original dataset
original_df = pd.read_csv('PII_tweet_emotions.csv')

# Load the anonymized dataset
anonymized_df = pd.read_csv('Anonymized_PII_tweet_emotions.csv')

# Compare 'PII' columns, excluding empty 'PII' lists
original_df['Refined_PII_Match'] = (original_df['PII'] == anonymized_df['PII']) & \
                                   (original_df['PII'] != '[]') & \
                                   (anonymized_df['PII'] != '[]')

# Calculate matches and non-matches
refined_matches = original_df['Refined_PII_Match'].sum()
refined_total_non_empty = ((original_df['PII'] != '[]') & (anonymized_df['PII'] != '[]')).sum()
refined_non_matches = refined_total_non_empty - refined_matches

# Output the results
print(f"Out of {refined_total_non_empty} rows with non-empty 'PII' values:")
print(f"- {refined_matches} rows have 'PII' values that match between the original and anonymized datasets.")
print(f"- {refined_non_matches} rows have 'PII' values that do not match, indicating successful anonymization.")

Out of 21853 rows with non-empty 'PII' values:
- 1033 rows have 'PII' values that match between the original and anonymized datasets.
- 20820 rows have 'PII' values that do not match, indicating successful anonymization.


In Conclusion, our anonymisation algorithm effectively anonymized the personally identifiable information (PII) in a tweet dataset, with over 95% of the PII successfully altered. However, around 5% of the original PII remained unchanged, indicating areas where the anonymization process could be improved. There's no direct assessment of whether the anonymized text maintains the original sentiment or meaning, but high semantic similarity scores would suggest the content's contextual integrity is largely preserved.


Please continue reading in 3.2.ipynb :)