# Assignment #4-5: Anonymising Textual Data and De-Anonymisation
- Dataset:  Tweets Emotions [Dataset](https://www.kaggle.com/datasets/pashupatigupta/emotion-detection-from-text?resource=download)
- Credits: Dataset was put together by Pashipatu Gupta
- ToDo: To run the jupyter notebook the requirements.txt need be installed (`pip install -r requirements.txt`)

## 3.1 Textual Data Anonymisation – 30 marks

### 3.1.1 Do some research to determine what needs to be anonymised in the data and why.
- For a better understanding of the structure of the dataset , we display the attribute values
    - What columns does the dataset contain and in what format are the attribute values?
        - Therefore, each column and the first value of each column (which is not empty or Null) is printed

In [1]:
import pandas as pd
df = pd.read_csv("tweet_emotions.csv")
print(df.iloc[:4])

     tweet_id   sentiment                                            content
0  1956967341       empty  @tiffanylue i know  i was listenin to bad habi...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!


By inspecting the different columns and the data format, the 'content' attribute definitely has the potential to contain explicit personally identifiable information:
1. User Mentions: 
    - Any instance of @username should be anonymised because it directly points to an individual's account, which is considered personally identifiable information (PII).
2. First Names: 
    - If any first names are used in a context that can identify an individual, such as tagging in combination with other identifying information, they should be anonymised.
3. Locations and Specific References: 
    - Any mention of specific locations, addresses, landmarks, or establishments that could help in identifying an individual should be anonymised.
4. Specific Events with Identifiable Information: 
    - References to specific events that may lead to the identification of individuals, like parties or gatherings with a list of names, should be anonymised.
5. Unique Identifiers: 
    - Any other unique identifiers, such as specific dates, times, or unique events, that could potentially be linked back to an individual.


Apart from that, the 'sentiment' attribute is explored further as we don't know by now how many unique values there actually are and if they would qualify as PII: 

In [2]:
print("total lenth of the dataframe: ", len(df))

# Calculate the number of unique values and the number of entries per unique value
unique_counts = df['sentiment'].nunique()
value_counts = df['sentiment'].value_counts()

print("number of unique values in sentiment: ", unique_counts)
print("counts per unique value in", value_counts)

total lenth of the dataframe:  40000
number of unique values in sentiment:  13
counts per unique value in sentiment
neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: count, dtype: int64


By inspecting the 'sentiment' attribute further, we can say that there are 13 different values in the 'sentiment' column. We also know that there are 40,000 tweets in total in the dataset. Given this information, there is no need to anonymize the 'sentiment' attribute.

### 3.1.2 Using a Natural Language Processing library (e.g. Python’s spaCy), analyse the text to identify elements of personally identifiable information (PII).
The goal of anonymization is to remove or obscure such details so that the individuals to whom the data pertains cannot be readily identified. The first step is finding the contents, that might actually contain PII.
As the first step, we install 'en_core_web_sm', a pre-trained spaCy model suitable for identifying named entities, which include PII. 'en_core_web_sm' is the English model trained on web text. It has been trained on a diverse range of web text, including blogs, news, comments. We've decided on using 'en_core_web_sm' instead of for example 'en_core_web_trf' due to their balance between performance and resource usage.



In [3]:
#python -m spacy download en_core_web_sm#install model without outputting in console

In [4]:
import spacy

#Load the pre-trained spaCy model
nlp = spacy.load('en_core_web_sm')

# Function to identify PII using spaCy
def identify_pii(text):
    # Process the text using spaCy to identify named entities
    doc = nlp(text)
    pii_entities = [(ent.text, ent.label_) for ent in doc.ents]
    return pii_entities

pii_original = df['content'].apply(identify_pii)

df['PII'] = pii_original
df.to_csv("PII_tweet_emotions.csv", index=False)



In [5]:
#load new dataframe containing the PII information
df = pd.read_csv("PII_tweet_emotions.csv")
print(df.iloc[:5])

     tweet_id   sentiment                                            content  \
0  1956967341       empty  @tiffanylue i know  i was listenin to bad habi...   
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...   
2  1956967696     sadness                Funeral ceremony...gloomy friday...   
3  1956967789  enthusiasm               wants to hang out with friends SOON!   
4  1956968416     neutral  @dannycastillo We want to trade with someone w...   

                        PII  
0  [('@tiffanylue', 'ORG')]  
1                        []  
2      [('friday', 'DATE')]  
3                        []  
4      [('Houston', 'GPE')]  


Each non-empty list within the square brackets [] in the new 'PII' column indicates that the spaCy model has identified text segments in that specific row which it believes to be named entities. The entities are tagged with labels that classify what type of entity they are (e.g., DATE, PERSON, ORG(=organization), GPE(=Geopolitical Entity)). These named entities can be considered PII, as they might be used to identify an individual either directly or when combined with other additional information. In conclusion, when there is a non-empty list in the 'PII' column in a specific row, we have to apply some sort of anonymisation mechanism to prevent the PIIs from being able to identify an individual.

### 3.1.3 Using the techniques you applied in Assignment #1, apply a masking or transformation mechanism to modify the detected PII elements and substitute with suitable replacements.
In the following section, we will apply techniques similar to those used in Assignment #1 to mask or transform Personally Identifiable Information (PII) detected in a dataset. The goal is to substitute these sensitive elements with suitable replacements while maintaining the overall structure and coherence of the data.

As a first step, we start by checking how many occurences of which category we have in our new PII column. This information is crucial for planning further anonymization steps. 

In [15]:
import collections
import ast 

# Function to extract entities from a string and return their labels
def extract_labels(data_string):
    # Convert string representation of list to actual list
    entities = ast.literal_eval(data_string)
    # Extract labels
    return [label for _, label in entities]

# Extract labels from each item in the data
all_labels = [label for item in df['PII'] for label in extract_labels(item)]

# Count occurrences of each label
label_counts = collections.Counter(all_labels)

print(label_counts)
print(len(label_counts))

Counter({'PERSON': 8403, 'ORG': 7517, 'DATE': 7360, 'CARDINAL': 3814, 'GPE': 3264, 'TIME': 2949, 'NORP': 840, 'ORDINAL': 731, 'PRODUCT': 400, 'MONEY': 335, 'WORK_OF_ART': 335, 'LOC': 291, 'FAC': 190, 'QUANTITY': 182, 'EVENT': 79, 'PERCENT': 71, 'LANGUAGE': 65, 'LAW': 29})
18


So, we now know that there are 18 types of different datatypes that should be anonymized in some kind of way. We start by anonymizing the easiest ones with the faker library:

In [16]:
df = pd.read_csv("PII_tweet_emotions.csv")

In [None]:
from faker import Faker
import re

fake = Faker()
def replace_pii_with_fake(text):
    # Process the text using spaCy to identify named entities
    doc = nlp(text)
    # Iterate over the identified entities
    for ent in doc.ents:
        # Replace with fake data based on the entity type
        if ent.label_ == 'PERSON':
            text = re.sub(re.escape(ent.text), fake.name(), text)
        elif ent.label_ == 'GPE':
            text = re.sub(re.escape(ent.text), fake.city(), text)
        elif ent.label_ == 'DATE':
            text = re.sub(re.escape(ent.text), fake.date(), text)
        elif ent.label_ == 'ORG':
            text = re.sub(re.escape(ent.text), fake.company(), text)
        elif ent.label_ == 'NORP':
            text = re.sub(re.escape(ent.text), fake.country(), text)
        elif ent.label_ == 'CARDINAL':
            text = re.sub(re.escape(ent.text), str(fake.random_number()), text)
        elif ent.label_ == 'TIME':
            text = re.sub(re.escape(ent.text), fake.time(), text)
            
    # Replace Twitter @ with fake names
    text = re.sub(r'(?<=@)\w+', fake.user_name(), text)
    return text


df['content'] = df['content'].apply(replace_pii_with_fake)

# Save the modified DataFrame to a new CSV file
df.to_csv("Anonymized_PII_tweet_emotions.csv", index=False)

### 3.1.4 Analyse the text to determine if any information can be obtained after the transformation process. What conclusions can you draw from this?

In [7]:
df_anonymized = pd.read_csv("Anonymized_PII_tweet_emotions.csv")

pii_anonymised = df_anonymized['content'].apply(identify_pii)


# Define the PII labels of interest
pii_labels = {'PERSON', 'GPE', 'DATE', 'ORG', 'NORP'}

# Filter the entities by the labels of interest
pii_original_filtered = [(text, label) for text, label in pii_original if label in pii_labels]
pii_anonymised_filtered = [(text, label) for text, label in pii_anonymised if label in pii_labels]

# Convert lists to sets
set_pii_original = set(pii_original_filtered)
set_pii_anonymised = set(pii_anonymised_filtered)

# Find common elements
common_pii = set_pii_original.intersection(set_pii_anonymised)

# Check if there are any common elements
if len(common_pii) > 0:
    print("The lists have common PII values:", common_pii)
else:
    print("The lists do not have any common PII values.")


ValueError: not enough values to unpack (expected 2, got 1)