# Assignment #4-5: Anonymising Textual Data and De-Anonymisation
- Dataset:  Tweets Emotions [Dataset](https://www.kaggle.com/datasets/pashupatigupta/emotion-detection-from-text?resource=download)
- Credits: Dataset was put together by Pashipatu Gupta
- ToDo: To run the jupyter notebook the requirements.txt need be installed (`pip install -r requirements.txt`)

## 3.1 Textual Data Anonymisation – 30 marks

### 3.1.1 Do some research to determine what needs to be anonymised in the data and why.
- For a better understanding of the structure of the dataset , we display the attribute values
- What columns does the dataset contain and in what format are the attribute values?
Therefore, each column and the first value of each column (which is not empty or Null) is printed

In [1]:
import pandas as pd
df = pd.read_csv("tweet_emotions.csv")
print(df.iloc[:4])

     tweet_id   sentiment                                            content
0  1956967341       empty  @tiffanylue i know  i was listenin to bad habi...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!


By inspecting the different columns and the data format, the 'content' attribute definitely has the potential to contain explicit personally identifiable information:
1. User Mentions: 
    - Any instance of @username should be anonymised because it directly points to an individual's account, which is considered personally identifiable information (PII).
2. First Names: 
    - If any first names are used in a context that can identify an individual, such as tagging in combination with other identifying information, they should be anonymised.
3. Locations and Specific References: 
    - Any mention of specific locations, addresses, landmarks, or establishments that could help in identifying an individual should be anonymised.
4. Specific Events with Identifiable Information: 
    - References to specific events that may lead to the identification of individuals, like parties or gatherings with a list of names, should be anonymised.
5. Unique Identifiers: 
    - Any other unique identifiers, such as specific dates, times, or unique events, that could potentially be linked back to an individual.


Apart from that, the 'sentiment' attribute is explored further as we don't know by now how many unique values there actually are and if they would qualify as PII: 

In [2]:
print("total lenth of the dataframe: ", len(df))

# Calculate the number of unique values and the number of entries per unique value
unique_counts = df['sentiment'].nunique()
value_counts = df['sentiment'].value_counts()

print("number of unique values in sentiment: ", unique_counts)
print("counts per unique value in", value_counts)

total lenth of the dataframe:  40000
number of unique values in sentiment:  13
counts per unique value in sentiment
neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: count, dtype: int64


By inspecting the 'sentiment' attribute further, we can say that there are 13 different values in the 'sentiment' column. We also know that there are 40,000 tweets in total in the dataset. Given this information, there is no need to anonymize the 'sentiment' attribute.

### 3.1.2 Using a Natural Language Processing library (e.g. Python’s spaCy), analyse the text to identify elements of personally identifiable information (PII).
The goal of anonymization is to remove or obscure such details so that the individuals to whom the data pertains cannot be readily identified. The first step is finding the contents, that might actually contain PII.
As the first step, we install 'en_core_web_lg', a pre-trained spaCy model suitable for identifying named entities, which include PII. 'en_core_web_lg' is the English model trained on web text. It has been trained on a diverse range of web text, including blogs, news, comments. We've decided on using 'en_core_web_lg' instead of for example 'en_core_web_trf' due to their balance between performance and resource usage.



In [3]:
#!python -m spacy download en_core_web_lg > /dev/null 2>&1 #install model without outputting in console

In [4]:
import spacy

#Load the large, pre-trained spaCy model
nlp = spacy.load('en_core_web_lg')

# Function to identify PII using spaCy
def identify_pii(text):
    # Process the text using spaCy to identify named entities
    doc = nlp(text)
    pii_entities = [(ent.text, ent.label_) for ent in doc.ents]
    return pii_entities

pii_original = df['content'].apply(identify_pii)

df['PII'] = pii_original
df.to_csv("PII_tweet_emotions.csv", index=False)

In [5]:
#load new dataframe containing the PII information
df = pd.read_csv("PII_tweet_emotions.csv")
print(df.iloc[:5])

     tweet_id   sentiment                                            content  \
0  1956967341       empty  @tiffanylue i know  i was listenin to bad habi...   
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...   
2  1956967696     sadness                Funeral ceremony...gloomy friday...   
3  1956967789  enthusiasm               wants to hang out with friends SOON!   
4  1956968416     neutral  @dannycastillo We want to trade with someone w...   

                           PII  
0  [('@tiffanylue', 'PERSON')]  
1                           []  
2         [('friday', 'DATE')]  
3                           []  
4         [('Houston', 'GPE')]  


Each non-empty list within the square brackets [] in the new 'PII' column indicates that the spaCy model has identified text segments in that specific row which it believes to be named entities. The entities are tagged with labels that classify what type of entity they are (e.g., DATE, PERSON, ORG(=organization), GPE(=Geopolitical Entity)). These named entities can be considered PII, as they might be used to identify an individual either directly or when combined with other additional information. In conclusion, when there is a non-empty list in the 'PII' column in a specific row, we might have to apply some sort of anonymisation mechanism to prevent the PIIs from being able to identify an individual.

As a further step towards the next task, we check how many occurrences of which category we have in our new PII column. This information is crucial for planning further anonymization steps. 

In [6]:
df = pd.read_csv("PII_tweet_emotions.csv")

In [7]:
import collections
import ast 

# Function to extract entities from a string and return their labels
def extract_labels(data_string):
    # Convert string representation of list to actual list
    entities = ast.literal_eval(data_string)
    # Extract labels
    return [label for _, label in entities]

# Extract labels from each item in the data
all_labels = [label for item in df['PII'] for label in extract_labels(item)]

# Count occurrences of each label
label_counts = collections.Counter(all_labels)

print(label_counts)
print(len(label_counts))

Counter({'PERSON': 10340, 'ORG': 8815, 'DATE': 7560, 'CARDINAL': 3430, 'GPE': 3269, 'TIME': 2964, 'NORP': 957, 'PRODUCT': 702, 'ORDINAL': 680, 'WORK_OF_ART': 607, 'MONEY': 353, 'LOC': 244, 'FAC': 210, 'QUANTITY': 191, 'EVENT': 123, 'LANGUAGE': 106, 'PERCENT': 75, 'LAW': 35})
18


The 18 different categories have the following meaning: 

- PERSON: Names of people.
- ORG: Organizations, including companies, agencies, institutions, etc.
- DATE: Absolute or relative dates or periods.
- CARDINAL: Numerals that do not fall under another type (like dates or quantities).
- GPE: Geopolitical entity, typically referring to countries, cities, states.
- TIME: Times smaller than a day, including specific time periods, durations, or times of day.
- NORP: Nationalities, religious or political groups.
- ORDINAL: "First", "second", etc., used to denote position in a ordered sequence.
- PRODUCT: Objects, vehicles, foods, etc. (not services).
- MONEY: Monetary values, including unit.
- WORK_OF_ART: Titles of books, songs, etc.
- LOC: Non-GPE locations, mountain ranges, bodies of water.
- FAC: Facilities, including buildings, airports, highways, bridges, etc.
- QUANTITY: Measurements, as of weight or distance.
- EVENT: Named hurricanes, battles, wars, sports events, etc.
- PERCENT: Percentage (including "%").
- LANGUAGE: Any named language.
- LAW: Named documents made into laws.

So, we now know that there are 18 types of different datatypes that might have to be anonymized in some kind of way.
We will now take a closer look at what is actually behind those categories in practical application:

In [8]:
import ast

categories = [
    'PERSON', 'ORG', 'DATE', 'CARDINAL', 'GPE', 'TIME', 'NORP', 'ORDINAL',
    'PRODUCT', 'MONEY', 'WORK_OF_ART', 'LOC', 'FAC', 'QUANTITY', 'EVENT', 
    'PERCENT', 'LANGUAGE', 'LAW'
]

# Initialize a dictionary to hold the first five examples of each category
first_five_examples = {category: [] for category in categories}

# Iterate over the DataFrame to find examples
for index, row in df.iterrows():
    # Convert the string representation of list to actual list
    entities = ast.literal_eval(row['PII'])

    # Check for the first five examples of each category
    for entity, label in entities:
        if label in categories and len(first_five_examples[label]) < 5:
            first_five_examples[label].append(entity)

    # Break the loop if five examples have been found for all categories
    if all(len(examples) == 5 for examples in first_five_examples.values()):
        break

# Print the examples
for category, examples in first_five_examples.items():
    print(f"{category}: {examples}")

# If some categories are missing or have less than five examples, print a message
for category, examples in first_five_examples.items():
    if len(examples) < 5:
        print(f"Less than five examples found for category: {category}")

PERSON: ['@tiffanylue', 'BC', '@charviray Charlene', 'isaac', '@davidbrussee']
ORG: ['@BrodyJenner if u watch the hills in london u', 'itonlinelol', '@annarosekerr', '@PerezHilton', 'Program']
DATE: ['friday', 'Friday', 'weeks', 'weeks', '2009']
CARDINAL: ['2', '#', '2', '80', '6hrs']
GPE: ['Houston', 'SoCal', 'canada', 'Cali', '@lostluna']
TIME: ['this afternoon', 'this afternoon', '9am', 'nightly', '6am']
NORP: ['@Telstra', 'French-', 'Malay', 'Aussie', 'Christian']
ORDINAL: ['first', 'first', 'first', 'First', 'second']
PRODUCT: ['Blackberry', '@Pokinatcha', 'Windows', 'veronica', '@judyrey']
MONEY: ['#uds', '20', '100', '3wordsaftersex', '#beer #']
WORK_OF_ART: ['Horse Pills', 'Drag Me to Hell', 'The Biggest Loser on Hallmark', 'The Biggest Loser', 'Dead Like Me']
LOC: ['Voobys', 'Torchwood', 'Harpers Island', 'the east coast', 'HII']
FAC: ['the Balisage Markup Conference', 'the Hongkong International Airport', 'WII', '@xMyLifesAStoryx', 'Cork Airport']
QUANTITY: ['400MB', '1000+ m

After looking at the output of the different categories, we've decided on anonymizing the following eight categories in the next task, because they might lead to the identification of an individual: 

- PERSON, GPE, DATE, ORG, NORP, CARDINAL, ORDINAL, TIME

After further inspection of the dataset, the remaining 10 categories don't seem to be problematic in context of PII. 

Please continue reading in 3.1.3+3.1.4.ipynb  :)