# Assignment #4-5: Anonymising Textual Data and De-Anonymisation
- Dataset:  Tweets Emotions [Daset](https://www.kaggle.com/datasets/pashupatigupta/emotion-detection-from-text?resource=download)
- Credits: Dataset was put together by Pashipatu Gupta
- ToDo: To run the jupyter notebook the requirements.txt need be installed (`pip install -r requirements.txt`)

### 3.1.1 Do some research to determine what needs to be anonymised in the data and why.
- For a better understanding of the structure of the dataset , we display the attribute values
    - What columns does the dataset contain and in what format are the attribute values?
        - Therefore, each column and the first value of each column (which is not empty or Null) is printed

In [28]:
import pandas as pd
df = pd.read_csv("tweet_emotions.csv")
print(df.iloc[:4])

     tweet_id   sentiment                                            content
0  1956967341       empty  @tiffanylue i know  i was listenin to bad habi...
1  1956967666     sadness  Layin n bed with a headache  ughhhh...waitin o...
2  1956967696     sadness                Funeral ceremony...gloomy friday...
3  1956967789  enthusiasm               wants to hang out with friends SOON!


By inspecting the different columns and the data format, the 'content' attribute has the potential to contain explicit personally identifiable information can be identified:
1. User Mentions: 
    - Any instance of @username should be anonymised because it directly points to an individual's account, which is considered personally identifiable information (PII).
2. First Names: 
    - If any first names are used in a context that can identify an individual, such as tagging in combination with other identifying information, they should be anonymised.
3. Locations and Specific References: 
    - Any mention of specific locations, addresses, landmarks, or establishments that could help in identifying an individual should be anonymised.
4. Specific Events with Identifiable Information: 
    - References to specific events that may lead to the identification of individuals, like parties or gatherings with a list of names, should be anonymised.
5. Unique Identifiers: 
    - Any other unique identifiers, such as specific dates, times, or unique events, that could potentially be linked back to an individual.


Apart from that, the 'sentiment' attribute is explored further as we don't know by now, how many unique values there actually are and if they would qualify as PII: 

In [29]:
print("total lenth of the dataframe: ", len(df))

# Calculate the number of unique values and the number of entries per unique value
unique_counts = df['sentiment'].nunique()
value_counts = df['sentiment'].value_counts()

print("number of unique values in sentiment: ", unique_counts)
print("counts per unique value in", value_counts)

total lenth of the dataframe:  40000
number of unique values in sentiment:  13
counts per unique value in sentiment
neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: count, dtype: int64


By inspecting the 'sentiment' attribute further, we can say that there are 13 different values in the 'sentiment' column. We also know that there are 40,000 tweets in total in the dataset. Given this information, there is no need to anonymize the 'sentiment' attribute.

### 3.1.2 Using a Natural Language Processing library (e.g. Python’s spaCy), analyse the text to identify elements of personally identifiable information (PII).
The goal of anonymization is to remove or obscure such details so that the individuals to whom the data pertains cannot be readily identified. The first step is finding the contents, that might actually contain PII.
As the first step, we install 'en_core_web_sm', a pre-trained spaCy model suitable for identifying named entities, which include PII. 'en_core_web_sm' is the English model trained on web text. It has been trained on a diverse range of web text, including blogs, news, comments. We've decided on using 'en_core_web_sm' instead of for example 'en_core_web_trf' due to their balance between performance and resource usage.



In [30]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [37]:
import spacy

#Load the pre-trained spaCy model
nlp = spacy.load('en_core_web_sm')

# Function to identify PII using spaCy
def identify_pii(text):
    # Process the text using spaCy to identify named entities
    doc = nlp(text)
    pii_entities = [(ent.text, ent.label_) for ent in doc.ents]
    return pii_entities

# Apply the function to each content entry in the dataframe
df['PII'] = df['content'].apply(identify_pii)
# Save the DataFrame with the new 'PII' column to a new CSV file
df.to_csv("PII_tweet_emotions.csv", index=False)

Each non-empty list within the square brackets [] in the new 'PII' column indicates that the spaCy model has identified text segments in that row which it believes to be named entities. The entities are tagged with labels that classify what type of entity they are (e.g., PERSON, ORG, GPE). These named entities can be considered PII, as they might be used to identify an individual either directly or when combined with other additional information. 

### 3.1.3 Using the techniques you applied in Assignment #1, apply a masking or transformation mechanism to modify the detected PII elements and substitute with suitable replacements.

### 3.1.4 Analyse the text to determine what if any information can be obtained after the transformation process. What conclusions can you draw from this?