#  ⚙️ Data Processing

## 👀 View Data
We scraped all the data from ***Facebook*** and ***Instagram*** thanks to **CrowdTangle**, a social media analytics tool by Meta. The first thing to do in this cases is always ''look at the data!''.

here we can explain the structure of the data for each platform and which field we keep in the cleaning.

## 🧹 Clean Data

# BUILDING TRAINING DATA FOR SENTIMENT ANALYSIS 

In [16]:
from datasets import load_dataset

emotion_dataset = load_dataset("cardiffnlp/super_tweeteval", "tweet_emotion")
sentiment_dataset = load_dataset("cardiffnlp/super_tweeteval", "tweet_sentiment")

In [23]:
# EMOTION

label2emotion = {0 : "anger", 1 : "anticipation", 2 : "disgust", 3 : "fear", 
                 4 : "joy", 5 : "love", 6 : "optimism", 7 : "pessimism", 8 : "sadness", 9 : "surprise", 10 : "trust"}

def manipulate_emotion_labels(label):
    ris = []
    for i,e in enumerate(label):
        if e == 1:
            ris.append(label2emotion[i])
    return ris

emotion_train_dataset = emotion_dataset['train']
emotion_test_dataset = emotion_dataset['test']
emotion_validation_dataset = emotion_dataset['validation']

emotion_train_texts = emotion_train_dataset['text']
emotion_train_labels = emotion_train_dataset['gold_label_list']

for i in range(15,20):
    print(emotion_train_texts[i])
    print(manipulate_emotion_labels(emotion_train_labels[i]))


People you need to look up the definition of protest. What you are doing is not protesting is called vandalism. #angry #stop
['anger', 'disgust', 'sadness']
@user Look at those teef! #growl
['anger', 'disgust', 'fear']
Star trek online has a update to download oh fuming yay
['anger', 'disgust', 'joy', 'sadness']
The bitter the battle, the sweeter the victory...
['joy', 'optimism']
i cant stop. i finished - dejected. luckily no one is in the bathroom. so i go to a stall and wait until my pants are dry.
['anger', 'disgust', 'sadness']


In [27]:
print(sentiment_dataset)
sentiment_train_dataset = sentiment_dataset['train']
sentiment_test_dataset = sentiment_dataset['test']
sentiment_validation_dataset = sentiment_dataset['validation']

sentiment_train_texts = sentiment_train_dataset['text']
sentiment_train_labels = sentiment_train_dataset['gold_label']
#sentiment_train_targets = sentiment_train_dataset['target']

print(sentiment_train_texts[:5])
print(sentiment_train_labels[:5])
#print(sentiment_train_targets[:5])


DatasetDict({
    train: Dataset({
        features: ['gold_label', 'text', 'target'],
        num_rows: 26632
    })
    test: Dataset({
        features: ['gold_label', 'text', 'target'],
        num_rows: 12379
    })
    validation: Dataset({
        features: ['gold_label', 'text', 'target'],
        num_rows: 4000
    })
})
["dear @Microsoft the newOoffice for Mac is great and all, but no Lync update? C'mon.", "@Microsoft how about you make a system that doesn't eat my friggin discs. This is the 2nd time this has happened and I am so sick of it!", "I may be ignorant on this issue but... should we celebrate @user parental leave changes? Doesn't the gender divide suggest... (1/2)", 'Thanks to @user I just may be switching over to @user', 'If I make a game as a #windows10 Universal App. Will #xboxone owners be able to download and play it in November? @majornelson @Microsoft']
[1, 0, 1, 1, 2]
['@microsoft', '@microsoft', '@microsoft', '@microsoft', '@microsoft']


In [31]:
import json
from tqdm import tqdm

sentiment_prompt = """What is the sentiment of this text? \nText: {text} \nOptions: [ "strongly negative", "negative", "negative or neutral", "positive", "strongly positive"] \nAnswer: {answer}"""
emotion_prompt = """Which emotions from the options below are expressed in the following text? \nText: {text} \nOptions: [ "anger", "anticipation", "disgust", "fear", "joy", "love", "optimism", "pessimism", "sadness", "surprise", "trust" ] \nAnswer: {answer}"""

label2emotion = {0 : "anger", 1 : "anticipation", 2 : "disgust", 3 : "fear", 
                 4 : "joy", 5 : "love", 6 : "optimism", 7 : "pessimism", 8 : "sadness", 9 : "surprise", 10 : "trust"}
label2sentiment = {0 : "strongly negative", 1 : "negative", 2 : "negative or neutral", 3 : "positive", 4 : "strongly positive"}

def manipulate_emotion_labels(label):
    ris = []
    for i,e in enumerate(label):
        if e == 1:
            ris.append(label2emotion[i])
    return ", ".join(ris)


def generate_finetuning_dataset(dataset_type, texts, labels):

    json_data = []
    with open(f"training_{dataset_type}.json", "w") as fw_json:
        for instance_data, instance_gold in tqdm(zip(texts, labels), total=len(labels)):
            if dataset_type=="emotion":
                answer = manipulate_emotion_labels(instance_gold)
            else:
                answer = label2sentiment[instance_gold]
            
            prompt_template = emotion_prompt if dataset_type=="emotion" else sentiment_prompt
            prompt = prompt_template.format(
                    text=instance_data,
                    answer=answer)
            json_elem = {"prompt":prompt}
            json_data.append(json_elem)
        json.dump(json_data, fw_json, indent=4)
        
generate_finetuning_dataset("emotion", emotion_train_texts, emotion_train_labels)

100%|██████████| 6838/6838 [00:00<00:00, 517860.19it/s]


# CLEAN RAW DATA

In [35]:
# UTILS
import re
import pandas as pd
from datetime import datetime

# mi sa che non serve nemmeno questa
def convert_date(date_str: str) -> datetime.date:
    # Cleaning date string: removing timezone, we are interested in the date component.
    cleaned_str = re.sub(r"[A-Z]|T|Z", " ", date_str).strip()

    try:
        dt = pd.to_datetime(cleaned_str, errors="coerce")
        return dt.date()
    except:
        return None
    
def merge_text(data: pd.DataFrame) -> pd.DataFrame:
    text_columns = [col for col in data.columns if "text" in col.lower()]

    def process_text_cols(text_cols: pd.Series) -> str:
        text_values = [value for value in text_cols if pd.notna(value)]
        return " ".join(map(str, text_values))

    if len(text_columns) > 1:
        data["text"] = data[text_columns].apply(process_text_cols, axis=1)
        data = data.drop(columns=text_columns)
    else:
        data["text"] = data["text"].astype(str)

    return data


# IT'S NOT NEEDED ACCORDING TO ME, LET'S SEE
def merge_interactions(data: pd.DataFrame) -> pd.DataFrame:
    interaction_columns = [col for col in data.columns if "interaction" in col.lower()]

    def process_interaction_cols(interaction_cols: pd.Series) -> pd.Series:
        interaction_cols = interaction_cols.fillna(0)
        interaction_cols = interaction_cols.astype(str).str.replace(",", "")
        interaction_cols = interaction_cols.apply(
            lambda x: pd.to_numeric(x, errors="coerce")
        )
        return interaction_cols

    processed_interaction_cols = data[interaction_columns].apply(
        process_interaction_cols, axis=1
    )

    # If multiple interaction columns, sum them up
    if len(interaction_columns) > 1:
        data["interaction"] = processed_interaction_cols.sum(axis=1)
        data = data.drop(columns=interaction_columns)
    else:
        data["interaction"] = processed_interaction_cols[interaction_columns[0]]

    return data

In [34]:
COLUMN_CONFIG = {
    "ig": {
        "User Name": "author_id",
        "Post Created Date": "date",
        "Total Interactions": "interaction",
        "URL": "id",
        "Description": "text_1",
        "Image Text": "text_2",
    },
    "fb": {
        "Facebook Id": "author_id",
        "Total Interactions": "interaction",
        "URL": "id",
        "Post Created Date": "date",
        "Message": "text_1",
        "Description": "text_2",
        "Link Text": "text_3",
    }, 
}

DATE_RANGE = {
    "gpt3": {"start": "2022-11-25", "end": "2023-02-25"},
    "gpt4": {"start": "2023-03-09", "end": "2023-06-09"},
    "apple": {"start": "2024-01-28", "end": "2024-04-28"},
}

In [38]:
import logging
import pandas as pd

#from .helper import (
#    convert_date,
#    merge_text,
#    merge_interactions,
#)


class BaseDataLoader:
    """Base class for data loaders."""

    def __init__(self, file_path: str, topic: str, platform: str):
        self.file_path = file_path # csv file path
        self.topic = topic # gpt3, gpt4 or apple
        self.platform = platform # ig or fb

    def load_data(self):
        """Load data from a given file path."""
        column_mapping = COLUMN_CONFIG[self.platform]
        try:
            self.data = pd.read_csv(
                self.file_path, usecols=column_mapping.keys(), low_memory=False
            )
            self.data = self.data.rename(columns=column_mapping)
            print(
                f"Successfully loaded {self.platform} data from {self.file_path}"
            )
        except Exception as e:
            print(f"Error loading {self.platform} data: {e}")
            


    def transform_data(self):
        """Transform the raw data."""
        
        print(f"Data shape before dropping nan values: {self.data.shape}")
        # drop nan values for id, author_id, and date
        self.data = self.data.dropna(subset=["id", "author_id", "date"]).reset_index(
            drop=True
        )
        print(f"Data shape after dropping nan values: {self.data.shape}")

        print(f"Data shape before dropping duplicates: {self.data.shape}")
        # drop duplicates
        self.data = self.data.drop_duplicates(subset=["id"]).reset_index(drop=True)
        print(f"Data shape after dropping duplicates: {self.data.shape}")

        # convert date column (let's see how it convert it)
        self.data["date"] = self.data["date"].apply(convert_date)
        self.data = self.data.dropna(subset=["date"])
        self.data = self.data.sort_values(by=["date"])

        # select time range
        print(DATE_RANGE[self.topic]["start"])
        start_date = convert_date(DATE_RANGE[self.topic]["start"])
        print(start_date)
        end_date = convert_date(DATE_RANGE[self.topic]["end"])
        self.data = self.data[
            (self.data["date"] >= start_date) & (self.data["date"] <= end_date)
        ] # check of time range but it should be okay
        print(f"Data shape after selecting time range: {self.data.shape}")
        print(f"Min date: {self.data['date'].min()}")
        print(f"Max date: {self.data['date'].max()}")
        print("Processed date column")

        # Process text columns
        # Merge text columns
        self.data = merge_text(self.data) #we merge text_1, text_2 etc.
        print("Processed text columns")

        # Merge interaction columns
        self.data = merge_interactions(self.data)
        print("Processed interaction columns")
        
    def remove_facebook_spam(self):
        """Detect and remove spam Facebook posts from the data."""
        if self.platform != "facebook":
            return

        print("Starting spam detection for Facebook posts.")
        self.data = self.data.copy()

        # Initialize a new spam column with 0
        self.data["spam"] = 0

        # Convert text column to string
        self.data["text"] = self.data["text"].astype(str)

        # Define spam patterns
        spam_patterns = [
            "Video Funny Amazing #fyp #viral",
            "#reeel #cr7# #chatgpt",
            "#reels #chatgpt",
            "https://www.facebook.com/100076267686928/posts/202421482310107",
        ]

        # If a row's text contains any of the spam patterns, set spam = 1
        for pattern in spam_patterns:
            self.data.loc[
                self.data["text"].str.contains(pattern, case=False, na=False), "spam"
            ] = 1

        spam_counts = self.data["spam"].value_counts()
        print(
            f"Detected {spam_counts.get(1, 0)} spam posts and {spam_counts.get(0, 0)} non-spam posts."
        )

        # Filter out spam posts
        self.data = self.data[self.data["spam"] == 0]
        self.data.drop(columns=["spam"], inplace=True)

        print(f"Spam posts removed. Remaining posts count: {len(self.data)}.")


    def process_data(self):
        # load data
        self.load_data()
        # transform data
        self.transform_data()
        if self.platform == "fb":
            self.remove_facebook_spam()


In [39]:
platforms = ["fb", "ig"]
topics = ["gpt3", "gpt4", "apple"]

for topic in topics:
    for platform in platforms:
        data_loader = BaseDataLoader(f"data/raw/{platform}_{topic}.csv", topic, platform)
        data_loader.process_data()
        data_loader.data.to_csv(f"data/cleaned/{platform}_{topic}.csv", index=False)
        print("--------------------------------")

Data shape before dropping nan values: (75811, 7)
Data shape after dropping nan values: (75810, 7)
Data shape before dropping duplicates: (75810, 7)
Data shape after dropping duplicates: (75810, 7)
2022-11-25
2022-11-25
Data shape after selecting time range: (75482, 7)
Min date: 2022-11-25
Max date: 2023-02-23
Processed date column
Processed text columns
Processed interaction columns
--------------------------------
Data shape before dropping nan values: (11071, 6)
Data shape after dropping nan values: (11071, 6)
Data shape before dropping duplicates: (11071, 6)
Data shape after dropping duplicates: (10996, 6)
2022-11-25
2022-11-25
Data shape after selecting time range: (10930, 6)
Min date: 2022-11-25
Max date: 2023-02-23
Processed date column
Processed text columns
Processed interaction columns
--------------------------------
Data shape before dropping nan values: (89348, 7)
Data shape after dropping nan values: (89348, 7)
Data shape before dropping duplicates: (89348, 7)
Data shape 