# Data Preprocessing of Text Datasets

## 1. Diary-style Text Datasets 

Datasets:

1. [journal-entries-with-labelled-emotions](https://www.kaggle.com/datasets/madhavmalhotra/journal-entries-with-labelled-emotions) — 1500 journal entries labelled with 18 emotions from people reflecting on their day.
2. [Diary-Entry-To-Rap](https://huggingface.co/datasets/chaudha7/Diary-Entry-To-Rap) — 175 diary entries with age and gender writer's info plus transformed diary into rap text with certain mood and style.

From both datasets only texts will be taken.

### Read only diary-style text entries

In [6]:
import pandas as pd


journals_ds_dir = 'data/journals.csv'

journals_df = pd.read_csv(journals_ds_dir)
journals_df.head(1)

Unnamed: 0,Answer,Answer.f1.afraid.raw,Answer.f1.angry.raw,Answer.f1.anxious.raw,Answer.f1.ashamed.raw,Answer.f1.awkward.raw,Answer.f1.bored.raw,Answer.f1.calm.raw,Answer.f1.confused.raw,Answer.f1.disgusted.raw,...,Answer.t1.family.raw,Answer.t1.food.raw,Answer.t1.friends.raw,Answer.t1.god.raw,Answer.t1.health.raw,Answer.t1.love.raw,Answer.t1.recreation.raw,Answer.t1.school.raw,Answer.t1.sleep.raw,Answer.t1.work.raw
0,"My family was the most salient part of my day,...",False,False,True,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False


In [8]:
journals = journals_df['Answer'].to_list()
len(journals), journals[:5]

(1473,
 ['My family was the most salient part of my day, since most days the care of my 2 children occupies the majority of my time. They are 2 years old and 7 months and I love them, but they also require so much attention that my anxiety is higher than ever. I am often overwhelmed by the care the require, but at the same, I am so excited to see them hit developmental and social milestones.',
  'Yoga keeps me focused. I am able to take some time for me and breath and work my body. This is important because it sets up my mood for the whole day.',
  'Yesterday, my family and I played a bunch of board games. My husband won most of them which is not surprising in the least. We played all sorts of games including Life, Clue, Mouse Trap and more. It was relaxing and such a happy, fun filled moment.',
  "Yesterday, I visited my parents and had dinner with them.  I hadn't seen them in a few weeks, so it was wonderful to see them and catch up on things.",
  'Yesterday, I really felt the import

In [9]:
from datasets import load_dataset


diaries_ds = load_dataset('chaudha7/Diary-Entry-To-Rap')
diaries_ds

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'Age', 'Gender', 'Mood', 'Writing Manner', 'Diary Entry', 'Rap Song'],
        num_rows: 175
    })
})

In [10]:
diaries = diaries_ds['train']['Diary Entry']
len(diaries), diaries[:5]

(175,
 ['Dear Diary,\n\nI find solace in your pages today, as my weary soul seeks refuge from the relentless demands of the world. Fatigue has woven its heavy tendrils around my body, stifling my spirit and clouding my mind. The weariness seems never-ending, suffocating every ounce of energy that once resided within me.\n\nThe incessant noise and constant motion of life have suffused my being, leaving me utterly drained. The weight of expectations and responsibilities bears down upon my shoulders, making each step feel like a monumental effort. It feels as if I am eternally trapped in a labyrinth of tasks, forever searching for an escape that eludes me.\n\nIn this moment, all I desire is solitude; a chance to retreat into the sanctuary of my introverted nature. To find solace within myself and reconnect with the depths of my being. The mere thought of self-care becomes a distant dream, fading like a mirage in the scorching desert of my exhaustion.\n\nPerhaps tomorrow will bring respite

### Prepare for labeling

In [11]:
diary_texts = journals + diaries
len(diary_texts)

1648

### Labeling

In [16]:
from transformers import pipeline


reddit_classifier = pipeline(task="text-classification", model="SamLowe/roberta-base-go_emotions", top_k=None)
twitter_classifier = pipeline(task="text-classification", model="cardiffnlp/twitter-roberta-base-emotion-multilabel-latest", return_all_scores=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.
All model checkpoint layers were used when initializing TFRobertaForSequenceClass

In [17]:
reddit_outputs = reddit_classifier(diary_texts)
twitter_outputs = twitter_classifier(diary_texts)

Token indices sequence length is longer than the specified maximum sequence length for this model (514 > 512). Running this sequence through the model will result in indexing errors


InvalidArgumentError: Exception encountered when calling layer 'embeddings' (type TFRobertaEmbeddings).

{{function_node __wrapped__ResourceGather_device_/job:localhost/replica:0/task:0/device:CPU:0}} indices[0,512] = 514 is not in [0, 514) [Op:ResourceGather] name: 

Call arguments received by layer 'embeddings' (type TFRobertaEmbeddings):
  • input_ids=tf.Tensor(shape=(1, 514), dtype=int32)
  • position_ids=None
  • token_type_ids=tf.Tensor(shape=(1, 514), dtype=int32)
  • inputs_embeds=None
  • past_key_values_length=0
  • training=False

In [18]:
reddit_outputs[:5]

NameError: name 'reddit_outputs' is not defined

In [None]:
twitter_outputs[:5]

In [None]:
def transform_output_into_lists(output):
    pass

In [None]:
# diaries_reddit_classified = pd.DataFrame(diary_texts reddit_labels, columns=)
# diaries_twitter_classified = ...s

## 2. Reddit and Twitter Posts Datasets

Datasets:

1. [go_emotions](https://huggingface.co/datasets/go_emotions) — 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral.
2. [sem_eval_2018_task_1](https://huggingface.co/datasets/sem_eval_2018_task_1) — a dataset of Twitter tweets in Arabic, English (11k entries), Spanish to automatically determine the intensity of emotions (E) and intensity of sentiment (aka valence V) of the tweeters from their tweets.

Both datasets are designed to solve multi-label emotion classification task. From both datasets only texts will be taken.

TODO: try