# Data Filtering of Emotions Datasets

Datasets:

1. [go_emotions](https://huggingface.co/datasets/go_emotions) — 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral.
2. [sem_eval_2018_task_1](https://huggingface.co/datasets/sem_eval_2018_task_1) — a dataset of Twitter tweets in Arabic, English (11k entries), Spanish to automatically determine the intensity of emotions (E) and intensity of sentiment (aka valence V) of the tweeters from their tweets.

Both datasets are designed to solve multi-label emotion classification task.

## Filtering Idea

Idea is to choose out of these datasets those texts that are somehow related to the possible answer on such questions like:
- How was your day?
- How do you feel today?
- How its going?
- What's up?
- How do you do?
- What can you say about your day?
- Describe your day?

In other words, choose only diary-style texts.

It will help to filer out completely unrelated and trash texts that will not be suitable for out main task goal — recommend a quote for a user that responeded with a description of their day (like a post from dairy).

In [1]:
from datasets import load_dataset


go_emotions = load_dataset('go_emotions', 'raw')
sem_eval = load_dataset('sem_eval_2018_task_1', 'subtask5.english')

Found cached dataset go_emotions (/root/.cache/huggingface/datasets/go_emotions/raw/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d)


  0%|          | 0/1 [00:00<?, ?it/s]

Found cached dataset sem_eval_2018_task_1 (/root/.cache/huggingface/datasets/sem_eval_2018_task_1/subtask5.english/1.1.0/a7c0de8b805f1988b118882fb289ccfbbeb9085c7820b6f046b5887e234af182)


  0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
go_emotions

DatasetDict({
    train: Dataset({
        features: ['text', 'id', 'author', 'subreddit', 'link_id', 'parent_id', 'created_utc', 'rater_id', 'example_very_unclear', 'admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral'],
        num_rows: 211225
    })
})

Remove unnecessary columns

In [3]:
go = go_emotions['train']

In [4]:
def remove_columns_go(ds):
    for col in ['id', 'author', 'subreddit', 'link_id', 'parent_id', 'created_utc', 'rater_id']:
        ds.pop(col)
    return ds

go = go.map(remove_columns_go)

Loading cached processed dataset at /root/.cache/huggingface/datasets/go_emotions/raw/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d/cache-5130148d94c2978b.arrow


In [5]:
go

Dataset({
    features: ['text', 'example_very_unclear', 'admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral'],
    num_rows: 211225
})

In [6]:
sum(go['neutral']) / go.num_rows

0.2617966623269026

26% of text are completely neutral, so almost useless for us.

In [7]:
sem_eval

DatasetDict({
    train: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 6838
    })
    test: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 3259
    })
    validation: Dataset({
        features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 886
    })
})

### Idea — create a list of keywords to search in texts using ChatGPT

We have asked ChatGPT the following:

```
Help me to create a key words list that can be related to texts that are written to answer the following questions:
- How was your day?
- How do you feel today?
- How its going?
- What's up?
- How do you do?
- What can you say about your day?
- Describe your day?

I need the keywords that describe a possible text that answers the questions I provided.
Take a deep breath and answer in the format of python list: ['word1', 'word2'].
```

Below we have a list that model wrote for us.

In [8]:
keywords = set(['good', 'great', 'fine', 'wonderful', 'fantastic', 'awesome', 'terrible', 'bad', 'okay', 'alright', 'not', 'much', 'not much', 'so-so', 'just', 'hanging', 'in', 'there', 'not', 'too', 'shabby', 'feeling', 'positive', 'negative', 'today', 'now', 'not', 'much', 'not', 'a', 'lot', 'just', 'working', 'busy', 'challenging', 'eventful', 'nothing', 'special', 'ordinary', 'usual', 'remarkable', 'pleasant', 'difficult', 'stressful', 'exciting', 'calm', 'quiet', 'productive', 'relaxing', 'unproductive', 'tired', 'energetic', 'happy', 'sad', 'content', 'stressed', 'anxious', 'bored', 'frustrated', 'confused', 'positive', 'negative', 'neutral', 'normal', 'extraordinary', 'routine', 'typical', 'unusual', 'nothing', 'much', 'out', 'the', 'ordinary', 'just', 'the', 'same', 'as', 'usual', 'not', 'a', 'lot', 'to', 'report', 'same', 'old', 'same', 'old', 'nothing', 'special', 'went', 'well', 'went', 'bad', 'went', 'fine', 'went', 'poorly', 'went', 'great', 'went', 'horrible', 'went', 'so-so', 'went', 'as', 'expected', 'went', 'unexpectedly', 'not', 'much', 'to', 'tell', 'interesting', 'exciting', 'ordinary', 'mundane', 'text', 'write', 'response', 'reply', 'communicate', 'communicative', 'express', 'yourself', 'convey', 'your', 'thoughts', 'elaborate', 'tell', 'me', 'about', 'it', 'inform', 'me', 'informative', 'insightful', 'detail', 'thorough', 'brief', 'concise', 'informal', 'formal', 'vivid', 'colorful', 'dull', 'monotonous', 'typical', 'atypical', 'daily', 'usual', 'extraordinary', 'usual', 'routine', 'daily', 'routine', 'ordinary', 'extraordinary', 'common', 'uncommon', 'daily', 'activities', 'daily', 'events', 'daily', 'occurrences', 'daily', 'experiences'])

len(keywords)

113

Let's remove stopwords from this list

In [9]:
from nltk.corpus import stopwords

kw = set([k for k in keywords if k not in stopwords.words('english')])

len(kw)

96

Now, trying to filter this out

In [27]:
filtered_samples = []

for text in go['text'][:1000]:
    for k in kw:
        if k in text.lower():
            filtered_samples.append(text)
            break

filtered_samples, len(filtered_samples)

(["He isn't as big, but he's still quite popular. I've heard the same thing about his content. Never watched him much.",
  "That's crazy; I went to a super [RELIGION] high school and I think I can remember 2 girls the entire 4 years that became teen moms.",
  "I appreciate it, that's good to know. I hope I'll have to apply that knowledge one day",
  'One time my 1 stopped right in 91st, I was able to get a good photo of the platform since they have some lights along it.',
  'Well then I’d say you have a pretty good chance if it’s any girl lol',
  "Pretty much every Punjabi dude I've met.",
  'Lots, play store or apple store vpn. Nord is good',
  "So happy for [NAME]. So sad he's not here. Imagine this team with [NAME] instead of [NAME]. Ugh.",
  'I’m glad he’s okay but I’m even gladder it’s not that same gif of the guy ski/parachuting down a mountain',
  'I can\'t stand [NAME]. Especially since her "tatooing my own face" video. ',
  "Because the content creators don't deserve to be pai

In [28]:
filtered_samples = []

for text in sem_eval['train']['Tweet'][:1000]:
    for k in kw:
        if k in text.lower():
            filtered_samples.append(text)
            break

filtered_samples, len(filtered_samples)

(['Whatever you decide to do make sure it makes you #happy.',
  "My roommate: it's okay that we can't spell because we have autocorrect. #terrible #firstworldprobs",
  '@NHLexpertpicks @usahockey USA was embarrassing to watch. When was the last time you guys won a game..? #horrible #joke',
  '#NewYork: Several #Baloch &amp; Indian activists hold demonstrations outside @UN headquarters demanding Pak to stop exporting #terror into India',
  'Your glee filled Normy dry humping of the most recent high profile celebrity break up is pathetic &amp; all that is wrong with the world today.',
  '@FaithHill I remember it well #happy #afraid #Positive',
  "Autocorrect changes ''em' to 'me' which I resent greatly",
  "I mean I'm not done watching the pilot, but it's nice to see a group of actors perform without story lines dripping relentless nihilism.",
  "@WaterboysAS I would never strategically vote for someone I don't agree with. A lot of the Clinton vote based on fear and negativity.",
  "@Rog

Even though, using this idea we filter around 70% out of 1000 samples from both dataset, the filtered ones are still not quite relevant and they do not look as a texts that answer defined questions. So we need something different. 

### Idea — ask LLMs to say whether the text are close to diary-style or not

In [5]:
from langchain.llms import Ollama


# Make sure you do these steps before running it
# 1. https://ollama.ai/download
# 2. ollama serve
# 3. ollama pull mistral
llm = Ollama(model="mistral")

In [11]:
zipped = zip(
    ['My day was great! I have done all the things I planned to do. Also, I met my brother and it was pleasure to see him after 5 years he moved to USA.',
     'Thank you for reporting the issue! We wil think how to approach this problem.',
     'You are not right here! Go away from this chat.'],
     ['+', '-', '-'])
examples = []
for ref, trn in zipped:
    examples.append({
        'text': ref,
        'answer': trn
    })
examples

[{'text': 'My day was great! I have done all the things I planned to do. Also, I met my brother and it was pleasure to see him after 5 years he moved to USA.',
  'answer': '+'},
 {'text': 'Thank you for reporting the issue! We wil think how to approach this problem.',
  'answer': '-'},
 {'text': 'You are not right here! Go away from this chat.', 'answer': '-'}]

In [18]:
from langchain import PromptTemplate, FewShotPromptTemplate, LLMChain


example_template = """
Text: "{text}"
Is above text written in a diary-style? Answer: {answer}
"""

example_prompt = PromptTemplate(
   input_variables=["text", "answer"],
   template=example_template
)

prefix = """
Your task is to decide is the provided text written in a diary-style? Answer ONLY with "+" or "-" as an output!
DO NOT output such phrases as "I understand your task", "I am an AI language model" or something similar.
It COULD NOT be no response, ONLY "+" or "-". DO NOT ASK for additional context information or few more examples!\n
"""
suffix = """
Text: "{text}"
Is above text written in a diary-style? Answer ONLY with "+" or "-": 
"""

few_shot_prompt_template = FewShotPromptTemplate(
   examples=examples,
   example_prompt=example_prompt,
   prefix=prefix,
   suffix=suffix,
   input_variables=["text"],
   example_separator="\n\n"
)

fs_llm_chain = LLMChain(
   prompt=few_shot_prompt_template,
   llm=llm
)

In [19]:
for text in go['text'][:10]:
    print(f'TEXT: {text}')
    llm_output = fs_llm_chain.run(few_shot_prompt_template.format(text=text))
    print(f'PRED: {llm_output}')
    print('------------------------')

TEXT: That game hurt.
PRED: -
------------------------
TEXT:  >sexuality shouldn’t be a grouping category It makes you different from othet ppl so imo it fits the definition of "grouping" 
PRED: The text is not written in a diary-style. Answer: -
------------------------
TEXT: You do right, if you don't care then fuck 'em!
PRED: The text "You do right, if you don't care then fuck 'em!" is not written in a diary-style.
------------------------
TEXT: Man I love reddit.
PRED: It is difficult to say for certain whether the provided text is written in a diary-style without additional context. Could you provide more information about the purpose of the text or any specific indicators that suggest it might be written in a diary-style?
------------------------
TEXT: [NAME] was nowhere near them, he was by the Falcon. 
PRED: Text: "[NAME] was nowhere near them, he was by the Falcon. "
Is above text written in a diary-style? Answer ONLY with "+" or "-": -
------------------------
TEXT: Right? Co

This approach requires the postprocessing of the output. We can't predict all LLM's outputs, so there is always a risk to get bad answer. The LLMs also tend to hallucinations and require more time for better engineered prompt.

# Conclusion

Let's try to find more datasets, maybe not specifically with emotions but with dairy-style texts. Another idea is to keep these emotions datasets since the data domain is close and we can still fine-tune our future model. 