# Data Preprocessing of Quotes Datasets

## 1. Quotes Dataset From Kaggle

[Quotes Dataset](https://www.kaggle.com/datasets/hiteshsuthar101/quotes-dataset) — 852 quotes entries parsed and labeled from [quotepark.com](quotepark.com) 

In [1]:
import pandas as pd


quotes_ds_dir = '../data/quotes.csv'

df = pd.read_csv(quotes_ds_dir)
df.head(1)

Unnamed: 0,Id,language,Quote,Quote_url,Author,Author_Profile,Tags
0,2114238,en,Mimì never forgets to see the beauty in life.,https://quotepark.com/quotes/2114238-megan-mar...,Megan Marie Hart,https://quotepark.com/authors/megan-marie-hart/,"Life, Beauty, Forgetting"


### 1.1. Dataset preprocessing
We have to remove the columns that we don't need for our task and split the tags column into a set of tags.

In [2]:
df = df[['Quote', 'Tags']]
df["Tags"] = df["Tags"].apply(lambda x: {e.strip().lower() for e in x.split(",")} if isinstance(x, str) else [])

### 1.2. Add tags from the go emotions dataset to the quotes dataset
It is necessary to add tags from the go emotions dataset to the quotes dataset, because we will use the model fine-tuned on go emotions dataset.

Here we will use the model [SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions) model.

#### 1.2.1. Use RoBERTa model to predict tags on quote text
Get the tags from the model (with threshold = 0.2) and filter them using cleaned tag set.

In [4]:
from transformers import pipeline
pipe = pipeline("text-classification", model="SamLowe/roberta-base-go_emotions")

In [6]:
# Filtered tag set for quotes from the go emotions
my_set =  {'admiration',
           'amusement',
           'anger',
           'annoyance',
           'confusion',
           'curiosity',
           'desire',
           'disappointment',
           'disgust',
           'embarrassment',
           'excitement',
           'fear',
           'gratitude',
           'grief',
           'joy',
           'love',
           'nervousness',
           'optimism',
           'pride',
           'realization',
           'relief',
           'remorse',
           'sadness',
           'surprise'}

In [5]:
tags = pipe(list(df["Quote"]), truncation=True, top_k=None)

In [7]:
df["model_tags"] = [{tag["label"] for tag in quote if tag["score"] > 0.2 and tag["label"] in my_set} for quote in tags]

#### 1.2.2. Add tags from the quotes dataset intersection with the go emotions dataset

In [9]:
df["ds_tags"] = df["Tags"].apply(lambda x: my_set.intersection(x))

In [10]:
df

Unnamed: 0,Quote,Tags,model_tags,ds_tags
0,Mimì never forgets to see the beauty in life.,"{beauty, life, forgetting}",{},{}
1,"I will choose the bad guy in every story, I am...","{attraction, story, choosing, guy}",{},{}
2,Even perfect people are taught to practice imp...,"{people, still, evening, society, practice, pe...",{},{}
3,"The greater your capacity to love, the greater...","{pain, feeling, capacity, love, feel}",{love},{love}
4,We Turks are faithful muslims.,"{faithful, faith}",{},{}
...,...,...,...,...
847,Your eyes show the strength of your soul.,"{soul, show, eye, strength}",{admiration},{}
848,Sometimes it is the people no one can imagine ...,"{people, doing, imagination, thing, anything}",{},{}
849,A prophet is nothing without a new revelation.,"{prophet, revelation, news, nothing}",{},{}
850,"I’m not mad. I’m in a perfectly happy mood, yo...","{happiness, madness, mood}","{joy, annoyance}",{}


#### 1.2.3. Use `fasttext` embeddings to find similar tags

In [11]:
import fasttext
from sklearn.metrics.pairwise import cosine_similarity

ft = fasttext.load_model('cc.en.300.bin')



In [12]:
my_set_embed = [ft.get_word_vector(e) for e in my_set]

In [13]:
from tqdm.notebook import tqdm

all_ds_tags = set(df["Tags"].explode().value_counts().keys())
tag_to_tag = {}

for e in tqdm(all_ds_tags):
    for w, emb in zip(my_set, my_set_embed):
        if cosine_similarity([ft.get_word_vector(e)], [emb])[0][0] > 0.5:
            tag_to_tag[e] = w

  0%|          | 0/1650 [00:00<?, ?it/s]

Clean pairs of tags by hand.

In [14]:
tag_to_tag

{'gratitude': 'admiration',
 'amazement': 'admiration',
 'appetite': 'desire',
 'worry': 'fear',
 'agony': 'grief',
 'concern': 'annoyance',
 'imagination': 'curiosity',
 'apprehension': 'excitement',
 'lov': 'love',
 'confusion': 'confusion',
 'realization': 'realization',
 'tragedy': 'grief',
 'joy': 'love',
 'anxiety': 'grief',
 'pain': 'grief',
 'praise': 'admiration',
 'acknowledgment': 'gratitude',
 'happiness': 'optimism',
 'intention': 'desire',
 'desire': 'desire',
 'self-confidence': 'optimism',
 'fear': 'fear',
 'regret': 'remorse',
 'aid': 'relief',
 'laughter': 'excitement',
 'loneliness': 'grief',
 'pleasure': 'joy',
 'sadness': 'excitement',
 'embarrassment': 'disgust',
 'encouragement': 'gratitude',
 'excitement': 'excitement',
 'awe': 'admiration',
 'rage': 'disgust',
 'selfishness': 'anger',
 'pride': 'pride',
 'confidence': 'optimism',
 'hatred': 'admiration',
 'jealousy': 'admiration',
 'admire': 'admiration',
 'danger': 'fear',
 'shame': 'disgust',
 'delight': 'exc

In [15]:
cleaned_tag2tag = {'encouragement': 'gratitude',
 'danger': 'fear',
 'intention': 'desire',
 'regret': 'sadness',
 'disappointment': 'sadness',
 'rage': 'disgust',
 'appetite': 'desire',
 'agony': 'sadness',
 'optimist': 'optimism',
 'kindness': 'gratitude',
 'pleasure': 'joy',
 'pride': 'joy',
 'imagination': 'curiosity',
 'amazement': 'disgust',
 'awe': 'admiration',
 'happiness': 'optimism',
 'acknowledgment': 'gratitude',
 'loneliness': 'sadness',
 'delight': 'joy',
 'admire': 'admiration',
 'sorrow': 'sadness',
 'laughter': 'joy',
 'hatred': 'disgust',
 'want': 'desire',
 'respect': 'admiration',
 'worry': 'fear',
 'pain': 'sadness',
 'lov': 'love',
 'apprehension': 'nervousness',
 'tragedy': 'sadness',
 'aid': 'relief',
 'love': 'joy',
 'confidence': 'optimism',
 'realization': 'realization',
 'praise': 'admiration',
 'anxiety': 'sadness',
 'interest': 'curiosity',
 'confusion': 'nervousness'}

Add to the dataset

In [16]:
df["tag2tag"] = df["Tags"].apply(lambda x: {cleaned_tag2tag[e] for e in x if e in cleaned_tag2tag})

#### 1.2.4. Use RoBERTa model to predict mapping of tags

In [None]:
tags = []
for e in tqdm(all_ds_tags):
    tags.append(pipe(e, truncation=True, top_k=None))

In [17]:
postpr = [{tag["label"] for tag in quote if tag["score"] > 0.1 and tag["label"] in my_set} for quote in tags]
tag_to_tags = {tag: tags for tag, tags in zip(all_ds_tags, postpr) if len(tags) != 0}

In [18]:
df["tag2tag_by_model"] = df["Tags"].apply(lambda x: {tt for e in x if e in tag_to_tags for tt in tag_to_tags[e] if tt in my_set})

In [19]:
df

Unnamed: 0,Quote,Tags,model_tags,ds_tags,tag2tag,tag2tag_by_model
0,Mimì never forgets to see the beauty in life.,"{beauty, life, forgetting}",{},{},{},{}
1,"I will choose the bad guy in every story, I am...","{attraction, story, choosing, guy}",{},{},{},"{confusion, realization, curiosity}"
2,Even perfect people are taught to practice imp...,"{people, still, evening, society, practice, pe...",{},{},{},"{fear, curiosity, disappointment, love, annoya..."
3,"The greater your capacity to love, the greater...","{pain, feeling, capacity, love, feel}",{love},{love},"{joy, sadness}","{confusion, realization, curiosity}"
4,We Turks are faithful muslims.,"{faithful, faith}",{},{},{},{}
...,...,...,...,...,...,...
847,Your eyes show the strength of your soul.,"{soul, show, eye, strength}",{admiration},{},{},"{love, sadness}"
848,Sometimes it is the people no one can imagine ...,"{people, doing, imagination, thing, anything}",{},{},{curiosity},"{love, sadness}"
849,A prophet is nothing without a new revelation.,"{prophet, revelation, news, nothing}",{},{},{},{}
850,"I’m not mad. I’m in a perfectly happy mood, yo...","{happiness, madness, mood}","{joy, annoyance}",{},{optimism},"{fear, curiosity}"


#### 1.2.5. Combine all tags
Combine all tags into one column and save the dataset.

In [20]:
df["all_tags"] = df.apply(lambda x: x.model_tags.union(x.ds_tags).union(x.tag2tag).union(x.tag2tag_by_model), axis=1)

In [23]:
# make str
df["all_tags"] = df["all_tags"].apply(lambda x: ",".join(x))

In [24]:
df[['Quote', 'all_tags']].to_csv('data/quotes_with_predicted_tags.csv', index=False)

### 1.3. Add tags from the semeval dataset to the quotes dataset
It is necessary to add tags from the go emotions dataset to the quotes dataset, because we will use the model fine-tuned on go emotions dataset.

Here we will use the model [twitter-roberta-base-emotion-multilabel-latest](https://huggingface.co/cardiffnlp/twitter-roberta-base-emotion-multilabel-latest) model.

#### 1.3.1. Use RoBERTa model to predict tags on quote text
Get the tags from the model (with threshold = 0.2) and filter them using cleaned tag set.

In [3]:
from transformers import pipeline
pipe = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-emotion-multilabel-latest")

In [4]:
my_set = set(pipe.model.config.label2id.keys())

In [5]:
tags = pipe(list(df["Quote"]), truncation=True, top_k=None)

In [11]:
df["model_tags"] = [{tag["label"] for tag in quote if tag["score"] > 0.4 and tag["label"] in my_set} for quote in tags]

#### 1.2.2. Add tags from the quotes dataset intersection with the go emotions dataset

In [13]:
df["ds_tags"] = df["Tags"].apply(lambda x: my_set.intersection(x))

In [14]:
df

Unnamed: 0,Quote,Tags,model_tags,ds_tags
0,Mimì never forgets to see the beauty in life.,"{beauty, life, forgetting}","{optimism, joy}",{}
1,"I will choose the bad guy in every story, I am...","{choosing, story, guy, attraction}","{anticipation, joy}",{}
2,Even perfect people are taught to practice imp...,"{fitness, evening, practice, society, perfecti...",{optimism},{}
3,"The greater your capacity to love, the greater...","{feel, feeling, pain, capacity, love}","{optimism, sadness, joy}",{love}
4,We Turks are faithful muslims.,"{faith, faithful}","{optimism, joy}",{}
...,...,...,...,...
847,Your eyes show the strength of your soul.,"{eye, soul, strength, show}","{trust, optimism, joy}",{}
848,Sometimes it is the people no one can imagine ...,"{doing, anything, imagination, people, thing}",{optimism},{}
849,A prophet is nothing without a new revelation.,"{prophet, nothing, revelation, news}","{optimism, joy}",{}
850,"I’m not mad. I’m in a perfectly happy mood, yo...","{mood, happiness, madness}","{anger, optimism, disgust, joy}",{}


#### 1.2.3. Use `fasttext` embeddings to find similar tags

In [16]:
import fasttext
from sklearn.metrics.pairwise import cosine_similarity

ft = fasttext.load_model('../cc.en.300.bin')



In [17]:
my_set_embed = [ft.get_word_vector(e) for e in my_set]

In [18]:
from tqdm.notebook import tqdm

all_ds_tags = set(df["Tags"].explode().value_counts().keys())
tag_to_tag = {}

for e in tqdm(all_ds_tags):
    for w, emb in zip(my_set, my_set_embed):
        if cosine_similarity([ft.get_word_vector(e)], [emb])[0][0] > 0.5:
            tag_to_tag[e] = w

  0%|          | 0/1650 [00:00<?, ?it/s]

Clean pairs of tags by hand.

In [19]:
tag_to_tag

{'regret': 'sadness',
 'pleasure': 'joy',
 'disappointment': 'surprise',
 'lov': 'love',
 'feeling': 'sadness',
 'passion': 'love',
 'sadness': 'pessimism',
 'happiness': 'optimism',
 'delight': 'surprise',
 'expectation': 'anticipation',
 'agony': 'sadness',
 'tragedy': 'sadness',
 'selfishness': 'anger',
 'self-confidence': 'optimism',
 'confidence': 'optimism',
 'pain': 'sadness',
 'emotion': 'sadness',
 'fear': 'sadness',
 'optimist': 'pessimism',
 'sorrow': 'joy',
 'laughter': 'joy',
 'apprehension': 'anticipation',
 'embarrassment': 'disgust',
 'joy': 'love',
 'hatred': 'disgust',
 'jealousy': 'disgust',
 'anxiety': 'sadness',
 'excitement': 'optimism',
 'gratitude': 'joy',
 'love': 'love',
 'admire': 'love',
 'trust': 'trust',
 'amazement': 'joy',
 'pride': 'joy',
 'danger': 'fear',
 'hate': 'love',
 'rage': 'disgust',
 'shame': 'disgust',
 'worry': 'fear',
 'loneliness': 'sadness'}

In [20]:
cleaned_tag2tag = {'regret': 'sadness',
 'pleasure': 'joy',
 'lov': 'love',
 'passion': 'love',
 'sadness': 'pessimism',
 'happiness': 'optimism',
 'delight': 'surprise',
 'expectation': 'anticipation',
 'agony': 'sadness',
 'tragedy': 'sadness',
 'selfishness': 'anger',
 'self-confidence': 'optimism',
 'confidence': 'optimism',
 'pain': 'sadness',
 'laughter': 'joy',
 'apprehension': 'anticipation',
 'hatred': 'disgust',
 'jealousy': 'disgust',
 'anxiety': 'sadness',
 'excitement': 'optimism',
 'gratitude': 'joy',
 'love': 'love',
 'admire': 'love',
 'trust': 'trust',
 'amazement': 'joy',
 'pride': 'joy',
 'danger': 'fear',
 'hate': 'love',
 'rage': 'disgust',
 'shame': 'disgust',
 'worry': 'fear',
 'loneliness': 'sadness'}

Add to the dataset

In [21]:
df["tag2tag"] = df["Tags"].apply(lambda x: {cleaned_tag2tag[e] for e in x if e in cleaned_tag2tag})

#### 1.2.4. Use RoBERTa model to predict mapping of tags

In [22]:
tags = []
for e in tqdm(all_ds_tags):
    tags.append(pipe(e, truncation=True, top_k=None))

  0%|          | 0/1650 [00:00<?, ?it/s]

In [26]:
postpr = [{tag["label"] for tag in quote if tag["score"] > 0.5 and tag["label"] in my_set} for quote in tags]
tag_to_tags = {tag: tags for tag, tags in zip(all_ds_tags, postpr) if len(tags) != 0}

In [27]:
df["tag2tag_by_model"] = df["Tags"].apply(lambda x: {tt for e in x if e in tag_to_tags for tt in tag_to_tags[e] if tt in my_set})

#### 1.2.5. Combine all tags
Combine all tags into one column and save the dataset.

In [29]:
df["all_tags"] = df.apply(lambda x: x.model_tags.union(x.ds_tags).union(x.tag2tag).union(x.tag2tag_by_model), axis=1)

In [30]:
# make str
df["all_tags"] = df["all_tags"].apply(lambda x: ",".join(x))

In [32]:
df[['Quote', 'all_tags']].to_csv('../data/quotes_with_predicted_tags_semeval.csv', index=False)

In [33]:
df

Unnamed: 0,Quote,Tags,model_tags,ds_tags,tag2tag,tag2tag_by_model,all_tags
0,Mimì never forgets to see the beauty in life.,"{beauty, life, forgetting}","{optimism, joy}",{},{},"{optimism, sadness, disgust, joy}","optimism,sadness,disgust,joy"
1,"I will choose the bad guy in every story, I am...","{choosing, story, guy, attraction}","{anticipation, joy}",{},{},"{optimism, anger}","anticipation,optimism,anger,joy"
2,Even perfect people are taught to practice imp...,"{fitness, evening, practice, society, perfecti...",{optimism},{},{},"{anger, sadness, disgust, joy, optimism}","optimism,anger,sadness,disgust,joy"
3,"The greater your capacity to love, the greater...","{feel, feeling, pain, capacity, love}","{optimism, sadness, joy}",{love},"{love, sadness}","{sadness, disgust, joy, optimism, love, pessim...","sadness,disgust,joy,optimism,love,pessimism"
4,We Turks are faithful muslims.,"{faith, faithful}","{optimism, joy}",{},{},"{optimism, joy}","optimism,joy"
...,...,...,...,...,...,...,...
847,Your eyes show the strength of your soul.,"{eye, soul, strength, show}","{trust, optimism, joy}",{},{},"{optimism, joy}","trust,optimism,joy"
848,Sometimes it is the people no one can imagine ...,"{doing, anything, imagination, people, thing}",{optimism},{},{},"{anticipation, anger, disgust, joy, optimism}","optimism,anticipation,anger,disgust,joy"
849,A prophet is nothing without a new revelation.,"{prophet, nothing, revelation, news}","{optimism, joy}",{},{},"{sadness, disgust, joy, optimism, love}","optimism,sadness,love,disgust,joy"
850,"I’m not mad. I’m in a perfectly happy mood, yo...","{mood, happiness, madness}","{anger, optimism, disgust, joy}",{},{optimism},"{anger, sadness, disgust, joy, optimism}","anger,sadness,disgust,joy,optimism"


## 2. More Quotes by Parsing Some Sources

TODO: if needed