# Creating a (Baseline) Dataset of Pairs (diary-style text, quote)

## 1. Baseline Method: Based on Embeddings of Text and Quote (using Cosine Simimilarity)

Utilizing embeddings that are learned by chosen classifiers (`roberta-base-go_emotions` and `twitter-roberta-base-emotion-multilabel-latest`) we can compare of embeddings of text and quote. To get embeddings for text and quote we need to turn off the last classification layer from both models. We will use Cosine Similary between embeddings to compare the text and quote. So, top-1 quote by Cosine Similarity (above some threshold) will be chosen for each diary-style text.  

### Read the data

In [1]:
import pandas as pd


# diaries = pd.read_csv('../data/diaries_labeled_reddit.csv', index_col=0)['Text'].to_list()
# quotes = pd.read_csv('../data/quotes.csv')['Quote'].to_list()

diaries = pd.read_csv('/kaggle/input/recsys/diaries_labeled_reddit.csv', index_col=0)['Text'].to_list()
quotes = pd.read_csv('/kaggle/input/recsys/quotes.csv')['Quote'].to_list()

### Load models

In [2]:
import torch


device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [3]:
from transformers import AutoTokenizer, RobertaModel


tokenizer_reddit = AutoTokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
model_reddit = RobertaModel.from_pretrained("SamLowe/roberta-base-go_emotions", add_pooling_layer=False, device_map="auto")

tokenizer_twitter = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-emotion-multilabel-latest")
model_twitter = RobertaModel.from_pretrained("cardiffnlp/twitter-roberta-base-emotion-multilabel-latest", add_pooling_layer=False, device_map="auto")

Downloading (…)okenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/409 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [4]:
def model_inference(model, tokenizer, text):
    tokenized_text = tokenizer(text, return_tensors="pt", truncation=True)
    tokenized_text = tokenized_text.to(device)
    output = model(**tokenized_text)
    return output[0][:, 0, :]

#### Test inference and Cosine Similarity computation

In [13]:
test_word = 'sunday'

r1 = model_inference(model_reddit, tokenizer_reddit, test_word)
r2 = model_inference(model_reddit, tokenizer_reddit, test_word)
t1 = model_inference(model_twitter, tokenizer_twitter, test_word)
t2 = model_inference(model_twitter, tokenizer_twitter, test_word)

In [14]:
from torch.nn import functional as F


F.cosine_similarity(r1, r2).data, F.cosine_similarity(t1, t2).data

(tensor([1.], device='cuda:0'), tensor([1.0000], device='cuda:0'))

### Create embeddings

In [6]:
quotes_emb_reddit = [model_inference(model_reddit, tokenizer_reddit, q) for q in quotes]



In [19]:
import pickle


with open('./quotes_emb_reddit.pickle', 'wb') as handle:
    pickle.dump(quotes_emb_reddit, handle)

In [14]:
quotes_emb_twitter = [model_inference(model_twitter, tokenizer_twitter, q) for q in quotes]

In [25]:
import pickle


with open('./quotes_emb_twitter.pickle', 'wb') as handle:
    pickle.dump(quotes_emb_twitter, handle)

### Create a Dataset

For each diary-style text select most closer quote based on the cosine similarity of their embeddings.

In [7]:
from torch.nn import functional as F
import numpy as np


similarity_threshold = 0.8

#### Using [SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions) model

In [79]:
import pickle


with open('./quotes_emb_reddit.pickle', 'rb') as handle:
    quotes_emb_reddit = pickle.load(handle)

In [8]:
quotes_emb_reddit = np.array([e.cpu().detach().numpy() for e in quotes_emb_reddit])

In [77]:
for d in diaries[:5]:
    d_emb = model_inference(model_reddit, tokenizer_reddit, d)
    d_emb = d_emb.squeeze().cpu()
    q_emb = torch.tensor(quotes_emb_reddit).squeeze(1)
    similarities = F.cosine_similarity(d_emb, q_emb)
    top_index = torch.argmax(similarities).item()
    above_threshold_indices = (similarities > similarity_threshold).nonzero().flatten().tolist()
    if above_threshold_indices:
        index = np.random.choice(above_threshold_indices)
        print(f'random out of {len(above_threshold_indices)}: ', similarities[index].item())
    else:
        index = torch.argmax(similarities).item()
        print('top: ', similarities[index].item())
    print(d)
    print()
    print(quotes[index])
    print()
    print()

top:  0.614059567451477
My family was the most salient part of my day, since most days the care of my 2 children occupies the majority of my time. They are 2 years old and 7 months and I love them, but they also require so much attention that my anxiety is higher than ever. I am often overwhelmed by the care the require, but at the same, I am so excited to see them hit developmental and social milestones.

I'm possessed by love — but isn't everybody?


random out of 2:  0.8722858428955078
Yoga keeps me focused. I am able to take some time for me and breath and work my body. This is important because it sets up my mood for the whole day.

I didn’t grow up in a man’s man world. I grew up with my mum and my sister. But I definitely think in the last two years, I’ve become a lot more content with who I am. I think there’s so much masculinity in being vulnerable and allowing yourself to be feminine, and I’m very comfortable with that. Growing up you don’t even know what those things mean. Y

In [None]:
selected_quotes_reddit = []
random_choose_size = []

for d in diaries:
    d_emb = model_inference(model_reddit, tokenizer_reddit, d)
    d_emb = d_emb.squeeze().cpu()
    q_emb = torch.tensor(quotes_emb_reddit).squeeze(1)
    similarities = F.cosine_similarity(d_emb, q_emb)
    top_index = torch.argmax(similarities).item()
    above_threshold_indices = (similarities > similarity_threshold).nonzero().flatten().tolist()
    if above_threshold_indices:
        index = np.random.choice(above_threshold_indices)
        random_choose_size.append(len(above_threshold_indices))
    else:
        index = torch.argmax(similarities).item()
    selected_quotes_reddit.append(quotes[index])

In [12]:
print(f'Random choice: {len(random_choose_size)} / {len(diaries)}, on average from {np.mean(random_choose_size)} samples')

Random choice: 1274 / 1648, on average from 4.638932496075353 samples


In [13]:
diaries_quotes_reddit = pd.DataFrame(zip(diaries, selected_quotes_reddit), columns=['Text', 'Quote'])
diaries_quotes_reddit.to_csv('./diaries_quotes_emb_reddit.csv', index=False)

#### Using [twitter-roberta-base-emotion-multilabel-latest](https://huggingface.co/cardiffnlp/twitter-roberta-base-emotion-multilabel-latest) model

In [None]:
import pickle


with open('./quotes_emb_twitter.pickle', 'rb') as handle:
    quotes_emb_reddit = pickle.load(handle)

In [15]:
quotes_emb_twitter = np.array([e.cpu().detach().numpy() for e in quotes_emb_twitter])

In [17]:
for d in diaries[:5]:
    d_emb = model_inference(model_twitter, tokenizer_twitter, d)
    d_emb = d_emb.squeeze().cpu()
    q_emb = torch.tensor(quotes_emb_twitter).squeeze(1)
    similarities = F.cosine_similarity(d_emb, q_emb)
    top_index = torch.argmax(similarities).item()
    above_threshold_indices = (similarities > similarity_threshold).nonzero().flatten().tolist()
    if above_threshold_indices:
        index = np.random.choice(above_threshold_indices)
        print(f'random out of {len(above_threshold_indices)}: ', similarities[index].item())
    else:
        index = torch.argmax(similarities).item()
        print('top: ', similarities[index].item())
    print(d)
    print()
    print(quotes[index])
    print()
    print()

random out of 2:  0.9532959461212158
My family was the most salient part of my day, since most days the care of my 2 children occupies the majority of my time. They are 2 years old and 7 months and I love them, but they also require so much attention that my anxiety is higher than ever. I am often overwhelmed by the care the require, but at the same, I am so excited to see them hit developmental and social milestones.

You say you love rain, but you use an umbrella to walk under it. You say you love sun, but you seek shelter when it is shining. You say you love wind, but when it comes you close your windows. So that's why I'm scared when you say you love me.


random out of 228:  0.9184310436248779
Yoga keeps me focused. I am able to take some time for me and breath and work my body. This is important because it sets up my mood for the whole day.

Change is good only when you know yourself.


random out of 31:  0.818392813205719
Yesterday, my family and I played a bunch of board games.

In [18]:
selected_quotes_twitter = []
random_choose_size = []

for d in diaries:
    d_emb = model_inference(model_twitter, tokenizer_twitter, d)
    d_emb = d_emb.squeeze().cpu()
    q_emb = torch.tensor(quotes_emb_twitter).squeeze(1)
    similarities = F.cosine_similarity(d_emb, q_emb)
    top_index = torch.argmax(similarities).item()
    above_threshold_indices = (similarities > similarity_threshold).nonzero().flatten().tolist()
    if above_threshold_indices:
        index = np.random.choice(above_threshold_indices)
        random_choose_size.append(len(above_threshold_indices))
    else:
        index = torch.argmax(similarities).item()
    selected_quotes_twitter.append(quotes[index])

In [19]:
print(f'Random choice: {len(random_choose_size)} / {len(diaries)}, on average from {np.mean(random_choose_size)} samples')

Random choice: 1627 / 1648, on average from 60.77504609711125 samples


In [20]:
diaries_quotes_reddit = pd.DataFrame(zip(diaries, selected_quotes_twitter), columns=['Text', 'Quote'])
diaries_quotes_reddit.to_csv('./diaries_quotes_emb_twitter.csv', index=False)

## 2. Baseline Method: Based on Overlapping Emotion Labels of Text and Quote

Taking the labeled datasets of diaries (diary-style texts) and quotes we can compute most appropriate quote for diary entry. To choose most suitable we will utilize overlapping method. This method include the following comparison of emotion labels of both diary-style text and quote: pick those quote that has maximum overlapping emotion labels (and greater than some score) with diary-style text.  

In [None]:
import pandas as pd


diaries_dfs = {
    'reddit': pd.read_csv('../data/diaries_labeled_reddit.csv', index_col=0),
    'twitter': pd.read_csv('../data/diaries_labeled_twitter.csv', index_col=0),
}
# quotes_dfs = {
#     'reddit': pd.read_csv('../data/.csv', index_col=0),
#     'twitter': pd.read_csv('../data/.csv', index_col=0),
# }