# Database creation & Embedding generation

Here, we will create vector database with emotionally rich text without emojis, so we can chow it to user when he/she inputs emoji text.

Main steps:
1. Choose dataset (~50000 objects)
2. Choose model to create embeddings from dataset
3. Choose vector database/storage
4. Choose indexing algorithm

## Dataset
Chosen dataset is **Quotes-500K** which contains ~500000 quotes with different tags including love, life, philosophy, motivation, family, etc. These tags will help to evaluate whether selected quote is relevant to the input or not

In [1]:
import pandas as pd

df = pd.read_csv('quotes.csv')

In [2]:
print(df.shape)
df.head()

(499709, 3)


Unnamed: 0,quote,author,category
0,"I'm selfish, impatient and a little insecure. ...",Marilyn Monroe,"attributed-no-source, best, life, love, mistak..."
1,You've gotta dance like there's nobody watchin...,William W. Purkey,"dance, heaven, hurt, inspirational, life, love..."
2,You know you're in love when you can't fall as...,Dr. Seuss,"attributed-no-source, dreams, love, reality, s..."
3,A friend is someone who knows all about you an...,Elbert Hubbard,"friend, friendship, knowledge, love"
4,Darkness cannot drive out darkness: only light...,"Martin Luther King Jr., A Testament of Hope: T...","darkness, drive-out, hate, inspirational, ligh..."


Since 500K is much more that we need, we select 100K shortest valuable quotes.

In [3]:
quotes = df['quote'].dropna()
print("Number of quotes:", len(quotes))
print("Minimal:", min(quotes, key=len), "Maximal:", max(quotes, key=len))
print("Sorted quotes:", sorted(quotes, key=len)[1000:1100]) # Since 1000th quote there is smth more meaningful

Number of quotes: 499708
Minimal: 1 Maximal: I imagined my coffin being closed, and the screws being turned. I was immobile, but I was alive, and I wanted to tell my family that I was seeing everything. I wanted to tell them all that I loved them, but not a sound came out of my mouth. My father and mother were weeping, my wife and my friends were gathered around, but I was completely alone! With all of the people dear to me standing there, no one was able to see that I was alive and that I had not yet accomplished all that I wanted to do in this world. I tried desperately to open my eyes, to give a sign, to beat on the lid of the coffin. But I could not move any part of my body. I felt the coffin being carried toward the grave. I could hear the sound of the handles grinding against their fittings, the steps of those in the procession, and conversations from this side and that. Someone said that he had a date for dinner later on, and another observed that I had died early. The smell of 

We will select quotes based on categories to ensure selected dataset preserve the variance of the original one

In [10]:
df.dropna(subset=['category'], inplace=True)
print("Number of quotes with categories:", len(df))

Number of quotes with categories: 499646


In [13]:
from collections import Counter

# Разделяем теги в колонке category
df['categories'] = df['category'].str.split(',').apply(lambda x: [tag.strip() for tag in x])

# Создаем словарь для подсчета тегов
tag_counts = Counter()
for categories in df['categories']:
    tag_counts.update(categories)

# Вычисляем пропорции для каждого тега
total_quotes = len(df)
samples_per_tag = {tag: int(count / total_quotes * 100000) for tag, count in tag_counts.items()}

In [14]:
samples_per_tag

{'attributed-no-source': 31,
 'best': 84,
 'life': 7019,
 'love': 7766,
 'mistakes': 259,
 'out-of-control': 4,
 'truth': 2367,
 'worst': 18,
 'dance': 130,
 'heaven': 272,
 'hurt': 279,
 'inspirational': 5820,
 'sing': 22,
 'dreams': 944,
 'reality': 621,
 'sleep': 209,
 'friend': 177,
 'friendship': 1000,
 'knowledge': 1286,
 'darkness': 371,
 'drive-out': 0,
 'hate': 321,
 'light': 497,
 'peace': 868,
 'activism': 79,
 'apathy': 37,
 'indifference': 30,
 'opposite': 6,
 'philosophy': 2989,
 'lack-of-friendship': 0,
 'lack-of-love': 2,
 'marriage': 770,
 'unhappy-marriage': 1,
 'poetry': 1437,
 'do-wrong': 0,
 'trust': 501,
 'wrong': 98,
 'courage': 741,
 'deeply-loved': 0,
 'strength': 570,
 'widely-misattributed': 1,
 'friends': 468,
 'heartbreak': 315,
 'sisters': 39,
 'essential': 8,
 'happiness': 2086,
 'katniss': 8,
 'peeta': 5,
 'suzanne-collins': 2,
 'the-hunger-games': 3,
 'doomed': 5,
 'inevitable': 7,
 'oblivion': 11,
 'pleasure': 144,
 'simple': 39,
 'simile': 24,
 'girls

In [15]:
# Выборка цитат
sampled_indices = set()
for tag, n_samples in samples_per_tag.items():
    # Находим цитаты с данным тегом
    tag_quotes = df[df['categories'].apply(lambda x: tag in x)]
    if len(tag_quotes) < n_samples:
        n_samples = len(tag_quotes)  # Берем все, если меньше, чем нужно
    sampled_indices.update(tag_quotes.sample(n_samples, random_state=42).index)

In [16]:
len(sampled_indices)

251458

In [19]:
import random

if len(sampled_indices) > 100000:
    sampled_indices = random.sample(list(sampled_indices), 100000)

# Формируем итоговый датафрейм
sampled_df = df.loc[list(sampled_indices)].copy()
sampled_df = sampled_df.drop(columns=['categories'])  # Удаляем временную колонку

In [20]:
sampled_df.head()

Unnamed: 0,quote,author,category
292948,Never surrender...never give up! Keep walking ...,"Timothy Pina, Bullying Ben: How Benjamin Frank...","inspirational-quote, legend-of-the-peace-panda..."
192662,Failure of your first attempt does not mean yo...,"Israelmore Ayivor, Daily Drive 365","attempt, battle, battles, don-t-give-up, fail,..."
187137,The only time that exists in life is now.,Ken Poirot,"exists, life, life-and-death, life-experience,..."
68822,The goal of the church is to bring the kingdom...,Sunday Adelaja,"blessing, church, employment, finance, goal, j..."
138656,His suffering was no more real than he was.,"Johnny Rich, The Human Script","fiction-vs-reality, fictional-character, reali..."


In [21]:
# Save to new CSV file
sampled_df.to_csv('selected_quotes.csv', index=False)

## Model

`SamLowe/roberta-base-go_emotions` was chosen. This model, based on RoBERTa and fine-tuned on the GoEmotions dataset with 27 emotion categories, should better distinguish emotional tones, making it suitable for search tasks.

In [22]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
model = AutoModelForSequenceClassification.from_pretrained("SamLowe/roberta-base-go_emotions")

In [23]:
import torch
import numpy as np

# Function to generate embeddings
def get_embedding(text, tokenizer, model, device='cpu'):
    # Tokenization
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {key: val.to(device) for key, val in inputs.items()}
    
    # Get model outputs
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    
    # Extract the last hidden state (batch_size, seq_len, hidden_size)
    hidden_states = outputs.hidden_states[-1]  # Last layer
    # Use the [CLS] token embedding (first position)
    cls_embedding = hidden_states[:, 0, :].squeeze().cpu().numpy()
    return cls_embedding.astype(np.float32)

In [25]:
# Generate embeddings for the selected quotes
sampled_df['embeddings'] = sampled_df['quote'].apply(lambda x: get_embedding(x, tokenizer, model).tolist())

In [26]:
sampled_df.head()

Unnamed: 0,quote,author,category,embeddings
292948,Never surrender...never give up! Keep walking ...,"Timothy Pina, Bullying Ben: How Benjamin Frank...","inspirational-quote, legend-of-the-peace-panda...","[-0.9500261545181274, 0.8815367817878723, -0.4..."
192662,Failure of your first attempt does not mean yo...,"Israelmore Ayivor, Daily Drive 365","attempt, battle, battles, don-t-give-up, fail,...","[-0.4900580942630768, -0.001421164721250534, -..."
187137,The only time that exists in life is now.,Ken Poirot,"exists, life, life-and-death, life-experience,...","[-0.32679757475852966, 0.4938211441040039, -0...."
68822,The goal of the church is to bring the kingdom...,Sunday Adelaja,"blessing, church, employment, finance, goal, j...","[0.24700643122196198, -0.7337496876716614, -0...."
138656,His suffering was no more real than he was.,"Johnny Rich, The Human Script","fiction-vs-reality, fictional-character, reali...","[-0.26969072222709656, 1.1550625562667847, -1...."


In [28]:
len(sampled_df['embeddings'].iloc[0])

768

In [29]:
sampled_df.to_csv('selected_quotes_embeddings.csv', index=False)

## Перед запуском:
Установка клиента Qdrant
```
pip install qdrant-client
```
Запуск локального сервера Qdrant через Docker
```
docker run -p 6333:6333 qdrant/qdrant
```

In [8]:
# init client
from qdrant_client import QdrantClient

# При использовании локального инстанса Qdrant
client = QdrantClient(url="http://localhost:6333", prefer_grpc=False)

In [9]:
# Параметры коллекции
collection_name = "emoji_sync"
vector_size = 768  # размер эмбеддингов BERT

# создадим новую коллекцию
client.create_collection(
    collection_name=collection_name,
    vectors_config={
        "size": vector_size,
        "distance": "Cosine"
    }
)

True

In [10]:
# добавление в дб
from qdrant_client.http.models import PointStruct
import numpy as np

# Используем vector_size из параметров коллекции для генерации примера
ids = [1, 2, 3]
# TODO: заменить на реальные эмбеддинги
vectors = [np.random.rand(vector_size).tolist() for _ in ids]
metadatas = [
    {"text": "I love pizza 😊"},
    {"text": "It's raining outside 😢"},
    {"text": "Party time 🎉"}
]
points = [
    PointStruct(id=pid, vector=vec, payload=meta)
    for pid, vec, meta in zip(ids, vectors, metadatas)
]

# Загрузим точки в Qdrant
client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [35]:
# Эмбеддинг запроса
# TODO: получить эмбеддинг
# query_embedding = embedder.encode("Happy vibes 😄🎶").tolist()
query_embedding = np.random.rand(vector_size).tolist()

response = client.query_points(
    collection_name=collection_name,
    query=query_embedding,  # ранее query_vector
    limit=5,
    with_payload=True
)

for point in response.points:
    print(f"ID: {point.id}, Score: {point.score:.4f}, Text: {point.payload['text']}")

ID: 2, Score: 0.7639, Text: It's raining outside 😢
ID: 1, Score: 0.7601, Text: I love pizza 😊
ID: 3, Score: 0.7595, Text: Party time 🎉
