# Database creation & Embedding generation

Here, we will create vector database with emotionally rich text without emojis, so we can chow it to user when he/she inputs emoji text.

Main steps:
1. Choose dataset (~50000 objects)
2. Choose model to create embeddings from dataset
3. Choose vector database/storage
4. Choose indexing algorithm

## Dataset
Chosen dataset is **Quotes-500K** which contains ~500000 quotes with different tags including love, life, philosophy, motivation, family, etc. These tags will help to evaluate whether selected quote is relevant to the input or not

In [2]:
import pandas as pd

df = pd.read_csv('quotes.csv')

In [10]:
print(df.shape)
df.head()

(499709, 3)


Unnamed: 0,quote,author,category
0,"I'm selfish, impatient and a little insecure. ...",Marilyn Monroe,"attributed-no-source, best, life, love, mistak..."
1,You've gotta dance like there's nobody watchin...,William W. Purkey,"dance, heaven, hurt, inspirational, life, love..."
2,You know you're in love when you can't fall as...,Dr. Seuss,"attributed-no-source, dreams, love, reality, s..."
3,A friend is someone who knows all about you an...,Elbert Hubbard,"friend, friendship, knowledge, love"
4,Darkness cannot drive out darkness: only light...,"Martin Luther King Jr., A Testament of Hope: T...","darkness, drive-out, hate, inspirational, ligh..."


Since 500K is much more that we need, we select 100K shortest valuable quotes.

In [None]:
quotes = df['quote'].dropna()
print("Number of quotes:", len(quotes))
print("Minimal:", min(quotes, key=len), "Maximal:", max(quotes, key=len))
print("Sorted quotes:", sorted(quotes, key=len)[55:100]) # Since 55th quote there is smth more meaningful

Number of quotes: 499708
Minimal: 1 Maximal: I imagined my coffin being closed, and the screws being turned. I was immobile, but I was alive, and I wanted to tell my family that I was seeing everything. I wanted to tell them all that I loved them, but not a sound came out of my mouth. My father and mother were weeping, my wife and my friends were gathered around, but I was completely alone! With all of the people dear to me standing there, no one was able to see that I was alive and that I had not yet accomplished all that I wanted to do in this world. I tried desperately to open my eyes, to give a sign, to beat on the lid of the coffin. But I could not move any part of my body. I felt the coffin being carried toward the grave. I could hear the sound of the handles grinding against their fittings, the steps of those in the procession, and conversations from this side and that. Someone said that he had a date for dinner later on, and another observed that I had died early. The smell of 

In [21]:
# Final selected quotes
selected_quotes = sorted(quotes, key=len)[55:100056]

# Filter the original dataframe to keep only selected quotes
selected_df = df[df['quote'].isin(selected_quotes)][['quote', 'author', 'category']]

# Save to new CSV file
selected_df.to_csv('selected_quotes.csv', index=False)

## Model

`SamLowe/roberta-base-go_emotions` was chosen. This model, based on RoBERTa and fine-tuned on the GoEmotions dataset with 27 emotion categories, should better distinguish emotional tones, making it suitable for search tasks.

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('SamLowe/roberta-base-go_emotions') # Load the pre-trained model

No sentence-transformers model found with name SamLowe/roberta-base-go_emotions. Creating a new one with mean pooling.


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at SamLowe/roberta-base-go_emotions and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [24]:
# Generate embeddings for the selected quotes
selected_df['embeddings'] = selected_df['quote'].apply(lambda x: model.encode(x).tolist())

In [25]:
selected_df.head()

Unnamed: 0,quote,author,category,embeddings
3,A friend is someone who knows all about you an...,Elbert Hubbard,"friend, friendship, knowledge, love","[-0.8965908288955688, 0.8911721706390381, 0.07..."
5,We accept the love we think we deserve.,"Stephen Chbosky, The Perks of Being a Wallflower","inspirational, love","[-0.6066871285438538, 0.21085752546787262, -0...."
12,"Love all, trust a few, do wrong to none.","William Shakespeare, All's Well That Ends Well","do-wrong, love, trust, wrong","[-0.7522128820419312, 0.6477974653244019, -0.2..."
18,"You love me. Real or not real?""I tell him, ""Real.","Suzanne Collins, Mockingjay","katniss, love, peeta, suzanne-collins, the-hun...","[-0.11977081745862961, 0.6745093464851379, 0.1..."
23,"Love is like the wind, you can't see it but yo...","Nicholas Sparks, A Walk to Remember","love, simile","[-0.581844687461853, 0.6239872574806213, 0.094..."


In [28]:
len(selected_df['embeddings'].iloc[0])

768

In [29]:
selected_df.to_csv('selected_quotes_embeddings.csv', index=False)

## Перед запуском:
Установка клиента Qdrant
```
pip install qdrant-client
```
Запуск локального сервера Qdrant через Docker
```
docker run -p 6333:6333 qdrant/qdrant
```

In [8]:
# init client
from qdrant_client import QdrantClient

# При использовании локального инстанса Qdrant
client = QdrantClient(url="http://localhost:6333", prefer_grpc=False)

In [9]:
# Параметры коллекции
collection_name = "emoji_sync"
vector_size = 768  # размер эмбеддингов BERT

# создадим новую коллекцию
client.create_collection(
    collection_name=collection_name,
    vectors_config={
        "size": vector_size,
        "distance": "Cosine"
    }
)

True

In [10]:
# добавление в дб
from qdrant_client.http.models import PointStruct
import numpy as np

# Используем vector_size из параметров коллекции для генерации примера
ids = [1, 2, 3]
# TODO: заменить на реальные эмбеддинги
vectors = [np.random.rand(vector_size).tolist() for _ in ids]
metadatas = [
    {"text": "I love pizza 😊"},
    {"text": "It's raining outside 😢"},
    {"text": "Party time 🎉"}
]
points = [
    PointStruct(id=pid, vector=vec, payload=meta)
    for pid, vec, meta in zip(ids, vectors, metadatas)
]

# Загрузим точки в Qdrant
client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [35]:
# Эмбеддинг запроса
# TODO: получить эмбеддинг
# query_embedding = embedder.encode("Happy vibes 😄🎶").tolist()
query_embedding = np.random.rand(vector_size).tolist()

response = client.query_points(
    collection_name=collection_name,
    query=query_embedding,  # ранее query_vector
    limit=5,
    with_payload=True
)

for point in response.points:
    print(f"ID: {point.id}, Score: {point.score:.4f}, Text: {point.payload['text']}")

ID: 2, Score: 0.7639, Text: It's raining outside 😢
ID: 1, Score: 0.7601, Text: I love pizza 😊
ID: 3, Score: 0.7595, Text: Party time 🎉
