# Database creation & Embedding generation

Here, we will create vector database with emotionally rich text without emojis, so we can chow it to user when he/she inputs emoji text.

Main steps:
1. Choose dataset (~50000 objects)
2. Choose model to create embeddings from dataset
3. Choose vector database/storage
4. Choose indexing algorithm

## Перед запуском:
Установка клиента Qdrant
```
pip install qdrant-client
```
Запуск локального сервера Qdrant через Docker
```
docker run -p 6333:6333 qdrant/qdrant
```

In [8]:
# init client
from qdrant_client import QdrantClient

# При использовании локального инстанса Qdrant
client = QdrantClient(url="http://localhost:6333", prefer_grpc=False)

In [9]:
# Параметры коллекции
collection_name = "emoji_sync"
vector_size = 768  # размер эмбеддингов BERT

# создадим новую коллекцию
client.create_collection(
    collection_name=collection_name,
    vectors_config={
        "size": vector_size,
        "distance": "Cosine"
    }
)

True

In [10]:
# добавление в дб
from qdrant_client.http.models import PointStruct
import numpy as np

# Используем vector_size из параметров коллекции для генерации примера
ids = [1, 2, 3]
# TODO: заменить на реальные эмбеддинги
vectors = [np.random.rand(vector_size).tolist() for _ in ids]
metadatas = [
    {"text": "I love pizza 😊"},
    {"text": "It's raining outside 😢"},
    {"text": "Party time 🎉"}
]
points = [
    PointStruct(id=pid, vector=vec, payload=meta)
    for pid, vec, meta in zip(ids, vectors, metadatas)
]

# Загрузим точки в Qdrant
client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [35]:
# Эмбеддинг запроса
# TODO: получить эмбеддинг
# query_embedding = embedder.encode("Happy vibes 😄🎶").tolist()
query_embedding = np.random.rand(vector_size).tolist()

response = client.query_points(
    collection_name=collection_name,
    query=query_embedding,  # ранее query_vector
    limit=5,
    with_payload=True
)

for point in response.points:
    print(f"ID: {point.id}, Score: {point.score:.4f}, Text: {point.payload['text']}")

ID: 2, Score: 0.7639, Text: It's raining outside 😢
ID: 1, Score: 0.7601, Text: I love pizza 😊
ID: 3, Score: 0.7595, Text: Party time 🎉
