### 1. Read data

In [1]:
def read_quotes() -> list[str]:
    with open("rick_and_morty_quotes.txt", "r") as fh:
        return fh.readlines()

In [2]:
rick_and_morty_quotes = read_quotes()
rick_and_morty_quotes[:3]

["Losers look stuff up while the rest of us are carpin' all them diems.\n",
 "He's not a hot girl. He can't just bail on his life and set up shop in someone else's.\n",
 "When you are an a—hole, it doesn't matter how right you are. Nobody wants to give you the satisfaction.\n"]

Vemos como hay carácteres qu epueden molestar (\n). Cómo lo aseguramos? Usando SentenceTransformer:

In [5]:
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

emb1, emb2 = model.encode([
 "Losers look stuff up while the rest of us are carpin' all them diems.\n",
 "Losers look stuff up while the rest of us are carpin' all them diems."
])

np.allclose(emb1, emb2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

True

Vemos que con y sin \n son lo suficientemente parecidas.

### 2. Generate embeddings from text

In this step I write a function that turns text into embeddings using Sentence Transformers and a [pre-trained model](https://www.sbert.net/docs/pretrained_models.html).


In [6]:
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import Union

MODEL_NAME = 'paraphrase-MiniLM-L6-v2'

def generate_embeddings(input_data: Union[str, list[str]]) -> np.ndarray:
    model = SentenceTransformer(MODEL_NAME)
    embeddings = model.encode(input_data)
    return embeddings

In [7]:
embeddings = generate_embeddings(rick_and_morty_quotes)

In [8]:
#Print the embeddings
for sentence, embedding in zip(rick_and_morty_quotes[:3], embeddings[:3]):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: Losers look stuff up while the rest of us are carpin' all them diems.

Embedding: [ 0.6188345   0.06881799  0.44374305 -0.45357847  0.30271497 -0.10784142
  0.4952488  -0.12448785  0.05482465 -0.04262822  0.04789169 -0.31940377
  0.18216974 -0.27199626 -0.14199588 -0.56009746 -0.35566285 -0.44555157
 -0.03909524  0.42247924 -0.4604969   0.26436502  0.16821663  0.34295756
  0.20552608  0.20994833 -0.07352602 -0.0243092  -0.07486243  0.41356117
 -0.09713856 -0.02470837  0.02246359  0.10461547  0.25205338 -0.05957111
  0.02156227  0.24379666  0.20664053 -0.40555897 -0.18285899  0.13926435
 -0.29004836  0.14936377 -0.17484201 -0.22140746 -0.01152995 -0.1715579
  0.2581103   0.01463477 -0.05509421  0.02583265  0.01430671 -0.13821091
  0.16159977 -0.56482404  0.4062962   0.08129338  0.18729608 -0.06932873
 -0.17729421 -0.1006496   0.3024407  -0.2205626  -0.2050517   0.13730268
  0.32069105  0.22979245 -0.22806767  0.37576777 -0.17270264 -0.17178848
  0.16163573  0.5295059  -0.19358

### 3. Let's put it all together

First I have to encode the question:

In [9]:
query_text = "Are you the cause of your parents' misery?"
query_embedding = model.encode(query_text)

In [10]:
import numpy as np

def euclidean_distance(v1: np.ndarray, v2: np.ndarray) -> float:
    """
    Compute the Euclidean distance between two vectors.

    Parameters
    ----------
    v1 : np.ndarray
        First vector.
    v2 : np.ndarray
        Second vector.

    Returns
    -------
    float
        Euclidean distance between `v1` and `v2`.
    """
    dist = v1 - v2
    return np.linalg.norm(dist, axis=len(dist.shape)-1)


def find_nearest_neighbors(query: np.ndarray,
                           vectors: np.ndarray,
                           k: int = 1) -> np.ndarray:
    """
    Find k-nearest neighbors of a query vector.

    Parameters
    ----------
    query : np.ndarray
        Query vector.
    vectors : np.ndarray
        Vectors to search.
    k : int, optional
        Number of nearest neighbors to return, by default 1.

    Returns
    -------
    np.ndarray
        The `k` nearest neighbors of `query` in `vectors`.
    """
    distances = euclidean_distance(query, vectors)
    return np.argsort(distances)[:k]

In [11]:
indices = find_nearest_neighbors(query_embedding, embeddings, k=3)

In [12]:
for i in indices:
    print(rick_and_morty_quotes[i])

You're not the cause of your parents' misery. You're just a symptom of it.

Having a family doesn't mean that you stop being an individual. You know the best thing you can do for the people that depend on you? Be honest with them, even if it means setting them free.

B—h, my generation gets traumatized for breakfast.

