# First Sentence Embeddings
In this notebook, we try out our first sentence embeddings.


Install the Hugging Face sentence_transformers package and all its dependencies.

In [None]:
!pip install sentence_transformers

Import the SentenceTransformer library (https://www.sbert.net/)

In [1]:
from sentence_transformers import SentenceTransformer

Create an instance of the model Snowflake/snowflake-arctic-embed-xs

The model is hosted on Hugging Face https://huggingface.co/Snowflake/snowflake-arctic-embed-xs


In [2]:
model = SentenceTransformer("Snowflake/snowflake-arctic-embed-xs")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/85.2k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/737 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Setup two arrays of strings:
- two query strings
- two very short document strings

In [3]:
queries = ['what is snowflake?', 'Where can I get the best tacos?']
documents = ['The Data Cloud!', 'Mexico City of Course!']

Generate the embeddings for all strings and put them into two new arrays.

The prompt_name parameter is explained in the source code: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py#L274
<pre>
prompts (Dict[str, str], optional): A dictionary with prompts for the model. The key is the prompt name, the value is the prompt text.
            The prompt text will be prepended before any text to encode. For example:
            `{"query": "query: ", "passage": "passage: "}` or `{"clustering": "Identify the main category based on the
            titles in "}`
</pre>


In [4]:
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

Find out the dimension of an embedding

Compute the scores:

@ In Python, the @ operator performs matrix multiplication. 

T transposes the matrix document_embeddings (necessary for correct maths)

In [5]:
len(query_embeddings[0])

384

In [6]:
scores = query_embeddings @ document_embeddings.T

In [7]:
scores

array([[0.57515126, 0.45798576],
       [0.50448984, 0.563602  ]], dtype=float32)

Output the results:

Each query is scored with each document (more precisely: the query embeddings and the document embeddings)

All four resulting scores are displayed

In [8]:
for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    # Output passages & scores
    print("")
    print("Query:", query)
    for document, score in doc_score_pairs:
        print(score, document)


Query: what is snowflake?
0.57515126 The Data Cloud!
0.45798576 Mexico City of Course!

Query: Where can I get the best tacos?
0.563602 Mexico City of Course!
0.50448984 The Data Cloud!


#### Play with other queries and vectors:

Repeat the steps above with the new texts

In [None]:
queries = ['what is snowflake?', 'Where can I get the best tacos?']
documents = ['The Data Cloud!', 'Mexico City of Course!', 'a software company', 'snow in winter']


In [None]:
queries = ['Wie heisst die Hauptstadt der Schweiz?', 'Wie heisst die Hauptstadt Frankreichs', 'Wo isst man die besten Tacos?']
documents = ['Bern ist eine schöne Stadt an der Aare', 'Mexico City of Course!', 
             'Viele Banken sind in Zürich', 'Die UNO ist in Genf', 'Paris ist die Hauptstadt Frankreichs', 
             'In Lyon fliessen zwei Flüsse zusammen']