### Text Similarity

This notebook will lead you through:

1. Downloading the Jina AI text embedding model `jina-embedding-t-en-v1` from HuggingFace using the built-in interface in Python.
2. Input English-language texts to the AI model and retrieve embeddings for them
3. Measure the cosines between embeddings to see that they match intuitions about text similarity.

If you are running this notebook in an environment that may not have all the prerequisites installed, run the line below first. It will install the necessary libraries if needed:

In [None]:
!pip install numpy torch timm sentence_transformers

Import the `SentenceTransformer` object:

In [None]:
from sentence_transformers import SentenceTransformer

Next, download the `jina-embedding-t-en-v1` and store it (along with its accompanying Python class) in the variable `model`. Be patient, this may take several minuites the first time you do it, but it will be cached on your local system for the next time.

In [None]:
model = SentenceTransformer('jinaai/jina-embedding-t-en-v1')

You can get embeddings one at a time, using the `SentenceTransformer.encode()` method.

In [None]:
embedding_car = model.encode("This is a car")

Or you can get several embeddings at once by using batching:

In [None]:
sentences = ["This is a car", "This is a truck", "This is an airplane", "This is a dog"]

# The below is functionally equivalent, in Python, to:
# embeddings_list = model.encode(sentences)
# embedding_car = embeddings_list[0]
# embedding_truck = embeddings_list[1]
# embedding_airplane = embeddings_list[2]
# embedding_dog = embeddings_list[3]

embedding_car, embedding_truck, embedding_airplane, embedding_dog = model.encode(sentences)

You can see that each embedding has 312 dimensions:

In [None]:
len(embedding_car)

And if you inspect an embedding directly, you see it is just a vector -- a list of numbers!

In [None]:
embedding_car

Define the cosine function over pairs of vectors, so we can compare embeddings:

In [None]:
from numpy import dot
from numpy.linalg import norm

def cosine(a, b):
    return dot(a,b)/(norm(a)*norm(b))

Now, we can do pairwise comparisons between the embedding vectors. Remember, the closer the cosine is to 1.0, the more similar the two embeddings are.

For example, `embedding_car` and `embedding_truck`:

In [None]:
cosine(embedding_car, embedding_truck)

This is a higher cosine than between `embedding_car` and `embedding_airplane`:

In [None]:
cosine(embedding_car, embedding_airplane)

And much higher than between `embedding_car` and `embedding_dog`

In [None]:
cosine(embedding_car, embedding_dog)

Now, let’s create an additional embedding:

In [None]:
embedding_automobile = model.encode("This is an automobile")

We would expect `embedding_car` and `embedding_automobile` to have nearly the same vector, since the two sentences mean more or less the same thing. This should be reflected in a very high cosine, one closer to 1.0:

In [None]:
cosine(embedding_car, embedding_automobile)

Try it with your own texts to see if the results match your intuitions about semantic similarity. 

The maximum input size to the `jina-embedding-t-en-v1` model is 512 tokens. This is not quite exactly 512 words because the tokenizer does not match tokens one-to-one with words, but that’s a good rough estimate. This is a relatively small model, so it has limitations in that respect.

If you put in too much text, it will truncate it automatically to the maximum it accepts.