## Summary

A [semantic text search](https://www.algolia.com/blog/ai/the-past-present-and-future-of-semantic-search/) means that instead of simply looking for the exact keywords from the question, we are looking for text that is most similar to the question across a number of dimensions.

### Keyword vs. Semantic Search
In a keyword search, we look for the exact keywords and return the results that match those keywords. With A semantic text search(opens in a new tab)s, we are looking for text that is most similar to the question across a number of dimensions.

### Measuring Vector Similarity
If semantic search means finding the most similar text, how do we measure that similarity? We need to define a measurement of "distance", and whichever piece of text has the shortest distance is the most similar.

- 1D Space
  - simple subtraction
- 2D Space
  - Manhattan
  - Euclidean

### OpenAI's Recommendation: Cosine Similarity
[OpenAI's documentation](https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use) states that:

*We recommend cosine similarity. The choice of distance function typically doesn’t matter much.*

So for the purposes of this tutorial, we'll use cosine similarity as recommended.

### What is Cosine Similarity?
We'll be using tools to calculate cosine similarity, but if you are interested in the detaitls, the formula for cosine similarity is the dot product of the two vectors divided by their Euclidean magnitudes.

![](./img/img43.png)

You can read more about the derivation of the cosine similarity formula here: [Wikipedia: Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity).

The important thing to remember is that cosine similarity ranges from -1 to 1:

- 1 means the vectors are identical
- -1 means the vectors are opposites
- 0: the vectors are orthogonal to each other, which means that they are unrelated.

![](./img/img44.png)

### Cosine Distance
We need to take one more step before we are ready to implement a semantic text search in Python. We want to find the text that is the shortest distance from our query, and cosine similarity is not a true distance metric.

Instead we’ll calculate **cosine distance**.

    cosine distance = 1 - cosine similarity

If you want details about the math any why cosine similarity is not a true distance metric, read this article on [cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity#Cosine_Distance).

Sorting by cosine distance works for any kind of data that can be vectorized and works especially well for multimedia data (text, images, videos) that produce vectors with many dimensions.

## Additional References

[]()