# Text Similarity

## Recap

* Semantically similar text are embedded more closely in the vector space
* Measuring distance allows us to measure similarity
* Enable embeddings applications:
    * Semantic search
    * Recommendations
    * Classification

## Measuring Similarity

### Cosine Distance

* Ranges from 0 to 2
* Smaller numbers = Greater Similarity

```
from scipy.spatial import distance

distance.cosine([0,1],[1,0])
```
Result 1


![](images/cosine_distance.png)

## Example: Comparing headline similarity

In [1]:
from utils import *

articles = get_articles()
articles_embedded = embed_articles(articles)

print(articles_embedded)

[{'headline': 'Economic Growth Continues Amid Global Uncertainty', 'topic': 'Business', 'embedding': [0.00631048996001482, 0.008328129537403584, 0.05450693145394325, 0.04758930951356888, 0.02627224288880825, 0.0359618179500103, -0.0007340006995946169, 0.09419739991426468, 0.017760135233402252, -0.03250300884246826, 0.019452743232250214, -0.0012594915460795164, -0.07589758932590485, -0.03495606407523155, 0.03417108580470085, 0.053133219480514526, -0.0494045726954937, -0.009205097332596779, -0.0205566193908453, 0.018140358850359917, 0.024567367509007454, -0.01955086551606655, -0.023966370150446892, 0.0029329366516321898, -0.006948284804821014, 0.004286717623472214, -0.05185763165354729, 0.0063656833954155445, 3.854462192975916e-05, -0.01585901528596878, 0.04138307645916939, -0.015871280804276466, 0.009640514850616455, 0.021292537450790405, -0.0466080866754055, 0.03196333721280098, -0.012032246217131615, -0.00044576649088412523, 0.0312028881162405, 7.167526928242296e-05, 0.000314489618176

In [6]:
new_text = ["Python is the best!", "R is the best!"]
new_embedding = create_embeddings(new_text)

In [7]:
print(new_embedding)

[[-0.03197936713695526, -0.013423292897641659, -0.030844006687402725, 0.024552207440137863, 0.005206699948757887, -0.035148922353982925, 0.006457372568547726, 0.04044727981090546, -0.021761108189821243, 0.005709334276616573, 0.015138162299990654, -0.0513751395046711, -0.007001399993896484, -0.009638751856982708, 0.04245781525969505, 0.016498232260346413, -0.01214601006358862, 0.011412755586206913, -0.045721981674432755, 0.018319541588425636, 0.031175153329968452, 0.005795077886432409, -0.03394259884953499, -5.640776362270117e-05, 0.04425547271966934, -0.02452855370938778, -0.031884755939245224, 0.037419646978378296, -0.02161918766796589, 0.0003329952305648476, 0.006717559415847063, -0.016853032633662224, -0.03309107571840286, -0.010383833199739456, -0.037940021604299545, -0.030962273478507996, -0.0025456948205828667, -0.011034301482141018, 0.02046017348766327, -0.015256429091095924, 0.019632304087281227, -0.05795077979564667, 0.0022248958703130484, -0.00046308879973366857, 0.0016513015

In [9]:
from scipy.spatial import distance
import numpy as np

search_text = "computer"
search_embedding = create_embeddings(search_text)[0]

distances = []

for article in articles:
    dist = distance.cosine(search_embedding, article['embedding'])
    distances.append(dist)

min_dist_ind = np.argmin(distances)
print(articles[min_dist_ind]['headline'])

Tech Giant Buys 49% Stake In AI Startup
