# Similarities

You can always use similarities, but with a small model they are just based on token, dependency and ner information. Below we try to print the similarity score, the length of the vector on a token and the length of the vector on the lexeme. The last one fails for small language models.

In [1]:
import spacy

In [2]:
nlp_sm = spacy.load("en_core_web_sm")
nlp_lg = spacy.load("en_core_web_lg")

In [3]:
def similarity_test(nlp, text, n1, n2):
    doc = nlp(text)
    print('similarity:', doc[n1].similarity(doc[n2]))
    
def vector_test(nlp, text):
    doc = nlp(text)
    print('size of vector on token:     ', len(doc[0].vector))
    try:
        print('size of vector on token.lex: ', len(doc[0].lex.vector))
    except ValueError as e:
        print(e)

In [4]:
text = "comparing apples and oranges"

In [5]:
# Printing the similarity score and the vector using the small model

similarity_test(nlp_sm, text, 1, 3)
vector_test(nlp_sm, text)

similarity: 0.6067285
size of vector on token:      96
[E010] Word vectors set to length 0. This may be because you don't have a model installed or loaded, or because your model doesn't include word vectors. For more info, see the docs:
https://spacy.io/usage/models


  print('similarity:', doc[n1].similarity(doc[n2]))


In [6]:
# Printing the similarity score and the vector using the large model

similarity_test(nlp_lg, text, 1, 3)
vector_test(nlp_lg, text)

similarity: 0.77809423
size of vector on token:      300
size of vector on token.lex:  300


Documents and spans have vectors too.

In [7]:
def compare_docs(nlp, text1, text2):
    doc1 = nlp(text1)
    doc2 = nlp(text2)
    print(doc1.similarity(doc2))
    
def compare_spans(nlp, text1, text2, start, end):
    span1 = nlp(text1)[start:end]
    span2 = nlp(text2)[start:end]
    print(span1.similarity(span2))

In [8]:
compare_docs(nlp_lg, "We are eating pizza", "We are eating paella")

0.9040502655654258


In [9]:
compare_spans(nlp_lg, "We are eating pizza", "We are eating paella", 0, 3)

1.0


In [10]:
# this does not give a warning because the similarity function first checks the actual string
compare_spans(nlp_sm, "We are eating pizza", "We are eating paella", 0, 3)

1.0


In [11]:
nlp = spacy.load("en_core_web_sm")
for nc in nlp("We are eating pizza").noun_chunks:
    print(type(nc), nc)

<class 'spacy.tokens.span.Span'> We
<class 'spacy.tokens.span.Span'> pizza


In [12]:
nlp_lg("Fido barks.").vector

array([-1.07563362e-02,  3.33379984e-01, -3.28269988e-01, -5.88053286e-01,
        3.62993330e-01,  2.02390000e-01,  1.34178683e-01, -2.58746654e-01,
       -1.32898003e-01,  3.90106648e-01, -1.44556671e-01,  4.27457958e-01,
       -1.66819990e-01, -1.20200336e-01, -1.77853659e-01, -5.72146773e-02,
        2.89449006e-01,  1.45733312e-01,  9.77416709e-02, -3.44655663e-01,
       -2.71650016e-01,  3.74561340e-01,  2.62319326e-01,  2.68040001e-02,
       -3.74459922e-02, -1.40609995e-01, -3.32466692e-01, -3.50136645e-02,
        1.43150330e-01, -1.68436036e-01, -1.11395337e-01, -4.99066599e-02,
       -6.44636676e-02,  2.66300350e-01, -3.86333466e-03, -8.14373270e-02,
        3.79196644e-01, -1.44066676e-01, -3.57766636e-02,  2.55699337e-01,
        3.49776983e-01, -7.86300004e-02,  1.81003332e-01, -3.06970000e-01,
       -9.42760035e-02,  3.29153299e-01, -1.45003334e-01, -7.31186643e-02,
       -1.73419669e-01,  7.76770040e-02,  6.54873326e-02, -5.40900230e-03,
        1.55579999e-01, -