## Similarities
Most models add vectors to each token (tok2vec, or transformer-based for the _trf model),
which are word embeddings, so should compare well for words with similar meaning.

This assists tasks like calculating similarity of larger chunks of text as well.

In [None]:
import spacy

english_lg  = spacy.load('en_core_web_lg')   

In [7]:
# You can generally compare any token/span (spans/sentence/docs will act as their average) using .similarity(). 
# 
# There are some fundamental limitations to this to keep in mind, like 
# - that it does not consider ordering, just words' presence
# - how volatile the meaning of short sentences may be
# - how function words dilute larger-span vectors (and might make them compare well for non-contentful reasons)
# - 'static vectors' basically means a word has the same vector in all contexts.

for one, other in (
        ('ducks are great', 'cats are nice'),
        ('ducks are great', 'goats are cool'),
        ('ducks are great', 'Forks are spoons'),
        ('ducks are great', 'Forks and spoons'),
        ('ducks are great', 'Forks and spoons and knives'),
        ('ducks are great', 'Forks and spoons are knives'),
        ('ducks and blah and blah and blah', 'blue and bleh and bleh and bleh'),
        ('ducks',           'blue'),
    ):
    sim = english_lg( one ).similarity(english_lg( other ))
    print( "%.3f   %50s %50s"%( sim, one, other ))

# These three-world phrases are actually quite contrived - real sentences have a narrower range of 

print()
text = """Because it is smaller, the Moon has less gravity than Earth (only 1/6 of the amount on Earth). So if a person weighs 60 kilograms on Earth, the person would only weigh 10 kilograms on the moon.[nb 2] But even though the Moon's gravity is weaker than the Earth's gravity, it is still there. If a person dropped a ball while standing on the moon, it would still fall down. However, it would fall much more slowly. A person who jumped as high as possible on the moon would jump higher than on Earth, but still fall back to the ground.   Rome ceased to be the capital from the time of the division. In 286, the capital of the Western Roman Empire became Mediolanum (now Milan). In 402, the capital was again moved, this time to Ravenna. In AD 398, Alaric led the Visigoths and began making attacks closer and closer to the capital. By 410, he had sacked the Rome. In 455, the Vandals captured the city. In 476, the Goths captured the capital """
doc = english_lg( text )
sents = list(doc.sents)
for i in range(len(sents)-1):
    one, other = sents[i], sents[i+1] 
    sim = one.similarity( other )
    print( "%.3f  [%50s]  x  [%50s]"%( sim, one, other ))

0.869                                      ducks are great                                      cats are nice
0.872                                      ducks are great                                     goats are cool
0.796                                      ducks are great                                   Forks are spoons
0.383                                      ducks are great                                   Forks and spoons
0.450                                      ducks are great                        Forks and spoons and knives
0.761                                      ducks are great                        Forks and spoons are knives
0.868                     ducks and blah and blah and blah                    blue and bleh and bleh and bleh
0.218                                                ducks                                               blue

0.701  [Because it is smaller, the Moon has less gravity than Earth (only 1/6 of the amount on Earth).]  x  [So if a pe


Spacy has, however, made things a bit complex.
- _sm models don't have vectors.  While similarity() still does something useful at all, you should assume this is extremely basic.
- _md and _lg  models tend to have static word vectors, meaning a word will receive the same vector in all contexts (e.g. for english it's the GroVe )
- _trf do context sensitive embeddings (and put those values in a different place)

This means that similarity() doesn't just pick up vectors -- its implementation will easily vary between models.

It also means you should _not_ assume you can use the vectors directly, 
though you can get away with it if you stick to one model.


Note that
- scores on spans and docs act as the average of their compobnents
- ...which also means e.g. function words can dilute larger-span vectors (and might make them compare well for non-contentful reasons)
- (...so...) similarity() does not consider ordering, just words' presence
- shorter sentences have minimal and more volative meaning

In [6]:
#!pip3 install spacy

!python3 -m spacy download en_core_web_trf   # works better, but can be rather slow without GPU properly set up

In [2]:
import spacy

english_trf = spacy.load('en_core_web_trf')



In [3]:
# while _trf can do contextual word embedding rather than static word embedding, this is not placed in .vector.tensors
# You could fish out the tensors like   toks_vectors, doc_vector = doc._.trf_data.tensors
#   but it's handier to augent spacy to make similarity work with transformer tensors - custom pipeline element defined in our helper module

# this mentioned tensor2attr is not basic spacy, it exists because it is defined in our helpers_spacy
if not english_trf.has_pipe('tensor2attr'):
    print("adding transformer based similarity")
    english_trf.add_pipe('tensor2attr')

adding transformer based similarity


ValueError: [E002] Can't find factory for 'tensor2attr' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, future_entity_ruler, span_ruler, textcat_multilabel, en.lemmatizer

In [1]:
import spacy.tokens
spacy.tokens.Doc

ImportError: No module named spacy.tokens

  sim = english_trf( one ).similarity(english_trf( other ))
  sim = english_trf( one ).similarity(english_trf( other ))


0.000                                      ducks are great                                      cats are nice
0.000                                      ducks are great                                     goats are cool
0.000                                      ducks are great                                   Forks are spoons
0.000                                      ducks are great                                   Forks and spoons
0.000                                      ducks are great                        Forks and spoons and knives
0.000                                      ducks are great                        Forks and spoons are knives
0.000                     ducks and blah and blah and blah                    blue and bleh and bleh and bleh
0.000                                                ducks                                               blue

0.000  [Because it is smaller, the Moon has less gravity than Earth (only 1/6 of the amount on Earth).] x [So if a pers

  sim = one.similarity( other )
  sim = one.similarity( other )


In [None]:
doc = english_trf("The bank stores investment capital in Paris, the capital of France")
first_capital, second_capital = doc[4], doc[9]
print( first_capital, second_capital, round( first_capital.similarity(second_capital), 2) ) 
# without contextual word embeddings, both capitals would be the same, and their similarity 1.0

In [None]:
# figure out what word it's close to
money  = english_trf("money")[0]
city   = english_trf("city")[0]
paris  = english_trf("lyon")[0]

print ('              ','money', 'city', 'lyon')
for in_sent, example in ( (' bank_capital ',first), ('city_capital',second) ):
    print( '%15s %4.2f %4.2f %4.2f'%(in_sent, round(example.similarity(money), 2), round( example.similarity(city), 2), round( example.similarity(paris), 2)) )

In [None]:
import numpy
#vocablist = list(english_lg.vocab.strings)
vvv = []
i=0
for string in list(english_lg.vocab.strings)[::10]:
    if len(string)<4:
        continue
    n = english_lg(string)[0].norm
    if n > 800000000000000000:
        continue

    print( n, string, english_lg.vocab.strings[string])
    
    if string in english_lg.vocab.vectors:
        print( string, english_lg.vocab.strings[string], numpy.abs(english_lg.vocab.vectors[string]))
    
    vvv.append( string )

    i+=1
    if i>1000:
        break
print(i)
print(len(vvv))

vocablist = vvv

#english_lg.vocab.lookups
#for string in 
#print(len(vocablist))
#vocablist[0].prob

In [None]:
#english_md = spacy.load("en_core_web_md")   

allsim1, allsim2 = {}, {}
#vocablist = list(english_lg.vocab.strings)

print( len(vocablist) )

for word in vocablist[::10]:
    isolated = english_trf(word)
    allsim1[word] = first_capital.similarity( isolated )
    allsim2[word] = second_capital.similarity( isolated )

allsim1 = list(allsim1.items())
allsim1.sort(key = lambda x:x[1], reverse=True)
print( allsim1[:10] )

allsim2 = list(allsim2.items())
allsim2.sort(key = lambda x:x[1], reverse=True)
print( allsim2[:10] )




In [None]:
allsim

In [None]:
# Things like 'find similar words within texts' will rely on some variant of 'compare everything to everything'
# I've not found a spacy way to do such mass comparisons other than to call .similarity() a lot, which is a bunch of overhead
# Since it seems to just be cosine similarity, we can use scipy to do a lot more comparisons in one go - code for which is in our helper

print( "SENTENCE SIMILARITY" )
for score, one, two in helpers_spacy.similar_sentences(doc,     thresh=0.5, n=5):   # yes, these these thresholds are chosen to give good results with this example. Play with them to see how messy it actually is.
    print( "    %5.2f  %40r  %40r"%(score, one, two) )
    
print( "TOKEN SIMILARITY" )
for score, one, two in helpers_spacy.similar_chunks(doc, 1,0,0, thresh=0.6, n=5):
    print( "    %5.2f  %40s  %40s"%(score, one, two) )

print( "ENTITY AND NOUN CHUNK SIMILARITY" )
for score, one, two in helpers_spacy.similar_chunks(doc, 0,1,1, thresh=0.7, n=5):
    print( "    %5.2f  %40s  %40s"%(score, one, two) )

# It's generally not so useful to compare tokens with phrases from the same document, in that the top similarities will be phrases with their own head/root.

In [None]:
# Since the average of a sentence or document would be a lot of function words, 
#   direct comparison would still work but be watered down depending on how many of those there are



#   so you might like 
# At the same time, spacy prefers its parsed object immutable, so you would have to work around it
import numpy
from importlib import reload
reload(helpers_spacy)
for sent in paris.sents:
    print( '-'*80 )
    print( sent )
    sg = helpers_spacy.interesting_words( sent )
    print( sg )
    vpt = helpers_spacy.vector_per_tag(sent, average=True) 
    for tag, ary in vpt.items():
        print( tag, numpy.linalg.norm(ary))
