# Sense2Vec overview
sense2vec ([Trask et. al](https://arxiv.org/abs/1511.06388), 2015): https://github.com/explosion/sense2vec

Sense embedding model. Vectors of words and multi-word phrases based on part-of-speech tags and entity labels.
It can be used as a standalone module, or as a spaCy pipeline component.

In [1]:
!pip install sense2vec



Download Sense2vec model pretrained on 2015 Reddit comments:

In [2]:
import os
if not os.path.exists('./s2v_reddit_2015_md.tar.gz'):
  !wget https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz

In [3]:
if not os.path.exists('./s2v_old'):
  !tar -xzvf ./s2v_reddit_2015_md.tar.gz

### Standalone usage

In [4]:
from sense2vec import Sense2Vec

In [5]:
s2v = Sense2Vec().from_disk('./s2v_old')

In [6]:
# query pair: word/phrase and sense (part-of-speech tags or entity labels)
query = "natural_language_processing|NOUN"
assert query in s2v
vector = s2v[query]
freq = s2v.get_freq(query)
s2v.most_similar(query, n=3)

[('machine_learning|NOUN', 0.8987),
 ('computer_vision|NOUN', 0.8636),
 ('deep_learning|NOUN', 0.8573)]

The available senses in the table:

In [7]:
s2v.senses

['PUNCT',
 'SYM',
 'MONEY',
 'PERCENT',
 'PRODUCT',
 'X',
 'LANGUAGE',
 'DET',
 'LOC',
 'CARDINAL',
 'CONJ',
 'LAW',
 'ORG',
 'PART',
 'VERB',
 'NUM',
 'EVENT',
 'ADP',
 'PERSON',
 'QUANTITY',
 'INTJ',
 'TIME',
 'SPACE',
 'DATE',
 'ADJ',
 'NOUN',
 'NORP',
 'ORDINAL',
 'WORK OF ART',
 'ADV',
 'FAC',
 'GPE']

Result depends on the sense:

In [8]:
s2v.most_similar("apple|NOUN", n=10)

[('blackberry|NOUN', 0.8481),
 ('apple|ADJ', 0.7543),
 ('banana|NOUN', 0.751),
 ('grape|NOUN', 0.7432),
 ('apple|VERB', 0.7349),
 ('gingerbread|NOUN', 0.733),
 ('jelly_bean|NOUN', 0.7278),
 ('pear|NOUN', 0.7213),
 ('pomegranate|NOUN', 0.7205),
 ('ice_cream_sandwich|NOUN', 0.7161)]

In [9]:
s2v.most_similar("Apple|ORG", n=10)

[('BlackBerry|ORG', 0.9017),
 ('&gt;Apple|NOUN', 0.8947),
 ('even_Apple|ORG', 0.8858),
 ('Blackberry|PERSON', 0.884),
 ('_Apple|ORG', 0.8812),
 ('Blackberry|ORG', 0.8776),
 ('Apple|PERSON', 0.8745),
 ('Android|ORG', 0.8659),
 ('OEMs|NOUN', 0.8608),
 ('Samsung|ORG', 0.8572)]

In [10]:
s2v.get_other_senses("apple|NOUN", ignore_case=True)

['apple|VERB',
 'apple|ADJ',
 'APPLE|ORG',
 'APPLE|VERB',
 'APPLE|ADP',
 'APPLE|INTJ',
 'Apple|PRODUCT',
 'Apple|LOC',
 'Apple|ORG',
 'Apple|PERSON',
 'Apple|ADJ']

In [11]:
s2v.get_best_sense("apple", ["ORG", "PRODUCT", "NOUN"])

'Apple|ORG'

Check vector:

In [12]:
s2v[query]

array([-0.02698926,  0.3866803 , -0.66829497, -0.41728875,  0.26364306,
       -0.40081096,  0.6281248 ,  0.14720058,  0.19218649, -0.0998884 ,
        0.26744893, -0.02889291, -0.17782305, -0.11958034, -0.03006067,
       -0.24114996, -0.12906119,  0.19724639,  0.4380696 ,  0.05275216,
        0.15804796,  0.19498187, -0.08526038, -0.46956626, -0.11648716,
        0.07625313, -0.29506105, -0.42849484, -0.40789005, -0.1288717 ,
        0.20095542,  0.61653686, -0.05818588, -0.2014371 , -0.00563217,
       -0.5979889 , -0.21555479,  0.52637964, -0.23618117, -0.27018833,
       -0.39888066, -0.03571676, -0.14596932,  0.06775339,  0.06443068,
        0.02549744, -0.03748453, -0.18575297, -0.2129982 ,  0.5471347 ,
        0.05033882, -0.40439808,  0.10965174, -0.19026929, -0.10089809,
        0.05904212,  0.57891583,  0.185087  , -0.447115  ,  0.09574994,
        0.11977117, -0.20688562,  0.201603  ,  0.30103895, -0.39587796,
       -0.58227926, -0.59210235, -0.34023854, -0.06494252, -0.31

Vectors dimensions:

In [13]:
len(s2v[query])

128

In [14]:
print(type(s2v))

<class 'sense2vec.sense2vec.Sense2Vec'>


### Usage as a spaCy pipeline component

In [15]:
!pip install spacy==3.4.0



In [16]:
!python3 -m spacy validate

[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /usr/local/lib/python3.9/site-packages/spacy[0m

NAME             SPACY            VERSION                            
en_core_web_sm   >=3.4.0,<3.5.0   [38;5;2m3.4.0[0m   [38;5;2m✔[0m



In [17]:
!python3 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [18]:
import spacy

nlp = spacy.load("en_core_web_sm")
s2v_p = nlp.add_pipe("sense2vec")
s2v_p.from_disk('./s2v_old')

doc = nlp("A sentence about natural language processing.")
assert doc[3:6].text == "natural language processing"
freq = doc[3:6]._.s2v_freq
vector = doc[3:6]._.s2v_vec
most_similar = doc[3:6]._.s2v_most_similar(3)

In [19]:
print(type(s2v_p))

<class 'sense2vec.component.Sense2VecComponent'>


In [20]:
most_similar

[(('machine learning', 'NOUN'), 0.8987),
 (('computer vision', 'NOUN'), 0.8636),
 (('deep learning', 'NOUN'), 0.8573)]

The spacy pipeline has a pos tagger and named entity recognizer components before the sense2vec component.Sense2vec component uses results from these components to create word senses.

In [21]:
doc = nlp('Power resides where men believe it resides. It’s a trick, a shadow on the wall.')

In [22]:
 for i in doc:
   try:
     print(i,i.pos_,'\n',i._.s2v_most_similar(3))
   except ValueError as e:
     #If a token pos tag combination is not in the keyed vectors it raises Error so we need to catch it
     pass

Power NOUN 
 [((' power', 'NOUN'), 0.8701), (('only power', 'NOUN'), 0.8584), (('actual power', 'NOUN'), 0.8424)]
resides VERB 
 [(('resides', 'NOUN'), 0.858), (('reside', 'VERB'), 0.8311), (('residing', 'VERB'), 0.8212)]
men NOUN 
 [(('&gt;Men', 'NOUN'), 0.9238), (('&gt;men', 'NOUN'), 0.9099), ((' men', 'NOUN'), 0.9024)]
believe VERB 
 [(('beleive', 'VERB'), 0.9053), (('that', 'ADP'), 0.8564), (('claim', 'VERB'), 0.8178)]
resides VERB 
 [(('resides', 'NOUN'), 0.858), (('reside', 'VERB'), 0.8311), (('residing', 'VERB'), 0.8212)]
. PUNCT 
 [((',', 'PUNCT'), 0.9312), (('and', 'CONJ'), 0.8561), (('that', 'DET'), 0.851)]
’s VERB 
 [(('isn’t', 'VERB'), 0.9394), (('’s', 'ADV'), 0.9149), (('’s', 'PUNCT'), 0.902)]
a DET 
 [(('another', 'DET'), 0.8344), (('an', 'DET'), 0.8261), (('.', 'PUNCT'), 0.8019)]
trick NOUN 
 [(('good trick', 'NOUN'), 0.8685), (('little trick', 'NOUN'), 0.8457), (('neat trick', 'NOUN'), 0.8286)]
, PUNCT 
 [(('.', 'PUNCT'), 0.9312), (('...', 'PUNCT'), 0.8282), (('(', 'PUN

For entities, the entity labels are used as the "sense" (instead of the token's part-of-speech tag):

In [23]:
doc = nlp("A sentence about Apple and Google.")
for ent in doc.ents:
  assert ent._.in_s2v
  print(ent.text)
  most_similar = ent._.s2v_most_similar(3)
  print(most_similar)

Apple
[(('BlackBerry', 'ORG'), 0.9017), (('&gt;Apple', 'NOUN'), 0.8947), (('even Apple', 'ORG'), 0.8858)]
Google
[((' Google', 'ORG'), 0.8996), (('search engine', 'NOUN'), 0.8486), (('Bing', 'NOUN'), 0.8436)]


**Training your own sense2vec vectors:**

* **01_parse.py** - Use spaCy to parse the raw text and output binary DocBin (The DocBin is faster and produces smaller data sizes than pickle).

* **02_preprocess.py** - DocBin to output text files in the sense2vec format (one sentence per line and merged phrases with senses).\
Example output:\
Rats|NOUN ,|PUNCT mould|NOUN and|CCONJ broken_furniture|NOUN :|PUNCT the|DET scandal|NOUN of|ADP the|DET UK|GPE 's|PART refugee_housing|NOUN

* **03_glove_build_counts.py, 04_glove_train_vectors.py** or **04_fasttext_train_vectors.py** - output a plain-text vectors file.\
Glove model is based on leveraging global word to word co-occurance counts leveraging the entire corpus. Word2vec on the other hand leverages co-occurance within local context (neighbouring words). FastText is an extension to Word2Vec proposed by Facebook in 2016. 

* **05_export.py** - Expects a vectors.txt and a vocab file and exports a component that can be loaded with Sense2vec.from_disk.

For more detailed documentation of the scripts, check out the source or run them with --help.


In [24]:
s2v.most_similar("Google|ORG", n=10)

[('_Google|ORG', 0.8996),
 ('search_engine|NOUN', 0.8486),
 ('Bing|NOUN', 0.8436),
 ('even_Google|ORG', 0.8404),
 ('google|ORG', 0.8318),
 ('Google_Search|NOUN', 0.8291),
 ('Googles|NOUN', 0.8234),
 ('&gt;Google|NOUN', 0.8138),
 ('DuckDuckGo|NOUN', 0.8127),
 ('Yahoo|ORG', 0.8038)]

s2v shape (set at creation):

In [25]:
len(s2v)

1195261

Number of items:

In [26]:
len(list(s2v.items()))

1187453

Add new vector manually:

In [27]:
s2v.add("GOOGLE IS EVIL|ORG", s2v['Google|ORG'], 123) # key, vector, freq

In [28]:
s2v.most_similar("Google|ORG", n=10)

[('GOOGLE IS EVIL|ORG', 1.0),
 ('_Google|ORG', 0.8996),
 ('search_engine|NOUN', 0.8486),
 ('Bing|NOUN', 0.8436),
 ('even_Google|ORG', 0.8404),
 ('google|ORG', 0.8318),
 ('Google_Search|NOUN', 0.8291),
 ('Googles|NOUN', 0.8234),
 ('&gt;Google|NOUN', 0.8138),
 ('DuckDuckGo|NOUN', 0.8127)]

Shape unchanged:

In [29]:
len(s2v)

1195261

Items was increased:

In [30]:
len(list(s2v.items()))

1187454

In [31]:
import numpy as np

Create and modify sense2vec manually:

In [32]:
s2v = Sense2Vec(shape=(5, 4))
s2v.cfg["senses"] = ["A", "B", "C"]
for key, freq in [("a|A", 100), ("a|B", 50), ("a|C", 10), ("b|A", 1), ("B|C", 2)]:
  s2v.add(key, np.asarray([4, 2, 2, 2], dtype=np.float32), freq)
assert s2v.get_best_sense("a") == "a|A"
assert s2v.get_best_sense("b") == "B|C"
assert s2v.get_best_sense("b", ignore_case=False) == "b|A"
assert s2v.get_best_sense("c") is None
s2v.cfg["senses"] = []
assert s2v.get_best_sense("a") is None
assert s2v.get_best_sense("b", ["A"]) == "b|A"
assert s2v.get_best_sense("b", ["A", "C"]) == "B|C"

In [33]:
s2v.most_similar("a|A", n=10)

[('B|C', 1.0), ('b|A', 1.0), ('a|C', 1.0), ('a|B', 1.0)]