# Vocabs, Hashes, Lexemes

spaCy encodes all the strings in a `Vocab` to **hash values**, i.e. memory saving b/c each string is only stored once. This also means that if a word is not in the vocabulary, there's no way to get its string.

Any string can be converted to a hash.

> Each Token -> corresponding Lexeme -> corresponding hash in string store

## Look up a strings hash value

`nlp.vocab.strings` works *both* ways, i.e. can lookup a string or hash value

In [13]:
nlp.vocab.strings["Chicago"]

3767271857074311788

In [2]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

doc = nlp("I love coffee")

print('hash value:', nlp.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
string value: coffee


Alternatively, can use `doc.vocab.strings`

In [3]:
print('hash value:', doc.vocab.strings['coffee'])

hash value: 3197928453018144401


# Lexemes

Context independent entries in the vocabulary, e.g. play is a lexeme that can take up many forms like play, playing, plays, and played. They do **not** have POS, dependencies or entity labels b/c these depend on context.

Use `nlp.vocab` to look-up a string or hash ID.


> See the spaCy [documentation](https://spacy.io/api/lexeme) for more info, including lexeme attributes.

Need to use 'en_core_web_lg' to do word vector stuff, e.g. similarity. See [here](https://spacy.io/usage/vectors-similarity) for more info.

In [4]:
doc = nlp("I love coffee")

lexeme = nlp.vocab['coffee']

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


In [5]:
doc = nlp("I love playing basketball")

lexeme = nlp.vocab['playing']

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

playing 13803694918078379268 True
