<div style="text-align: right; font-style: italic">Lorenz Köhl
<br>
September 2022</div>

# Reverse Dictionary

Writing well is laborsome. A good dictionary helps but it's only usable in one direction.
You have to think of a word, look it up, chase references and so on. Even more work!

It may be helpful to look up words by meaning, by what we as writers want to express.
For example, if we had a function `find_words("I'm lost for words")`
and it would present us with a choice of words:

```
astoundment, bewildered, blank, confus, distraught, perplexly, stagger, stound, unyielded
```

then we may find the right word we want, without all the gyrations of traditional dictionary use.

Can we implement this function? The answer is yes and it's not hard (*if you know your way around python*)!

*Dependencies for execution:*

- an environment with the following and a computer with enough resources (ie. nvidia gpu and lots of RAM)
- pytorch<br> `conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge`
- [sentence-transformers](https://github.com/UKPLab/sentence-transformers):<br> `conda install -c conda-forge sentence-transformers`
- [ScaNN](https://github.com/google-research/google-research/tree/master/scann):<br> `pip install scann`

You'll also need a cleaned up version of the webster1913 dictionary
[json file](https://www.dropbox.com/s/w62l6pdfl8dtw2z/webst.json?dl=0). 
Please find a cleaning script in the [repo](https://github.com/mye/simple-vector-search) which depends on
[html5-parser](https://html5-parser.readthedocs.io/en/latest/):
<br> 
`pip install --no-binary lxml html5-parser`

`python cleanwebst.py <webst.json > cleanwebst.json`

In [None]:
import torch, numpy as np
from sentence_transformers import SentenceTransformer
import json
import scann

In [13]:
assert torch.cuda.is_available()

We start of by loading the dictionary, embedding definitions into vectors (sentence embeddings) and indexing those vectors for approximate nearest neighbor search

In [5]:
webst = json.load(open('cleanwebst.json'))
webst['neuron']

['The brain and spinal cord; the cerebro-spinal axis; myelencephalon.',
 '[NL., from Gr. νεῦρον nerve.]']

In [19]:
mpnet = SentenceTransformer('all-mpnet-base-v2') # could also use all-MiniLM-L6-v2 for lighter weight model

In [7]:
# this takes a while (about 30 minutes on my RTX 3060 TI)
webst_embs = {word: mpnet.encode(defs) for word, defs in webst.items()} 

In [18]:
dataset = np.concatenate([webst_embs[w] for w in webst_embs])
dataset_words = np.array([w for w in webst_embs for e in webst_embs[w]])
assert len(dataset) == len(dataset_words)
np.save('embs.npy', dataset) # save data so we don't have to recompute when something bad happens
np.save('words.npy', dataset_words)

In [20]:
normalized_dataset = dataset / np.linalg.norm(dataset, axis=1)[:, np.newaxis]

In [22]:
searcher = scann.scann_ops_pybind.builder(normalized_dataset, 10, "dot_product").tree(
    num_leaves=2000, num_leaves_to_search=100, training_sample_size=250000).score_ah(
    2, anisotropic_quantization_threshold=0.2).reorder(100).build()

This did alot in a few cells, even if it doesn't look like much!
We loaded a pretrained neural network and encoded the whole dictionary,
which gives us around 270000 vector to search through.

We now have everything to implement our word finding function.
We simply encode the description (the meaning) into a vector and search for its neighbors!

In [23]:
def find_words(description: str):
    emb = mpnet.encode(description)
    neighbors, distances = searcher.search(emb, final_num_neighbors=10)
    return set(dataset_words[neighbors])

In [24]:
find_words("I'm lost for words")

{'amazeful',
 'astoundment',
 'bewildered',
 'blank',
 'confus',
 'distraught',
 'perplexly',
 'stagger',
 'stound',
 'unyielded'}

Of course what we really want is more nicely formatted list with definitions

In [26]:
from IPython.display import HTML

In [72]:
def word_html(word, ndefs=5):
    defs = [f'<i style="font-size: small">{d}</i>' for d in webst[word][:ndefs]]
    html = f'<li><b>{word}</b><br>{"  //  ".join(defs)}</li>'
    return html

def display_words(desc):
    words = find_words(desc)
    htmls = [word_html(word) for word in words]
    return HTML('<ul>' + "".join(htmls) + '</ul>')

In [73]:
display_words("I'm lost for words")

That's a decent result for the wee bit of code we had to write.
The quality of words isn't always perfect (false positives happen).
Some words have a lot definitions and appear too often (eg. unyielded).
We could for example think about how improve the embeddings,
or we could increase the size of our dataset, and balance the number of
definitions used for training. Then we could think about deploying it as a service to others.

But before we do all that, let's gather some real world experience on how
useful our model is in practice and get some writing done. Have fun!