### “You shall know a word by the company it keeps” – Firth, J.R. (1957)

Notebook explores various character, word and sentence embeddings.

What are they?

- Word embeddings represent (embed) words in a continuous vector space where semantically similar words are mapped to nearby points ('are embedded nearby each other'- Word embeddings allow you to implicitly include external information from the world into your language understanding models
- Any set of numbers is a valid word vector, but to be useful, a set of word vectors for a vocabulary should capture the meaning of words, the relationship between words, and the context of different words as they are used naturally

There’s a few key characteristics to a set of useful word embeddings:

- Every word has a unique word embedding (a vector), which is just a list of numbers for each word.
- The word embeddings are multidimensional; typically for a good model, embeddings are between 50 and 500 in length.
- For each word, the embedding captures the “meaning” of the word.
- Similar words end up with similar embedding values.

### Popular Word Embedding Algorithms (<--- euphemism, really called vector-space models)

- Word2Vec (by far the most popular right now)
    - Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text. It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model [(Section 3.1 and 3.2 in Mikolov et al.)](https://arxiv.org/pdf/1301.3781.pdf).
- GloVe (global vectors)
- [LexVec](https://github.com/alexandres/lexvec)
- fasttext
- [ELMo](https://arxiv.org/pdf/1802.05365.pdf)
- BERT
- [State of the art in 2017](http://ruder.io/word-embeddings-2017/index.html)
- [A more recent state of the art, 2018](https://towardsdatascience.com/beyond-word-embeddings-part-2-word-vectors-nlp-modeling-from-bow-to-bert-4ebd4711d0ec)
- [A pretty nice evaluation of existing embedding methods](https://github.com/kudkudak/word-embeddings-benchmarks)

## "traditional methods are frequently overshadowed amid the deep learning craze and their relevancy consequently deserves to be mentioned more often"

## On combining embeddings of different languages

- [This is very relevant](http://ruder.io/cross-lingual-embeddings/index.html)

# ELMo (Em-beddings from Language Models) vs word2vec

The overview of ELMo is as follows:
1. Train an LSTM-based language model on some large corpus
2. Use the hidden states of the LSTM for each token to compute a vector representation of each word

We can:
 1. Embed the whole sentence (and measure the difference with the target)
 2. Embed the individual tokens in the sentence (and measure the difference with the target)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
sys.path.append(os.path.join(os.getcwd(),".."))
from haberrspd.embeddings import calculate_bert_embeddings

In [3]:
test= dict()
test['a'] = [["I am a test sentence"], ["The dog walks over the road"]]
test['b'] = [["Cocaine in the membrane"], ["Wishful thinking is all"]]

In [4]:
out = calculate_bert_embeddings(test)

