## Embeddings

For di example wey we do before, we dey use high-dimensional bag-of-words vectors wey get length `vocab_size`, and we turn low-dimensional positional representation vectors into sparse one-hot representation. Dis one-hot representation no dey save memory well. Plus, e dey treat each word as if dem no relate, so one-hot encoded vectors no fit show di meaning similarity wey dey between words.

For dis unit, we go still dey look di **News AG** dataset. To start, make we load di data and collect some definitions from di last unit.


In [2]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np

ds_train, ds_test = tfds.load('ag_news_subset').values()

### Wetin be embedding?

Di idea of **embedding** na to use small-small dense vectors wey go show di meaning wey dey inside di word. Later we go talk how to make beta word embeddings, but for now make we just see embeddings as one way to make di word vector no too big.

So, embedding layer dey take word as input, and e dey give output vector wey get di `embedding_size` wey you set. E be like `Dense` layer, but e no dey use one-hot encoded vector as input, e fit use word number instead.

If we use embedding layer as di first layer for our network, we fit change from bag-of-words to **embedding bag** model. For dis one, we go first change each word for our text to di embedding wey match am, then we go do one kind calculation for all di embeddings, like `sum`, `average` or `max`.

![Image wey dey show embedding classifier for five sequence words.](../../../../../translated_images/embedding-classifier-example.b77f021a7ee67eeec8e68bfe11636c5b97d6eaa067515a129bfb1d0034b1ac5b.pcm.png)

Our classifier neural network get di following layers:

* `TextVectorization` layer, wey dey take string as input, and e dey give tensor of token numbers. We go set one beta vocabulary size `vocab_size`, and we go ignore words wey people no dey use well-well. Di input shape go be 1, and di output shape go be $n$, because we go get $n$ tokens as result, and each one go get numbers from 0 to `vocab_size`.
* `Embedding` layer, wey dey take $n$ numbers, and e dey reduce each number to dense vector wey get di length wey you set (100 for our example). So di input tensor wey get shape $n$ go turn $n\times 100$ tensor.
* Aggregation layer, wey dey take di average of di tensor along di first axis. Dis one mean say e go calculate di average of all $n$ input tensors wey match di different words. To do dis layer, we go use `Lambda` layer, and we go pass di function wey go calculate di average. Di output go get shape of 100, and e go be di numeric representation of di whole input sequence.
* Final `Dense` linear classifier.


In [3]:
vocab_size = 30000
batch_size = 128

vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,input_shape=(1,))

model = keras.models.Sequential([
    vectorizer,    
    keras.layers.Embedding(vocab_size,100),
    keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1)),
    keras.layers.Dense(4, activation='softmax')
])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, None)             0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, None, 100)         3000000   
                                                                 
 lambda (Lambda)             (None, 100)               0         
                                                                 
 dense (Dense)               (None, 4)                 404       
                                                                 
Total params: 3,000,404
Trainable params: 3,000,404
Non-trainable params: 0
_________________________________________________________________


For di `summary` wey e print, for di **output shape** column, di first tensor dimension `None` mean di minibatch size, and di second one mean di length of di token sequence. All di token sequence for di minibatch get different length. We go talk how we go handle am for di next section.

Now make we train di network:


In [4]:
def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])

print("Training vectorizer")
vectorizer.adapt(ds_train.take(500).map(extract_text))

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))

Training vectorizer


<keras.callbacks.History at 0x22255515100>

> **Note** say we dey build vectorizer based on small part of di data. Dis one na to make di process fast, and e fit mean say no be all di tokens wey dey our text go dey di vocabulary. If dis one happen, dem go ignore di tokens, and e fit make di accuracy small. But for real life, small part of text dey always give better vocabulary estimation.


### How to handle sequence wey get different size

Make we understand how training dey happen for minibatches. For di example wey dey up, di input tensor get dimension 1, and we dey use 128-long minibatches, so di real size of di tensor na $128 \times 1$. But, di number of tokens wey dey each sentence no dey di same. If we use di `TextVectorization` layer for one input, di number of tokens wey e go return go dey different, e go depend on how dem tokenize di text:


In [5]:
print(vectorizer('Hello, world!'))
print(vectorizer('I am glad to meet you!'))

tf.Tensor([ 1 45], shape=(2,), dtype=int64)
tf.Tensor([ 112 1271    1    3 1747  158], shape=(6,), dtype=int64)


But wen we use di vectorizer for plenti sequences, e go produce tensor wey get rectangular shape, so e go fill di unused elements wit di PAD token (wey for our case na zero):


In [6]:
vectorizer(['Hello, world!','I am glad to meet you!'])

<tf.Tensor: shape=(2, 6), dtype=int64, numpy=
array([[   1,   45,    0,    0,    0,    0],
       [ 112, 1271,    1,    3, 1747,  158]], dtype=int64)>

Here we fit see di embeddings:


In [7]:
model.layers[1](vectorizer(['Hello, world!','I am glad to meet you!'])).numpy()

array([[[ 1.53059261e-02,  6.80514947e-02,  3.14026810e-02, ...,
         -8.92002955e-02,  1.52911525e-04, -5.65562584e-02],
        [ 2.57456154e-01,  2.79364467e-01, -2.03605562e-01, ...,
         -2.07474351e-01,  8.31158683e-02, -2.03911960e-01],
        [ 3.98201384e-02, -8.03454965e-03,  2.39790026e-02, ...,
         -7.18549127e-04,  2.66963355e-02, -4.30646613e-02],
        [ 3.98201384e-02, -8.03454965e-03,  2.39790026e-02, ...,
         -7.18549127e-04,  2.66963355e-02, -4.30646613e-02],
        [ 3.98201384e-02, -8.03454965e-03,  2.39790026e-02, ...,
         -7.18549127e-04,  2.66963355e-02, -4.30646613e-02],
        [ 3.98201384e-02, -8.03454965e-03,  2.39790026e-02, ...,
         -7.18549127e-04,  2.66963355e-02, -4.30646613e-02]],

       [[ 1.89674050e-01,  2.61548996e-01, -3.67433839e-02, ...,
         -2.07366899e-01, -1.05442435e-01, -2.36952081e-01],
        [ 6.16133213e-02,  1.80511594e-01,  9.77298319e-02, ...,
         -5.46628237e-02, -1.07340455e-01, -1.06589

> **Note**: To reduce how much padding go dey, e fit make sense to arrange all the sequences for the dataset according to how dem dey increase for length (or, make am clear, number of tokens). Dis one go make sure say each minibatch get sequences wey get similar length.


## Semantic embeddings: Word2Vec

For di example wey we do before, di embedding layer learn how e go fit map words to vector representations, but di representations no get semantic meaning. E go make sense if we fit learn vector representation wey go make similar words or synonyms dey close to each oda based on vector distance (like euclidian distance).

To do dis one, we go need to pretrain our embedding model for big text collection using technique like [Word2Vec](https://en.wikipedia.org/wiki/Word2vec). Dis one dey based on two main architectures wey dem dey use to produce distributed representation of words:

 - **Continuous bag-of-words** (CBoW), na where we dey train di model to predict one word from di surrounding context. If dem give di ngram $(W_{-2},W_{-1},W_0,W_1,W_2)$, di goal of di model na to predict $W_0$ from $(W_{-2},W_{-1},W_1,W_2)$.
 - **Continuous skip-gram** na di opposite of CBoW. Di model dey use di surrounding window of context words to predict di current word.

CBoW fast well, but skip-gram slow small, e dey represent words wey no dey common better.

![Image wey dey show both CBoW and Skip-Gram algorithms to convert words to vectors.](../../../../../translated_images/example-algorithms-for-converting-words-to-vectors.fbe9207a726922f6f0f5de66427e8a6eda63809356114e28fb1fa5f4a83ebda7.pcm.png)

To test di Word2Vec embedding wey dem don pretrain for Google News dataset, we fit use di **gensim** library. For di example below, we go find di words wey dey most similar to 'neural'.

> **Note:** When you first create word vectors, e fit take time to download dem!


In [8]:
import gensim.downloader as api
w2v = api.load('word2vec-google-news-300')

In [12]:
for w,p in w2v.most_similar('neural'):
    print(f"{w} -> {p}")

neuronal -> 0.7804799675941467
neurons -> 0.7326500415802002
neural_circuits -> 0.7252851724624634
neuron -> 0.7174385190010071
cortical -> 0.6941086649894714
brain_circuitry -> 0.6923246383666992
synaptic -> 0.6699118614196777
neural_circuitry -> 0.6638563275337219
neurochemical -> 0.6555314064025879
neuronal_activity -> 0.6531826257705688


We fit extract di vector embedding from di word, to use am train di classification model. Di embedding get 300 components, but for here we go show only di first 20 components of di vector make e clear:


In [13]:
w2v['play'][:20]

array([ 0.01226807,  0.06225586,  0.10693359,  0.05810547,  0.23828125,
        0.03686523,  0.05151367, -0.20703125,  0.01989746,  0.10058594,
       -0.03759766, -0.1015625 , -0.15820312, -0.08105469, -0.0390625 ,
       -0.05053711,  0.16015625,  0.2578125 ,  0.10058594, -0.25976562],
      dtype=float32)

Di beta tin wey semantic embeddings dey do be say you fit use di vector encoding based on di meaning. For example, we fit ask make e find one word wey di vector representation dey near di words *king* and *woman*, and e far from di word *man*:


In [14]:
w2v.most_similar(positive=['king','woman'],negative=['man'])[0]

('queen', 0.7118192911148071)

Di example wey dey up use some internal GenSym magic, but di logic wey dey under am no too hard. One interesting thing about embeddings be say you fit do normal vector operations for embedding vectors, and e go show operations for word **meanings**. Di example wey dey up fit dey explain with vector operations: we dey calculate di vector wey match **KING-MAN+WOMAN** (di operations `+` and `-` dey happen for vector representations of di words wey dey match), and then we go find di word wey dey closest to dat vector for di dictionary:


In [15]:
# get the vector corresponding to kind-man+woman
qvec = w2v['king']-1.7*w2v['man']+1.7*w2v['woman']
# find the index of the closest embedding vector 
d = np.sum((w2v.vectors-qvec)**2,axis=1)
min_idx = np.argmin(d)
# find the corresponding word
w2v.index_to_key[min_idx]

'queen'

> **NOTE**: We gatz add small coefficients to *man* and *woman* vectors - try remove am make you see wetin go happen.

To find di vector wey dey closest, we dey use TensorFlow machinery to calculate vector of distances between our vector and all di vectors wey dey vocabulary, then we go find di index of di word wey get di smallest distance using `argmin`.


While Word2Vec dey look like better way to show wetin word mean, e get plenty wahala, like di ones wey dey below:

* Both CBoW and skip-gram models na **predictive embeddings**, and dem dey only use di local context. Word2Vec no dey use di global context at all.
* Word2Vec no dey consider word **morphology**, wey mean say e no dey look di way di meaning of word fit depend on di different parts of di word, like di root.

**FastText** wan try solve di second problem, and e dey build on top Word2Vec by learning vector representations for each word and di character n-grams wey dey inside di word. Di values of di representations go then dey average into one vector for each training step. Even though dis one go add plenty extra computation for di pretraining, e go make word embeddings fit carry sub-word information.

Another method, **GloVe**, dey use different way to do word embeddings, based on di factorization of di word-context matrix. First, e go build one big matrix wey go count how many times word dey appear for different contexts, then e go try represent di matrix for smaller dimensions in a way wey go reduce reconstruction loss.

Di gensim library dey support all dis word embeddings, and you fit try dem out by changing di model loading code wey dey above.


## How to use pretrained embeddings for Keras

We fit change di example wey dey up so dat we go fit put semantic embeddings like Word2Vec for di matrix wey dey our embedding layer. Di vocabularies for di pretrained embedding and di text corpus no go match well, so we go need choose one. For here, we go look di two options wey dey: using di tokenizer vocabulary, and using di vocabulary wey dey Word2Vec embeddings.

### Using tokenizer vocabulary

If we use di tokenizer vocabulary, some words for di vocabulary go get Word2Vec embeddings wey match dem, and some no go get. As our vocabulary size be `vocab_size`, and di Word2Vec embedding vector length be `embed_size`, di embedding layer go dey represented by weight matrix wey get shape `vocab_size`$\times$`embed_size`. We go fill dis matrix by checking di vocabulary:


In [9]:
embed_size = len(w2v.get_vector('hello'))
print(f'Embedding size: {embed_size}')

vocab = vectorizer.get_vocabulary()
W = np.zeros((vocab_size,embed_size))
print('Populating matrix, this will take some time...',end='')
found, not_found = 0,0
for i,w in enumerate(vocab):
    try:
        W[i] = w2v.get_vector(w)
        found+=1
    except:
        # W[i] = np.random.normal(0.0,0.3,size=(embed_size,))
        not_found+=1

print(f"Done, found {found} words, {not_found} words missing")

Embedding size: 300
Populating matrix, this will take some time...Done, found 4551 words, 784 words missing


For words wey no dey inside Word2Vec vocabulary, we fit either leave dem as zero, or we fit generate random vector for dem.

Now, we fit define embedding layer wey get pretrained weights:


In [10]:
emb = keras.layers.Embedding(vocab_size,embed_size,weights=[W],trainable=False)
model = keras.models.Sequential([
    vectorizer, emb,
    keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1)),
    keras.layers.Dense(4, activation='softmax')
])

Make we train our model now.


In [11]:
model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size),
          validation_data=ds_test.map(tupelize).batch(batch_size))



<keras.callbacks.History at 0x2220226ef10>

> **Note**: See say we set `trainable=False` when we dey create `Embedding`, e mean say we no dey retrain the Embedding layer. Dis fit make accuracy small low, but e go make training fast.

### How to use embedding vocabulary

One wahala wey dey with di way we do am before na say di vocabularies wey TextVectorization and Embedding dey use no be di same. To solve dis problem, we fit use one of dis options:
* Train di Word2Vec model again with our own vocabulary.
* Load our dataset with di vocabulary wey dey di pretrained Word2Vec model. Di vocabularies wey we go use load di dataset fit dey specified when we dey load am.

Di second option dey look easier, so make we do am. First, we go create `TextVectorization` layer wey get di vocabulary wey we take from di Word2Vec embeddings:


In [12]:
vocab = list(w2v.vocab.keys())
vectorizer = keras.layers.experimental.preprocessing.TextVectorization(input_shape=(1,))
vectorizer.set_vocabulary(vocab)

The gensim word embeddings library get one beta function, `get_keras_embeddings`, wey go automatically create di Keras embeddings layer for you.


In [13]:
model = keras.models.Sequential([
    vectorizer, 
    w2v.get_keras_embedding(train_embeddings=False),
    keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1)),
    keras.layers.Dense(4, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(128),validation_data=ds_test.map(tupelize).batch(128),epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x2220ccb81c0>

One of di reasons why we no dey see higher accuracy na because some words from our dataset dey miss for di pretrained GloVe vocabulary, and so dem dey basically ignore dem. To solve dis one, we fit train our own embeddings based on our dataset.


## Contextual embeddings

One big wahala wey dey traditional pretrained embedding representations like Word2Vec be say, even though dem fit sabi small meaning of one word, dem no fit sabi di different meanings wey di word fit get. Dis one fit cause wahala for downstream models.

For example, di word 'play' get different meaning for dis two sentences:
- I go watch one **play** for di theater.
- John wan **play** wit im friends.

Di pretrained embeddings wey we don talk about go represent di two meanings of di word 'play' as di same embedding. To solve dis kain wahala, we need to build embeddings wey base on di **language model**, wey dem don train wit plenty text, and e *sabi* how words fit join together for different contexts. To talk about contextual embeddings no dey di scope of dis tutorial, but we go come back to dem when we dey talk about language models for di next unit.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis dokyument don use AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator) do di translation. Even as we dey try make am accurate, abeg make you sabi say machine translation fit get mistake or no dey correct well. Di original dokyument wey dey for im native language na di main source wey you go fit trust. For important information, e good make professional human translation dey use. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translation.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
