## Embeddings

For di example wey we do before, we dey use high-dimensional bag-of-words vectors wey get length `vocab_size`, and we dey change from low-dimensional positional representation vectors go sparse one-hot representation. Dis one-hot representation no dey use memory well, plus, e dey treat each word as if dem no relate to each oda, meaning say one-hot encoded vectors no dey show any semantic similarity between words.

For dis unit, we go still dey look di **News AG** dataset. To start, make we load di data and get some definitions from di notebook wey we use before.


In [1]:
import torch
import torchtext
import numpy as np
from torchnlp import *
train_dataset, test_dataset, classes, vocab = load_dataset()
vocab_size = len(vocab)
print("Vocab size = ",vocab_size)

Loading dataset...


d:\WORK\ai-for-beginners\5-NLP\14-Embeddings\data\train.csv: 29.5MB [00:01, 18.8MB/s]                            
d:\WORK\ai-for-beginners\5-NLP\14-Embeddings\data\test.csv: 1.86MB [00:00, 11.2MB/s]                          


Building vocab...
Vocab size =  95812


## Wetin be embedding?

Di idea of **embedding** na to represent words wit vectors wey get lower-dimensional dense format, wey go fit show di meaning wey dey inside di word. Later we go talk how to build word embeddings wey get meaning, but for now, make we just see embeddings as one way to reduce di dimensionality of word vector.

So, embedding layer go take one word as input, and e go produce output vector wey get di `embedding_size` wey you set. E dey similar to `Linear` layer, but instead of using one-hot encoded vector, e go fit take di word number as input.

If we use embedding layer as di first layer for our network, we fit change from bag-of-words to **embedding bag** model. For dis model, we go first change each word for our text to di embedding wey match am, then we go calculate one aggregate function for all di embeddings, like `sum`, `average` or `max`.

![Image showing an embedding classifier for five sequence words.](../../../../../translated_images/embedding-classifier-example.b77f021a7ee67eeec8e68bfe11636c5b97d6eaa067515a129bfb1d0034b1ac5b.pcm.png)

Our classifier neural network go start wit embedding layer, then aggregation layer, and linear classifier on top:


In [2]:
class EmbedClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.fc = torch.nn.Linear(embed_dim, num_class)

    def forward(self, x):
        x = self.embedding(x)
        x = torch.mean(x,dim=1)
        return self.fc(x)

### How to Handle Variable Sequence Size

Because of how dis kain architecture be, we go need create minibatches for our network in one special way. For di last unit, wen we dey use bag-of-words, all di BoW tensors wey dey inside one minibatch get di same size `vocab_size`, no matter how long di text sequence be. But once we start to use word embeddings, di number of words wey dey each text sample go dey different, and to join all di samples together inside minibatches, we go need add some padding.

We fit do dis one by using di same method wey involve providing `collate_fn` function to di datasource:


In [3]:
def padify(b):
    # b is the list of tuples of length batch_size
    #   - first element of a tuple = label, 
    #   - second = feature (text sequence)
    # build vectorized sequence
    v = [encode(x[1]) for x in b]
    # first, compute max length of a sequence in this minibatch
    l = max(map(len,v))
    return ( # tuple of two tensors - labels and features
        torch.LongTensor([t[0]-1 for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v])
    )

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=padify, shuffle=True)

### Train di embedding classifier

Now wey we don set correct dataloader, we fit train di model wit di training function wey we don define for di previous unit:


In [4]:
net = EmbedClassifier(vocab_size,32,len(classes)).to(device)
train_epoch(net,train_loader, lr=1, epoch_size=25000)

3200: acc=0.6415625
6400: acc=0.6865625
9600: acc=0.7103125
12800: acc=0.726953125
16000: acc=0.739375
19200: acc=0.75046875
22400: acc=0.7572321428571429


(0.889799795315499, 0.7623160588611644)

> **Note**: We dey train only 25k records here (e no reach one full epoch) sake of time, but you fit continue to train, write function wey go train for plenty epochs, and test wit learning rate parameter to get better accuracy. You fit reach accuracy wey go near 90%.


### EmbeddingBag Layer and Variable-Length Sequence Representation

For di architecture wey we bin dey use before, we need to pad all di sequences make dem get di same length so dem go fit enter minibatch. Dis no be di most efficient way to represent sequences wey get different length - another way na to use **offset** vector, wey go hold di offsets of all di sequences wey dey inside one big vector.

![Image wey dey show offset sequence representation](../../../../../translated_images/offset-sequence-representation.eb73fcefb29b46eecfbe74466077cfeb7c0f93a4f254850538a2efbc63517479.pcm.png)

> **Note**: For di picture wey dey up, we dey show sequence of characters, but for our example we dey work with sequences of words. But di general principle of how to represent sequences with offset vector still remain di same.

To work with offset representation, we dey use [`EmbeddingBag`](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html) layer. E be like `Embedding`, but e dey take content vector and offset vector as input, and e also get averaging layer, wey fit be `mean`, `sum` or `max`.

Dis na di modified network wey dey use `EmbeddingBag`:


In [5]:
class EmbedClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = torch.nn.EmbeddingBag(vocab_size, embed_dim)
        self.fc = torch.nn.Linear(embed_dim, num_class)

    def forward(self, text, off):
        x = self.embedding(text, off)
        return self.fc(x)

To prepare di dataset for training, we need to provide one conversion function wey go prepare di offset vector:


In [6]:
def offsetify(b):
    # first, compute data tensor from all sequences
    x = [torch.tensor(encode(t[1])) for t in b]
    # now, compute the offsets by accumulating the tensor of sequence lengths
    o = [0] + [len(t) for t in x]
    o = torch.tensor(o[:-1]).cumsum(dim=0)
    return ( 
        torch.LongTensor([t[0]-1 for t in b]), # labels
        torch.cat(x), # text 
        o
    )

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=offsetify, shuffle=True)

Note, say unlike all di examples wey we don do before, our network now dey accept two parameters: data vector and offset vector, wey get different sizes. Same way, our data loader dey also give us 3 values instead of 2: both text and offset vectors dey provided as features. So, we go need adjust our training function small to handle am:


In [7]:
net = EmbedClassifier(vocab_size,32,len(classes)).to(device)

def train_epoch_emb(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.CrossEntropyLoss(),epoch_size=None, report_freq=200):
    optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)
    loss_fn = loss_fn.to(device)
    net.train()
    total_loss,acc,count,i = 0,0,0,0
    for labels,text,off in dataloader:
        optimizer.zero_grad()
        labels,text,off = labels.to(device), text.to(device), off.to(device)
        out = net(text, off)
        loss = loss_fn(out,labels) #cross_entropy(out,labels)
        loss.backward()
        optimizer.step()
        total_loss+=loss
        _,predicted = torch.max(out,1)
        acc+=(predicted==labels).sum()
        count+=len(labels)
        i+=1
        if i%report_freq==0:
            print(f"{count}: acc={acc.item()/count}")
        if epoch_size and count>epoch_size:
            break
    return total_loss.item()/count, acc.item()/count


train_epoch_emb(net,train_loader, lr=4, epoch_size=25000)

3200: acc=0.6153125
6400: acc=0.6615625
9600: acc=0.6932291666666667
12800: acc=0.715078125
16000: acc=0.7270625
19200: acc=0.7382291666666667
22400: acc=0.7486160714285715


(22.771553103007037, 0.7551983365323096)

## Semantic Embeddings: Word2Vec

For di example wey we do before, di model embedding layer learn how to map words go vector representation, but di representation no get beta semantical meaning. E go make sense if we fit learn vector representation wey similar words or synonyms go dey close to each oda based on vector distance (like euclidian distance).

To do dis one, we go need pre-train our embedding model for one big collection of text in one special way. One of di first way wey dem take train semantic embeddings na [Word2Vec](https://en.wikipedia.org/wiki/Word2vec). E dey based on two main architectures wey dem dey use to produce distributed representation of words:

- **Continuous bag-of-words** (CBoW) — for dis architecture, we dey train di model to predict one word from di surrounding context. If dem give di ngram $(W_{-2},W_{-1},W_0,W_1,W_2)$, di goal of di model na to predict $W_0$ from $(W_{-2},W_{-1},W_1,W_2)$.
- **Continuous skip-gram** na di opposite of CBoW. Di model dey use di surrounding window of context words to predict di current word.

CBoW fast pass, but skip-gram slow small, e dey do better work for words wey no dey common.

![Image wey show both CBoW and Skip-Gram algorithms to convert words to vectors.](../../../../../translated_images/example-algorithms-for-converting-words-to-vectors.fbe9207a726922f6f0f5de66427e8a6eda63809356114e28fb1fa5f4a83ebda7.pcm.png)

To try word2vec embedding wey dem don pre-train for Google News dataset, we fit use **gensim** library. For di example below, we go find di words wey dey most similar to 'neural'

> **Note:** When you first create word vectors, e fit take time to download dem!


In [8]:
import gensim.downloader as api
w2v = api.load('word2vec-google-news-300')

In [9]:
for w,p in w2v.most_similar('neural'):
    print(f"{w} -> {p}")

neuronal -> 0.7804799675941467
neurons -> 0.7326500415802002
neural_circuits -> 0.7252851724624634
neuron -> 0.7174385190010071
cortical -> 0.6941086649894714
brain_circuitry -> 0.6923246383666992
synaptic -> 0.6699118614196777
neural_circuitry -> 0.6638563275337219
neurochemical -> 0.6555314064025879
neuronal_activity -> 0.6531826257705688


We fit also fit vector embeddings from di word, to use am train classification model (we go only show first 20 components of di vector for clearity):


In [10]:
w2v.word_vec('play')[:20]

array([ 0.01226807,  0.06225586,  0.10693359,  0.05810547,  0.23828125,
        0.03686523,  0.05151367, -0.20703125,  0.01989746,  0.10058594,
       -0.03759766, -0.1015625 , -0.15820312, -0.08105469, -0.0390625 ,
       -0.05053711,  0.16015625,  0.2578125 ,  0.10058594, -0.25976562],
      dtype=float32)

Beta tin wey dey about semantical embeddings na say you fit use vector encoding take change the meaning. For example, we fit ask make e find one word, wey e vector representation go dey near words *king* and *woman*, but e go far from the word *man*:


In [10]:
w2v.most_similar(positive=['king','woman'],negative=['man'])[0]

('queen', 0.7118192911148071)

Both CBoW and Skip-Grams na "predictive" embeddings, because dem dey only use local context. Word2Vec no dey use global context.

**FastText**, dey build ontop Word2Vec by learning vector representations for each word and di character n-grams wey dey inside di word. Di values of di representations go then dey averaged into one vector for each training step. Even though dis one go add plenty extra computation for pre-training, e go make word embeddings fit carry sub-word information.

Another method, **GloVe**, dey use di idea of co-occurrence matrix, e dey use neural methods to break di co-occurrence matrix into more expressive and non-linear word vectors.

You fit try di example by changing embeddings to FastText and GloVe, because gensim dey support plenty different word embedding models.


## How to Use Pre-Trained Embeddings for PyTorch

We fit change di example wey dey up to load di matrix for our embedding layer wit semantical embeddings like Word2Vec. But we go need remember say di vocabularies for di pre-trained embedding and di text corpus wey we dey use no go match well, so we go initialize weights for di words wey dey miss wit random values:


In [11]:
embed_size = len(w2v.get_vector('hello'))
print(f'Embedding size: {embed_size}')

net = EmbedClassifier(vocab_size,embed_size,len(classes))

print('Populating matrix, this will take some time...',end='')
found, not_found = 0,0
for i,w in enumerate(vocab.get_itos()):
    try:
        net.embedding.weight[i].data = torch.tensor(w2v.get_vector(w))
        found+=1
    except:
        net.embedding.weight[i].data = torch.normal(0.0,1.0,(embed_size,))
        not_found+=1

print(f"Done, found {found} words, {not_found} words missing")
net = net.to(device)

Embedding size: 300
Populating matrix, this will take some time...Done, found 41080 words, 54732 words missing


Make we train di model now. Note say di time wey e go take train di model go long pass di one for di previous example, because di embedding layer size big well-well, and e get plenty parameters. Plus, because of dis one, we fit need train di model with more examples if we wan avoid overfitting.


In [12]:
train_epoch_emb(net,train_loader, lr=4, epoch_size=25000)

3200: acc=0.6359375
6400: acc=0.68109375
9600: acc=0.7067708333333333
12800: acc=0.723671875
16000: acc=0.73625
19200: acc=0.7463541666666667
22400: acc=0.7560714285714286


(214.1013875559821, 0.7626759436980166)

For our own case, we no see big improvement for accuracy, wey fit dey because di vocabularies dey different well well.  
To solve di wahala of different vocabularies, we fit use one of dis solutions:  
* Train word2vec model again wit our own vocabulary  
* Load our dataset wit di vocabulary wey dey from di pre-trained word2vec model. Di vocabulary wey dem use load di dataset fit dey specified during di loading.  

Di second method dey look easier, especially because PyTorch `torchtext` framework get built-in support for embeddings. For example, we fit create GloVe-based vocabulary like dis:  


In [14]:
vocab = torchtext.vocab.GloVe(name='6B', dim=50)

100%|█████████▉| 399999/400000 [00:15<00:00, 25411.14it/s]


Loaded vocabulary get dis kain basic operations:  
* `vocab.stoi` na dictionary wey go help us change word to im dictionary index  
* `vocab.itos` dey do di opposite - e go change number go word  
* `vocab.vectors` na di array of embedding vectors, so if we wan get di embedding of one word `s`, we go use `vocab.vectors[vocab.stoi[s]]`  

Dis na example of how we fit play wit embeddings to show di equation **kind-man+woman = queen** (I bin adjust di coefficient small make e work):  


In [15]:
# get the vector corresponding to kind-man+woman
qvec = vocab.vectors[vocab.stoi['king']]-vocab.vectors[vocab.stoi['man']]+1.3*vocab.vectors[vocab.stoi['woman']]
# find the index of the closest embedding vector 
d = torch.sum((vocab.vectors-qvec)**2,dim=1)
min_idx = torch.argmin(d)
# find the corresponding word
vocab.itos[min_idx]

'queen'

To train di classifier wit di embeddings, we first need to encode our dataset wit GloVe vocabulary:


In [16]:
def offsetify(b):
    # first, compute data tensor from all sequences
    x = [torch.tensor(encode(t[1],voc=vocab)) for t in b] # pass the instance of vocab to encode function!
    # now, compute the offsets by accumulating the tensor of sequence lengths
    o = [0] + [len(t) for t in x]
    o = torch.tensor(o[:-1]).cumsum(dim=0)
    return ( 
        torch.LongTensor([t[0]-1 for t in b]), # labels
        torch.cat(x), # text 
        o
    )

As we don see for up, all vector embeddings dey store for `vocab.vectors` matrix. E make am super-easy to load di weights enter weights of embedding layer wit simple copying:


In [17]:
net = EmbedClassifier(len(vocab),len(vocab.vectors[0]),len(classes))
net.embedding.weight.data = vocab.vectors
net = net.to(device)

Make we train our model now see if e go give us better result:


In [18]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=offsetify, shuffle=True)
train_epoch_emb(net,train_loader, lr=4, epoch_size=25000)

3200: acc=0.6271875
6400: acc=0.68078125
9600: acc=0.7030208333333333
12800: acc=0.71984375
16000: acc=0.7346875
19200: acc=0.7455729166666667
22400: acc=0.7529464285714286


(35.53972978646833, 0.7575175943698017)

One of di reasons we no dey see big increase for accuracy na because some words from our dataset dey miss for di pre-trained GloVe vocabulary, and so dem dey basically ignore dem. To solve dis mata, we fit train our own embeddings for our dataset.


## Contextual Embeddings

One big wahala wey dey traditional pretrained embedding like Word2Vec na di problem of word sense disambiguation. Even though pretrained embeddings fit capture some meaning of words for context, dem dey put all di possible meaning of one word inside di same embedding. Dis one fit cause wahala for downstream models, because plenty words like di word 'play' get different meanings depending on di context wey dem dey use am.

For example, di word 'play' for dis two different sentences get different meaning:
- I go watch **play** for di theatre.
- John wan **play** wit im friends.

Di pretrained embeddings wey dey above dey represent di two meanings of di word 'play' inside di same embedding. To solve dis wahala, we need to build embeddings wey dey based on di **language model**, wey dem don train on top big corpus of text, and e *sabi* how words fit join together for different contexts. To talk about contextual embeddings no dey di scope of dis tutorial, but we go come back to am when we dey talk about language models for di next unit.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis dokyument don use AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator) do di translation. Even as we dey try make am accurate, abeg sabi say machine translation fit get mistake or no dey correct well. Di original dokyument wey dey for im native language na di main source wey you go trust. For important information, e better make professional human translator check am. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translation.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
