## Kufundisha Modeli ya CBoW

Notibuku hii ni sehemu ya [Mtaala wa AI kwa Kompyuta](http://aka.ms/ai-beginners)

Katika mfano huu, tutachunguza jinsi ya kufundisha modeli ya lugha ya CBoW ili kupata nafasi yetu ya Word2Vec. Tutatumia seti ya data ya AG News kama chanzo cha maandishi.


In [None]:
import torch
import torchtext
import os
import collections
import builtins
import random
import numpy as np

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Kwanza, hebu tusakinishe seti yetu ya data na kufafanua tokenizer na msamiati. Tutaweka `vocab_size` kuwa 5000 ili kupunguza mahesabu kidogo.


In [None]:
def load_dataset(ngrams = 1, min_freq = 1, vocab_size = 5000 , lines_cnt = 500):
    tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
    print("Loading dataset...")
    test_dataset, train_dataset  = torchtext.datasets.AG_NEWS(root='./data')
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    classes = ['World', 'Sports', 'Business', 'Sci/Tech']
    print('Building vocab...')
    counter = collections.Counter()
    for i, (_, line) in enumerate(train_dataset):
        counter.update(torchtext.data.utils.ngrams_iterator(tokenizer(line),ngrams=ngrams))
        if i == lines_cnt:
            break
    vocab = torchtext.vocab.Vocab(collections.Counter(dict(counter.most_common(vocab_size))), min_freq=min_freq)
    return train_dataset, test_dataset, classes, vocab, tokenizer

In [None]:
train_dataset, test_dataset, _, vocab, tokenizer = load_dataset()

Loading dataset...
Building vocab...


In [None]:
def encode(x, vocabulary, tokenizer = tokenizer):
    return [vocabulary[s] for s in tokenizer(x)]

## Mfano wa CBoW

CBoW hujifunza kutabiri neno kwa kuzingatia maneno $2N$ ya jirani. Kwa mfano, wakati $N=1$, tutapata jozi zifuatazo kutoka kwenye sentensi *I like to train networks*: (like,I), (I, like), (to, like), (like,to), (train,to), (to, train), (networks, train), (train,networks). Hapa, neno la kwanza ni neno la jirani linalotumika kama ingizo, na neno la pili ni lile tunalotabiri.

Ili kujenga mtandao wa kutabiri neno linalofuata, tutahitaji kutoa neno la jirani kama ingizo, na kupata namba ya neno kama matokeo. Muundo wa mtandao wa CBoW ni kama ifuatavyo:

* Neno la ingizo hupitishwa kupitia safu ya embedding. Safu hii ya embedding ndiyo itakuwa Word2Vec embedding yetu, kwa hivyo tutaifafanua kando kama kigezo `embedder`. Katika mfano huu, tutatumia ukubwa wa embedding = 30, ingawa unaweza kutaka kujaribu vipimo vya juu zaidi (Word2Vec halisi ina 300).
* Vector ya embedding kisha itapitishwa kwenye safu ya linear ambayo itatabiri neno la matokeo. Kwa hivyo, ina neurons `vocab_size`.

Kwa matokeo, tukitumia `CrossEntropyLoss` kama kazi ya hasara, tutahitaji pia kutoa namba za maneno kama matokeo yanayotarajiwa, bila kutumia one-hot encoding.


In [None]:
vocab_size = len(vocab)

embedder = torch.nn.Embedding(num_embeddings = vocab_size, embedding_dim = 30)
model = torch.nn.Sequential(
    embedder,
    torch.nn.Linear(in_features = 30, out_features = vocab_size),
)

print(model)

Sequential(
  (0): Embedding(5002, 30)
  (1): Linear(in_features=30, out_features=5002, bias=True)
)


## Kuandaa Data ya Mafunzo

Sasa hebu tuandike kazi kuu ambayo itahesabu jozi za maneno za CBoW kutoka kwa maandishi. Kazi hii itatuwezesha kubainisha ukubwa wa dirisha, na itarudisha seti ya jozi - neno la kuingiza na neno la kutoa. Kumbuka kwamba kazi hii inaweza kutumika kwa maneno, pamoja na kwa vekta/tensha - ambayo itatuwezesha kusimba maandishi, kabla ya kuyapitisha kwa kazi ya `to_cbow`.


In [None]:
def to_cbow(sent,window_size=2):
    res = []
    for i,x in enumerate(sent):
        for j in range(max(0,i-window_size),min(i+window_size+1,len(sent))):
            if i!=j:
                res.append([sent[j],x])
    return res

print(to_cbow(['I','like','to','train','networks']))
print(to_cbow(encode('I like to train networks', vocab)))

[['like', 'I'], ['to', 'I'], ['I', 'like'], ['to', 'like'], ['train', 'like'], ['I', 'to'], ['like', 'to'], ['train', 'to'], ['networks', 'to'], ['like', 'train'], ['to', 'train'], ['networks', 'train'], ['to', 'networks'], ['train', 'networks']]
[[232, 172], [5, 172], [172, 232], [5, 232], [0, 232], [172, 5], [232, 5], [0, 5], [1202, 5], [232, 0], [5, 0], [1202, 0], [5, 1202], [0, 1202]]


Tuweke tayari seti ya mafunzo. Tutapitia habari zote, tuita `to_cbow` kupata orodha ya jozi za maneno, na kuongeza jozi hizo kwenye `X` na `Y`. Kwa sababu ya muda, tutazingatia tu habari 10k za kwanza - unaweza kuondoa kizuizi hiki kwa urahisi ikiwa una muda zaidi wa kusubiri, na unataka kupata embeddings bora :)


In [None]:
X = []
Y = []
for i, x in zip(range(10000), train_dataset):
    for w1, w2 in to_cbow(encode(x[1], vocab), window_size = 5):
        X.append(w1)
        Y.append(w2)

X = torch.tensor(X)
Y = torch.tensor(Y)

Tutabadilisha data hiyo kuwa seti moja ya data, na kuunda dataloader:


In [None]:
class SimpleIterableDataset(torch.utils.data.IterableDataset):
    def __init__(self, X, Y):
        super(SimpleIterableDataset).__init__()
        self.data = []
        for i in range(len(X)):
            self.data.append( (Y[i], X[i]) )
        random.shuffle(self.data)

    def __iter__(self):
        return iter(self.data)

Tutabadilisha data hiyo kuwa seti moja ya data, na kuunda dataloader:


In [None]:
ds = SimpleIterableDataset(X, Y)
dl = torch.utils.data.DataLoader(ds, batch_size = 256)

Sasa tuanze mafunzo halisi. Tutatumia optimizer ya `SGD` yenye kiwango cha juu cha kujifunza. Unaweza pia kujaribu kutumia optimizers nyingine, kama `Adam`. Tutafundisha kwa mizunguko 10 mwanzoni - na unaweza kuendesha tena seli hii ikiwa unataka upotevu wa chini zaidi.


In [None]:
def train_epoch(net, dataloader, lr = 0.01, optimizer = None, loss_fn = torch.nn.CrossEntropyLoss(), epochs = None, report_freq = 1):
    optimizer = optimizer or torch.optim.Adam(net.parameters(), lr = lr)
    loss_fn = loss_fn.to(device)
    net.train()

    for i in range(epochs):
        total_loss, j = 0, 0, 
        for labels, features in dataloader:
            optimizer.zero_grad()
            features, labels = features.to(device), labels.to(device)
            out = net(features)
            loss = loss_fn(out, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss
            j += 1
        if i % report_freq == 0:
            print(f"Epoch: {i+1}: loss={total_loss.item()/j}")

    return total_loss.item()/j

In [None]:
train_epoch(net = model, dataloader = dl, optimizer = torch.optim.SGD(model.parameters(), lr = 0.1), loss_fn = torch.nn.CrossEntropyLoss(), epochs = 10)

Epoch: 1: loss=5.664632366860172
Epoch: 2: loss=5.632101973960962
Epoch: 3: loss=5.610399051405015
Epoch: 4: loss=5.594621561080262
Epoch: 5: loss=5.582538017415446
Epoch: 6: loss=5.572900234519603
Epoch: 7: loss=5.564951676341915
Epoch: 8: loss=5.558288112064614
Epoch: 9: loss=5.552576955031129
Epoch: 10: loss=5.547634165194347


5.547634165194347

## Kujaribu Word2Vec

Ili kutumia Word2Vec, hebu tutoe vekta zinazohusiana na maneno yote katika msamiati wetu:


In [None]:
vectors = torch.stack([embedder(torch.tensor(vocab[s])) for s in vocab.itos], 0)

Hebu tuone, kwa mfano, jinsi neno **Paris** linavyowekwa kwenye vekta:


In [None]:
paris_vec = embedder(torch.tensor(vocab['paris']))
print(paris_vec)

tensor([-0.0915,  2.1224, -0.0281, -0.6819,  1.1219,  0.6458, -1.3704, -1.3314,
        -1.1437,  0.4496,  0.2301, -0.3515, -0.8485,  1.0481,  0.4386, -0.8949,
         0.5644,  1.0939, -2.5096,  3.2949, -0.2601, -0.8640,  0.1421, -0.0804,
        -0.5083, -1.0560,  0.9753, -0.5949, -1.6046,  0.5774],
       grad_fn=<EmbeddingBackward>)


Ni jambo la kuvutia kutumia Word2Vec kutafuta visawe. Kazi ifuatayo itarudisha maneno `n` yaliyo karibu zaidi na ingizo lililotolewa. Ili kuyapata, tunahesabu norm ya $|w_i - v|$, ambapo $v$ ni vekta inayolingana na neno letu la ingizo, na $w_i$ ni usimbaji wa neno la $i$ katika msamiati. Kisha tunapanga safu na kurudisha fahirisi zinazolingana kwa kutumia `argsort`, na kuchukua vipengele vya kwanza `n` vya orodha, ambavyo vinasimba nafasi za maneno yaliyo karibu zaidi katika msamiati.


In [None]:
def close_words(x, n = 5):
  vec = embedder(torch.tensor(vocab[x]))
  top5 = np.linalg.norm(vectors.detach().numpy() - vec.detach().numpy(), axis = 1).argsort()[:n]
  return [ vocab.itos[x] for x in top5 ]

close_words('microsoft')

['microsoft', 'quoted', 'lp', 'rate', 'top']

In [None]:
close_words('basketball')

['basketball', 'lot', 'sinai', 'states', 'healthdaynews']

In [None]:
close_words('funds')

['funds', 'travel', 'sydney', 'japan', 'business']

## Muhimu

Kwa kutumia mbinu za ubunifu kama CBoW, tunaweza kufundisha modeli ya Word2Vec. Unaweza pia kujaribu kufundisha modeli ya skip-gram ambayo hufundishwa kutabiri neno la jirani ukizingatia neno la katikati, na uone jinsi inavyofanya kazi vizuri.



---

**Kanusho**:  
Hati hii imetafsiriwa kwa kutumia huduma ya kutafsiri ya AI [Co-op Translator](https://github.com/Azure/co-op-translator). Ingawa tunajitahidi kuhakikisha usahihi, tafadhali fahamu kuwa tafsiri za kiotomatiki zinaweza kuwa na makosa au kutokuwa sahihi. Hati ya asili katika lugha yake ya awali inapaswa kuzingatiwa kama chanzo cha mamlaka. Kwa taarifa muhimu, tafsiri ya kitaalamu ya binadamu inapendekezwa. Hatutawajibika kwa kutoelewana au tafsiri zisizo sahihi zinazotokana na matumizi ya tafsiri hii.
