## Ý tưởng siêu chi tiết của CBOW

### 1. Cách xây dựng dữ liệu huấn luyện  
- Với mỗi từ trong tập dữ liệu, ta lấy `window_size` từ bên trái và `window_size` từ bên phải.  
- Tập hợp các từ này tạo thành một ngữ cảnh (`context_length = window_size * 2`).  
- Nhãn (`label`) cho ngữ cảnh này chính là từ trung tâm ban đầu.  

### 2. Mô hình mạng nơ-ron  

Mô hình huấn luyện gồm các lớp sau:  

1. **Lớp đầu tiên (Embedding Layer)**:  
   - Đầu vào: `(batch_size, context_length)`, chứa các chỉ mục (index) của từ ngữ cảnh.  
   - Dùng ma trận nhúng **Embedding Matrix** có kích thước `(vocab_length, embedding_dim)`, trong đó:  
     - `vocab_length`: Kích thước từ vựng.  
     - `embedding_dim`: Kích thước vector nhúng của mỗi từ.  
   - Các chỉ mục đầu vào được ánh xạ thành vector nhúng, tạo ra đầu ra có dạng `(batch_size, context_length, embedding_dim)`.  
   - **Nguồn gốc của ma trận embedding**:  
     - Ban đầu, ma trận này có thể được khởi tạo ngẫu nhiên.  
     - Trong quá trình huấn luyện, nó sẽ được cập nhật thông qua lan truyền ngược (backpropagation).  
     - Nếu dùng mô hình đã được huấn luyện sẵn (pretrained embeddings, ví dụ: Word2Vec, GloVe), ta có thể nạp các giá trị nhúng này vào ma trận và có thể cố định hoặc tiếp tục tinh chỉnh.  

2. **Lớp thứ hai (Average Layer)**:  
   - Lấy trung bình các vector nhúng theo chiều `context_length`, kết quả có dạng `(batch_size, embedding_dim)`.  

3. **Lớp thứ ba (Output Layer)**:  
   - Đầu ra có kích thước `(batch_size, vocab_length)`, sử dụng hàm softmax để dự đoán từ trung tâm (`target_word`).  

### 3. Cách hoạt động  
- Mô hình sẽ học cách ánh xạ từ ngữ cảnh sang từ trung tâm bằng cách điều chỉnh ma trận embedding sao cho các từ có ngữ cảnh tương tự sẽ có vector nhúng gần nhau.  
- Sau khi huấn luyện xong, ta có thể sử dụng ma trận embedding này để biểu diễn từ vựng trong các bài toán NLP khác.  


In [1]:
import json

with open('sarcasm.json') as f:
    data = json.load(f)

In [2]:
corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
          'I love green eggs, ham, sausages and bacon!',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'
]

In [3]:
from tensorflow.keras.preprocessing import text
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing import sequence

tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(corpus)
word2id = tokenizer.word_index

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

In [4]:
word2id

{'the': 1,
 'is': 2,
 'and': 3,
 'sky': 4,
 'blue': 5,
 'beautiful': 6,
 'quick': 7,
 'brown': 8,
 'fox': 9,
 'lazy': 10,
 'dog': 11,
 'love': 12,
 'sausages': 13,
 'ham': 14,
 'bacon': 15,
 'eggs': 16,
 'very': 17,
 'this': 18,
 'jumps': 19,
 'over': 20,
 'a': 21,
 "king's": 22,
 'breakfast': 23,
 'has': 24,
 'toast': 25,
 'beans': 26,
 'i': 27,
 'green': 28,
 'today': 29,
 'but': 30}

In [5]:
word2id['<PAD>'] = 0
id2word = {v: k for k, v in word2id.items()}

In [6]:
wids = [[word2id[w] for w in text.text_to_word_sequence(c)] for c in corpus]
wids

[[1, 4, 2, 5, 3, 6],
 [12, 18, 5, 3, 6, 4],
 [1, 7, 8, 9, 19, 20, 1, 10, 11],
 [21, 22, 23, 24, 13, 14, 15, 16, 25, 3, 26],
 [27, 12, 28, 16, 14, 13, 3, 15],
 [1, 8, 9, 2, 7, 3, 1, 5, 11, 2, 10],
 [1, 4, 2, 17, 5, 3, 1, 4, 2, 17, 6, 29],
 [1, 11, 2, 10, 30, 1, 8, 9, 2, 7]]

In [7]:
vocab_size = len(word2id)
embed_size = 300
window_size = 2 # context window size

print('Vocabulary Size:', vocab_size)
print('Vocabulary Sample:', list(word2id.items())[:10])

Vocabulary Size: 31
Vocabulary Sample: [('the', 1), ('is', 2), ('and', 3), ('sky', 4), ('blue', 5), ('beautiful', 6), ('quick', 7), ('brown', 8), ('fox', 9), ('lazy', 10)]


In [11]:
def generate_context_word_pairs(corpus, window_size, vocab_size):
    context_length = window_size * 2
    for words in corpus:
        sentence_length = len(words)
        for index, word in enumerate(words):
            context_words = []
            label_word = []
            start = index - window_size
            end = index + window_size + 1

            context_words.append([
                words[i] 
                for i in range(start, end)
                if 0 <= i < sentence_length
                and i != index
            ])

            label_word.append(word)

            x = sequence.pad_sequences(context_words, maxlen=context_length)
            y = to_categorical(label_word, num_classes=vocab_size)
            
            yield (x, y)


In [9]:
m = generate_context_word_pairs(corpus=wids, window_size=window_size, vocab_size=vocab_size)

In [10]:
m

<generator object generate_context_word_pairs at 0x000001F9ACBFEC40>

In [12]:
import numpy as np
i = 0
for x, y in generate_context_word_pairs(corpus=wids, window_size=window_size, vocab_size=vocab_size):
    if 0 not in x[0]:
        print(x, y)
        print('Context (X):', [id2word[w] for w in x[0]], '-> Target (Y):', id2word[np.argwhere(y[0])[0][0]])

        if i == 10:
            break
        i += 1

[[1 4 5 3]] [[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0.]]
Context (X): ['the', 'sky', 'blue', 'and'] -> Target (Y): is
[[4 2 3 6]] [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0.]]
Context (X): ['sky', 'is', 'and', 'beautiful'] -> Target (Y): blue
[[12 18  3  6]] [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0.]]
Context (X): ['love', 'this', 'and', 'beautiful'] -> Target (Y): blue
[[18  5  6  4]] [[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0.]]
Context (X): ['this', 'blue', 'beautiful', 'sky'] -> Target (Y): and
[[ 1  7  9 19]] [[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0.]]
Context (X): ['the', 'quick', 'fox', 'jumps'] -> Target (Y): brown
[[ 7  8 19 20]] [[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0

In [22]:
import tensorflow.keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda

# build CBOW architecture
cbow = Sequential()
cbow.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=window_size*2))
# Output ra 4 vector: (batch_size, 4, 300)
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embed_size,)))
# Output ra 4 vector: (batch_size, 300)
cbow.add(Dense(vocab_size, activation='softmax'))
# Output ra 4 vector: (batch_size, vocab_size)
cbow.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# view model summary
print(cbow.summary())



None


In [16]:
# visualize model structure
from IPython.display import SVG
from keras.utils import model_to_dot

cbow.build()

In [17]:
SVG(model_to_dot(cbow, show_shapes=True, show_layer_names=False,
                 rankdir='TB').create(prog='dot', format='svg'))

ImportError: You must install pydot (`pip install pydot`) for model_to_dot to work.

In [23]:
for epoch in range(1, 6):
    loss = 0.
    i = 0
    for x, y in generate_context_word_pairs(corpus=wids, window_size=window_size, vocab_size=vocab_size):
        i += 1
        loss += cbow.train_on_batch(x, y)
        if i % 100000 == 0:
            print('Processed {} (context, word) pairs'.format(i))

    print('Epoch:', epoch, '\tLoss:', loss)
    print()

Epoch: 1 	Loss: 250.25726

Epoch: 2 	Loss: 246.75287

Epoch: 3 	Loss: 242.3309

Epoch: 4 	Loss: 237.27925

Epoch: 5 	Loss: 231.91743



In [24]:
import pandas as pd
weights = cbow.get_weights()[0]
weights = weights[1:]
print(weights.shape)

pd.DataFrame(weights, index=list(id2word.values())[1:]).head()

(30, 300)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
is,-0.016159,-0.001919,-0.034625,-0.000501,-0.068037,0.039652,-0.073711,-0.056429,0.110851,-0.042928,...,0.031391,-0.037141,-0.10258,0.022456,0.040771,-0.004076,-0.028039,-0.095149,-0.023611,-0.011764
and,-0.013875,0.041296,0.043127,-0.031687,-0.035261,0.011198,0.001305,0.008456,0.002694,0.041878,...,0.045352,-0.039327,0.016409,-0.035324,0.059275,-0.010692,-0.016966,-0.075731,0.011736,0.035353
sky,-0.023439,0.067804,-0.068437,-0.073149,-0.101756,0.029633,-0.005518,0.021309,-0.022604,-0.06391,...,-0.039347,0.08615,-0.100598,0.093492,0.099698,-0.006067,0.043246,-0.053052,-0.050908,0.072212
blue,-0.086822,0.025009,0.066076,-0.099671,0.010679,0.078916,0.004387,0.060186,-0.044301,0.01377,...,-0.073872,-0.057893,0.046852,0.04657,0.050003,0.045964,0.010325,-0.066373,-0.073589,-0.064363
beautiful,-0.053409,0.054269,0.016521,-0.052443,-0.083826,0.083107,-0.072217,0.016962,0.060972,-0.052102,...,-0.123333,-0.084623,0.054153,0.084007,0.06516,0.029221,0.015634,-0.09687,-0.058569,0.020105
