먼저 bAbi Dataset이 어떻게 구성되어 있는지 살펴보고, 이를 어떻게 학습에 맞게 바꿀 수 있을지 고민하자.

In [1]:
with open('./dataset/qa1_single-supporting-fact_train.txt') as f:
    lines = f.readlines()
print(lines[:20])

['1 Mary moved to the bathroom.\n', '2 John went to the hallway.\n', '3 Where is Mary? \tbathroom\t1\n', '4 Daniel went back to the hallway.\n', '5 Sandra moved to the garden.\n', '6 Where is Daniel? \thallway\t4\n', '7 John moved to the office.\n', '8 Sandra journeyed to the bathroom.\n', '9 Where is Daniel? \thallway\t4\n', '10 Mary moved to the hallway.\n', '11 Daniel travelled to the office.\n', '12 Where is Daniel? \toffice\t11\n', '13 John went back to the garden.\n', '14 John moved to the bedroom.\n', '15 Where is Sandra? \tbathroom\t8\n', '1 Sandra travelled to the office.\n', '2 Sandra went to the bathroom.\n', '3 Where is Sandra? \tbathroom\t2\n', '4 Mary went to the bedroom.\n', '5 Daniel moved to the hallway.\n']


Story를 구성하는 문장 중간중간에 Question과 Answer이 tap으로 구분된 문장(QA)이 섞여있다. QA 문장은 Question(1), Answer(2)과 함께 정답의 근거가 되는 Supporting(3)이 함께 제안된다. 모델의 input으로 Story, Question, Answer이 들어가도록 데이터를 preprocessing해야한다. 위의 문장 번호로 예를 들면

S:[1,2],Q:[3(1)],A:[3(2)]

S:[1,2,4,5],Q:[6(1)],A:[6(2)]

S:[1,2,4,5,7,8],Q:[9(1)],A:[9(2)]  


실제 구현은 Story를 최대로 저장할 수 있는 크기를 설정한 뒤(memory length), 남은 부분은 zero-padding한다. 만약 memory length를 넘는 story가 존재한다면 story의 앞부분을 잘라낸다.

memory_length=5

S:[1,2,0,0,0],Q:[3(1)],A:[3(2)]

S:[1,2,4,5,0],Q:[6(1)],A:[6(2)]

S:[2,4,5,7,8],Q:[9(1)],A:[9(2)]  


각각의 sentence는 단어로 구성되어 있으며, 문장 역시 최대 길이를 설정한 뒤 남은 부분은 zero-padding한다. 결과적으로 Story의 dimension은 


batch_size X memory_length X sentence_length X word_embedding_dimension 이 될 것이다.


그럼 우선 문장을 tokeinze하는 함수를 구현한 뒤, dataset를 만들어주는 함수를 구현하자

In [2]:
import re

def tokenize(sentence):
    return [w.strip() for w in re.split("(\W+)?", sentence) if w.strip()]

print(tokenize("Mary moved to the bathroom.\n"))
print(tokenize("Daniel travelled to the office.\n"))

['Mary', 'moved', 'to', 'the', 'bathroom', '.']
['Daniel', 'travelled', 'to', 'the', 'office', '.']


  return _compile(pattern, flags).split(string, maxsplit)


In [3]:
def split_SQA(lines):
    """
    input: 
        lines: list. [num_lines]
    return:
        data: list. [num_story*[story,question,answer,supporting id]]
        story_len: list. lenght of each story(number of sentences in a story)
        sentence_len: list. lenght of each sentence(number of words in a sentence)
    """
    data = []
    story_len = []
    sentence_len = []
    story = None
    num_questions = None
    for line in lines:
        line.lower()
        nid, line = line.split(' ',1)
        nid = int(nid)
    
        if nid == 1:
            story = [] # init story
            num_questions = [0] #init num_questions
            question_count = 0
            
        if '\t' not in line: #normal story sentence if '\t' is not in line
            line = tokenize(line)
            line = line[:-1] if line[-1] == '.' else line
            story.append(line)
            sentence_len.append(len(line))
    
        else : #QA sentence if '\t' is in the line
            q, a, sid = line.split('\t')
            q = tokenize(q)
            q = q[:-1] if q[-1] == '?' else q
            sid = int(sid) - num_questions[int(sid)]
            data.append([story[:], q, a, sid -1])
            story_len.append(len(story))
            question_count += 1
            
        num_questions.append(question_count) #need to match sentence index without question index
            
    return data, story_len, sentence_len       

In [4]:
train_data,story_len, sentence_len = split_SQA(lines)
print(train_data[0])
print('\n')
print(train_data[1])
print("\nThe longest story length:{0}\nThe longest sentence length:{1}".format(max(story_len), max(sentence_len)))

[[['Mary', 'moved', 'to', 'the', 'bathroom'], ['John', 'went', 'to', 'the', 'hallway']], ['Where', 'is', 'Mary'], 'bathroom', 0]


[[['Mary', 'moved', 'to', 'the', 'bathroom'], ['John', 'went', 'to', 'the', 'hallway'], ['Daniel', 'went', 'back', 'to', 'the', 'hallway'], ['Sandra', 'moved', 'to', 'the', 'garden']], ['Where', 'is', 'Daniel'], 'hallway', 2]

The longest story length:10
The longest sentence length:6


가장 긴 story의 길이는 10이며, 가장 긴 sentence의 길이는 6이다. 제법 단순한 문장들로 구성되어 있다. 

In [5]:
def make_dictionary(lines):
    word2idx = {}
    idx = 1
    for line in lines:
        line.lower()
        _, line = line.split(' ',1)
        if '\t' in line:
            line = line.split('\t')[0]
        line = tokenize(line)
        line = line[:-1] if line[-1] is '?' or '.' else line
        for w in line:
            if w not in word2idx.keys():
                word2idx[w] = idx
                idx += 1
    return word2idx

In [6]:
dic = make_dictionary(lines)
print(dic)

{'Mary': 1, 'moved': 2, 'to': 3, 'the': 4, 'bathroom': 5, 'John': 6, 'went': 7, 'hallway': 8, 'Where': 9, 'is': 10, 'Daniel': 11, 'back': 12, 'Sandra': 13, 'garden': 14, 'office': 15, 'journeyed': 16, 'travelled': 17, 'bedroom': 18, 'kitchen': 19}


등장하는 단어의 수도 무척 제한적이다. 이제 단어들로 이루어진 문장들을 index로 바꾸어 저장하자. 추가적으로 sentence zero padding과 story zero padding을 함께 진행해준다.

In [7]:
import numpy as np

def data_preprocess(data, sentence_len, memory_len, dic):
    """
    input: 
        data: list. [num_story*[story,question,answer,supporting id]]
        sentence_len: int. maximum sentence_len. 
        memory_len: int. maximum story len
        dic: dictionary. 
    return:
        S(tory): np.array. [num_story, memory_len, sentence_len].  
        Q(uestion): np.array. [num_story, 1, sentence_len]. 
        A(nswer): np.array. [num_story, ]
        Support: np.array. [num_story, ]
    """
    
    S = [];Q = []; A = []; Support = []
    for story, question, answer, support in data:
        #delete front part of story that exceeds memrory length 
        start = max(len(story) - memory_len,0)
        story = story[start:]
        #(1)convert words to idx and (2)zero-pad to match the sentence length
        story_idx = []
        for sentence in story:
            story_idx.append([dic[w] for w in sentence] + [0]*(sentence_len - len(sentence)))
        #zero-pad to match the memroy length
        for _ in range(memory_len - len(story_idx)):
            story_idx.append([0]*sentence_len)
        
        question_idx = [[dic[w] for w in question] + [0]*(sentence_len - len(question))]
        
        answer_idx = [0] * (len(dic) + 1)
        answer_idx[dic[answer]] = 1
        
        S.append(story_idx); Q.append(question_idx); A.append(answer_idx); Support.append(support)
    return np.array(S),np.array(Q),np.array(A),np.array(Support)


batch size가 2인 상태를 예시로 들어 Memory network 내부구조 구현을 살펴보자

In [8]:
ss_len = max(sentence_len)
mem_len = max(story_len)
S,Q,A,Support = data_preprocess(train_data[:2], ss_len, mem_len, dic)
print(S)

[[[ 1  2  3  4  5  0]
  [ 6  7  3  4  8  0]
  [ 0  0  0  0  0  0]
  [ 0  0  0  0  0  0]
  [ 0  0  0  0  0  0]
  [ 0  0  0  0  0  0]
  [ 0  0  0  0  0  0]
  [ 0  0  0  0  0  0]
  [ 0  0  0  0  0  0]
  [ 0  0  0  0  0  0]]

 [[ 1  2  3  4  5  0]
  [ 6  7  3  4  8  0]
  [11  7 12  3  4  8]
  [13  2  3  4 14  0]
  [ 0  0  0  0  0  0]
  [ 0  0  0  0  0  0]
  [ 0  0  0  0  0  0]
  [ 0  0  0  0  0  0]
  [ 0  0  0  0  0  0]
  [ 0  0  0  0  0  0]]]


In [9]:
print(S.shape,Q.shape,A.shape, Support.shape)

(2, 10, 6) (2, 1, 6) (2, 20) (2,)


In [10]:
import tensorflow as tf
from tensorflow.keras import layers

In [11]:
word_emb_dim = 12

emb_a = layers.Embedding(input_dim = len(dic)+1, output_dim=word_emb_dim)
emb_b = layers.Embedding(input_dim = len(dic)+1, output_dim=word_emb_dim)
emb_c = layers.Embedding(input_dim = len(dic)+1, output_dim=word_emb_dim)

In [12]:
a = emb_a(S)
print(a.shape)
b = emb_b(Q)
print(b.shape)
c = emb_c(S)
print(c.shape)

(2, 10, 6, 12)
(2, 1, 6, 12)
(2, 10, 6, 12)


dimension of Story: batch_size X memory_length X sentence_length X word_embedding_dimension

dimension of Question: batch_size X 1 X sentence_length X word_embedding_dimension

정리하면, batch의 크기는 2이고, 각각의 batch마다 10개의 스토리가 있으며, 각각의 스토리는 6개의 문장으로 이루어져 있고, 각각의 문장은 12차원의 word embedding으로 표현되는 단어들의 집합이다. (query는 각각의 batch마다 하나의 질문이 있다).

이제 해결해야 할 것은 word embedding들의 집합을 sentence embedding으로 나타내는 것인데, 우선 가장 간단한 방법인 word embedding의 평균으로 문장을 나타내보자. 

In [13]:
def get_avg_word_emb(sentence_word_idx, sentence_word_emb):
    '''
    intput: 
        sentence_word_idx : [batch_size,memory_length,sentence_length]
        sentence_word_emb : [batch_size,memory_length,sentence_length,word_emb_len]
    output: 
        sentence_emb: [batch_size,memory_length,word_emb_len]
        average sentences
    '''
    # sentence_word_idx ->  not_zero:[batch_size,memory_length,1]
    # 1 if word index is not zero, else 0
    not_zero = tf.not_equal(sentence_word_idx, 0)
    not_zero = tf.cast(tf.expand_dims(not_zero,-1), tf.float32)
    
    mul = tf.multiply(sentence_word_emb,not_zero)
    return tf.reduce_sum(mul,-2)

In [14]:
keys = get_avg_word_emb(S,a)
query = get_avg_word_emb(Q,b)
values = get_avg_word_emb(S,c)

print(keys.shape, query.shape, values.shape)

(2, 10, 12) (2, 1, 12) (2, 10, 12)


In [15]:
def get_attention_score(keys, query):
    '''
    input:
        keys: [batch, mem_len, word_emb_len]
        query: [batch, 1, word_emb_len]
    output:
        attn_score: [batch,mem_len], 
        attention socres for each memory component
    '''
    #calcuate dot product
    #dot product-> logits: [batch,mem_size]
    elemwise_mul = tf.multiply(keys, query)
    logits = tf.reduce_sum(elemwise_mul,-1)
    
    #zero's of logit: padding sentence. set that value as negative inf
    logits_pad = logits + tf.cast(tf.equal(logits,0.),tf.float32)*-1e+10
    attn_score = tf.nn.softmax(logits_pad)
    
    return attn_score

In [16]:
attn_score = get_attention_score(keys,query)
print(attn_score)

tf.Tensor(
[[0.5003158  0.49968415 0.         0.         0.         0.
  0.         0.         0.         0.        ]
 [0.24996877 0.24852683 0.24905097 0.25245348 0.         0.
  0.         0.         0.         0.        ]], shape=(2, 10), dtype=float32)


In [17]:
def get_output_memory_represntation(values, attn_score):
    '''
    input:
        values: [batch, mem_len, word_emb_len]
        attn_score: [batch,mem_len], 
    output:
        mem_rep: [batch, word_emb_len]
    '''
    
    #attn_score_expand[batch,mem_size,1]
    attn_score_expand = tf.expand_dims(attn_score, -1)
    #get memory representation
    #mul[batch,mem_size,sentence_emb]
    mul = tf.multiply(values, attn_score_expand)
    mem_rep = tf.reduce_sum(mul, -2)
    
    return mem_rep
    

In [18]:
mem_rep = get_output_memory_represntation(values, attn_score)
print(mem_rep)

tf.Tensor(
[[ 0.01542873 -0.00454633  0.08610645  0.10061193 -0.01662656  0.06419418
   0.0125137   0.03442524 -0.03258358 -0.01858172  0.08948081 -0.02875923]
 [ 0.00693787  0.03351345  0.05264403  0.08770313 -0.00967796  0.05295835
   0.00034816  0.01645008 -0.02345917 -0.00802074  0.0505798  -0.01337768]], shape=(2, 12), dtype=float32)


In [19]:
class MemLayer(layers.Layer):
    def __init__(self, vocab_size, word_emb_dim):
        super(MemLayer, self).__init__()
        self.emb_a = layers.Embedding(vocab_size+1, word_emb_dim,input_length=sentence_len)
        self.emb_b = layers.Embedding(vocab_size+1, word_emb_dim,input_length=sentence_len)
        self.emb_c = layers.Embedding(vocab_size+1, word_emb_dim,input_length=sentence_len)
        
        self.logit_layer = layers.Dense(vocab_size + 1)
        
    def call(self, story, question):
        a = self.emb_a(story); b = self.emb_b(question); c = self.emb_c(story)

        keys = get_avg_word_emb(story, a)     #[batch_size,memory_length,word_emb_dim]
        query = get_avg_word_emb(question, b) #[batch_size,1,word_emb_dim]
        values = get_avg_word_emb(story, c)   #[batch_size,memory_length,word_emb_dim]
        
        attn_score = get_attention_score(keys,query) #[batch_size,memory_length]
        mem_rep = get_output_memory_represntation(values, attn_score) #[batch_size,word_emb_dim]
        query_squeeze = tf.squeeze(query) #[batch_size,word_emb_dim]
        out = mem_rep + query_squeeze #[batch_size,word_emb_dim]
        
        return out, attn_score

In [20]:
model = MemLayer(len(dic), 12)
out, attn_score = model(S,Q)
print(out)

tf.Tensor(
[[-0.06885671  0.01596903  0.04857868  0.0793507   0.00252632 -0.05488843
  -0.04818626 -0.0033795  -0.04938336  0.09648502  0.13865146 -0.00148977]
 [-0.02715708  0.11025342  0.04472348  0.10957113 -0.04149476 -0.0640047
  -0.06871129  0.0434243  -0.03897819  0.09421811  0.05388987  0.01761121]], shape=(2, 12), dtype=float32)


In [21]:
class SingleMemN2N(tf.keras.Model):
    def __init__(self, vocab_size, word_emb_dim):
        super(SingleMemN2N, self).__init__()
        self.mem_layer = MemLayer(vocab_size, word_emb_dim)
        self.logit_layer = layers.Dense(vocab_size + 1)
    
    def call(self, story, question):
        out, attn_score = self.mem_layer(story, question)
        logit = self.logit_layer(out)
        
        return logit,attn_score

In [22]:
def get_data_loader(data, sentence_len, memory_len, dic, batch_size):
    S,Q,A,Support = data_preprocess(data, sentence_len, memory_len, dic)
    loader = tf.data.Dataset.from_tensor_slices((S,Q,A,Support))
    loader = loader.shuffle(buffer_size=len(S)).batch(batch_size)
    return loader

In [23]:
with open('./dataset/qa1_single-supporting-fact_test.txt') as f:
    t_lines = f.readlines()
test_data,_,_ = split_SQA(t_lines)

  return _compile(pattern, flags).split(string, maxsplit)


In [24]:
word_emb_dim = 128
lr = 0.0001
batch_size = 250
epochs = 500
model = SingleMemN2N(len(dic), word_emb_dim)
loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr)

In [25]:
train_loader = get_data_loader(train_data, ss_len, mem_len, dic, batch_size)
test_loader = get_data_loader(test_data, ss_len, mem_len, dic, batch_size)

for epoch in range(1,epochs+1):
    total_loss = 0
    for batch_id, batch in enumerate(train_loader):
        story,question,answer,support = batch[0], batch[1], batch[2], batch[3] 
        # Open a GradientTape.
        with tf.GradientTape() as tape:
            # Forward pass.
            logit, attn_score = model(story, question)
            # Loss value for this batch.
            loss = loss_fn(answer,logit)
            # Get gradients of weights wrt the loss.
            gradients  = tape.gradient(loss, model.trainable_weights)
            # Update the weights of our linear layer.
            optimizer.apply_gradients(zip(gradients, model.trainable_weights))
            total_loss += float(loss)
    if epoch%20 == 0:
        print("epoch{0} loss:{1:.4f}".format(epoch, total_loss))
        correct_ratio = []
        correct_attn_ratio = []
        for batch in test_loader:
            story,question,answer,support = batch[0], batch[1], batch[2], batch[3] 
            logit, attn_score = model(story, question)
            pred_idx = tf.argmax(logit,axis=-1)
            ans_idx = tf.argmax(answer,axis=-1)
            pred_attn_idx = tf.cast(tf.argmax(attn_score,axis=-1),tf.int32)
            
            true_list = tf.cast(tf.equal(pred_idx,ans_idx),tf.float32)
            correct_ratio.append(float(tf.reduce_sum(true_list) / true_list.shape[0]))
            
            true_attn_list = tf.cast(tf.equal(pred_attn_idx,support),tf.float32)
            correct_attn_ratio.append(float(tf.reduce_sum(true_attn_list) / true_attn_list.shape[0]))

        correct_ratio = sum(correct_ratio) / len(correct_ratio) 
        correct_attn_ratio = sum(correct_attn_ratio) / len(correct_attn_ratio) 
        print('correct ratio: {0:.4f} correct attn ratio: {1:.4f}'.format(correct_ratio, correct_attn_ratio))
        


epoch20 loss:10.8058
correct ratio: 0.1780 correct attn ratio: 0.2220
epoch40 loss:9.3333
correct ratio: 0.1940 correct attn ratio: 0.2510
epoch60 loss:8.1698
correct ratio: 0.2080 correct attn ratio: 0.2270
epoch80 loss:7.5357
correct ratio: 0.2650 correct attn ratio: 0.2270
epoch100 loss:7.2016
correct ratio: 0.3210 correct attn ratio: 0.2370
epoch120 loss:6.9909
correct ratio: 0.3480 correct attn ratio: 0.2520
epoch140 loss:6.8275
correct ratio: 0.3590 correct attn ratio: 0.2580
epoch160 loss:6.6787
correct ratio: 0.3820 correct attn ratio: 0.2850
epoch180 loss:6.5262
correct ratio: 0.4030 correct attn ratio: 0.3170
epoch200 loss:6.3564
correct ratio: 0.4320 correct attn ratio: 0.3550
epoch220 loss:6.1571
correct ratio: 0.4540 correct attn ratio: 0.4350
epoch240 loss:5.9178
correct ratio: 0.5040 correct attn ratio: 0.5180
epoch260 loss:5.6310
correct ratio: 0.5490 correct attn ratio: 0.5690
epoch280 loss:5.2959
correct ratio: 0.5970 correct attn ratio: 0.5930
epoch300 loss:4.9245
co

문제 정답률은 좋아지고,  attention score도 높아진다! 좀 더 높일 수 있을까? 복잡한 문제에선 잘 작동할까? TODO: multi-hop attention 

In [26]:
for batch in test_loader:
    story,question,answer,support = batch[0], batch[1], batch[2], batch[3] 
    logit, attn_score = model(story, question)
    pred_attn_idx = tf.cast(tf.argmax(attn_score,axis=-1),tf.int32)
    print(support)
    print(pred_attn_idx)
    

tf.Tensor(
[9 0 2 5 0 7 6 1 0 4 7 4 4 5 3 4 4 1 1 0 1 0 0 1 2 8 1 0 3 7 2 5 1 5 1 5 4
 3 6 0 2 9 8 0 6 0 5 6 5 7 3 1 0 5 5 2 3 5 1 3 5 0 9 5 1 5 5 0 4 7 5 6 5 7
 7 5 1 2 6 7 1 3 7 4 7 2 5 7 0 5 3 5 1 0 3 7 2 5 1 5 0 1 3 1 8 7 6 3 0 1 9
 5 5 5 1 0 2 6 4 5 9 1 3 6 8 2 1 3 2 6 3 1 2 1 3 1 4 5 2 9 5 5 5 8 0 3 6 7
 1 0 7 1 1 4 0 7 1 3 7 7 6 9 5 0 0 5 8 1 9 2 4 9 1 5 3 1 6 3 2 5 9 1 3 6 1
 8 6 9 4 1 5 1 3 1 6 7 5 5 4 8 6 6 4 8 3 5 2 3 5 6 7 1 2 7 5 1 7 1 1 1 9 3
 1 7 7 0 0 4 3 1 1 1 0 5 9 8 9 9 0 1 1 3 3 3 0 1 3 7 8 0], shape=(250,), dtype=int32)
tf.Tensor(
[8 0 2 5 0 2 6 1 0 4 7 0 2 5 3 3 1 1 0 0 1 0 0 0 2 7 1 0 3 7 2 5 1 5 1 5 4
 2 5 0 0 0 8 0 6 0 5 5 3 4 3 1 0 3 5 2 3 3 1 3 3 0 1 5 0 3 2 0 4 3 3 6 5 1
 6 5 1 2 3 5 1 3 3 1 7 2 5 2 0 2 3 4 1 0 3 3 2 5 1 5 0 1 2 0 8 7 6 1 0 1 5
 5 1 5 1 0 2 6 4 5 6 1 3 6 1 2 1 3 2 6 3 1 2 1 3 1 4 5 2 5 2 5 5 8 0 2 3 3
 1 0 7 1 1 4 0 4 1 3 5 7 6 5 5 0 0 2 0 1 9 0 4 5 1 5 3 1 6 2 1 0 8 1 3 2 1
 4 3 5 3 1 5 1 2 1 1 1 4 3 3 8 5 3 4 3 3 5 2 0 5 5 1 1 2 7 5 1 3 1 