#HEADS-UP - this is a script I'm converting to use Eugene natural-commands texts :)

Trains two recurrent neural networks based upon a story and a question.
The resulting merged vector is then queried to answer a range of bAbI tasks.

The results are comparable to those for an LSTM model provided in Weston et al.:
"Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks"
http://arxiv.org/abs/1502.05698

Task Number                  | FB LSTM Baseline | Keras QA
---                          | ---              | ---
QA1 - Single Supporting Fact | 50               | 100.0
QA2 - Two Supporting Facts   | 20               | 50.0
QA3 - Three Supporting Facts | 20               | 20.5
QA4 - Two Arg. Relations     | 61               | 62.9
QA5 - Three Arg. Relations   | 70               | 61.9
QA6 - Yes/No Questions       | 48               | 50.7
QA7 - Counting               | 49               | 78.9
QA8 - Lists/Sets             | 45               | 77.2
QA9 - Simple Negation        | 64               | 64.0
QA10 - Indefinite Knowledge  | 44               | 47.7
QA11 - Basic Coreference     | 72               | 74.9
QA12 - Conjunction           | 74               | 76.4
QA13 - Compound Coreference  | 94               | 94.4
QA14 - Time Reasoning        | 27               | 34.8
QA15 - Basic Deduction       | 21               | 32.4
QA16 - Basic Induction       | 23               | 50.6
QA17 - Positional Reasoning  | 51               | 49.1
QA18 - Size Reasoning        | 52               | 90.8
QA19 - Path Finding          | 8                | 9.0
QA20 - Agent's Motivations   | 91               | 90.7

For the resources related to the bAbI project, refer to:
https://research.facebook.com/researchers/1543934539189348

Notes:

- With default word, sentence, and query vector sizes, the GRU model achieves:
  - 100% test accuracy on QA1 in 20 epochs (2 seconds per epoch on CPU)
  - 50% test accuracy on QA2 in 20 epochs (16 seconds per epoch on CPU)
In comparison, the Facebook paper achieves 50% and 20% for the LSTM baseline.

- The task does not traditionally parse the question separately. This likely
improves accuracy and is a good example of merging two RNNs.

- The word vector embeddings are not shared between the story and question RNNs.

- See how the accuracy changes given 10,000 training samples (en-10k) instead
of only 1000. 1000 was used in order to be comparable to the original paper.

- Experiment with GRU, LSTM, and JZS1-3 as they give subtly different results.

- The length and noise (i.e. 'useless' story components) impact the ability for
LSTMs / GRUs to provide the correct answer. Given only the supporting facts,
these RNNs can achieve 100% accuracy on many tasks. Memory networks and neural
networks that use attentional processes can efficiently search through this
noise to find the relevant statements, improving performance substantially.
This becomes especially obvious on QA2 and QA3, both far longer than QA1.

In [1]:
from functools import reduce
import re
import tarfile
import os
import random
import numpy as np
import csv

from pprint import pprint

seed = 80085
np.random.seed(seed)  # for reproducibility
random.seed(seed)

csv.register_dialect('eugene', delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL)

from keras.utils.data_utils import get_file
from keras.layers.embeddings import Embedding
from keras.layers import Dense, Merge, Dropout, RepeatVector
from keras.layers import recurrent
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [2]:
RNN = recurrent.GRU
EMBED_HIDDEN_SIZE = 200
BATCH_SIZE = 128
EPOCHS = 200
PATH_MODEL = "model.hdf5"
print('RNN / Embed / Sent / Query = {}, {}'.format(RNN, EMBED_HIDDEN_SIZE))

RNN / Embed / Sent / Query = <class 'keras.layers.recurrent.GRU'>, 200


In [3]:
def tokenize(sent):
    '''Return the tokens of a sentence including punctuation.

    >>> tokenize('Bob dropped the apple. Where is the apple?')
    ['Bob', 'dropped', 'the', 'apple', '.', 'Where', 'is', 'the', 'apple', '?']
    '''
    return [x.strip() for x in re.split('(\W+)?', sent) if x.strip()]


def parse_stories(lines, only_supporting=False):
    '''Parse stories provided in the bAbi tasks format

    If only_supporting is true, only the sentences that support the answer are kept.
    '''
    data = []
    story = []
    for line in lines:
        line = line.decode('utf-8').strip()
        nid, line = line.split(' ', 1)
        nid = int(nid)
        if nid == 1:
            story = []
        if '\t' in line:
            q, a, supporting = line.split('\t')
            q = tokenize(q)
            substory = None
            if only_supporting:
                # Only select the related substory
                supporting = map(int, supporting.split())
                substory = [story[i - 1] for i in supporting]
            else:
                # Provide all the substories
                substory = [x for x in story if x]
            data.append((substory, q, a))
            story.append('')
        else:
            sent = tokenize(line)
            story.append(sent)
    return data

t = 0
def get_stories(f, only_supporting=False, max_length=None):
    global t
    '''Given a file name, read the file, retrieve the stories, and then convert the sentences into a single story.

    If max_length is supplied, any stories longer than max_length tokens will be discarded.
    '''
    data = parse_stories(f.readlines(), only_supporting=only_supporting)
    flatten = lambda data: reduce(lambda x, y: x + y, data)
    data = [(flatten(story), q, answer) for story, q, answer in data if not max_length or len(flatten(story)) < max_length]
    return data

def vectorize_stories(data, word_idx, story_maxlen, query_maxlen):
    X = []
    Y = []
    for natural, cmd in data:
        x = [word_idx[w] for w in natural]
        y = [word_idx[w] for w in cmd]
        X.append(x)
        Y.append(y)
        
    return pad_sequences(X, maxlen=story_maxlen), pad_sequences(Y, maxlen=query_maxlen)

In [4]:
try:
    path = get_file('babi-tasks-v1-2.tar.gz', origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz')
except:
    print('Error downloading dataset, please download it manually:\n'
          '$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n'
          '$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz')
    raise
print("data path: %s" % path)

data path: /home/peter/.keras/datasets/babi-tasks-v1-2.tar.gz


In [5]:
train = []
test = []
with open('../syntethic_data/data.csv', 'r') as f:
    reader = csv.reader(f, 'eugene')
    chance_test = 0.2
    for row in reader:
        to_append = [tokenize(row[0]), tokenize(row[1])]
        if random.uniform(0, 1) <= chance_test:
            test.append(to_append)
        else:
            train.append(to_append) 
random.shuffle(test)
random.shuffle(train)

  return _compile(pattern, flags).split(string, maxsplit)


In [15]:
vocab = sorted(reduce(lambda x, y: x | y, (set(natural + cmd) for natural, cmd in train + test)))
# Reserve 0 for masking via pad_sequences
vocab_size = len(vocab) + 1
word_idx = dict((c, i + 1) for i, c in enumerate(vocab))
word_idx_rev = {v: k for k, v in word_idx.items()}
natural_maxlen = max(map(len, (x for x, _ in train + test)))
cmd_maxlen = max(map(len, (x for _, x in train + test)))

In [23]:
X, Y = vectorize_stories(train, word_idx, natural_maxlen, cmd_maxlen)
Xv, Yv = vectorize_stories(test, word_idx, natural_maxlen, cmd_maxlen)

# i = 0
# for Ys in Xqv:
#     if i % 20 == 0:
#         print " ".join(to_words(Ys)), ": " + word_idx_rev[np.argmax(Yv[i])]
#     i+=1

print('vocab({}) = {}'.format(len(vocab), vocab))
print('X.shape = {}'.format(X.shape))
print('Y.shape = {}'.format(Y.shape))

print('Xv.shape = {}'.format(Xv.shape))
print('Yv.shape = {}'.format(Yv.shape))
print('natural_maxlen, cmd_maxlen = {}, {}'.format(natural_maxlen, cmd_maxlen))

vocab(126) = ['&&', '1', '10', '2', '3', '4', '5', '56', '56_1', '56_10', '56_2', '56_3', '56_4', '56_5', '56_6', '56_7', '56_8', '56_9', '6', '7', '8', '9', 'J', 'J_1', 'J_10', 'J_2', 'J_3', 'J_4', 'J_5', 'J_6', 'J_7', 'J_8', 'J_9', 'K7U1IW', 'K7U1IW_1', 'K7U1IW_10', 'K7U1IW_2', 'K7U1IW_3', 'K7U1IW_4', 'K7U1IW_5', 'K7U1IW_6', 'K7U1IW_7', 'K7U1IW_8', 'K7U1IW_9', 'N2', 'N2_1', 'N2_10', 'N2_2', 'N2_3', 'N2_4', 'N2_5', 'N2_6', 'N2_7', 'N2_8', 'N2_9', 'O4SCXPKEU', 'O4SCXPKEU_1', 'O4SCXPKEU_10', 'O4SCXPKEU_2', 'O4SCXPKEU_3', 'O4SCXPKEU_4', 'O4SCXPKEU_5', 'O4SCXPKEU_6', 'O4SCXPKEU_7', 'O4SCXPKEU_8', 'O4SCXPKEU_9', 'PHIHGJ', 'PHIHGJ_1', 'PHIHGJ_10', 'PHIHGJ_2', 'PHIHGJ_3', 'PHIHGJ_4', 'PHIHGJ_5', 'PHIHGJ_6', 'PHIHGJ_7', 'PHIHGJ_8', 'PHIHGJ_9', 'R6W1ED8D', 'R6W1ED8D_1', 'R6W1ED8D_10', 'R6W1ED8D_2', 'R6W1ED8D_3', 'R6W1ED8D_4', 'R6W1ED8D_5', 'R6W1ED8D_6', 'R6W1ED8D_7', 'R6W1ED8D_8', 'R6W1ED8D_9', 'XKZZ3F2', 'XKZZ3F2_1', 'XKZZ3F2_10', 'XKZZ3F2_2', 'XKZZ3F2_3', 'XKZZ3F2_4', 'XKZZ3F2_5', 'XKZZ3F2_6

In [24]:
print('Build model...')

model = Sequential()
model.add(Embedding(vocab_size, EMBED_HIDDEN_SIZE,
                   input_length=natural_maxlen))
model.add(Dropout(0.2))
model.add(Dense(cmd_maxlen, activation='softmax'))
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Build model...


In [27]:
if os.path.exists(PATH_MODEL):
    model.load_weights(PATH_MODEL)
    print("Loaded existing weights...")

print("Training model....")
validation_data = (Xv, Yv)
train_data = X
for iteration in range(100):        
    print("Iteration %d" % iteration)
    model.fit(train_data, Y, batch_size=BATCH_SIZE, shuffle=False, epochs=3, validation_data=validation_data)        
    model.save_weights(PATH_MODEL, overwrite=True)
print("Saved model to %s..." % PATH_MODEL)

Training model....
Iteration 0


ValueError: Error when checking target: expected dense_4 to have 3 dimensions, but got array with shape (83, 29)

In [None]:
def to_words(seq):
    return [word_idx_rev[s] for s in seq if s > 0]

def to_seq(words, maxlen):
    if type(words) is str:
        words = tokenize(words)
    seq = [word_idx[w] for w in words if w in word_idx]
    return pad_sequences([seq], maxlen=maxlen)

def ask(text, question):    
    seq = to_seq(text, story_maxlen)
    probs = model.predict([np.array(seq), np.array(to_seq(question, query_maxlen))])
    for prob in probs:
        return word_idx_rev[np.argmax(prob)]        

In [None]:
print "QA test system\n=========="
QAPairs = [
    (
        """
        John travelled to the kitchen.
        Sandra got the football there.
        """,
        "Where is the football?"
    ), (
        """
        John travelled to the kitchen, then he went to the garden to beat Stan.
        Sandra got the football there.
        """,
        "Where is the football?"
    ), (
        """
        John travelled to the kitchen, then he went to the garden to beat Stan.
        Sandra got the football and the milk there.
        """,
        "Where is the milk?"
    ), (
        """
        John travelled to the kitchen, then he went to the garden to beat Stan.
        Sandra got the football but left the milk in the kitchen.
        """,
        "Where is the milk?"
    ), (
        "John left with a football on his way to the garden. When back in the kitchen, he saw Sandra. Sandra threw the ball through the window in the garden again.",
        ("Where is John?", "Where is the football?")
    ), (
        """
        Mary got an football. She dropped the football.
        """,
        ("How many objects is Mary carrying?")
    )
]

idx = 0
for qa in QAPairs:
    print "Story: %s" % qa[0]
    if type(qa[1]) is tuple:
        for qas in qa[1]:            
            print "Question: %s" % qas            
            print "Answer: %s" % ask(qa[0], qas)
    else:
            print "Question: %s" % qa[1]
            print "Answer: %s" % ask(qa[0], qa[1])
    idx += 1
    if idx < len(QAPairs):
        print "\n"