# Generating Sentences with TreeRNNs

This notebook goes through a minimal example of encoding one sentence into a distributed representation using a TreeRNN, and the using this distributed representation to generate another sentence using a different TreeRNN in reverse. To start, we'll do some data cleaning to make sure we have a good set of sentence pairs to train on. The main goal here is to remove sentences with mispelled words and oddities.

In [1]:
import enchant 
import random
import pickle
import numpy as np

from collections import namedtuple
from pysem.corpora import SNLI
from pysem.networks import DependencyNetwork
from pysem.generatives import EmbeddingGenerator

checker = enchant.Dict('en_US')
TrainingPair = namedtuple('TrainingPair', ['sentence1', 'sentence2', 'label'])

snli = SNLI('/Users/peterblouw/corpora/snli_1.0/')
snli.load_xy_pairs()

def repair(sen):
    tokens = DependencyNetwork.parser(sen)
    if len(tokens) > 15:
        return None
    for token in tokens:
        if not checker.check(token.text):
            return None
    return sen

def clean_data(data):
    clean = []
    for item in data:
        
        s1 = repair(item.sentence1)
        s2 = repair(item.sentence2)
        if s1 == None or s2 == None:
            continue
        else:
            clean.append(TrainingPair(s1, s2, item.label))
    
    return clean

In [2]:
clean_dev = clean_data(snli.dev_data[:100])
clean_train = clean_data(snli.train_data[:1000])
clean_test = clean_data(snli.test_data[:10])

In [3]:
print(len(clean_dev))
print(len(clean_test))
print(len(clean_train))

41
4
486


Next, we'll build a vocab from the set of cleaned sentence pairs. 

In [4]:
def build_vocab(data):
    vocab = set()
    for item in data:
        s1 = item.sentence1
        s2 = item.sentence2
        
        t1 = DependencyNetwork.parser(s1)
        t2 = DependencyNetwork.parser(s2)
        
        for t in t1:
            if t.text not in vocab:
                vocab.add(t.text)
        for t in t2:
            if t.text not in vocab:
                vocab.add(t.text)

    return sorted(list(vocab))

data = clean_dev + clean_test + clean_train
vocab = build_vocab(data)

In [5]:
print(len(vocab))

988


Now we can collect all of the sentence pairs standing in entailment relations to one another.

In [6]:
train_data = [d for d in clean_train if d.label == 'entailment'] # or d.label == 'neutral']
test_data = [d for d in clean_test if d.label == 'entailment'] # or d.label == 'neutral']
dev_data = [d for d in clean_dev if d.label == 'entailment'] # or d.label == 'neutral']

print(len(train_data))
print(len(test_data))
print(len(dev_data))

173
1
15


In [7]:
from pysem.utils.snli import InferentialRoleModel

dim = 300
iters = 1
rate = 0.01

vectors = 'w2v_embeddings.pickle'

with open('w2v_dep_vocabs.pickle', 'rb') as pfile:
    subvocabs = pickle.load(pfile)

encoder = DependencyNetwork(dim=dim, vocab=vocab, pretrained=vectors)
decoder = EmbeddingGenerator(dim=dim, subvocabs=subvocabs, vectors=vectors)

model = InferentialRoleModel(encoder=encoder, decoder=decoder, data=train_data)
model.train(iters=iters, rate=rate)

On iteration  0


In [12]:
sample = random.choice(train_data)

print(sample)

model.encode(sample.sentence1)
model.decode(sample.sentence2)

TrainingPair(sentence1='A couple walk through a white brick town.', sentence2='People are walking outdoors.', label='entailment')


'people are walking outdoors .'

In [13]:
def compute_accuracy(data, model):
    total = 0 
    correct = 0

    for item in data:
        model.encoder.forward_pass(item.sentence1)
        model.decoder.forward_pass(item.sentence2, model.encoder.get_root_embedding())

        for node in model.decoder.tree:
            total += 1
            if node.pword.lower() == node.lower_:
                correct += 1

    return float(correct / total)

print(compute_accuracy(train_data, model))
print(compute_accuracy(dev_data, model))

0.5975206611570248
0.29411764705882354


In [14]:
model.save('enc_model.pickle','dec_model.pickle')

In [15]:
test_model = InferentialRoleModel(encoder=None, decoder=None, data=train_data)
test_model.load('enc_model.pickle','dec_model.pickle')

print(compute_accuracy(train_data, test_model))
print(compute_accuracy(dev_data, test_model))

0.5975206611570248
0.29411764705882354


## Simple Entailment Generation Examples

This small amount of data probably isn't enough to generalize outside of the training set, so we'll first check how well the learned decoder is able to generate the entailments it has been trained on.