# Generating Sentences with TreeRNNs

This notebook goes through a minimal example of encoding one sentence into a distributed representation using a TreeRNN, and the using this distributed representation to generate another sentence using a different TreeRNN in reverse. To start, we'll do some data cleaning to make sure we have a good set of sentence pairs to train on. The main goal here is to remove sentences with mispelled words and oddities.

In [1]:
import enchant 
import random
import pickle
import numpy as np

from collections import namedtuple
from pysem.corpora import SNLI
from pysem.networks import DependencyNetwork
from pysem.generatives import EmbeddingGenerator

checker = enchant.Dict('en_US')
TrainingPair = namedtuple('TrainingPair', ['sentence1', 'sentence2', 'label'])

snli = SNLI('/home/pblouw/snli_1.0/')
snli.load_xy_pairs()

def repair(sen):
    tokens = DependencyNetwork.parser(sen)
    if len(tokens) > 15:
        return None
    for token in tokens:
        if not checker.check(token.text):
            return None
    return sen

def clean_data(data):
    clean = []
    for item in data:
        
        s1 = repair(item.sentence1)
        s2 = repair(item.sentence2)
        if s1 == None or s2 == None:
            continue
        else:
            clean.append(TrainingPair(s1, s2, item.label))
    
    return clean

In [2]:
clean_dev = clean_data(snli.dev_data)
clean_train = clean_data(snli.train_data)
clean_test = clean_data(snli.test_data)

In [3]:
print(len(clean_dev))
print(len(clean_test))
print(len(clean_train))

4955
4839
306651


Next, we'll build a vocab from the set of cleaned sentence pairs. 

In [4]:
def build_vocab(data):
    vocab = set()
    for item in data:
        s1 = item.sentence1
        s2 = item.sentence2
        
        t1 = DependencyNetwork.parser(s1)
        t2 = DependencyNetwork.parser(s2)
        
        for t in t1:
            if t.text not in vocab:
                vocab.add(t.text)
        for t in t2:
            if t.text not in vocab:
                vocab.add(t.text)

    return sorted(list(vocab))

data = clean_dev + clean_test + clean_train
vocab = build_vocab(data)

In [5]:
print(len(vocab))

22555


Now we can check to see whether any words in our cleaned data are missing from our model of Word2Vec embeddings:

In [6]:
train_data = [d for d in clean_train if d.label == 'entailment'] # or d.label == 'neutral']
test_data = [d for d in clean_test if d.label == 'entailment'] # or d.label == 'neutral']
dev_data = [d for d in clean_dev if d.label == 'entailment'] # or d.label == 'neutral']

print(len(train_data))
print(len(test_data))
print(len(dev_data))

106288
1666
1701


In [7]:
dim = 300
iters = 50
rate = 0.0006
batchsize = 10000

vectors = 'w2v_embeddings.pickle'

with open('w2v_dep_vocabs.pickle', 'rb') as pfile:
    subvocabs = pickle.load(pfile)

encoder = DependencyNetwork(dim=dim, vocab=vocab, pretrained=vectors)
decoder = EmbeddingGenerator(dim=dim, subvocabs=subvocabs, vectors=vectors)

for _ in range(iters):
    print('On iteration ', _)
    if _ == 45:
        rate = rate / 2.0
    if _ == 50:
        rate = rate / 2.0
    
    batch = random.sample(train_data, batchsize)te
    for sample in batch:
        s1 = sample.sentence1
        s2 = sample.sentence2

        encoder.forward_pass(s1)        
        decoder.forward_pass(s2, encoder.get_root_embedding())
        decoder.backward_pass(rate=rate)
        encoder.backward_pass(decoder.pass_grad, rate=rate)

On iteration  0
On iteration  1
On iteration  2
On iteration  3
On iteration  4
On iteration  5
On iteration  6
On iteration  7
On iteration  8
On iteration  9
On iteration  10
On iteration  11
On iteration  12
On iteration  13
On iteration  14
On iteration  15
On iteration  16
On iteration  17
On iteration  18
On iteration  19
On iteration  20
On iteration  21
On iteration  22
On iteration  23
On iteration  24
On iteration  25
On iteration  26
On iteration  27
On iteration  28
On iteration  29
On iteration  30
On iteration  31
On iteration  32
On iteration  33
On iteration  34
On iteration  35
On iteration  36
On iteration  37
On iteration  38
On iteration  39
On iteration  40
On iteration  41
On iteration  42
On iteration  43
On iteration  44
On iteration  45
On iteration  46
On iteration  47
On iteration  48
On iteration  49


## Simple Entailment Generation Examples

This small amount of data probably isn't enough to generalize outside of the training set, so we'll first check how well the learned decoder is able to generate the entailments it has been trained on.

In [102]:
sample_trees = [d for d in test_data if 5 < len(d.sentence2.split()) < 10]
batch = random.sample(test_data, 10)

for sample in batch:
    s1 = sample.sentence1
    s2 = sample.sentence2
    randsen = random.choice(sample_trees)

    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding())

    predicted = [node.pword for node in decoder.tree]
    true = [node.lower_ for node in decoder.tree]

    print('Sentence: ', s1)
    print('Predicted Entailment: ', ' '.join(predicted))
    print('Actual Entailment: ', ' '.join(true))
    
    decoder.forward_pass(randsen.sentence2, encoder.get_root_embedding())
    alternate = [node.pword for node in decoder.tree]
    print('Random Tree Entailment: ', ' '.join(alternate))
    print('')

Sentence:  A surfer is performing a jumping stunt in the ocean.
Predicted Entailment:  a is in the ocean .
Actual Entailment:  a person in the water .
Random Tree Entailment:  a surfer and a surfboard is outside .

Sentence:  The football player prepares to kick the ball.
Predicted Entailment:  the player is playing the ball .
Actual Entailment:  a person is playing a sport .
Random Tree Entailment:  there playing a player playing outside to the football .

Sentence:  The guitarist performs a rocking solo.
Predicted Entailment:  the guitarist is performs
Actual Entailment:  the musician is performing
Random Tree Entailment:  two guitarist performs loud to perform her musician perform .

Sentence:  A little boy playing outside on the cement.
Predicted Entailment:  a boy playing on cement .
Actual Entailment:  a boy is outside playing .
Random Tree Entailment:  a little boy is playing on the cement cement .

Sentence:  An elderly woman wearing a skirt is picking out vegetables at a local

We can also generate entailments using randomly chosen trees for the decoding network structure. This doesn't work very well.

In [23]:
batch = random.sample(test_data, 20)

for sample in batch:
    s1 = sample.sentence1
    s2 = sample.sentence2
    randsen = random.choice(sample_trees)

    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding())

    predicted = [node.pword for node in decoder.tree]
    true = [node.lower_ for node in decoder.tree]
    
    print('Sentence: ', s1)
    print('Predicted Entailment: ', ' '.join(predicted))
    print('Actual Entailment: ', ' '.join(true))

    decoder.forward_pass(randsen.sentence2, encoder.get_root_embedding())
    alternate = [node.pword for node in decoder.tree]
    print('Random Tree Entailment: ', ' '.join(alternate))
    print('')

Sentence:  Two women stand in a kitchen and wipe down trays
Predicted Entailment:  two women are in the kitchen .
Actual Entailment:  two women stand in a kitchen .
Random Tree Entailment:  there are a women in the kitchen .

Sentence:  A wet child stands in chest deep ocean water.
Predicted Entailment:  a child is is in the water .
Actual Entailment:  the child s playing on the beach .
Random Tree Entailment:  a is standing in the water

Sentence:  Two little boys are standing in a kitchen.
Predicted Entailment:  the boys are not happy .
Actual Entailment:  the kitchen is not empty .
Random Tree Entailment:  the boys are are up a food .

Sentence:  A couple sits in the grass.
Predicted Entailment:  couple are outside .
Actual Entailment:  people are outside .
Random Tree Entailment:  a couple is are outside to their grass .

Sentence:  3 people on plain boats smiling towards the camera.
Predicted Entailment:  a are camera on the boat
Actual Entailment:  3 smiling people on a boat
Rand

## Generating Entailment Chains (i.e. Inferential Roles)

We can also generate entailment chains by re-encoding a generated sentence, and then generating new sentence from the subsequent encoding. This is kind of neat because it allows us to distill what the model has learned in a network of inferential relationships between sentences. Philosophers sometimes argue that the meaning of sentences is determined by it's role or location in such a network.

In [10]:
s1 = 'A man curls up in a blanket on the street.'
s2 = 'A dog chases in a field.'
s3 = 'A frog is cold.'

def predict(encoder, decoder, s1, s2, s3):
    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding())

    true = [node.lower_ for node in decoder.tree]
    predicted = [node.pword for node in decoder.tree]

    print('Sentence: ', s1)
    print('Predicted Entailment: ', ' '.join(predicted))

    encoder.forward_pass(' '.join(predicted))
    decoder.forward_pass(s3, encoder.get_root_embedding())

    predicted = [node.pword for node in decoder.tree]
    print('Next Prediction: ', ' '.join(predicted))
    print('')

predict(encoder, decoder, s1, s2, s3)
    
s1 = 'A group of Asian men pose around a large table after enjoying a meal together.'
s2 = 'Some people pose for a picture'
s3 = 'The group takes a picture.'

predict(encoder, decoder, s1, s2, s3)

s1 = 'Two police officers are sitting on motorcycles in the road.'
s2 = 'Two policemen sit on their bikes.'
s3 = 'The men have big guns.'

predict(encoder, decoder, s1, s2, s3)

s1 = 'Five people are playing in a gymnasium.'
s2 = 'Some people are competing indoors.'
s3 = 'Some people are inside.'

predict(encoder, decoder, s1, s2, s3)

s1 = 'A woman, whose face can only be seen in a mirror, is applying eyeliner in a dimly lit room.'
s2 = 'The woman applies eyeliner.'
s3 = 'The red woman applies green eyeliner.'

predict(encoder, decoder, s1, s2, s3)

Sentence:  A man curls up in a blanket on the street.
Predicted Entailment:  a man is in the street .
Next Prediction:  a man is outside .

Sentence:  A group of Asian men pose around a large table after enjoying a meal together.
Predicted Entailment:  a group eating at a table
Next Prediction:  the group eating a food .

Sentence:  Two police officers are sitting on motorcycles in the road.
Predicted Entailment:  two officers are on their road .
Next Prediction:  the officers are same road .

Sentence:  Five people are playing in a gymnasium.
Predicted Entailment:  the people are are indoors .
Next Prediction:  the people are indoors .

Sentence:  A woman, whose face can only be seen in a mirror, is applying eyeliner in a dimly lit room.
Predicted Entailment:  a woman is makeup .
Next Prediction:  a female woman woman physical makeup .



## Substitional Analysis

Finally, it is also possible to examine the effect a given word or phrase has on entailment generation via substitutions. Essentially, this involves looking at the difference made to the most likely entailment when a given word or phrase in the input sentence is replaced with another word or phrase.

In [426]:
s2 = 'the dog is on her phone'
s3 = 'the dog is outside'
s4 = 'the dog is selling the bone'
s5 = 'a dog wearing some clothes is indoors'
s6 = 'a dog are inside a car'
s7 = 'the boy is red'
s8 = 'three people are indoors'
s9 = 'a boy is not indoors'

def sub_predict(encoder, decoder, s1, s2):
    
    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding())

    true = [node.lower_ for node in decoder.tree]
    predicted = [node.pword for node in decoder.tree]

    print('Sentence: ', s1)
    print('Predicted Entailment: ', ' '.join(predicted))
    print('')    

    
s1 = 'A boy in a beige shirt is sleeping in a car.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'A girl in a beige shirt is sleeping in a car.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'A man in a beige shirt is sleeping in a car.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'A woman in a beige shirt is sleeping in a car.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'A boy in a beige shirt is sleeping in a car.'
sub_predict(encoder, decoder, s1, s3)

s1 = 'A woman in a beige shirt is sleeping in a car.'
sub_predict(encoder, decoder, s1, s3)

s1 = 'A man in a beige shirt is driving in a car.'
sub_predict(encoder, decoder, s1, s4)

s1 = 'A person in a beige shirt is selling her car.'
sub_predict(encoder, decoder, s1, s4)

s1 = 'A boy in a red shirt is waiting in a store.'
sub_predict(encoder, decoder, s1, s5)

s1 = 'Some men in red shirts are waiting in a store.'
sub_predict(encoder, decoder, s1, s6)

s1 = 'Many women in red shirts are waiting in a store.'
sub_predict(encoder, decoder, s1, s6)

s1 = 'A boy and a girl are waiting in a store.'
sub_predict(encoder, decoder, s1, s8)

s1 = 'A boy and a girl are waiting in a playground.'
sub_predict(encoder, decoder, s1, s8)

s1 = 'A boy in a red shirt is sleeping in a car.'
sub_predict(encoder, decoder, s1, s9)

s1 = 'A boy in a red shirt is waiting in a store.'
sub_predict(encoder, decoder, s1, s9)

Sentence:  A boy in a beige shirt is sleeping in a car.
Predicted Entailment:  a boy is in his car

Sentence:  A girl in a beige shirt is sleeping in a car.
Predicted Entailment:  a girl is in her car

Sentence:  A man in a beige shirt is sleeping in a car.
Predicted Entailment:  a man sleeping in his car

Sentence:  A woman in a beige shirt is sleeping in a car.
Predicted Entailment:  a woman is in her car

Sentence:  A boy in a beige shirt is sleeping in a car.
Predicted Entailment:  a boy is asleep

Sentence:  A woman in a beige shirt is sleeping in a car.
Predicted Entailment:  a woman is asleep

Sentence:  A man in a beige shirt is driving in a car.
Predicted Entailment:  a man is driving a car

Sentence:  A person in a beige shirt is selling her car.
Predicted Entailment:  a person is selling a car

Sentence:  A boy in a red shirt is waiting in a store.
Predicted Entailment:  a boy wearing a shirt is indoors

Sentence:  Some men in red shirts are waiting in a store.
Predicted Ent

In [198]:
s1 = 'A fisherman using a cellphone on a boat.'
s2 = 'A man is on the street'
sub_predict(encoder, decoder, s1, s2)

s1 = 'A Man is eating food next to a child on a bench.'
s2 = 'A man is on the street'
sub_predict(encoder, decoder, s1, s2)

s1 = 'A shirtless man skateboards on a ledge.'
s2 = 'A man is on the street'
sub_predict(encoder, decoder, s1, s2)

s1 = 'A man wearing a hat and boots is digging for something in the snow.'
s2 = 'A man is on the street'
sub_predict(encoder, decoder, s1, s2)

s1 = 'A man is on a boat.'
s2 = 'A man is outside'
sub_predict(encoder, decoder, s1, s2)

s1 = 'A man is on a bench.'
s2 = 'A man is outside'
sub_predict(encoder, decoder, s1, s2)

s1 = 'A man is on a skateboard.'
s2 = 'A man is outside'
sub_predict(encoder, decoder, s1, s2)

s1 = 'A man is in the snow.'
s2 = 'A man is outside'
sub_predict(encoder, decoder, s1, s2)

Sentence:  A fisherman using a cellphone on a boat.
Predicted Entailment:  a man is on a boat

Sentence:  A Man is eating food next to a child on a bench.
Predicted Entailment:  a man is on a bench

Sentence:  A shirtless man skateboards on a ledge.
Predicted Entailment:  a man is on a skateboard

Sentence:  A man wearing a hat and boots is digging for something in the snow.
Predicted Entailment:  a man is in the snow

Sentence:  A man is on a boat.
Predicted Entailment:  a man is outside

Sentence:  A man is on a bench.
Predicted Entailment:  a man is outside

Sentence:  A man is on a skateboard.
Predicted Entailment:  a man is outside

Sentence:  A man is in the snow.
Predicted Entailment:  a man is outside



In [18]:
total = 0 
correct = 0

for item in train_data:
    encoder.forward_pass(item.sentence1)
    decoder.forward_pass(item.sentence2, encoder.get_root_embedding())
    
    for node in decoder.tree:
        total += 1
        if node.pword.lower() == node.lower_:
            correct += 1
            
accuracy = float(correct / total)
print(accuracy)

0.6669988931826031


In [19]:
total = 0 
correct = 0

for item in test_data:
    encoder.forward_pass(item.sentence1)
    decoder.forward_pass(item.sentence2, encoder.get_root_embedding())
    
    for node in decoder.tree:
        total += 1
        if node.pword.lower() == node.lower_:
            correct += 1
            
accuracy = float(correct / total)
print(accuracy)

0.6176445248556905
