# Generating Sentences with TreeRNNs

This notebook goes through a minimal example of encoding one sentence into a distributed representation using a TreeRNN, and the using this distributed representation to generate another sentence using a different TreeRNN in reverse. To start, we'll do some data cleaning to make sure we have a good set of sentence pairs to train on. The main goal here is to remove sentences with mispelled words and oddities.

In [1]:
import enchant 
import random
import pickle
import numpy as np

from collections import namedtuple
from pysem.corpora import SNLI
from pysem.networks import DependencyNetwork
from pysem.generatives import EmbeddingGenerator

checker = enchant.Dict('en_US')
TrainingPair = namedtuple('TrainingPair', ['sentence1', 'sentence2', 'label'])

snli = SNLI('/home/pblouw/snli_1.0/')
snli.load_xy_pairs()

def repair(sen):
    tokens = DependencyNetwork.parser(sen)
    if len(tokens) > 15:
        return None
    for token in tokens:
        if not checker.check(token.text):
            return None
    return sen

def clean_data(data):
    clean = []
    for item in data:
        
        s1 = repair(item.sentence1)
        s2 = repair(item.sentence2)
        if s1 == None or s2 == None:
            continue
        else:
            clean.append(TrainingPair(s1, s2, item.label))
    
    return clean

In [2]:
clean_dev = clean_data(snli.dev_data)
clean_train = clean_data(snli.train_data)
clean_test = clean_data(snli.test_data)

In [3]:
print(len(clean_dev))
print(len(clean_test))
print(len(clean_train))

4955
4839
306651


Next, we'll build a vocab from the set of cleaned sentence pairs. 

In [4]:
def build_vocab(data):
    vocab = set()
    for item in data:
        s1 = item.sentence1
        s2 = item.sentence2
        
        t1 = DependencyNetwork.parser(s1)
        t2 = DependencyNetwork.parser(s2)
        
        for t in t1:
            if t.text not in vocab:
                vocab.add(t.text)
        for t in t2:
            if t.text not in vocab:
                vocab.add(t.text)

    return sorted(list(vocab))

data = clean_dev + clean_test + clean_train
vocab = build_vocab(data)

In [5]:
print(len(vocab))

22555


Now we can collect all of the sentence pairs standing in entailment relations to one another.

In [6]:
train_data = [d for d in clean_train if d.label == 'entailment'] # or d.label == 'neutral']
test_data = [d for d in clean_test if d.label == 'entailment'] # or d.label == 'neutral']
dev_data = [d for d in clean_dev if d.label == 'entailment'] # or d.label == 'neutral']

print(len(train_data))
print(len(test_data))
print(len(dev_data))

106288
1666
1701


In [8]:
dim = 300
iters = 100
rate = 0.0006
batchsize = 10000

vectors = 'w2v_embeddings.pickle'

with open('w2v_dep_vocabs.pickle', 'rb') as pfile:
    subvocabs = pickle.load(pfile)

encoder = DependencyNetwork(dim=dim, vocab=vocab, pretrained=vectors)
decoder = EmbeddingGenerator(dim=dim, subvocabs=subvocabs, vectors=vectors)

for _ in range(iters):
    print('On iteration ', _)
    if _ == 60:
        rate = rate / 2.0
    if _ == 75:
        rate = rate / 2.0
    if _ == 90:
        rate = rate / 2.0
    
    batch = random.sample(train_data, batchsize)
    for sample in batch:
        s1 = sample.sentence1
        s2 = sample.sentence2

        encoder.forward_pass(s1)        
        decoder.forward_pass(s2, encoder.get_root_embedding())
        decoder.backward_pass(rate=rate)
        encoder.backward_pass(decoder.pass_grad, rate=rate)

On iteration  0
On iteration  1
On iteration  2
On iteration  3
On iteration  4
On iteration  5
On iteration  6
On iteration  7
On iteration  8
On iteration  9
On iteration  10
On iteration  11
On iteration  12
On iteration  13
On iteration  14
On iteration  15
On iteration  16
On iteration  17
On iteration  18
On iteration  19
On iteration  20
On iteration  21
On iteration  22
On iteration  23
On iteration  24
On iteration  25
On iteration  26
On iteration  27
On iteration  28
On iteration  29
On iteration  30
On iteration  31
On iteration  32
On iteration  33
On iteration  34
On iteration  35
On iteration  36
On iteration  37
On iteration  38
On iteration  39
On iteration  40
On iteration  41
On iteration  42
On iteration  43
On iteration  44
On iteration  45
On iteration  46
On iteration  47
On iteration  48
On iteration  49
On iteration  50
On iteration  51
On iteration  52
On iteration  53
On iteration  54
On iteration  55
On iteration  56
On iteration  57
On iteration  58
On iter

## Simple Entailment Generation Examples

This small amount of data probably isn't enough to generalize outside of the training set, so we'll first check how well the learned decoder is able to generate the entailments it has been trained on.

In [27]:
sample_trees = [d for d in dev_data if 5 < len(d.sentence2.split()) < 10]
batch = random.sample(train_data, 5)

for sample in batch:
    s1 = sample.sentence1
    s2 = sample.sentence2
    randsen = random.choice(sample_trees)

    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding())

    predicted = [node.pword for node in decoder.tree]
    true = [node.lower_ for node in decoder.tree]

    print('Sentence: ', s1)
    print('Predicted Entailment: ', ' '.join(predicted))
    print('Actual Entailment: ', ' '.join(true))
    print('')

Sentence:  Two young boys running down an upper level breezeway.
Predicted Entailment:  the boys are running
Actual Entailment:  some boys are running

Sentence:  Three people with their back turned are standing inside a building.
Predicted Entailment:  three people are their building inside a building
Actual Entailment:  three people turn their backs inside the building

Sentence:  Young children playing in a pile of toilet paper.
Predicted Entailment:  children are playing together .
Actual Entailment:  kids are playing together .

Sentence:  A man climbing a cliff.
Predicted Entailment:  a man is climbing up a cliff .
Actual Entailment:  a man is climbing up a cliff .

Sentence:  A white dog holds a stick while swimming.
Predicted Entailment:  a white dog is swimming .
Actual Entailment:  the white dog is swimming .



We can also generate entailments using randomly chosen trees for the decoding network structure. This doesn't work very well.

In [10]:
batch = random.sample(dev_data, 5)

for sample in batch:
    s1 = sample.sentence1
    s2 = sample.sentence2
    randsen = random.choice(sample_trees)

    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding())

    predicted = [node.pword for node in decoder.tree]
    true = [node.lower_ for node in decoder.tree]
    
    print('Sentence: ', s1)
    print('Predicted Entailment: ', ' '.join(predicted))
    print('Actual Entailment: ', ' '.join(true))

    decoder.forward_pass(randsen.sentence2, encoder.get_root_embedding())
    alternate = [node.pword for node in decoder.tree]
    print('Random Tree Entailment: ', ' '.join(alternate))
    print('')

Sentence:  The greyhounds are running quickly in this race.
Predicted Entailment:  the greyhounds are running fast in the race .
Actual Entailment:  the dogs are running quickly in this race .
Random Tree Entailment:  the greyhounds are running a race .

Sentence:  Four black teenagers are playing basketball inside a gymnasium while another one watches.
Predicted Entailment:  four are playing in a gym .
Actual Entailment:  four teenagers playing inside a gym .
Random Tree Entailment:  some people are inside a gym .

Sentence:  Dog herding cows
Predicted Entailment:  dog herding near a livestock .
Actual Entailment:  animals are near each other .
Random Tree Entailment:  a dog herding near a livestock .

Sentence:  A dog runs along the shore of a pond with two elegant geese swimming.
Predicted Entailment:  a dog is at the water of the water outside .
Actual Entailment:  a dog runs along the edge of a pond outdoors .
Random Tree Entailment:  a dog is at the water .

Sentence:  A man runn

## Generating Entailment Chains (i.e. Inferential Roles)

We can also generate entailment chains by re-encoding a generated sentence, and then generating new sentence from the subsequent encoding. This is kind of neat because it allows us to distill what the model has learned in a network of inferential relationships between sentences. Philosophers sometimes argue that the meaning of sentences is determined by it's role or location in such a network.

In [11]:
s1 = 'A man curls up in a blanket on the street.'
s2 = 'A dog chases in a field.'
s3 = 'A frog is cold.'

def predict(encoder, decoder, s1, s2, s3):
    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding())

    true = [node.lower_ for node in decoder.tree]
    predicted = [node.pword for node in decoder.tree]

    print('Sentence: ', s1)
    print('Predicted Entailment: ', ' '.join(predicted))

    encoder.forward_pass(' '.join(predicted))
    decoder.forward_pass(s3, encoder.get_root_embedding())

    predicted = [node.pword for node in decoder.tree]
    print('Next Prediction: ', ' '.join(predicted))
    print('')

predict(encoder, decoder, s1, s2, s3)
    
s1 = 'A group of Asian men pose around a large table after enjoying a meal together.'
s2 = 'Some people pose for a picture'
s3 = 'The group takes a picture.'

predict(encoder, decoder, s1, s2, s3)

s1 = 'Two police officers are sitting on motorcycles in the road.'
s2 = 'Two policemen sit on their bikes.'
s3 = 'The men have big guns.'

predict(encoder, decoder, s1, s2, s3)

s1 = 'Five people are playing in a gymnasium.'
s2 = 'Some people are competing indoors.'
s3 = 'Some people are inside.'

predict(encoder, decoder, s1, s2, s3)

s1 = 'A woman, whose face can only be seen in a mirror, is applying eyeliner in a dimly lit room.'
s2 = 'The woman applies eyeliner.'
s3 = 'The red woman applies green eyeliner.'

predict(encoder, decoder, s1, s2, s3)

Sentence:  A man curls up in a blanket on the street.
Predicted Entailment:  a man is on the blanket .
Next Prediction:  a man is outside .

Sentence:  A group of Asian men pose around a large table after enjoying a meal together.
Predicted Entailment:  a men eating at a table
Next Prediction:  the men eating a food .

Sentence:  Two police officers are sitting on motorcycles in the road.
Predicted Entailment:  two officers are on their road .
Next Prediction:  the officers are same outdoors .

Sentence:  Five people are playing in a gymnasium.
Predicted Entailment:  the people are are together .
Next Prediction:  the people are together .

Sentence:  A woman, whose face can only be seen in a mirror, is applying eyeliner in a dimly lit room.
Predicted Entailment:  a woman is makeup .
Next Prediction:  a female woman makeup new makeup .



In [12]:
def condition(encoder, decoder, s1, s2, cond):
    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding() + cond)

    true = [node.lower_ for node in decoder.tree]
    predicted = [node.pword for node in decoder.tree]
    print('Predicted Entailment: ', ' '.join(predicted))
    
s1 = 'A woman, whose face can only be seen in a mirror, is applying eyeliner in a dimly lit room.'
s2 = 'a blond woman applying eyeliner'
cond_sen = 'A person in a hooded shirt is photographing a woman.'
encoder.forward_pass(cond_sen)
cond = encoder.get_root_embedding()

print('Sentence: ', s1)
print('Conditioning Context: ', cond_sen)

encoder.forward_pass('')
condition(encoder, decoder, s1, s2, cond)   

s1 = 'A shirtless man sleeps in his blue boat out on the open waters.'
s2 = 'The red man is in the big boat.'
cond_word = 'fishing'
cond = encoder.vectors[cond_word]

print('')
print('Sentence: ', s1)
print('Conditioning Context: ', cond_word)

encoder.forward_pass('')
condition(encoder, decoder, s1, s2, cond)

s1 = 'Seven women stand and sit around a waters edge and one of them women sitting in the middle with her bare feet in the water drinks from a water bottle.'
s2 = 'a big man is on a boat.'
cond_sen = 'What is in the water bottle?'

encoder.forward_pass(cond_sen)
cond = encoder.get_root_embedding()

print('')
print('Sentence: ', s1)
print('Conditioning Context: ', cond_sen)

condition(encoder, decoder, s1, s2, cond)

Sentence:  A woman, whose face can only be seen in a mirror, is applying eyeliner in a dimly lit room.
Conditioning Context:  A person in a hooded shirt is photographing a woman.
Predicted Entailment:  a human is performing makeup

Sentence:  A shirtless man sleeps in his blue boat out on the open waters.
Conditioning Context:  fishing
Predicted Entailment:  a shirtless man fishing in the blue water .

Sentence:  Seven women stand and sit around a waters edge and one of them women sitting in the middle with her bare feet in the water drinks from a water bottle.
Conditioning Context:  What is in the water bottle?
Predicted Entailment:  a female person sitting in the bottle .


In [13]:
s1 = 'Several runners compete in a road race.'
s2 = 'the dog ran quickly to the beach.'

def sub_predict(encoder, decoder, s1, s2):
    
    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding())

    true = [node.lower_ for node in decoder.tree]
    predicted = [node.pword for node in decoder.tree]

    print('Sentence: ', s1)
    print('Predicted Entailment: ', ' '.join(predicted))
    print('')    

sub_predict(encoder, decoder, s1, s2)
    
s1 = 'Many runners compete in a road race.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'Several runners compete in a talent show.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'Several performers compete in a talent show.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'Several performers perform in a music show.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'Several swimmers compete in a fast race.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'The swimmers compete in a road race.'
s2 = 'the boy is on the beach.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'The swimmers compete in a swim race.'
s2 = 'a big man sits near that beach.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'Several runners paint paintings.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'The swimmers race before eating.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'One very slow cyclist is in the indoor arena.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'a little boy is in the water.'
sub_predict(encoder, decoder, s1, s2)

Sentence:  Several runners compete in a road race.
Predicted Entailment:  the runners are outside in a race .

Sentence:  Many runners compete in a road race.
Predicted Entailment:  the runners are outside in a race .

Sentence:  Several runners compete in a talent show.
Predicted Entailment:  the people compete together in a show .

Sentence:  Several performers compete in a talent show.
Predicted Entailment:  the performers compete together in a show .

Sentence:  Several performers perform in a music show.
Predicted Entailment:  the performers performing together in a show .

Sentence:  Several swimmers compete in a fast race.
Predicted Entailment:  the swimmers are fast in a race .

Sentence:  The swimmers compete in a road race.
Predicted Entailment:  the people are in the race .

Sentence:  The swimmers compete in a swim race.
Predicted Entailment:  the swim swimmers are in a race .

Sentence:  Several runners paint paintings.
Predicted Entailment:  the several people paint in th

In [14]:
for x in train_data[70:90]:
    print(x.sentence1)
    print(x.sentence2)
    print('')

Young blond woman putting her foot into a water fountain
A person is dipping her foot into water.

A young woman tries to stick her foot in a fountain.
A woman is near a fountain.

A young woman tries to stick her foot in a fountain.
The woman has one foot in the air.

Woman balancing on edge of fountain while sticking her toe in the water.
A woman stand on a fountain and dips her toes in.

A couple strolls arm and arm and hand in hand down a city sidewalk.
The couple is outdoors.

A man stare at a passing couple while walking down the block.
A man stares at a passing couple.

Two men in wheelchairs are reaching in the air for a basketball.
Two people in wheelchairs are reaching in the air for a basketball.

Three dogs in different shades of brown and white biting and licking each other.
there are three dogs

Three small puppies bite and play together in the grass.
Three puppies are playing outside.

Two dogs biting another dog in a field.
dogs attacking another dog

Tourists waiting a

In [15]:
s1 = 'The little boy is jumping into a puddle on the street.'
s2 = 'A boy plays with my favorite puddle'
sub_predict(encoder, decoder, s1, s2)

# s1 = 'An animal is jumping to catch an object.'
# s2 = 'A white moose walks.'
s1 = 'Young players engage in the sport of Water polo while others watch.'
s2 = 'A man jumps in the big puddle'
sub_predict(encoder, decoder, s1, s2)

# s2 = 'A white moose walks.'
s1 = 'A woman dancing by two men above the ceiling.'
s2 = 'The bed is dirty'
s3 = 'A man stares at some puddle'
sub_predict(encoder, decoder, s1, s3)

Sentence:  The little boy is jumping into a puddle on the street.
Predicted Entailment:  the boy is in his wet puddle

Sentence:  Young players engage in the sport of Water polo while others watch.
Predicted Entailment:  some players watch in an other polo

Sentence:  A woman dancing by two men above the ceiling.
Predicted Entailment:  the woman dancing near the men



In [16]:
for item in decoder.tree:
    print(item.lower_, item.dep_, item.head, item.pword)


a det man the
man nsubj stares woman
stares ROOT stares dancing
at prep stares near
some det puddle the
puddle pobj at men


In [17]:
for item in decoder.tree:
    probs = np.copy(item.probs)
    idx = np.argmax(probs)
    indices =  np.argpartition(probs.flatten(), -3)[-3:]
    print(decoder.idx_to_wrd[item.dep_][idx], [(decoder.idx_to_wrd[item.dep_][x], probs[x]) for x in indices])

the [('some', array([ 0.01932157])), ('a', array([ 0.43674835])), ('the', array([ 0.54203248]))]
woman [('people', array([ 0.06433475])), ('women', array([ 0.2472285])), ('woman', array([ 0.5238112]))]
dancing [('are', array([ 0.05906747])), ('is', array([ 0.10831368])), ('dancing', array([ 0.14677156]))]
near [('by', array([ 0.13327161])), ('in', array([ 0.14761176])), ('near', array([ 0.16761699]))]
the [('some', array([ 0.01390214])), ('a', array([ 0.045567])), ('the', array([ 0.9380372]))]
men [('girls', array([ 0.05206402])), ('women', array([ 0.12743021])), ('men', array([ 0.59933989]))]


In [18]:
print(subvocabs['aux'])

{'stands', 'WERE', 'was', 'inhaling', 'walks', 'mask', 'am', 'hawk', 'square', 'has', 'outdoors', 're', 'give', 'love', 'would', 'jeep', 'being', 's', 'outdoor', 'unknowing', 'orange', 'practices', 'laying', 'goers', 'having', 'match', 'will', 'turned', 'ride', 'seafarer', 'ware', 'doing', 'goes', 'Men', 'Waves', 'getting', 'looks', 'outside', 'microwave', 'feels', 'stare', 'swimming', 'enjoys', 'both', 'ski', 'hugging', 'Player', 'have', 'handicapped', 'may', 'were', 'sits', 'prepares', 'sunshine', 'dressed', 'bungee', 'concrete', 'help', 'ARE', 'avoiding', 'crew', 'sitting', 'wearing', 'considers', 'band', 'birds', 'be', 'jump', 'been', 'd', 'sleeping', 'busy', 'pretend', 'id', 'shall', 'park', 'tries', 'gets', 'to', 'rock', 'willing', 'n', 'is', 'snow', 'tie', 'can', 'did', 'calmly', 'leans', 'hate', 'floats', 'enjoy', 'walk', 'building', 'had', 'sled', 'Is', 'lap', 'para', 'Will', 'must', 'should', 'are', 'could', 'runs', 'likes', 'TO', 'balancing', 'does', 'pep', 'To', 'walking', 

In [19]:
total = 0 
correct = 0

for item in train_data:
    encoder.forward_pass(item.sentence1)
    decoder.forward_pass(item.sentence2, encoder.get_root_embedding())
    
    for node in decoder.tree:
        total += 1
        if node.pword.lower() == node.lower_:
            correct += 1
            
accuracy = float(correct / total)
print(accuracy)

0.705805488400173


In [20]:
total = 0 
correct = 0

for item in test_data:
    encoder.forward_pass(item.sentence1)
    decoder.forward_pass(item.sentence2, encoder.get_root_embedding())
    
    for node in decoder.tree:
        total += 1
        if node.pword.lower() == node.lower_:
            correct += 1
            
accuracy = float(correct / total)
print(accuracy)

0.6186783837339537
