# Generating Sentences with TreeRNNs

This notebook goes through a minimal example of encoding one sentence into a distributed representation using a TreeRNN, and the using this distributed representation to generate another sentence using a different TreeRNN in reverse. To start, we'll do some data cleaning to make sure we have a good set of sentence pairs to train on. The main goal here is to remove sentences with mispelled words and oddities.

In [4]:
import enchant 
import random
import pickle
import spacy
import nltk
import gensim 
import numpy as np

from collections import namedtuple
from pysem.corpora import SNLI
from pysem.networks import DependencyNetwork
from pysem.generatives import EmbeddingGenerator

checker = enchant.Dict('en_US')
TrainingPair = namedtuple('TrainingPair', ['sentence1', 'sentence2', 'label'])

snli = SNLI('/Users/peterblouw/corpora/snli_1.0/')
snli.load_xy_pairs()

def repair(sen):
    words = nltk.word_tokenize(sen)
    for word in words:
        if not checker.check(word):
            return None
    return sen

def clean_data(data):
    clean = []
    for item in data:
        
        s1 = repair(item.sentence1)
        s2 = repair(item.sentence2)
        if s1 == None or s2 == None:
            continue
        else:
            clean.append(TrainingPair(s1, s2, item.label))
    
    return clean

In [5]:
clean_dev = clean_data(snli.dev_data)
clean_train = clean_data(snli.train_data)
clean_test = clean_data(snli.test_data)

Now we can check to see whether any words in our cleaned data are missing from our model of Word2Vec embeddings:

In [10]:
with open('w2v_embeddings.pickle', 'rb') as pfile:
    model = pickle.load(pfile)

acc = []
def w2v_check(sen):
    words = nltk.word_tokenize(sen)
    for word in words:
        if word not in model:
            acc.append(word)
            return None

for item in clean_dev:
    s1 = item.sentence1
    s2 = item.sentence2
    
    w2v_check(s1)
    w2v_check(s2)

print(set(acc))
print(len(model))

{'2013', '50', '916', '2012', 'of', '10', 'a', '15', 'to', '40', 'and', '.', '52'}
24910


Next, we'll build a vocab from the set of cleaned sentence pairs. 

In [11]:
def build_vocab(data):
    vocab = set()
    for item in data:
        s1 = item.sentence1
        s2 = item.sentence2
        
        t1 = nltk.word_tokenize(s1)
        t2 = nltk.word_tokenize(s2)
        
        for t in t1:
            if t not in vocab:
                vocab.add(t)
        for t in t2:
            if t not in vocab:
                vocab.add(t)

    return sorted(list(vocab))

data = clean_dev + clean_test + clean_train
vocab = build_vocab(data)

In [14]:
print(len(vocab))

25280


In [16]:
depsets = {dep: set() for dep in DependencyNetwork.deps}
depsets[''] = set()

parser = spacy.load('en')
data = clean_dev + clean_test + clean_train

for item in data:
    s1 = item.sentence1
    s2 = item.sentence2

    s1_parse = parser(s1)
    s2_parse = parser(s2)

    for token in s1_parse:
        if token.text not in depsets[token.dep_]:
            depsets[token.dep_].add(token.text)
    for token in s2_parse:
        if token.text not in depsets[token.dep_]:
            depsets[token.dep_].add(token.text)
            
with open('w2v_dep_vocabs.pickle', 'wb') as pfile:
    pickle.dump(depsets, pfile)

In [25]:
acc = set()
for x in depsets:
    for y in depsets[x]:
        if y not in acc:
            acc.add(y)
print(len(acc))
print(len(vocab))

count = 0
for x in vocab:
    if x not in model:
        count += 1

print(len(new_train))

25283
25280


In [30]:
train_data = [d for d in clean_train if d.label == 'entailment']
test_data = [d for d in clean_test if d.label == 'entailment']
dev_data = [d for d in clean_dev if d.label == 'entailment']


2415


In [31]:
dim = 300
iters = 20
rate = 0.0002
batchsize = 5000
vectors = 'w2v_embeddings.pickle'

with open('w2v_dep_vocabs.pickle', 'rb') as pfile:
    subvocabs = pickle.load(pfile)

encoder = DependencyNetwork(dim=dim, vocab=vocab, pretrained=vectors)
decoder = EmbeddingGenerator(dim=dim, subvocabs=subvocabs, vectors=vectors)

for _ in range(iters):
    print('On iteration ', _)
    batch = random.sample(train_data, batchsize)
    for sample in batch:
        s1 = sample.sentence1
        s2 = sample.sentence2

        encoder.forward_pass(s1)        
        decoder.forward_pass(s2, encoder.get_root_embedding())
        decoder.backward_pass(rate=rate)
        encoder.backward_pass(decoder.pass_grad, rate=rate)

On iteration  0
On iteration  1
On iteration  2
On iteration  3
On iteration  4
On iteration  5
On iteration  6
On iteration  7
On iteration  8
On iteration  9
On iteration  10
On iteration  11
On iteration  12
On iteration  13
On iteration  14
On iteration  15
On iteration  16
On iteration  17
On iteration  18
On iteration  19


## Simple Entailment Generation Examples

This small amount of data probably isn't enough to generalize outside of the training set, so we'll first check how well the learned decoder is able to generate the entailments it has been trained on.

In [32]:
sample_trees = [d for d in dev_data if 5 < len(d.sentence2.split()) < 10]
batch = random.sample(dev_data, 5)

for sample in batch:
    s1 = sample.sentence1
    s2 = sample.sentence2
    randsen = random.choice(sample_trees)

    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding())

    predicted = [node.pword for node in decoder.tree]
    true = [node.lower_ for node in decoder.tree]

    print('Sentence: ', s1)
    print('Predicted Entailment: ', ' '.join(predicted))
    print('Actual Entailment: ', ' '.join(true))
    print('')

Sentence:  Two dogs biting each other playfully while jumping
Predicted Entailment:  two dogs are biting .
Actual Entailment:  two dogs are playing .

Sentence:  A group of constructions workers lifting a bathtub.
Predicted Entailment:  are working
Actual Entailment:  humans working

Sentence:  A smiling woman is playing the violin in front of a turquoise background.
Predicted Entailment:  the woman is playing an instrument .
Actual Entailment:  a woman is playing an instrument .

Sentence:  Number 916 is hoping that he is going to win the race.
Predicted Entailment:  the people is going in a race .
Actual Entailment:  a person is competing in a race .

Sentence:  A man in glasses and a striped shirt walks down the street with one hand in his pocket.
Predicted Entailment:  the man is walks outside
Actual Entailment:  a man is walking somewhere



We can also generate entailments using randomly chosen trees for the decoding network structure. This doesn't work very well.

In [34]:
batch = random.sample(dev_data, 5)

for sample in batch:
    s1 = sample.sentence1
    s2 = sample.sentence2
    randsen = random.choice(sample_trees)

    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding())

    predicted = [node.pword for node in decoder.tree]
    true = [node.lower_ for node in decoder.tree]
    
    print('Sentence: ', s1)
    print('Predicted Entailment: ', ' '.join(predicted))
    print('Actual Entailment: ', ' '.join(true))

    decoder.forward_pass(randsen.sentence2, encoder.get_root_embedding())
    alternate = [node.pword for node in decoder.tree]
    print('Random Tree Entailment: ', ' '.join(alternate))
    print('')

Sentence:  Two women in bathing suit on large rocks at the ocean.
Predicted Entailment:  two women are outside
Actual Entailment:  two women are outside
Random Tree Entailment:  the women are on the ocean .

Sentence:  Hiker in blue shirt and red shorts stands on hill near mountain.
Predicted Entailment:  the hiker hiker shirt on .
Actual Entailment:  the hiker has shorts on .
Random Tree Entailment:  there hiker red hiker on the mountain of mountain .

Sentence:  A man in a bar drinks from a pitcher while a man in a green hat looks on and a woman in a black shirt drink from a glass.
Predicted Entailment:  a drinking at other bar .
Actual Entailment:  a women in black drinks .
Random Tree Entailment:  a man is drinking his drink drink .

Sentence:  A woman in an orange shirt is enjoying food in a public setting.
Predicted Entailment:  a woman in a red shirt is is
Actual Entailment:  the lady in the orange shirt is eating
Random Tree Entailment:  a is in a food food .

Sentence:  A man 

## Generating Entailment Chains (i.e. Inferential Roles)

We can also generate entailment chains by re-encoding a generated sentence, and then generating new sentence from the subsequent encoding. This is kind of neat because it allows us to distill what the model has learned in a network of inferential relationships between sentences. Philosophers sometimes argue that the meaning of sentences is determined by it's role or location in such a network.

In [35]:
s1 = 'A man curls up in a blanket on the street.'
s2 = 'A dog chases in a field.'
s3 = 'A frog is cold.'

def predict(encoder, decoder, s1, s2, s3):
    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding())

    true = [node.lower_ for node in decoder.tree]
    predicted = [node.pword for node in decoder.tree]

    print('Sentence: ', s1)
    print('Predicted Entailment: ', ' '.join(predicted))

    encoder.forward_pass(' '.join(predicted))
    decoder.forward_pass(s3, encoder.get_root_embedding())

    predicted = [node.pword for node in decoder.tree]
    print('Next Prediction: ', ' '.join(predicted))
    print('')

predict(encoder, decoder, s1, s2, s3)
    
s1 = 'A group of Asian men pose around a large table after enjoying a meal together.'
s2 = 'Some people pose for a picture'
s3 = 'The group takes a picture.'

predict(encoder, decoder, s1, s2, s3)

s1 = 'Two police officers are sitting on motorcycles in the road.'
s2 = 'Two policemen sit on their bikes.'
s3 = 'The men have big guns.'

predict(encoder, decoder, s1, s2, s3)

s1 = 'Five people are playing in a gymnasium.'
s2 = 'Some people are competing indoors.'
s3 = 'Some people are inside.'

predict(encoder, decoder, s1, s2, s3)

s1 = 'A woman, whose face can only be seen in a mirror, is applying eyeliner in a dimly lit room.'
s2 = 'The woman applies eyeliner.'
s3 = 'The red woman applies green eyeliner.'

predict(encoder, decoder, s1, s2, s3)

Sentence:  A man curls up in a blanket on the street.
Predicted Entailment:  a man is on the street .
Next Prediction:  a man is outside .

Sentence:  A group of Asian men pose around a large table after enjoying a meal together.
Predicted Entailment:  the group are at a table
Next Prediction:  the group are a table .

Sentence:  Two police officers are sitting on motorcycles in the road.
Predicted Entailment:  two officers are on their road .
Next Prediction:  the people are other road .

Sentence:  Five people are playing in a gymnasium.
Predicted Entailment:  the people are are outside .
Next Prediction:  the people are outside .

Sentence:  A woman, whose face can only be seen in a mirror, is applying eyeliner in a dimly lit room.
Predicted Entailment:  a woman is makeup .
Next Prediction:  a female woman makeup black makeup .



In [36]:
def condition(encoder, decoder, s1, s2, cond):
    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding() + cond)

    true = [node.lower_ for node in decoder.tree]
    predicted = [node.pword for node in decoder.tree]
    print('Predicted Entailment: ', ' '.join(predicted))
    
s1 = 'A woman, whose face can only be seen in a mirror, is applying eyeliner in a dimly lit room.'
s2 = 'a blond woman applying eyeliner'
cond_sen = 'A person in a hooded shirt is photographing a woman.'
encoder.forward_pass(cond_sen)
cond = encoder.get_root_embedding()

print('Sentence: ', s1)
print('Conditioning Context: ', cond_sen)

encoder.forward_pass('')
condition(encoder, decoder, s1, s2, cond)   

s1 = 'A shirtless man sleeps in his blue boat out on the open waters.'
s2 = 'The red man is in the big boat.'
cond_word = 'fishing'
cond = encoder.vectors[cond_word]

print('')
print('Sentence: ', s1)
print('Conditioning Context: ', cond_word)

encoder.forward_pass('')
condition(encoder, decoder, s1, s2, cond)

s1 = 'Seven women stand and sit around a waters edge and one of them women sitting in the middle with her bare feet in the water drinks from a water bottle.'
s2 = 'a big man is on a boat.'
cond_sen = 'What is in the water bottle?'

encoder.forward_pass(cond_sen)
cond = encoder.get_root_embedding()

print('')
print('Sentence: ', s1)
print('Conditioning Context: ', cond_sen)

condition(encoder, decoder, s1, s2, cond)

Sentence:  A woman, whose face can only be seen in a mirror, is applying eyeliner in a dimly lit room.
Conditioning Context:  A person in a hooded shirt is photographing a woman.
Predicted Entailment:  a human is wearing picture

Sentence:  A shirtless man sleeps in his blue boat out on the open waters.
Conditioning Context:  fishing
Predicted Entailment:  a shirtless man fishing on the fishing boat .

Sentence:  Seven women stand and sit around a waters edge and one of them women sitting in the middle with her bare feet in the water drinks from a water bottle.
Conditioning Context:  What is in the water bottle?
Predicted Entailment:  the young people are in the water .


In [37]:
s1 = 'Several runners compete in a road race.'
s2 = 'the dog ran quickly to the beach.'

def sub_predict(encoder, decoder, s1, s2):
    
    encoder.forward_pass(s1)
    decoder.forward_pass(s2, encoder.get_root_embedding())

    true = [node.lower_ for node in decoder.tree]
    predicted = [node.pword for node in decoder.tree]

    print('Sentence: ', s1)
    print('Predicted Entailment: ', ' '.join(predicted))
    print('')    

sub_predict(encoder, decoder, s1, s2)
    
s1 = 'Many runners compete in a road race.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'Several runners compete in a talent show.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'Several performers compete in a talent show.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'Several performers perform in a music show.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'Several swimmers compete in a fast race.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'The swimmers compete in a road race.'
s2 = 'the boy is on the beach.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'The swimmers compete in a swim race.'
s2 = 'a big man sits near that beach.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'Several runners paint paintings.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'The swimmers race before eating.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'One very slow cyclist is in the indoor arena.'
sub_predict(encoder, decoder, s1, s2)

s1 = 'a little boy is in the water.'
sub_predict(encoder, decoder, s1, s2)

Sentence:  Several runners compete in a road race.
Predicted Entailment:  the runners runners outside in a race .

Sentence:  Many runners compete in a road race.
Predicted Entailment:  the runners runners outside in a race .

Sentence:  Several runners compete in a talent show.
Predicted Entailment:  the people compete outside in a compete .

Sentence:  Several performers compete in a talent show.
Predicted Entailment:  the performers compete together in a competition .

Sentence:  Several performers perform in a music show.
Predicted Entailment:  the performers performing together in a stage .

Sentence:  Several swimmers compete in a fast race.
Predicted Entailment:  the swimmers compete outside in a race .

Sentence:  The swimmers compete in a road race.
Predicted Entailment:  the people compete in the race .

Sentence:  The swimmers compete in a swim race.
Predicted Entailment:  the young swimmers swim in the race .

Sentence:  Several runners paint paintings.
Predicted Entailment

In [38]:
for x in dev_data[70:90]:
    print(x.sentence1)
    print(x.sentence2)
    print('')

A man in a black shirt is smiling at a woman in a black shirt what a tattoo and a eye brow ring.
A man is smiling at a woman.

A goalie in white runs for an approaching ball while the opponent in red who kicked it waits.
A person wearing white is running towards a ball that was kicked from another person.

kids drawing something on paper
There are kids drawing.

A young woman is singing into a microphone.
The woman is singing.

A diver is swimming with a turtle.
A turtle has a diver swimming with it.

A man with a beard skateboarding and a boy with a blue and black backpack riding a green bike in the background.
There is a man and a boy outside.

A young boy reaches for and touches the propeller of a vintage aircraft.
A young boy and an airplane.

Indian people dressed in magnificent bright colors conduct a ritual.
Indian people having a ritual

Two little girls ride an inflatable dinghy down a purple water slide.
Two girls on a water slide.

Young girl with dark hair facing camera and

In [127]:
s1 = 'The little boy is jumping into a puddle on the street.'
s2 = 'A boy plays with my favorite puddle'
sub_predict(encoder, decoder, s1, s2)

# s1 = 'An animal is jumping to catch an object.'
# s2 = 'A white moose walks.'
s1 = 'Young players engage in the sport of Water polo while others watch.'
s2 = 'A man jumps in the big puddle'
sub_predict(encoder, decoder, s1, s2)

# s2 = 'A white moose walks.'
s1 = 'A woman dancing by two men above the ceiling.'
s2 = 'The bed is dirty'
s3 = 'A man stares at some puddle'
sub_predict(encoder, decoder, s1, s3)

Sentence:  The little boy is jumping into a puddle on the street.
Predicted Entailment:  the boy is into his little puddle

Sentence:  Young players engage in the sport of Water polo while others watch.
Predicted Entailment:  the people playing in the other water

Sentence:  A woman dancing by two men above the ceiling.
Predicted Entailment:  the women dancing in the men



In [44]:
for item in decoder.tree:
    print(item.lower_, item.dep_, item.head, item.pword)


the det bed  a
bed nsubj is  woman
is ROOT is  is
dirty acomp is  outside


In [45]:
for item in decoder.tree:
    probs = np.copy(item.probs)
    idx = np.argmax(probs)
    indices =  np.argpartition(probs.flatten(), -3)[-3:]
    print(decoder.idx_to_wrd[item.dep_][idx], [(decoder.idx_to_wrd[item.dep_][x], probs[x]) for x in indices])

a [('an', array([ 0.00979189])), ('the', array([ 0.37057355])), ('a', array([ 0.61028682]))]
woman [('she', array([ 0.01600048])), ('lady', array([ 0.01862793])), ('woman', array([ 0.90540286]))]
is [('jumping', array([ 0.02618057])), ('is', array([ 0.07478745])), ('jumps', array([ 0.03118227]))]
outside [('wet', array([ 0.03886075])), ('asleep', array([ 0.05742537])), ('outside', array([ 0.1132588]))]


In [None]:
print(subvocabs['aux'])