# Generation of sentences from vector representations of sentences

In this notebook, we will see how to train a decoder to generate sentences from sentence vectors.



In [1]:
# !python3 -m venv venv_vec2seq
# !source venv_vec2seq/bin/activate

## Installation
To be able to run vec2seq, you need to install packages including:
* sentence-transformers
* gensim
* sacrebleu
* distance

In [2]:
#!pip install -r requirements.txt

Implement with CUDA if your system supports it.

In [3]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## Training a decoder for sentence vectors
We generate sentences on the premise of having only a fixed-size vector and no other information.

For that, we train a vector-to-sequence model which is a unidirectional RNN (decoder part of RNN Encoder-Decoder).




### Loading sentence embedding model

we implement four typical methods for representing variable-length sentences into fixed-size vectors:
* *'word-vec-sum'*:  summation over word vectors
* *'word-vec-avg'*: word vectors averaging
* *'word-vec-concat'*: concatenation of word vectors
* *'bert-base-nli-mean-tokens'*: pre-trained sentence embedding model (SBERT)

Note that you need to specify the path of the word2vec model if you choose the embedding methods based on word vectors.

In [4]:
from vec2seq.sentence2vector import encoder

encoder_name='word-vec-sum' 
encoder_path='/mango/homes/WANG_Liyan/model/cc.en.300.init.bin'

encoder = encoder(model_name=encoder_name,model_path=encoder_path, device=device)


print out the dimension of vectors

In [5]:
print('vector dimension: {}'.format(encoder.vector_size))

vector dimension: 300


### Loading the dataset

Place data files in a specific directory, and pass the path to data_dir. (one sentence per line in each .tsv file.)

We experiment with a toy data.

In [6]:
data_dir='/mango/homes/WANG_Liyan/data/vec2seq_toy/'
train_filename='sent.train.tsv'
valid_filename='sent.valid.tsv'
test_filename='sent.test.tsv'

batch_size=128
max_length=12

customize special tokens:
* init_token: start of sentence (default as 'SOS')
* eos_token: end of sentence (default as 'EOS')
* pad_token: padding token (default as 'PAD')

In [7]:
init_token='SOS'
eos_token = 'EOS'
pad_token='PAD'

Then we use `TextDataLoader` to do batching, padding, and building a vocabulary.

In [8]:
from vec2seq.dataloader import TextDataLoader

dataset = TextDataLoader(data_dir=data_dir,
                         train_filename=train_filename,
                         valid_filename=valid_filename,
                         test_filename=test_filename,
                         batch_size=batch_size,
                         max_length=max_length,
                         device=device)

vocab = dataset.generate_vocabulary()


vocabulary size: 139


In [9]:
train_set,valid_set,test_set = dataset.split()

train/valid/test size: 60/20/6


### Customizing a decoder

* the choice of model type: `decoder_net='basicrnn'` or `decoder_net='conrnn'`
* the choice of loss function: `loss = 'sl'` or `loss = 'ml'`


In [10]:
from vec2seq.networks import basicRNN,ConRNN
from vec2seq.networks import init_weights,count_parameters

decoder_net='basicrnn'
vocab_size=len(vocab.vocab)
vector_dim = encoder.vector_size
hid_dim = encoder.vector_size
n_layers = 1
loss = 'sl'

In [11]:
if decoder_net == "conrnn":
    decoder = ConRNN(vocab_size=vocab_size , vector_dim=vector_dim, hid_dim=hid_dim, n_layers=n_layers, loss=loss)
elif decoder_net == "basicrnn":
    decoder = basicRNN(vocab_size=vocab_size , vector_dim=vector_dim, hid_dim=hid_dim, n_layers=n_layers, loss=loss)

decoder = decoder.to(device)
print(decoder)
print('The model has {} trainable parameters'.format(count_parameters(decoder)))


  "num_layers={}".format(dropout, num_layers))


basicRNN(
  (rnn): LSTM(300, 300, dropout=0.4)
  (fc_out): Linear(in_features=300, out_features=139, bias=True)
  (dropout): Dropout(p=0.4, inplace=False)
)
The model has 764239 trainable parameters


In [12]:
decoder.apply(init_weights)

basicRNN(
  (rnn): LSTM(300, 300, dropout=0.4)
  (fc_out): Linear(in_features=300, out_features=139, bias=True)
  (dropout): Dropout(p=0.4, inplace=False)
)

### Training
We pass model arguments to the `trainer` class, and train the model by calling the function `train()`.

In [13]:
from vec2seq.train import trainer

epochs=10
patience=5
model_save_path='results/vec2seq_toy.pt'


trainer = trainer(decoder=decoder,encoder=encoder,train_set=train_set,valid_set=valid_set,valid_path=data_dir+valid_filename,batch_size=batch_size,max_length=max_length,
    corpora_field=vocab,loss_mode=loss,epochs=epochs,patience=patience,model_save_path=model_save_path,device=device)

trainer.train()


  return torch.FloatTensor(vec).view(1,1,-1).to(self.device)


Epoch: 0 | Time: 0m 0s
	Train Loss: 4.922 | Train PPL: 137.338
	 Val. Loss: 4.890 |  Val. PPL: 132.974
Validation loss decreased (inf --> 4.8901567459106445).  Saving model ...
Epoch: 1 | Time: 0m 0s
	Train Loss: 4.874 | Train PPL: 130.878
	 Val. Loss: 4.842 |  Val. PPL: 126.706
Validation loss decreased (4.8901567459106445 --> 4.8418684005737305).  Saving model ...
Epoch: 2 | Time: 0m 0s
	Train Loss: 4.821 | Train PPL: 124.116
	 Val. Loss: 4.790 |  Val. PPL: 120.309
Validation loss decreased (4.8418684005737305 --> 4.790066719055176).  Saving model ...
Epoch: 3 | Time: 0m 0s
	Train Loss: 4.764 | Train PPL: 117.268
	 Val. Loss: 4.734 |  Val. PPL: 113.702
Validation loss decreased (4.790066719055176 --> 4.733580112457275).  Saving model ...
Epoch: 4 | Time: 0m 0s
	Train Loss: 4.699 | Train PPL: 109.851
	 Val. Loss: 4.666 |  Val. PPL: 106.307
Validation loss decreased (4.733580112457275 --> 4.666333198547363).  Saving model ...
Epoch: 5 | Time: 0m 0s
	Train Loss: 4.618 | Train PPL: 101.2

## Generating sentences from vectors using a trained decoder

Once the training is completed, we can use the model to generate sentence from vectors.

Load pre-trained decoder model

In [14]:
pretrained_decoder_path = model_save_path
decoder.load_state_dict(torch.load(pretrained_decoder_path))

<All keys matched successfully>

We can use `test` function to compute test loss and generate a list of reference and predicted sentence pairs.

In [15]:
from vec2seq.test import test

sentence_save_path='results/sents_toy.tsv'
test_path=data_dir+test_filename

results,test_loss = test(encoder=encoder,decoder=decoder,test_set=test_set,
                         test_path=test_path,results_save_path=sentence_save_path,
                         batch_size=batch_size,corpora_field=vocab,loss_mode=loss,device=device)

Check out generated sentences

In [16]:
from tabulate import tabulate

print(tabulate(results, headers=["Reference","Generation"]))

Reference                       Generation
------------------------------  ----------------------
I do not want to do it again .  I do not not not not .
I do not want to go with you .  I do not not not .
I do not want to hurt anyone .  I do not not not .
Tom does not want to die .      I do not not not .
I do not want to die .          I do not not not .
It 's him                       I do not not .


### Evaluation

Then evaluate sentences generated by the decoder trained on toy data.

In [17]:
print("toy data size: train {} | valid {} | test {}".format(len(train_set.dataset),len(valid_set.dataset),len(test_set.dataset)))

toy data size: train 60 | valid 20 | test 6


In [18]:
from vec2seq.eval import eval
scores = eval(results)


print(tabulate(scores.items()))


-----------------  ----
BLEU               13
Accuracy            0
Distance in words   4.5
Distance in chars  12
-----------------  ----
