# Hands On

Now let's get our hands dirty. We show how the whole training pipeline works and then we also look at how to write a custom Keras Layer for the SC-LSTM cell. You can donwload the whole codebase here: [Cool Code Here](https://github.com/jderiu/e2e_nlg) 

## Preprocessing
The first step in every machine learning pipeline is the preprocessing. The preprocessing consists of the following steps:
- Delexicalizing the data: Replacing the names of the restaurant by placeholders. 
- Vectorizing the data: translating the meaning represenations into binarized vectors as well as transforming the utterances in a list of indices, each character is represented by an index from the vocabulary (i.e. char2idx mapping).
- Extracting the syntactic information: get the first word of the utterance and the follow-up sentences and encode those into a binary vector.

In [1]:
import logging
import os, sys
import pprint

pp = pprint.PrettyPrinter(width=41, compact=True)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

#makes sure that the modules can be loaded
nb_dir = os.path.split(os.getcwd())[0]
nb_dir = nb_dir.replace('\\src', '')
sys.path.append(nb_dir)
logging.info('Base directory:' + nb_dir)
## should output: /some-path/e2e_nlg

from src.data_processing.delexicalise_data import _delex_nlg_data, _retrieve_mr_ontology
data_path = os.path.join(nb_dir, 'data/e2e_nlg')
# List of what attributes we want to replace with a placeholder
delex_attributes = ["name", "near", "food"]
# File name for the table of attribute-placeholder pairs
attribute_fname = 'ontology/attribute_tags.txt'

train_delex = _delex_nlg_data('trainset.csv', data_path, delex_attributes, attribute_fname)
valid_delex = _delex_nlg_data('devset.csv', data_path, delex_attributes, attribute_fname)
test_delex = _delex_nlg_data('testset.csv', data_path, delex_attributes, attribute_fname)

print('Lengths of Trainset: {} Validationset: {} Testset: {}'.format(
    len(train_delex['mr_raw']), 
    len(valid_delex['mr_raw']), 
    len(test_delex['mr_raw'])))

#Print an example
idx = 110 #(change me)
print('Parsed MR:')
pp.pprint(train_delex['parsed_mrs'][idx])
print('Original Output: ', train_delex['outputs_raw'][idx])
print('Delexicalized Output: ', train_delex['delexicalised_texts'][idx])



2019-02-11 14:47:18,890 : INFO : Base directory:C:\Users\deri\Documents\Git Projects\e2e_nlg


Lengths of Trainset: 42061 Validationset: 4672 Testset: 4693
Parsed MR:
{'area': 'riverside',
 'customer rating': '5 out of 5',
 'eatType': 'coffee shop',
 'food': 'Japanese',
 'name': 'The Golden Palace',
 'priceRange': 'more than £30'}
Original Output:  The coffee shop The Golden Palace is north of the city centre. It serves expensive food and has a 5 star rating.
Delexicalized Output:  The coffee shop XNAMEX is north of the city centre. It serves expensive food and has a 5 star rating.


Next we need to extract the data ontology, which we need to vectorize the data later. The ontoloty is a dictonary of values2idx vocabulaires. 

In [2]:
full_mr_list = train_delex['parsed_mrs'] + valid_delex['parsed_mrs'] + test_delex['parsed_mrs']
mr_data_ontology = _retrieve_mr_ontology(full_mr_list)
print('List of attributes:')
pp.pprint(list(mr_data_ontology.keys()))
print('Value2Idx Vocabulary for priceRange:')
pp.pprint(mr_data_ontology['priceRange'])

List of attributes:
['name', 'eatType', 'priceRange',
 'customer rating', 'near', 'food',
 'area', 'familyFriendly']
Value2Idx Vocabulary for priceRange:
{'cheap': 0,
 'high': 1,
 'less than £20': 2,
 'moderate': 3,
 'more than £30': 4,
 '£20-25': 5}


# Vectorization
Next, we transform the preprocessed data into vectors, which can be interpreted by our neural network. This requires the following steps:
- Transform the meaning representations into a binary representation. For this, we rely on the ontology we extracted in the cell above.
- Transform the utterances into a list of indices, which are then given as input to the neural network. Each index corresponds to a alphanumeric character.  

## Vectorize Meaning Representations
For each attribute, we create a one-hot encoded vector, which indicates which value is present in the utterance. We add am extra dimension to the vectors for those cases where the attrbute is missing. Note that the delexicalized attributes only have lenghts of two. This is just to indicate if the attribute is present or not, since the value is replaced by a placeholder.

We frist take the processing of one MR apart and then we process the whole dataset.

In [3]:
from src.data_processing.vectorize_data import _compute_vector_length, _vectorize_single_mr

# First compute the length of the one-hot encoded vectors:
vector_lengts = _compute_vector_length(mr_data_ontology, delex_attributes)
pp.pprint(vector_lengts)

{'area': 3,
 'customer rating': 7,
 'eatType': 4,
 'familyFriendly': 3,
 'food': 2,
 'name': 2,
 'near': 2,
 'priceRange': 7}


In [4]:
#Process one meaning representation:
mr = train_delex['parsed_mrs'][idx]
vec = _vectorize_single_mr(mr, mr_data_ontology, vector_lengts, delex_attributes)
pp.pprint(train_delex['parsed_mrs'][idx]['priceRange'])
pp.pprint(vec['priceRange'])

'more than £30'
array([[0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.]])


Now we run the meaning representation vectorization over the whole dataset. We store the result in a dictionary of attribute name to vectors. Each row corresponds to one datapoint. 

In [5]:
# Vectorize meaning representations
from src.data_processing.vectorize_data import _vectorize_mrs

train_mr_vecs = _vectorize_mrs(train_delex['parsed_mrs'], mr_data_ontology, delex_attributes)
valid_mr_vecs = _vectorize_mrs(valid_delex['parsed_mrs'], mr_data_ontology, delex_attributes)
test_mr_vecs = _vectorize_mrs(test_delex['parsed_mrs'], mr_data_ontology, delex_attributes)
print('Dimensions: {}'.format(train_mr_vecs['priceRange'].shape))

Dimensions: (42061, 7)


## Vectorize Utterances
Now we just need to create the representation for the utterances. For this load the vocabulary and then just apply the transforamtion. One important detail: since Keras works with fixed lenght sequences, we need to pad the texts (or cut them off) so that all the vectors have the same length.

In [6]:
#Step 1 Load Vocabulary
import json
from src.data_processing.utils import convert2indices 

char_fname = open(os.path.join(data_path, 'vocabulary.json'), 'rt', encoding='utf-8')
char_vocab = json.load(char_fname)
print('Vocab Len: {}'.format(len(char_vocab)))
print('Idx of character "a": {}'.format(char_vocab['a']))

#Always use a dummy character for padding and a unk character for unknown tokens (or characters in this case)
dummy_char = max(char_vocab.values()) + 1
unk_char = max(char_vocab.values()) + 2

print('Dummy Idx: {} Unknown Idx: {}'.format(dummy_char, unk_char))

#Step 2 Convert 2 Indices
max_sentence_len = 256

train_idx_data = convert2indices(train_delex['delexicalised_texts'], char_vocab, dummy_char, unk_char, max_sentence_len)
valid_idx_data = convert2indices(valid_delex['delexicalised_texts'], char_vocab, dummy_char, unk_char, max_sentence_len)
test_idx_data = convert2indices(test_delex['delexicalised_texts'], char_vocab, dummy_char, unk_char, max_sentence_len)

#The shape is number of datapoints x sentence length
print('Input Data Shape: {}'.format(train_idx_data.shape))

Vocab Len: 68
Idx of character "a": 5
Dummy Idx: 68 Unknown Idx: 69
Input Data Shape: (42061, 256)


# First Word Features
Next we prepare the syntactic features, which we use for to add more variety to the generated utterances. For sake of brievety, we only show the extraction of the first word features. The other two manipulations are present in the full code version on Github. 

The extraciton of the first word is done in following steps:
- Word tokenize all delexicalized utterances.
- Extract the first word of each utterance. 
- Create a vocabulary of first words, i.e. first word-to-idx mapping. We only keep first words, which appear at least 100 times. Otherwise the neural network has difficulties learning the correlation between the first word and the utterance.

In [20]:
#Step 1: Tokenize the Utterances
from src.data_processing.surface_feature_vectors import _sentence_tok, _utterance_first_word_vocab, _utt_fw_features

train_tok = _sentence_tok(train_delex['delexicalised_texts'])
valid_tok = _sentence_tok(valid_delex['delexicalised_texts'])
test_tok = _sentence_tok(test_delex['delexicalised_texts'])

#Print an example: Note that we both tokenize on sentence and word level. The sentence level tokeniztion can be used for other manipulations.
print(train_tok[idx])

#Step 2: Generate fw2idx mapping
utt_fw_vocab = _utterance_first_word_vocab(train_tok + valid_tok + test_tok, min_freq=100)
inverse_utt_fw_vocab = {v: k for k, v in utt_fw_vocab.items()}
print('Mapping from First Word 2 Index:')
pp.pprint(list(utt_fw_vocab.items()))

#Step 3 Create Surface Level Features
train_utt_fw = _utt_fw_features(train_tok, utt_fw_vocab)
valid_utt_fw = _utt_fw_features(valid_tok, utt_fw_vocab)
test_utt_fw = _utt_fw_features(test_tok, utt_fw_vocab)

print('Shape of First words: {}'.format(train_utt_fw.shape))
print('The word "{}" corresponds to : {}'.format(train_tok[idx][0][0], train_utt_fw[idx]))

[['The', 'coffee', 'shop', 'XNAMEX', 'is', 'north', 'of', 'the', 'city', 'centre', '.'], ['It', 'serves', 'expensive', 'food', 'and', 'has', 'a', '5', 'star', 'rating', '.']]
Mapping from First Word 2 Index:
[('XNAMEX', 0), ('Located', 1),
 ('For', 2), ('In', 3), ('A', 4),
 ('XNESRX', 5), ('An', 6), ('Near', 7),
 ('There', 8), ('On', 9), ('XFOODX', 10),
 ('The', 11), ('With', 12),
 ('Serving', 13), ('If', 14), ('At', 15),
 ('Riverside', 16), ('By', 17),
 ('You', 18), ('Family', 19)]
Shape of First words: (42061, 21)
The word "The" corresponds to : [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


# WAIT A MOMENT: THAT'S CHEATING !!!!
Of course, we do nat have access to the correct first word during test time. This is indeed a major drawback of this approach. The solution is to sample n different first words for each meaning representation during test time. This then corresponds to n different utterances. Then we have to rank those utterances according to their semantic correctness, as there are conficlting combinations of meaning represenations and first words. For instance, when there is no location mentioned but the first word is "Located". 

So let's sample 10 different first words for each meaning representation in the test set. We have an extra test set which contains only the meaning representations (i.e. no reference utterances given).

In [31]:
#
import numpy as np
import random
from src.data_processing.generate_evaluation_data import _read_data, _parse_raw_mr
test_mr_only = os.path.join(data_path, 'test_mr_only.csv')

#Read the MR only test set 
test_mr_only_raw = _read_data(test_mr_only)
test_process_mr_only = _parse_raw_mr(test_mr_only_raw)
test_vectorised_mrs_only = _vectorize_mrs(test_process_mr_only, mr_data_ontology, delex_attributes)

def sample_utt_fw_for_mr(nsamples):
    utt_fw_samples = random.sample(list(utt_fw_vocab.values()), k=nsamples)
    dummy_idx = max(utt_fw_vocab.values()) + 1
    utt_fw_vec = []
    for fidx in utt_fw_samples:
        v = np.zeros(shape=(dummy_idx + 1, 1))
        v[fidx] = 1.0
        utt_fw_vec.append(v)
    return utt_fw_vec
    
first_word_features = []    
for mr in test_process_mr_only:
    utt_fw_vec = sample_utt_fw_for_mr(10)
    first_word_features.append(utt_fw_vec)

print('Len of Test Set: {}'.format(len(first_word_features)))

Len of Test Set: 630
