# 1. Build the data

See `build_data.py`

We need to generate some data. Requirements:
- enough to train, not too much for speed issues
- train and dev sets ideally have the same distribution
- test set distribution ideally a bit different
- should be reproducible (fix random seed)
- randomize dimension of vectors and scale of the distribution
- saves data to files for future re-use of the model

Then, the file `model/input_fn.py` takes care of the input data pipeline to the Graph.

In [2]:
import os
import logging
import random

import tensorflow as tf

from build_data import get_one_example
from build_data import export_to_file
from build_data import save_dict_to_json

In [3]:
# Set random seeds for reproducibility
random.seed(2018)

# Define hyper parameters of the dataset
DATA_PARENT_DIR = "data"
# Sizes
TRAIN_SIZE = 8000
TEST_SIZE = 1000
DEV_SIZE = 1000
# Dimensions - test distribution is wider
TRAIN_DIMENSION_LOW = 10
TRAIN_DIMENSION_UPP = 200
TEST_DIMENSION_LOW = 1
TEST_DIMENSION_UPP = 400
# Scales - test distribution is wider
TRAIN_SCALE_LOW = 1
TRAIN_SCALE_UPP = 10
TEST_SCALE_LOW = 0.1
TEST_SCALE_UPP = 30

In [4]:
# Build the datasets
data_train = [get_one_example(TRAIN_DIMENSION_LOW, TRAIN_DIMENSION_UPP,
                              TRAIN_SCALE_LOW, TRAIN_SCALE_UPP)
              for _ in range(TRAIN_SIZE)]
data_dev = [get_one_example(TRAIN_DIMENSION_LOW, TRAIN_DIMENSION_UPP,
                              TRAIN_SCALE_LOW, TRAIN_SCALE_UPP)
              for _ in range(DEV_SIZE)]
data_test = [get_one_example(TEST_DIMENSION_LOW, TEST_DIMENSION_UPP,
                              TEST_SCALE_LOW, TEST_SCALE_UPP)
              for _ in range(TEST_SIZE)]

In [5]:
# Save the data to files
if not os.path.exists(DATA_PARENT_DIR):
    os.makedirs(DATA_PARENT_DIR)

export_to_file(data_train, "train.x", "train.y")
export_to_file(data_dev, "dev.x", "dev.y")
export_to_file(data_test, "test.x", "test.y")


In [6]:
# Save datasets properties in json file
sizes = {
    'train_size': len(data_train),
    'dev_size': len(data_dev),
    'test_size': len(data_test),
}

save_dict_to_json(sizes, os.path.join(DATA_PARENT_DIR, 'dataset_params.json'))


# 2. Define the model

See `model/model_fn.py`

Obviously, the L1 norm of x_1, ...., x_n can be computed with the following graph

```
# shape = [batch_size, max_dimension in batch] (padded with zeros)
out = input_vectors

# Compute the absolute norm of every entry
out = tf.nn.relu(out) + tf.nn.relu(-out)

# Sum over the last dimension
out = tf.reduce_sum(out, axis=-1)
```

It respects the requirement (only ReLUs, +, - and sum), but there is no learnable component...

We can also test if we can learn the right connection for the 2 dense layers 

(learn this: [1, -1] -> relu -> [1, 1] -> reduce_sum or [a, -b] -> relu -> [1/a, 1/b] -> reduce_sum with a, b > 0).

(or more generally 

$$ \sum_i b_i relu( a_i x) $$ with $$ sum(a_i [a_i >0] * b_i[a_i > 0]) = 1 $$ and $$ sum(a_i[a_i < 0] * b_i[a_i < 0]) = -1 $$

The operations of such a graph are summed up below: we need to take care of padding because of different dimensions in a batch


```
# shape = [batch_size, max_dimension in batch, 1]
out = tf.expand_dims(out, axis=-1)

out = tf.layers.dense(out, params.hidden_units, activation=tf.nn.relu, use_bias=False)
out = tf.layers.dense(out, 1, activation=None, use_bias=False)

# length is a tensor of shape [batch_size] of int where length[i] is the dim of the i-th example
# shape = [batch_size, max dimension in batch, 1]
mask = tf.expand_dims(tf.cast(tf.sequence_mask(lengths), tf.float32), axis=-1)
# shape = [batch_size]
predictions = tf.reduce_sum(tf.reduce_sum(outputs * mask, axis=-1), axis=-1)

# Loss
sum_of_dims = tf.reduce_sum(lengths)
l2_loss = tf.reduce_mean(tf.square(predictions - labels))
loss = l2_loss / tf.cast(sum_of_dims, tf.float32) # loss per component for indep to other hp

```


We can make one interesting observation about the problem. If all units have the same sign, then the negative entries of the vectors will never matter in the loss function. In that case, if we assign a_i that verifies

$$ \sum(a_i [a_i >0] * b_i[a_i > 0]) = 2 $$

then, we get a local minimum. The way we sample data indeed creates roughly the same amount of negative components and positive components, meaning that $ - \sum_{x_i < 0} x_i \approx \sum_{x_i > 0} x_i $, and thus, a way to approximate the L1 norm is just to compute $ 2 *  \sum_{x_i > 0} x_i $ (the same analysis is also valid for the negative entries, by symetry)

Thus, if we use a higher number of hidden units (say 20), we can break the symmetry, and all the components (positive and negative) do matter to the cost function and we are less likely to get stuck in a local optimum where we only take positive (or negative) components into account.

In [7]:
from model.utils import Params
from model.utils import set_logger
from model.training import train_and_evaluate
from model.input_fn import input_fn
from model.input_fn import load_dataset_from_text
from model.model_fn import model_fn

# Set the random seed for the whole graph for reproductible experiments
tf.reset_default_graph()
# Removing random seed that gave favorable initialization
# tf.set_random_seed(2018)

First, let's import hyperparams and setup the logging

In [8]:
model_dir = "experiments/base_model"
data_dir = "data"

# Load the parameters from the experiment params.json file in model_dir
json_path = os.path.join(model_dir, 'params.json')
assert os.path.isfile(json_path), "No json configuration file found at {}".format(json_path)
params = Params(json_path)

 # Load the parameters from the dataset, that gives the size etc. into params
json_path = os.path.join(data_dir, 'dataset_params.json')
params.update(json_path)

# Set the logger
set_logger(os.path.join(model_dir, 'train.log'))
print(params.dict)

{'learning_rate': 0.01, 'batch_size': 32, 'num_epochs': 10, 'model_version': 'trainable', 'hidden_size': 20, 'save_summary_steps': 100, 'train_size': 8000, 'dev_size': 1000, 'test_size': 1000}


Then, load the vectors and l1 norm with tf.data

In [9]:
# Get paths for vocabularies and dataset
path_train_x = os.path.join(data_dir, 'train.x')
path_train_y = os.path.join(data_dir, 'train.y')
path_eval_x = os.path.join(data_dir, 'dev.x')
path_eval_y = os.path.join(data_dir, 'dev.y')

# Create the input data pipeline
logging.info("Creating the datasets...")
train_x = load_dataset_from_text(path_train_x)
train_labels = load_dataset_from_text(path_train_y)
eval_x = load_dataset_from_text(path_eval_x)
eval_labels = load_dataset_from_text(path_eval_y)

Creating the datasets...


Finally, create the input function and the model

In [10]:
# Specify other parameters for the dataset and the model
params.eval_size = params.dev_size
params.buffer_size = params.train_size # buffer size for shuffling

# Create the two iterators over the two datasets
train_inputs = input_fn('train', train_x, train_labels, params)
eval_inputs = input_fn('eval', eval_x, eval_labels, params)
logging.info("- done.")

# Define the models (2 different set of nodes that share weights for train and eval)
logging.info("Creating the model...")
train_model_spec = model_fn('train', train_inputs, params)
eval_model_spec = model_fn('eval', eval_inputs, params, reuse=True)
logging.info("- done.")


- done.
Creating the model...
- done.


# 3. Train the model


### Training and Hyperparameters

- Applying Dropout would not make sense as the architecture is as minimalist as possible. L2 regularization would introduce a bias, even though it would penalize high values of the weights (that work, [a, -b] and [1/a, 1/b] are valid weights for all a for our 2 layer network)
- Relevant hyperparameters are batch_size, learning_rate, optimization method.
- Training loss is the L2 loss between the average per component of the predicted L1 norm and the gold L1 norm
- We take the average per component and per batch so that changing other hyperparameters does not impact the choice of learning_rate and batch_size


The accuracy here is just the negative L2 loss (thus the higher the better, used to select the best weights)

In [12]:
# Train the model
logging.info("Starting training for {} epoch(s)".format(params.num_epochs))
train_and_evaluate(train_model_spec, eval_model_spec, model_dir, params)

Starting training for 10 epoch(s)
Epoch 1/10
100%|██████████| 250/250 [00:05<00:00, 43.81it/s, loss=0.885]  
- Train metrics: loss: 91.063 ; neg_l2_loss: -12655.478 ; accuracy: -91.063
- Eval metrics: loss: 0.399 ; neg_l2_loss: -55.555 ; accuracy: -0.399
- Found new best accuracy, saving in experiments/base_model/best_weights/after-epoch-1
Epoch 2/10
100%|██████████| 250/250 [00:05<00:00, 44.57it/s, loss=0.049]
- Train metrics: loss: 0.274 ; neg_l2_loss: -32.423 ; accuracy: -0.274
- Eval metrics: loss: 0.123 ; neg_l2_loss: -18.861 ; accuracy: -0.123
- Found new best accuracy, saving in experiments/base_model/best_weights/after-epoch-2
Epoch 3/10
100%|██████████| 250/250 [00:05<00:00, 44.29it/s, loss=0.013]
- Train metrics: loss: 0.070 ; neg_l2_loss: -7.979 ; accuracy: -0.070
- Eval metrics: loss: 0.033 ; neg_l2_loss: -4.781 ; accuracy: -0.033
- Found new best accuracy, saving in experiments/base_model/best_weights/after-epoch-3
Epoch 4/10
100%|██████████| 250/250 [00:05<00:00, 43.61it/

Most of the runs reach a zero error during training. It can happen that after reaching a good minimum, the loss goes up a little: it can be explained by a noisy batch with stronger disymetry than the weights that cause the weights to change a bit in a non-favorable way (that kind of thing could be solved by having a learning rate decay, etc.).

# 3. Evaluate on the Test set

In [13]:
# Reset the default graph
from model.evaluation import evaluate
tf.reset_default_graph()

# Get paths for vocabularies and dataset
path_eval_x = os.path.join(data_dir, 'dev.x')
path_eval_y = os.path.join(data_dir, 'dev.y')

# Create the input data pipeline
logging.info("Creating the dataset...")
test_x = load_dataset_from_text(path_eval_x)
test_labels = load_dataset_from_text(path_eval_y)

Creating the dataset...


In [14]:
# Specify other parameters for the dataset and the model
params.eval_size = params.test_size

# Create iterator over the test set
inputs = input_fn('eval', test_x, test_labels, params)
logging.info("- done.")

# Define the model
logging.info("Creating the model...")
model_spec = model_fn('eval', inputs, params, reuse=False)
logging.info("- done.")

- done.
Creating the model...
- done.


In [15]:
logging.info("Starting evaluation")
evaluate(model_spec, model_dir, params, "best_weights")

Starting evaluation


INFO:tensorflow:Restoring parameters from experiments/base_model/best_weights/after-epoch-9


Restoring parameters from experiments/base_model/best_weights/after-epoch-9
- Eval metrics: loss: 0.000 ; neg_l2_loss: -0.000 ; accuracy: -0.000


### Results and Observations

- Both models achieve 0 L2 loss between the two L1 norms (neural and gold).
- Some sensibility to the learning_rate, resolved with hyperparameter search and the use of Adam.
- Ability to generalize as the distribution of the test set is slightly different

We can look at the actual predicted values and check that we get what we expected

In [16]:
variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)

In [17]:
import numpy as np
saver = tf.train.Saver()
with tf.Session() as sess:
     # Reload weights from the weights subdirectory
    save_path = os.path.join(model_dir, "best_weights")
    if os.path.isdir(save_path):
        save_path = tf.train.latest_checkpoint(save_path)
    saver.restore(sess, save_path)

    a, b = sess.run(variables)
a = np.squeeze(a)
b = np.squeeze(b)

INFO:tensorflow:Restoring parameters from experiments/base_model/best_weights/after-epoch-9


Restoring parameters from experiments/base_model/best_weights/after-epoch-9


We notice that the values found verify

In [18]:
sum(a[a>0] * b[a>0])

1.0000004788143997

In [19]:
sum(a[a<0] * b[a<0])

-0.9999996540136635