Skip to content
This repository has been archived by the owner on Dec 29, 2022. It is now read-only.

batch_normalize=True doesn't work accurately with phase=Phase.* setting #23

Closed
jramapuram opened this issue May 27, 2016 · 13 comments
Closed

Comments

@jramapuram
Copy link

I believe that there is an error when using phase in the default_scope coupled with batch_normalize=True.

Basically it looks like this:

    def encoder(self, inputs, latent_size, activ=tf.nn.elu, phase=pt.Phase.train):
        with pt.defaults_scope(activation_fn=activ,
                               batch_normalize=True,
                               learned_moments_update_rate=0.0003,
                               variance_epsilon=0.001,
                               scale_after_normalization=True,
                               phase=phase):
            params = (pt.wrap(inputs).
                      reshape([-1, self.input_shape[0], self.input_shape[1], 1]).
                      conv2d(5, 32, stride=2).
                      conv2d(5, 64, stride=2).
                      conv2d(5, 128, edges='VALID').
                      flatten().
                      fully_connected(self.latent_size * 2, activation_fn=None)).tensor

Full code here: https://github.com/jramapuram/CVAE/blob/master/cvae.py
If I remove phase=phase within the scope assignment my model produces the following:
2d_cluster_orig

However, when setting the phase appropriately I get the following:
2d_cluster

This is trained for the same number of iterations using the same model.

@eiderman
Copy link
Contributor

eiderman commented Jun 1, 2016

In most typical usage, batch_normalization is only applied during training and the moving average is tracked for inference time when batch size tends to be 1. Because of this, Phase.infer and Phase.test use variables in the graph that tracked the stddev/mean of the batches during training.

I feel like the infer/test paths may need better documentation to clear this up. Are you having the problem when using pt.Phase.train as well?

@jramapuram
Copy link
Author

jramapuram commented Jun 1, 2016

Yes, that is correct @eiderman . I use phase=pt.Phase.train during train and phase=pt.Phase.test during test. I haven't permuted them yet (i.e. try train for test, etc).

@eiderman
Copy link
Contributor

eiderman commented Jun 1, 2016

I've checked the implementation and it should be doing the correct thing. @jramapuram, would you mind explaining the graph to me? Also, how does this impact the evaluation metrics for the relevant loss on the test set?

@jramapuram
Copy link
Author

jramapuram commented Jun 1, 2016

I have a convolutional variational autoencoder which is mapping to a two dimensional latent space. Thus, it disentangles the manifold seen above (of MNIST). When I use do not use the phase=* (in the scope) I see option fig 1 which is the correct expectation. When I add the phase=* option I see fig 2. I have tried re-training many times, but still face the same issue. With regards to metrics: since this is unsupervised it is slightly hard to quantify.

My train/test objects are simply this [note in train the phase is default valued to phase=pt.Phase.train and thus ommited ] :

            with tf.variable_scope("z"): # Encode our data into z and return the mean and covariance
                self.z_mean, self.z_log_sigma_sq = self.encoder(self.inputs, latent_size)
                self.z = tf.add(self.z_mean,
                                tf.mul(tf.sqrt(tf.exp(self.z_log_sigma_sq)), eps))
                # Get the reconstructed mean from the decoder
                self.x_reconstr_mean = self.decoder(self.z, self.input_size)
                self.z_summary = tf.histogram_summary("z", self.z)


            with tf.variable_scope("z", reuse=True): # The test z
                self.z_mean_test, self.z_log_sigma_sq_test = self.encoder(self.inputs, latent_size, phase=pt.Phase.test)
                self.z_test = tf.add(self.z_mean_test,
                                     tf.mul(tf.sqrt(tf.exp(self.z_log_sigma_sq_test)), eps))
                # Get the reconstructed mean from the decoder
                self.x_reconstr_mean_test = self.decoder(self.z_test, self.input_size, phase=pt.Phase.test)

@eiderman
Copy link
Contributor

eiderman commented Jun 1, 2016

Batch normalization is behaving correctly, but I would really like to understand this phenomenon more because it may have modeling implications on best practice for BN.

One experiment that may help to verify it is whether your test results are as good when running smaller batches than all 10k. It may be that normalizing the output based on all test examples results in a cleaner embedding. The default inference behavior of BN is geared towards generating correct and stable predictions for small batch sizes.

It would be interesting to see how the accuracy changes on the test set if you were to attach a softmax layer to the embedding (and not training lower layers by using no_gradients()) and test it on various batch sizes.

Yet another aspect that would be interesting to test is which projection works better as a VAE. Since one of the goals is to make a decoder that can be easily sampled to generate new results, I suspect that a denser region of digits may work better since there is less likely to be junk spaces that produce non-digits within the samples space.

@jramapuram
Copy link
Author

@eiderman : Will give it a shot for smaller batch sizes (i.e. same as training). However, this still doesn't answer why it would work when no phase parameter is provided. Does batch normalization turn off without a provided phase parameter?

I'm not sure the softmax layer makes any sense. This is a pure unsupervised problem. There are no class labels that can be provided to update the softmax's weights & biases. I'm assuming you would be talking about a softmax+cross-entropy as an optimization objective.

@eiderman
Copy link
Contributor

eiderman commented Jun 4, 2016

Jason, with the phase set, batch normalization looks like:

  • train - do a per channel normalization so that the per-activation mean is
    0 and standard deviation is 1. Also keep an exponential moving average of
    the these values for use in training.
  • test/infer - use the exponential moving averages stored during training
    for the pre-activation scale and mean.

If you do not set the phase, it defaults to 'train' in both cases. This
means that the version without Phase set is performing normalization during
inference by using the test set activations, which is not really a good
thing because it can easily bring the network outside of the ranges during
training and a test example's prediction may be sensitive to other items in
the batch. In your case, it appears to have made your model do a better
separation, but there are some caveats:

  1. 2D isn't sufficient to capture the space, so this may just be noise.
  2. Both using Phase.train and Phase.test on the test set have large patches
    of intermingled values and it isn't obvious which is better overall.
  3. When using a VAE as a generational model, a denser embedding is often
    preferable. Empty spaces in the embedding may correspond to junk digits
    instead of plausible images.

To test 1& 2, I would recommend either computing the test reconstruction
loss (preferable) or attaching a classification loss and only training
the classification layer. While I suggested softmax before, I think nearest
neighbor vs train set may work just as well for a smoke test.

To test 3, Just sample from the model and make sure to hit the white space
on your graph to see how the digits look. Doing enough of these to achieve
statistical significance would be hard, but sampling from a VAE using a
gaussian should probably give you equal probability.

On Sat, Jun 4, 2016 at 5:23 AM, Jason Ramapuram notifications@github.com
wrote:

@eiderman https://github.com/eiderman : Will give it a shot for smaller
batch sizes (i.e. same as training). However, this still doesn't answer why
it would work when no phase parameter is provided. Does batch normalization
turn off without a provided phase parameter?

I'm not sure the softmax layer makes any sense. This is a pure
unsupervised problem. There are no class labels that can be provided to
update the softmax's weights & biases. I'm assuming you would be talking
about a softmax+cross-entropy as an optimization objective.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#23 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABnmwJgDDiKcIt68_7pkTc4n4wIdotN9ks5qIW5BgaJpZM4IoVym
.

@jramapuram
Copy link
Author

@eiderman : I updated my logic to do inference using only batch_size as such:

def plot_2d_cvae(sess, source, cvae):
    z_mu = []
    y_sample = []
    for _ in range(np.floor(10000.0 / FLAGS.batch_size).astype(int)):
      x_sample, y = source.test.next_batch(FLAGS.batch_size)
      z_mu.append(cvae.transform(sess, x_sample))
      y_sample.append(y)

    z_mu = np.vstack(z_mu)
    y_sample = np.vstack(y_sample)
    print 'z.shape = ', z_mu.shape, ' | y_sample.shape = ', y_sample.shape

    plt.figure(figsize=(8, 6))
    plt.scatter(z_mu[:, 0], z_mu[:, 1], c=np.argmax(y_sample, 1))
    plt.colorbar()
    plt.savefig("models/2d_cluster.png", bbox_inches='tight')
    #plt.show()

When the phase is set to test it looks like the same issue is present:
2d_cluster

However, setting phase=train for both test & train,
2d_cluster
it accurately separates the manifold:

To address your points:

  1. The 2d representation is perfectly sufficient for MNIST as the manifold has been proven to be separable in this manner via the SOM, autoencoder and t-sne literature, so I don't believe that is the issue at hand.
  2. There is no intermingling going on. Phase.train is used on the parameters that are optimized during training time using the training data for MNIST. Phase.test is used at test time using the reused parameters (i.e. weights/biases) but working on the test data for MNIST. The training loss after around 400 epochs is 138.141. This is the standard VAE loss (2 part loss). I haven't had the time to add an extra layer and such.
  3. I am not using it as a generative model for the above use case. Merely as one to separate the visualize a disentangled feature space. However, here is a visualization of the reconstruction as requested from both cases (one with Phase.train for train & Phase.test for test parameters [the correct method] and one for Phase.train set for both test & train functions [the incorrect method that proves that batch_normalization is NOT working accurately).

Listed below is reconstruction when Phase.test is set accurately:
20d_reconstr_4

And here is when using Phase.train :
20d_reconstr_4

When using batch normalization with the running mean it appears to be projecting to ~ the same location (as per the reconstruction). Thus I believe that there is either something wrong with the batch_normalization implementation on the conv2d op.

@eiderman
Copy link
Contributor

eiderman commented Jun 6, 2016

My apologies for being obtuse. BN is working as intended, but there is a
gotcha (which I am currently fixing). In order for you to update the
averaged mean and variance variables, you need to run the update ops on
each iteration.

These are executed by adding a dependency on pt.with_update_ops as
documented here:
https://github.com/google/prettytensor/blob/master/docs/pretty_tensor_top_level.md#apply_optimizerlosses-regularizetrue-include_markedtrue

This is really a poor API to trickle out to other users, so I will fix it
so that the updates are part of the graph.

On Mon, Jun 6, 2016 at 7:16 AM, Jason Ramapuram notifications@github.com
wrote:

@eiderman https://github.com/eiderman : I updated my logic to do
inference using only batch_size as such:

def plot_2d_cvae(sess, source, cvae):
z_mu = []
y_sample = []
for _ in range(np.floor(10000.0 / FLAGS.batch_size).astype(int)):
x_sample, y = source.test.next_batch(FLAGS.batch_size)
z_mu.append(cvae.transform(sess, x_sample))
y_sample.append(y)

z_mu = np.vstack(z_mu)
y_sample = np.vstack(y_sample)
print 'z.shape = ', z_mu.shape, ' | y_sample.shape = ', y_sample.shape

plt.figure(figsize=(8, 6))
plt.scatter(z_mu[:, 0], z_mu[:, 1], c=np.argmax(y_sample, 1))
plt.colorbar()
plt.savefig("models/2d_cluster.png", bbox_inches='tight')
#plt.show()

When the phase is set to test it looks like the same issue is present:
[image: 2d_cluster]
https://cloud.githubusercontent.com/assets/8204807/15824556/0b19c8f4-2c00-11e6-99f6-213158c23c6a.png

However, setting phase=train for both test & train,
[image: 2d_cluster]
https://cloud.githubusercontent.com/assets/8204807/15824544/f5dd4cb8-2bff-11e6-9711-c4fdd63ebbff.png
it accurately separates the manifold:

To address your points:

  1. The 2d representation is perfectly sufficient for MNIST as the
    manifold has been proven to be separable in this manner via the SOM,
    autoencoder and t-sne literature, so I don't believe that is the issue at
    hand.
    2.

    There is no intermingling going on. Phase.train is used on the
    parameters that are optimized during training time using the training
    data for MNIST. Phase.test is used at test time using the reused parameters
    (i.e. weights/biases) but working on the test data for MNIST. The
    training loss after around 400 epochs is 138.141. This is the standard VAE
    loss (2 part loss). I haven't had the time to add an extra layer and such.
    3.

    I am not using it as a generative model for the above use case. Merely
    as one to separate the visualize a disentangled feature space. However,
    here is a visualization of the reconstruction as requested from both cases
    (one with Phase.train for train & Phase.test for test parameters [the
    correct method] and one for Phase.train set for both test & train functions
    [the incorrect method that proves that batch_normalization is NOT working
    accurately).

Listed below is reconstruction when Phase.test is set accurately:
[image: 20d_reconstr_4]
https://cloud.githubusercontent.com/assets/8204807/15824647/7a5685c2-2c00-11e6-9398-fba7bc504a26.png

And here is when using Phase.train :

[image: 20d_reconstr_4]
https://cloud.githubusercontent.com/assets/8204807/15824672/9c3be650-2c00-11e6-992b-784e5d076063.png

When using batch normalization with the running mean it appears to be
projecting to ~ the same location (as per the reconstruction). Thus I
believe that there is either something wrong with the batch_normalization
implementation on the conv2d op.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#23 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABnmwGSfkOHs1pqEBwUxW2guqGyqUF-Lks5qJCuvgaJpZM4IoVym
.

@jramapuram
Copy link
Author

Great! Thanks!

@eiderman
Copy link
Contributor

eiderman commented Jun 7, 2016

I added fix to automatically compute the running variance/mean for inference time. If you have any other issues, please let me know!

I'm a little surprised at how poorly the model did with the initial variance (1.0) and mean (0.0). I would have expected the training to have made it somewhat resilient to scale and shift of features.

@jramapuram
Copy link
Author

Great! Will give it a shot and get back

@jramapuram
Copy link
Author

Thanks for the assistance @eiderman ! It is working as intended now.
2d_cluster

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants