batch normalization: loss increased when deepxde.maps.map.Map.training set to False #69

smao-astro · 2020-06-24T00:56:43Z

Hi Lu,

I am trying deepxde.maps.fnn.FNN.batch_normalization at

deepxde/deepxde/maps/fnn.py

Line 27 in 8e811ad

batch_normalization=None,

And I noticed that when

deepxde/deepxde/model.py

Line 253 in 8e811ad

def _test(self, uncertainty):

is called, the loss increased significantly compared to the loss computed at

deepxde/deepxde/model.py

Line 243 in 8e811ad

self.train_op.minimize(

To reproduce, here are two scripts comparing with each other:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np

import deepxde as dde
from deepxde.backend import tf


def gen_testdata():
    data = np.load("dataset/Burgers.npz")
    t, x, exact = data["t"], data["x"], data["usol"].T
    xx, tt = np.meshgrid(x, t)
    X = np.vstack((np.ravel(xx), np.ravel(tt))).T
    y = exact.flatten()[:, None]
    return X, y


def main():
    def pde(x, y):
        dy_x = tf.gradients(y, x)[0]
        dy_x, dy_t = dy_x[:, 0:1], dy_x[:, 1:2]
        dy_xx = tf.gradients(dy_x, x)[0][:, 0:1]
        return dy_t + y * dy_x - 0.01 / np.pi * dy_xx

    geom = dde.geometry.Interval(-1, 1)
    timedomain = dde.geometry.TimeDomain(0, 0.99)
    geomtime = dde.geometry.GeometryXTime(geom, timedomain)

    bc = dde.DirichletBC(
        geomtime, lambda x: np.zeros((len(x), 1)), lambda _, on_boundary: on_boundary
    )
    # lambda _, on_boundary: on_boundary
    #   an anonymous function, it take two inputs and no matter what the first imput is, return the second input only
    #   and identically
    ic = dde.IC(
        geomtime, lambda x: -np.sin(np.pi * x[:, 0:1]), lambda _, on_initial: on_initial
    )

    data = dde.data.TimePDE(
        geomtime, pde, [bc, ic], num_domain=2540, num_boundary=80, num_initial=160
    )
    net = dde.maps.FNN([2] + [20] * 3 + [1], "tanh", "Glorot normal")
    model = dde.Model(data, net)

    # we first use Adam for a certain number of iterations, and then switch to L-BFGS.
    model.compile("adam", lr=1e-3)
    model.train(epochs=1500)
    # The optimizer L-BFGS does not require learning rate,
    # and the neural network is trained until convergence,
    # so the number of iterations is also ignored for L-BFGS.
    model.compile("L-BFGS-B")
    losshistory, train_state = model.train(display_every=200)
    # dde.saveplot(losshistory, train_state, issave=True, isplot=True)
    #
    # X, y_true = gen_testdata()
    # y_pred = model.predict(X)
    # f = model.predict(X, operator=pde)
    # print("Mean residual:", np.mean(np.absolute(f)))
    # print("L2 relative error:", dde.metrics.l2_relative_error(y_true, y_pred))
    # np.savetxt("test.dat", np.hstack((X, y_true, y_pred)))


if __name__ == "__main__":
    main()

without batch_normalization, it gives

Using TensorFlow 1 backend.

Compiling model...
Building feed-forward neural network...
'build' took 0.305456 s

'compile' took 0.732757 s

Initializing variables...
Training model...

Step Train loss Test loss Test metric
0 [4.72e-03, 8.26e-03, 4.37e-01] [4.72e-03, 0.00e+00, 0.00e+00] []
1000 [3.72e-02, 2.52e-04, 5.29e-02] [3.72e-02, 0.00e+00, 0.00e+00] []
1500 [3.41e-02, 1.06e-04, 4.87e-02] [3.41e-02, 0.00e+00, 0.00e+00] []

Best model at step 1500:
train loss: 8.29e-02
test loss: 3.41e-02
test metric: []

'train' took 6.344179 s

Compiling model...
'compile' took 1.180172 s

Training model...

Step Train loss Test loss Test metric
1500 [3.41e-02, 1.06e-04, 4.87e-02] [3.41e-02, 0.00e+00, 0.00e+00] []
1600 [2.18e-02, 2.19e-04, 2.46e-02]
1800 [8.66e-03, 1.41e-05, 9.62e-03]
2000 [6.09e-03, 4.42e-06, 3.73e-03]
2200 [3.99e-03, 6.30e-06, 3.37e-03]
2400 [2.94e-03, 9.06e-06, 3.06e-03]
2600 [2.20e-03, 8.50e-06, 2.66e-03]
2800 [1.56e-03, 2.16e-06, 2.32e-03]
3000 [1.29e-03, 1.07e-06, 2.16e-03]
3200 [1.08e-03, 2.86e-06, 2.11e-03]
3400 [9.43e-04, 9.06e-07, 2.08e-03]
3600 [8.71e-04, 2.73e-06, 2.01e-03]
3800 [8.47e-04, 1.08e-06, 1.68e-03]
4000 [4.21e-04, 9.31e-07, 4.50e-04]
4200 [3.30e-04, 6.53e-07, 4.29e-04]
4400 [2.89e-04, 8.89e-07, 4.05e-04]
4600 [2.69e-04, 5.40e-07, 3.80e-04]
4800 [2.55e-04, 2.23e-07, 3.47e-04]
5000 [2.21e-04, 4.14e-07, 3.26e-04]
5200 [2.22e-04, 4.91e-07, 2.95e-04]
5400 [2.27e-04, 1.82e-07, 2.45e-04]
5600 [2.25e-04, 2.67e-07, 1.60e-04]
5800 [1.89e-04, 1.21e-07, 8.47e-05]
6000 [1.45e-04, 1.27e-07, 6.40e-05]
6200 [1.28e-04, 2.80e-07, 5.85e-05]
6400 [1.10e-04, 9.07e-08, 4.82e-05]
6600 [9.93e-05, 9.26e-08, 4.33e-05]
6800 [8.91e-05, 8.64e-08, 3.82e-05]
7000 [8.05e-05, 8.08e-08, 3.42e-05]
7200 [7.32e-05, 9.41e-08, 3.23e-05]
7400 [6.76e-05, 1.27e-07, 2.91e-05]
7600 [6.43e-05, 9.91e-08, 2.38e-05]
7800 [5.84e-05, 8.04e-08, 2.04e-05]
8000 [5.45e-05, 1.05e-07, 1.71e-05]
8200 [5.21e-05, 1.18e-07, 1.51e-05]
8400 [4.81e-05, 1.78e-07, 1.42e-05]
8600 [4.60e-05, 1.55e-07, 1.17e-05]
8800 [4.29e-05, 2.24e-07, 9.72e-06]
9000 [3.96e-05, 1.27e-07, 7.74e-06]
9200 [3.77e-05, 1.05e-07, 6.13e-06]
9400 [3.47e-05, 1.12e-07, 3.55e-06]
9600 [3.04e-05, 1.89e-07, 2.58e-06]
9800 [2.74e-05, 1.61e-07, 2.34e-06]
10000 [2.61e-05, 1.09e-07, 2.13e-06]
10200 [2.44e-05, 5.00e-08, 2.04e-06]
10400 [2.33e-05, 3.94e-08, 1.81e-06]
10600 [2.22e-05, 4.27e-08, 1.65e-06]
10800 [2.10e-05, 3.46e-08, 1.35e-06]
11000 [2.00e-05, 3.83e-08, 1.06e-06]
11200 [1.86e-05, 5.09e-08, 1.06e-06]
11400 [1.77e-05, 5.11e-08, 1.05e-06]
11600 [1.68e-05, 4.61e-08, 1.17e-06]
11702 [1.65e-05, 4.23e-08, 1.13e-06] [1.65e-05, 0.00e+00, 0.00e+00] []

Best model at step 11702:
train loss: 1.76e-05
test loss: 1.65e-05
test metric: []

'train' took 49.402318 s

to compare, if I add batch_normalization

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np

import deepxde as dde
from deepxde.backend import tf


def gen_testdata():
    data = np.load("dataset/Burgers.npz")
    t, x, exact = data["t"], data["x"], data["usol"].T
    xx, tt = np.meshgrid(x, t)
    X = np.vstack((np.ravel(xx), np.ravel(tt))).T
    y = exact.flatten()[:, None]
    return X, y


def main():
    def pde(x, y):
        dy_x = tf.gradients(y, x)[0]
        dy_x, dy_t = dy_x[:, 0:1], dy_x[:, 1:2]
        dy_xx = tf.gradients(dy_x, x)[0][:, 0:1]
        return dy_t + y * dy_x - 0.01 / np.pi * dy_xx

    geom = dde.geometry.Interval(-1, 1)
    timedomain = dde.geometry.TimeDomain(0, 0.99)
    geomtime = dde.geometry.GeometryXTime(geom, timedomain)

    bc = dde.DirichletBC(
        geomtime, lambda x: np.zeros((len(x), 1)), lambda _, on_boundary: on_boundary
    )
    # lambda _, on_boundary: on_boundary
    #   an anonymous function, it take two inputs and no matter what the first imput is, return the second input only
    #   and identically
    ic = dde.IC(
        geomtime, lambda x: -np.sin(np.pi * x[:, 0:1]), lambda _, on_initial: on_initial
    )

    data = dde.data.TimePDE(
        geomtime, pde, [bc, ic], num_domain=2540, num_boundary=80, num_initial=160
    )
    net = dde.maps.FNN([2] + [20] * 3 + [1], "tanh", "Glorot normal", batch_normalization="before")
    model = dde.Model(data, net)

    # we first use Adam for a certain number of iterations, and then switch to L-BFGS.
    model.compile("adam", lr=1e-3)
    model.train(epochs=1500)
    # The optimizer L-BFGS does not require learning rate,
    # and the neural network is trained until convergence,
    # so the number of iterations is also ignored for L-BFGS.
    model.compile("L-BFGS-B")
    losshistory, train_state = model.train(display_every=200)
    # dde.saveplot(losshistory, train_state, issave=True, isplot=True)
    #
    # X, y_true = gen_testdata()
    # y_pred = model.predict(X)
    # f = model.predict(X, operator=pde)
    # print("Mean residual:", np.mean(np.absolute(f)))
    # print("L2 relative error:", dde.metrics.l2_relative_error(y_true, y_pred))
    # np.savetxt("test.dat", np.hstack((X, y_true, y_pred)))


if __name__ == "__main__":
    main()

the output is

Using TensorFlow 1 backend.

Compiling model...
Building feed-forward neural network...
'build' took 0.459043 s

'compile' took 3.913206 s

Initializing variables...
Training model...

Step Train loss Test loss Test metric
0 [3.33e-02, 2.57e-01, 2.37e-01] [3.33e-02, 0.00e+00, 0.00e+00] []
1000 [1.20e+00, 7.88e-04, 7.63e-04] [1.20e+00, 0.00e+00, 0.00e+00] []
1500 [1.11e+00, 4.78e-04, 3.57e-04] [1.11e+00, 0.00e+00, 0.00e+00] []

Best model at step 0:
train loss: 5.27e-01
test loss: 3.33e-02
test metric: []

'train' took 21.294190 s

Compiling model...
'compile' took 3.631830 s

Training model...

Step Train loss Test loss Test metric
1500 [1.11e+00, 4.78e-04, 3.57e-04] [1.11e+00, 0.00e+00, 0.00e+00] []
1600 [1.81e-04, 4.00e-05, 4.24e-05]
1800 [7.96e-05, 1.73e-05, 2.90e-05]
2000 [3.92e-05, 1.16e-05, 2.35e-05]
2200 [2.34e-05, 1.08e-05, 1.87e-05]
2400 [1.92e-05, 8.32e-06, 1.26e-05]
2484 [2.58e+00, 4.39e-02, 1.84e-02] [2.58e+00, 0.00e+00, 0.00e+00] []

Best model at step 0:
train loss: 5.27e-01
test loss: 3.33e-02
test metric: []

'train' took 16.359440 s

notice that at step = 2484 the loss increased three order of magnitude.

I am guessing that mean and standard deviation from training is either not properly stored or not properly reused when testing. Any idea? Thanks!

The text was updated successfully, but these errors were encountered:

smao-astro · 2020-06-24T00:58:55Z

By the way, #68 does used batch_normalization so the output might influenced by the issue mentioned here.

lululxvi · 2020-06-24T04:08:42Z

The code of batch_normalization should be correct. You can try a simple function approximation for testing. For PDEs, it is a little complicated, and I am not sure whether batch normalization would help or not. It becomes even more complicated when using L-BFGS, which is a quasi-Newton method. I usually don't use batch norm for PDEs. Let me know if you have any specific reason to use batch norm.

smao-astro · 2020-06-24T04:16:03Z

While the code above was just modified from Burgers.py
I hope batch norm could helps since the spatial coordinates and time coordinates are usually not of the same magnitude (expect I manually choose the units to make so).

lululxvi · 2020-06-25T05:36:33Z

You can also check other examples, and you will find that L-BFGS is not used too much.
In your case, it seems to me that x and t are of the same magnitude.
There are two ways to handle the input scaling:
- Recommendation: Re-scale your PDE if possible.
- Use apply_feature_transform to rescale the network inputs, e.g., if x is in [0, a], and t in [0, b]

net = ...
net.apply_feature_transform(lambda X: tf.concat([X[:, 0:1] / a, X[:, 0:1] / b)], axis=1))

smao-astro · 2020-06-27T00:21:45Z

Hi Lu,

I see. Thank you for you reply. I will temporarily close this issue since I also did not spot anything wrong with the code of batch normalization.

smao-astro · 2020-10-03T09:31:37Z

The issue still exists when using batch_normalization = "before" AND "L-BFGS-B", and I doubt this is due to update_ops is not executed when is_scipy_opts(optimizer) is True

deepxde/deepxde/train.py

Lines 17 to 47 in f8ab0b0

    
           def get_train_op(loss, optimizer, lr=None, decay=None): 
        
               if is_scipy_opts(optimizer): 
        
                   if backend.is_tf_1(): 
        
                       ScipyOptimizerInterface = tf.contrib.opt.ScipyOptimizerInterface 
        
                   else: 
        
                       ScipyOptimizerInterface = external_optimizer.ScipyOptimizerInterface 
        
                   if lr is not None or decay is not None: 
        
                       print("Warning: learning rate is ignored for {}".format(optimizer)) 
        
                   return ScipyOptimizerInterface( 
        
                       loss, 
        
                       method=optimizer, 
        
                       options={ 
        
                           "disp": None, 
        
                           "maxcor": 50, 
        
                           "ftol": np.finfo(float).eps, 
        
                           "gtol": 1e-5, 
        
                           "eps": 1e-8, 
        
                           "maxfun": 15000, 
        
                           "maxiter": 15000, 
        
                           "maxls": 50, 
        
                       }, 
        
                   ) 
        
               if lr is None: 
        
                   raise ValueError("No learning rate for {}.".format(optimizer)) 
        
               lr, global_step = _get_learningrate(lr, decay) 
        
               update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) 
        
               with tf.control_dependencies(update_ops): 
        
                   train_op = _get_optimizer(optimizer, lr).minimize(loss, global_step=global_step) 
        
               return train_op

smao-astro · 2020-10-03T09:34:02Z

import numpy as np

import deepxde as dde
from deepxde.backend import tf

import sys


def gen_testdata():
    data = np.load("dataset/Burgers.npz")
    t, x, exact = data["t"], data["x"], data["usol"].T
    xx, tt = np.meshgrid(x, t)
    X = np.vstack((np.ravel(xx), np.ravel(tt))).T
    y = exact.flatten()[:, None]
    return X, y


def main(batch_normalization):
    def pde(x, y):
        dy_x = tf.gradients(y, x)[0]
        dy_x, dy_t = dy_x[:, 0:1], dy_x[:, 1:2]
        dy_xx = tf.gradients(dy_x, x)[0][:, 0:1]
        return dy_t + y * dy_x - 0.01 / np.pi * dy_xx

    geom = dde.geometry.Interval(-1, 1)
    timedomain = dde.geometry.TimeDomain(0, 0.99)
    geomtime = dde.geometry.GeometryXTime(geom, timedomain)

    bc = dde.DirichletBC(geomtime, lambda x: 0, lambda _, on_boundary: on_boundary)
    ic = dde.IC(
        geomtime, lambda x: -np.sin(np.pi * x[:, 0:1]), lambda _, on_initial: on_initial
    )

    data = dde.data.TimePDE(
        geomtime, pde, [bc, ic], num_domain=2540, num_boundary=80, num_initial=160
    )
    net = dde.maps.FNN(
        [2] + [20] * 3 + [1],
        "tanh",
        "Glorot normal",
        batch_normalization=batch_normalization,
    )
    model = dde.Model(data, net)

    # model.compile("adam", lr=1e-4)
    # model.train(epochs=1500)
    model.compile("L-BFGS-B")
    losshistory, train_state = model.train()
    dde.saveplot(losshistory, train_state, issave=True, isplot=True)

    X, y_true = gen_testdata()
    y_pred = model.predict(X)
    f = model.predict(X, operator=pde)
    print("Mean residual:", np.mean(np.absolute(f)))
    print("L2 relative error:", dde.metrics.l2_relative_error(y_true, y_pred))
    np.savetxt("test.dat", np.hstack((X, y_true, y_pred)))


if __name__ == "__main__":
    main(sys.argv[1] if len(sys.argv) > 1 else None)

Run with before as command line args, get

The issue is that the test loss is much larger than train loss.

lululxvi · 2020-10-05T06:49:23Z

Yes, "L-BFGS-B" does not work with "batch_normalization", because "L-BFGS-B" is from scipy. But the TensorFlow optimizers should work.

smao-astro · 2020-10-05T09:15:06Z

No, it does not (correct me if I am wrong). Applying batch_normalization="before" to examples/diffusion_1d.py gives

lululxvi · 2020-10-08T19:39:47Z

I am not sure whether it makes sense to use batch-norm, because here we want to compute the derivatives dy/dx. My suggestion is that you may just remove batch-norm. We have worked on many different cases, and we never use batch-norm (the main purpose of batch-norm is for deep networks). There are always other ways.

smao-astro · 2020-10-17T07:39:27Z

Hi Lu,

I see, I agree with you that one should be careful when using batch normalization is such case. Thank you for your reply!

smao-astro closed this as completed Jun 27, 2020

smao-astro reopened this Oct 3, 2020

smao-astro closed this as completed Oct 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch normalization: loss increased when deepxde.maps.map.Map.training set to False #69

batch normalization: loss increased when deepxde.maps.map.Map.training set to False #69

smao-astro commented Jun 24, 2020

smao-astro commented Jun 24, 2020

lululxvi commented Jun 24, 2020

smao-astro commented Jun 24, 2020

lululxvi commented Jun 25, 2020 •

edited

smao-astro commented Jun 27, 2020

smao-astro commented Oct 3, 2020

smao-astro commented Oct 3, 2020

lululxvi commented Oct 5, 2020

smao-astro commented Oct 5, 2020

lululxvi commented Oct 8, 2020

smao-astro commented Oct 17, 2020

batch normalization: loss increased when deepxde.maps.map.Map.training set to False #69

batch normalization: loss increased when deepxde.maps.map.Map.training set to False #69

Comments

smao-astro commented Jun 24, 2020

smao-astro commented Jun 24, 2020

lululxvi commented Jun 24, 2020

smao-astro commented Jun 24, 2020

lululxvi commented Jun 25, 2020 • edited

smao-astro commented Jun 27, 2020

smao-astro commented Oct 3, 2020

smao-astro commented Oct 3, 2020

lululxvi commented Oct 5, 2020

smao-astro commented Oct 5, 2020

lululxvi commented Oct 8, 2020

smao-astro commented Oct 17, 2020

lululxvi commented Jun 25, 2020 •

edited