Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch normalization: loss increased when deepxde.maps.map.Map.training set to False #69

Closed
smao-astro opened this issue Jun 24, 2020 · 11 comments

Comments

@smao-astro
Copy link
Contributor

Hi Lu,

I am trying deepxde.maps.fnn.FNN.batch_normalization at

batch_normalization=None,

And I noticed that when
def _test(self, uncertainty):
is called, the loss increased significantly compared to the loss computed at
self.train_op.minimize(

To reproduce, here are two scripts comparing with each other:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np

import deepxde as dde
from deepxde.backend import tf


def gen_testdata():
    data = np.load("dataset/Burgers.npz")
    t, x, exact = data["t"], data["x"], data["usol"].T
    xx, tt = np.meshgrid(x, t)
    X = np.vstack((np.ravel(xx), np.ravel(tt))).T
    y = exact.flatten()[:, None]
    return X, y


def main():
    def pde(x, y):
        dy_x = tf.gradients(y, x)[0]
        dy_x, dy_t = dy_x[:, 0:1], dy_x[:, 1:2]
        dy_xx = tf.gradients(dy_x, x)[0][:, 0:1]
        return dy_t + y * dy_x - 0.01 / np.pi * dy_xx

    geom = dde.geometry.Interval(-1, 1)
    timedomain = dde.geometry.TimeDomain(0, 0.99)
    geomtime = dde.geometry.GeometryXTime(geom, timedomain)

    bc = dde.DirichletBC(
        geomtime, lambda x: np.zeros((len(x), 1)), lambda _, on_boundary: on_boundary
    )
    # lambda _, on_boundary: on_boundary
    #   an anonymous function, it take two inputs and no matter what the first imput is, return the second input only
    #   and identically
    ic = dde.IC(
        geomtime, lambda x: -np.sin(np.pi * x[:, 0:1]), lambda _, on_initial: on_initial
    )

    data = dde.data.TimePDE(
        geomtime, pde, [bc, ic], num_domain=2540, num_boundary=80, num_initial=160
    )
    net = dde.maps.FNN([2] + [20] * 3 + [1], "tanh", "Glorot normal")
    model = dde.Model(data, net)

    # we first use Adam for a certain number of iterations, and then switch to L-BFGS.
    model.compile("adam", lr=1e-3)
    model.train(epochs=1500)
    # The optimizer L-BFGS does not require learning rate,
    # and the neural network is trained until convergence,
    # so the number of iterations is also ignored for L-BFGS.
    model.compile("L-BFGS-B")
    losshistory, train_state = model.train(display_every=200)
    # dde.saveplot(losshistory, train_state, issave=True, isplot=True)
    #
    # X, y_true = gen_testdata()
    # y_pred = model.predict(X)
    # f = model.predict(X, operator=pde)
    # print("Mean residual:", np.mean(np.absolute(f)))
    # print("L2 relative error:", dde.metrics.l2_relative_error(y_true, y_pred))
    # np.savetxt("test.dat", np.hstack((X, y_true, y_pred)))


if __name__ == "__main__":
    main()

without batch_normalization, it gives

Using TensorFlow 1 backend.

Compiling model...
Building feed-forward neural network...
'build' took 0.305456 s

'compile' took 0.732757 s

Initializing variables...
Training model...

Step Train loss Test loss Test metric
0 [4.72e-03, 8.26e-03, 4.37e-01] [4.72e-03, 0.00e+00, 0.00e+00] []
1000 [3.72e-02, 2.52e-04, 5.29e-02] [3.72e-02, 0.00e+00, 0.00e+00] []
1500 [3.41e-02, 1.06e-04, 4.87e-02] [3.41e-02, 0.00e+00, 0.00e+00] []

Best model at step 1500:
train loss: 8.29e-02
test loss: 3.41e-02
test metric: []

'train' took 6.344179 s

Compiling model...
'compile' took 1.180172 s

Training model...

Step Train loss Test loss Test metric
1500 [3.41e-02, 1.06e-04, 4.87e-02] [3.41e-02, 0.00e+00, 0.00e+00] []
1600 [2.18e-02, 2.19e-04, 2.46e-02]
1800 [8.66e-03, 1.41e-05, 9.62e-03]
2000 [6.09e-03, 4.42e-06, 3.73e-03]
2200 [3.99e-03, 6.30e-06, 3.37e-03]
2400 [2.94e-03, 9.06e-06, 3.06e-03]
2600 [2.20e-03, 8.50e-06, 2.66e-03]
2800 [1.56e-03, 2.16e-06, 2.32e-03]
3000 [1.29e-03, 1.07e-06, 2.16e-03]
3200 [1.08e-03, 2.86e-06, 2.11e-03]
3400 [9.43e-04, 9.06e-07, 2.08e-03]
3600 [8.71e-04, 2.73e-06, 2.01e-03]
3800 [8.47e-04, 1.08e-06, 1.68e-03]
4000 [4.21e-04, 9.31e-07, 4.50e-04]
4200 [3.30e-04, 6.53e-07, 4.29e-04]
4400 [2.89e-04, 8.89e-07, 4.05e-04]
4600 [2.69e-04, 5.40e-07, 3.80e-04]
4800 [2.55e-04, 2.23e-07, 3.47e-04]
5000 [2.21e-04, 4.14e-07, 3.26e-04]
5200 [2.22e-04, 4.91e-07, 2.95e-04]
5400 [2.27e-04, 1.82e-07, 2.45e-04]
5600 [2.25e-04, 2.67e-07, 1.60e-04]
5800 [1.89e-04, 1.21e-07, 8.47e-05]
6000 [1.45e-04, 1.27e-07, 6.40e-05]
6200 [1.28e-04, 2.80e-07, 5.85e-05]
6400 [1.10e-04, 9.07e-08, 4.82e-05]
6600 [9.93e-05, 9.26e-08, 4.33e-05]
6800 [8.91e-05, 8.64e-08, 3.82e-05]
7000 [8.05e-05, 8.08e-08, 3.42e-05]
7200 [7.32e-05, 9.41e-08, 3.23e-05]
7400 [6.76e-05, 1.27e-07, 2.91e-05]
7600 [6.43e-05, 9.91e-08, 2.38e-05]
7800 [5.84e-05, 8.04e-08, 2.04e-05]
8000 [5.45e-05, 1.05e-07, 1.71e-05]
8200 [5.21e-05, 1.18e-07, 1.51e-05]
8400 [4.81e-05, 1.78e-07, 1.42e-05]
8600 [4.60e-05, 1.55e-07, 1.17e-05]
8800 [4.29e-05, 2.24e-07, 9.72e-06]
9000 [3.96e-05, 1.27e-07, 7.74e-06]
9200 [3.77e-05, 1.05e-07, 6.13e-06]
9400 [3.47e-05, 1.12e-07, 3.55e-06]
9600 [3.04e-05, 1.89e-07, 2.58e-06]
9800 [2.74e-05, 1.61e-07, 2.34e-06]
10000 [2.61e-05, 1.09e-07, 2.13e-06]
10200 [2.44e-05, 5.00e-08, 2.04e-06]
10400 [2.33e-05, 3.94e-08, 1.81e-06]
10600 [2.22e-05, 4.27e-08, 1.65e-06]
10800 [2.10e-05, 3.46e-08, 1.35e-06]
11000 [2.00e-05, 3.83e-08, 1.06e-06]
11200 [1.86e-05, 5.09e-08, 1.06e-06]
11400 [1.77e-05, 5.11e-08, 1.05e-06]
11600 [1.68e-05, 4.61e-08, 1.17e-06]
11702 [1.65e-05, 4.23e-08, 1.13e-06] [1.65e-05, 0.00e+00, 0.00e+00] []

Best model at step 11702:
train loss: 1.76e-05
test loss: 1.65e-05
test metric: []

'train' took 49.402318 s

to compare, if I add batch_normalization

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np

import deepxde as dde
from deepxde.backend import tf


def gen_testdata():
    data = np.load("dataset/Burgers.npz")
    t, x, exact = data["t"], data["x"], data["usol"].T
    xx, tt = np.meshgrid(x, t)
    X = np.vstack((np.ravel(xx), np.ravel(tt))).T
    y = exact.flatten()[:, None]
    return X, y


def main():
    def pde(x, y):
        dy_x = tf.gradients(y, x)[0]
        dy_x, dy_t = dy_x[:, 0:1], dy_x[:, 1:2]
        dy_xx = tf.gradients(dy_x, x)[0][:, 0:1]
        return dy_t + y * dy_x - 0.01 / np.pi * dy_xx

    geom = dde.geometry.Interval(-1, 1)
    timedomain = dde.geometry.TimeDomain(0, 0.99)
    geomtime = dde.geometry.GeometryXTime(geom, timedomain)

    bc = dde.DirichletBC(
        geomtime, lambda x: np.zeros((len(x), 1)), lambda _, on_boundary: on_boundary
    )
    # lambda _, on_boundary: on_boundary
    #   an anonymous function, it take two inputs and no matter what the first imput is, return the second input only
    #   and identically
    ic = dde.IC(
        geomtime, lambda x: -np.sin(np.pi * x[:, 0:1]), lambda _, on_initial: on_initial
    )

    data = dde.data.TimePDE(
        geomtime, pde, [bc, ic], num_domain=2540, num_boundary=80, num_initial=160
    )
    net = dde.maps.FNN([2] + [20] * 3 + [1], "tanh", "Glorot normal", batch_normalization="before")
    model = dde.Model(data, net)

    # we first use Adam for a certain number of iterations, and then switch to L-BFGS.
    model.compile("adam", lr=1e-3)
    model.train(epochs=1500)
    # The optimizer L-BFGS does not require learning rate,
    # and the neural network is trained until convergence,
    # so the number of iterations is also ignored for L-BFGS.
    model.compile("L-BFGS-B")
    losshistory, train_state = model.train(display_every=200)
    # dde.saveplot(losshistory, train_state, issave=True, isplot=True)
    #
    # X, y_true = gen_testdata()
    # y_pred = model.predict(X)
    # f = model.predict(X, operator=pde)
    # print("Mean residual:", np.mean(np.absolute(f)))
    # print("L2 relative error:", dde.metrics.l2_relative_error(y_true, y_pred))
    # np.savetxt("test.dat", np.hstack((X, y_true, y_pred)))


if __name__ == "__main__":
    main()

the output is

Using TensorFlow 1 backend.

Compiling model...
Building feed-forward neural network...
'build' took 0.459043 s

'compile' took 3.913206 s

Initializing variables...
Training model...

Step Train loss Test loss Test metric
0 [3.33e-02, 2.57e-01, 2.37e-01] [3.33e-02, 0.00e+00, 0.00e+00] []
1000 [1.20e+00, 7.88e-04, 7.63e-04] [1.20e+00, 0.00e+00, 0.00e+00] []
1500 [1.11e+00, 4.78e-04, 3.57e-04] [1.11e+00, 0.00e+00, 0.00e+00] []

Best model at step 0:
train loss: 5.27e-01
test loss: 3.33e-02
test metric: []

'train' took 21.294190 s

Compiling model...
'compile' took 3.631830 s

Training model...

Step Train loss Test loss Test metric
1500 [1.11e+00, 4.78e-04, 3.57e-04] [1.11e+00, 0.00e+00, 0.00e+00] []
1600 [1.81e-04, 4.00e-05, 4.24e-05]
1800 [7.96e-05, 1.73e-05, 2.90e-05]
2000 [3.92e-05, 1.16e-05, 2.35e-05]
2200 [2.34e-05, 1.08e-05, 1.87e-05]
2400 [1.92e-05, 8.32e-06, 1.26e-05]
2484 [2.58e+00, 4.39e-02, 1.84e-02] [2.58e+00, 0.00e+00, 0.00e+00] []

Best model at step 0:
train loss: 5.27e-01
test loss: 3.33e-02
test metric: []

'train' took 16.359440 s

notice that at step = 2484 the loss increased three order of magnitude.

I am guessing that mean and standard deviation from training is either not properly stored or not properly reused when testing. Any idea? Thanks!

@smao-astro
Copy link
Contributor Author

By the way, #68 does used batch_normalization so the output might influenced by the issue mentioned here.

@lululxvi
Copy link
Owner

The code of batch_normalization should be correct. You can try a simple function approximation for testing. For PDEs, it is a little complicated, and I am not sure whether batch normalization would help or not. It becomes even more complicated when using L-BFGS, which is a quasi-Newton method. I usually don't use batch norm for PDEs. Let me know if you have any specific reason to use batch norm.

@smao-astro
Copy link
Contributor Author

  • While the code above was just modified from Burgers.py
  • I hope batch norm could helps since the spatial coordinates and time coordinates are usually not of the same magnitude (expect I manually choose the units to make so).

@lululxvi
Copy link
Owner

lululxvi commented Jun 25, 2020

  • You can also check other examples, and you will find that L-BFGS is not used too much.
  • In your case, it seems to me that x and t are of the same magnitude.
  • There are two ways to handle the input scaling:
    • Recommendation: Re-scale your PDE if possible.
    • Use apply_feature_transform to rescale the network inputs, e.g., if x is in [0, a], and t in [0, b]
net = ...
net.apply_feature_transform(lambda X: tf.concat([X[:, 0:1] / a, X[:, 0:1] / b)], axis=1))

@smao-astro
Copy link
Contributor Author

Hi Lu,

I see. Thank you for you reply. I will temporarily close this issue since I also did not spot anything wrong with the code of batch normalization.

@smao-astro smao-astro reopened this Oct 3, 2020
@smao-astro
Copy link
Contributor Author

The issue still exists when using batch_normalization = "before" AND "L-BFGS-B", and I doubt this is due to update_ops is not executed when is_scipy_opts(optimizer) is True

deepxde/deepxde/train.py

Lines 17 to 47 in f8ab0b0

def get_train_op(loss, optimizer, lr=None, decay=None):
if is_scipy_opts(optimizer):
if backend.is_tf_1():
ScipyOptimizerInterface = tf.contrib.opt.ScipyOptimizerInterface
else:
ScipyOptimizerInterface = external_optimizer.ScipyOptimizerInterface
if lr is not None or decay is not None:
print("Warning: learning rate is ignored for {}".format(optimizer))
return ScipyOptimizerInterface(
loss,
method=optimizer,
options={
"disp": None,
"maxcor": 50,
"ftol": np.finfo(float).eps,
"gtol": 1e-5,
"eps": 1e-8,
"maxfun": 15000,
"maxiter": 15000,
"maxls": 50,
},
)
if lr is None:
raise ValueError("No learning rate for {}.".format(optimizer))
lr, global_step = _get_learningrate(lr, decay)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = _get_optimizer(optimizer, lr).minimize(loss, global_step=global_step)
return train_op

@smao-astro
Copy link
Contributor Author

import numpy as np

import deepxde as dde
from deepxde.backend import tf

import sys


def gen_testdata():
    data = np.load("dataset/Burgers.npz")
    t, x, exact = data["t"], data["x"], data["usol"].T
    xx, tt = np.meshgrid(x, t)
    X = np.vstack((np.ravel(xx), np.ravel(tt))).T
    y = exact.flatten()[:, None]
    return X, y


def main(batch_normalization):
    def pde(x, y):
        dy_x = tf.gradients(y, x)[0]
        dy_x, dy_t = dy_x[:, 0:1], dy_x[:, 1:2]
        dy_xx = tf.gradients(dy_x, x)[0][:, 0:1]
        return dy_t + y * dy_x - 0.01 / np.pi * dy_xx

    geom = dde.geometry.Interval(-1, 1)
    timedomain = dde.geometry.TimeDomain(0, 0.99)
    geomtime = dde.geometry.GeometryXTime(geom, timedomain)

    bc = dde.DirichletBC(geomtime, lambda x: 0, lambda _, on_boundary: on_boundary)
    ic = dde.IC(
        geomtime, lambda x: -np.sin(np.pi * x[:, 0:1]), lambda _, on_initial: on_initial
    )

    data = dde.data.TimePDE(
        geomtime, pde, [bc, ic], num_domain=2540, num_boundary=80, num_initial=160
    )
    net = dde.maps.FNN(
        [2] + [20] * 3 + [1],
        "tanh",
        "Glorot normal",
        batch_normalization=batch_normalization,
    )
    model = dde.Model(data, net)

    # model.compile("adam", lr=1e-4)
    # model.train(epochs=1500)
    model.compile("L-BFGS-B")
    losshistory, train_state = model.train()
    dde.saveplot(losshistory, train_state, issave=True, isplot=True)

    X, y_true = gen_testdata()
    y_pred = model.predict(X)
    f = model.predict(X, operator=pde)
    print("Mean residual:", np.mean(np.absolute(f)))
    print("L2 relative error:", dde.metrics.l2_relative_error(y_true, y_pred))
    np.savetxt("test.dat", np.hstack((X, y_true, y_pred)))


if __name__ == "__main__":
    main(sys.argv[1] if len(sys.argv) > 1 else None)

Run with before as command line args, get
loss_curve

The issue is that the test loss is much larger than train loss.

@lululxvi
Copy link
Owner

lululxvi commented Oct 5, 2020

Yes, "L-BFGS-B" does not work with "batch_normalization", because "L-BFGS-B" is from scipy. But the TensorFlow optimizers should work.

@smao-astro
Copy link
Contributor Author

No, it does not (correct me if I am wrong). Applying batch_normalization="before" to examples/diffusion_1d.py gives
diffusion_1d_loss_curve

@lululxvi
Copy link
Owner

lululxvi commented Oct 8, 2020

I am not sure whether it makes sense to use batch-norm, because here we want to compute the derivatives dy/dx. My suggestion is that you may just remove batch-norm. We have worked on many different cases, and we never use batch-norm (the main purpose of batch-norm is for deep networks). There are always other ways.

@smao-astro
Copy link
Contributor Author

Hi Lu,

I see, I agree with you that one should be careful when using batch normalization is such case. Thank you for your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants