sklearn.roc_auc_score custom metric with Batch Normalization freezes during training #12954

Voyz · 2019-06-13T10:41:52Z

System information

Have I written custom code (as opposed to using example directory): Yes, https://gist.github.com/Voyz/3dded010911f6d2105020222a4e86cd0
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Pro v1803 / 17134.648
TensorFlow backend (yes / no): yes
TensorFlow version: 1.13.1
Keras version: 2.2.4
Python version: Python 3.6.7
CUDA/cuDNN version: 10.0, V10.0.130 / 7.5.0
GPU model and memory: NVIDIA GeForce RTX 2070 with Max-Q Design, 8 GB

Describe the current behavior
When using sklearn roc_auc_score custom metric in the following form:

def auc(y_true, y_pred):
    return tf.py_function(roc_auc_score, (y_true, y_pred), tf.double)
...
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', auc])

Along with BatchNormalization layer used with 1d input (ie. shape of (None, features)), eg:

inputs = Input(shape=[input_dim])
x = Dense(10)(inputs)
x = BatchNormalization(axis=1)(x)
x = Dense(1)(x)

the training freezes without throwing an error after a few batches. I couldn't notice any pattern, as it always stops at a different moment, sometimes just after the start, sometimes closer to the end.

Tested on sklearn versions: 0.20.3 and 0.21.2.

Sorry if this is a non-issue or if this issue doesn't belong here, but at sklearn - I wouldn't know how to reproduce it without Keras

Interestingly, training doesn't freeze in any of following cases:

A) - Input to BatchNormalization has 2 dimensions or more, eg:

inputs = Input(shape=[input_dim])
x = Dense(10)(inputs)
x = Lambda(lambda x: K.expand_dims(x, -1))(x) #changing shape to (None, input_dim, 1)
x = BatchNormalization(axis=1)(x)
x = Flatten()(x)
x = Dense(1)(x)

B) - roc_auc_score is used within a custom callback, instead of custom metric. See my gist for implementation of such callback.

C) - Tensorflow's native roc_auc metric is used instead, however apparently it is less accurate - see: tensorflow/tensorflow#14834 (comment)

def auc(y_true, y_pred):
    auc = tf.metrics.auc(y_true, y_pred)[1]
    K.get_session().run(tf.local_variables_initializer())
    return auc

Describe the expected behavior
The training to complete without freezing.

Code to reproduce the issue
A minimal example, for full code see my gist

from keras import Input, Model
from keras.callbacks import Callback
from keras.layers import Dense, BatchNormalization
import numpy as np
from sklearn.metrics import roc_auc_score
import tensorflow as tf

# generate random test data
m = 10000
f = 600

X_train = np.random.randn(m, f)
X_test = np.random.randn(m, f)
y_train = np.random.randint(0, 2, (m, 1))
y_test = np.random.randint(0, 2, (m, 1))

input_dim = X_train.shape[1]


def auc(y_true, y_pred):
    ## Using the sklearn.metrics.roc_auc_score produces the bug
    return tf.py_function(roc_auc_score, (y_true, y_pred), tf.double)

# Define the model
inputs = Input(shape=[input_dim])
x = inputs
x = Dense(10)(x)
x = BatchNormalization(axis=1)(x)
x = Dense(1)(x)

model = Model(
    inputs=[inputs],
    outputs=[x],
    name='model')

# Note: the bug persists with SGD optimizer, as well as MSE loss. It disappears if 'auc' is removed from metrics.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', auc])
model.summary()


model.fit(x=X_train, y=y_train, batch_size = 1024, epochs = 50, validation_data = (X_test, y_test))

Other info / logs
Example of log when freezing, that last line of log just freezes, nothing follows.

Using TensorFlow backend.
WARNING:tensorflow:From C:\programs\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 600)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                6010      
_________________________________________________________________
batch_normalization_1 (Batch (None, 10)                40        
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 11        
=================================================================
Total params: 6,061
Trainable params: 6,041
Non-trainable params: 20
_________________________________________________________________
WARNING:tensorflow:From C:\programs\Python36\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 10000 samples, validate on 10000 samples
Epoch 1/50
2019-06-13 18:16:08.030801: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-06-13 18:16:09.255081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce RTX 2070 with Max-Q Design major: 7 minor: 5 memoryClockRate(GHz): 1.185
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.59GiB
2019-06-13 18:16:09.255409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-06-13 18:16:09.785650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-13 18:16:09.785829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-06-13 18:16:09.785932: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-06-13 18:16:09.786147: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6317 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-06-13 18:16:10.358392: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
2019-06-13 18:16:10.675287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-06-13 18:16:10.675475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-13 18:16:10.675634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-06-13 18:16:10.675738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-06-13 18:16:10.675918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6317 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 7.5)

 1024/10000 [==>...........................] - ETA: 23s - loss: 6.4152 - acc: 0.2373 - auc: 0.5061
 6144/10000 [=================>............] - ETA: 1s - loss: 6.2482 - acc: 0.2371 - auc: 0.4938 
10000/10000 [==============================] - 3s 289us/step - loss: 6.2626 - acc: 0.2392 - auc: 0.4949 - val_loss: 6.2801 - val_acc: 0.2398 - val_auc: 0.4994
In a callback: 0.4957 - roc-auc_val: 0.4994                                                                                                    
Epoch 2/50

 1024/10000 [==>...........................] - ETA: 0s - loss: 6.3064 - acc: 0.2334 - auc: 0.4998
 6144/10000 [=================>............] - ETA: 0s - loss: 6.3103 - acc: 0.2419 - auc: 0.4951
10000/10000 [==============================] - 0s 20us/step - loss: 6.2647 - acc: 0.2411 - auc: 0.4959 - val_loss: 6.2949 - val_acc: 0.2366 - val_auc: 0.4994
In a callback: 0.4965 - roc-auc_val: 0.4993                                                                                                    
Epoch 3/50

 1024/10000 [==>...........................] - ETA: 0s - loss: 5.8954 - acc: 0.2490 - auc: 0.5123

The text was updated successfully, but these errors were encountered:

jvishnuvardhan added backend:tensorflow type:bug/performance labels Jun 13, 2019

fchollet removed the backend:tensorflow label Jun 16, 2021

fchollet closed this as completed Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sklearn.roc_auc_score custom metric with Batch Normalization freezes during training #12954

sklearn.roc_auc_score custom metric with Batch Normalization freezes during training #12954

Voyz commented Jun 13, 2019

sklearn.roc_auc_score custom metric with Batch Normalization freezes during training #12954

sklearn.roc_auc_score custom metric with Batch Normalization freezes during training #12954

Comments

Voyz commented Jun 13, 2019