Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn.roc_auc_score custom metric with Batch Normalization freezes during training #12954

Closed
Voyz opened this issue Jun 13, 2019 · 0 comments
Closed

Comments

@Voyz
Copy link

Voyz commented Jun 13, 2019

System information

  • Have I written custom code (as opposed to using example directory): Yes, https://gist.github.com/Voyz/3dded010911f6d2105020222a4e86cd0
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Pro v1803 / 17134.648
  • TensorFlow backend (yes / no): yes
  • TensorFlow version: 1.13.1
  • Keras version: 2.2.4
  • Python version: Python 3.6.7
  • CUDA/cuDNN version: 10.0, V10.0.130 / 7.5.0
  • GPU model and memory: NVIDIA GeForce RTX 2070 with Max-Q Design, 8 GB

Describe the current behavior
When using sklearn roc_auc_score custom metric in the following form:

def auc(y_true, y_pred):
    return tf.py_function(roc_auc_score, (y_true, y_pred), tf.double)
...
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', auc])

Along with BatchNormalization layer used with 1d input (ie. shape of (None, features)), eg:

inputs = Input(shape=[input_dim])
x = Dense(10)(inputs)
x = BatchNormalization(axis=1)(x)
x = Dense(1)(x)

the training freezes without throwing an error after a few batches. I couldn't notice any pattern, as it always stops at a different moment, sometimes just after the start, sometimes closer to the end.

Tested on sklearn versions: 0.20.3 and 0.21.2.

Sorry if this is a non-issue or if this issue doesn't belong here, but at sklearn - I wouldn't know how to reproduce it without Keras


Interestingly, training doesn't freeze in any of following cases:

A) - Input to BatchNormalization has 2 dimensions or more, eg:

inputs = Input(shape=[input_dim])
x = Dense(10)(inputs)
x = Lambda(lambda x: K.expand_dims(x, -1))(x) #changing shape to (None, input_dim, 1)
x = BatchNormalization(axis=1)(x)
x = Flatten()(x)
x = Dense(1)(x)

B) - roc_auc_score is used within a custom callback, instead of custom metric. See my gist for implementation of such callback.

C) - Tensorflow's native roc_auc metric is used instead, however apparently it is less accurate - see: tensorflow/tensorflow#14834 (comment)

def auc(y_true, y_pred):
    auc = tf.metrics.auc(y_true, y_pred)[1]
    K.get_session().run(tf.local_variables_initializer())
    return auc

Describe the expected behavior
The training to complete without freezing.

Code to reproduce the issue
A minimal example, for full code see my gist

from keras import Input, Model
from keras.callbacks import Callback
from keras.layers import Dense, BatchNormalization
import numpy as np
from sklearn.metrics import roc_auc_score
import tensorflow as tf

# generate random test data
m = 10000
f = 600

X_train = np.random.randn(m, f)
X_test = np.random.randn(m, f)
y_train = np.random.randint(0, 2, (m, 1))
y_test = np.random.randint(0, 2, (m, 1))

input_dim = X_train.shape[1]


def auc(y_true, y_pred):
    ## Using the sklearn.metrics.roc_auc_score produces the bug
    return tf.py_function(roc_auc_score, (y_true, y_pred), tf.double)

# Define the model
inputs = Input(shape=[input_dim])
x = inputs
x = Dense(10)(x)
x = BatchNormalization(axis=1)(x)
x = Dense(1)(x)

model = Model(
    inputs=[inputs],
    outputs=[x],
    name='model')

# Note: the bug persists with SGD optimizer, as well as MSE loss. It disappears if 'auc' is removed from metrics.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', auc])
model.summary()


model.fit(x=X_train, y=y_train, batch_size = 1024, epochs = 50, validation_data = (X_test, y_test))

Other info / logs
Example of log when freezing, that last line of log just freezes, nothing follows.

Using TensorFlow backend.
WARNING:tensorflow:From C:\programs\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 600)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                6010      
_________________________________________________________________
batch_normalization_1 (Batch (None, 10)                40        
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 11        
=================================================================
Total params: 6,061
Trainable params: 6,041
Non-trainable params: 20
_________________________________________________________________
WARNING:tensorflow:From C:\programs\Python36\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 10000 samples, validate on 10000 samples
Epoch 1/50
2019-06-13 18:16:08.030801: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-06-13 18:16:09.255081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce RTX 2070 with Max-Q Design major: 7 minor: 5 memoryClockRate(GHz): 1.185
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.59GiB
2019-06-13 18:16:09.255409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-06-13 18:16:09.785650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-13 18:16:09.785829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-06-13 18:16:09.785932: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-06-13 18:16:09.786147: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6317 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-06-13 18:16:10.358392: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
2019-06-13 18:16:10.675287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-06-13 18:16:10.675475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-13 18:16:10.675634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-06-13 18:16:10.675738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-06-13 18:16:10.675918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6317 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 7.5)

 1024/10000 [==>...........................] - ETA: 23s - loss: 6.4152 - acc: 0.2373 - auc: 0.5061
 6144/10000 [=================>............] - ETA: 1s - loss: 6.2482 - acc: 0.2371 - auc: 0.4938 
10000/10000 [==============================] - 3s 289us/step - loss: 6.2626 - acc: 0.2392 - auc: 0.4949 - val_loss: 6.2801 - val_acc: 0.2398 - val_auc: 0.4994
In a callback: 0.4957 - roc-auc_val: 0.4994                                                                                                    
Epoch 2/50

 1024/10000 [==>...........................] - ETA: 0s - loss: 6.3064 - acc: 0.2334 - auc: 0.4998
 6144/10000 [=================>............] - ETA: 0s - loss: 6.3103 - acc: 0.2419 - auc: 0.4951
10000/10000 [==============================] - 0s 20us/step - loss: 6.2647 - acc: 0.2411 - auc: 0.4959 - val_loss: 6.2949 - val_acc: 0.2366 - val_auc: 0.4994
In a callback: 0.4965 - roc-auc_val: 0.4993                                                                                                    
Epoch 3/50

 1024/10000 [==>...........................] - ETA: 0s - loss: 5.8954 - acc: 0.2490 - auc: 0.5123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants