You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the training freezes without throwing an error after a few batches. I couldn't notice any pattern, as it always stops at a different moment, sometimes just after the start, sometimes closer to the end.
Tested on sklearn versions: 0.20.3 and 0.21.2.
Sorry if this is a non-issue or if this issue doesn't belong here, but at sklearn - I wouldn't know how to reproduce it without Keras
Interestingly, training doesn't freeze in any of following cases:
A) - Input to BatchNormalization has 2 dimensions or more, eg:
Describe the expected behavior
The training to complete without freezing.
Code to reproduce the issue
A minimal example, for full code see my gist
fromkerasimportInput, Modelfromkeras.callbacksimportCallbackfromkeras.layersimportDense, BatchNormalizationimportnumpyasnpfromsklearn.metricsimportroc_auc_scoreimporttensorflowastf# generate random test datam=10000f=600X_train=np.random.randn(m, f)
X_test=np.random.randn(m, f)
y_train=np.random.randint(0, 2, (m, 1))
y_test=np.random.randint(0, 2, (m, 1))
input_dim=X_train.shape[1]
defauc(y_true, y_pred):
## Using the sklearn.metrics.roc_auc_score produces the bugreturntf.py_function(roc_auc_score, (y_true, y_pred), tf.double)
# Define the modelinputs=Input(shape=[input_dim])
x=inputsx=Dense(10)(x)
x=BatchNormalization(axis=1)(x)
x=Dense(1)(x)
model=Model(
inputs=[inputs],
outputs=[x],
name='model')
# Note: the bug persists with SGD optimizer, as well as MSE loss. It disappears if 'auc' is removed from metrics.model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', auc])
model.summary()
model.fit(x=X_train, y=y_train, batch_size=1024, epochs=50, validation_data= (X_test, y_test))
Other info / logs
Example of log when freezing, that last line of log just freezes, nothing follows.
Using TensorFlow backend.
WARNING:tensorflow:From C:\programs\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 600) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 6010
_________________________________________________________________
batch_normalization_1 (Batch (None, 10) 40
_________________________________________________________________
dense_2 (Dense) (None, 1) 11
=================================================================
Total params: 6,061
Trainable params: 6,041
Non-trainable params: 20
_________________________________________________________________
WARNING:tensorflow:From C:\programs\Python36\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 10000 samples, validate on 10000 samples
Epoch 1/50
2019-06-13 18:16:08.030801: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-06-13 18:16:09.255081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2070 with Max-Q Design major: 7 minor: 5 memoryClockRate(GHz): 1.185
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.59GiB
2019-06-13 18:16:09.255409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-06-13 18:16:09.785650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-13 18:16:09.785829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-06-13 18:16:09.785932: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-06-13 18:16:09.786147: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6317 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-06-13 18:16:10.358392: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
2019-06-13 18:16:10.675287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-06-13 18:16:10.675475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-13 18:16:10.675634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-06-13 18:16:10.675738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-06-13 18:16:10.675918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6317 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 7.5)
1024/10000 [==>...........................] - ETA: 23s - loss: 6.4152 - acc: 0.2373 - auc: 0.5061
6144/10000 [=================>............] - ETA: 1s - loss: 6.2482 - acc: 0.2371 - auc: 0.4938
10000/10000 [==============================] - 3s 289us/step - loss: 6.2626 - acc: 0.2392 - auc: 0.4949 - val_loss: 6.2801 - val_acc: 0.2398 - val_auc: 0.4994
In a callback: 0.4957 - roc-auc_val: 0.4994
Epoch 2/50
1024/10000 [==>...........................] - ETA: 0s - loss: 6.3064 - acc: 0.2334 - auc: 0.4998
6144/10000 [=================>............] - ETA: 0s - loss: 6.3103 - acc: 0.2419 - auc: 0.4951
10000/10000 [==============================] - 0s 20us/step - loss: 6.2647 - acc: 0.2411 - auc: 0.4959 - val_loss: 6.2949 - val_acc: 0.2366 - val_auc: 0.4994
In a callback: 0.4965 - roc-auc_val: 0.4993
Epoch 3/50
1024/10000 [==>...........................] - ETA: 0s - loss: 5.8954 - acc: 0.2490 - auc: 0.5123
The text was updated successfully, but these errors were encountered:
System information
Describe the current behavior
When using sklearn
roc_auc_score
custom metric in the following form:Along with BatchNormalization layer used with 1d input (ie. shape of (None, features)), eg:
the training freezes without throwing an error after a few batches. I couldn't notice any pattern, as it always stops at a different moment, sometimes just after the start, sometimes closer to the end.
Tested on sklearn versions: 0.20.3 and 0.21.2.
Sorry if this is a non-issue or if this issue doesn't belong here, but at sklearn - I wouldn't know how to reproduce it without Keras
Interestingly, training doesn't freeze in any of following cases:
A) - Input to BatchNormalization has 2 dimensions or more, eg:
B) - roc_auc_score is used within a custom callback, instead of custom metric. See my gist for implementation of such callback.
C) - Tensorflow's native roc_auc metric is used instead, however apparently it is less accurate - see: tensorflow/tensorflow#14834 (comment)
Describe the expected behavior
The training to complete without freezing.
Code to reproduce the issue
A minimal example, for full code see my gist
Other info / logs
Example of log when freezing, that last line of log just freezes, nothing follows.
The text was updated successfully, but these errors were encountered: