Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics fmeasure and matthews_correlation don't work batchwise #4592

Closed
mhubrich opened this issue Dec 4, 2016 · 5 comments
Closed

metrics fmeasure and matthews_correlation don't work batchwise #4592

mhubrich opened this issue Dec 4, 2016 · 5 comments

Comments

@mhubrich
Copy link

mhubrich commented Dec 4, 2016

Hello,

In my opinion, the metrics fmeasure, matthews_correlation, precision and recall all don't work batchwise. In general, this is the case for all metrics which incorporate true/false positives/negatives.

Here is a small and easy counterexample:
Let's assume we have just 4 samples: two negatives and two positives. Also, our batch size is 2:

Batch Label Prediction
1 0 0
1 0 1
2 1 0
2 1 1

Now we want to calculate the recall aka. TP-rate in batchwise manner. For the first batch, we have a TP-rate of 0 (since we don't have any true positives). For the second batch, we have a TP-rate of 0.5. Finally, we take the mean over all batches and end with recall=TP-rate=mean(0 + 0.5) = 0.25.

But: as we easily can see the correct recall over the entire dataset is 0.5. The problem with the batchwise calculation is that we wrongly incorporated the first batch.

@sietschie
Copy link

I think I stumbled upon the same problem while experimenting with an unbalanced dataset. Here is an example I wrote to isolate the problem:

from keras.models import Sequential
from keras.layers import Dense
import numpy as np

np.random.seed(1380)

input_data = np.random.choice(2,10000, p = [0.99,0.01])

model = Sequential()
model.add(Dense(1, input_dim=1, init='uniform', activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy', 'precision', 'recall'])
              
model.fit(input_data, input_data, nb_epoch=2, batch_size=2, class_weight = {0:0.01, 1:0.99})

Output:

Epoch 1/2
10000/10000 [==============================] - 12s - loss: 0.0130 - acc: 0.7067 - precision: 0.0187 - recall: 0.0218
Epoch 2/2
10000/10000 [==============================] - 11s - loss: 0.0100 - acc: 1.0000 - precision: 0.0230 - recall: 0.0230

In this simple example the classification is always correct, yet the precision and recall is almost 0. It goes up when the batch size is increased. It roughly follows the probability of getting a random batch sized subset of the data where at least on sample is 1. (In this example 1-(.99**2) = 0.019)

That is why I think that this is the same issue the OP described.

@laxatives
Copy link

laxatives commented Feb 13, 2017

I'm also using a binary classifier and seeing some suspicious metrics reported. Namely every epoch reports the exact same value for accuracy, precision and recall over hundreds of epochs. This occurs in training, validation, and evaluation.

model.compile(optimizer='RMSprop',
              loss='binary_crossentropy',
              metrics=['accuracy', 'precision', 'recall'])
4704/4789 [============================>.] - ETA: 2s - loss: 0.5148 - acc: 0.8567 - precision: 0.8567 - recall: 0.8567
4736/4789 [============================>.] - ETA: 1s - loss: 0.5149 - acc: 0.8566 - precision: 0.8566 - recall: 0.8566
4768/4789 [============================>.] - ETA: 0s - loss: 0.5152 - acc: 0.8565 - precision: 0.8565 - recall: 0.8565
4789/4789 [==============================] - 166s - loss: 0.5153 - acc: 0.8563 - precision: 0.8563 - recall: 0.8563 - val_loss: 0.6714 - val_acc: 0.7749 - val_precision: 0.7749 - val_recall: 0.7749
Epoch 4/100
  32/4789 [..............................] - ETA: 162s - loss: 0.4381 - acc: 0.8750 - precision: 0.8750 - recall: 0.8750
  64/4789 [..............................] - ETA: 167s - loss: 0.5680 - acc: 0.8125 - precision: 0.8125 - recall: 0.8125
  96/4789 [..............................] - ETA: 164s - loss: 0.5151 - acc: 0.8438 - precision: 0.8438 - recall: 0.8438
 128/4789 [..............................] - ETA: 161s - loss: 0.4999 - acc: 0.8516 - precision: 0.8516 - recall: 0.8516

@isaacgerg
Copy link

@laxatives I have this same problem with a binary classifier with a balanced dataset. I use a generator and have verified that it gives balanced data each call to next().

@DepthFirstSearch @sietschie Have you figured out the issue?

@mhubrich
Copy link
Author

@laxatives @isaacgerg I don't think you have the same issue as @sietschie or me.

As I stated in my initial post, I don't think it's possible to compute fmeasure, precision etc. in a batch-wise manner (see my example). That's the whole issue here.

@isaacgerg
Copy link

@DepthFirstSearch During training these metrics are meant to be computed for each batch, not for all the batches. In any case, for a binary classifier, the metrics are still being reported incorrectly even on a per batch basis. See #5400.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants