Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing a metric is not easy. #4506

Open
MoyanZitto opened this issue Nov 25, 2016 · 27 comments

Comments

Projects
None yet
@MoyanZitto
Copy link
Contributor

commented Nov 25, 2016

Recently some friend asked me to write a MAP(mean average precision) metric. Some sorting operation is required in this metric.

At the beginning, I plan to use K.get_value() to get the value in y_pred and y_true, then use sort methods provided by numpy or python to compute the metric value just like this:

def MAP(y_true, y_pred):

    np_y_true = K.get_value(y_true)
    np_y_pred = K.get_value(y_pred)

    zipped = zip(np_y_true, np_y_pred)
    zipped.sort(key=lambda x:x[1],reverse=True)

    np_y_true, np_y_pred = zip(*zipped)
    k_list = [i for i in range(len(np_y_true)) if int(np_y_true[i])==1]
    score = 0.
    r = np.sum(np_y_true).astype(np.int64)
    for k in k_list:
        Yk = np.sum(np_y_true[:k+1])
        score += Yk/(k+1)

    score/=r

    return K.variable(score)

However, this function doesn't work. Because a metric is a part of the computation graph, which means it must be a pure "tensor operation". I cannot get the value of y_true and y_pred because they are "empty" at this moment. Only when the real data was fed into the input tensor of the whole computation graph can we obtain the value of these tensors.

I think perhaps it's not good for metric function, because a metric function is actually not a part of the model, they are just there for evaluation, not for generating error or gradients or something that are important to the training process.

There are a lot of various metrics and it's not a easy job for users to define such metrics with the pure tensor language. Perhaps we should rethink the implementation of metrics, decouple the metric from the computation graph, make it possible to use numpy/python to finish such tasks.

How do you think? @fchollet

BTW, finally I wrote a callback function and call model.predict in the callback for computing MAP. It works well, but this method can just work for a fixed validation data and cannot get the metric for training data. If more information is provided in Callback, perhaps it's a possible solution. And here is the code:

class MAP_eval(Callback):
    def __init__(self, validation_data):
        self.validation_data = validation_data
        self.maps = []

    def eval_map(self):
        x_val, y_true = self.validation_data
        y_pred = self.model.predict(x_val)
        y_pred = list(np.squeeze(y_pred))
        zipped = zip(y_true, y_pred)
        zipped.sort(key=lambda x:x[1],reverse=True)

        y_true, y_pred = zip(*zipped)
        k_list = [i for i in range(len(y_true)) if int(y_true[i])==1]
        score = 0.
        r = np.sum(y_true).astype(np.int64)
        for k in k_list:
            Yk = np.sum(y_true[:k+1])
            score += Yk/(k+1)
        score/=r
        return score

    def on_epoch_end(self, epoch, logs={}):
        score = self.eval_map()
        print "MAP for epoch %d is %f"%(epoch, score)
        self.maps.append(score)
@cbaziotis

This comment has been minimized.

Copy link

commented Nov 25, 2016

I wanted to do a similar thing, and i decided to explicitly pass the training and validation sets to my Metrics Callback. I don't like it but it works for me.

MetricsCallback.py

class MetricsCallback(Callback):
    def __init__(self, metrics, validation_data, test_data):
        super().__init__()
        self.validation_data = validation_data
        self.test_data = test_data
        self.metrics = metrics
        ...

and then do

metrics_callback = MetricsCallback(validation_data=(X_val, y_val),
                                   test_data=(X_test, y_test),
                                   metrics=["macro_f1", "macro_recall"])

nn_model.fit(X_train, y_train, validation_data=(X_val, y_val),
             nb_epoch=150, batch_size=128,
             callbacks=[metrics_callback])
@MoyanZitto

This comment has been minimized.

Copy link
Contributor Author

commented Nov 26, 2016

@cbaziotis
Yes it does work. But you can't access the training data after each batch or epoch, can you? If these information was provided in callbacks, we can rewrite the metrics module and implement them with Callbacks. I think it's a possible solution.

Anyway, what I want to say is that the metrics should not be a part of the computation graph, thus the current implementation is inappropriate.

@cbaziotis

This comment has been minimized.

Copy link

commented Nov 26, 2016

@MoyanZitto

Yes it does work. But you can't access the training data after each batch or epoch, can you?

Of course you can. I showed you how above. You pass them to your Callback. Here is an example where you explicitly pass your train_data and validation_data:

MetricsCallback

class MetricsCallback(Callback):
    def __init__(self, train_data, validation_data):
        super().__init__()
        self.validation_data = validation_data
        self.train_data = train_data

    def on_epoch_end(self, epoch, logs={}):

        X_train = self.train_data[0]
        y_train = self.train_data[1]

        X_val = self.validation_data[0]
        y_val = self.validation_data[1]

        # do whatever you want next


**Model**

```python
metrics_callback = MetricsCallback(train_data=(X_train, y_train),
                                   validation_data=(X_val, y_val))

nn_model.fit(X_train, y_train, validation_data=(X_val, y_val),
             nb_epoch=150, batch_size=128,
             callbacks=[metrics_callback])

But i agree that this is not so nice. I did it like that because i want to use my own metrics and @fchollet said in a comment (#3230 (comment)) that you should use a callback for that.

What i don't understand is why not use scickit-learn's metrics in the first place. What i would really like is a way to pass a scorer created with make_scorer in the metrics argument in compile(), like that:

scorer = make_scorer(f1_score, labels=['positive', 'negative'], average='macro')

model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=[scorer])
@MoyanZitto

This comment has been minimized.

Copy link
Contributor Author

commented Nov 26, 2016

@cbaziotis
Well, if you are using fit_generator or using validation_split,it dosen't work.
I mean, the callback should be able to access the training samples in each forward process, not pass them by hand. To do this, the code in _fit_loop, the class Callback should be modified properly, and the metrics part in .compile should be removed.

If @fchollet agree to adjust the structure, I can help to write some PRs.

Using scikit-learn metrics is a good idea, there are also lots of modules that could be used in Keras, K-fold cross validation for example. Although it will add another dependence to keras, I think it is worth to. However, @fchollet is the boss, it's up to him.

@lebavarois

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2016

@MoyanZitto
I'm confused about the MAP metric you propose. Your implementation looks like the formula for AP@all. So isn't it AP (average precision) instead of MAP(mean average precision)?

On the other hand I'm confused why the result does not match the sklearn.metrics.average_precision_score:

import numpy as np
from sklearn.metrics import average_precision_score
y_true = np.array([0, 0, 1, 1])
y_pred = np.array([0.1, 0.4, 0.35, 0.8])

def MAP(np_y_true, np_y_pred):

    zipped = zip(np_y_true, np_y_pred)
    zipped.sort(key=lambda x:x[1],reverse=True)

    np_y_true, np_y_pred = zip(*zipped)
    k_list = [i for i in range(len(np_y_true)) if int(np_y_true[i])==1]
    score = 0.
    r = np.sum(np_y_true).astype(np.int64)
    for k in k_list:
        Yk = np.sum(np_y_true[:k+1])
        score += Yk/(k+1)

    score/=r
    return score

print MAP(y_true, y_pred)
#---> 0.5
print average_precision_score(y_true, y_pred)
#---> 0.791666666667
@MoyanZitto

This comment has been minimized.

Copy link
Contributor Author

commented Nov 30, 2016

@lebavarois
This code is wrote based on the algorthm given by a friend, the documents said it's name is "MAP".

But yes, it looks more like AP@all. Here is the algorithm:

  1. rank the probabilities you predicted from high to low.
  2. If the number of true postive samples in top k predictions is Y_k, then define P@k as:
    P@k=Y_k/k
  3. Assume the indexes of postive samples are k1,k2...kr,where r is the total number of positive samples. then MAP is defined as:
    MAP=sum(P@k)/r

I hope I didn't write a wrong code...

@lebavarois

This comment has been minimized.

Copy link
Contributor

commented Nov 30, 2016

@MoyanZitto
I found out why the values did not match in my previous post.

  1. I got the wrong value in python2, for python3 it is correct.
    score += Yk/(k+1) should be score += Yk/float(k+1) in python2.

  2. There seems to be a problem in the latest sklearn version. With the code of this pull request, I get the same value as in your code.

import numpy as np
from sklearn.metrics import average_precision_score

y_true = np.array([0, 0, 1, 1])
y_pred = np.array([0.1, 0.4, 0.35, 0.8])

def AP(np_y_true, np_y_pred):

    zipped = zip(np_y_true, np_y_pred)
    zipped.sort(key=lambda x:x[1],reverse=True)

    np_y_true, np_y_pred = zip(*zipped)
    k_list = [i for i in range(len(np_y_true)) if int(np_y_true[i])==1]
    score = 0.
    r = np.sum(np_y_true).astype(np.int64)
    for k in k_list:
        Yk = np.sum(np_y_true[:k+1])
        score += Yk/float(k+1)

    score/=r
    return score


print AP(y_true, y_pred)
#---> 0.8333
print average_precision_score(y_true, y_pred)
#---> 0.8333

Still I think it is AP and not MAP. If you have a multi-label classification (e.g. one image with multiple class-labels), AP should give you the evaluation for just one test datapoint (e.g. one test image). If you then have multiple test datapoints (e.g. multiple images) you can compute the mean for the whole test set, which is then MAP. So in this case y_true and y_pred should be 2 dimensional (multiple outputs) and the AP function should be applied to the second dimension. Finally the mean is taken over the first dimension (which corresponds to the datapoints, e.g. multiple images).

Maybe for other use-cases it makes more sense to implement it the way you did (if the class-label is part of the model-input and you have just one model-output, e.g. recommendation system). What kind of data are you evaluating?

@MoyanZitto

This comment has been minimized.

Copy link
Contributor Author

commented Nov 30, 2016

@lebavarois
well, this is the metric given by some data mining compition documents. It is a binary classification problem here~

I am really appreciate your response, it's very clear and helpful. Thank you very much for the explaination.

Perhaps we're a little away from subject. let's come back to the main topic~ my point is, although we can implement our own metrics by defining a callback, we needn't to do this. Here are the reasons:

  • Metrics were set for evaluating the performance of our model, not producing gradients, so it needn't to be a part of the computation map, that's why we can remove it from the compile process.

  • Removing metrics from the computation map will make it easier for users to define their metrics with more complicated logic, or to use the functions provided by other packages such as scikit-learn, that's why we want to do this.

Please feel free to correct me if I was wrong.

@lebavarois

This comment has been minimized.

Copy link
Contributor

commented Nov 30, 2016

@MoyanZitto
I think the evaluation might be faster if it is part of the computation graph. For the few measures where computing it with keras backend is not possible, there is still the workaround you suggested using callbacks.

For the sorting you need in the MAP it would be helpful to have some sorting function in keras backend. I think it could be accomplished using theano.tensor.sort for theano and tf.nn.top_k (which also supports sorting) in tensorflow.

@MoyanZitto

This comment has been minimized.

Copy link
Contributor Author

commented Dec 1, 2016

@lebavarois
I see, yes, it would be faster. I know theano.tensor.sort, it's....hard to understand and harder to verify if you're using it correctly.

Anyway, speed is a reasonable reason. Although compared with the training cost, I don't think the evaluation process is very time-consuming. And in my opinion, feasibility is more important than efficiency, in addtion, computation speed has never been an advantage for keras.

Assume considering the need of evaluation efficiency, we want to retain the current implementation, I still think at least we should add another callback class(Just use scikit-learn metrics), so that the users would have a possibility to choose. If the metric is very complicated and we don't really care the training time, they could choose the callbacks version.

@thomasjungblut

This comment has been minimized.

Copy link

commented May 16, 2017

Bringing this up again now that some of the metrics were removed from the latest release.
It would be really great if we can come up with a good callback class that enables evaluation outside of the computation graph. Anyone already wrote something like this?

@yuanzhigang10

This comment has been minimized.

Copy link

commented Aug 10, 2017

I'm trying to implement a metric function for f-score used in BIO tagging scheme. Really not so easy ... As @MoyanZitto said, decouple the metric from the computation graph would be a good thing, since f-score for various scenarios has already been implemented in other packages using Numpy.

@TingDaoK

This comment has been minimized.

Copy link

commented Aug 23, 2017

@cbaziotis
When I try your code to calculate the training loss, I found it different from the loss presented by Keras.`
x_train = self.train_data[0]
y_train = self.train_data[1]

    x_val = self.validation_data[0]
    y_val = self.validation_data[1]
    y_train_pre = self.model.predict(x_train)
    kvar_pred = K.variable(y_train_pre)
    kvar_true = K.variable(y_train)
    my_loss = K.mean(K.categorical_crossentropy(kvar_pred,kvar_true))
    print("\nmy train loss")`

I was really confused.
The loss presented by Keras is about 0.02. However, when I printed out my_loss, I found it was about 1.00. Do you have any idea about this problem?

@fredtcaroli

This comment has been minimized.

Copy link
Contributor

commented Sep 29, 2017

I'm okay with writing metrics with callbacks, as long as I were able to see them properly in the progbar after an epoch is finished. I know EarlyStop and ModelCheckpoint already work if we append extra metrics to the logs dict, but ProgbarLogger just skips the extra metrics (https://github.com/fchollet/keras/blob/master/keras/callbacks.py#L291)

If there's a new float in the logs dict when it reaches on_epoch_end, why not just show it? Also, could raise a warning if a value was found and it was not a float or int.
Callbacks printing is already a little off, so if I want to show the custom metric while training I should either set verbose=2 or print some newlines before and after the custom metric printing. Printing that with ProgbarLogger would make things way easier

@fredtcaroli

This comment has been minimized.

Copy link
Contributor

commented Sep 29, 2017

Since we're talking about this, and I'm not sure that was discussed elsewhere, I constantly end up having to write custom callbacks for metrics and spending extra computation time re-computing the model's output just because metrics can't operate on multiple outputs. The way we have it now, a custom metric can take a single prediction and its expected output counterpart. Say I have non mutual exclusive classification, how can I have the average "completely correct" samples using metric functions? I can't.

@d4nst

This comment has been minimized.

Copy link

commented Nov 17, 2017

@fredtcaroli you can just append the metric name to self.params['metrics'] and it will show in the progbar: self.params['metrics'].append('my_custom_metric')

@Zvezdin

This comment has been minimized.

Copy link

commented Dec 14, 2017

I also would like to be able to implement that kind of custom metrics at the expense of a bit of performance. I don't see how the workaround by @cbaziotis solves the issue, because the custom callback never receives y_pred, it only keeps references of the network input data and expected output.

@fchollet

This comment has been minimized.

Copy link
Collaborator

commented Dec 14, 2017

Maybe we could add something like model.add_metric(tensor) like we do with losses. Thoughts?

@Zvezdin

This comment has been minimized.

Copy link

commented Dec 14, 2017

Wasn't the whole point of the issue about detaching the custom metrics from the graph and hence be able to use y_pred and y_true as numpy arrays to have the freedom of and compatibility to the numpy ecosystem?

@fchollet

This comment has been minimized.

Copy link
Collaborator

commented Dec 14, 2017

@fredtcaroli

This comment has been minimized.

Copy link
Contributor

commented Dec 14, 2017

@Zvezdin callbacks are perfectly fine for dealing with that. Only thing missing is have the last prediction available to the callback, and that shouldn't be easy (or possible even) to implement without a performance hit. That's because the internal training function never actually outputs the predicted vector. Copying that back and forth from the gpu takes time, so it's something to avoid if possible

@evictor

This comment has been minimized.

Copy link

commented Jul 17, 2018

Calling predict is a very expensive operation that should not have to be performed again—even just one extra time—in order to calculate metrics when predictions have already been made and are just inaccessible for architectural reasons.

For my use case an extra call to predict is not feasible because the data are ginormous and the additional predict would bloat train time considerably.

I think a good solution would be to expose the latest predictions from the model somehow obtained during training. If that were accessible from the model interface, then you could write a callback to perform whatever "fancy" (non-graph, Python-based) metrics calculations without having to perform expensive predict.

Maybe if I can figure out how to do this I will submit a PR but I can't imagine the work would be that difficult to expose that...

@fredtcaroli

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2018

Well... If we had something like model.add_metric(tensor) then tensorflow.py_func would be an option for folks wanting to use some third-party libs to compute metrics

It's kinda limited, but should cover 90% of the use cases, I guess

I still think that unnecessarily exposing the last batch predictions is bad, but that's only my 2 cents

@hermansje

This comment has been minimized.

Copy link

commented Jul 17, 2018

When this PR is merged, you can access the output by asking for it as an extra fetch using a Callback.

@evictor

This comment has been minimized.

Copy link

commented Jul 26, 2018

I put in an issue to track this from TF repo so they can be aware for their own adaptation of Keras: tensorflow/tensorflow#21174

@theceday

This comment has been minimized.

Copy link

commented Jul 27, 2018

While this PR might help for some problems, I think fredtcaroli's suggestion might work best for a larger set. I also needed to calc some custom accuracy during training but it also needs extra informartion. (not the tensors except pred/true but like image_ocr example, some dictionary for decode purposes)
Sending all that information to tf session is kinda pointless. And making really hard for beginners.

Though I am a beginner, evaluating metrics should be out of graph execution might be best choice.
Edit: at least optionally, its better to keep those in graph for computation costs obviously.

And if extra tensors should be needed to evaluate such a metric, it should be mapped so that keras could extract the value out of session and send them whatever metric evaluation.

Btw, i had across several questions/problems while searching, most solutions are just a workaround for a specific problem and not applicable to others. So this issue might need a little more attention for the design part.

@chigkim

This comment has been minimized.

Copy link

commented May 8, 2019

Is there a solution for this yet? I'd also love to calculate a custom metric outside computation graph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.