# Writing a metric is not easy. #4506

Open
opened this issue Nov 25, 2016 · 27 comments

Projects
None yet
Contributor

### MoyanZitto commented Nov 25, 2016 • edited

 Recently some friend asked me to write a MAP(mean average precision) metric. Some sorting operation is required in this metric. At the beginning, I plan to use K.get_value() to get the value in y_pred and y_true, then use sort methods provided by numpy or python to compute the metric value just like this: ```def MAP(y_true, y_pred): np_y_true = K.get_value(y_true) np_y_pred = K.get_value(y_pred) zipped = zip(np_y_true, np_y_pred) zipped.sort(key=lambda x:x[1],reverse=True) np_y_true, np_y_pred = zip(*zipped) k_list = [i for i in range(len(np_y_true)) if int(np_y_true[i])==1] score = 0. r = np.sum(np_y_true).astype(np.int64) for k in k_list: Yk = np.sum(np_y_true[:k+1]) score += Yk/(k+1) score/=r return K.variable(score)``` However, this function doesn't work. Because a metric is a part of the computation graph, which means it must be a pure "tensor operation". I cannot get the value of y_true and y_pred because they are "empty" at this moment. Only when the real data was fed into the input tensor of the whole computation graph can we obtain the value of these tensors. I think perhaps it's not good for metric function, because a metric function is actually not a part of the model, they are just there for evaluation, not for generating error or gradients or something that are important to the training process. There are a lot of various metrics and it's not a easy job for users to define such metrics with the pure tensor language. Perhaps we should rethink the implementation of metrics, decouple the metric from the computation graph, make it possible to use numpy/python to finish such tasks. How do you think? @fchollet BTW, finally I wrote a callback function and call model.predict in the callback for computing MAP. It works well, but this method can just work for a fixed validation data and cannot get the metric for training data. If more information is provided in Callback, perhaps it's a possible solution. And here is the code: ```class MAP_eval(Callback): def __init__(self, validation_data): self.validation_data = validation_data self.maps = [] def eval_map(self): x_val, y_true = self.validation_data y_pred = self.model.predict(x_val) y_pred = list(np.squeeze(y_pred)) zipped = zip(y_true, y_pred) zipped.sort(key=lambda x:x[1],reverse=True) y_true, y_pred = zip(*zipped) k_list = [i for i in range(len(y_true)) if int(y_true[i])==1] score = 0. r = np.sum(y_true).astype(np.int64) for k in k_list: Yk = np.sum(y_true[:k+1]) score += Yk/(k+1) score/=r return score def on_epoch_end(self, epoch, logs={}): score = self.eval_map() print "MAP for epoch %d is %f"%(epoch, score) self.maps.append(score)```

### cbaziotis commented Nov 25, 2016 • edited

 I wanted to do a similar thing, and i decided to explicitly pass the training and validation sets to my Metrics Callback. I don't like it but it works for me. MetricsCallback.py ```class MetricsCallback(Callback): def __init__(self, metrics, validation_data, test_data): super().__init__() self.validation_data = validation_data self.test_data = test_data self.metrics = metrics ...``` and then do ```metrics_callback = MetricsCallback(validation_data=(X_val, y_val), test_data=(X_test, y_test), metrics=["macro_f1", "macro_recall"]) nn_model.fit(X_train, y_train, validation_data=(X_val, y_val), nb_epoch=150, batch_size=128, callbacks=[metrics_callback])```
Contributor Author

### MoyanZitto commented Nov 26, 2016 • edited

 @cbaziotis Yes it does work. But you can't access the training data after each batch or epoch, can you? If these information was provided in callbacks, we can rewrite the metrics module and implement them with Callbacks. I think it's a possible solution. Anyway, what I want to say is that the metrics should not be a part of the computation graph, thus the current implementation is inappropriate.

### cbaziotis commented Nov 26, 2016 • edited

 @MoyanZitto Yes it does work. But you can't access the training data after each batch or epoch, can you? Of course you can. I showed you how above. You pass them to your Callback. Here is an example where you explicitly pass your train_data and validation_data: MetricsCallback ```class MetricsCallback(Callback): def __init__(self, train_data, validation_data): super().__init__() self.validation_data = validation_data self.train_data = train_data def on_epoch_end(self, epoch, logs={}): X_train = self.train_data[0] y_train = self.train_data[1] X_val = self.validation_data[0] y_val = self.validation_data[1] # do whatever you want next **Model** ```python metrics_callback = MetricsCallback(train_data=(X_train, y_train), validation_data=(X_val, y_val)) nn_model.fit(X_train, y_train, validation_data=(X_val, y_val), nb_epoch=150, batch_size=128, callbacks=[metrics_callback])``` But i agree that this is not so nice. I did it like that because i want to use my own metrics and @fchollet said in a comment (#3230 (comment)) that you should use a callback for that. What i don't understand is why not use scickit-learn's metrics in the first place. What i would really like is a way to pass a `scorer` created with make_scorer in the `metrics` argument in `compile()`, like that: ```scorer = make_scorer(f1_score, labels=['positive', 'negative'], average='macro') model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=[scorer])```
Contributor Author

### MoyanZitto commented Nov 26, 2016

 @cbaziotis Well, if you are using fit_generator or using validation_split，it dosen't work. I mean, the callback should be able to access the training samples in each forward process, not pass them by hand. To do this, the code in `_fit_loop`, the class `Callback` should be modified properly, and the metrics part in `.compile` should be removed. If @fchollet agree to adjust the structure, I can help to write some PRs. Using scikit-learn metrics is a good idea, there are also lots of modules that could be used in Keras, K-fold cross validation for example. Although it will add another dependence to keras, I think it is worth to. However, @fchollet is the boss, it's up to him.
Contributor

### lebavarois commented Nov 29, 2016

 @MoyanZitto I'm confused about the MAP metric you propose. Your implementation looks like the formula for AP@all. So isn't it AP (average precision) instead of MAP(mean average precision)? On the other hand I'm confused why the result does not match the `sklearn.metrics.average_precision_score`: ``````import numpy as np from sklearn.metrics import average_precision_score y_true = np.array([0, 0, 1, 1]) y_pred = np.array([0.1, 0.4, 0.35, 0.8]) def MAP(np_y_true, np_y_pred): zipped = zip(np_y_true, np_y_pred) zipped.sort(key=lambda x:x[1],reverse=True) np_y_true, np_y_pred = zip(*zipped) k_list = [i for i in range(len(np_y_true)) if int(np_y_true[i])==1] score = 0. r = np.sum(np_y_true).astype(np.int64) for k in k_list: Yk = np.sum(np_y_true[:k+1]) score += Yk/(k+1) score/=r return score print MAP(y_true, y_pred) #---> 0.5 print average_precision_score(y_true, y_pred) #---> 0.791666666667 ``````
Contributor Author

### MoyanZitto commented Nov 30, 2016 • edited

 @lebavarois This code is wrote based on the algorthm given by a friend, the documents said it's name is "MAP". But yes, it looks more like AP@all. Here is the algorithm: rank the probabilities you predicted from high to low. If the number of true postive samples in top k predictions is Y_k, then define P@k as: P@k=Y_k/k Assume the indexes of postive samples are k1,k2...kr，where r is the total number of positive samples. then MAP is defined as: MAP=sum(P@k)/r I hope I didn't write a wrong code...
Contributor

### lebavarois commented Nov 30, 2016 • edited

 @MoyanZitto I found out why the values did not match in my previous post. I got the wrong value in python2, for python3 it is correct. `score += Yk/(k+1)` should be `score += Yk/float(k+1)` in python2. There seems to be a problem in the latest sklearn version. With the code of this pull request, I get the same value as in your code. ``````import numpy as np from sklearn.metrics import average_precision_score y_true = np.array([0, 0, 1, 1]) y_pred = np.array([0.1, 0.4, 0.35, 0.8]) def AP(np_y_true, np_y_pred): zipped = zip(np_y_true, np_y_pred) zipped.sort(key=lambda x:x[1],reverse=True) np_y_true, np_y_pred = zip(*zipped) k_list = [i for i in range(len(np_y_true)) if int(np_y_true[i])==1] score = 0. r = np.sum(np_y_true).astype(np.int64) for k in k_list: Yk = np.sum(np_y_true[:k+1]) score += Yk/float(k+1) score/=r return score print AP(y_true, y_pred) #---> 0.8333 print average_precision_score(y_true, y_pred) #---> 0.8333 `````` Still I think it is AP and not MAP. If you have a multi-label classification (e.g. one image with multiple class-labels), AP should give you the evaluation for just one test datapoint (e.g. one test image). If you then have multiple test datapoints (e.g. multiple images) you can compute the mean for the whole test set, which is then MAP. So in this case `y_true` and `y_pred` should be 2 dimensional (multiple outputs) and the AP function should be applied to the second dimension. Finally the mean is taken over the first dimension (which corresponds to the datapoints, e.g. multiple images). Maybe for other use-cases it makes more sense to implement it the way you did (if the class-label is part of the model-input and you have just one model-output, e.g. recommendation system). What kind of data are you evaluating?
Contributor Author

### MoyanZitto commented Nov 30, 2016 • edited

 @lebavarois well, this is the metric given by some data mining compition documents. It is a binary classification problem here~ I am really appreciate your response, it's very clear and helpful. Thank you very much for the explaination. Perhaps we're a little away from subject. let's come back to the main topic~ my point is, although we can implement our own metrics by defining a callback, we needn't to do this. Here are the reasons: Metrics were set for evaluating the performance of our model, not producing gradients, so it needn't to be a part of the computation map, that's why we can remove it from the `compile` process. Removing metrics from the computation map will make it easier for users to define their metrics with more complicated logic, or to use the functions provided by other packages such as scikit-learn, that's why we want to do this. Please feel free to correct me if I was wrong.
Contributor

### lebavarois commented Nov 30, 2016 • edited

 @MoyanZitto I think the evaluation might be faster if it is part of the computation graph. For the few measures where computing it with keras backend is not possible, there is still the workaround you suggested using callbacks. For the sorting you need in the MAP it would be helpful to have some sorting function in keras backend. I think it could be accomplished using `theano.tensor.sort` for theano and `tf.nn.top_k` (which also supports sorting) in tensorflow.
Contributor Author

### MoyanZitto commented Dec 1, 2016

 @lebavarois I see, yes, it would be faster. I know `theano.tensor.sort`, it's....hard to understand and harder to verify if you're using it correctly. Anyway, speed is a reasonable reason. Although compared with the training cost, I don't think the evaluation process is very time-consuming. And in my opinion, feasibility is more important than efficiency, in addtion, computation speed has never been an advantage for keras. Assume considering the need of evaluation efficiency, we want to retain the current implementation, I still think at least we should add another callback class(Just use scikit-learn metrics), so that the users would have a possibility to choose. If the metric is very complicated and we don't really care the training time, they could choose the callbacks version.

### thomasjungblut commented May 16, 2017

 Bringing this up again now that some of the metrics were removed from the latest release. It would be really great if we can come up with a good callback class that enables evaluation outside of the computation graph. Anyone already wrote something like this?

### yuanzhigang10 commented Aug 10, 2017 • edited

 I'm trying to implement a metric function for f-score used in BIO tagging scheme. Really not so easy ... As @MoyanZitto said, `decouple the metric from the computation graph` would be a good thing, since f-score for various scenarios has already been implemented in other packages using `Numpy`.

### TingDaoK commented Aug 23, 2017

 @cbaziotis When I try your code to calculate the training loss, I found it different from the loss presented by Keras.` x_train = self.train_data[0] y_train = self.train_data[1] `````` x_val = self.validation_data[0] y_val = self.validation_data[1] y_train_pre = self.model.predict(x_train) kvar_pred = K.variable(y_train_pre) kvar_true = K.variable(y_train) my_loss = K.mean(K.categorical_crossentropy(kvar_pred,kvar_true)) print("\nmy train loss")` `````` I was really confused. The loss presented by Keras is about 0.02. However, when I printed out my_loss, I found it was about 1.00. Do you have any idea about this problem?
Contributor

### fredtcaroli commented Sep 29, 2017

 I'm okay with writing metrics with callbacks, as long as I were able to see them properly in the progbar after an epoch is finished. I know `EarlyStop` and `ModelCheckpoint` already work if we append extra metrics to the `logs` dict, but `ProgbarLogger` just skips the extra metrics (https://github.com/fchollet/keras/blob/master/keras/callbacks.py#L291) If there's a new `float` in the logs dict when it reaches `on_epoch_end`, why not just show it? Also, could raise a warning if a value was found and it was not a `float` or `int`. Callbacks printing is already a little off, so if I want to show the custom metric while training I should either set `verbose=2` or print some newlines before and after the custom metric printing. Printing that with `ProgbarLogger` would make things way easier
Contributor

### fredtcaroli commented Sep 29, 2017

 Since we're talking about this, and I'm not sure that was discussed elsewhere, I constantly end up having to write custom callbacks for metrics and spending extra computation time re-computing the model's output just because metrics can't operate on multiple outputs. The way we have it now, a custom metric can take a single prediction and its expected output counterpart. Say I have non mutual exclusive classification, how can I have the average "completely correct" samples using metric functions? I can't.

### d4nst commented Nov 17, 2017

 @fredtcaroli you can just append the metric name to `self.params['metrics']` and it will show in the progbar: `self.params['metrics'].append('my_custom_metric')`

### Zvezdin commented Dec 14, 2017 • edited

 I also would like to be able to implement that kind of custom metrics at the expense of a bit of performance. I don't see how the workaround by @cbaziotis solves the issue, because the custom callback never receives `y_pred`, it only keeps references of the network input data and expected output.
Collaborator

### fchollet commented Dec 14, 2017

 Maybe we could add something like `model.add_metric(tensor)` like we do with losses. Thoughts?

### Zvezdin commented Dec 14, 2017

 Wasn't the whole point of the issue about detaching the custom metrics from the graph and hence be able to use `y_pred` and `y_true` as numpy arrays to have the freedom of and compatibility to the numpy ecosystem?
Collaborator

### fchollet commented Dec 14, 2017

 If you want to use numpy metrics, you don't really need keras-level support. You can just have your training process dump saved models, then have a different eval process load them, cal `predict` on your validation data, and call a numpy metric on the output. Or any other similar workflow. … On 14 December 2017 at 14:40, Zvezdin ***@***.***> wrote: Wasn't the whole point of the issue about detaching the custom metrics from the graph and hence be able to use y_pred and y_true as numpy arrays to have the freedom of and compatibility to the numpy ecosystem? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4506 (comment)>, or mute the thread .
Contributor

### fredtcaroli commented Dec 14, 2017

 @Zvezdin callbacks are perfectly fine for dealing with that. Only thing missing is have the last prediction available to the callback, and that shouldn't be easy (or possible even) to implement without a performance hit. That's because the internal training function never actually outputs the predicted vector. Copying that back and forth from the gpu takes time, so it's something to avoid if possible

### evictor commented Jul 17, 2018

 Calling `predict` is a very expensive operation that should not have to be performed again—even just one extra time—in order to calculate metrics when predictions have already been made and are just inaccessible for architectural reasons. For my use case an extra call to predict is not feasible because the data are ginormous and the additional predict would bloat train time considerably. I think a good solution would be to expose the latest predictions from the model somehow obtained during training. If that were accessible from the model interface, then you could write a callback to perform whatever "fancy" (non-graph, Python-based) metrics calculations without having to perform expensive predict. Maybe if I can figure out how to do this I will submit a PR but I can't imagine the work would be that difficult to expose that...
Contributor

### fredtcaroli commented Jul 17, 2018 • edited

 Well... If we had something like `model.add_metric(tensor)` then `tensorflow.py_func` would be an option for folks wanting to use some third-party libs to compute metrics It's kinda limited, but should cover 90% of the use cases, I guess I still think that unnecessarily exposing the last batch predictions is bad, but that's only my 2 cents

### hermansje commented Jul 17, 2018

 When this PR is merged, you can access the output by asking for it as an extra fetch using a Callback.

Open

### evictor commented Jul 26, 2018

 I put in an issue to track this from TF repo so they can be aware for their own adaptation of Keras: tensorflow/tensorflow#21174

### theceday commented Jul 27, 2018 • edited

 While this PR might help for some problems, I think fredtcaroli's suggestion might work best for a larger set. I also needed to calc some custom accuracy during training but it also needs extra informartion. (not the tensors except pred/true but like image_ocr example, some dictionary for decode purposes) Sending all that information to tf session is kinda pointless. And making really hard for beginners. Though I am a beginner, evaluating metrics should be out of graph execution might be best choice. Edit: at least optionally, its better to keep those in graph for computation costs obviously. And if extra tensors should be needed to evaluate such a metric, it should be mapped so that keras could extract the value out of session and send them whatever metric evaluation. Btw, i had across several questions/problems while searching, most solutions are just a workaround for a specific problem and not applicable to others. So this issue might need a little more attention for the design part.

Closed

### chigkim commented May 8, 2019

 Is there a solution for this yet? I'd also love to calculate a custom metric outside computation graph.