About the quality score ranges #21

li-js · 2020-10-08T03:50:32Z

Nice work in this repo.
I have read through other closed issues and it seems normal that two models have very different quality score ranges based on the SER-FIQ method. Normalization is recommended to enforce the number to the same range just to make them look comfortable since only the ranking is important while the actual values are not.

I do have some thoughts here. Suppose two models are observed to have different quality values and range widths on the same dataset. E.g. Model 1 has range [0.2,0.6], i.e. 0.4+-0.2, and Model 2 has range [0.8,0.82], i.e. 0.81+-0.01.

Is the quality value difference here 0.4 vs 0.81 an indicator of something? Is the range width here 0.4 vs 0.02 also an indicator of something?
Even with normalization, are the quality comparable across two models? In order to make the quality comparable between the two models, what kind of normalization is required?

Any thoughts or discussions are welcome.

pterhoer · 2020-10-08T10:04:05Z

Hi li-js,

well summarized and very interesting questions!
I am concerned that I will not be able to give you satisfactory answers...

Assuming that two models have the same output dimension and the output embeddings are normalized, a higher quality mean might represent a better separation in the embedding space, since a higher quality means a higher robustness and thus, a low uncertantity of the embedding. So the range might give you indications about the separability of the data samples in the embedding space of the model). If you change the starting assumption, I am not sure how this will affect the separability.
SER-FIQ uses the robustness of representations as its indicator for the future performance (sample quality). Therefore, it reflects the how well recognition will work on a specific face recognition model. If the behaviour (and the biases) of two models are similar, the normalized/scaled quality predictions should be similar as well. If the models perform differently, it will be hard to align the qualities. A mapping for the different cohorts (regions of performance differences) might be needed.

Interesting questions! Let me know if you have other ideas on that :)

Best
Philipp

rocinant · 2020-10-15T07:20:47Z

I don't understand why all you guys just saying only the ranking is important. of course absolute score is way more important. If I get bunch of images, why would I care their rankings among themselfs with the precission goes to even two digits after decimal. The only thing I care is an absolute threshold which can be used to filter out unqualified images. You can get their rankings, but you can not filter out images just based on rankings. If there is one dataset with high quality and their scores are in the range say [0.90 0.91], and another bad dataset whose score range is [0.40, 0.41], both of them under same model output. Okay, then normalize the range with their rankings, they will mostly all be in [0, 1]. Then, filter some images based on some threshold, say 0.5, you will almost get rid of half images in good dataset and bad dataset. And you will even has some dillusions that you have gotten the better quality images. For certain model, of course absolute score is way more important than relative rankings. Because in the end, you can not just say, I think I am getting rid of the lowest 10% ranking images in the batch no matter how good is their overall quality. I think author failed to assress this issue or maybe I do not see it clearly.

rocinant · 2020-10-15T07:36:10Z

Now with the method proposed, I am in the dilemma that I get all images' scores separated in [0.88, 0.89] but I can cleayly see some of them are suitble for face recognition and some of them not. Then, how can I filter the unqualified ones out? Normalize them to [0, 1] then get rid of the lowest 10% percentage? or should it be 20%? How to decide this percentage threshold or the threshold for the normalized scores? Like I sad, you can not apply this in the real face recognition system, with no absolute threshold but just some not discriminative relative scores. Sometimes the differences are so small, you won't even know if it's meanful or it's tiny diffrence from arithmetic operations.

pterhoer · 2020-10-15T08:54:06Z

Hi rocinant,

thank you for your comment, I think you point at an important aspect between research experiments and real-life deployment.

(1) To reproduce the research results, the quality ranking on a whole dataset is of high importance since this allows you to compute the recognition performance on all quality-thresholds. This results in the error-vs-reject curves (Figure 4 & 6 in the paper).
(2) To deploy the face quality assessment in real-life applications, you need to define a quality-threshold which divides input images into good and bad. Good images are accepted and bad images are rejected.

For both scenarios, scaling the quality range does not matter since (1) it does not change the quality ranking on the dataset and (2) the quality threshold is adapted as well. In both cases, the same results will be obtained. Scaling the quality values (in this case) only help the perception of the human observer to feel more comfortable.

So, one point left: How to choose the quality threshold?
That depends on your model and your application scenario:

You have to determine an FMR for your system. (Low FMR for more security-oriented application, higher FMR if you focus on convenience)
Take a (preferably large) face database and compute the error-vs-reject curves for your model.
Choose a quality-thresholds that is best suitable for you tradeoff between reducing the recognition errors and rejecting fewer images.

Best,
Philipp

rocinant · 2020-10-19T09:46:03Z

Hi Philipp,

Thank you for you detailed responses, appreciate it. While exactly scaling the quality range doesn't change anything but easier perceptiona understanding, it won't help in deciding what is the best threshold for classification error. You pointed out a possible solution to get the best threshold. I bet the face database here in step 2 should be get from real-life scenario or at least very close to it. Then to make the curve, one need to lable this database. This is not easy work, which is basically just like preparing data for training face recognition model. Which means even you are using the same face recognition model trained with public dataset, say you get the public model (arcface network trained with ms1m dataset), and you get this error-rejection plot, you want to deploy this method to you real-life scenario. This best threshold from the plot won't work, one needs to collect big enough real-life pics, label them, make the plot themselves, and get his custom threshold. Basically, this method is closely coupled to the dataset not only model, which means, you can not even get a best threshold from determined model. Hopdfully, I get this point clear. The idea is good, it's just it's not cross-dataset, not like face recognition model (with determined model trained with public dataset, you can basically say I think threshold xx is a good choice considering tradeoff between precision and recall etc. ).

Best,
Rocin

pterhoer · 2020-11-13T15:25:59Z

Hi Rocin,

I am sorry for the late response. I did not noticed your answer.

It is correct that the face data from step 2 is best from real-life data to capture all necessary kind of variations. This is similar to face recognition systems in general.

The threshold determination process is the same as for every face recognition model. In face recognition, you need to find a suitable threshold for the comparison score to classify genuine from imposter pairs. For face quality assessment, you need a threshold for the quality score to decide if the face image is used for the identity verification or not (rejected). If (as you said) the face recognition threshold can be set with public available data, you can do the same for the quality threshold.

However, SER-FIQ has a big advantage here. It measures quality based on the embedding robustness. Face recognition models and face quality assessment models need to be evaluated under different conditions to see if they perform reasonable in certain scenarios. For SER-FIQ, the embedding robustness is used as the quality indicator and thus, it is sufficient to evaluate the performance under different quality scenarios. The domain transition is done over the embedding robustness and thus, much less data is needed to achieve a similar threshold suitability than with other methods.

I hope this explanation helps you.

Best
Philipp

pterhoer closed this as completed Nov 13, 2020

jankolf mentioned this issue Dec 9, 2020

All image's score same #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the quality score ranges #21

About the quality score ranges #21

li-js commented Oct 8, 2020

pterhoer commented Oct 8, 2020

rocinant commented Oct 15, 2020

rocinant commented Oct 15, 2020 •

edited

pterhoer commented Oct 15, 2020

rocinant commented Oct 19, 2020 •

edited

pterhoer commented Nov 13, 2020

About the quality score ranges #21

About the quality score ranges #21

Comments

li-js commented Oct 8, 2020

pterhoer commented Oct 8, 2020

rocinant commented Oct 15, 2020

rocinant commented Oct 15, 2020 • edited

pterhoer commented Oct 15, 2020

rocinant commented Oct 19, 2020 • edited

pterhoer commented Nov 13, 2020

rocinant commented Oct 15, 2020 •

edited

rocinant commented Oct 19, 2020 •

edited