Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the quality score ranges #21

Closed
li-js opened this issue Oct 8, 2020 · 6 comments
Closed

About the quality score ranges #21

li-js opened this issue Oct 8, 2020 · 6 comments

Comments

@li-js
Copy link

li-js commented Oct 8, 2020

Nice work in this repo.
I have read through other closed issues and it seems normal that two models have very different quality score ranges based on the SER-FIQ method. Normalization is recommended to enforce the number to the same range just to make them look comfortable since only the ranking is important while the actual values are not.

I do have some thoughts here. Suppose two models are observed to have different quality values and range widths on the same dataset. E.g. Model 1 has range [0.2,0.6], i.e. 0.4+-0.2, and Model 2 has range [0.8,0.82], i.e. 0.81+-0.01.

  1. Is the quality value difference here 0.4 vs 0.81 an indicator of something? Is the range width here 0.4 vs 0.02 also an indicator of something?
  2. Even with normalization, are the quality comparable across two models? In order to make the quality comparable between the two models, what kind of normalization is required?

Any thoughts or discussions are welcome.

@pterhoer
Copy link
Owner

pterhoer commented Oct 8, 2020

Hi li-js,

well summarized and very interesting questions!
I am concerned that I will not be able to give you satisfactory answers...

  1. Assuming that two models have the same output dimension and the output embeddings are normalized, a higher quality mean might represent a better separation in the embedding space, since a higher quality means a higher robustness and thus, a low uncertantity of the embedding. So the range might give you indications about the separability of the data samples in the embedding space of the model). If you change the starting assumption, I am not sure how this will affect the separability.

  2. SER-FIQ uses the robustness of representations as its indicator for the future performance (sample quality). Therefore, it reflects the how well recognition will work on a specific face recognition model. If the behaviour (and the biases) of two models are similar, the normalized/scaled quality predictions should be similar as well. If the models perform differently, it will be hard to align the qualities. A mapping for the different cohorts (regions of performance differences) might be needed.

Interesting questions! Let me know if you have other ideas on that :)

Best
Philipp

@rocinant
Copy link

I don't understand why all you guys just saying only the ranking is important. of course absolute score is way more important. If I get bunch of images, why would I care their rankings among themselfs with the precission goes to even two digits after decimal. The only thing I care is an absolute threshold which can be used to filter out unqualified images. You can get their rankings, but you can not filter out images just based on rankings. If there is one dataset with high quality and their scores are in the range say [0.90 0.91], and another bad dataset whose score range is [0.40, 0.41], both of them under same model output. Okay, then normalize the range with their rankings, they will mostly all be in [0, 1]. Then, filter some images based on some threshold, say 0.5, you will almost get rid of half images in good dataset and bad dataset. And you will even has some dillusions that you have gotten the better quality images. For certain model, of course absolute score is way more important than relative rankings. Because in the end, you can not just say, I think I am getting rid of the lowest 10% ranking images in the batch no matter how good is their overall quality. I think author failed to assress this issue or maybe I do not see it clearly.

@rocinant
Copy link

rocinant commented Oct 15, 2020

Now with the method proposed, I am in the dilemma that I get all images' scores separated in [0.88, 0.89] but I can cleayly see some of them are suitble for face recognition and some of them not. Then, how can I filter the unqualified ones out? Normalize them to [0, 1] then get rid of the lowest 10% percentage? or should it be 20%? How to decide this percentage threshold or the threshold for the normalized scores? Like I sad, you can not apply this in the real face recognition system, with no absolute threshold but just some not discriminative relative scores. Sometimes the differences are so small, you won't even know if it's meanful or it's tiny diffrence from arithmetic operations.

@pterhoer
Copy link
Owner

Hi rocinant,

thank you for your comment, I think you point at an important aspect between research experiments and real-life deployment.

(1) To reproduce the research results, the quality ranking on a whole dataset is of high importance since this allows you to compute the recognition performance on all quality-thresholds. This results in the error-vs-reject curves (Figure 4 & 6 in the paper).
(2) To deploy the face quality assessment in real-life applications, you need to define a quality-threshold which divides input images into good and bad. Good images are accepted and bad images are rejected.

For both scenarios, scaling the quality range does not matter since (1) it does not change the quality ranking on the dataset and (2) the quality threshold is adapted as well. In both cases, the same results will be obtained. Scaling the quality values (in this case) only help the perception of the human observer to feel more comfortable.

So, one point left: How to choose the quality threshold?
That depends on your model and your application scenario:

  1. You have to determine an FMR for your system. (Low FMR for more security-oriented application, higher FMR if you focus on convenience)
  2. Take a (preferably large) face database and compute the error-vs-reject curves for your model.
  3. Choose a quality-thresholds that is best suitable for you tradeoff between reducing the recognition errors and rejecting fewer images.

Best,
Philipp

@rocinant
Copy link

rocinant commented Oct 19, 2020

Hi Philipp,

Thank you for you detailed responses, appreciate it. While exactly scaling the quality range doesn't change anything but easier perceptiona understanding, it won't help in deciding what is the best threshold for classification error. You pointed out a possible solution to get the best threshold. I bet the face database here in step 2 should be get from real-life scenario or at least very close to it. Then to make the curve, one need to lable this database. This is not easy work, which is basically just like preparing data for training face recognition model. Which means even you are using the same face recognition model trained with public dataset, say you get the public model (arcface network trained with ms1m dataset), and you get this error-rejection plot, you want to deploy this method to you real-life scenario. This best threshold from the plot won't work, one needs to collect big enough real-life pics, label them, make the plot themselves, and get his custom threshold. Basically, this method is closely coupled to the dataset not only model, which means, you can not even get a best threshold from determined model. Hopdfully, I get this point clear. The idea is good, it's just it's not cross-dataset, not like face recognition model (with determined model trained with public dataset, you can basically say I think threshold xx is a good choice considering tradeoff between precision and recall etc. ).

Best,
Rocin

@pterhoer
Copy link
Owner

Hi Rocin,

I am sorry for the late response. I did not noticed your answer.

It is correct that the face data from step 2 is best from real-life data to capture all necessary kind of variations. This is similar to face recognition systems in general.

The threshold determination process is the same as for every face recognition model. In face recognition, you need to find a suitable threshold for the comparison score to classify genuine from imposter pairs. For face quality assessment, you need a threshold for the quality score to decide if the face image is used for the identity verification or not (rejected). If (as you said) the face recognition threshold can be set with public available data, you can do the same for the quality threshold.

However, SER-FIQ has a big advantage here. It measures quality based on the embedding robustness. Face recognition models and face quality assessment models need to be evaluated under different conditions to see if they perform reasonable in certain scenarios. For SER-FIQ, the embedding robustness is used as the quality indicator and thus, it is sufficient to evaluate the performance under different quality scenarios. The domain transition is done over the embedding robustness and thus, much less data is needed to achieve a similar threshold suitability than with other methods.

I hope this explanation helps you.

Best
Philipp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants