-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the quality score ranges #21
Comments
Hi li-js, well summarized and very interesting questions!
Interesting questions! Let me know if you have other ideas on that :) Best |
I don't understand why all you guys just saying only the ranking is important. of course absolute score is way more important. If I get bunch of images, why would I care their rankings among themselfs with the precission goes to even two digits after decimal. The only thing I care is an absolute threshold which can be used to filter out unqualified images. You can get their rankings, but you can not filter out images just based on rankings. If there is one dataset with high quality and their scores are in the range say [0.90 0.91], and another bad dataset whose score range is [0.40, 0.41], both of them under same model output. Okay, then normalize the range with their rankings, they will mostly all be in [0, 1]. Then, filter some images based on some threshold, say 0.5, you will almost get rid of half images in good dataset and bad dataset. And you will even has some dillusions that you have gotten the better quality images. For certain model, of course absolute score is way more important than relative rankings. Because in the end, you can not just say, I think I am getting rid of the lowest 10% ranking images in the batch no matter how good is their overall quality. I think author failed to assress this issue or maybe I do not see it clearly. |
Now with the method proposed, I am in the dilemma that I get all images' scores separated in [0.88, 0.89] but I can cleayly see some of them are suitble for face recognition and some of them not. Then, how can I filter the unqualified ones out? Normalize them to [0, 1] then get rid of the lowest 10% percentage? or should it be 20%? How to decide this percentage threshold or the threshold for the normalized scores? Like I sad, you can not apply this in the real face recognition system, with no absolute threshold but just some not discriminative relative scores. Sometimes the differences are so small, you won't even know if it's meanful or it's tiny diffrence from arithmetic operations. |
Hi rocinant, thank you for your comment, I think you point at an important aspect between research experiments and real-life deployment. (1) To reproduce the research results, the quality ranking on a whole dataset is of high importance since this allows you to compute the recognition performance on all quality-thresholds. This results in the error-vs-reject curves (Figure 4 & 6 in the paper). For both scenarios, scaling the quality range does not matter since (1) it does not change the quality ranking on the dataset and (2) the quality threshold is adapted as well. In both cases, the same results will be obtained. Scaling the quality values (in this case) only help the perception of the human observer to feel more comfortable. So, one point left: How to choose the quality threshold?
Best, |
Hi Philipp, Thank you for you detailed responses, appreciate it. While exactly scaling the quality range doesn't change anything but easier perceptiona understanding, it won't help in deciding what is the best threshold for classification error. You pointed out a possible solution to get the best threshold. I bet the face database here in step 2 should be get from real-life scenario or at least very close to it. Then to make the curve, one need to lable this database. This is not easy work, which is basically just like preparing data for training face recognition model. Which means even you are using the same face recognition model trained with public dataset, say you get the public model (arcface network trained with ms1m dataset), and you get this error-rejection plot, you want to deploy this method to you real-life scenario. This best threshold from the plot won't work, one needs to collect big enough real-life pics, label them, make the plot themselves, and get his custom threshold. Basically, this method is closely coupled to the dataset not only model, which means, you can not even get a best threshold from determined model. Hopdfully, I get this point clear. The idea is good, it's just it's not cross-dataset, not like face recognition model (with determined model trained with public dataset, you can basically say I think threshold xx is a good choice considering tradeoff between precision and recall etc. ). Best, |
Hi Rocin, I am sorry for the late response. I did not noticed your answer. It is correct that the face data from step 2 is best from real-life data to capture all necessary kind of variations. This is similar to face recognition systems in general. The threshold determination process is the same as for every face recognition model. In face recognition, you need to find a suitable threshold for the comparison score to classify genuine from imposter pairs. For face quality assessment, you need a threshold for the quality score to decide if the face image is used for the identity verification or not (rejected). If (as you said) the face recognition threshold can be set with public available data, you can do the same for the quality threshold. However, SER-FIQ has a big advantage here. It measures quality based on the embedding robustness. Face recognition models and face quality assessment models need to be evaluated under different conditions to see if they perform reasonable in certain scenarios. For SER-FIQ, the embedding robustness is used as the quality indicator and thus, it is sufficient to evaluate the performance under different quality scenarios. The domain transition is done over the embedding robustness and thus, much less data is needed to achieve a similar threshold suitability than with other methods. I hope this explanation helps you. Best |
Nice work in this repo.
I have read through other closed issues and it seems normal that two models have very different quality score ranges based on the SER-FIQ method. Normalization is recommended to enforce the number to the same range just to make them look comfortable since only the ranking is important while the actual values are not.
I do have some thoughts here. Suppose two models are observed to have different quality values and range widths on the same dataset. E.g. Model 1 has range [0.2,0.6], i.e. 0.4+-0.2, and Model 2 has range [0.8,0.82], i.e. 0.81+-0.01.
Any thoughts or discussions are welcome.
The text was updated successfully, but these errors were encountered: