-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent results on DDSM testset #5
Comments
The official test set was not available when I did the study so it could not be part of the train set. It is actually more like another hold out set. Unfortunately, the scores are not as good as the ones on the test set I used. One thing you need to check is when you convert to PNG, the contrast is automatically adjusted. I used "convert -auto-level" to perform the conversion. I also offer two reasons why the performance is worse:
If you want to improve the scores on the test set, you shall do your own training on the train set. As a side note (unpublished): I could achieve a single model AUC of 0.85 on the official test set when combining the CC and MLO views. Maybe you can do it even better. |
@lishen
In this way I think the contrast is not adjusted?
which python package this "convert auto-level" command is from? |
@xuranzhao711
|
Hi @lishen ! I'm trying to reproduce your algorithm on DDSM official train/val/test split, but I observed a relatively large auc gap (around 8%) between val and test sets. So far, the best val auc I achieved is 83% and the test auc on the same model is 75%. I was wondering if you observed the same or at least similar auc gap when you trained and tested on this new official split. Otherwise, I guess maybe it means my model is somehow overfitting. Thank you in advance! Looking forward to your reply. |
@irisliuyue, it's actually common to observe such gap between val and test sets. Sometimes, the val AUC is even lower than the test AUC. It means the val and test sets have different distributions. Unfortunately, it's actually hard to make them more even. If you can afford the computation, simply do multiple splits or use (nested) cross-validation. |
Hi @lishen, thanks for your reply! a) What did you mean by multiple splits? Do you suggest mix all train/val/test images and split, or just mix some train/val and split while leaving official test set untouched? I don't see a huge difference between my train/val auc scores, so my model generalises well on unseen validation (but not unseen test). So I guess if I mix train and val and then do cross-validation, the test performance won't have a huge leap anyway. b) And you are right, I did notice that sometimes val auc is even lower than test auc, but it's very rare. In general, from my observation mostly my test is around 8% lower than validation. One explanation could be val and test have something systematically different, for example like having different distributions as you said. I did try to plot histograms showing distributions of reading difficulties across train/val/test datasets according to BIRADS assessment, but their distributions are almost identical. So I was wondering if you have any advice on how to prove different distributions across datasets? Thanks! |
Hello, Li,
![qq 20180125204102](https://user-images.githubusercontent.com/7842342/35388875-86d7da52-0210-11e8-8ba6-55f3034bf276.png)
![qq 20180125204124](https://user-images.githubusercontent.com/7842342/35389226-df4f183e-0211-11e8-9936-62e5f813f005.png)
![qq 20180125204212](https://user-images.githubusercontent.com/7842342/35388891-930eacec-0210-11e8-976b-1e3fa8ef0548.png)
![qq 20180125204245](https://user-images.githubusercontent.com/7842342/35388894-9466a7d4-0210-11e8-8e55-3c30ee3bbde9.png)
First congratulations for your excellent work and thank you a lot for sharing the code. It's really helpful for people like me who starts to work on mammography.
But when I ran a simple test of your trained whole image model on the DDSM test set, I got auc scores much lower than reported.
I used the CBIS-DDSM dataset, convert all images to PNG and resized to 1152*896. Then I used the official testset (CalcTest and MassTest), made "MALIGNANT" positive, "BENIGN" and "BENIGN WITHOUT CALLBACK" negative, which amounts to 649 images in total.
Then I used your code example_model_test.ipynb to test 3 models you provided on the project homepage. (ddsm_resnet50_s10_[512-512-1024]x2.h5
ddsm_vgg16_s10_512x1.h5 ddsm_vgg16_s10_[512-512-1024]x2_hybrid.h5). For the three models, I got auc of 0.69 (resnet), 0.75 (vgg) and 0.71 (hybrid) respectively, which are much lower than reported(0.86,0.83,0.85 respectively)
Indeed I am using a different testset, since you mentioned in your paper your randomly split the DDSM data for training and test. But in this case, my testset should somehow overlap with your training set, resulting better rather than worse performance.
Do you get an idea where is this discrepancy in performance comes from? Some preprocessing for example? Or I did something evidently wrong?
Thank you very much!
Best regards,
The text was updated successfully, but these errors were encountered: