Inconsistent results on DDSM testset #5

taijizhao · 2018-01-25T12:54:45Z

Hello, Li,
First congratulations for your excellent work and thank you a lot for sharing the code. It's really helpful for people like me who starts to work on mammography.
But when I ran a simple test of your trained whole image model on the DDSM test set, I got auc scores much lower than reported.
I used the CBIS-DDSM dataset, convert all images to PNG and resized to 1152*896. Then I used the official testset (CalcTest and MassTest), made "MALIGNANT" positive, "BENIGN" and "BENIGN WITHOUT CALLBACK" negative, which amounts to 649 images in total.
Then I used your code example_model_test.ipynb to test 3 models you provided on the project homepage. (ddsm_resnet50_s10_[512-512-1024]x2.h5
ddsm_vgg16_s10_512x1.h5 ddsm_vgg16_s10_[512-512-1024]x2_hybrid.h5). For the three models, I got auc of 0.69 (resnet), 0.75 (vgg) and 0.71 (hybrid) respectively, which are much lower than reported(0.86,0.83,0.85 respectively)

Indeed I am using a different testset, since you mentioned in your paper your randomly split the DDSM data for training and test. But in this case, my testset should somehow overlap with your training set, resulting better rather than worse performance.
Do you get an idea where is this discrepancy in performance comes from? Some preprocessing for example? Or I did something evidently wrong?
Thank you very much!
Best regards,

lishen · 2018-01-29T15:51:00Z

@taijizhao ,

The official test set was not available when I did the study so it could not be part of the train set. It is actually more like another hold out set.

Unfortunately, the scores are not as good as the ones on the test set I used. One thing you need to check is when you convert to PNG, the contrast is automatically adjusted. I used "convert -auto-level" to perform the conversion.

I also offer two reasons why the performance is worse:

The official test set is intrinsically more difficult (e.g., more subtle cases) to classify than the test set I used.
The official test set contains cases whose distributions do not bear similarity to the train set I used for model generation.

If you want to improve the scores on the test set, you shall do your own training on the train set.

As a side note (unpublished): I could achieve a single model AUC of 0.85 on the official test set when combining the CC and MLO views. Maybe you can do it even better.

xuranzhao711 · 2018-01-30T06:13:16Z

@lishen
Thank you very much for your kind explanations! I'll try more.
Just one thing I want to make clear: when converting dicom to png, SHOULD or SHOULD NOT the contrast being adjusted? Actually I did the conversion by dicom and opencv, something like this:

import dicom
import cv2
img = dicom.read_file(dicom_filename)
img = img.pixel_array
img = cv2.resize(img,(896,1152),interpolation=cv2.INTER_CUBIC)
cv2.imwrite(save_path+png_save_name, img)

In this way I think the contrast is not adjusted?
and your comment

I used "convert -auto-level" to perform the conversion.

which python package this "convert auto-level" command is from?
Thank you again!

lishen · 2018-01-30T16:03:46Z

@xuranzhao711
The way you converted it there was no contrast adjustment done. It does not matter whether it is right or wrong that you do contrast adjustment. But it's important to be consistent between model training and evaluation.

convert is simply a Linux command from ImageMagik. It is widely available. This the command I used:

convert -auto-level {} -resize 896x1152! ../ConvertedPNGs/{/.}.png

yueliukth · 2018-10-10T09:09:41Z

Hi @lishen ! I'm trying to reproduce your algorithm on DDSM official train/val/test split, but I observed a relatively large auc gap (around 8%) between val and test sets. So far, the best val auc I achieved is 83% and the test auc on the same model is 75%. I was wondering if you observed the same or at least similar auc gap when you trained and tested on this new official split. Otherwise, I guess maybe it means my model is somehow overfitting. Thank you in advance! Looking forward to your reply.

lishen · 2018-10-10T18:20:34Z

@irisliuyue, it's actually common to observe such gap between val and test sets. Sometimes, the val AUC is even lower than the test AUC. It means the val and test sets have different distributions. Unfortunately, it's actually hard to make them more even. If you can afford the computation, simply do multiple splits or use (nested) cross-validation.

yueliukth · 2018-10-10T21:50:31Z

Hi @lishen, thanks for your reply!

a) What did you mean by multiple splits? Do you suggest mix all train/val/test images and split, or just mix some train/val and split while leaving official test set untouched?

I don't see a huge difference between my train/val auc scores, so my model generalises well on unseen validation (but not unseen test). So I guess if I mix train and val and then do cross-validation, the test performance won't have a huge leap anyway.

b) And you are right, I did notice that sometimes val auc is even lower than test auc, but it's very rare. In general, from my observation mostly my test is around 8% lower than validation. One explanation could be val and test have something systematically different, for example like having different distributions as you said. I did try to plot histograms showing distributions of reading difficulties across train/val/test datasets according to BIRADS assessment, but their distributions are almost identical.

So I was wondering if you have any advice on how to prove different distributions across datasets? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent results on DDSM testset #5

Inconsistent results on DDSM testset #5

taijizhao commented Jan 25, 2018

lishen commented Jan 29, 2018

xuranzhao711 commented Jan 30, 2018

lishen commented Jan 30, 2018

yueliukth commented Oct 10, 2018 •

edited

Loading

lishen commented Oct 10, 2018

yueliukth commented Oct 10, 2018 •

edited

Loading

Inconsistent results on DDSM testset #5

Inconsistent results on DDSM testset #5

Comments

taijizhao commented Jan 25, 2018

lishen commented Jan 29, 2018

xuranzhao711 commented Jan 30, 2018

lishen commented Jan 30, 2018

yueliukth commented Oct 10, 2018 • edited Loading

lishen commented Oct 10, 2018

yueliukth commented Oct 10, 2018 • edited Loading

yueliukth commented Oct 10, 2018 •

edited

Loading

yueliukth commented Oct 10, 2018 •

edited

Loading