0.94 AUC not reproducible #12

fbmoreira · 2019-05-22T12:04:18Z

Following your README step by step and creating several models, only one model has only achieved 0.76 AUC for EyePACS so far. It's not clear to me whether the reported AUC of 0.94 used a single model or an ensemble... I'll try an ensemble, but most of the models I am running end with 0.52~ AUC, which means they are likely not contributing much.

Are there any undetailed reasons for the code to not reproduce the paper results?
Maybe a different seed for the distribution of images in the folders?
I used the --only_gradable flag, it's also not clear whether your paper used all images or only the gradable ones.

Thank you!

The text was updated successfully, but these errors were encountered:

mikevoets · 2019-05-23T10:39:38Z

When it comes to AUC, we also experienced fluctuating results. We have for example
experienced that it stopped at 0.60 AUC sometimes. Please run it a couple more times to find better results. This code is directly related to the results in our paper without having made any modifications. For the latest version of our paper we have used all images.

The original paper proposes evaluating the linear average of predictions with an ensemble of 10 trained models. To create an such an ensemble of trained models from the code of this repo, use the -lm parameter. To specify an ensemble, the model paths should be comma-separated or satisfy a regular expression. For example: -lm=./tmp/model-1,./tmp/model-2,./tmp/model-3

fbmoreira · 2019-05-23T19:23:06Z

Do you believe running eyepacs.sh --redistribute before training new models could result in more varied models and thus a better ensemble?

Thank you for your answer!

mikevoets · 2019-05-24T04:32:52Z

Yes, in combination with applying a different seed with the --seed parameter. Otherwise there will be no difference between the distributions.

In our study we did not redistribute though. We only distributed once with the default seed in the script, and all our models and ensemble are created from training with that image distribution.

fbmoreira · 2019-06-04T12:42:00Z

Sorry for not providing a faster reply, but the maximum AUC I got in my tests was 0.90 using the evaluate.py on eyepacs with the default seed. I suppose eventually I can reach 0.94 or higher, but I think it will be better to move to an ensemble model to get better results quicker.

Sayyam-Jain · 2019-06-05T01:08:51Z

Sorry for not providing a faster reply, but the maximum AUC I got in my tests was 0.90 using the evaluate.py on eyepacs with the default seed. I suppose eventually I can reach 0.94 or higher, but I think it will be better to move to an ensemble model to get better results quicker.

Can you please help me on how did you achieve similarl results?
Thanks

fbmoreira · 2019-06-05T12:47:31Z

I trained about 10 models with the default seed, and tried to use evaluate.py on the ones that got the higher AUC during the training cross-validation. One of the models, which had 0.77 AUC during cross-validation, gave me 0.90 AUC in the execution of ./evaluate.py -e...

So I do not have any specific tip, just run more models until you get something good :P

mikevoets · 2019-06-05T13:13:22Z

@fbmoreira You experience exactly the same as we experienced when we trained our models. Some are bad (around 0.70 AUC), most of them are ok-ish (~ 0.85 AUC), some are better and exceed 0.90 AUC on evaluation. What we learned is that using the power of all these models together (both bad and better models) in an ensemble always yields a better result.

Sayyam-Jain · 2019-06-05T13:20:47Z

@fbmoreira Apologies for being dumb (still new to Deep Learning) but can you please explain training different models. Did you use different neural network architectures or something else. Please explain.

mikevoets · 2019-06-05T13:36:53Z

@Sayyam-Jain When you run python train.py twice, you'll train two different models. That's because the weights of the network are initialized different (randomly) every time. Because the starting points of the network's parameters are different, it essentially leads to a different network with different results every time. The neural network architecture is still the same. Hope this explains it well.

@fbmoreira NB: The random seed random.seed(432) is not intended to set a fixed initialization of weights here, and that's why you get different results every time, which is as intended. The random seed here is meant to set a fixed shuffling order of various data augmentations, here: https://github.com/mikevoets/jama16-retina-replication/blob/master/lib/dataset.py#L42.

fbmoreira · 2019-06-05T13:48:53Z

I didn't say a thing about fixed initialization o_O I knew it was only for the dataset partition since it was in the eyepacs.sh script, and had nothing to do with the network itself.
I did read your augmentation code, I found it curious that you did not perform vertical flips as well.

Reading your code it was clear to me you initialized the inception v3 model with imagenet weights and I assume the only (small) random weight there is in the top layer initialization.

I think that your results omitting the --only-gradable were better because the noise introduced might have helped the network to better generalize the problem, hence your higher AUC. Another thing that might help in the future is to introduce gaussian noise/"pepper-and-salt" as a form of augmentation, although the size of the hemorrhages and microaneurisms might be indistinguishable from the noise, so I am not sure.

mikevoets · 2019-06-05T14:03:37Z

Ah ok, excuse me for my misunderstanding!

Regarding vertical flips in data augmentation: the objective of this project was to try to replicate the model and reproduce the results made in this paper. Since the team in that paper did not vertically flip images in their augmentation, we did not either.

Regarding your last point: it seems definitely likely that the noise in non-gradable images improves the generalization and reduces the chance of overfitting. I am however not sure how big the effects are of training the network with wrong labels for those non-gradable images. I still lean towards the option of using only gradable images and applying random data augmentation to them, but during our project we did not test if this actually leads to better results.

slala2121 · 2020-08-16T19:54:11Z

Using the ensemble of pretrained models, I get AUC of 0.91 on the test dataset rather than 0.95. I followed the instructions in downloading the dataset. Should I be getting 0.95? Does something need to be changed?

mikevoets · 2023-02-16T08:27:13Z

Hey @slala2121, just to confirm, did you download the models from here https://figshare.com/articles/dataset/Trained_neural_network_models/8312183? Also, what Tensorflow version and Python did you run with?

fbmoreira changed the title ~~reproduction not reproducible~~ 0.94 AUC not reproducible May 22, 2019

mikevoets closed this as completed Jun 4, 2019

mikevoets reopened this Jun 5, 2019

mikevoets closed this as completed Jun 5, 2019

mikevoets reopened this Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.94 AUC not reproducible #12

0.94 AUC not reproducible #12

fbmoreira commented May 22, 2019

mikevoets commented May 23, 2019

fbmoreira commented May 23, 2019

mikevoets commented May 24, 2019 •

edited

fbmoreira commented Jun 4, 2019

Sayyam-Jain commented Jun 5, 2019

fbmoreira commented Jun 5, 2019

mikevoets commented Jun 5, 2019 •

edited

Sayyam-Jain commented Jun 5, 2019

mikevoets commented Jun 5, 2019

fbmoreira commented Jun 5, 2019

mikevoets commented Jun 5, 2019

slala2121 commented Aug 16, 2020

mikevoets commented Feb 16, 2023

0.94 AUC not reproducible #12

0.94 AUC not reproducible #12

Comments

fbmoreira commented May 22, 2019

mikevoets commented May 23, 2019

fbmoreira commented May 23, 2019

mikevoets commented May 24, 2019 • edited

fbmoreira commented Jun 4, 2019

Sayyam-Jain commented Jun 5, 2019

fbmoreira commented Jun 5, 2019

mikevoets commented Jun 5, 2019 • edited

Sayyam-Jain commented Jun 5, 2019

mikevoets commented Jun 5, 2019

fbmoreira commented Jun 5, 2019

mikevoets commented Jun 5, 2019

slala2121 commented Aug 16, 2020

mikevoets commented Feb 16, 2023

mikevoets commented May 24, 2019 •

edited

mikevoets commented Jun 5, 2019 •

edited