Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.94 AUC not reproducible #12

Open
fbmoreira opened this issue May 22, 2019 · 13 comments
Open

0.94 AUC not reproducible #12

fbmoreira opened this issue May 22, 2019 · 13 comments

Comments

@fbmoreira
Copy link

Following your README step by step and creating several models, only one model has only achieved 0.76 AUC for EyePACS so far. It's not clear to me whether the reported AUC of 0.94 used a single model or an ensemble... I'll try an ensemble, but most of the models I am running end with 0.52~ AUC, which means they are likely not contributing much.

Are there any undetailed reasons for the code to not reproduce the paper results?
Maybe a different seed for the distribution of images in the folders?
I used the --only_gradable flag, it's also not clear whether your paper used all images or only the gradable ones.

Thank you!

@fbmoreira fbmoreira changed the title reproduction not reproducible 0.94 AUC not reproducible May 22, 2019
@mikevoets
Copy link
Owner

When it comes to AUC, we also experienced fluctuating results. We have for example
experienced that it stopped at 0.60 AUC sometimes. Please run it a couple more times to find better results. This code is directly related to the results in our paper without having made any modifications. For the latest version of our paper we have used all images.

The original paper proposes evaluating the linear average of predictions with an ensemble of 10 trained models. To create an such an ensemble of trained models from the code of this repo, use the -lm parameter. To specify an ensemble, the model paths should be comma-separated or satisfy a regular expression. For example: -lm=./tmp/model-1,./tmp/model-2,./tmp/model-3

@fbmoreira
Copy link
Author

Do you believe running eyepacs.sh --redistribute before training new models could result in more varied models and thus a better ensemble?

Thank you for your answer!

@mikevoets
Copy link
Owner

mikevoets commented May 24, 2019

Yes, in combination with applying a different seed with the --seed parameter. Otherwise there will be no difference between the distributions.

In our study we did not redistribute though. We only distributed once with the default seed in the script, and all our models and ensemble are created from training with that image distribution.

@fbmoreira
Copy link
Author

Sorry for not providing a faster reply, but the maximum AUC I got in my tests was 0.90 using the evaluate.py on eyepacs with the default seed. I suppose eventually I can reach 0.94 or higher, but I think it will be better to move to an ensemble model to get better results quicker.

@Sayyam-Jain
Copy link

Sorry for not providing a faster reply, but the maximum AUC I got in my tests was 0.90 using the evaluate.py on eyepacs with the default seed. I suppose eventually I can reach 0.94 or higher, but I think it will be better to move to an ensemble model to get better results quicker.

Can you please help me on how did you achieve similarl results?
Thanks

@fbmoreira
Copy link
Author

I trained about 10 models with the default seed, and tried to use evaluate.py on the ones that got the higher AUC during the training cross-validation. One of the models, which had 0.77 AUC during cross-validation, gave me 0.90 AUC in the execution of ./evaluate.py -e...

So I do not have any specific tip, just run more models until you get something good :P

@mikevoets
Copy link
Owner

mikevoets commented Jun 5, 2019

@fbmoreira You experience exactly the same as we experienced when we trained our models. Some are bad (around 0.70 AUC), most of them are ok-ish (~ 0.85 AUC), some are better and exceed 0.90 AUC on evaluation. What we learned is that using the power of all these models together (both bad and better models) in an ensemble always yields a better result.

@mikevoets mikevoets reopened this Jun 5, 2019
@Sayyam-Jain
Copy link

@fbmoreira Apologies for being dumb (still new to Deep Learning) but can you please explain training different models. Did you use different neural network architectures or something else. Please explain.

@mikevoets
Copy link
Owner

@Sayyam-Jain When you run python train.py twice, you'll train two different models. That's because the weights of the network are initialized different (randomly) every time. Because the starting points of the network's parameters are different, it essentially leads to a different network with different results every time. The neural network architecture is still the same. Hope this explains it well.

@fbmoreira NB: The random seed random.seed(432) is not intended to set a fixed initialization of weights here, and that's why you get different results every time, which is as intended. The random seed here is meant to set a fixed shuffling order of various data augmentations, here: https://github.com/mikevoets/jama16-retina-replication/blob/master/lib/dataset.py#L42.

@fbmoreira
Copy link
Author

I didn't say a thing about fixed initialization o_O I knew it was only for the dataset partition since it was in the eyepacs.sh script, and had nothing to do with the network itself.
I did read your augmentation code, I found it curious that you did not perform vertical flips as well.

Reading your code it was clear to me you initialized the inception v3 model with imagenet weights and I assume the only (small) random weight there is in the top layer initialization.

I think that your results omitting the --only-gradable were better because the noise introduced might have helped the network to better generalize the problem, hence your higher AUC. Another thing that might help in the future is to introduce gaussian noise/"pepper-and-salt" as a form of augmentation, although the size of the hemorrhages and microaneurisms might be indistinguishable from the noise, so I am not sure.

@mikevoets
Copy link
Owner

Ah ok, excuse me for my misunderstanding!

Regarding vertical flips in data augmentation: the objective of this project was to try to replicate the model and reproduce the results made in this paper. Since the team in that paper did not vertically flip images in their augmentation, we did not either.

Regarding your last point: it seems definitely likely that the noise in non-gradable images improves the generalization and reduces the chance of overfitting. I am however not sure how big the effects are of training the network with wrong labels for those non-gradable images. I still lean towards the option of using only gradable images and applying random data augmentation to them, but during our project we did not test if this actually leads to better results.

@slala2121
Copy link

Using the ensemble of pretrained models, I get AUC of 0.91 on the test dataset rather than 0.95. I followed the instructions in downloading the dataset. Should I be getting 0.95? Does something need to be changed?

@mikevoets mikevoets reopened this Aug 19, 2020
@mikevoets
Copy link
Owner

Hey @slala2121, just to confirm, did you download the models from here https://figshare.com/articles/dataset/Trained_neural_network_models/8312183? Also, what Tensorflow version and Python did you run with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants