Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about the dataset #34

Closed
Docurdt opened this issue May 19, 2019 · 2 comments
Closed

Some questions about the dataset #34

Docurdt opened this issue May 19, 2019 · 2 comments

Comments

@Docurdt
Copy link

Docurdt commented May 19, 2019

Dear Nicolas,

Thanks for your brilliant work, high-quality codes, and detailed documents, these really helped me a lot. But I still have some concerns as follows, I hope could hear from you! Thanks!
1: The images from TCGA are not as clean as we expected, some of them are low quality, marked with pen by the doctors. Do the “dirty” images/tiles have bad effects on the model performance? Did you do some manually or automatically filtering for these dirty tiles?
2: The labels are not always correct for the tiles. As far as I know, the Tumour images always contain normal parts inside, and for the so-called normal images, they sometimes contain several cancerous regions. Did you conduct some special processes for these wrongly labelled tiles?
3: How is your AUC values calculated? Is it based on the predicted results of the tiles?
4: Have you evaluated the influence caused by the tile size?
5: How do you think the performance of applying your methodology to other types of cancer? Could you share some insights on this?

Thank you very much! Looking forward to hearing from you!
Best wishes!
D

@ncoudray
Copy link
Owner

Hi,

  1. They were not filtered. We haven't investigated how this affect the results but I doubt it would affect much the result given the low proportion of tiles maked.
  2. You can have a look at our paper on Lung cancer. with cohort were the tumor content is low, this is true. But for the TCGA dataset, the tumor content is really high so the proportion of falsely labelled tiles is really low. Unless you train for too long and over-fit, AUC is good. It could probably have been even better if we had carefully manually selected regions with high tumor content
  3. Yes. Per tile then aggregated per slide.
  4. No, only the magnification (which is slightly equivalent)
  5. We have another paper coming in biorXiv which you can check

Best,
N.

@Docurdt
Copy link
Author

Docurdt commented May 21, 2019

Hi,

  1. They were not filtered. We haven't investigated how this affect the results but I doubt it would affect much the result given the low proportion of tiles maked.
  2. You can have a look at our paper on Lung cancer. with cohort were the tumor content is low, this is true. But for the TCGA dataset, the tumor content is really high so the proportion of falsely labelled tiles is really low. Unless you train for too long and over-fit, AUC is good. It could probably have been even better if we had carefully manually selected regions with high tumor content
  3. Yes. Per tile then aggregated per slide.
  4. No, only the magnification (which is slightly equivalent)
  5. We have another paper coming in biorXiv which you can check

Best,
N.

Hi Nicholas, I am really appreciated for all your kindly help!

All the best,
D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants