Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concerns about Dataset Usage and Discrepancies in Experimental Results #50

Open
epsilontl opened this issue Jun 13, 2024 · 4 comments
Open

Comments

@epsilontl
Copy link

Dear Author,

I hope this message finds you well.

I have some concerns regarding the experimental setup and results presented in your paper, which I hope you can clarify.

1. Dataset Usage Issue: According to your code, the entire dataset, including the test set, is used for training. Isn't this setup problematic?

2.Discrepancies in Experimental Results: I re-trained the model by separating the training and test sets according to standard practices. The mIoU results I obtained are as follows:

  • ramen: 48.0
  • teatime: 47.8
  • waldo_kitchen: 23.4
  • figurines: 31.9

These results significantly differ from those reported in the paper (ramen: 51.2, teatime: 65.1, waldo_kitchen: 44.5, figurines: 44.7).

3.Acknowledgment of Test Set Inclusion: Do you acknowledge that the test set was included in the training process? If so, do you believe the paper should be retracted under these circumstances?

I look forward to your response.

Thank you!

@sangminkim-99
Copy link

Hi @epsilontl,

In the 3D Open-Vocabulary Semantic Segmentation (3D-OVS) task, it seems common to use all images as the training dataset. Therefore, they did not report how similar the rendered RGB is to the ground truth; instead, they focused on how well the trained model can segment the parts. For training, we do not impose any information about ground truth segmentation masks, so it is valid to use all images as the training set.

Do you think we still need to separate the training and testing datasets?

@epsilontl
Copy link
Author

Hi @epsilontl,

In the 3D Open-Vocabulary Semantic Segmentation (3D-OVS) task, it seems common to use all images as the training dataset. Therefore, they did not report how similar the rendered RGB is to the ground truth; instead, they focused on how well the trained model can segment the parts. For training, we do not impose any information about ground truth segmentation masks, so it is valid to use all images as the training set.

Do you think we still need to separate the training and testing datasets?

Hi @sangminkim-99 ,

Thank you for your insightful response regarding the use of all images as the training dataset in the 3D Open-Vocabulary Semantic Segmentation (3D-OVS) task. I understand that you mentioned it is common practice to use the entire dataset for training in this context. However, I would like to clarify my understanding further.

When you mention "common," could you please elaborate on what sources or practices this is based on? From what I know, the standard approach in similar tasks involves separating the training and testing datasets to ensure the model's performance is evaluated accurately on unseen data. For instance, in the context of NeRF and 3DGS models used for OVS, the benchmark comparisons such as LangSplat versus FFD, 3D-OVS, and LERF, as well as the more recent LEGaussian presented at CVPR 2024, all adhere to this principle by separating the training and testing datasets and performing tests on novel viewpoints.

I would greatly appreciate any further insights you can provide on this matter.

@sangminkim-99
Copy link

Hi @epsilontl, sorry for the late reply.

I referenced 3D-OVS from NeurIPS 2023.
You can see that they are using all images as the training set here.

I think the word "common" was a bit too aggressive, but I still think it's okay to use all images for training because there is no clue about the GT segmentation mask.

@epsilontl
Copy link
Author

Hi @sangminkim-99 ,thank you for your response and no worries about the delay.

Regarding 3D-OVS from NeurIPS 2023, they actually do not use all images as the training set. If you look further down in the code, you'll notice that all configurations set clip_input = 0.5, which means they randomly use half of the data for training. This approach ensures that both seen and unseen data are included during testing, which is acceptable.

However, using all images for training may not always be suitable or meaningful. In practical applications, we aim to evaluate performance from multiple perspectives, including unseen views. If the model encounters a new pose during inference, would we need to retrain the model with those images? How does such an experimental setup reasonably reflect real-world application scenarios? How can we assess the model's generalization capability under these conditions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants