Concerns about Dataset Usage and Discrepancies in Experimental Results #50

epsilontl · 2024-06-13T11:09:28Z

Dear Author,

I hope this message finds you well.

I have some concerns regarding the experimental setup and results presented in your paper, which I hope you can clarify.

1. Dataset Usage Issue: According to your code, the entire dataset, including the test set, is used for training. Isn't this setup problematic?

2.Discrepancies in Experimental Results: I re-trained the model by separating the training and test sets according to standard practices. The mIoU results I obtained are as follows:

ramen: 48.0
teatime: 47.8
waldo_kitchen: 23.4
figurines: 31.9

These results significantly differ from those reported in the paper (ramen: 51.2, teatime: 65.1, waldo_kitchen: 44.5, figurines: 44.7).

3.Acknowledgment of Test Set Inclusion: Do you acknowledge that the test set was included in the training process? If so, do you believe the paper should be retracted under these circumstances?

I look forward to your response.

Thank you!

sangminkim-99 · 2024-07-05T00:57:17Z

Hi @epsilontl,

In the 3D Open-Vocabulary Semantic Segmentation (3D-OVS) task, it seems common to use all images as the training dataset. Therefore, they did not report how similar the rendered RGB is to the ground truth; instead, they focused on how well the trained model can segment the parts. For training, we do not impose any information about ground truth segmentation masks, so it is valid to use all images as the training set.

Do you think we still need to separate the training and testing datasets?

epsilontl · 2024-07-06T12:50:25Z

Hi @epsilontl,

In the 3D Open-Vocabulary Semantic Segmentation (3D-OVS) task, it seems common to use all images as the training dataset. Therefore, they did not report how similar the rendered RGB is to the ground truth; instead, they focused on how well the trained model can segment the parts. For training, we do not impose any information about ground truth segmentation masks, so it is valid to use all images as the training set.

Do you think we still need to separate the training and testing datasets?

Hi @sangminkim-99 ,

Thank you for your insightful response regarding the use of all images as the training dataset in the 3D Open-Vocabulary Semantic Segmentation (3D-OVS) task. I understand that you mentioned it is common practice to use the entire dataset for training in this context. However, I would like to clarify my understanding further.

When you mention "common," could you please elaborate on what sources or practices this is based on? From what I know, the standard approach in similar tasks involves separating the training and testing datasets to ensure the model's performance is evaluated accurately on unseen data. For instance, in the context of NeRF and 3DGS models used for OVS, the benchmark comparisons such as LangSplat versus FFD, 3D-OVS, and LERF, as well as the more recent LEGaussian presented at CVPR 2024, all adhere to this principle by separating the training and testing datasets and performing tests on novel viewpoints.

I would greatly appreciate any further insights you can provide on this matter.

sangminkim-99 · 2024-07-15T06:28:30Z

Hi @epsilontl, sorry for the late reply.

I referenced 3D-OVS from NeurIPS 2023.
You can see that they are using all images as the training set here.

I think the word "common" was a bit too aggressive, but I still think it's okay to use all images for training because there is no clue about the GT segmentation mask.

epsilontl · 2024-07-15T09:34:17Z

Hi @sangminkim-99 ,thank you for your response and no worries about the delay.

Regarding 3D-OVS from NeurIPS 2023, they actually do not use all images as the training set. If you look further down in the code, you'll notice that all configurations set clip_input = 0.5, which means they randomly use half of the data for training. This approach ensures that both seen and unseen data are included during testing, which is acceptable.

However, using all images for training may not always be suitable or meaningful. In practical applications, we aim to evaluate performance from multiple perspectives, including unseen views. If the model encounters a new pose during inference, would we need to retrain the model with those images? How does such an experimental setup reasonably reflect real-world application scenarios? How can we assess the model's generalization capability under these conditions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concerns about Dataset Usage and Discrepancies in Experimental Results #50

Concerns about Dataset Usage and Discrepancies in Experimental Results #50

epsilontl commented Jun 13, 2024

sangminkim-99 commented Jul 5, 2024

epsilontl commented Jul 6, 2024

sangminkim-99 commented Jul 15, 2024

epsilontl commented Jul 15, 2024

Concerns about Dataset Usage and Discrepancies in Experimental Results #50

Concerns about Dataset Usage and Discrepancies in Experimental Results #50

Comments

epsilontl commented Jun 13, 2024

sangminkim-99 commented Jul 5, 2024

epsilontl commented Jul 6, 2024

sangminkim-99 commented Jul 15, 2024

epsilontl commented Jul 15, 2024