bug-report: data collection questions #5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello,
Thank you for your work! I have several questions about the training and data preprocessing.
In the paper, you say that at each round, you are training the fixer and breaker only on the newly generated data. Please see the screenshot and the equations. None of the equations mentions the dataset from round-0 on which the initial breaker is trained.
This seemed a bit odd to me because I would expect the newly generated data to be merged with the existing data from round-0. The model should then be trained on this joined dataset. Because the newly generated dataset contains a lot of bias since it is synthetic, it is very likely that the models forget everything learned from the initial data, which contains real-world bugs and fixes.
At first, I thought the equations were wrong, but also the text backed up the equations. See the screenshot from the paper:
To double-check, I started looking into your code and might have found several inconsistencies.
Indeed you merge the synthetically generated dataset with the initial dataset. See here: https://github.com/michiyasunaga/BIFI/blob/main/src/c006__generate_paired_data_from_fixer.py#L47-L52
But you do it in a very strange way. You build the dataset for the next step where 1/3 of it comes from the initial data, and the other 2/3 comes from the synthetically generated data. See here: https://github.com/michiyasunaga/BIFI/blob/main/src/c006__generate_paired_data_from_fixer.py#L59-L62
What is even more strange is that you duplicate data points. The total size is set to 30’000’000, and you repeat the data points. See here: https://github.com/michiyasunaga/BIFI/blob/main/src/c006__generate_paired_data_from_fixer.py#L58-L62
You just duplicate the same samples in the dataset.
There is not a single word on all of this in the paper if I did not miss anything and I do not understand the motivation behind these choices.
Could you please clarify these issues?
Thanks in advance!
Best,
Berkay Berabi