Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The quantity of the open-source training data does not match that mentioned in the paper. #21

Open
zytx121 opened this issue Mar 20, 2024 · 6 comments

Comments

@zytx121
Copy link

zytx121 commented Mar 20, 2024

Thank you very much for your work!

I discovered that the quantity of the open-source training data does not match that mentioned in the paper. When using a global batch size of 144, the number of iterations I trained for is 2144, while the paper indicates 2400.

@KjAeRsTuIsK
Copy link
Collaborator

Hi @zytx121 , we further finetuned our model on just the grounding part of the dataset for some more steps.

@baichuanzhou
Copy link

Hi @zytx121 , we further finetuned our model on just the grounding part of the dataset for some more steps.

Hi, thanks for your great work.
Would you mind also open-sourcing the specific grounding dataset that you used in stage2? Thanks in advance.

@KjAeRsTuIsK
Copy link
Collaborator

You can filter the data from the geochat_instruct file using the [refer] and [grounding] keywords.

@zytx121
Copy link
Author

zytx121 commented Mar 24, 2024

Thank you for your answer. After filtering, my training sample iteration count is 1472, which still does not match the over 1600 in the paper.

@baichuanzhou
Copy link

baichuanzhou commented Mar 24, 2024

You can filter the data from the geochat_instruct file using the [refer] and [grounding] keywords.

Does stage 2 also have a batch size of 144? @KjAeRsTuIsK

@baichuanzhou
Copy link

baichuanzhou commented Mar 24, 2024

Thank you for your answer. After filtering, my training sample iteration count is 1472, which still does not match the over 1600 in the paper.

May I ask how did you filter the data? I used this script to find out how many samples have [refer] or [grounding] keywords:

for sample in data:
     conversations = sample['conversations']
     if any('[grounding]' in conv['value'] or '[refer]' in conv['value'] for conv in conversations):
             num_grounding += 1

num_grounding is 70464. If we set the batch size to 144, it only has 490 iterations. @zytx121

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants