The quantity of the open-source training data does not match that mentioned in the paper. #21

zytx121 · 2024-03-20T23:58:16Z

Thank you very much for your work！

I discovered that the quantity of the open-source training data does not match that mentioned in the paper. When using a global batch size of 144, the number of iterations I trained for is 2144, while the paper indicates 2400.

KjAeRsTuIsK · 2024-03-21T10:42:38Z

Hi @zytx121 , we further finetuned our model on just the grounding part of the dataset for some more steps.

baichuanzhou · 2024-03-23T05:06:15Z

Hi @zytx121 , we further finetuned our model on just the grounding part of the dataset for some more steps.

Hi, thanks for your great work.
Would you mind also open-sourcing the specific grounding dataset that you used in stage2? Thanks in advance.

KjAeRsTuIsK · 2024-03-23T23:36:15Z

You can filter the data from the geochat_instruct file using the [refer] and [grounding] keywords.

zytx121 · 2024-03-24T01:17:32Z

Thank you for your answer. After filtering, my training sample iteration count is 1472, which still does not match the over 1600 in the paper.

baichuanzhou · 2024-03-24T02:07:52Z

You can filter the data from the geochat_instruct file using the [refer] and [grounding] keywords.

Does stage 2 also have a batch size of 144? @KjAeRsTuIsK

baichuanzhou · 2024-03-24T02:35:32Z

Thank you for your answer. After filtering, my training sample iteration count is 1472, which still does not match the over 1600 in the paper.

May I ask how did you filter the data? I used this script to find out how many samples have [refer] or [grounding] keywords:

for sample in data:
     conversations = sample['conversations']
     if any('[grounding]' in conv['value'] or '[refer]' in conv['value'] for conv in conversations):
             num_grounding += 1

num_grounding is 70464. If we set the batch size to 144, it only has 490 iterations. @zytx121

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The quantity of the open-source training data does not match that mentioned in the paper. #21

The quantity of the open-source training data does not match that mentioned in the paper. #21

zytx121 commented Mar 20, 2024

KjAeRsTuIsK commented Mar 21, 2024

baichuanzhou commented Mar 23, 2024

KjAeRsTuIsK commented Mar 23, 2024

zytx121 commented Mar 24, 2024

baichuanzhou commented Mar 24, 2024 •

edited

Loading

baichuanzhou commented Mar 24, 2024 •

edited

Loading

The quantity of the open-source training data does not match that mentioned in the paper. #21

The quantity of the open-source training data does not match that mentioned in the paper. #21

Comments

zytx121 commented Mar 20, 2024

KjAeRsTuIsK commented Mar 21, 2024

baichuanzhou commented Mar 23, 2024

KjAeRsTuIsK commented Mar 23, 2024

zytx121 commented Mar 24, 2024

baichuanzhou commented Mar 24, 2024 • edited Loading

baichuanzhou commented Mar 24, 2024 • edited Loading

baichuanzhou commented Mar 24, 2024 •

edited

Loading

baichuanzhou commented Mar 24, 2024 •

edited

Loading