Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I think this method looks like MM-COT, What's the difference with it except used llm? #12

Closed
dongxiaolong opened this issue Apr 19, 2023 · 3 comments

Comments

@dongxiaolong
Copy link

dongxiaolong commented Apr 19, 2023

Absolutely awesome work. Firstly thanks for the instruct tuning data that contributed to the community. Just confused with above question mentioned in the title.

@152334H
Copy link

152334H commented Apr 19, 2023

the other important part is the synthetic gpt-4 based dataset.

@ChunyuanLI
Copy link
Collaborator

The section of fine-tuning on ScienceQA is similar with MM-COT, in terms of architectures and data organization. One major finding in MM-CoT is that prediction order matters: reason-then-answer, and thus it is called "CoT". In our study, we find that the "CoT" claim is not very important. See the evidence in our paper:

Chain-of-thoughts. To decide the order between the answer and reasoning process in the model prediction, we run both variants and observe that answer-first reports the best number 89.77% accuracy in 12 epochs, while reasoning-first can quickly reach 89.77% accuracy in 6 epochs, but no further improvement with more training (89.96%). Training the model for 24 epochs does not improve the performance. We conclude that CoT-like reasoning-first strategy can largely improve convergence speed, but contributes relatively little to the final performance.

Now, for both papers, the most performance gain is the use of vision encoder and the end-to-end training on ScienceQA. This dataset is relatively small compared with VQA 2.0. It is easy to reach high performance by training a large model on it. I hope this should be noted for readers to make solid conclusions. Further, there are implementation difference between the two papers for us to reach the different conclusions: (1) The choice of LLM; (2) We have pre-training stage to connect the two modality, which leads 5% improvement compared with training from scratch, while MMCoT does not has this pre-training stage. Hope it can re-considered in the development.


I'd like to clarify that ScienceQA has helped us quantitively ablate our design choice in the early stage of the project, but ScienceQA is not the single main focus of this project. We aim to help the community produce multimodal GPT-4 level capability with minimum efforts: (1) From focus shift from model-centric to data-centric AI: the multimodal instruction-following data is the key, and the most of our time is spent on. (2) Achieving multimodal chat with detailed description such as OCR and complex reasoning. The current demo has preliminary capabilities on this. Hope the community can be inspired to scale up this approach to reach better performance.

@ChunyuanLI ChunyuanLI added the question Further information is requested label Apr 19, 2023
@152334H
Copy link

152334H commented Apr 20, 2023

In our study, we find that the "CoT" claim is not very important. See the evidence in our paper:

Wow, thanks for the work!

@haotian-liu haotian-liu removed the question Further information is requested label Apr 23, 2023
Repository owner locked and limited conversation to collaborators Apr 23, 2023
@haotian-liu haotian-liu converted this issue into discussion #45 Apr 23, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants