I think this method looks like MM-COT, What's the difference with it except used llm? #12

dongxiaolong · 2023-04-19T05:57:01Z

Absolutely awesome work. Firstly thanks for the instruct tuning data that contributed to the community. Just confused with above question mentioned in the title.

152334H · 2023-04-19T07:20:17Z

the other important part is the synthetic gpt-4 based dataset.

ChunyuanLI · 2023-04-19T20:57:12Z

The section of fine-tuning on ScienceQA is similar with MM-COT, in terms of architectures and data organization. One major finding in MM-CoT is that prediction order matters: reason-then-answer, and thus it is called "CoT". In our study, we find that the "CoT" claim is not very important. See the evidence in our paper:

Chain-of-thoughts. To decide the order between the answer and reasoning process in the model prediction, we run both variants and observe that answer-first reports the best number 89.77% accuracy in 12 epochs, while reasoning-first can quickly reach 89.77% accuracy in 6 epochs, but no further improvement with more training (89.96%). Training the model for 24 epochs does not improve the performance. We conclude that CoT-like reasoning-first strategy can largely improve convergence speed, but contributes relatively little to the final performance.

Now, for both papers, the most performance gain is the use of vision encoder and the end-to-end training on ScienceQA. This dataset is relatively small compared with VQA 2.0. It is easy to reach high performance by training a large model on it. I hope this should be noted for readers to make solid conclusions. Further, there are implementation difference between the two papers for us to reach the different conclusions: (1) The choice of LLM; (2) We have pre-training stage to connect the two modality, which leads 5% improvement compared with training from scratch, while MMCoT does not has this pre-training stage. Hope it can re-considered in the development.

I'd like to clarify that ScienceQA has helped us quantitively ablate our design choice in the early stage of the project, but ScienceQA is not the single main focus of this project. We aim to help the community produce multimodal GPT-4 level capability with minimum efforts: (1) From focus shift from model-centric to data-centric AI: the multimodal instruction-following data is the key, and the most of our time is spent on. (2) Achieving multimodal chat with detailed description such as OCR and complex reasoning. The current demo has preliminary capabilities on this. Hope the community can be inspired to scale up this approach to reach better performance.

152334H · 2023-04-20T01:17:44Z

In our study, we find that the "CoT" claim is not very important. See the evidence in our paper:

Wow, thanks for the work!

ChunyuanLI added the question Further information is requested label Apr 19, 2023

haotian-liu removed the question Further information is requested label Apr 23, 2023

Repository owner locked and limited conversation to collaborators Apr 23, 2023

haotian-liu converted this issue into discussion #45 Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

I think this method looks like MM-COT, What's the difference with it except used llm? #12

I think this method looks like MM-COT, What's the difference with it except used llm? #12

dongxiaolong commented Apr 19, 2023 •

edited

Loading

152334H commented Apr 19, 2023

ChunyuanLI commented Apr 19, 2023

152334H commented Apr 20, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

I think this method looks like MM-COT, What's the difference with it except used llm? #12

I think this method looks like MM-COT, What's the difference with it except used llm? #12

Comments

dongxiaolong commented Apr 19, 2023 • edited Loading

152334H commented Apr 19, 2023

ChunyuanLI commented Apr 19, 2023

152334H commented Apr 20, 2023

This issue was moved to a discussion.

dongxiaolong commented Apr 19, 2023 •

edited

Loading