Multimodal Chain-of-Thought Reasoning in Language Models

"Imagine learning a textbook without figures or tables."

Multimodal-CoT incorporates vision features in a decoupled training framework. The framework consists of two training stages: (i) rationale generation and (ii) answer inference. Both stages share the same model architecture but differ in the input and output.

Requirements

Install all required python dependencies:

pip install -r requirements.txt

Datasets

Download the dataset from the following repository:

https://github.com/lupantech/ScienceQA/tree/main/data

Download the extracted vision features from vision_features and unzip the files under vision_features

Instructions

Training

# rationale generation
CUDA_VISIBLE_DEVICES=0,1 python main.py \
    --model allenai/unifiedqa-t5-base \
    --user_msg rationale --img_type detr \
    --bs 8 --eval_bs 4 --eval_acc 10 --output_len 512 \
    --final_eval --prompt_format QCM-LE

# answer inference
CUDA_VISIBLE_DEVICES=0,1 python main.py \
    --model allenai/unifiedqa-t5-base \
    --user_msg answer --img_type detr \
    --bs 8 --eval_bs 4 --eval_acc 10 --output_len 64 \
    --final_eval --prompt_format QCMG-A \
    --eval_le experiments/rationale_allenai-unifiedqa-t5-base_detr_QCM-LE_lr5e-05_bs16_op512_ep20/predictions_ans_eval.json \
    --test_le experiments/rationale_allenai-unifiedqa-t5-base_detr_QCM-LE_lr5e-05_bs16_op512_ep20/predictions_ans_test.json

Inference

Our trained models are available at models. To use our trained models, please put the them under the models folder.

# rationale generation
CUDA_VISIBLE_DEVICES=0,1 python main.py \
    --model allenai/unifiedqa-t5-base \
    --user_msg rationale --img_type detr \
    --bs 8 --eval_bs 4 --eval_acc 10 --output_len 512 \
    --final_eval --prompt_format QCM-LE \
    --evaluate_dir models/MM-CoT-UnifiedQA-base-Rationale

# answer inference
CUDA_VISIBLE_DEVICES=0,1 python main.py \
    --model allenai/unifiedqa-t5-base \
    --user_msg answer --img_type detr \
    --bs 8 --eval_bs 4 --eval_acc 10 --output_len 64 \
    --final_eval --prompt_format QCMG-A \
    --eval_le models/rationale/predictions_ans_eval.json \
    --test_le models/rationale/predictions_ans_test.json \
    --evaluate_dir models/MM-CoT-UnifiedQA-base-Answer

Citing MM-CoT

@article{zhang2023multicot,
  title={Multimodal Chain-of-Thought Reasoning in Language Models},
  author={Zhang, Zhuosheng and Zhang, Aston and Li, Mu and Zhao, Hai and Karypis, George and Smola, Alex},
  journal={arXiv preprint arXiv:2302.00923},
  year={2023}
}

License

This project is licensed under the Apache-2.0 License.

Acknowledgement

Part of our codes are adapted from ScienceQA and Transformers.

We thank Pan Lu for providing parameter size for ScienceQA baselines.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
experiments/answer_allenai-unifiedqa-t5-base_detr_QCMG-A_lr5e-05_bs0_op64_ep20		experiments/answer_allenai-unifiedqa-t5-base_detr_QCMG-A_lr5e-05_bs0_op64_ep20
vision_features		vision_features
.DS_Store		.DS_Store
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
analysis.py		analysis.py
ans-direct.sh		ans-direct.sh
ans.sh		ans.sh
ans_cpu-cp.sh		ans_cpu-cp.sh
ans_cpu.sh		ans_cpu.sh
ans_fid.sh		ans_fid.sh
caption_gen.py		caption_gen.py
evaluations.py		evaluations.py
main.py		main.py
main_v1.py		main_v1.py
model.py		model.py
nohup.out		nohup.out
rationale.sh		rationale.sh
requirements.txt		requirements.txt
run_inference.sh		run_inference.sh
run_training.sh		run_training.sh
sqa-cp.json		sqa-cp.json
utils_data.py		utils_data.py
utils_evaluate.py		utils_evaluate.py
utils_prompt.py		utils_prompt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Chain-of-Thought Reasoning in Language Models

"Imagine learning a textbook without figures or tables."

Requirements

Datasets

Instructions

Training

Inference

Citing MM-CoT

License

Acknowledgement

About

Releases

Packages

Languages

License

lonestar234028/mm-cot

Folders and files

Latest commit

History

Repository files navigation

Multimodal Chain-of-Thought Reasoning in Language Models

"Imagine learning a textbook without figures or tables."

Requirements

Datasets

Instructions

Training

Inference

Citing MM-CoT

License

Acknowledgement

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages