GPT3.5 ToT Performance is a lot lower #24

IsThatYou · 2023-06-09T23:17:34Z

Hi! I tried to use GPT-3.5-turbo for the ToT experiment on Game24 and got similar results except for ToT. For both standard prompting and CoT I got an answer close to what's in the paper: (IO: 36%, CoT:42%). But for ToT, without changing the script I can only get 4% as opposed to 45% in the paper. I am wondering have you guys seen similar behaviors from GPT-3.5? What may potentially cause this?

One quick glance over what's generated suggests that GPT3.5 is not as good at following the format. But the huge discrepancy is interesting.

Thanks!

GithungDang · 2023-06-16T17:39:38Z

Does this mean that using open source models like vicuna will be worse?

GithungDang · 2023-06-16T17:49:21Z

Actually, I want to use ToT to improve the reasoning ability of open source models, so that they can be close to the reasoning level of gpt3.5, rather than the superficial dialogue style.

ysymyth · 2023-06-26T13:59:51Z

Hi @IsThatYou this is a great point --- I tried GPT-3.5 and it indeeds performs badly on game of 24. Note that IO: 36% CoT:42% are pass@100 though.

We also tried ToT using GPT-3.5-turbo instead of GPT-4 on Creative Writing (scoring is still via GPT-4). We find all methods perform worse, but ToT is still significantly better than other methods.

Creative Writing	GPT-4 (in paper)	GPT-3.5-turbo
IO	6.19	4.47
CoT	6.93	5.16
ToT	6.93	6.62

In general, I believe proposing and evaluating diverse thoughts is an "emerging capability" that is hard even for GPT-4, but significantly harder for smaller/weaker models. It would be important and interesting to study how to make smaller models better at ToT reasoning!

IsThatYou · 2023-06-26T18:18:47Z

Hi @ysymyth thank you for the response! I closely looked and compared some of the generations between gpt-3.5 and gpt-4, I found gpt-4 to be better at task understanding in general. gpt-3.5 degenerates more often than gpt-4. Anyway, this is pretty interesting. It is definitely interesting to see how to make smaller models better at them. :D

ysymyth · 2023-06-26T18:44:27Z

Yes I agree, and perhaps some better prompt engineering can help with the issue.

ysymyth closed this as completed Jun 26, 2023

ysymyth mentioned this issue Jul 3, 2023

How to use it on oobabooga / text-generation-webui? #34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT3.5 ToT Performance is a lot lower #24

GPT3.5 ToT Performance is a lot lower #24

IsThatYou commented Jun 9, 2023

GithungDang commented Jun 16, 2023

GithungDang commented Jun 16, 2023

ysymyth commented Jun 26, 2023

IsThatYou commented Jun 26, 2023

ysymyth commented Jun 26, 2023

GPT3.5 ToT Performance is a lot lower #24

GPT3.5 ToT Performance is a lot lower #24

Comments

IsThatYou commented Jun 9, 2023

GithungDang commented Jun 16, 2023

GithungDang commented Jun 16, 2023

ysymyth commented Jun 26, 2023

IsThatYou commented Jun 26, 2023

ysymyth commented Jun 26, 2023