issues with hallucinations in tasks #20

yananchen1989 · 2024-05-15T19:35:55Z

Hello authors,

similar to previous post: #6, when I run the tasks, I also face severe hallucation issues when using GPT-3.5-turbo-0125.

error message like this:

An error occurred: list index out of range. Traceback (most recent call last):
File "C:\Users/ITSupp/Downloads/codes/self-refine\src\utils.py", line 39, in wrapper
return func(*args, **kwargs)
File "C:\Users\ITSupp\Downloads\codes\self-refine\src\gsm\run.py", line 45, in iterative_gsm
fb_and_maybe_soln = task_feedback(solution=solution)
File "C:\Users/ITSupp/Downloads/codes/self-refine\src\gsm\feedback.py", line 52, in call
improved_soln = entire_output.split("def solution():")[1]
IndexError: list index out of range
. Left retries: 2.
n_attempts===> 0

Take GSM task for example, in the script: self-refine\src\gsm\run.py,
I see that to avoid this type of error, you add @retry_parse_fail_prone_cmd, which is a retry mechanism to have multiple times to have a decent answer with regard to format.

Ok, let's first execute the run.py, and we will have the output file for further evaluation, where I use the default max_attempts=4.

with retry_parse_fail_prone_cmd added, only around 50% of the test samples( slow_programs_df) can be collected into results which then is written to outfile on disk.
without retry_parse_fail_prone_cmd , the retention rate is smaller, around 30%.

However, in output file of data/tasks/gsm/gsm_outputs.jsonl has 1253 samples. Out of the 1319 input test cases (data/tasks/gsm/gsm.jsonl), the retention rate is 95%. May I know why the retention rate is so high ? do you execute the self-refine\src\gsm\run.py multiple times to merge?

Then evaluating on the outfile via src/gsm/gsm_selfref_eval.py, the accuracy is in between 70% to 80% ( I run several times) .
However, most of the time, among the attempts, the accuracy is the same.

Accuracy at attempt 0 = 74.07%
Accuracy at attempt 1 = 74.07%
Accuracy at attempt 2 = 74.07%
Accuracy at attempt 3 = 74.07%
Accuracy at attempt 4 = 74.07%

Not sure in your original experiements, if this is quite common actually ?

Thanks.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues with hallucinations in tasks #20

issues with hallucinations in tasks #20

yananchen1989 commented May 15, 2024 •

edited

Loading

issues with hallucinations in tasks #20

issues with hallucinations in tasks #20

Comments

yananchen1989 commented May 15, 2024 • edited Loading

yananchen1989 commented May 15, 2024 •

edited

Loading