Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with hallucinations in tasks #20

Open
yananchen1989 opened this issue May 15, 2024 · 0 comments
Open

issues with hallucinations in tasks #20

yananchen1989 opened this issue May 15, 2024 · 0 comments

Comments

@yananchen1989
Copy link

yananchen1989 commented May 15, 2024

Hello authors,

similar to previous post: #6, when I run the tasks, I also face severe hallucation issues when using GPT-3.5-turbo-0125.

error message like this:

An error occurred: list index out of range. Traceback (most recent call last):
File "C:\Users/ITSupp/Downloads/codes/self-refine\src\utils.py", line 39, in wrapper
return func(*args, **kwargs)
File "C:\Users\ITSupp\Downloads\codes\self-refine\src\gsm\run.py", line 45, in iterative_gsm
fb_and_maybe_soln = task_feedback(solution=solution)
File "C:\Users/ITSupp/Downloads/codes/self-refine\src\gsm\feedback.py", line 52, in call
improved_soln = entire_output.split("def solution():")[1]
IndexError: list index out of range
. Left retries: 2.
n_attempts===> 0

Take GSM task for example, in the script: self-refine\src\gsm\run.py,
I see that to avoid this type of error, you add @retry_parse_fail_prone_cmd, which is a retry mechanism to have multiple times to have a decent answer with regard to format.

Ok, let's first execute the run.py, and we will have the output file for further evaluation, where I use the default max_attempts=4.

  • with retry_parse_fail_prone_cmd added, only around 50% of the test samples( slow_programs_df) can be collected into results which then is written to outfile on disk.

  • without retry_parse_fail_prone_cmd , the retention rate is smaller, around 30%.

However, in output file of data/tasks/gsm/gsm_outputs.jsonl has 1253 samples. Out of the 1319 input test cases (data/tasks/gsm/gsm.jsonl), the retention rate is 95%. May I know why the retention rate is so high ? do you execute the self-refine\src\gsm\run.py multiple times to merge?

Then evaluating on the outfile via src/gsm/gsm_selfref_eval.py, the accuracy is in between 70% to 80% ( I run several times) .
However, most of the time, among the attempts, the accuracy is the same.

Accuracy at attempt 0 = 74.07%
Accuracy at attempt 1 = 74.07%
Accuracy at attempt 2 = 74.07%
Accuracy at attempt 3 = 74.07%
Accuracy at attempt 4 = 74.07%

Not sure in your original experiements, if this is quite common actually ?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant