You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
similar to previous post: #6, when I run the tasks, I also face severe hallucation issues when using GPT-3.5-turbo-0125.
error message like this:
An error occurred: list index out of range. Traceback (most recent call last):
File "C:\Users/ITSupp/Downloads/codes/self-refine\src\utils.py", line 39, in wrapper
return func(*args, **kwargs)
File "C:\Users\ITSupp\Downloads\codes\self-refine\src\gsm\run.py", line 45, in iterative_gsm
fb_and_maybe_soln = task_feedback(solution=solution)
File "C:\Users/ITSupp/Downloads/codes/self-refine\src\gsm\feedback.py", line 52, in call
improved_soln = entire_output.split("def solution():")[1]
IndexError: list index out of range
. Left retries: 2.
n_attempts===> 0
Take GSM task for example, in the script: self-refine\src\gsm\run.py,
I see that to avoid this type of error, you add @retry_parse_fail_prone_cmd, which is a retry mechanism to have multiple times to have a decent answer with regard to format.
Ok, let's first execute the run.py, and we will have the output file for further evaluation, where I use the default max_attempts=4.
with retry_parse_fail_prone_cmd added, only around 50% of the test samples( slow_programs_df) can be collected into results which then is written to outfile on disk.
without retry_parse_fail_prone_cmd , the retention rate is smaller, around 30%.
However, in output file of data/tasks/gsm/gsm_outputs.jsonl has 1253 samples. Out of the 1319 input test cases (data/tasks/gsm/gsm.jsonl), the retention rate is 95%. May I know why the retention rate is so high ? do you execute the self-refine\src\gsm\run.py multiple times to merge?
Then evaluating on the outfile via src/gsm/gsm_selfref_eval.py, the accuracy is in between 70% to 80% ( I run several times) .
However, most of the time, among the attempts, the accuracy is the same.
Accuracy at attempt 0 = 74.07%
Accuracy at attempt 1 = 74.07%
Accuracy at attempt 2 = 74.07%
Accuracy at attempt 3 = 74.07%
Accuracy at attempt 4 = 74.07%
Not sure in your original experiements, if this is quite common actually ?
Thanks.
The text was updated successfully, but these errors were encountered:
Hello authors,
similar to previous post: #6, when I run the tasks, I also face severe hallucation issues when using GPT-3.5-turbo-0125.
error message like this:
Take GSM task for example, in the script:
self-refine\src\gsm\run.py
,I see that to avoid this type of error, you add
@retry_parse_fail_prone_cmd
, which is a retry mechanism to have multiple times to have a decent answer with regard to format.Ok, let's first execute the run.py, and we will have the output file for further evaluation, where I use the default max_attempts=4.
with
retry_parse_fail_prone_cmd
added, only around 50% of the test samples(slow_programs_df
) can be collected intoresults
which then is written tooutfile
on disk.without
retry_parse_fail_prone_cmd
, the retention rate is smaller, around 30%.However, in output file of
data/tasks/gsm/gsm_outputs.jsonl
has 1253 samples. Out of the 1319 input test cases (data/tasks/gsm/gsm.jsonl
), the retention rate is 95%. May I know why the retention rate is so high ? do you execute theself-refine\src\gsm\run.py
multiple times to merge?Then evaluating on the outfile via
src/gsm/gsm_selfref_eval.py
, the accuracy is in between 70% to 80% ( I run several times) .However, most of the time, among the attempts, the accuracy is the same.
Not sure in your original experiements, if this is quite common actually ?
Thanks.
The text was updated successfully, but these errors were encountered: