Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential bug of checkpointing #337

Open
drcege opened this issue Jun 28, 2024 · 1 comment
Open

potential bug of checkpointing #337

drcege opened this issue Jun 28, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@drcege
Copy link
Collaborator

drcege commented Jun 28, 2024

As the title suggests, the issue lies in the following code.

recorded_op_num = len(self.op_record)
prefix_process = self.process_list[:recorded_op_num]
all_the_same = True
dif1, dif2 = None, None
for record_op, config_op in zip(self.op_record, prefix_process):
if record_op != config_op:
all_the_same = False
dif1, dif2 = record_op, config_op
break
if all_the_same:
for op in self.op_record:
op_name = list(op.keys())[0]
logger.info(f'Skip op [{op_name}].')
self.process_list = self.process_list[recorded_op_num:]
return True

When the new process_list is shorter than op_record, Python does not raise an error for out-of-range indices but rather truncates to the maximum available length, thus len(prefix_process) < len(self.op_record). Similarly, the zip function terminates at the shorter iterable's length. This results in the check_ops_to_skip function incorrectly assuming that the recorded operators match the prefix of the current operators list.

Is that the case? @HYLcool @yxdyc

@drcege drcege self-assigned this Jun 28, 2024
@drcege drcege added the bug Something isn't working label Jun 28, 2024
@HYLcool
Copy link
Collaborator

HYLcool commented Jul 11, 2024

Yes, that's a problem when meeting this situation. 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants