Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan all status is hung after migration from 1.8 to 1.10 #9963

Closed
danfengliu opened this issue Nov 22, 2019 · 17 comments · Fixed by #10429
Closed

Scan all status is hung after migration from 1.8 to 1.10 #9963

danfengliu opened this issue Nov 22, 2019 · 17 comments · Fixed by #10429

Comments

@danfengliu
Copy link
Contributor

Test Steps:
1. Deploy harbor 1.8.5;
2. Set Scan all shedule to 12 12 12 * * *;
3. Wait more than 12 minutes;
4. Update harbor to 1.10;
5. View page of "Interrogation Services";
Result:
Page is in scan waiting status for long time.

@danfengliu
Copy link
Contributor Author

image

@steven-zou
Copy link
Contributor

en?

This is very weird. Total is 0 but ongoing is true.

@AllForNothing which property you're using to disable the button?

@AllForNothing
Copy link
Contributor

on

en?

This is very weird. Total is 0 but ongoing is true.

@AllForNothing which property you're using to disable the button?

ongoing

@reasonerjt
Copy link
Contributor

Let's be clear that this issue is because the Harbor is shutdown while the job is running and the interrupted job is blocked after Harbor is started.

So it does not only impact upgrading.

@danfengliu
Copy link
Contributor Author

@reasonerjt Yes, it's upgrade regardless.

@steven-zou
Copy link
Contributor

I think the issue can be generalized to the case that, if there is a Redis restart when lots of jobs are in progress, there might be cause job loss issues. Once the jobs could not be recovered, the status data in the database will be out of sync and look like hung there.

@steven-zou
Copy link
Contributor

@stuclem @xaleeks

Could we add release notes for such known issues?

If there are jobs are still running in the job service, and a Redis restart happened with any reasons (power outage, crash etc.), job loss issues may happen. That would cause status out of sync issues that appeared in the scan, replication or GC scenarios. As result, some status will be like hung there, no change anymore.

@stuclem
Copy link
Contributor

stuclem commented Dec 11, 2019

@steven-zou yes of course. Added the kind/note label.

@renmaosheng
Copy link
Contributor

@steven-zou we also need to mention the workaround when this happens.

@michmike
Copy link
Contributor

what's the workaround and is it something that happens automatically or someone has to manually execute it?

@michmike
Copy link
Contributor

does this "hung" issue happen 100% on each upgrade?

@steven-zou
Copy link
Contributor

@michmike

No. It will happen only and only if there are jobs are running and then Redis restart occurs for any reason. Most of the time, when users doing their upgrading, there will not be jobs running.

I'll continue to do investigate this issue. The possible root cause is data loss happens after Redis restarts.

Current plan is we can deliver a fix in the patch release 1.10.X.

@steven-zou
Copy link
Contributor

steven-zou commented Dec 12, 2019

The workaround approaches will be noted in the FAQ list. By manually triggering API calls.

@michmike
Copy link
Contributor

comments after discussion:
-This is not a new issue
-data loss could be: in progress jobs in the queue. jobs could be replication, scanning, GC
....However, replication will be eventually consistent on the next replication
....scanning is safe since next scan will work. for new images, policy will prevent them from being pulled until they are scanned

i think we are fine to release with this ticket and fix it in the next patch release. this is not data loss, because Harbor is eventually consistent based on user intent. A redis cache restart (for whatever reasons) will chase the same effects and Harbor will be eventually consistent.

@stuclem
Copy link
Contributor

stuclem commented Dec 13, 2019

Added to the RNs, so removing kind/note: https://github.com/goharbor/harbor/releases/tag/v1.10.0#known-issues

  • Scan status freezes after migration from 1.8 to 1.10 #9963
    If there are jobs running in the job service, and a Redis restart happens, for example because of a power outage or crash, status out of sync issues might appear in scan, replication or garbage collection tasks. Consequently, some task statuses freeze in their current state, and do not update themselves. The cause is known and this will be fixed in a patch.

    Workaround: Run the task again. Replication and scanning function correctly on the next run. No data is lost.

@steven-zou
Copy link
Contributor

It may be related to this issue in the upstream project : gocraft/work#146

@reasonerjt reasonerjt added this to the Sprint 77 milestone Dec 31, 2019
steven-zou added a commit to steven-zou/harbor that referenced this issue Jan 8, 2020
- improve the status hook sending/resending approach
- improve the status compare and set approach
- simplify the relevant flow
- add reaper to fix the out of sync jobs
- fix goharbor#10244 , fix goharbor#9963

Signed-off-by: Steven Zou <szou@vmware.com>
steven-zou added a commit to steven-zou/harbor that referenced this issue Jan 8, 2020
- improve the status hook sending/resending approach
- improve the status compare and set approach
- simplify the relevant flow
- add reaper to fix the out of sync jobs
- fix goharbor#10244 , fix goharbor#9963

Signed-off-by: Steven Zou <szou@vmware.com>
steven-zou added a commit to steven-zou/harbor that referenced this issue Jan 8, 2020
- improve the status hook sending/resending approach
- improve the status compare and set approach
- simplify the relevant flow
- add reaper to fix the out of sync jobs
- fix goharbor#10244 , fix goharbor#9963

Signed-off-by: Steven Zou <szou@vmware.com>
steven-zou added a commit to steven-zou/harbor that referenced this issue Jan 9, 2020
- improve the status hook sending/resending approach
- improve the status compare and set approach
- simplify the relevant flow
- add reaper to fix the out of sync jobs
- fix goharbor#10244 , fix goharbor#9963

Signed-off-by: Steven Zou <szou@vmware.com>
steven-zou added a commit to steven-zou/harbor that referenced this issue Jan 9, 2020
- improve the status hook sending/resending approach
- improve the status compare and set approach
- simplify the relevant flow
- add reaper to fix the out of sync jobs
- fix goharbor#10244 , fix goharbor#9963

Signed-off-by: Steven Zou <szou@vmware.com>
steven-zou added a commit to steven-zou/harbor that referenced this issue Jan 9, 2020
- improve the status hook sending/resending approach
- improve the status compare and set approach
- simplify the relevant flow
- add reaper to fix the out of sync jobs
- fix goharbor#10244 , fix goharbor#9963

Signed-off-by: Steven Zou <szou@vmware.com>
steven-zou added a commit to steven-zou/harbor that referenced this issue Jan 9, 2020
- improve the status hook sending/resending approach
- improve the status compare and set approach
- simplify the relevant flow
- add reaper to fix the out of sync jobs
- fix goharbor#10244 , fix goharbor#9963

Signed-off-by: Steven Zou <szou@vmware.com>
@steven-zou
Copy link
Contributor

The code work is done, pending for PR review.

Move to next SP 78

@steven-zou steven-zou modified the milestones: Sprint 77, Sprint 78 Jan 17, 2020
wy65701436 pushed a commit to wy65701436/harbor that referenced this issue Jan 20, 2020
- improve the status hook sending/resending approach
- improve the status compare and set approach
- simplify the relevant flow
- add reaper to fix the out of sync jobs
- fix goharbor#10244 , fix goharbor#9963

Signed-off-by: Steven Zou <szou@vmware.com>
wy65701436 added a commit to wy65701436/harbor that referenced this issue Jan 20, 2020
* fix[jobservice]:job status is hung after restart

- improve the status hook sending/resending approach
- improve the status compare and set approach
- simplify the relevant flow
- add reaper to fix the out of sync jobs
- fix goharbor#10244 , fix goharbor#9963

Signed-off-by: Steven Zou <szou@vmware.com>

* add content trust middleware in new v2 handler

* fix

* add policy check middleware in v2 handler

* merge with latest

* fix

Co-authored-by: Steven Zou <loneghost1982@gmail.com>
steven-zou added a commit to steven-zou/harbor that referenced this issue Feb 6, 2020
- improve the status hook sending/resending approach
- improve the status compare and set approach
- simplify the relevant flow
- add reaper to fix the out of sync jobs
- fix goharbor#10244 , fix goharbor#9963

Signed-off-by: Steven Zou <szou@vmware.com>
AllForNothing pushed a commit to AllForNothing/harbor that referenced this issue Feb 7, 2020
- improve the status hook sending/resending approach
- improve the status compare and set approach
- simplify the relevant flow
- add reaper to fix the out of sync jobs
- fix goharbor#10244 , fix goharbor#9963

Signed-off-by: Steven Zou <szou@vmware.com>
hobti01 pushed a commit to hobti01/harbor that referenced this issue Feb 9, 2020
- improve the status hook sending/resending approach
- improve the status compare and set approach
- simplify the relevant flow
- add reaper to fix the out of sync jobs
- fix goharbor#10244 , fix goharbor#9963

Signed-off-by: Steven Zou <szou@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment