Scan all status is hung after migration from 1.8 to 1.10 #9963

danfengliu · 2019-11-22T03:21:51Z

Test Steps:
1. Deploy harbor 1.8.5;
2. Set Scan all shedule to 12 12 12 * * *;
3. Wait more than 12 minutes;
4. Update harbor to 1.10;
5. View page of "Interrogation Services";
Result:
Page is in scan waiting status for long time.

danfengliu · 2019-12-09T08:52:46Z

steven-zou · 2019-12-10T06:05:37Z

en?

This is very weird. Total is 0 but ongoing is true.

@AllForNothing which property you're using to disable the button?

AllForNothing · 2019-12-10T10:43:48Z

on

en?

This is very weird. Total is 0 but ongoing is true.

@AllForNothing which property you're using to disable the button?

ongoing

reasonerjt · 2019-12-11T02:27:23Z

Let's be clear that this issue is because the Harbor is shutdown while the job is running and the interrupted job is blocked after Harbor is started.

So it does not only impact upgrading.

danfengliu · 2019-12-11T02:29:55Z

@reasonerjt Yes, it's upgrade regardless.

steven-zou · 2019-12-11T08:17:23Z

I think the issue can be generalized to the case that, if there is a Redis restart when lots of jobs are in progress, there might be cause job loss issues. Once the jobs could not be recovered, the status data in the database will be out of sync and look like hung there.

steven-zou · 2019-12-11T08:25:05Z

@stuclem @xaleeks

Could we add release notes for such known issues?

If there are jobs are still running in the job service, and a Redis restart happened with any reasons (power outage, crash etc.), job loss issues may happen. That would cause status out of sync issues that appeared in the scan, replication or GC scenarios. As result, some status will be like hung there, no change anymore.

stuclem · 2019-12-11T08:26:28Z

@steven-zou yes of course. Added the kind/note label.

renmaosheng · 2019-12-12T01:30:00Z

@steven-zou we also need to mention the workaround when this happens.

michmike · 2019-12-12T01:42:57Z

what's the workaround and is it something that happens automatically or someone has to manually execute it?

michmike · 2019-12-12T01:45:01Z

does this "hung" issue happen 100% on each upgrade?

steven-zou · 2019-12-12T01:56:37Z

@michmike

No. It will happen only and only if there are jobs are running and then Redis restart occurs for any reason. Most of the time, when users doing their upgrading, there will not be jobs running.

I'll continue to do investigate this issue. The possible root cause is data loss happens after Redis restarts.

Current plan is we can deliver a fix in the patch release 1.10.X.

steven-zou · 2019-12-12T01:57:33Z

The workaround approaches will be noted in the FAQ list. By manually triggering API calls.

michmike · 2019-12-12T05:08:52Z

comments after discussion:
-This is not a new issue
-data loss could be: in progress jobs in the queue. jobs could be replication, scanning, GC
....However, replication will be eventually consistent on the next replication
....scanning is safe since next scan will work. for new images, policy will prevent them from being pulled until they are scanned

i think we are fine to release with this ticket and fix it in the next patch release. this is not data loss, because Harbor is eventually consistent based on user intent. A redis cache restart (for whatever reasons) will chase the same effects and Harbor will be eventually consistent.

stuclem · 2019-12-13T08:16:39Z

Added to the RNs, so removing kind/note: https://github.com/goharbor/harbor/releases/tag/v1.10.0#known-issues

Scan status freezes after migration from 1.8 to 1.10 #9963
If there are jobs running in the job service, and a Redis restart happens, for example because of a power outage or crash, status out of sync issues might appear in scan, replication or garbage collection tasks. Consequently, some task statuses freeze in their current state, and do not update themselves. The cause is known and this will be fixed in a patch.

Workaround: Run the task again. Replication and scanning function correctly on the next run. No data is lost.

steven-zou · 2019-12-28T04:10:40Z

It may be related to this issue in the upstream project : gocraft/work#146

- improve the status hook sending/resending approach - improve the status compare and set approach - simplify the relevant flow - add reaper to fix the out of sync jobs - fix goharbor#10244 , fix goharbor#9963 Signed-off-by: Steven Zou <szou@vmware.com>

steven-zou · 2020-01-17T09:33:01Z

The code work is done, pending for PR review.

Move to next SP 78

- improve the status hook sending/resending approach - improve the status compare and set approach - simplify the relevant flow - add reaper to fix the out of sync jobs - fix goharbor#10244 , fix goharbor#9963 Signed-off-by: Steven Zou <szou@vmware.com>

* fix[jobservice]:job status is hung after restart - improve the status hook sending/resending approach - improve the status compare and set approach - simplify the relevant flow - add reaper to fix the out of sync jobs - fix goharbor#10244 , fix goharbor#9963 Signed-off-by: Steven Zou <szou@vmware.com> * add content trust middleware in new v2 handler * fix * add policy check middleware in v2 handler * merge with latest * fix Co-authored-by: Steven Zou <loneghost1982@gmail.com>

- improve the status hook sending/resending approach - improve the status compare and set approach - simplify the relevant flow - add reaper to fix the out of sync jobs - fix goharbor#10244 , fix goharbor#9963 Signed-off-by: Steven Zou <szou@vmware.com>

reasonerjt assigned steven-zou Nov 22, 2019

reasonerjt added the area/interrogation-service label Nov 22, 2019

steven-zou added area/job-services known-issue labels Dec 11, 2019

stuclem added the kind/note label Dec 11, 2019

stuclem removed the kind/note label Dec 13, 2019

steven-zou added the target/1.10.1 label Dec 17, 2019

xaleeks added target/2.0.0 and removed target/1.10.1 labels Dec 30, 2019

steven-zou added target/1.10.1 and removed target/2.0.0 labels Dec 31, 2019

reasonerjt added this to the Sprint 77 milestone Dec 31, 2019

steven-zou mentioned this issue Jan 8, 2020

fix[jobservice]:job status is hung after restart #10429

Merged

steven-zou modified the milestones: Sprint 77, Sprint 78 Jan 17, 2020

steven-zou closed this as completed in #10429 Jan 20, 2020

steven-zou mentioned this issue Feb 6, 2020

[CHERRY-PICK] fix[jobservice]:job status is hung after restart #10649

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scan all status is hung after migration from 1.8 to 1.10 #9963

Scan all status is hung after migration from 1.8 to 1.10 #9963

danfengliu commented Nov 22, 2019

danfengliu commented Dec 9, 2019

steven-zou commented Dec 10, 2019

AllForNothing commented Dec 10, 2019

reasonerjt commented Dec 11, 2019

danfengliu commented Dec 11, 2019

steven-zou commented Dec 11, 2019

steven-zou commented Dec 11, 2019

stuclem commented Dec 11, 2019

renmaosheng commented Dec 12, 2019

michmike commented Dec 12, 2019

michmike commented Dec 12, 2019

steven-zou commented Dec 12, 2019

steven-zou commented Dec 12, 2019 •

edited

michmike commented Dec 12, 2019

stuclem commented Dec 13, 2019

steven-zou commented Dec 28, 2019

steven-zou commented Jan 17, 2020

Scan all status is hung after migration from 1.8 to 1.10 #9963

Scan all status is hung after migration from 1.8 to 1.10 #9963

Comments

danfengliu commented Nov 22, 2019

danfengliu commented Dec 9, 2019

steven-zou commented Dec 10, 2019

AllForNothing commented Dec 10, 2019

reasonerjt commented Dec 11, 2019

danfengliu commented Dec 11, 2019

steven-zou commented Dec 11, 2019

steven-zou commented Dec 11, 2019

stuclem commented Dec 11, 2019

renmaosheng commented Dec 12, 2019

michmike commented Dec 12, 2019

michmike commented Dec 12, 2019

steven-zou commented Dec 12, 2019

steven-zou commented Dec 12, 2019 • edited

michmike commented Dec 12, 2019

stuclem commented Dec 13, 2019

steven-zou commented Dec 28, 2019

steven-zou commented Jan 17, 2020

steven-zou commented Dec 12, 2019 •

edited