Skip to content

Fix paused sync file move issue #5949#5953

Merged
ckamm merged 6 commits into2.4from
fix_5949
Oct 17, 2017
Merged

Fix paused sync file move issue #5949#5953
ckamm merged 6 commits into2.4from
fix_5949

Conversation

@mrow4a
Copy link
Copy Markdown
Contributor

@mrow4a mrow4a commented Aug 10, 2017

@SamuAlfageme can you verify that this fixes issue for ChunkingNG for you? I checked on my local machine and this fixed it.

@guruz I will also add similar thing, but for V1, however there I will check if the file bytes been sent or not.

Issue #5949

Comment thread src/libsync/owncloudpropagator.h Outdated
if (_runningJobs.size() == 0) {
break;
}
Utility::sleep(1);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required, because if composite job gets aborted, but one of the job isnt, it will not finalize adding file to journal

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that's not acceptable. You can't let the main thread sleep like this.
You need to find a better way to wait that the job is actually finished (by listening to the finished signal or so)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, true, just tested and it stops the main thread, yep will do that with signals

Comment thread src/libsync/propagateuploadng.cpp Outdated
foreach (AbstractNetworkJob *job, _jobs) {
// Abort only PUT and MKDIR jobs, since abording MoveJob
// might result in conflict.
if (job->reply() && !job->inherits("OCC::MoveJob")) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is not MoveJob, just exit. (Wont it result in tmp new chunking directory be left without deletion, is server deleting periodicaly these tmp directories? @DeepDiver1975 )

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: use qobject_cast (more efficient and also compile-time checked)

Also, we shold probzbly decrease the timeout. For example, if the abort happens because the connection was detected to be dropped, we should not wait a full HTTP timeout again, IMHO.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to answer your question about stale directory, the next sync will resume the upload, so it won't be stale.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to qobject_cast.

Copy link
Copy Markdown
Contributor

@ogoffart ogoffart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That goes in the right direction,

Also the chunkingv1 need to be changed not to abort the last job. This is also what is used when the file is too small to be split in chunks.

Comment thread src/libsync/owncloudpropagator.h Outdated
if (_runningJobs.size() == 0) {
break;
}
Utility::sleep(1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that's not acceptable. You can't let the main thread sleep like this.
You need to find a better way to wait that the job is actually finished (by listening to the finished signal or so)

Comment thread src/libsync/propagateuploadng.cpp Outdated
foreach (AbstractNetworkJob *job, _jobs) {
// Abort only PUT and MKDIR jobs, since abording MoveJob
// might result in conflict.
if (job->reply() && !job->inherits("OCC::MoveJob")) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: use qobject_cast (more efficient and also compile-time checked)

Also, we shold probzbly decrease the timeout. For example, if the abort happens because the connection was detected to be dropped, we should not wait a full HTTP timeout again, IMHO.

@guruz
Copy link
Copy Markdown
Contributor

guruz commented Aug 11, 2017

Also the chunkingv1 need to be changed not to abort the last job. This is also what is used when the file is too small to be split in chunks.

I think it's better to hook into how UploadDevice is used and from there control an abort. Then you for sure know you didn't send all bytes and can safely abort.

@mrow4a
Copy link
Copy Markdown
Contributor Author

mrow4a commented Aug 14, 2017

@ogoffart can you have a look on this approach? One disadvantage is that now after paused, MOVE job continues (1st stage):
selection_162

after it is finished, view switches to (2nd stage):
selection_163

One thing is that the button resume sync is still available also in "1st stage", not sure what happens if it still processes MOVE, but someone clicks resume in that time.

@SamuAlfageme I did manual testing and for ChunkingNG problem is solved.

Comment thread src/libsync/owncloudpropagator.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not needed, abort was already sent

So you don't need to put abortJobs in a different function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here was to force the abort. Thus if after 5 seconds they did not finish, you kill the job just by sending abort jobs again, please have a look on chunkingng implementation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe in fact I dont need it also... it will roll back to what was before this PR for the jobs which did not terminate within 5sec

Comment thread src/libsync/owncloudpropagator.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to count the jobs and only emit the signal when ALL subjobs have finished their abort.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no. i got confused with CompositeJob.

Well, you don't need a slot for this single line, just connect to SIGNAL(abortFinished())

Copy link
Copy Markdown
Contributor Author

@mrow4a mrow4a Aug 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abortFinished signal is being send after 5 seconds timeout regardless if jobs terminated in that time or not. I give jobs max 5 seconds to finish what they started after receiving first abort signal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I should change namings, since I see your confusion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogoffart so I can connect signal to signal?

@ogoffart
Copy link
Copy Markdown
Contributor

Looks mostly good. But don't forget the put in the V1, this one is even more important as it happens more often.

@mrow4a
Copy link
Copy Markdown
Contributor Author

mrow4a commented Aug 14, 2017

Yep, I will do now V1, just wanted to verify that aborting implementation will work for NG (if it will work there, it will probably work everywhere just by implementing abort() in the job)

@mrow4a mrow4a force-pushed the fix_5949 branch 2 times, most recently from 89b7324 to 6d533eb Compare August 14, 2017 20:27
@mrow4a
Copy link
Copy Markdown
Contributor Author

mrow4a commented Aug 15, 2017

Guys, I tried also the version where CompositeJob waits for aborts from its child, but for some reason with "not aborted MOVE", it was not working properly (looks like a mix of finished and abortFinished signals or even their lack messes things). It took me quite long to debug everything.

Should I try once again Timeout in the job itself, or keep with version of timeout in composite job (which will finish sync run no matter what since it does not care if subjob finished abort or not) ?

@mrow4a mrow4a changed the title [WIP] Fix paused sync file move issue #5949 Fix paused sync file move issue #5949 Aug 15, 2017
@mrow4a
Copy link
Copy Markdown
Contributor Author

mrow4a commented Aug 15, 2017

@ogoffart @guruz @SamuAlfageme PR ready, fixes both V1 and NG

Comment thread src/libsync/owncloudpropagator.cpp Outdated
void PropagatorCompositeJob::abort()
{
foreach (PropagatorJob *j, _runningJobs)
j->abort();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use { and }

Comment thread src/libsync/owncloudpropagator.h Outdated
foreach (PropagatorJob *j, _runningJobs)
j->abort();
}
virtual void abort() Q_DECL_OVERRIDE;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also document here that the abort() might not be synchronous/immediate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the PropagatorJob::abort documentation should state that this function must emit abortFinished

guruz
guruz previously requested changes Aug 15, 2017
Copy link
Copy Markdown
Contributor

@guruz guruz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing a bit of info what happens in a situation where an immediate abort does not work. What will happen after those 5000msec? It will go to emitFinished but what happens with the running stuff?

Also added some other comments:)

@mrow4a
Copy link
Copy Markdown
Contributor Author

mrow4a commented Aug 15, 2017

I'm missing a bit of info what happens in a situation where an immediate abort does not work. What will > happen after those 5000msec? It will go to emitFinished but what happens with the running stuff?

@guruz Good question, in my observation when job finishes the sync run is already gone, thus whatever finishes after that time is ignored.

I think I will try to implement "forced abort".

@guruz
Copy link
Copy Markdown
Contributor

guruz commented Aug 15, 2017

I'm not sure, maybe the QNetworkReplys are parented to the Jobs which get deleted when SyncEngine is deleted.
(Which is the force abort, basically)

@mrow4a
Copy link
Copy Markdown
Contributor Author

mrow4a commented Aug 15, 2017

@guruz how to check it? @jturcotte

@jturcotte
Copy link
Copy Markdown
Member

jturcotte commented Aug 15, 2017

Maybe just call abort() in the code where you think the issue could appear.

For example in tests we have FakeFolder::execUntilItemCompleted that allows you to something right after an itemCompleted signal. But you could also just test this manually in the code with a QTimer or something like that.

Copy link
Copy Markdown
Contributor

@ogoffart ogoffart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the composite job, you need to increment a counter for every subjob you call abort(). connectconnect to the abortFinished on each job Only when all job have emited the abortFinished, you can safely emit the CompositeJob's abortFinished.

Note that all other jobs need to emit abortfinished, so you need to review every job's abort to make sure they emit that signal.

Comment thread src/libsync/owncloudpropagator.h Outdated
{
// Abort first job and sub jobs
// Finished abort will be announced via abortFinished()
// (look contructor implementation)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe you can do the connection here.)

Here as well you need to have a counter to make sure that both jobs sent their abortFinished.

@mrow4a
Copy link
Copy Markdown
Contributor Author

mrow4a commented Aug 15, 2017

@ogoffart I did that, and for some reason it was not working properly e.g. finished signal never went back after aborting, and other issues... I can give a try once again on that approach..

I just discovered that I also have to be careful not to break abort mechanism on network error with move, since it will finish in infinite aborts loop (it will retry all the time).

@ckamm
Copy link
Copy Markdown
Contributor

ckamm commented Sep 22, 2017

I've rebased to master and will review the current state to help move this forward.

@ckamm
Copy link
Copy Markdown
Contributor

ckamm commented Sep 25, 2017

@mrow4a Looks good. My remaining worry is about what happens when abort(Async) is called twice. Do we guard against this somewhere?

I've also added my test case here.

@mrow4a
Copy link
Copy Markdown
Contributor Author

mrow4a commented Sep 25, 2017

I mean, if you call it again, it will basically let unfinished job run (as with previous call of abort(Async)), the other jobs were synchronously aborted with previous call. The only difference is, that now you might have 2 timers (timeout timers) running in parallel, and you add another connects on signals. @ogoffart is that a problem?

@ckamm
Copy link
Copy Markdown
Contributor

ckamm commented Sep 28, 2017

@mrow4a I added a minor guard against it that seems sufficient for me.

@ckamm
Copy link
Copy Markdown
Contributor

ckamm commented Sep 28, 2017

@mrow4a @guruz @ogoffart Are there any remaining concerns (also about the code I added)? Can we merge this?

Copy link
Copy Markdown
Contributor

@ogoffart ogoffart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread src/libsync/propagateupload.h Outdated
PropagateUploadFileV1(OwncloudPropagator *propagator, const SyncFileItemPtr &item)
: PropagateUploadFileCommon(propagator, item)
, _startChunk(0)
, _currentChunk(0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this could be initialized in the header file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I step in and fix it, or @ckamm ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed the notifications for this. Looking now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrow4a (in general, feel free to just fix issues if I'm unresponsive!)

if (abortType == AbortType::Asynchronous && (((_currentChunk + _startChunk) % _chunkCount) == 0)
&& putJob->device()->atEnd()) {
if (abortType == AbortType::Asynchronous
&& _chunkCount > 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how chunk count can be 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunkCount gets set by doStartUpload and the job's abort can be called before we get there.

// since this might result in conflicts
if (PUTFileJob *putJob = qobject_cast<PUTFileJob *>(job)){
if (abortType == AbortType::Asynchronous
&& _chunkCount > 0
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is safety check right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. I was getting crashes because of zero chunkCount when I aborted jobs quickly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh that would make sense did not try indeed this corner case.

@mrow4a
Copy link
Copy Markdown
Contributor Author

mrow4a commented Oct 5, 2017

Any progress here?

@ckamm
Copy link
Copy Markdown
Contributor

ckamm commented Oct 6, 2017

@mrow4a I think you can merge this now

@mrow4a
Copy link
Copy Markdown
Contributor Author

mrow4a commented Oct 6, 2017

What is happening with Jenkins build? Did you run tests locally?

@ckamm
Copy link
Copy Markdown
Contributor

ckamm commented Oct 6, 2017

No, I hadn't. The test that fails to compile on jenkins works locally.

The problem here is that jenkins merges on top of master, and this patch needs some adjustments to work there. I'll take care of the rebase and fixups.

mrow4a and others added 6 commits October 6, 2017 13:04
Dont abort final chunk immedietally

Use sync and async aborts
_chunkCount could be 0, leading to a floating point exception

I also added initializers for several uninitialized integers in the
upload jobs.
@ckamm
Copy link
Copy Markdown
Contributor

ckamm commented Oct 6, 2017

Jenkins should pass now, let's wait and see.

@ckamm ckamm added this to the 2.4.0 milestone Oct 9, 2017
@ckamm
Copy link
Copy Markdown
Contributor

ckamm commented Oct 9, 2017

@DeepDiver1975 I think we might want this in 2.4, but jenkins failed in something after running the tests. Is this a transient issue where retriggering jenkins would help? Could you do that?

@ogoffart
Copy link
Copy Markdown
Contributor

ogoffart commented Oct 13, 2017

We anyway need to merge this to the 2.4 branch if we want this in 2.4 so it will need to be re-targetted

@guruz guruz dismissed their stale review October 16, 2017 13:05

(others reviewed)

@ckamm ckamm changed the base branch from master to 2.4 October 17, 2017 07:44
@ckamm ckamm merged commit b2a8ffc into 2.4 Oct 17, 2017
@ckamm ckamm deleted the fix_5949 branch October 17, 2017 07:44
@ckamm
Copy link
Copy Markdown
Contributor

ckamm commented Oct 17, 2017

Targeted to 2.4 and merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants