Add sync lock to create job functions #75

No-ops · 2017-01-16T15:45:23Z

Add sync lock to make create job calls thread safe.

sadlil · 2017-02-03T06:36:56Z

This method have a select {} at the end, so my thinking is it may be blocked for a while if we use locking there. On my view i am thinking why not set lastcall = "c" + job-handle or some random to prevent overwriting?

No-ops · 2017-02-03T08:00:39Z

The function we saved is used when we get a response that is JOB_CREATED in the processLoop() function. That's the first time we get the job handle so we can't use that. We also need to be able to find the correct handler at that point so a random value would not work.

The only alternative I can think of that might work would be to make a queue instead of a single function to handle the JOB_CREATED responses we get. However I can't find it documented anywhere in gearman that the server guarantees that we get JOB_CREATED in the same order we sent the SUBMIT_JOB requests.

sadlil · 2017-02-06T10:39:59Z

Putting "c" in inner handler is safe cause it is putting the same handler for all. So no overlapping. But lastcall = 'c' is not safe. This lastcall is only used inside processLoop() when client received an error this lastcall is trying to map the error to last calling job. But this is not safe, can be overridden.

Dig some dipper and find out, this mapping is not required. Job failure returned with WorkFail,
Error is kind of Network Error. So this could be ignored or may be logged only.

cc: @mikespook

No-ops · 2017-02-06T15:33:54Z

The function is not the same on every call because it will use the "result" go channel from the outer scope. This means that when the innerhandle["c"] function is called in processLoop() if there are several callers it will assign the job handle to the wrong job. This in turn means that we will return the wrong handle in the select statement and that means that the results of jobs may be sent as response to the wrong one.

tamalsaha · 2017-02-06T17:06:43Z

@No-ops , I work with @sadlil and we are trying to solve this issue in our fork https://github.com/appscode/g2.

Based on my understanding, I think this problem can't be solved using a shared TCP connection between client and server. Say, multiple go routines in an application p1 are trying to submit background jobs. First go routine, submits a job with uid_1. Then second go routine submits a job with uid_2. Now, gearman server might return JOB_CREATED response with handle_1 and handle_2 in any order. In the protocol spec, no ordering in guaranteed. So, the only way to be completely sure, is to use separate TCP connection (separate client). Then gearman server will return JOB_CREATED response with respective job handle to the correct client.

Using mutex to block until response for uid_1 JOB_CREATED in returned will fix this issue. But I think that may severely hamper performance of the client.

Using a queue will not work, because the JOB_CREATED response does not return any field from the original client request. So, there is no way to associate a SUBMIT_JOB request with JOB_CREATED response. IMHO, this is a mistake in gearman protocol spec.

All the other client requests can be done in a shared connection, since the request/response both contains the job handle.

I would like to hear your thoughts, @No-ops .

No-ops · 2017-02-06T17:48:30Z

I agree that the inability to give some sort of id when doing a SUBMIT_JOB that can be used to identify the related JOB_CREATED is an oversight in the gearman protocol. I also agree that my solution will slow down the job creation significantly but seeing that it will only slow it down in the same cases where we otherwise would get a race condition I find it acceptable.

I haven't looked that deep into the gearman protocol but wouldn't WORK_COMPLETE to the client be sent on the same connection that the SUBMIT_JOB was sent on? In that case the whole client library would have to work with a dedicated connection for each job.

tamalsaha · 2017-02-06T17:56:30Z

I haven't looked that deep into the gearman protocol but wouldn't WORK_COMPLETE to the client be sent on the same connection that the SUBMIT_JOB was sent on? In that case the whole client library would have to work with a dedicated connection for each job.

I am making the assumption that in case of a background job, server will always send JOB_CREATED response before sending WORK_xxx responses. This can be enforced on the server and sounds quite reasonable.

mikespook · 2017-02-07T03:03:55Z

I think @No-ops is correct. There is a race in the function.
While I think maybe we needn't use defer to unlock the mutex.
Do you think to put the Unlock before line 240 & 242 is better?

No-ops · 2017-02-08T09:59:14Z

@mikespook If we unlock at line 242 we don't solve the race condition. The race condition happens because the innerhandler["c"] callback will get a new function with a different result channel before we get a response and go through the select statement. We need the call to complete before we can unlock and allow anyone to write to the innerhandler["c"] callback which means the unlock needs to happen after the select statement.

mikespook · 2017-02-08T23:44:11Z

@No-ops Ops, I think you are right.
The gearman has a bad design of the protocol. That innerhandler["c"] is the tricky design to make handling responses easier. However, it was designed several years ago, maybe, we need a redesign here.

I've added you (@No-ops, @sadlil, and @tamalsaha) as collaborators and let's discuss what's the better solution for this. Thanks all of you!

PS: I'm not using gearman in production environment anymore, but I'm still interesting working on gearman-go.

tamalsaha · 2017-02-09T05:20:46Z

Thanks for adding us as collaborator @mikespook . We think the client must be concurrent-safe. Here is my opinion on this issue:

Immediate term: Merge the pr in place to add lock. This will fix the bug.
Medium term: We want to create an inner tcp connection for submission and remove the lock.
Long term: Define next iteration of protocol and fix the underlying issue in spec. This is out of scope of this ticket.

No-ops · 2017-03-02T12:42:02Z

Since this has been open a while now and there's no objections I'm going ahead and merge this.

sadlil · 2017-03-02T12:44:47Z

@No-ops is there any plan to move this forward with proposal?

No-ops · 2017-03-02T13:23:16Z

Do you mean the suggestion @tamalsaha made?

I don't think we can eliminate the risk of having to wait for the gearman server to send JOB_CREATED completely. The "inner connection" that we create must be kept alive until the job finalizes if we want a response so making one for each job would be resource intensive and error prone. I suggest we make a connection pool for creating jobs. That way we can mitigate the problem without risking having too many long running tcp connections alive.

No-ops · 2017-03-02T13:33:06Z

Created a new issue on this #77 so we can move the discussion over there.

…ob functions Signed-off-by: sadlil <sadlil@appscode.com>

* remove rescheduling job while running worker stopped * Seperate jobDone, jobFailed, jobFailedWithException Support CanDOTimeout * remove dependency from db & fix some error * Server and worker restart support - set timeout for every running job - monitor all running job & remove if timeout expire * report to client that job failed with timeout exception * generic db call for all. * DB test fixed * fix * Merge Pull Request mikespook/gearman-go#75, Add sync lock to create job functions Signed-off-by: sadlil <sadlil@appscode.com>

Add sync lock to create job functions

d6791e2

Add sync lock to make create job calls thread safe.

No-ops mentioned this pull request Feb 1, 2017

Race condition when creating new job from client #76

Closed

tamalsaha mentioned this pull request Feb 19, 2017

Race condition when creating new job from client appscode/g2#15

Open

No-ops merged commit 9735b2e into mikespook:master Mar 2, 2017

No-ops mentioned this pull request Mar 2, 2017

Client need to receive a JOB_CREATED from server before submitting a new job #77

Open

sadlil pushed a commit to appscode/g2 that referenced this pull request Mar 6, 2017

Merge Pull Request mikespook/gearman-go#75, Add sync lock to create j…

fab23d9

…ob functions Signed-off-by: sadlil <sadlil@appscode.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sync lock to create job functions #75

Add sync lock to create job functions #75

No-ops commented Jan 16, 2017

sadlil commented Feb 3, 2017

No-ops commented Feb 3, 2017

sadlil commented Feb 6, 2017

No-ops commented Feb 6, 2017

tamalsaha commented Feb 6, 2017 •

edited

Loading

No-ops commented Feb 6, 2017

tamalsaha commented Feb 6, 2017

mikespook commented Feb 7, 2017

No-ops commented Feb 8, 2017

mikespook commented Feb 8, 2017 •

edited

Loading

tamalsaha commented Feb 9, 2017

No-ops commented Mar 2, 2017

sadlil commented Mar 2, 2017

No-ops commented Mar 2, 2017

No-ops commented Mar 2, 2017

Add sync lock to create job functions #75

Add sync lock to create job functions #75

Conversation

No-ops commented Jan 16, 2017

sadlil commented Feb 3, 2017

No-ops commented Feb 3, 2017

sadlil commented Feb 6, 2017

No-ops commented Feb 6, 2017

tamalsaha commented Feb 6, 2017 • edited Loading

No-ops commented Feb 6, 2017

tamalsaha commented Feb 6, 2017

mikespook commented Feb 7, 2017

No-ops commented Feb 8, 2017

mikespook commented Feb 8, 2017 • edited Loading

tamalsaha commented Feb 9, 2017

No-ops commented Mar 2, 2017

sadlil commented Mar 2, 2017

No-ops commented Mar 2, 2017

No-ops commented Mar 2, 2017

tamalsaha commented Feb 6, 2017 •

edited

Loading

mikespook commented Feb 8, 2017 •

edited

Loading