Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sync lock to create job functions #75

Merged
merged 1 commit into from
Mar 2, 2017
Merged

Conversation

No-ops
Copy link
Collaborator

@No-ops No-ops commented Jan 16, 2017

Add sync lock to make create job calls thread safe.

Add sync lock to make create job calls thread safe.
@sadlil
Copy link
Collaborator

sadlil commented Feb 3, 2017

This method have a select {} at the end, so my thinking is it may be blocked for a while if we use locking there. On my view i am thinking why not set lastcall = "c" + job-handle or some random to prevent overwriting?

@No-ops
Copy link
Collaborator Author

No-ops commented Feb 3, 2017

The function we saved is used when we get a response that is JOB_CREATED in the processLoop() function. That's the first time we get the job handle so we can't use that. We also need to be able to find the correct handler at that point so a random value would not work.

The only alternative I can think of that might work would be to make a queue instead of a single function to handle the JOB_CREATED responses we get. However I can't find it documented anywhere in gearman that the server guarantees that we get JOB_CREATED in the same order we sent the SUBMIT_JOB requests.

@sadlil
Copy link
Collaborator

sadlil commented Feb 6, 2017

Putting "c" in inner handler is safe cause it is putting the same handler for all. So no overlapping. But lastcall = 'c' is not safe. This lastcall is only used inside processLoop() when client received an error this lastcall is trying to map the error to last calling job. But this is not safe, can be overridden.

Dig some dipper and find out, this mapping is not required. Job failure returned with WorkFail,
Error is kind of Network Error. So this could be ignored or may be logged only.

cc: @mikespook

@No-ops
Copy link
Collaborator Author

No-ops commented Feb 6, 2017

The function is not the same on every call because it will use the "result" go channel from the outer scope. This means that when the innerhandle["c"] function is called in processLoop() if there are several callers it will assign the job handle to the wrong job. This in turn means that we will return the wrong handle in the select statement and that means that the results of jobs may be sent as response to the wrong one.

@tamalsaha
Copy link
Collaborator

tamalsaha commented Feb 6, 2017

@No-ops , I work with @sadlil and we are trying to solve this issue in our fork https://github.com/appscode/g2.

Based on my understanding, I think this problem can't be solved using a shared TCP connection between client and server. Say, multiple go routines in an application p1 are trying to submit background jobs. First go routine, submits a job with uid_1. Then second go routine submits a job with uid_2. Now, gearman server might return JOB_CREATED response with handle_1 and handle_2 in any order. In the protocol spec, no ordering in guaranteed. So, the only way to be completely sure, is to use separate TCP connection (separate client). Then gearman server will return JOB_CREATED response with respective job handle to the correct client.

Using mutex to block until response for uid_1 JOB_CREATED in returned will fix this issue. But I think that may severely hamper performance of the client.

Using a queue will not work, because the JOB_CREATED response does not return any field from the original client request. So, there is no way to associate a SUBMIT_JOB request with JOB_CREATED response. IMHO, this is a mistake in gearman protocol spec.

All the other client requests can be done in a shared connection, since the request/response both contains the job handle.

I would like to hear your thoughts, @No-ops .

@No-ops
Copy link
Collaborator Author

No-ops commented Feb 6, 2017

I agree that the inability to give some sort of id when doing a SUBMIT_JOB that can be used to identify the related JOB_CREATED is an oversight in the gearman protocol. I also agree that my solution will slow down the job creation significantly but seeing that it will only slow it down in the same cases where we otherwise would get a race condition I find it acceptable.

I haven't looked that deep into the gearman protocol but wouldn't WORK_COMPLETE to the client be sent on the same connection that the SUBMIT_JOB was sent on? In that case the whole client library would have to work with a dedicated connection for each job.

@tamalsaha
Copy link
Collaborator

I haven't looked that deep into the gearman protocol but wouldn't WORK_COMPLETE to the client be sent on the same connection that the SUBMIT_JOB was sent on? In that case the whole client library would have to work with a dedicated connection for each job.

I am making the assumption that in case of a background job, server will always send JOB_CREATED response before sending WORK_xxx responses. This can be enforced on the server and sounds quite reasonable.

@mikespook
Copy link
Owner

I think @No-ops is correct. There is a race in the function.
While I think maybe we needn't use defer to unlock the mutex.
Do you think to put the Unlock before line 240 & 242 is better?

@No-ops
Copy link
Collaborator Author

No-ops commented Feb 8, 2017

@mikespook If we unlock at line 242 we don't solve the race condition. The race condition happens because the innerhandler["c"] callback will get a new function with a different result channel before we get a response and go through the select statement. We need the call to complete before we can unlock and allow anyone to write to the innerhandler["c"] callback which means the unlock needs to happen after the select statement.

@mikespook
Copy link
Owner

mikespook commented Feb 8, 2017

@No-ops Ops, I think you are right.
The gearman has a bad design of the protocol. That innerhandler["c"] is the tricky design to make handling responses easier. However, it was designed several years ago, maybe, we need a redesign here.

I've added you (@No-ops, @sadlil, and @tamalsaha) as collaborators and let's discuss what's the better solution for this. Thanks all of you!

PS: I'm not using gearman in production environment anymore, but I'm still interesting working on gearman-go.

@tamalsaha
Copy link
Collaborator

Thanks for adding us as collaborator @mikespook . We think the client must be concurrent-safe. Here is my opinion on this issue:

  • Immediate term: Merge the pr in place to add lock. This will fix the bug.
  • Medium term: We want to create an inner tcp connection for submission and remove the lock.
  • Long term: Define next iteration of protocol and fix the underlying issue in spec. This is out of scope of this ticket.

@No-ops
Copy link
Collaborator Author

No-ops commented Mar 2, 2017

Since this has been open a while now and there's no objections I'm going ahead and merge this.

@sadlil
Copy link
Collaborator

sadlil commented Mar 2, 2017

@No-ops is there any plan to move this forward with proposal?

@No-ops
Copy link
Collaborator Author

No-ops commented Mar 2, 2017

Do you mean the suggestion @tamalsaha made?

I don't think we can eliminate the risk of having to wait for the gearman server to send JOB_CREATED completely. The "inner connection" that we create must be kept alive until the job finalizes if we want a response so making one for each job would be resource intensive and error prone. I suggest we make a connection pool for creating jobs. That way we can mitigate the problem without risking having too many long running tcp connections alive.

@No-ops
Copy link
Collaborator Author

No-ops commented Mar 2, 2017

Created a new issue on this #77 so we can move the discussion over there.

sadlil pushed a commit to appscode/g2 that referenced this pull request Mar 6, 2017
…ob functions

Signed-off-by: sadlil <sadlil@appscode.com>
sadlil pushed a commit to appscode/g2 that referenced this pull request Mar 9, 2017
* remove rescheduling job while running worker stopped

* Seperate jobDone, jobFailed, jobFailedWithException
Support CanDOTimeout

* remove dependency from db & fix some error

* Server and worker restart support
 - set timeout for every running job
 - monitor all running job & remove if timeout expire

* report to client that job failed with timeout exception

* generic db call for all.

* DB test fixed

* fix

* Merge Pull Request mikespook/gearman-go#75, Add sync lock to create job functions

Signed-off-by: sadlil <sadlil@appscode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants