Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK locking #646

Open
natario1 opened this issue Apr 28, 2017 · 20 comments
Open

SDK locking #646

natario1 opened this issue Apr 28, 2017 · 20 comments

Comments

@natario1
Copy link
Contributor

I had some time to investigate an issue that looks serious.
I think the SDK hangs with a deadlock somewhere when doing too many operations concurrently, where too many depends on the device but is a pretty low number (even down to 3).

Working case

The concurrent policy is managed by bolts in this class. You can see that CORE_POOL_SIZE is the number of processors + 1.

When you schedule too much operation and exceed the pool size, tasks are scheduled. You can see this in action calling this function:

private Task<Void> tryHangingWithBolts() {
    List<Task<Void>> tasks = new ArrayList<>();
    for (int i = 0; i < 50; i++) {
        final int num = i;
        tasks.add(Task.callInBackground(new Callable<Void>() {
            @Override
            public Void call() throws Exception {
                Log.e("Bolts", "About to start task "+num);
                Task.delay(3000).waitForCompletion();
                Log.e("Bolts", "Finished task "+num);
                return null;
            }
        }));
    }
    return Task.whenAll(tasks);
}

This starts 50 operations concurrently. Each operation will just wait for 3 seconds. This works. It’s also nice to look at the logs because you get to see how the bolts executor works.

About to start task 1
About to start task 0
About to start task 4
About to start task 3
About to start task 6
About to start task 5
About to start task 2
Finished task 1
Finished task 4
Finished task 3
About to start task 8
About to start task 7
About to start task 9
Finished task 0
About to start task 10 ... and so on

My device has 6 processors, so the core pool size is 7. I can see in logs that the first 7 tasks are started, then there’s a delay because the pool is full, then some tasks end and new tasks are added and executed. With a reasonable delay, all 50 tasks are executed.

Hanging case

If you replace my fake tasks with parse tasks, the SDK hangs completely and indefinitely after the pool is full. For example, here I get 10 objects through query, and then call fetch() on each of these at the same time.

private Task<Void> tryHangingWithParse() {
    // get objects from query first
    ParseQuery<ParseObject> query = new ParseQuery<>("Tag");
    query.setLimit(10);
    return query.findInBackground().onSuccessTask(new Continuation<List<ParseObject>, Task<Void>>() {
        @Override
        public Task<Void> then(Task<List<ParseObject>> task) throws Exception {
            // We have objects. Fetch them concurrently.
            List<ParseObject> objectsToFetch = task.getResult();
            List<Task<Void>> tasks = new ArrayList<>();
            int count = 0;
            for (final ParseObject o : objectsToFetch) {
                final int currCount = count;
                tasks.add(Task.callInBackground(new Callable<Void>() {
                    @Override
                    public Void call() throws Exception {
                        Log.e("Parse", "About to fetch object "+currCount);
                        o.fetch();
                        Log.e("Parse", "Fetched object "+currCount);
                        return null;
                    }
                }));
                count++;
            }
            return Task.whenAll(tasks);
        }
    });
}

This is the output though:

About to fetch object 0
About to fetch object 1
About to fetch object 2
About to fetch object 3
About to fetch object 5
About to fetch object 6
About to fetch object 4

Just 7. And nothing more, forever. No objects is fetched, not even those belonging to the first tasks. Other tasks are not scheduled. After this operation, the SDK is locked permanently.

It’s not uncommon to practically end up in this situation. E.g. displaying a list that fetches child objects onBind. And with a low end device / slow connection, this is much more frequent.

Can you reproduce this @rogerhu @Jawnnypoo @hermanliang ? I have no idea what’s happening and why.

@natario1
Copy link
Contributor Author

Going deeper with logs I think the culprit is this line. ParseRequest does everything in a super powered executor, with more threads than bolts default executor (Task.BACKGROUND_EXECUTOR).

  /**
   * We want to use more threads than default in {@code bolts.Executors} since most of the time
   * the threads will be asleep waiting for data.
   */

This is fine, but in the line I mentioned, after the request is executed in this NETWORK EXECUTOR, it falls back to Task.BACKGROUND_EXECUTOR for a moment, causing a bottleneck. Replacing Task.BACKGROUND_EXECUTOR with NETWORK_EXECUTOR there fixes the issue.

(I can’t see why Task.BACKGROUND_EXECUTOR would hang though, it should just enqueue...?)

@natario1
Copy link
Contributor Author

I amended that line, and everything works. It appears that when you call fetch or most other object methods, you allocate one thread in Task.BACKGROUND_EXECUTOR and a child thread in NETWORK_EXECUTOR.

So it’s important that the child NETWORK_EXECUTOR thread won’t try to execute in Task.BACKGROUND_EXECUTOR as it was now, or there’s a lock.

This also means that this big NETWORK_EXECUTOR is useless in this case, since each thread is wrapped in a Task.BACKGROUND_EXECUTOR thread and so it is bounded by a smaller pool size...

@Jawnnypoo
Copy link
Member

Great findings @natario1. Yeah, it really seems that having that Task.BACKGROUND_EXECUTOR was possibly a copy paste error? It doesn't make much sense at all to have it switch to the smaller thread pool just to check the task for an error. Maybe we could get some comments from @grantland or @wangmengyan95 on this?

As an aside, I think it would be great to have another module in this project for benchmarking. Parse should remain the SDK, ParseStarter a starter implementation, and Benchmarks (keeping with the caps casing) for where we can test things like this? I suppose this could potentially have been caught or found out via some test within the SDK, but I could also see there being a case for creating a codebase to really hammer down and test the real world performance of Parse (as it can be sluggish, especially with things like local storage)

@Jawnnypoo
Copy link
Member

@natario1 Looks like there are some similar cases of this with the database thread executor, but there, we see the comment that they want to seemingly purposefully jump off of the DB executor service:

https://github.com/parse-community/Parse-SDK-Android/blob/master/Parse/src/main/java/com/parse/ParseSQLiteDatabase.java#L124

Only thing I could think is that by jumping off and onto the other executor, it would free another resource for the db pool? But still, this also seems like it would lead to thrashing, as the many db threads would be fighting over jumping onto the lesser Task.BACKGROUND_EXECUTOR. Thoughts?

@natario1
Copy link
Contributor Author

was possibly a copy paste error?

@Jawnnypoo more of a habit, since Task.BACKGROUND_EXECUTOR is used everywhere? Btw I have tried to catch this bug in a ParseRequestTest, willing to include it in #648 , but I was not able to... Which means that

  • root cause for this bug is not 100% clear to me (would be great to know from grantland and wang)
  • unit tests are not appropriate for something like this that includes multiple components

So I agree with the need of more comprehensive tests. Meanwhile I think #648 is a urgent fix even if we don’t get why it happens. As far as I know the db executor should not be a problem, it is single threaded so jumping into Task.BACKGROUND_EXECUTOR means jumping into a bigger pool.

@natario1
Copy link
Contributor Author

natario1 commented Apr 30, 2017

@rogerhu can we leave this open for a while? The PR fixes e.g. my function tryHangingWithParse but that’s really high level and doesn’t reveal where the problem is.

I am still experiencing the bug in my app in more complex situations (fetch Posts from query, display them in a list, and when each Post is shown (recycler view’s onBind), fetch a couple of child objects)

Looks that if LDS is enabled, tryHangingWithParse still locks the SDK... so it’s not about the task complexity. #648 fixed just the case of LDS disabled. This is driving me crazy :-)

@rogerhu rogerhu reopened this Apr 30, 2017
@natario1
Copy link
Contributor Author

natario1 commented May 2, 2017

Just some thoughts if anyone is interested. Many hours, big post :-) I don’t think this is going to be fixed easily in parse, and I also think that #648 should be reverted (though it does no big harm). The issue seems clear to me now and goes down to bolts. I might be wrong on everything of course.

When the SDK hangs

When BACKGROUND_EXECUTOR is full, and all threads inside it are requesting a new BACKGROUND_EXECUTOR thread to complete. Since it’s full, it can’t offer new threads, and there’s a lock. Unfortunately the pool is pretty small (7 threads if you have 6 processors...), and by design it won’t allow new threads to be created.

BACKGROUND_EXECUTOR pool (max 7 threads for 6 processors)
|- background-executor-thread-1 (needs another background thread to complete)
|- background-executor-thread-2 (needs another background thread to complete)
|- background-executor-thread-3 (needs another background thread to complete)
|- background-executor-thread-4 (needs another background thread to complete)
|- background-executor-thread-5 (needs another background thread to complete)
|- background-executor-thread-6 (needs another background thread to complete)
|- background-executor-thread-7 (needs another background thread to complete)

The BACKGROUND_EXECUTOR

Defined here. The comments say if a task cannot be queued, a new thread will be created unless this would exceed max pool size. Strictly this is true, but with the queue they are using, new LinkedBlockingQueue<Runnable>(), tasks can always be queued, because it has infinite capacity. So in practice we are stuck with 7 (say) threads, and all other commands stay in a endless queue that keeps growing, but is never released. If not confident with thread pool executor, see oracle docs, describing queuing strategies. The executor is using the second strategy:

Using an unbounded queue (for example a LinkedBlockingQueue without a predefined capacity) will cause new tasks to wait in the queue when all corePoolSize threads are busy. Thus, no more than corePoolSize threads will ever be created. (And the value of the maximumPoolSize therefore doesn't have any effect.) This ... admits the possibility of unbounded work queue growth when commands continue to arrive on average faster than they can be processed.

So this is an error in bolts (or in their comments). Anyway, even if the queue had a finite capacity, our issue would hardly be solved. What would work is a queue with size 0 (or very small), and a big enough maximumPoolSize. This is the first strategy in the docs, called the direct handoff strategy (added emphasis):

A good default choice for a work queue is a SynchronousQueue that hands off tasks to threads without otherwise holding them. Here, an attempt to queue a task will fail if no threads are immediately available to run it, so a new thread will be constructed. This policy avoids lockups when handling sets of requests that might have internal dependencies.

It fits the Parse use case. Bolts designed the thread pool emulating android AsyncTask but by design async tasks are independent units, they don’t depend on each other. Tasks do. However, I doubt that such a change would be allowed in bolts. Maybe grantland can give his opinion.

Why the SDK hangs

Because sometimes, deep down a BACKGROUND_EXECUTOR task, we request another BACKGROUND_EXECUTOR thread. If enough of these tasks are executed concurrently, we end up in the locking scenario. This happens in two cases.

When moving to different executors

Sometimes we perform certain blocks of a long task in a executor different than BACKGROUND_EXECUTOR. This happens e.g. in two base blocks, ParseRequest and ParseSQLiteDatabase. As you can see in the sqlite database (and as it was in parse request before #648 ), after using the special executor the task jumps back to BACKGROUND_EXECUTOR, by explicitly mentioning it.

public Task<Void> openDatabase() {
  return Task.call(new Callable<Void>() {
    // open the database 
  }, DATABASE_EXECUTOR).continueWithTask(new Continuation<Void, Task<Void>>() {
    // jump into background executor. But this requests a new thread!
    return task;
  }, Task.BACKGROUND_EXECUTOR);
}

This is correct. Imagine the opposite, i.e. simply

public Task<Void> openDatabase() {
  return Task.call(new Callable<Void>() {
    // open the database 
  }, DATABASE_EXECUTOR);
}

In this case anyone from outside the class calling db.openDatabase().continueWith(continuation) without mentioning an executor, would be executing the continuation in the database executor. That is not desirable. So tasks that use a special executor fall back to BACKGROUND_EXECUTOR before ending, in order to not expose their internal executor to accessors.

The issue is that anytime you mention BACKGROUND_EXECUTOR a new (or cached) thread is requested. So you call openDatabase from a background thread, the task is performed in the database executor, and then a second background thread is requested (while the first is still locked). With the current queue strategy of BACKGROUND_EXECUTOR, that’s bad!

The SDK would benefit from removing these two special executors (in the database class and in ParseRequest), but I don’t see that as a possible solution, they are there for a reason.

When going too deep down the stack

The task chain of public methods like fetch() is huge. 99% of times we run tasks and continuations without mentioning an executor, so the efficient BoltsExecutors.ImmediateExecutor is used. It executes continuations in the current thread, unless the stack is too deep - there were too many nested calls to execute() from the same thread. In that case, it requests a new thread from BACKGROUND_EXECUTOR. Bad again!

This could be mitigated by replacing lots of useless continueWithTask with continueWith, and generally trying to reduce the nestedness in continuations. But it’s not a solution. The executor has a max depth of 15 and I think the SDK goes beyond that value in many cases, really often when LDS is enabled and there are database reads/writes. continueWithTask and similar are used extensively as if they don't have a price, but they have.

Solutions

  • Have bolts change their strategy. Strictly this is a one liner, just changing LinkedBlockingQueue to SynchronousQueue, and increasing the max pool size to reasonably big number. As docs say, it’s a decent solution for tasks that have internal dependencies. I have run tests with this change and everything is fixed. We could also keep using LinkedBlockingQueue, but use capacity < corePoolSize
  • Drop ourselves Task.BACKGROUND_EXECUTOR and create a background executor that fits our needs. This is easier said than done, it should be used everywhere. There would be hard work spotting all public methods and ensuring they start in the new EXECUTOR, and this work would be useless anyway: internal tasks that are too deep will fall back automatically to Task.BACKGROUND_EXECUTOR (see ImmediateExecutor here, already mentioned). That’s how bolts work.
  • ???

It was fun to dig into this at first, but it’s not now that I don’t see any practical solution. I can see the SDK hanging pretty easily. The developer can slow down / serialize her requests to avoid it, but that’s not deterministic - lock depends on number of processors and network speed. A single core device would hang with 2 concurrent fetch()...

@Jawnnypoo
Copy link
Member

@natario1 Really awesome work digging into this. This is important stuff that really deserves time and attention.

The huge downside to this problem in my opinion is the fact that the Parse SDK relies on bolts. Bolts itself seems to be a dead project, with no commits in over a year. I regretfully doubt that if you were to bring this to anyone's attention on the bolts project, even with a pull request, would even get looked at.

I think the current potential immediate solution would be to open an issue on bolts, giving the description here. If it manages to get the fix, that is the best case scenario.

If not, the second option is to pull in bolts as a local module for the Parse SDK and make the needed fixes. This would be somewhat problematic in that we would have to change the package name of the bolts framework so as to not collide with any projects that are currently using bolts alongside the Parse SDK.

The solution I think would be best, but would result in quite a bit of work and rewrite, as well as a 2.0 release would be to drop bolts entirely in favor of a more tried and true and supported reactive framework, RxJava 2. I think most Android developer are much more likely to be familiar with RxJava than bolts, and even if not, there are tons of resources available to do so.

Using RxJava, we would be able to leave the threading to the user, using whatever Scheduler they desire to do Parse operations. Instead of having to have both async and synchronous operations for everything, we would provide the appropriate Observable, Single, Completable, etc. to the user and they would make the decision.

Any thoughts on these options, or other ideas?

@rogerhu
Copy link
Contributor

rogerhu commented May 2, 2017

Let's see how Bolt can be fixed if possible and we can ping Facebook for a review.

Moving to RxJava is obviously a lot of work and a steep learning curve to get right.

@natario1
Copy link
Contributor Author

natario1 commented May 2, 2017

I am going to open a issue there, let’s try. If you happen to know who to ping please ping him/her there. If this doesn’t work out, we might have to fork but that’s a breaking change? e.g. one might compile parse and facebook-sdk { exclude: bolts } right now

@natario1
Copy link
Contributor Author

natario1 commented May 8, 2017

Not sure we want this, but there is actually room for some tweaking:

  ((ThreadPoolExecutor) Task.BACKGROUND_EXECUTOR).setCorePoolSize(...)
  ((ThreadPoolExecutor) Task.BACKGROUND_EXECUTOR).setMaxPoolSize(...)

@felipesouto
Copy link

Any progress on this issue? Thanks!

@felipesouto
Copy link

"I think the current potential immediate solution would be to open an issue on bolts, giving the description here. If it manages to get the fix, that is the best case scenario."

Opening an issue on Bolts doesn't seem to work. Are there any plans for this? Thanks!

@natario1
Copy link
Contributor Author

natario1 commented Jun 9, 2017

Hi @felipesouto we will eventually figure out how to get out of this, but it doesn't seem something we can fix in a few days at the moment. If this is critical for your application, you can

  • refactor your code to reduce the number of concurrent calls, e.g. chain operations instead of executing in parallel, using continueWith
  • (do at your own risk) tweak the executor settings using ((ThreadPoolExecutor) Task.BACKGROUND_EXECUTOR).setCorePoolSize and setMaxPoolSize. But I strongly suggest to take the first route.

I spent a lot of time on this and think it's a critical issue. But it would be great if we could solve it through Bolts.

In the long run (v2) we should probably consider dropping it in favor of something else. After the shutdown SDK is left with this huge dependency that is not under control and unresponsive.

If anyone has ideas or plans...

@Jawnnypoo
Copy link
Member

Jawnnypoo commented Jun 9, 2017

I will throw in my advocation for RxJava again. It is the most highly adopted reactive framework for Java and would allow people to have more control over the SDK than Bolts does. There is some truth to the fact that it has a learning curve, but you could argue the same about Bolts if you are using it for not the simplest of cases.

@Jawnnypoo
Copy link
Member

Any thoughts on this? I would love to move forward Rx'ifying Parse and removing Bolts if others are on board.

@flovilmart
Copy link
Contributor

If there is a move to make, i’d Rather not pick a side, bolts vs Rx. But have a flexible architecture that rx bindings are easy enough to implementz

@rogerhu
Copy link
Contributor

rogerhu commented Jun 21, 2017

RxJava 2 right? In general there is a lot of underlying plumbing/tests you will have to rework..

@flovilmart
Copy link
Contributor

Also keep in mind that many users have their app written around bolts, deprecating bolts binding without providing an alternative that enables them is a no go in the official repository.

@flovilmart
Copy link
Contributor

Any change in the async that requires deprecation is a no go if we don’t have a stable way forward. I understand people like Rx better than bolts, but 100% of the users of parse use bolts both in iOS and Android.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants