-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LOAD queries starve SELECT queries (and mapdql) when number of total concurrent queries is GTE than number of cpu cores #95
Comments
Try There is a limit on the size of the Thrift server's thread pool, defaulting to 8 threads. This limit at the Thrift/server-side is not desirable, but it is currently required to prevent some crashes we came across when using CUDA + OpenGL from multiple threads. We have a more appropriate workaround planned (moving rendering to a single thread), but no eta at the moment. |
Thanks. It works after enlarging the pool to allow 9 threads. I am curious of the reason why 7 (=NWRITER+NREADER) threads won't work when the pool allows 8 threads, why only reader threads are starved (not in the middle but from the very beginning), and why ALL readers are starved rather than only one (eg. the last one) reader. Any hint will be great. |
Here is more information about this issue, especially about "ALL readers are starved". Steps to reproduce the issue:
The table was created, the writer threads started to load data, but both readers were blocked completely. At this time, on the other window, run mapdql shell, but it was also blocked and never reached the prompt. At this time, open one more window and run gdb to attach mapd server. Below is the thread information and back trace of one thrift handler thread processing the SELECT query on mapd_server side. Apparently, (both) the SELECT query string has been accepted by Calcite client and probably has been submitted to Calcite parser. Somehow the thread just paused at __libc_recv() indefinitely.
Interestingly, this starvation only happens immediately after table creation in the same Python session of the script. For example, if the table exists and the script runs without '-c' option to re-create the table, no threads will be blocked and everything runs as expectated. It seems more like a corner case, but just curious what runtime impact may the table creation put on later SELECT query. |
Attached a tcpdump capture of the test in previous cell. It can be seen in the file that Calcite client sent two requests ('process' and 'get tables') but got no Thrift response from Calcite server other than TCP ACK. Attached another tcpdump capture when the script ran without creating the table. In this capture, calcite server replied to each SELECT query with a 2613-byte Thrift response. |
The issue here is that the calcite server requires access to the mapd-core server to get the db metadata. If all the connections to the server are exhausted then the calcite server is going to sit and wait for a connection to become available. The calcite server only makes a connection to the mapd-core server when it is first lazy loading the metadata for the tables the query is being executed on. Which explains why sometimes you see an issue and sometimes you do not depending on whether the calcite server has seen the metadata yet. This connection issue is ultimately connected to an issue in the nvidia driver where render sessions have some kind of thread local storage that fails after "too many" threads have been instantiated. We are fixing this by moving the render threads to a managed threadpool, so the limitation on the number of sessions will be removed. In the short term if you wish to have many threads you should use the http protocol connections to the web server port (normally 9092 by default) as your eg:
There is an overhead in using the http protocol but it will avoid your thread starvation issue. but it does appear maybe you are just testing for corner case at the moment anyway. |
When both mapd_server and my script got stuck, i tried mapdql with http port 9092. It didn't hang but dumped core like below.
Yes i'm still curious what minimum change to mapd_server or calcite server can make the script run to end. Though it is a corner case, it seems not an unusual "unit test" case that begins with creating a table and immediately forks multiple threads to read/write the table. |
Corrected. After interleaving the forking of threads with a delay, both reader threads got SELECT result :) |
One thing i'm still confused (stubborn:) is, in previously attached 1.cap file, after Calcite server received SELECT requests (msg 1 and 3), it could send "get_tables" requests back to mapd_server (msg 11 and 12), which means Calcite had no problem on connecting back to mapd_server. If that is the case, the issue seems kicked back to mapd_server and the question now becomes "why didn't mapd_server reply to "get_tables" request of Calcite?". |
@dwayneberry |
Exceptions in dbe - to catch properly #1884
For issue issue #82, i wrote a script to simulate multiple concurrent writers doing LOAD queries and multiple concurrent readers doing SELECT queries.
Test dataset are a small subset of NY taxicab trip data consisting of 12 CSV files.
In the script, there are two constants NWRITER and NREADER which are set to the numbers of threads doing LOAD queries and doing SELECT queries, respectively.
A sympton is observed when running the script on my 7-core Ubuntu 16.04 VM. If NWRITER + NREADER >= 7, all readers will be starved from the very beginning and none of their SELECT queries will return any result. At the same time, launching any mapdql client is also locked out.
That is, reader lock-out occurs when [NWRITER, NREADER] = [1, 6], [2, 5], [3, 4], [4, 3], [5, 2], or [6, 1].
Lock-out does not occur when NWRITER + NREADER < 7.
It's interesting that it is always readers that are locked out.
Any compile-time or runtime parameter can be set to work around this constraint?
The text was updated successfully, but these errors were encountered: