-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault with extremely large maxRows values #65
Comments
If it works on Mac it might be a kernel config on CentOS.
|
What version of Node? What's the stack trace? Do you have a seg fault handler:
maxRows of 500K could use a lot of memory. It's not a size I'd suggest in a live app. When #13 is resolved you can hopefully not need to set this value. I'll mark this as a bug to so we review if the problem is in node-oracledb code. |
I use segfault-handler get stack trace PID 22687 received SIGSEGV for address: 0x0 |
For completeness, do you have a testcase showing the CREATE TABLE statement and node-oracledb code? |
I can confirm the same, I'll try to make a simple test when time allows. |
Tryed on windows x64 - no problems... |
I just came across a similar issue. I am using 64bit Windows Server 2008 R2 but have 32bit node.js/oracledb/etc. installed. I am on version 0.7.0 of oracle-db and have about 3GB of total system RAM. (I have also tried increasing the RAM to 4GB but this didn't appear to help.) I am running a query that currently has about 3200 rows. I set maxRows to 99999 and what I noticed was that connection.execute callback would not get called about 20% of the time. I wouldn't even get a error message. The node app would just quit. My troubleshooting led me to figuring out that if I lowered maxRows to 10K it would never give me a problem but if I increased it to something high like 999999, it would consistently not work. @cjbj any traction on figuring this out? |
@bchr02 we're currently looking at other things that don't have workarounds. We have a task to review internal error checking that might root out the cause. If it is, for example, one of the malloc calls which is currently not checked for NULL, you should be getting an error. Note that if my guess is correct then you might start to see the problem at a lower maxRows in the next release since we need to increase buffer sizes to fix the various NLS issues with server->client character set expansion. (Yes, I'm not happy about the malloc calls not being error checked from day 1 either.) |
@cjbj thanks for the feedback. However, I would argue that it's hardly a workaround not being able to reliably get the data in one result set. And not getting an error message just makes the problem worst. |
@bchr02 we will look at it. You will have to reduce numRows or use a ResultSet until then. If you set numRows to 99999 for a query returning 3200 rows you are allocating a huge amount of unused memory. |
@cjbj Using a ResultSet doesnt make sense in situations where one needs a single array of all the rows. For example, I am trying to take the results of a query and, using the js-xlsx module, output them to an excel file, and finally sending the file as an email. What I don't understand is why there is even a need to specify numRows when a ResultSet is not used. When I use php's oci_fetch_all function the default behavior is that all rows are returned. Here is a snippet from oci_fetch_all's function reference page:
Why can't oracledb, behave the same way? In scenarios where ResultSet's are not used, it should auto allocate the amount of memory, based on the data returned. If one is trying to build a scalable application and needs the results in one array, having to hard code a limit makes no sense. What, am I supposed to guess what that number should be? or am I supposed to do an initial query to count the rows before the main query? come on! Another thing, when not using ResultSet's, having to set the numRows will force everyone to have to assign a much higher number then what they would expect the query to output which will lead to unneeded higher memory utilization, (based on how you are telling me it works). Based on these remarks, I hope that you will try to change the driver so that numRows by default pulls all the rows. Thank you. |
Why don't you query with Here is a little helper function to do the job: function queryAll(conn, sql, args, cb) {
var allRows = [];
conn.execute(sql, args, {
resultSet: true
}, function(err, result) {
if (err) return cb(err);
function fetch() {
var max = 50;
result.resultSet.getRows(max, function(err, rows) {
if (err) return cb(err);
allRows = allRows.concat(rows);
if (rows.length === max) {
fetch();
} else {
result.resultSet.close(function(err) {
if (err) return cb(err);
cb(null, allRows);
});
}
});
}
fetch();
});
} This way you won't force the driver to allocate huge data and you will still be able to get all your results in a single huge array. Note that bringing a huge amount of data to generate an XLS file is a bit an anti-pattern in node. You should handle this in streaming mode instead. Of course you'll need a streaming XSL writer for the output. |
@bjouhier Thank for the reply and for the example. I have considered accumulating the data but to me this seems inefficient and like something the driver should do. I can't think of a scenario where someone, by default, would not want all the data when they need it and hence they don't use result sets and choose not to set a numRows. My point is, when one does not want the driver to allocate huge amounts of data, they should either set a limit, by use of the numRows, or utilize result sets. But, by default, when one chooses to not do either of those two things then the driver should include the full result set. Nobody is forcing the driver to do anything. By contrast, we shouldn't be forced to reassemble the data. This is an extremely inefficient approach. In regards to your note, I am utilizing a separate node process for this task and am performing asynchronous calls. So I can't think of a reason not to use node for something like this. |
The driver has to materialize the result in the main JS thread. So, if you ask it to fetch all results in a single async call it will pause the event loop for a very long time and memory will peak very high during materialization. With the helper function, the event loop will have smaller pauses and memory should stay lower because the driver will work with smaller chunks. My guess is that the helper is likely to be more efficient (less pressure on GC). The cost of reassembling is very small. This is just a reallocation of an array of pointers ( The helper could be packaged into the driver (implemented in the small JS wrapper, not in C++) but to me it does not belong there because this is an anti-pattern 👎 . |
In my opinion the oracledb driver should handle everything that has to do with getting the data and it should do so in the most efficient way possible. So, if this means in the scenario where one chooses to get all the results and that the most efficient way would be to use small chunks then it should do the job of reassembling them and providing them that way (in one Array). And if you don't agree with that then why aren't you advocating to make mandatory the use of result sets? I won't argue, that it would probably be more efficient if I rewrote my program so that the data is streamed in chunks so as they came out of OCI they would be processed and written to disk. But sometimes it's not practical to do so. In my example, the program will only be utilized, once per day. The time and effort to recode it and test it doesn't make sense. That is why it's my opinion that the oracledb driver should be more robust and handle this scenario. |
See the simple solution Bruno posted at the original post.
|
@rinie you are missing the point. I understand you and @bjouhier's point, that using a result set is more efficient even if it's just to put the data back together. My point is that this should be happening automatically when result set's are not used. Otherwise there will be issues where we are adding arbitrary maxRows that are either too large or not large enough. |
I think default behaviour should prevent using multiple gigabytes or long
|
@bchr02 If you think that a no-limit API is necessary, then you can create a small helper package and publish it to NPM (feel free to borrow the code that I posted). This is as usable and robust as having it in the driver. This is part of node's philosophy: keep the core small and create specialized packages for non-core scenarios. |
@bjouhier No, I don't think think it's necessary because I think it should be part of the core. It's a trivial amount of code to abstract. I guess we just have a difference of opinion on this. I would be interested to see if anyone else chimes in on this topic and what @cjbj's response is. Regardless, I really appreciate you and @rinie taking the time to discuss this with me. |
Lot's of good points going back and forth are being discussed. I know the PHP "it just works" point of view well and agree with it. That's why I'm keen on seeing #146 resolved, as is @dmcghan. Up front let me say @krismohan has the maxRows seg fault on the priority list to fix. I personally don't recall reproducing it, but @dmcghan saw it on a small VM so I believe we should be able to track it down. @bchr02 can you open a new issue with your points from #65 (comment) and #65 (comment) in it? We are still looking at tuning and best-practices and how to make users' apps efficient. Having your points under a new enhancement request issue would be useful. If non-RS allowed unlimited growth it could be possible for unexpected out-of memory issues to occur, but the maxRows definition could change to be the limit of allocation. Effectively non-RS would change from pre-allocated memory to allocated-on-demand memory. We could overload prefetchRows or have a new property to indicate how much memory to initially allocate, to make the fetches efficient. I'm speculating here; and you know there are plenty of other tasks to be prioritized. As to current design, in driver-land we see a lot of apps doing single-row queries. Using a non-RS makes a lot of sense here for efficiency. The app should set maxRows to 1. There's minimal memory allocation, it happens up-front, minimal data copying and no overhead of a RS. This is the advantage of having a maxRows-like setting: memory can be allocated up front and the driver knows where to put the data. When we introduced RS we had extensive discussions on performance and use cases. We did consider whether to have a single fetch model, possibly breaking the current one. In the end we decided to keep both RS & non-RS to satisfy different use cases. I think using a ResultSet when you want a single large array is a valid usecase. Call getRows() preferably with a script like @bjouhier's from #65 (comment) or even use a large initial numRows if you are happy with its memory allocation. One expectation I had of node-oracledb is that most people would be using it for fast, streaming applications. I am surprised to see more than one user pushing the bounds on maxRows. |
I published the helper as a gist: https://gist.github.com/bjouhier/f4f991895fbe62ab1972. I also improved it to avoid a reallocation of One easy way to get this feature into the driver would be to tie it to @bchr02 Thanks. I enjoy these discussions 😄 |
@bjouhier once your OCA is approved, what about submitting the helper as a new example? |
We believe we have fixed this in node-oracledb 1.0. (There is some more hardening still to be done but the main point of failure is no longer the issue) Note the buffer size increases for NLS might mean that you get a now expected error at a lower maxRows than you were using before. I'm going to close this and we can follow up the design questions in issue #158. |
var maxRows=500000
var a = {outFormat: Oracle.OBJECT, maxRows: maxRows};
conn.execute(...)
Process of melt get:
Segmentation fault
Centos6 server get this error,it is strange that on the MAC is normal.
The text was updated successfully, but these errors were encountered: