Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NegativeArraySizeException when reading data from binary table #131

Closed
vforchi opened this issue Jul 6, 2018 · 8 comments
Closed

NegativeArraySizeException when reading data from binary table #131

vforchi opened this issue Jul 6, 2018 · 8 comments
Assignees
Labels
enhancement A new feature and/or an improved capability
Milestone

Comments

@vforchi
Copy link
Contributor

vforchi commented Jul 6, 2018

I have a big binary table (70 million rows), and when I use the following method:
fits.getHDU(1).getColumn(0)

I get the following exception:

java.lang.NegativeArraySizeException
at nom.tam.util.ArrayFuncs.newInstance(ArrayFuncs.java:480)
at nom.tam.fits.BinaryTable$ColumnDesc.newInstance(BinaryTable.java:152)
at nom.tam.fits.BinaryTable.createTable(BinaryTable.java:1306)
at nom.tam.fits.BinaryTable.getData(BinaryTable.java:645)
at nom.tam.fits.BinaryTable.ensureData(BinaryTable.java:1360)
at nom.tam.fits.BinaryTable.getColumn(BinaryTable.java:633)
at nom.tam.fits.TableHDU.getColumn(TableHDU.java:275)

This is because ArrayFuncs.newInstance takes an int for the dimension, and the second column is a string with 30 characters.

Note that the column I am trying to read contains integers, so it would fit in the max integer value: is there a way for me to read just one column without the library trying to read all the others?

@vforchi
Copy link
Contributor Author

vforchi commented Dec 19, 2018

Are there any news on this? I have a workaround, but it is considerably slower and more verbose than the other approach.

@attipaci
Copy link
Collaborator

I'd be happy to work with you on this if still interested, especially if you have specific solution (maybe even a pull request) in mind... Let me know...

@attipaci attipaci added the enhancement A new feature and/or an improved capability label Sep 24, 2021
@vforchi
Copy link
Contributor Author

vforchi commented Sep 25, 2021

If I remember correctly the problem I'm afraid there is no simple solution:
ArrayFuncs.newInstance takes int because Array.newInstance takes int, so there is no way around that.
I think the problem was that char columns are mapped to char[] instead of String[], so there is a limitation that the number of rows times the size of a text column must be smaller than MAXINT.
This is in my opinion a pretty strong limitation, but it would probably require a lot of low level changes.

@attipaci
Copy link
Collaborator

I did take a peek last night, and getMemoryRow(int) does use Java arrays like you mention to locate the data -- so once the BinaryTable has been fully read, it would require a major rewrite on how we store and access column data. However, getFileRow() should readily handle larger values, since it repositions in the file to the element (which uses long) to read... I think there is rationale to that if you have more than 2^31 rows, you probably don't want to read them all at once in anyway... So, perhaps, we can support more rows by:

  1. Changing getRow(int) to getRow(long)...
  2. For tables with more than 2^31 rows, we don't even attempt to read the table into RAM...
  3. getRow(long) to throw an ArrayIndexOutOfBoundsException whenever the requested row is larger than the number of rows (whether loaded in RAM or not)

What do you think?

@vforchi
Copy link
Contributor Author

vforchi commented Sep 25, 2021

I don't think this would solve my problem: getColumn calls ensureData.
Also, I'm not asking to have more than 2^31 rows, my file had only 70 million, the problem is in how char columns are handled.

@attipaci
Copy link
Collaborator

I see. Maybe we can think about a better way to store column data, other than primitive java arrays. For example, I do think rather than using a raw java char[] array, column data should have its own object type (e.g. TableColumn class), which abstracts the underlying storage from the user API. Then we would have more freedom to tweak how that class stores data internally in a more optimal way... I think it's worth thinking about it. Maybe for a next release (1.16 or after...), since it would inevitably affect the API in more fundamental ways...

@attipaci attipaci unpinned this issue Sep 26, 2021
@vforchi
Copy link
Contributor Author

vforchi commented Sep 26, 2021

That is certainly a good approach, but it might require a bigger refactoring. Strictly speaking I think we just need to replace the char[] with String[], but it might have performance issues while reading the table.

@attipaci
Copy link
Collaborator

That might work too. But, whatever we do, we'll be breaking backward compatibility for getColumn() to some degree. That means (a) we have to tread carefully, and (b) a definite bump in the version number, e.g. 1.16 earliest.

If we do that, I'd vote for the better more comprehensive fix, rather than the quick fix... If interested, you could go ahead and play with some ideas on branches of your fork. It would be nice if you tried some benchmarking too, to see how much performance is affected by the changes. It's unlikely the performance will change so much as to be a showstopper, but it would be good to know it anyway...

@attipaci attipaci pinned this issue Sep 26, 2021
@attipaci attipaci unpinned this issue Oct 16, 2021
@attipaci attipaci closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2023
@attipaci attipaci added this to the 1.18.0 milestone Jul 5, 2023
@attipaci attipaci self-assigned this Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature and/or an improved capability
Projects
None yet
Development

No branches or pull requests

2 participants