You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently one of the databases I'm is around 85GB big uncompressed.
Compressed with i.e. xz it's only around 2GB.
When having a lot of databases lying around for different purposes, it really takes a toll on the remaining free space on the harddrive.
Personally supporting compressed database files, would literally free up 100s of GBs on my drive, and the indexing/querying performance shouldn't suffer much.
The individual lines representing documents is actually compressed. But for small documents, there's a lot of duplication across the documents when they get base64 encoded.
Main issue right now is, that when reading compressed files in an efficient way, we have to use a BufferedInputStream. So even though we can count the amount of uncompressed bytes read in the stream, the compressed bytes read is not accurate.
This can probably be mitigated by using a seekable inputstream on top of the compressed one. Then we can index based on the uncompressed bytes count. The downside to this is, that we may need to keep an open seekable inputstream as long as we keep the database value, to not have to open and close a pipe of streams for every query.
I'm open to other ideas! :)
The text was updated successfully, but these errors were encountered:
Currently one of the databases I'm is around 85GB big uncompressed.
Compressed with i.e.
xz
it's only around 2GB.When having a lot of databases lying around for different purposes, it really takes a toll on the remaining free space on the harddrive.
Personally supporting compressed database files, would literally free up 100s of GBs on my drive, and the indexing/querying performance shouldn't suffer much.
The individual lines representing documents is actually compressed. But for small documents, there's a lot of duplication across the documents when they get base64 encoded.
Main issue right now is, that when reading compressed files in an efficient way, we have to use a BufferedInputStream. So even though we can count the amount of uncompressed bytes read in the stream, the compressed bytes read is not accurate.
This can probably be mitigated by using a seekable inputstream on top of the compressed one. Then we can index based on the uncompressed bytes count. The downside to this is, that we may need to keep an open seekable inputstream as long as we keep the database value, to not have to open and close a pipe of streams for every query.
I'm open to other ideas! :)
The text was updated successfully, but these errors were encountered: