Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing long literals fails with cryptic error #158

Closed
rasendubi opened this issue Mar 22, 2023 · 13 comments
Closed

Importing long literals fails with cryptic error #158

rasendubi opened this issue Mar 22, 2023 · 13 comments

Comments

@rasendubi
Copy link

rasendubi commented Mar 22, 2023

I'm trying to import my dataset but quadstore fails with a cryptic "too long" error.

Error: too long: 21312
    at padNumStart (.../quadstore/dist/esm/serialization/utils.js:13:11)
    at Object.write (.../quadstore/dist/esm/serialization/terms.js:54:30)
    at Object.write (.../quadstore/dist/esm/serialization/terms.js:142:49)
    at Object.ingest (.../quadstore/dist/esm/serialization/quads.js:12:20)
    at Quadstore._batchPut (.../quadstore/dist/esm/quadstore.js:112:28)
    at Quadstore.put (.../quadstore/dist/esm/quadstore.js:127:14)

I found this to be caused by a long literal in my dataset and it looks like the limit is 9999 bytes. Is it possible to support larger values?

Minimal repro case:

import { MemoryLevel } from 'memory-level';
import dataFactory from '@rdfjs/data-model';
import { Quadstore } from 'quadstore';

const store = new Quadstore({
  backend: new MemoryLevel(),
  dataFactory,
});

await store.open();
await store.put(
  dataFactory.quad(
    dataFactory.namedNode('ex://s'),
    dataFactory.namedNode('ex://p'),
    dataFactory.literal('x'.repeat(10000))
  )
);
@jacoscaz
Copy link
Owner

jacoscaz commented Mar 22, 2023

Hello @rasendubi!

That error is a bit cryptic indeed, I'll work on making it more explicit. As for the issue per-se, the underlying problem is that the serialization mechanism requires the lengths of each term to be added at the end of the serialized quad, so as to be able to efficiently reconstruct terms upon index reads. Supporting unlimited lengths would require scanning for separators, which would make deserialization significantly slower.

The current limitation of 9999 chars is due to the fact that quadstore reserves 4 characters / digits per each of these lengths, a tradeoff that covers the vast majority of cases. Also, consider that each quad (and therefore each term in a quad) appears in the database as many times as the number of configured indexes, meaning a 20_000 chars term would actually use 20_000 * 6 = 120_000 when the store is configured with the six default indexes.

Presently, storing terms of arbitrary lengths can be done by a little bit of pre-processing at ingestion time and post-processing at query time via quadstore's preWrite hook, which allows you to persist additional values to the underlying KV store atomically with the quad that is being added. The general idea would be for you to replace terms longer than X chars with either a NamedNode or a BlankNode using a specific pattern and store the original term separately via the preWrite hook. Then, at query time, you'd reconstruct the original quad by intercepting the quad with the replacement term and querying the KV store for the original.

The preWrite was actually added to quadstore by @gsvarovsky for precisely the same use case - storing terms longer than quadstore's limits in m-ld. I'm not sure about whether this is still the case but there is some usage of preWrite here https://github.com/m-ld/m-ld-js/blob/e86c2244842cd0c5fdcaa32e080e6895d6276e59/src/engine/dataset/index.ts#L243 .

@jacoscaz
Copy link
Owner

I wonder if it'd be worth shipping this mechanism within quadstore itself, perhaps making it an opt-in. I've never had to store terms anywhere near that long but yours is the third use case for extra-long terms that I have come across.

@rasendubi
Copy link
Author

I believe this has some (un)expected consequences like range queries stopping working? Not that I expect to range over these long values, but that might be a counter-argument for including this by default.

Another option is changing the length encoding. Decimal coding is quite inefficient and only allows coding 10,000 options in 4 bytes. If we encode length in binary, we could get up to 4Gb with the same 4 bytes. This has a downside of requiring us to process serialized values as byte strings (which we probably should). If that is undesirable for some reason, we can use hex (64Kb), base32 (1Mb), or base64 (16Mb) encodings.

It's also possible to use variable-length encoding (e.g., base-128) to support unlimited lengths and simultaneously decrease the average size. Most of the values (up to 128 bytes) would only need one byte to encode their length.

These are obviously not backward-compatible and would mess-up all existing databases. One way to fix that is to exploit the fact that current encoding has a lot of bits always set to 0. We can sacrifice one bit to mark the encoding (0 for old, 1 for new). This would halve the available length but would keep it backward-compatible.

@jacoscaz
Copy link
Owner

@rasendubi I love this - lots of good ideas in your comment.

I've experimented quite a bit with binary encoding of lengths but couldn't find a way to do so without having to resort to TextEncoder / TextDecoder, which I have found to be a) horribly slow and b) not as well supported as I'd like, at least last I checked.

I had never even thought of using hex, base32 or base64! Do you reckon we could swap the current parseInt() with one of those without incurring in a significant performance loss?

I have thought in the past about using variable-length length encoding but ultimately put that on the backburner as I had more pressing priorities. So far, fixed-length length encoding has made for a good tradeoff between performance, simplicity and storage space.

All that said, I'd be more than happy to revisit all of the current assumption and choices and evaluate new ways of doing things.

These are obviously not backward-compatible and would mess-up all existing databases.

This is not a big issue. Quadstore is the first project of this kind I have ever worked on, both when it comes to databases and when it comes to RDF/JS. It is bound to evolve quickly. Most major versions have had breaking changes when it comes to the (de)serialization mechanism. We could slot these changes in for the next major release. What I'm not too keen on is major regressions in performance and readability.

@jacoscaz
Copy link
Owner

I believe this has some (un)expected consequences like range queries stopping working? Not that I expect to range over these long values, but that might be a counter-argument for including this by default.

Correct, although there might be a way to work around this as we'd be able to add a dedicated (de)serialization type for long literals in places like https://github.com/belayeng/quadstore/blob/master/src/serialization/terms.ts#L210

@jacoscaz
Copy link
Owner

Worth noting that a very simple tweak would be to use a different radix while keeping the current approach:

parseInt('zzzz', 36)       // returns 1_679_615 
(1_679_615).toString(36)   // returns 'zzzz' 

@jacoscaz
Copy link
Owner

Quickly tested using radix 36 in https://github.com/belayeng/quadstore/tree/encoding-lengths-using-radix-36 , performance remains exactly the same .

@rasendubi
Copy link
Author

4-byte radix 36 should give us 1.6Mb length limit—certainly good enough for my use case and seems like a very quick fix 👍

It would be interesting to experiment/benchmark other encodings though (especially variable-one—kind of feel that it has good potential) but that's likely a longer project

@jacoscaz
Copy link
Owner

@rasendubi I'll release a 13.x alpha version later tonight with this fix. In the meantime, you can checkout the branch linked above and npm link the repository into your project. That should allow you to start working with it right away. May I ask you to comment, if you can, upon your use case in the related discussion? I'm very curious!

@jacoscaz
Copy link
Owner

@rasendubi just published quadstore@13.0.0-alpha.1 to NPM!

@rasendubi
Copy link
Author

Just tested with my data and it works fine

@gsvarovsky
Copy link
Contributor

The preWrite was actually added to quadstore by @gsvarovsky for precisely the same use case - storing terms longer than quadstore's limits in m-ld. I'm not sure about whether this is still the case

It certainly is!

I'm starting work on the integration of a text CRDT into m-ld (m-ld/m-ld-spec#35). It's very likely I'll be making more use of preWrite, because generally I need an escape hatch for custom literal datatypes to manage their own storage, but still have writes be atomic.

Many thanks for keeping me tagged & and sorry to be late to the party, looks like you found a nice solution for your use-case.

@jacoscaz
Copy link
Owner

FYI, just released version 13.0.0-alpha.5 to NPM to keep the 13.x line in sync with recent changes on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants