Importing long literals fails with cryptic error #158

rasendubi · 2023-03-22T01:23:27Z

I'm trying to import my dataset but quadstore fails with a cryptic "too long" error.

Error: too long: 21312
    at padNumStart (.../quadstore/dist/esm/serialization/utils.js:13:11)
    at Object.write (.../quadstore/dist/esm/serialization/terms.js:54:30)
    at Object.write (.../quadstore/dist/esm/serialization/terms.js:142:49)
    at Object.ingest (.../quadstore/dist/esm/serialization/quads.js:12:20)
    at Quadstore._batchPut (.../quadstore/dist/esm/quadstore.js:112:28)
    at Quadstore.put (.../quadstore/dist/esm/quadstore.js:127:14)

I found this to be caused by a long literal in my dataset and it looks like the limit is 9999 bytes. Is it possible to support larger values?

Minimal repro case:

import { MemoryLevel } from 'memory-level';
import dataFactory from '@rdfjs/data-model';
import { Quadstore } from 'quadstore';

const store = new Quadstore({
  backend: new MemoryLevel(),
  dataFactory,
});

await store.open();
await store.put(
  dataFactory.quad(
    dataFactory.namedNode('ex://s'),
    dataFactory.namedNode('ex://p'),
    dataFactory.literal('x'.repeat(10000))
  )
);

The text was updated successfully, but these errors were encountered:

jacoscaz · 2023-03-22T08:48:56Z

Hello @rasendubi!

That error is a bit cryptic indeed, I'll work on making it more explicit. As for the issue per-se, the underlying problem is that the serialization mechanism requires the lengths of each term to be added at the end of the serialized quad, so as to be able to efficiently reconstruct terms upon index reads. Supporting unlimited lengths would require scanning for separators, which would make deserialization significantly slower.

The current limitation of 9999 chars is due to the fact that quadstore reserves 4 characters / digits per each of these lengths, a tradeoff that covers the vast majority of cases. Also, consider that each quad (and therefore each term in a quad) appears in the database as many times as the number of configured indexes, meaning a 20_000 chars term would actually use 20_000 * 6 = 120_000 when the store is configured with the six default indexes.

Presently, storing terms of arbitrary lengths can be done by a little bit of pre-processing at ingestion time and post-processing at query time via quadstore's preWrite hook, which allows you to persist additional values to the underlying KV store atomically with the quad that is being added. The general idea would be for you to replace terms longer than X chars with either a NamedNode or a BlankNode using a specific pattern and store the original term separately via the preWrite hook. Then, at query time, you'd reconstruct the original quad by intercepting the quad with the replacement term and querying the KV store for the original.

The preWrite was actually added to quadstore by @gsvarovsky for precisely the same use case - storing terms longer than quadstore's limits in m-ld. I'm not sure about whether this is still the case but there is some usage of preWrite here https://github.com/m-ld/m-ld-js/blob/e86c2244842cd0c5fdcaa32e080e6895d6276e59/src/engine/dataset/index.ts#L243 .

jacoscaz · 2023-03-22T08:53:10Z

I wonder if it'd be worth shipping this mechanism within quadstore itself, perhaps making it an opt-in. I've never had to store terms anywhere near that long but yours is the third use case for extra-long terms that I have come across.

rasendubi · 2023-03-22T11:01:11Z

I believe this has some (un)expected consequences like range queries stopping working? Not that I expect to range over these long values, but that might be a counter-argument for including this by default.

Another option is changing the length encoding. Decimal coding is quite inefficient and only allows coding 10,000 options in 4 bytes. If we encode length in binary, we could get up to 4Gb with the same 4 bytes. This has a downside of requiring us to process serialized values as byte strings (which we probably should). If that is undesirable for some reason, we can use hex (64Kb), base32 (1Mb), or base64 (16Mb) encodings.

It's also possible to use variable-length encoding (e.g., base-128) to support unlimited lengths and simultaneously decrease the average size. Most of the values (up to 128 bytes) would only need one byte to encode their length.

These are obviously not backward-compatible and would mess-up all existing databases. One way to fix that is to exploit the fact that current encoding has a lot of bits always set to 0. We can sacrifice one bit to mark the encoding (0 for old, 1 for new). This would halve the available length but would keep it backward-compatible.

jacoscaz · 2023-03-22T11:17:11Z

@rasendubi I love this - lots of good ideas in your comment.

I've experimented quite a bit with binary encoding of lengths but couldn't find a way to do so without having to resort to TextEncoder / TextDecoder, which I have found to be a) horribly slow and b) not as well supported as I'd like, at least last I checked.

I had never even thought of using hex, base32 or base64! Do you reckon we could swap the current parseInt() with one of those without incurring in a significant performance loss?

I have thought in the past about using variable-length length encoding but ultimately put that on the backburner as I had more pressing priorities. So far, fixed-length length encoding has made for a good tradeoff between performance, simplicity and storage space.

All that said, I'd be more than happy to revisit all of the current assumption and choices and evaluate new ways of doing things.

These are obviously not backward-compatible and would mess-up all existing databases.

This is not a big issue. Quadstore is the first project of this kind I have ever worked on, both when it comes to databases and when it comes to RDF/JS. It is bound to evolve quickly. Most major versions have had breaking changes when it comes to the (de)serialization mechanism. We could slot these changes in for the next major release. What I'm not too keen on is major regressions in performance and readability.

jacoscaz · 2023-03-22T11:23:13Z

I believe this has some (un)expected consequences like range queries stopping working? Not that I expect to range over these long values, but that might be a counter-argument for including this by default.

Correct, although there might be a way to work around this as we'd be able to add a dedicated (de)serialization type for long literals in places like https://github.com/belayeng/quadstore/blob/master/src/serialization/terms.ts#L210

jacoscaz · 2023-03-22T12:10:28Z

Worth noting that a very simple tweak would be to use a different radix while keeping the current approach:

parseInt('zzzz', 36)       // returns 1_679_615 
(1_679_615).toString(36)   // returns 'zzzz'

jacoscaz · 2023-03-22T13:04:47Z

Quickly tested using radix 36 in https://github.com/belayeng/quadstore/tree/encoding-lengths-using-radix-36 , performance remains exactly the same .

rasendubi · 2023-03-22T13:35:04Z

4-byte radix 36 should give us 1.6Mb length limit—certainly good enough for my use case and seems like a very quick fix 👍

It would be interesting to experiment/benchmark other encodings though (especially variable-one—kind of feel that it has good potential) but that's likely a longer project

jacoscaz · 2023-03-22T13:38:28Z

@rasendubi I'll release a 13.x alpha version later tonight with this fix. In the meantime, you can checkout the branch linked above and npm link the repository into your project. That should allow you to start working with it right away. May I ask you to comment, if you can, upon your use case in the related discussion? I'm very curious!

jacoscaz · 2023-03-23T10:05:56Z

@rasendubi just published quadstore@13.0.0-alpha.1 to NPM!

rasendubi · 2023-03-23T12:43:02Z

Just tested with my data and it works fine

gsvarovsky · 2023-04-11T07:15:56Z

The preWrite was actually added to quadstore by @gsvarovsky for precisely the same use case - storing terms longer than quadstore's limits in m-ld. I'm not sure about whether this is still the case

It certainly is!

I'm starting work on the integration of a text CRDT into m-ld (m-ld/m-ld-spec#35). It's very likely I'll be making more use of preWrite, because generally I need an escape hatch for custom literal datatypes to manage their own storage, but still have writes be atomic.

Many thanks for keeping me tagged & and sorry to be late to the party, looks like you found a nice solution for your use-case.

jacoscaz · 2024-01-24T10:07:31Z

FYI, just released version 13.0.0-alpha.5 to NPM to keep the 13.x line in sync with recent changes on master.

jacoscaz closed this as completed Mar 23, 2023

jacoscaz mentioned this issue Mar 31, 2023

Improve error message when dealing with exceedingly long terms #159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing long literals fails with cryptic error #158

Importing long literals fails with cryptic error #158

rasendubi commented Mar 22, 2023 •

edited

jacoscaz commented Mar 22, 2023 •

edited

jacoscaz commented Mar 22, 2023

rasendubi commented Mar 22, 2023

jacoscaz commented Mar 22, 2023

jacoscaz commented Mar 22, 2023

jacoscaz commented Mar 22, 2023

jacoscaz commented Mar 22, 2023

rasendubi commented Mar 22, 2023

jacoscaz commented Mar 22, 2023

jacoscaz commented Mar 23, 2023

rasendubi commented Mar 23, 2023

gsvarovsky commented Apr 11, 2023

jacoscaz commented Jan 24, 2024

Importing long literals fails with cryptic error #158

Importing long literals fails with cryptic error #158

Comments

rasendubi commented Mar 22, 2023 • edited

jacoscaz commented Mar 22, 2023 • edited

jacoscaz commented Mar 22, 2023

rasendubi commented Mar 22, 2023

jacoscaz commented Mar 22, 2023

jacoscaz commented Mar 22, 2023

jacoscaz commented Mar 22, 2023

jacoscaz commented Mar 22, 2023

rasendubi commented Mar 22, 2023

jacoscaz commented Mar 22, 2023

jacoscaz commented Mar 23, 2023

rasendubi commented Mar 23, 2023

gsvarovsky commented Apr 11, 2023

jacoscaz commented Jan 24, 2024

rasendubi commented Mar 22, 2023 •

edited

jacoscaz commented Mar 22, 2023 •

edited