Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serialized invertedIndex : more space efficient format possible? #268

Open
nkuehn opened this issue May 26, 2017 · 7 comments

Comments

@nkuehn
Copy link

@nkuehn nkuehn commented May 26, 2017

Since the index format will be incompatible with v2.1 anyways (discussed e.g. in #263 ) it could be an opportunity to optimize it a bit concerning size (if possible!).

One thing that makes me wonder if the empty objects in the serialized index will ever be used for something (I guess they could be the place where "metadataWhitelist" fields go, but it's a bit opaque to me in the code, can't really see it being used.

Here's an example invertedIndex entry in my serialized index (stemmed token "custom"). It's 584 characters minified. The index covers just ca 80 documents, so it's not even an extreme case.

[
        "custom",
        {
          "_index": 444,
          "title": {
            "14": {},
            "15": {},
            "16": {},
            "17": {},
            "22": {},
            "53": {},
            "54": {}
          },
          "description": {
            "14": {},
            "15": {},
            "16": {},
            "17": {},
            "22": {},
            "53": {},
            "54": {}
          },
          "categories": {},
          "body": {
            "3": {},
            "4": {},
            "7": {},
            "8": {},
            "9": {},
            "10": {},
            "11": {},
            "12": {},
            "13": {},
            "14": {},
            "15": {},
            "16": {},
            "17": {},
            "18": {},
            "19": {},
            "20": {},
            "21": {},
            "22": {},
            "23": {},
            "24": {},
            "25": {},
            "26": {},
            "27": {},
            "28": {},
            "29": {},
            "30": {},
            "31": {},
            "32": {},
            "34": {},
            "36": {},
            "38": {},
            "41": {},
            "43": {},
            "44": {},
            "45": {},
            "47": {},
            "48": {},
            "49": {},
            "50": {},
            "51": {},
            "52": {},
            "53": {},
            "54": {},
            "56": {},
            "57": {},
            "58": {},
            "59": {},
            "60": {},
            "63": {},
            "64": {}
          }
        }
      ],

Does anything speak against serializing the document references as an array? I'm not sure whether the IDs are integers or strings, assuming integers here, but strings are trivial, too (though less efficient). It's just 262 characters minified now, which is 55% savings on that part of the index.

[
        "custom",
        {
          "_index": 444,
          "title": [14,15,16,17,22,53,54],
          "description": [14,15,16,17,22,53,54],
          "categories": [],
          "body":
 [3,4,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,36,38,41,43,44,45,47,48,49,50,51,52,53,54,56,57,59,60,63,64]         
        }
      ],

Even if it's just 30% on less pronouced cases (keywords that appear less frequently) and just 15% on the overall index size then I think it's worth considering since the key advantage of lunr.js over other searchers is that you can run it in-browser and the index size is a key bottleneck.

@Andargor

This comment has been minimized.

Copy link

@Andargor Andargor commented May 26, 2017

I generally agree on size reduction, however I think object keys are used for fast retrieval and intrinsic uniqueness, whereas your example would require iteration over each entry.

@nkuehn

This comment has been minimized.

Copy link
Author

@nkuehn nkuehn commented May 26, 2017

@Andargor sure, in the in-memory representation the property maps are much more suitable. But lunr.js has explicit deserialization / serialization code that maps the in-memory datastructures to JSON.

So moving to a more efficient serialization does not imply changing the runtime structure - property hash lookup in the native JS engine is unbeatably fast so that needs to stay of course.

@olivernn

This comment has been minimized.

Copy link
Owner

@olivernn olivernn commented May 27, 2017

In 2.1.0-alpha.1 the structure of the index didn't need to change, only the vector references were slightly altered from "docRef" to "fieldName/docRef". That said, its still worth discussing what the format of a serialised index should be.

There are a couple of things that need to be balanced when designing the serialisation format; obviously the size (after minifying and compressing) is important, all other things being equal, a serialisation format that leads to a smaller amount of data to transfer is a win. How long a serialised index takes to deserialise is also important, and again the quicker it is the better.

Balancing these two isn't always straightforward. Take the two examples from this issue, the existing version is basically just a JSON.parse away from being the same object that is held in memory. The proposed version would require some extra work to re-create the in-memory structure. It also doesn't take into account any potential metadata stored for a term either. Also, with compression the difference in size is 20%, still worth considering but not quite 55% of the uncompressed version.

I think the approach I want to take is to have something that is simple/fast to deserialise and reasonably small on the wire. I want to publish the schema for this format (it'll still be JSON so probably JSONSchema), which would then allow more specific minification, such as the structure suggested at the top of the issue. I'm currently trying this out with a binary format for the index, which will be generated from the JSON serialised index. Deserialisation can either re-create lunr objects directly, or deserialise to the standard JSON format and rely on the built in lunr.Index.load method.

This way, for small to medium sized indexes, the built in serialisation will work well, and if people want smaller index sizes, or faster deserialisation, they can use a plugin that caters to their needs.

What do you think?

@Andargor

This comment has been minimized.

Copy link

@Andargor Andargor commented May 28, 2017

Just for another perspective, my current application has a serialized index that is 20MB in size (including the wrapping search function HTML, which is 1-2 KB). With node/express having compression enabled, on the wire 4.3 MB are actually transferred.

I'm just wondering now what real gains having a marginally smaller index would bring to the table. I would think the smaller cleartext index would result in slightly worse compression, therefore offsetting any advantage. I'm not sure what processing advantages there would be, if any, since for my current example the transfer impacts user experience far more than the local decompression.

Just thinking out loud.

@nkuehn

This comment has been minimized.

Copy link
Author

@nkuehn nkuehn commented May 29, 2017

You're all right - it's a question that's only possible to answer by actually testing with real-world data (compressed and uncompressed).

I tried stuff like this a lot in the past, all across Java, Javascript (server and browser etc) and found just a few commonalities:

  • custom binary formats never did the trick concerning serialization /deserialization. The native JSON (de)serializers (or their heavily tuned counterparts like Jackson in the Java world) are unbelievably optimized. And after compression, the actual gain was pretty low.
    E.g. you can try BSON or webpack or CBOR (latter being the best fit for a JSON drop-in), but after compression and deserialization I ended up with either noise or bigger sizes (e.g. because BSON is not optimized for size but for in-place modifiability on the disk).
  • compression is doing what it promises. it removes most of the duplication and repetitive noise. Just making artifacts smaller does not do anything, but getting rid of artifacts altogether does change something-
  • compressed wire size is king in web apps. We're talking about an order of magnitude difference between the optimization potential at serialization / deserialization vs. the potential on the wire. Implicitly that means that caching, HTTP and how it's put in the Browser are crucial.

My actual background here is that the site is a statically generated (Jekyll based) - which is the reason why lunr.js is a good fit at all - and "hosted" by just dumping it on S3. So no compression without further devOps effort, scripts etc. So the uncompressed size does get interesting.

@olivernn

This comment has been minimized.

Copy link
Owner

@olivernn olivernn commented May 29, 2017

@nkuehn regarding compression, if you can't compress some of your assets before deploying to S3 you can look at setting up CoudFront which can do the compression on the fly for you, you might also be able to do the same with other CDN providers.

As for binary formats, serialisation performance isn't really an issue, as it is does offline at build time. Obviously it shouldn't take hours, but its unlikely to be an priority. For deserialisation, its unlikely to be as quick as JSON.parse, this is why its probably best suited to large indexes where saving the amount of time spent transferring the index is a higher priority. The amount of data transferred, and what that costs the user on say a mobile data plan should also be considered.

I'd be surprised if a custom binary format doesn't provide a significant decrease in the size of an index, but I need to test this.

Getting back to the original issue, if there was a published schema for the plain JSON serialisation format for indexes, you could build on this to create a minified version as in your example.

@nkuehn

This comment has been minimized.

Copy link
Author

@nkuehn nkuehn commented Jun 21, 2017

PS on the topic: I ran across a bigger analysis the Uber engineering team did a while ago: https://eng.uber.com/trip-data-squeeze/ Probably a comparable data set (trip logs, i.e. a relatively repeating structure).

TL;DR: you can get 20% down even after compression with a schema-free but optimized format like CBOR or messagepack.
"Real" savings are possible only by switching to a schema-free (i.e. "with IDL") format like google's protocol buffers - not transmitting the structure and property names etc at AND encoding more efficiently all can lead to savings > 50%.

You can "emulate" such a structure in a custom but then totally unreadable JSON that consists just of nested arrays etc - not a real improvement imho.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.