Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About online index updating #447

Open
justmao945 opened this issue Dec 9, 2020 · 4 comments
Open

About online index updating #447

justmao945 opened this issue Dec 9, 2020 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed priority:medium

Comments

@justmao945
Copy link

justmao945 commented Dec 9, 2020

Dear my friends,
First thank you all for the great project !
This search engine is the most fancy I've found on Github !
In our case, we will have streaming data which should be updated to the exist index during queries, but I can't find any description of this feature in the doc.
So is there any plan to support this feature ?

@justmao945 justmao945 added the enhancement New feature or request label Dec 9, 2020
@JMMackenzie
Copy link
Member

Hi there,

Unfortunately, we do not currently support online index updates, but this is something that we may be willing to support in the future. Depending on the size of the index and the frequency of the updates, there are various approaches which may be best. Perhaps you could list some more details, and we might be able to determine if anything is likely to be worked on in the future. Of course, we welcome collaborators, so you are welcome to work on this yourself too.

@amallia and @elshize - any thoughts?

@elshize
Copy link
Member

elshize commented Jan 4, 2021

I haven't really given a thought to this feature, and I don't really know what approach would be best. But I'm thinking what would be the easiest way to support that, and here are some thoughts:

As you may know from the documentation, PISA has a rather unique indexing pipeline, with indexing separated into separate stages: parsing, inverting, and compression. This diagram could help to visualize it. (btw @JMMackenzie this image doesn't render in the docs, we should probably fix that.)

I don't see a way to update index with a single document fast without serious refactoring and possibly even structural changes (but maybe I'm missing it?). However, I can see supporting updating in batches that's reasonably fast. This is not to say a batch couldn't be just a single document, but it won't be much faster than a thousand documents.

Parsing

Parsing is largely independent except for one crucial functionality: documents and terms are assigned IDs at this stage. The way it works now, though, is that parsing is done in batches, and they are then merged together, including the ID mappings. Now, during an update, merging a forward index could be optional, and only mappings would be merged if someone doesn't care about maintaining the forward index. We would need to make sure that the old document IDs stay the same.

Inverting

We could use the newly parsed forward index to build a small inverted index, and merge it with the old one. I believe we already have the mechanism to do so in our code, since we invert in batches as well.

Compression

Probably the whole thing should be rebuilt. Significantly more work would be required otherwise (I think).

Caveats

Note that the above approach means that you need to keep your uncompressed index and keep merging to it. This might or might not be a problem depending on the size.

This wouldn't be superfast but also might be acceptable, depending on your update pattern.

Also, note that most of the "merging" I refer to is taking a number of files and producing a new one. The old ones could be removed right after, but you still need roughly twice as much storage available as your index.

Advantages

The advantage of such "hacky" approach is that it would be doable in reasonable time and without overhauling the entire indexing pipeline. It also should be too difficult.

@troycheng
Copy link

Hi, Is there any progress? I just find this awesome project and looking for some info about this feature

@elshize
Copy link
Member

elshize commented Jul 21, 2021

Unfortunately, I haven't been able to look into that. I just graduated and started a new job, plus have been busy in personal life. Currently, I cannot tell when I'd have time to look into that, but if someone else took the lead on it, I'd be probably able to help with review and discussion.

@elshize elshize added the help wanted Extra attention is needed label Feb 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed priority:medium
Projects
None yet
Development

No branches or pull requests

4 participants