Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1 #280

Open
wants to merge 62 commits into
base: master
Choose a base branch
from
Open

v1 #280

wants to merge 62 commits into from

Conversation

elshize
Copy link
Member

@elshize elshize commented Oct 28, 2019

I've been working on starting a draft of index format specification and some code examples.

Let's discuss!

@elshize elshize added the wip label Oct 28, 2019
@elshize elshize added this to the v1.0 milestone Oct 28, 2019
@elshize elshize self-assigned this Oct 28, 2019
@elshize elshize added this to In progress in 1.0 Stabilization via automation Oct 28, 2019
@elshize
Copy link
Member Author

elshize commented Nov 12, 2019

As of the latest commit, we can compress an index (compress program) and run queries both for evaluation and benchmarks with query program. It's not yet parameterized by algorithm, so only ranked OR runs but it actually performs very well, compared to the old ranked OR. Below results for blocked posting lists with SIMDBP encoding on Robust (on my laptop).

New:

[2019-11-11 21:01:47.675] [info] Mean: 1568.39
[2019-11-11 21:01:47.675] [info] 50% quantile: 1013
[2019-11-11 21:01:47.675] [info] 90% quantile: 3199
[2019-11-11 21:01:47.675] [info] 95% quantile: 6066

Old:

[2019-11-11 21:01:53.201] [stderr] [info] Mean: 1751.2
[2019-11-11 21:01:53.201] [stderr] [info] 50% quantile: 1079
[2019-11-11 21:01:53.202] [stderr] [info] 90% quantile: 3483
[2019-11-11 21:01:53.202] [stderr] [info] 95% quantile: 6693

The more interesting algorithms aren't implemented yet, but stay tuned.

Also, the compression is concurrent and quite fast, Robust compresses in seconds.

@elshize
Copy link
Member Author

elshize commented Nov 12, 2019

A quick update: I implemented precomputed scores (only as floats right now, without quantization) and store them as 4-bytes each. It takes a lot of space, but the idea is to maybe quantize to one byte each score, and then have a very fast score lookup. I'm expecting it to be at least as fast. I'll also experiment later with different codecs once I have integers instead of floats.

[2019-11-11 21:47:38.576] [info] Mean: 793.68
[2019-11-11 21:47:38.576] [info] 50% quantile: 418
[2019-11-11 21:47:38.576] [info] 90% quantile: 1948
[2019-11-11 21:47:38.576] [info] 95% quantile: 4284

```
Header := Version, Type, Encoding
Version := Major, Minor, Path
Type := ValueId, Count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused by Type. What are ValueId and Count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueId would be the type, and count would be how many of those, as in one, or pair or tuple, etc.

Actually, so far I implemented it like this: https://github.com/pisa-engine/pisa/pull/280/files#diff-2a007c99bc1af07f1fb150c293383559R71

Another approach would be to always have scalars in one file, and join multiple ones for tuples. But then we can't store arrays of undetermined length (say, positional index). But all of this is up for discussion.

> The latter should be further discussed.

```
Posting File := Header, [Posting Block]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there one Header for each Posting Block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, one header, followed by a list of blocks.

@elshize
Copy link
Member Author

elshize commented Nov 12, 2019

Quick tests on Clueweb09B show essentially the same results for BM25 ranked OR as before, while the average drops from 392.932 to 267.576 with precomputed quantized scores of length 1-byte without further compression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
1.0 Stabilization
  
In progress
Development

Successfully merging this pull request may close these issues.

None yet

2 participants