v1 #280

elshize · 2019-10-28T18:19:59Z

I've been working on starting a draft of index format specification and some code examples.

Let's discuss!

(very early stage of) draft and lib code in include/pisa/v1
examples in test/test_v1.cpp

elshize · 2019-11-12T02:12:42Z

As of the latest commit, we can compress an index (compress program) and run queries both for evaluation and benchmarks with query program. It's not yet parameterized by algorithm, so only ranked OR runs but it actually performs very well, compared to the old ranked OR. Below results for blocked posting lists with SIMDBP encoding on Robust (on my laptop).

New:

[2019-11-11 21:01:47.675] [info] Mean: 1568.39
[2019-11-11 21:01:47.675] [info] 50% quantile: 1013
[2019-11-11 21:01:47.675] [info] 90% quantile: 3199
[2019-11-11 21:01:47.675] [info] 95% quantile: 6066

Old:

[2019-11-11 21:01:53.201] [stderr] [info] Mean: 1751.2
[2019-11-11 21:01:53.201] [stderr] [info] 50% quantile: 1079
[2019-11-11 21:01:53.202] [stderr] [info] 90% quantile: 3483
[2019-11-11 21:01:53.202] [stderr] [info] 95% quantile: 6693

The more interesting algorithms aren't implemented yet, but stay tuned.

Also, the compression is concurrent and quite fast, Robust compresses in seconds.

elshize · 2019-11-12T02:53:23Z

A quick update: I implemented precomputed scores (only as floats right now, without quantization) and store them as 4-bytes each. It takes a lot of space, but the idea is to maybe quantize to one byte each score, and then have a very fast score lookup. I'm expecting it to be at least as fast. I'll also experiment later with different codecs once I have integers instead of floats.

[2019-11-11 21:47:38.576] [info] Mean: 793.68
[2019-11-11 21:47:38.576] [info] 50% quantile: 418
[2019-11-11 21:47:38.576] [info] 90% quantile: 1948
[2019-11-11 21:47:38.576] [info] 95% quantile: 4284

amallia · 2019-11-12T16:27:14Z

include/pisa/v1/README.md

+```
+Header := Version, Type, Encoding
+Version := Major, Minor, Path
+Type := ValueId, Count


I am a bit confused by Type. What are ValueId and Count?

ValueId would be the type, and count would be how many of those, as in one, or pair or tuple, etc.

Actually, so far I implemented it like this: https://github.com/pisa-engine/pisa/pull/280/files#diff-2a007c99bc1af07f1fb150c293383559R71

Another approach would be to always have scalars in one file, and join multiple ones for tuples. But then we can't store arrays of undetermined length (say, positional index). But all of this is up for discussion.

amallia · 2019-11-12T16:27:36Z

include/pisa/v1/README.md

+> The latter should be further discussed.
+
+```
+Posting File := Header, [Posting Block]


Is there one Header for each Posting Block?

No, one header, followed by a list of blocks.

elshize · 2019-11-12T20:45:28Z

Quick tests on Clueweb09B show essentially the same results for BM25 ranked OR as before, while the average drops from 392.932 to 267.576 with precomputed quantized scores of length 1-byte without further compression.

elshize added 12 commits October 22, 2019 15:15

Add tl::optional dependency

73e1000

Minimal partial example

58dd9c8

ZipCursor

f58a72a

Additional methods

f59af77

Merge remote-tracking branch 'origin/master' into v1

e3acf75

Intersections + union + bigram index

41e6549

Type-erase source and reader.

2223a6e

Add tl::expected

49712b5

Add tl::expected

6833d25

Posting header and builder

b189252

Index runner

36ec48c

Update cursor API

0b03bcc

elshize added the wip label Oct 28, 2019

elshize added this to the v1.0 milestone Oct 28, 2019

elshize requested review from JMMackenzie and amallia October 28, 2019 18:19

elshize self-assigned this Oct 28, 2019

elshize added this to In progress in 1.0 Stabilization via automation Oct 28, 2019

elshize added 5 commits October 29, 2019 17:43

On-the-fly BM25 scoring

8f9faa8

Precomputed scores

3c5ad7d

Index building tool

f76fdd7

Query and postings tools

d0994ca

Blocked cursor + SIMDBP

27a62ba

Precomputed scores

242223b

Quantized scores

4a40e13

amallia reviewed Nov 12, 2019

View reviewed changes

elshize added 30 commits December 6, 2019 19:50

Update porter2

547389c

JSON list queries and improved CLI

9c61991

Fixes to filtering queries

92dec02

Update script

baeb423

Merge branch 'master' into v1

ac633ea

Test fixes after merge

ac51df9

Intersections with JSON

6a457f6

Small fixes

d428402

Translation units + WAND

abd480e

PEF index

e2b8738

Minor fixes

14b7790

Selecting best bigrams

131b305

Add cereal library submodule

36e8da9

Fixes to selecting pairs for indexing

826a772

Support posting stats

2dc73a2

Selecting term-pairs and refactoring

087d51b

Update gitignore

83d4120

Multi-threaded pair index building

5007856

Merge branch 'master' into v1

15fe455

Fix queries test after merge

7f5eb8b

Improved UL

72e7ef4

Scripts and tweaks

fde4d98

Script update

f147046

Merge remote-tracking branch 'origin/master' into v1

848d6aa

Expand LookupUnion stats

43acf12

Merge remote-tracking branch 'origin/master' into v1

1df5869

Merge remote-tracking branch 'origin/master' into v1

dd28799

Refactor and test query inspection

5e853de

cmake

8480dc9

Add counting individual term postings

c931089

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1 #280

v1 #280

elshize commented Oct 28, 2019 •

edited

Loading

elshize commented Nov 12, 2019

elshize commented Nov 12, 2019

amallia Nov 12, 2019

elshize Nov 12, 2019

amallia Nov 12, 2019

elshize Nov 12, 2019

elshize commented Nov 12, 2019

v1 #280

Are you sure you want to change the base?

v1 #280

Conversation

elshize commented Oct 28, 2019 • edited Loading

elshize commented Nov 12, 2019

elshize commented Nov 12, 2019

amallia Nov 12, 2019

Choose a reason for hiding this comment

elshize Nov 12, 2019

Choose a reason for hiding this comment

amallia Nov 12, 2019

Choose a reason for hiding this comment

elshize Nov 12, 2019

Choose a reason for hiding this comment

elshize commented Nov 12, 2019

elshize commented Oct 28, 2019 •

edited

Loading