-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1 #280
base: master
Are you sure you want to change the base?
v1 #280
Changes from 19 commits
73e1000
58dd9c8
f58a72a
f59af77
e3acf75
41e6549
2223a6e
49712b5
6833d25
b189252
36ec48c
0b03bcc
8f9faa8
3c5ad7d
f76fdd7
d0994ca
27a62ba
242223b
4a40e13
c4e3e35
66b3dd4
9c2442f
024d017
008efb7
945b644
319895c
ba7f62c
db28627
a49dc0d
37edd68
c5cedf2
f44507a
547389c
9c61991
92dec02
baeb423
ac633ea
ac51df9
6a457f6
d428402
abd480e
e2b8738
14b7790
131b305
36e8da9
826a772
2dc73a2
087d51b
83d4120
5007856
15fe455
7f5eb8b
72e7ef4
fde4d98
f147046
848d6aa
43acf12
1df5869
dd28799
5e853de
8480dc9
c931089
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
#pragma once | ||
|
||
#include <cstdint> | ||
#include <optional> | ||
#include <string> | ||
#include <vector> | ||
|
||
namespace pisa { | ||
|
||
using term_id_type = std::uint32_t; | ||
using term_id_vec = std::vector<term_id_type>; | ||
|
||
struct Query { | ||
std::optional<std::string> id; | ||
std::vector<term_id_type> terms; | ||
std::vector<float> term_weights; | ||
}; | ||
|
||
} // namespace pisa |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
> This document is a **work in progress**. | ||
|
||
# Introduction | ||
|
||
In our efforts to come up with the v1.0 of both PISA and our index format, | ||
we should start a discussion about the shape of things from the point of view | ||
of both the binary format and how we can use it in our library. | ||
|
||
## Index Format specification | ||
|
||
This document mainly discusses the binary file format of each index component, | ||
as well as how these components come together to form a cohesive structure. | ||
|
||
## Reference Implementation | ||
|
||
Along with format description and discussion, this directory includes some | ||
reference implementation of the discussed structures and some algorithms working on them. | ||
|
||
The goal of this is to show how things work on certain examples, | ||
and find out what works and what doesn't and still needs to be thought through. | ||
|
||
> Look in `test/test_v1.cpp` for code examples. | ||
|
||
# Posting Files | ||
|
||
> Example: `v1/raw_cursor.hpp`. | ||
|
||
Each _posting file_ contains a list of blocks of data, each related to a single term, | ||
preceded by a header encoding information about the type of payload. | ||
|
||
> Do we need the header? I would say "yes" because even if we store the information | ||
> somewhere else, then we might want to (1) verify that we are reading what we think | ||
> we are reading, and (2) verify format version compatibility. | ||
> The latter should be further discussed. | ||
|
||
``` | ||
Posting File := Header, [Posting Block] | ||
``` | ||
|
||
Each posting block encodes a list of homogeneous values, called _postings_. | ||
Encoding is not fixed. | ||
|
||
> Note that _block_ here means the entire posting list area. | ||
> We can work on the terminology. | ||
|
||
## Header | ||
|
||
> Example: `v1/posting_format_header.hpp`. | ||
|
||
We should store the type of the postings in the file, as well as encoding used. | ||
**This might be tricky because we'd like it to be an open set of values/encodings.** | ||
|
||
``` | ||
Header := Version, Type, Encoding | ||
Version := Major, Minor, Path | ||
Type := ValueId, Count | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am a bit confused by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Actually, so far I implemented it like this: https://github.com/pisa-engine/pisa/pull/280/files#diff-2a007c99bc1af07f1fb150c293383559R71 Another approach would be to always have scalars in one file, and join multiple ones for tuples. But then we can't store arrays of undetermined length (say, positional index). But all of this is up for discussion. |
||
``` | ||
|
||
## Posting Types | ||
|
||
I think supporting these types will be sufficient to express about anything we | ||
would want to, including single-value lists, document-frequency (or score) lists, | ||
positional indexes, etc. | ||
|
||
``` | ||
Type := Primitive | List[Type] | Tuple[Type] | ||
Primitive := int32 | float32 | ||
``` | ||
|
||
## Encodings | ||
|
||
We can identify encodings by either a name or ID/hash, or both. | ||
I can imagine that an index reader could **register** new encodings, | ||
and default to whatever we define in PISA. | ||
We should then also verify that this encoding implement a `Encoding<Type>` "concept". | ||
This is not the same as our "codecs". | ||
This would be more like posting list reader. | ||
|
||
> Example: `IndexRunner` in `v1/index.hpp`. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
#pragma once | ||
|
||
#include <cstring> | ||
|
||
#include <gsl/span> | ||
|
||
namespace pisa::v1 { | ||
|
||
template <class T> | ||
constexpr auto bit_cast(gsl::span<const std::byte> mem) -> std::remove_const_t<T> | ||
{ | ||
std::remove_const_t<T> dst{}; | ||
std::memcpy(&dst, mem.data(), sizeof(T)); | ||
return dst; | ||
} | ||
|
||
} // namespace pisa::v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there one
Header
for eachPosting Block
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, one header, followed by a list of blocks.