Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1 #280

Open
wants to merge 62 commits into
base: master
Choose a base branch
from
Open

v1 #280

Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
73e1000
Add tl::optional dependency
elshize Oct 22, 2019
58dd9c8
Minimal partial example
elshize Oct 22, 2019
f58a72a
ZipCursor
elshize Oct 23, 2019
f59af77
Additional methods
elshize Oct 25, 2019
e3acf75
Merge remote-tracking branch 'origin/master' into v1
elshize Oct 25, 2019
41e6549
Intersections + union + bigram index
elshize Oct 26, 2019
2223a6e
Type-erase source and reader.
elshize Oct 27, 2019
49712b5
Add tl::expected
elshize Oct 27, 2019
6833d25
Add tl::expected
elshize Oct 27, 2019
b189252
Posting header and builder
elshize Oct 28, 2019
36ec48c
Index runner
elshize Oct 28, 2019
0b03bcc
Update cursor API
elshize Oct 28, 2019
8f9faa8
On-the-fly BM25 scoring
elshize Oct 29, 2019
3c5ad7d
Precomputed scores
elshize Oct 30, 2019
f76fdd7
Index building tool
elshize Oct 31, 2019
d0994ca
Query and postings tools
elshize Nov 4, 2019
27a62ba
Blocked cursor + SIMDBP
elshize Nov 12, 2019
242223b
Precomputed scores
elshize Nov 12, 2019
4a40e13
Quantized scores
elshize Nov 12, 2019
c4e3e35
Add yaml-cpp dependency
elshize Nov 13, 2019
66b3dd4
Creating bigram index from query terms
elshize Nov 13, 2019
9c2442f
Union-lookup query (without precomptued scores and tool)
elshize Nov 15, 2019
024d017
Max scores + maxscore + union-lookup
elshize Nov 20, 2019
008efb7
Add rapidcheck
elshize Nov 23, 2019
945b644
Union-lookup with bigrams
elshize Nov 28, 2019
319895c
Add json library
elshize Nov 28, 2019
ba7f62c
Refactoring
elshize Nov 29, 2019
db28627
Add scripts
elshize Nov 29, 2019
a49dc0d
Two-phase union-lookup
elshize Nov 29, 2019
37edd68
Precomputed scores for bigram index
elshize Dec 2, 2019
c5cedf2
Union-lookup updates
elshize Dec 5, 2019
f44507a
Union-lookup cleanup
elshize Dec 6, 2019
547389c
Update porter2
elshize Dec 6, 2019
9c61991
JSON list queries and improved CLI
elshize Dec 17, 2019
92dec02
Fixes to filtering queries
elshize Dec 18, 2019
baeb423
Update script
elshize Dec 18, 2019
ac633ea
Merge branch 'master' into v1
elshize Dec 18, 2019
ac51df9
Test fixes after merge
elshize Dec 18, 2019
6a457f6
Intersections with JSON
elshize Dec 19, 2019
d428402
Small fixes
elshize Dec 20, 2019
abd480e
Translation units + WAND
elshize Dec 28, 2019
e2b8738
PEF index
elshize Dec 31, 2019
14b7790
Minor fixes
elshize Jan 6, 2020
131b305
Selecting best bigrams
elshize Jan 6, 2020
36e8da9
Add cereal library submodule
elshize Jan 6, 2020
826a772
Fixes to selecting pairs for indexing
elshize Jan 7, 2020
2dc73a2
Support posting stats
elshize Jan 8, 2020
087d51b
Selecting term-pairs and refactoring
elshize Jan 10, 2020
83d4120
Update gitignore
elshize Jan 10, 2020
5007856
Multi-threaded pair index building
elshize Jan 15, 2020
15fe455
Merge branch 'master' into v1
elshize Jan 15, 2020
7f5eb8b
Fix queries test after merge
elshize Jan 15, 2020
72e7ef4
Improved UL
elshize Jan 21, 2020
fde4d98
Scripts and tweaks
elshize Jan 25, 2020
f147046
Script update
elshize Jan 31, 2020
848d6aa
Merge remote-tracking branch 'origin/master' into v1
elshize Feb 3, 2020
43acf12
Expand LookupUnion stats
elshize Feb 3, 2020
1df5869
Merge remote-tracking branch 'origin/master' into v1
elshize Feb 4, 2020
dd28799
Merge remote-tracking branch 'origin/master' into v1
elshize Feb 4, 2020
5e853de
Refactor and test query inspection
elshize Feb 17, 2020
8480dc9
cmake
elshize Feb 21, 2020
c931089
Add counting individual term postings
elshize Feb 21, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -67,3 +67,9 @@
[submodule "external/wapopp"]
path = external/wapopp
url = https://github.com/pisa-engine/wapopp.git
[submodule "external/optional"]
path = external/optional
url = https://github.com/TartanLlama/optional.git
[submodule "external/expected"]
path = external/expected
url = https://github.com/TartanLlama/expected.git
15 changes: 10 additions & 5 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,15 @@ endif()
set(THREADS_PREFER_PTHREAD_FLAG ON)
find_package(Threads REQUIRED)

file(GLOB_RECURSE PISA_SRC_FILES FOLLOW_SYMLINKS "src/v1/*cpp")
list(SORT PISA_SRC_FILES)

include_directories(include)
add_library(pisa INTERFACE)
target_include_directories(pisa INTERFACE
add_library(pisa ${PISA_SRC_FILES})
target_include_directories(pisa PUBLIC
$<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/include/pisa>
)
target_link_libraries(pisa INTERFACE
target_link_libraries(pisa PUBLIC
Threads::Threads
Boost::boost
QMX
Expand All @@ -95,10 +98,12 @@ target_link_libraries(pisa INTERFACE
spdlog
fmt::fmt
range-v3
optional
)
target_include_directories(pisa INTERFACE external)
target_include_directories(pisa PUBLIC external)

add_subdirectory(src)
add_subdirectory(v1)
#add_subdirectory(src)

if (PISA_ENABLE_TESTING AND BUILD_TESTING)
enable_testing()
Expand Down
10 changes: 10 additions & 0 deletions external/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -116,3 +116,13 @@ add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/spdlog)

# Add range-v3
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/range-v3)

# Add tl::optional
set(OPTIONAL_ENABLE_TESTS OFF CACHE BOOL "skip tl::optional testing")
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/optional EXCLUDE_FROM_ALL)

# Add tl::expected
#set(EXPECTED_BUILD_TESTS OFF CACHE BOOL "skip tl::expected testing")
#set(EXPECTED_BUILD_PACKAGE OFF CACHE BOOL "skip tl::expected package")
#set(EXPECTED_BUILD_PACKAGE_DEB OFF CACHE BOOL "skip tl::expected package deb")
#add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/expected EXCLUDE_FROM_ALL)
1 change: 1 addition & 0 deletions external/expected
Submodule expected added at 3d7417
1 change: 1 addition & 0 deletions external/optional
Submodule optional added at 5c4876
22 changes: 11 additions & 11 deletions include/pisa/codec/simdbp.hpp
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
#pragma once

#include <vector>
#include "util/util.hpp"
#include "codec/block_codecs.hpp"
#include "util/util.hpp"
#include <vector>

extern "C" {
#include "simdcomp/include/simdbitpacking.h"
Expand All @@ -14,7 +14,8 @@ struct simdbp_block {
static void encode(uint32_t const *in,
uint32_t sum_of_values,
size_t n,
std::vector<uint8_t> &out) {
std::vector<uint8_t> &out)
{

assert(n <= block_size);
uint32_t *src = const_cast<uint32_t *>(in);
Expand All @@ -23,23 +24,22 @@ struct simdbp_block {
return;
}
uint32_t b = maxbits(in);
thread_local std::vector<uint8_t> buf(8*n);
uint8_t * buf_ptr = buf.data();
thread_local std::vector<uint8_t> buf(8 * n);
uint8_t *buf_ptr = buf.data();
*buf_ptr++ = b;
simdpackwithoutmask(src, (__m128i *)buf_ptr, b);
out.insert(out.end(), buf.data(), buf.data() + b * sizeof(__m128i) + 1);
}
static uint8_t const *decode(uint8_t const *in,
uint32_t *out,
uint32_t sum_of_values,
size_t n) {
static uint8_t const *decode(uint8_t const *in, uint32_t *out, uint32_t sum_of_values, size_t n)
{
assert(n <= block_size);
if (PISA_UNLIKELY(n < block_size)) {
return interpolative_block::decode(in, out, sum_of_values, n);
}
uint32_t b = *in++;
simdunpack((const __m128i *)in, out, b);
return in + b * sizeof(__m128i);
return in + b * sizeof(__m128i);
}
};
} // namespace pisa

} // namespace pisa
14 changes: 14 additions & 0 deletions include/pisa/io.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -61,4 +61,18 @@ void for_each_line(std::istream &is, Function fn)
return data;
}

[[nodiscard]] inline auto load_bytes(std::string const &data_file)
{
std::vector<std::byte> data;
std::basic_ifstream<std::byte> in(data_file.c_str(), std::ios::binary);
in.seekg(0, std::ios::end);
std::streamsize size = in.tellg();
in.seekg(0, std::ios::beg);
data.resize(size);
if (not in.read(data.data(), size)) {
throw std::runtime_error("Failed reading " + data_file);
}
return data;
}

} // namespace pisa::io
19 changes: 5 additions & 14 deletions include/pisa/query/queries.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
#include <spdlog/spdlog.h>

#include "index_types.hpp"
#include "query/queries.hpp"
#include "query/query.hpp"
#include "scorer/score_function.hpp"
#include "term_processor.hpp"
#include "tokenizer.hpp"
Expand All @@ -24,15 +24,6 @@

namespace pisa {

using term_id_type = uint32_t;
using term_id_vec = std::vector<term_id_type>;

struct Query {
std::optional<std::string> id;
std::vector<term_id_type> terms;
std::vector<float> term_weights;
};

[[nodiscard]] auto split_query_at_colon(std::string const &query_string)
-> std::pair<std::optional<std::string>, std::string_view>
{
Expand Down Expand Up @@ -98,10 +89,10 @@ struct Query {
{
if (terms_file) {
auto term_processor = TermProcessor(terms_file, stopwords_filename, stemmer_type);
return [&queries, term_processor = std::move(term_processor)](
std::string const &query_line) {
queries.push_back(parse_query_terms(query_line, term_processor));
};
return
[&queries, term_processor = std::move(term_processor)](std::string const &query_line) {
queries.push_back(parse_query_terms(query_line, term_processor));
};
} else {
return [&queries](std::string const &query_line) {
queries.push_back(parse_query_ids(query_line));
Expand Down
19 changes: 19 additions & 0 deletions include/pisa/query/query.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#pragma once

#include <cstdint>
#include <optional>
#include <string>
#include <vector>

namespace pisa {

using term_id_type = std::uint32_t;
using term_id_vec = std::vector<term_id_type>;

struct Query {
std::optional<std::string> id;
std::vector<term_id_type> terms;
std::vector<float> term_weights;
};

} // namespace pisa
79 changes: 79 additions & 0 deletions include/pisa/v1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
> This document is a **work in progress**.

# Introduction

In our efforts to come up with the v1.0 of both PISA and our index format,
we should start a discussion about the shape of things from the point of view
of both the binary format and how we can use it in our library.

## Index Format specification

This document mainly discusses the binary file format of each index component,
as well as how these components come together to form a cohesive structure.

## Reference Implementation

Along with format description and discussion, this directory includes some
reference implementation of the discussed structures and some algorithms working on them.

The goal of this is to show how things work on certain examples,
and find out what works and what doesn't and still needs to be thought through.

> Look in `test/test_v1.cpp` for code examples.

# Posting Files

> Example: `v1/raw_cursor.hpp`.

Each _posting file_ contains a list of blocks of data, each related to a single term,
preceded by a header encoding information about the type of payload.

> Do we need the header? I would say "yes" because even if we store the information
> somewhere else, then we might want to (1) verify that we are reading what we think
> we are reading, and (2) verify format version compatibility.
> The latter should be further discussed.

```
Posting File := Header, [Posting Block]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there one Header for each Posting Block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, one header, followed by a list of blocks.

```

Each posting block encodes a list of homogeneous values, called _postings_.
Encoding is not fixed.

> Note that _block_ here means the entire posting list area.
> We can work on the terminology.

## Header

> Example: `v1/posting_format_header.hpp`.

We should store the type of the postings in the file, as well as encoding used.
**This might be tricky because we'd like it to be an open set of values/encodings.**

```
Header := Version, Type, Encoding
Version := Major, Minor, Path
Type := ValueId, Count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused by Type. What are ValueId and Count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueId would be the type, and count would be how many of those, as in one, or pair or tuple, etc.

Actually, so far I implemented it like this: https://github.com/pisa-engine/pisa/pull/280/files#diff-2a007c99bc1af07f1fb150c293383559R71

Another approach would be to always have scalars in one file, and join multiple ones for tuples. But then we can't store arrays of undetermined length (say, positional index). But all of this is up for discussion.

```

## Posting Types

I think supporting these types will be sufficient to express about anything we
would want to, including single-value lists, document-frequency (or score) lists,
positional indexes, etc.

```
Type := Primitive | List[Type] | Tuple[Type]
Primitive := int32 | float32
```

## Encodings

We can identify encodings by either a name or ID/hash, or both.
I can imagine that an index reader could **register** new encodings,
and default to whatever we define in PISA.
We should then also verify that this encoding implement a `Encoding<Type>` "concept".
This is not the same as our "codecs".
This would be more like posting list reader.

> Example: `IndexRunner` in `v1/index.hpp`.
17 changes: 17 additions & 0 deletions include/pisa/v1/bit_cast.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#pragma once

#include <cstring>

#include <gsl/span>

namespace pisa::v1 {

template <class T>
constexpr auto bit_cast(gsl::span<const std::byte> mem) -> std::remove_const_t<T>
{
std::remove_const_t<T> dst{};
std::memcpy(&dst, mem.data(), sizeof(T));
return dst;
}

} // namespace pisa::v1
Loading