Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1 #280

Open
wants to merge 62 commits into
base: master
Choose a base branch
from
Open

v1 #280

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
73e1000
Add tl::optional dependency
elshize Oct 22, 2019
58dd9c8
Minimal partial example
elshize Oct 22, 2019
f58a72a
ZipCursor
elshize Oct 23, 2019
f59af77
Additional methods
elshize Oct 25, 2019
e3acf75
Merge remote-tracking branch 'origin/master' into v1
elshize Oct 25, 2019
41e6549
Intersections + union + bigram index
elshize Oct 26, 2019
2223a6e
Type-erase source and reader.
elshize Oct 27, 2019
49712b5
Add tl::expected
elshize Oct 27, 2019
6833d25
Add tl::expected
elshize Oct 27, 2019
b189252
Posting header and builder
elshize Oct 28, 2019
36ec48c
Index runner
elshize Oct 28, 2019
0b03bcc
Update cursor API
elshize Oct 28, 2019
8f9faa8
On-the-fly BM25 scoring
elshize Oct 29, 2019
3c5ad7d
Precomputed scores
elshize Oct 30, 2019
f76fdd7
Index building tool
elshize Oct 31, 2019
d0994ca
Query and postings tools
elshize Nov 4, 2019
27a62ba
Blocked cursor + SIMDBP
elshize Nov 12, 2019
242223b
Precomputed scores
elshize Nov 12, 2019
4a40e13
Quantized scores
elshize Nov 12, 2019
c4e3e35
Add yaml-cpp dependency
elshize Nov 13, 2019
66b3dd4
Creating bigram index from query terms
elshize Nov 13, 2019
9c2442f
Union-lookup query (without precomptued scores and tool)
elshize Nov 15, 2019
024d017
Max scores + maxscore + union-lookup
elshize Nov 20, 2019
008efb7
Add rapidcheck
elshize Nov 23, 2019
945b644
Union-lookup with bigrams
elshize Nov 28, 2019
319895c
Add json library
elshize Nov 28, 2019
ba7f62c
Refactoring
elshize Nov 29, 2019
db28627
Add scripts
elshize Nov 29, 2019
a49dc0d
Two-phase union-lookup
elshize Nov 29, 2019
37edd68
Precomputed scores for bigram index
elshize Dec 2, 2019
c5cedf2
Union-lookup updates
elshize Dec 5, 2019
f44507a
Union-lookup cleanup
elshize Dec 6, 2019
547389c
Update porter2
elshize Dec 6, 2019
9c61991
JSON list queries and improved CLI
elshize Dec 17, 2019
92dec02
Fixes to filtering queries
elshize Dec 18, 2019
baeb423
Update script
elshize Dec 18, 2019
ac633ea
Merge branch 'master' into v1
elshize Dec 18, 2019
ac51df9
Test fixes after merge
elshize Dec 18, 2019
6a457f6
Intersections with JSON
elshize Dec 19, 2019
d428402
Small fixes
elshize Dec 20, 2019
abd480e
Translation units + WAND
elshize Dec 28, 2019
e2b8738
PEF index
elshize Dec 31, 2019
14b7790
Minor fixes
elshize Jan 6, 2020
131b305
Selecting best bigrams
elshize Jan 6, 2020
36e8da9
Add cereal library submodule
elshize Jan 6, 2020
826a772
Fixes to selecting pairs for indexing
elshize Jan 7, 2020
2dc73a2
Support posting stats
elshize Jan 8, 2020
087d51b
Selecting term-pairs and refactoring
elshize Jan 10, 2020
83d4120
Update gitignore
elshize Jan 10, 2020
5007856
Multi-threaded pair index building
elshize Jan 15, 2020
15fe455
Merge branch 'master' into v1
elshize Jan 15, 2020
7f5eb8b
Fix queries test after merge
elshize Jan 15, 2020
72e7ef4
Improved UL
elshize Jan 21, 2020
fde4d98
Scripts and tweaks
elshize Jan 25, 2020
f147046
Script update
elshize Jan 31, 2020
848d6aa
Merge remote-tracking branch 'origin/master' into v1
elshize Feb 3, 2020
43acf12
Expand LookupUnion stats
elshize Feb 3, 2020
1df5869
Merge remote-tracking branch 'origin/master' into v1
elshize Feb 4, 2020
dd28799
Merge remote-tracking branch 'origin/master' into v1
elshize Feb 4, 2020
5e853de
Refactor and test query inspection
elshize Feb 17, 2020
8480dc9
cmake
elshize Feb 21, 2020
c931089
Add counting individual term postings
elshize Feb 21, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ PenaltyBreakFirstLessLess: 20
PenaltyBreakString: 1000
PenaltyExcessCharacter: 1000000
PenaltyReturnTypeOnItsOwnLine: 200
PointerAlignment: Right
PointerAlignment: Left
SpaceAfterControlStatementKeyword: true
SpaceBeforeAssignmentOperators: true
SpaceInEmptyParentheses: false
Expand Down
17 changes: 17 additions & 0 deletions .clang-tidy
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
HeaderFilterRegex: '.*include/pisa.*\.hpp'
Checks: |
*,
-clang-diagnostic-c++17-extensions,
-llvm-header-guard,
-cppcoreguidelines-pro-type-reinterpret-cast,
-google-runtime-references,
-fuchsia-*,
-google-readability-namespace-comments,
-llvm-namespace-comment,
-clang-diagnostic-error,
-cppcoreguidelines-pro-bounds-pointer-arithmetic,
-cppcoreguidelines-avoid-magic-numbers,
-cppcoreguidelines-pro-bounds-array-to-pointer-decay,
-modernize-use-trailing-return-type,
-misc-non-private-member-variables-in-classes,
-readability-magic-numbers
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,4 @@ docs/_build/
node_modules

.clangd/
compile_commands.json
15 changes: 15 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,21 @@
[submodule "external/wapopp"]
path = external/wapopp
url = https://github.com/pisa-engine/wapopp.git
[submodule "external/optional"]
path = external/optional
url = https://github.com/TartanLlama/optional.git
[submodule "external/expected"]
path = external/expected
url = https://github.com/TartanLlama/expected.git
[submodule "external/yaml-cpp"]
path = external/yaml-cpp
url = https://github.com/jbeder/yaml-cpp.git
[submodule "external/rapidcheck"]
path = external/rapidcheck
url = https://github.com/emil-e/rapidcheck.git
[submodule "external/json"]
path = external/json
url = https://github.com/nlohmann/json.git
[submodule "external/cereal"]
path = external/cereal
url = https://github.com/USCiLab/cereal.git
40 changes: 33 additions & 7 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ set(CMAKE_CXX_EXTENSIONS OFF)
option(PISA_BUILD_TOOLS "Build command line tools." ON)
option(PISA_ENABLE_TESTING "Enable testing of the library." ON)
option(PISA_ENABLE_BENCHMARKING "Enable benchmarking of the library." ON)
option(PISA_COMPILE_TOOLS "Compile CLI tools." ON)
option(FORCE_COLORED_OUTPUT "Always produce ANSI-colored output (GNU/Clang only)." ON)
option(PISA_LIBCXX "Use libc++ standard library." OFF)

configure_file(
${PISA_SOURCE_DIR}/include/pisa/pisa_config.hpp.in
Expand All @@ -30,7 +33,7 @@ ExternalProject_Add(gumbo-external
SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/external/gumbo-parser
BINARY_DIR ${CMAKE_CURRENT_SOURCE_DIR}/external/gumbo-parser
CONFIGURE_COMMAND ./autogen.sh && ./configure --prefix=${CMAKE_BINARY_DIR}/gumbo-parser
BUILD_BYPRODUCTS ${CMAKE_BINARY_DIR}/gumbo-parser/lib/libgumbo.a
BUILD_BYPRODUCTS ${CMAKE_BINARY_DIR}/gumbo-parser/lib/libgumbo.a
BUILD_COMMAND ${MAKE})
add_library(gumbo::gumbo STATIC IMPORTED)
set_property(TARGET gumbo::gumbo APPEND PROPERTY INTERFACE_INCLUDE_DIRECTORIES
Expand All @@ -47,20 +50,38 @@ list(APPEND CMAKE_MODULE_PATH "${CMAKE_SOURCE_DIR}/external/CMake-codecov/cmake"
find_package(codecov)
list(APPEND LCOV_REMOVE_PATTERNS "'${PROJECT_SOURCE_DIR}/external/*'")


set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-strict-aliasing")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -DGSL_UNENFORCED_ON_CONTRACT_VIOLATION -flto")
if (UNIX)
# For hardware popcount and other special instructions
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native -Wno-odr")

# Extensive warnings
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -Wno-missing-braces")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wno-missing-braces")
#set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -Wno-missing-braces")

if (USE_SANITIZERS)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address -fno-omit-frame-pointer")
endif ()

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ggdb") # Add debug info anyway
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ggdb -gdwarf") # Add debug info anyway

#set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-omit-frame-pointer")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-omit-frame-pointer -Wfatal-errors")

if (${FORCE_COLORED_OUTPUT})
if ("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fdiagnostics-color=always")
elseif ("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Clang")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fcolor-diagnostics")
endif ()
endif ()

endif()

if (PISA_LIBCXX)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -stdlib=libc++")
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -stdlib=libc++ -lc++abi")
endif()

find_package(OpenMP)
Expand Down Expand Up @@ -100,16 +121,21 @@ target_link_libraries(pisa PUBLIC # TODO(michal): are there any of these we can
spdlog
fmt::fmt
range-v3
optional
yaml-cpp
nlohmann_json
)
target_include_directories(pisa PUBLIC external)

if (PISA_BUILD_TOOLS)
if (PISA_COMPILE_TOOLS)
add_subdirectory(v1)
add_subdirectory(tools)
endif()

if (PISA_ENABLE_TESTING AND BUILD_TESTING)
enable_testing()
add_subdirectory(test)
#add_subdirectory(test)
add_subdirectory(test/v1)
endif()

if (PISA_ENABLE_BENCHMARKING)
Expand Down
19 changes: 18 additions & 1 deletion external/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -106,8 +106,9 @@ set(TRECPP_BUILD_TOOL OFF CACHE BOOL "skip trecpp testing")
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/trecpp)

# Add trecpp
set(JSON_MultipleHeaders ON CACHE BOOL "")
set(WAPOPP_ENABLE_TESTING OFF CACHE BOOL "skip wapopp testing")
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/wapopp)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/wapopp EXCLUDE_FROM_ALL)

# Add fmt
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/fmt)
Expand All @@ -118,6 +119,22 @@ add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/spdlog)
# Add range-v3
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/range-v3)

# Add tl::optional
set(OPTIONAL_ENABLE_TESTS OFF CACHE BOOL "skip tl::optional testing")
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/optional EXCLUDE_FROM_ALL)

# Add yaml-cpp
set(YAML_CPP_BUILD_TOOLS OFF CACHE BOOL "skip building YAML tools")
set(YAML_CPP_BUILD_TESTS OFF CACHE BOOL "skip building YAML tests")
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/yaml-cpp EXCLUDE_FROM_ALL)

# Add RapidCheck
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/rapidcheck)
target_compile_options(rapidcheck PRIVATE -Wno-error=all)

# Add json
# TODO(michal): I had to comment this out because `wapocpp` already adds this target.
# How should we deal with it?
#set(JSON_MultipleHeaders ON CACHE BOOL "")
#set(JSON_BuildTests OFF CACHE BOOL "skip building JSON tests")
#add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/json)
1 change: 1 addition & 0 deletions external/cereal
Submodule cereal added at a5a309
1 change: 1 addition & 0 deletions external/expected
Submodule expected added at 3d7417
1 change: 1 addition & 0 deletions external/json
Submodule json added at e7b3b4
1 change: 1 addition & 0 deletions external/optional
Submodule optional added at 5c4876
1 change: 1 addition & 0 deletions external/yaml-cpp
Submodule yaml-cpp added at 72f699
19 changes: 13 additions & 6 deletions include/pisa/codec/integer_codes.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
namespace pisa {

// note: n can be 0
inline void write_gamma(bit_vector_builder &bvb, uint64_t n)
inline void write_gamma(bit_vector_builder& bvb, uint64_t n)
{
uint64_t nn = n + 1;
uint64_t l = broadword::msb(nn);
Expand All @@ -14,22 +14,27 @@ inline void write_gamma(bit_vector_builder &bvb, uint64_t n)
bvb.append_bits(nn ^ hb, l);
}

inline void write_gamma_nonzero(bit_vector_builder &bvb, uint64_t n)
inline void write_gamma_nonzero(bit_vector_builder& bvb, uint64_t n)
{
assert(n > 0);
write_gamma(bvb, n - 1);
}

inline uint64_t read_gamma(bit_vector::enumerator &it)
template <typename BitVectorEnumerator>
inline uint64_t read_gamma(BitVectorEnumerator& it)
{
uint64_t l = it.skip_zeros();
assert(l < 64);
return (it.take(l) | (uint64_t(1) << l)) - 1;
}

inline uint64_t read_gamma_nonzero(bit_vector::enumerator &it) { return read_gamma(it) + 1; }
template <typename BitVectorEnumerator>
inline uint64_t read_gamma_nonzero(BitVectorEnumerator& it)
{
return read_gamma(it) + 1;
}

inline void write_delta(bit_vector_builder &bvb, uint64_t n)
inline void write_delta(bit_vector_builder& bvb, uint64_t n)
{
uint64_t nn = n + 1;
uint64_t l = broadword::msb(nn);
Expand All @@ -38,9 +43,11 @@ inline void write_delta(bit_vector_builder &bvb, uint64_t n)
bvb.append_bits(nn ^ hb, l);
}

inline uint64_t read_delta(bit_vector::enumerator &it)
template <typename BitVectorEnumerator>
inline uint64_t read_delta(BitVectorEnumerator& it)
{
uint64_t l = read_gamma(it);
return (it.take(l) | (uint64_t(1) << l)) - 1;
}

} // namespace pisa
22 changes: 11 additions & 11 deletions include/pisa/codec/simdbp.hpp
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
#pragma once

#include <vector>
#include "util/util.hpp"
#include "codec/block_codecs.hpp"
#include "util/util.hpp"
#include <vector>

extern "C" {
#include "simdcomp/include/simdbitpacking.h"
Expand All @@ -14,7 +14,8 @@ struct simdbp_block {
static void encode(uint32_t const *in,
uint32_t sum_of_values,
size_t n,
std::vector<uint8_t> &out) {
std::vector<uint8_t> &out)
{

assert(n <= block_size);
uint32_t *src = const_cast<uint32_t *>(in);
Expand All @@ -23,23 +24,22 @@ struct simdbp_block {
return;
}
uint32_t b = maxbits(in);
thread_local std::vector<uint8_t> buf(8*n);
uint8_t * buf_ptr = buf.data();
thread_local std::vector<uint8_t> buf(8 * n);
uint8_t *buf_ptr = buf.data();
*buf_ptr++ = b;
simdpackwithoutmask(src, (__m128i *)buf_ptr, b);
out.insert(out.end(), buf.data(), buf.data() + b * sizeof(__m128i) + 1);
}
static uint8_t const *decode(uint8_t const *in,
uint32_t *out,
uint32_t sum_of_values,
size_t n) {
static uint8_t const *decode(uint8_t const *in, uint32_t *out, uint32_t sum_of_values, size_t n)
{
assert(n <= block_size);
if (PISA_UNLIKELY(n < block_size)) {
return interpolative_block::decode(in, out, sum_of_values, n);
}
uint32_t b = *in++;
simdunpack((const __m128i *)in, out, b);
return in + b * sizeof(__m128i);
return in + b * sizeof(__m128i);
}
};
} // namespace pisa

} // namespace pisa
16 changes: 9 additions & 7 deletions include/pisa/cursor/block_max_scored_cursor.hpp
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
#pragma once

#include <vector>

#include "query/queries.hpp"
#include "scorer/bm25.hpp"
#include "scorer/index_scorer.hpp"
#include "wand_data.hpp"
#include "query/queries.hpp"
#include <vector>

namespace pisa {

Expand All @@ -20,9 +22,9 @@ struct block_max_scored_cursor {
};

template <typename Index, typename WandType, typename Scorer>
[[nodiscard]] auto make_block_max_scored_cursors(Index const &index,
WandType const &wdata,
Scorer const &scorer,
[[nodiscard]] auto make_block_max_scored_cursors(Index const& index,
WandType const& wdata,
Scorer const& scorer,
Query query)
{
auto terms = query.terms;
Expand All @@ -34,7 +36,7 @@ template <typename Index, typename WandType, typename Scorer>
query_term_freqs.begin(),
query_term_freqs.end(),
std::back_inserter(cursors),
[&](auto &&term) {
[&](auto&& term) {
auto list = index[term.first];
auto w_enum = wdata.getenum(term.first);
float q_weight = term.second;
Expand All @@ -45,4 +47,4 @@ template <typename Index, typename WandType, typename Scorer>
return cursors;
}

} // namespace pisa
} // namespace pisa
14 changes: 8 additions & 6 deletions include/pisa/cursor/max_scored_cursor.hpp
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
#pragma once

#include <vector>

#include "query/queries.hpp"
#include "scorer/bm25.hpp"
#include "scorer/index_scorer.hpp"
#include "wand_data.hpp"
#include "query/queries.hpp"
#include <vector>

namespace pisa {

Expand All @@ -17,9 +19,9 @@ struct max_scored_cursor {
};

template <typename Index, typename WandType, typename Scorer>
[[nodiscard]] auto make_max_scored_cursors(Index const &index,
WandType const &wdata,
Scorer const &scorer,
[[nodiscard]] auto make_max_scored_cursors(Index const& index,
WandType const& wdata,
Scorer const& scorer,
Query query)
{
auto terms = query.terms;
Expand All @@ -30,7 +32,7 @@ template <typename Index, typename WandType, typename Scorer>
std::transform(query_term_freqs.begin(),
query_term_freqs.end(),
std::back_inserter(cursors),
[&](auto &&term) {
[&](auto&& term) {
auto list = index[term.first];
float q_weight = term.second;
auto max_weight = q_weight * wdata.max_term_weight(term.first);
Expand Down
Loading