Skip to content

Percolator Queries

Mark Papadakis edited this page Sep 23, 2017 · 1 revision

If you are familiar with Elastic Search’s percolate query, you should get what this is all about.

Whereas with ES, you need to register queries with the service, and then be notified for any documents you attempt to match against any of them about which ones matched, you can create as many Trinity::percolator_query derived classes instances, each based on a parsed Trinity::query, and then for each document you want to check against those queries, you just identify either all terms in the document or only the distinct ones involved in the queries, track them and use the match() method to determine which of them match the document.

ES will index the document in-memory and will execute the query against that special transient index, which can be expensive. Trinity’s implementation will compile, for each query, from the AST representation to the internal execution nodes tree representation, which is also used by the execution engine for regular Trinity queries, and execute against that intermediate tree. The compiler will also run the various optimiser passes, which means execution will be optimal.

It works like so:

struct document_repr
{
	// For each term in the document, we just track the term
	// and its relative position in terms
	std::vector<std::pair<str8_t, uint16_t>> terms;
	
	void reset(const std::string &content)
	{
		// parse content into terms
	}
	
	bool match_term(const str8_t term)
	{
		return match_term_impl(term);
	}
	
	bool match_phrase(const str8_t *terms, const uint16_t cnt)
	{
		return match_phrase_impl(terms, cnt);
	}
};

struct AppPercolatorQuery final
	: public Trinity::percolator_query
{
	document_repr *curDocument;
	
	bool match_term(const uint16_t term) override final
	{
		// return true if the term is set in the “current” document
		return curDocument->match_term(term_by_index(term));
	}
		
	 bool match_phrase(const uint16_t *phraseTerms, const uint16_t phraseTermsCnt) override final
	 {
		 // return true if the phrase identified by the sequence of
		 // those terms is set in the “current” document
		 str8_t phrase[phraseTermsCnt];
		 
		 for (uint16_t i{0}; i != phraseTermsCnt; ++i)
			 phrase[i] = term_by_index(phraseTerms[i]);
			 
		 return curDocument->match_phrase(phrase, phraseTermsCnt);
         }	
};

Trinity::query q(“search (lucene OR trinity OR google OR duckduckgo)”);
AppPercolatorQuery pq(q);

// We can use pq.distinct_terms() to get a list of
// all distinct terms involved with the query.

// Notice how the two match methods accept uint16_t for terms.
// To get the actual term (string) for the id, you can use
// pq.term_by_index(idx)

// When you want to process a document represented by
// an instance of document_repr

pq.curDocument = &docRepr;
const bool matched = pq.match();

// You can do this for all other "registered" queries as well
Clone this wiki locally