Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues/2: adding hybrid solution for hoisting constants #145

Closed
wants to merge 13 commits into from
Closed

issues/2: adding hybrid solution for hoisting constants #145

wants to merge 13 commits into from

Conversation

pressenna
Copy link
Contributor

This is the initial hybrid solution.
It is far from perfect, but it hoisted constants correctly (in most cases, except for CASE) into the query functions .entry block.

valid constanstants which are then hoisted into the query functions
.entry block.
Copy link
Contributor

@shtilman shtilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search for hoisted variables, multiple flavors of codegenHoistedConstants, Scalar visitor that codegens as it goes.. Seems a bit overengineered. Could you clean it up a bit and maybe describe the steps you're going through?

The way I thought about this hoisting was quite straightforward:

  • visit and collect Analyzer::Constants
  • when building the .entry block - go through that list and codegen these constants - that will fill the .literals payload and generate appropriate loads in the entry block
  • maintain a map that for each Constant would establish its relationship to codegen-ed llvm value (load) and also corresponding row_func arg
  • then in row_func whenever you need to codegen a Constant you first look in that map: if it's in, you pick the row_func arg it's tied to.

And all this should of course be guarded by hoist_literals flag.

@@ -300,12 +302,26 @@ llvm::Function* query_template(llvm::Module* mod,
Value* error_code = &*(++query_arg_it);
error_code->setName("error_code");

// WARNING: this does not work with the LICM optimizer... the generated IR seems correct, but it SIGSEGVs during peephole optimisation
// BasicBlock* bb_hoisted = BasicBlock::Create(mod->getContext(), ".hoisted_variables", query_func_ptr, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, a separate block is not needed here - not to mention that we wouldn't want to create it if there's nothing to hoist, and it's a good idea to let the block called .entry to remain the entry block.

auto bb_entry = BasicBlock::Create(mod->getContext(), ".entry", query_func_ptr, 0);
BasicBlock* bb_hoisted = bb_entry;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed.

auto bb_preheader = BasicBlock::Create(mod->getContext(), ".loop.preheader", query_func_ptr, 0);
auto bb_forbody = BasicBlock::Create(mod->getContext(), ".for.body", query_func_ptr, 0);
auto bb_crit_edge = BasicBlock::Create(mod->getContext(), "._crit_edge", query_func_ptr, 0);
auto bb_exit = BasicBlock::Create(mod->getContext(), ".exit", query_func_ptr, 0);

if (NULL != bb_hoisted) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check not needed, call generator on bb_entry

const bool is_group_by{!query_mem_desc.group_col_widths.empty()};
auto query_func =
is_group_by ? query_group_by_template(cgen_state_->module_,
is_nested_,
co.hoist_literals_,
query_mem_desc,
co.device_type_,
ra_exe_unit.scan_limit)
ra_exe_unit.scan_limit, visitor_to_use)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clang-format your changes

llvm::Value* findHoistedValueInBlock(llvm::Function* function, std::string name) {
// WARNING: this does not work with the LICM optimizer... the generated IR seems correct, but it SIGSEGVs during peephole optimisation
// llvm::BasicBlock* hoistedVariablesBlock = find_BasicBlock_by_name(function, ".hoisted_variables");
llvm::BasicBlock* hoistedVariablesBlock = find_BasicBlock_by_name(function, ".entry");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed we wanted to generate literal loads in the entry block, no need in extra blocks.
Function's entry block should remain the first in BB list, you shouldn't search for it, just calling front() should be fine. Not sure why you need to search for hoisted values by name..

@@ -657,24 +658,27 @@ std::vector<llvm::Value*> generate_column_heads_load(const int num_columns,
llvm::Function* query_func,
llvm::LLVMContext& context) {
auto max_col_local_id = num_columns - 1;
auto& fetch_bb = query_func->front();
llvm::BasicBlock * entryBlock = find_BasicBlock_by_name(query_func, ".entry");
llvm::BasicBlock& fetch_bb(*entryBlock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, no need to look for .entry, it's the first block, front() should work just fine.

{
// WARNING: this does not work with the LICM optimizer... the generated IR seems correct, but it SIGSEGVs during peephole optimisation
// llvm::BasicBlock* hoistedVarsBlock = find_BasicBlock_by_name(query_func, ".hoisted_variables");
llvm::BasicBlock* hoistedVarsBlock = find_BasicBlock_by_name(query_func, ".entry");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_func->front()


if (bb_hoisted != bb_entry) {
BranchInst::Create(bb_entry, bb_hoisted);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branch not needed.

@@ -583,12 +600,26 @@ llvm::Function* query_group_by_template(llvm::Module* mod,
Value* error_code = &*(++query_arg_it);
error_code->setName("error_code");

// WARNING: this does not work with the LICM optimizer... the generated IR seems correct, but it SIGSEGVs during peephole optimisation
// BasicBlock* bb_hoisted = BasicBlock::Create(mod->getContext(), ".hoisted_variables", query_func_ptr, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this template, .entry is entry

std::vector<llvm::Value*> Executor::codegenHoistedConstants(const std::vector<const Analyzer::Constant*>& constants,
const EncodingType enc_type,
const int dict_id) {
CHECK(!constants.empty());
const auto& type_info = constants.front()->get_type_info();
auto lit_buff_lv = get_arg_by_name(cgen_state_->row_func_, "literals");
int16_t lit_off{-1};
int64_t lit_off{-1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe why you decided to revamp this function? I see different versions.. What is the workflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • codegenHoistedConstantsInPlace
    Creates the literal load from the literals buffer in place, effectively the original codegenHoistedConstants enhanced to also set the names for IR nodes accordingly
  • codegenHoistedConstants
    This is the original entry point, it has been changed to first check if the Constant was already hoisted (by checking presence in the function argument list). If not, it calls codegenHoistedConstantsInPlace to generate the load.
  • codegenHoistedConstantsInEntryBlock (just renamed it)
    New entry point for the generation of the loads in the .entry block. It only creates the load, if the Constand had not been hoisted before.

@pressenna
Copy link
Contributor Author

Hey @shtilman ,

Sorry for the poor documentation on the change... will try to avoid this in the future.

The basic idea was as follows:

  1. Scan the RA Tree for Constants via the visitor
  2. Generate loads for found and valid Constants into the .entry block
    This is being done directly while scanning the RA Tree, because sometimes the parent RA node is required for correct type information for code generation. Doing this only on the Constants caused quite a few Tests to fail (especially where varchars were involved).
  3. Use the .entry block's IR as an indirect form of mapping
    The gen context already maintains a Constant -> offset mapping
  4. and finally when generating row_func it looks up the the right argument by name and, if found, uses it, otherwise generates a new load from the literals buffer.

This only works because duplicate Constants are eliminated during code generation, so the offset into the literals buffer is identical for Constants equal by value, and thus the names can be used for looking them up.

The rational for using the .entry block directly as mapping was my concern with the IN operator.
It has a hard limit of 5 million entries, so was trying to avoid maintaining too many mappings for the same thing (as the de-duplication already does this in some form: Constant -> offset ).

The hoist_literals flag is still being considered. If it is not set, then no constants will be hoisted.
This is achieved by passing the relevant methods a NULL instead of an instance of ConstantHoistingVisitor.

Please advice how to proceed.

@shtilman
Copy link
Contributor

Ok, let me dig a bit deeper into your approach.

@pressenna
Copy link
Contributor Author

I had a revisit as well and I think your are right to introduce a mapping between Constans (or their offset) and the function arguments / .basic_block literals load instructions.

It turns out that I am currently doing a linear search (by name) for the function arguments and in the .entry block for hoisted literal loads. I was hoping to do a binary search, but sadly llvm::SymbolTableList is a linked list, which sort of diminishes the benefits of binary search.

Please bear with me until I make the amendments.

@asuhan
Copy link
Contributor

asuhan commented Jan 13, 2018

Why would you even care about a binary search? We're talking maybe 100 constants at most.

@pressenna
Copy link
Contributor Author

Maybe I am missing something, but looking at the RelAlgTranslator::translateInOper I had the impression that it can create quite a few Constants instances. There is a special path for integers and dict encoded strings, but for decimals and floats it would take the hit and create as many Constants as there are results in the result set. Let me see if I can create a test case for this scenario.

@asuhan
Copy link
Contributor

asuhan commented Jan 14, 2018

Sure, but there is a lot of linear-time work to be done anyway during code generation, not sure what the impact would be. To be honest, I'm not entirely convinced we'd survive generating a kernel with enough constants for binary search to make a difference anyway. Also, in terms of use cases, the right hand side of an IN expression should be something which makes sense to enumerate (user names or ids, for example), which isn't really the case for floating point and decimals. So I'm not particularly bothered by the lack of binary search and would rather take simpler code.

@pressenna
Copy link
Contributor Author

Yes, makes sense. In that case I am going to leave it as it is for now, so as not to create too much deltas for a review.

Copy link
Contributor

@shtilman shtilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @psockali,

Sorry for the pause.

I'm not sure about this fused visitor/codegen that tries to cover everything, and these repeated searches by prefix. Feels a bit fragile and heavy. How can we simplify it?

What if we abandon the visitor idea altogether and just let codegen go through all of the incoming Constants as it did before - getting the old offsets or adding new offsets as before, but then instead of generating corresponding loads it would generate named offset-based value references, accumulating a list of (offset, type) tuples along the way. Then this list would be used to generate all needed literal loads in query_func's entry block and adjust the declaration of row_func, and the Call as well.

@pressenna
Copy link
Contributor Author

Hey @shtilman,

this whole 'Spagat' only made sense with the assumption that there might be up to 5 million Constants (by using the IN operator). Now that assumption has been rendered void. It all feels like doing an initial pass to capture things and a final pass to generate the target IR seems most reasonable.
How do you feel about trashing this PR and instead refocusing on two/multi pass based IR generation, with constants hoisting as the initial benefactor of the multi-pass paradigm?

@asuhan
Copy link
Contributor

asuhan commented Jan 22, 2018

In which situation we care about could the IN operator create over 5 million constants? For integer and strings we generate a bitmap instead. Can we throw an exception if the IN operator generates 5+ million constants and still work under the same assumption?

@pressenna
Copy link
Contributor Author

@asuhan, I am not sure I am following you correctly.

The check is already there (https://github.com/mapd/mapd-core/blob/647a0efd7d931aa27105652796aa7416f19b1dc4/QueryEngine/RelAlgTranslator.cpp#L369), but I am not sure if g_enable_watchdog is true.

And I thought that decimal/floats are virtually not a use case.

Are you saying that a single pass with a retrospective mutation of the query_function's declaration (including a search and replace of the query_function's IR) outweighs the complexity it introduces?

I mean, if there are at most about 100 hoisted literals, I doubt it would make a big difference.

Either way, please advise if I should attempt the single pass approach...

Pressenna Sockalingasamy added 4 commits January 22, 2018 21:35
This reverts commit 4781182.
…pfront for valid constanstants which are then hoisted into the query functions .entry block."

This reverts commit 390cf4a.
@shtilman
Copy link
Contributor

@psockali it seems to me that we need to tie constant hoisting to the actual [staged] codegen, which would solve the problem of on-the-fly constants and also subquery constants, since hoisting would be staged too. All relevant constants would be injected into each query's compilation, after all subqueries have been compiled and executed.

My proposal is to continue filling up literals_ buffer but defer literal load generation. Or, perhaps, redirect generation of literal loads to the query func's entry block, i.e. whenever a constant is codegened, query_func's entry block would get the load from literal buffer (if it hasn't yet been loaded) and row_func would get an appropriately named arg value. You know the offset and the type of each literal so you can then connect the dots in the IR before compiling it.

If number of constants gets too big you could throw an exception or you could stop hoisting.

@asuhan
Copy link
Contributor

asuhan commented Jan 23, 2018

@psockali g_enable_watchdog is true by default, we only disable it for unit tests and various experiments. I'd rather reject the 5M+ case altogether (even with watchdog disabled) than make heroic efforts for a kernel which might even fail to compile or do so very slowly -- 5M+ instructions is a lot.

@pressenna
Copy link
Contributor Author

But @asuhan, that is what I am saying. I misread the code. It actually does not create 5M load instructions, that is an upper limit for the elements in the bit-set. Either way, I see if I can get the modification of the function declaration / signature working.

@asuhan
Copy link
Contributor

asuhan commented Jan 23, 2018

@psockali In cases where a bitmap cannot be used (floats for example) it would. There would be 5M+ Analyzer::Constant expressions generating the same number of constants to be loaded from the literal buffer when the IN expression is generated.

@shtilman
Copy link
Contributor

shtilman commented Jan 24, 2018

@psockali modifying row_func signature may get messy. You may want to consider switching to global variables with internal linkage. For each new offset you could create a literal_<offset> global visible to the entire module (that is, if it hasn't been created yet - call getNamedGlobal(name) to check). If the module doesn't have it yet - 1. create it 2. then define it with a load from literals_ in query_func()'s entry block and 3. use it wherever you are in row_func(). If module already has it - skip steps 1. and 2.
Look at the resulting IR. Global optimizer should be able to handle these globals after row_func() is inlined, they should all be localized.
If that works well - no need to open Pandora's box messing with function signatures.

@pressenna
Copy link
Contributor Author

@asuhan: Would it be OK if I disable the float/decimal use-case for the IN operator?
@shtilman: I was hoping to use the function copy helper to change the signature, but I shall try global variables first.

@asuhan
Copy link
Contributor

asuhan commented Jan 28, 2018

@psockali No it wouldn't, but it'd be ok to reject it past a certain threshold (5M is as good of a value as any).

@pressenna
Copy link
Contributor Author

Just noticed a cast from size_t to int16_t when loading values from the literals buffer...

https://github.com/mapd/mapd-core/blob/245f9ffe9e036f88171555c71e63d10827efd9b9/QueryEngine/ConstantIR.cpp#L110

I guess it really does not hold a lot of elements.

@pressenna
Copy link
Contributor Author

@shtilman, i am not sure how to correctly initialize the GlobalVariable, shall I create a correct pointer type and use store/load? This then feels like copying the buffer into a .data segment and then loading the values from that .data segment for each use. Not sure if the optimizer will be able to detect this correctly.

@pressenna
Copy link
Contributor Author

The cast actually limits the buffer to 32K, with the CHECK enforcing this, so there is then already a reasonable upper limit.

@shtilman
Copy link
Contributor

@psockali yes, I think a simple store into a global var should work. You first load the value from the literal buffer at specific offset, then store into corresponding global which is named after that offset - this is in the entry block. Then you can use that literal_<offset> global whenever you need literals_[offset].

Make sure you create each global var in the query module, with internal linkage. After row_func is inlined, the access to each global you had created will be limited to a single non-recursive function. Then the GlobalOptimizer pass (which follows the inliner in the pipeline) should take all these internal globals and localize them, i.e. convert them into local allocas.

@pressenna
Copy link
Contributor Author

@shtilman, I did as you said, but it complains at link time.

LLVM ERROR: Program used external function 'global_literal_0' which could not be resolved!

@pressenna
Copy link
Contributor Author

I have also pushed the changes

@shtilman
Copy link
Contributor

Have you looked at the IR that you end up generating?
Could you run explain on that failing query and post the result here?

@pressenna
Copy link
Contributor Author

Yes, I did. The IR looks OK. The global variable is still present after the optimisation causing the linker to fails, because I am not passing in sufficient linkage information.

The query I used was:

select cast(-1.1 as int) as x from t2;

The generated bitcode is attached: bitcode.zip

@pressenna
Copy link
Contributor Author

@shtilman I have committed now a version that alters the row_funcs signature (or more precise, creates a new row_func_hoisted fuction with the correct signature). This passes all the tests and the optimized IR looks good as well, as far as I can tell.

@shtilman
Copy link
Contributor

All right, was hoping globals would simplify the hoisting, provided globalopt worked well. Not sure why globalopt failed to localize them, there are a few straightforward conditions each global should meet, perhaps globalvar declarations needed a bit of massaging.. Don't have time to investigate, we'll do signature change.

@@ -1047,6 +1047,10 @@ Executor::CompilationResult Executor::compileWorkUnit(const std::vector<InputTab
bind_pos_placeholders("group_buff_idx", false, query_func, cgen_state_->module_);
bind_pos_placeholders("pos_step", false, query_func, cgen_state_->module_);

cgen_state_->query_func_ = query_func;
cgen_state_->query_func_entry_ir_builder_.SetInsertPoint(&query_func->front(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use &query_func->getEntryBlock() instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The create_row_func() method already adds some instructions, thus I am using this mean to ensure that literals resolution ends up as the first instructions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. What I meant was replacing query_func->front() with query_func->getEntryBlock().
You could point that irbuilder to the very beginning of the entry block like this:

cgen_state_->query_func_entry_ir_builder_.SetInsertPoint(&query_func->getEntryBlock().front());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it accordingly

@@ -100,8 +100,10 @@ std::vector<llvm::Value*> Executor::codegenHoistedConstants(const std::vector<co
const EncodingType enc_type,
const int dict_id) {
CHECK(!constants.empty());

int64_t initial_watermark = cgen_state_->literal_bytes_high_watermark(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watermarks seem redundant.

int64_t allocated_watermark = cgen_state_->literal_bytes_high_watermark(0);
std::string literal_name = "literal_" + std::to_string(lit_off);

if (allocated_watermark == initial_watermark) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, watermarks don't seem to be needed. Could just check if defined_literals_ map contains something for this particular lit_off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just some form of precaution.
If something was already present in the literals buffer, I expect it to be present in the defined_literals map. I change it to only check against the map.

if (co.hoist_literals_ && !cgen_state_->defined_literals_.empty()) {
// we have some literals...

// create a new row_function with the literals as argument
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a better comment, something like

// row_func_ is using literals whose defs have been hoisted up to the query_func_,
// extend row_func_ signature to include extra args to pass these literal values. 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will do


auto ft = llvm::FunctionType::get(get_int_type(32, cgen_state_->context_), row_process_arg_types, false);
auto row_func_hoisted = llvm::Function::Create(
ft, llvm::Function::ExternalLinkage, "row_func_hoisted", cgen_state_->row_func_->getParent());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe give it a clearer name, e.g. "row_func_hoisted_literals" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, makes sense

@pressenna pressenna closed this Apr 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants