Improve hoisted literals code generation #2

asuhan · 2017-05-08T20:48:55Z

Currently, we generate memory loads (from the literal buffer) when the literal node is visited and rely on loop invariant code motion pass to hoist these loads outside of the query loop. This is a heavy-handed use of LICM and might sometimes not be optimized. We should collect the constant expression with a visitor and generate the loads in the entry block.

pressenna · 2017-12-17T19:57:34Z

Hi, can I tackle this one?

billmaimone · 2017-12-18T01:06:07Z

Yes, good one for you.

asuhan · 2017-12-18T01:23:15Z

@psockali You can use ScalarExprVisitor to build a constant-collecting visitor. There's a pitfall though: we don't know all constants upfront when sub-queries are involved. Have a look at RelAlgTranslator::translateInOper. We can use a hybrid approach: collect and generate all the other constants upfront and leave the current handling in place (and keep LICM). Alternatively, we can reorganize the compiler to do a full pass first without emitting any IR, which we know we'll eventually have to do for other reasons as well. Going for the first approach won't hurt our ability to achieve the second and the code should be fully reusable once we have the full solution in place, so I recommend going for it as an incremental, nearly-complete solution.

pressenna · 2017-12-18T12:32:30Z

Hi @asuhan , I will do so.
Am I correct in assuming that only runtime constants (within the scope of a block) shall be hoisted, but literals can remain as they are?

pressenna · 2017-12-19T17:26:44Z

Hey @asuhan

I think I need to be much more clearer when addressing things in mapd. (e.g. literal in LLVM IR, literal in REX, literal in Analyzer::Exp, etc...).

Finally managed to look at this in more detail and it turns out that Analyzer::Constant literals are hoisted in Executor::codegenHoistedConstants into a memory buffer. And each unoptimized access to the literal requires a memory read from that buffer. As you have pointed out RelAlgTranslator::translateInOper is creating Analyzer::Constant dynamically, so scanning early on, right after translating from REX, is not very promising.

Currently I would propose to scan for Analyzer::Constant via an ScalarExprVisitor before the actual LLVM IR generation (somewhere in Executor::compileWorkUnit) and also extend the Constant class to hold a few more attributes required for this extra step.

I would then generate the memory load for each literal in @query_group_by_template and pass them as arguments to the always inlined @row_func , which also needs to be extended to take these new parameters.

I would also make this, in effect, a configurable behaviour and would probably introduce a new compiler option flag.

I hope to start tomorrow on it, if there are no objections to the above approach.

shtilman · 2017-12-21T04:04:28Z

Hi @psockali, your approach sounds reasonable. Add a constant scanning visitor, generate corresponding literal loads in the first block of the query func and feed these values to the always_inlined @row_func. Not sure if any Constant extensions are necessary but let's see if your motivation can be justified. No need to make this behavior configurable - we're just putting some literal loads where they should have been generated from the get go, instead of relying on LICM to move them there.

You mentioned @query_group_by_template, note that it's just one of the templates that the query func can be built from, the other one is @query_template.

pressenna · 2017-12-21T21:25:20Z

Hey @shtilman, as usual: theory does not reflect reality.

It turns out that during LLVM IR generation Analyzer::Constant nodes are created on the fly and sometimes the original Constant node is translated.
See Executor::codegenRowId, Executor::codegenCmpDecimalConst and Executor::codegenOuterJoinNullPlaceholder.

This invalidates the proposed approach, because scanning the source Analyzer::Exp tree will not be reliable nor sufficient.

I have to rethink .. please bear with me.

pressenna · 2017-12-21T22:51:33Z

Hey @shtilman,

@asuhan mentioned two approaches:

Hybrid:
Currently this is the approach I am going for.
Full Pass without IR generation:
I am not confident enough at this stage to make such a central change to the compiler as I am not sure what state(s) the compiler currently maintains and how I would reset it correctly.

I think there is an alternative way, but it would require to clone the row_function (in order to correct the function type with the hoisted variable).

Either way, let me first try the hybrid approach, it probably will be complicated enough.

pressenna · 2017-12-28T11:00:18Z

Hi @shtilman / @asuhan
I committed an initial version of the hybrid solution. the PR is #145 .
Please let me know if this approach is OK for now as I fear that it does not capture all the cases.

shtilman · 2018-01-22T14:12:02Z

Comments and discussion are in #145 PR.

…ay (#2) Correctly merge separate arrow arrays offets to one chunk offset array

* Register local / global hint in Calcite * Support g_ prefix for global query hint name * Translate global hint in analyzer * Add tests * Apply comments #1: global hint registration * Apply comments #2: global hint flag identification * Apply comments #3: global hint translation * Apply comments #4: remove unnecessary virtual keyword * Fixup a bug on allow_gpu_hashtable build hint for overlaps join * Fixup a bug related to a query having multiple identical subqueries * Add global hint tests related to overlaps join hashtable * Apply comments #5: misc cleanup

…tions * Pass work_unit to deliver the body node to extract hash table cache key * Separate partition sorting and actual computing * Pass cache_key to window context object * Disallow hash table cache access for invalid join type * Support hash table cache for window context * Support sorted partitoon cache for window context * Address comment #1: advocate std::memcpy for copying sorted partition * Address comment #2: move common logic used in both perfect/baseline join hash table to HashJoin.h * Address comment #3: improve payload_copy logic Co-authored-by: yoonminnam <yoon-min.nam@heavy.ai>

* Add necessary method to propagate update query info in the projection node * Enhance related logic to properly handle update query with window function expression * Add tests * Enable window func update query on non-fragmented table in CtasUpdateTest * Misc cleanup * Disallow the update query having window func in dist mode * Address comment #1: change test queries to get deterministic query results * Address comment #2: revert const qualifier Co-authored-by: yoonminnam <yoon-min.nam@heavy.ai>

* Support ms / us / ns time units for INTERVAL keyword * Add few constants related to small time units and a new ExtractField to represent invalid status * Trasnlate window frame bound expression for range mode with date / time / timestamp types * Fixup null value generator for date / time / timestamp types for codegen * Implement fixed length date value encoder * Implement computation logic of search range of aggregate tree for timeinterval type ordering col * Cast timestamp col var iff it is encoded type * Support codegen * Fix issue #1 : compute size of aggregate tree properly * Fix issue #2 : finding null range from sorted column * Fix issue #3 : handling edge case while computing window aggregation over the frame * Add tests * Disable range mode with date type ordering column in dist mode * Address comments v1

… considering `max_gpu_slab_size` (#7133) * Cleaup perfect join hash table init * Cleanup baseline join hashtable init * Add rowid size getter * Introduce hash table entry info class in perfect join hash table * Introduce hash table entry info class in baseline join hash table * Introduce hash table entry info class in overlaps and range join hash tables * Fixup init log related to hash table layout * Fixup hash table init on an empty table * Control the maximum hash table by # hash entries for CPU and its size for GPU * Address comment #1: revert partial_err checking logic in baseline join hash table builder * Address comment #2 Signed-off-by: jack <jack@omnisci.com>

…ion (#7185) * Add runtime function to create null StringView * Support projection of none encoded string column after left join * Address comment #1: create null StringView obj * Address comment #2: avoid compilation error * Address comments #3: improve codegen logic Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

* Renaming OverlapsJoinHashTable class to BoundingBoxIntersectJoinHashTable * Rename global flag g_enable_overlaps_hashjoin to g_enable_bbox_intersect_hashjoin * Rename overlaps hash join tuner * Rename kOVERLAPS SQL_OP * Fixup serialization type toString * Rename convertOverlaps * Rename cache types * Rename is_overlaps_oper() function * Rename query rewriter logic * Change server configurations v2 * Change namings in query rewriter * Completely remove overlaps keyword used in codebase * Address comments #1 * Address comment #2 Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

* Fixup the function's logic * Address comments * Address comments #2 Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

* Determine the size of projection buffer based on filtered per-device cardinality when applicable * Address comments * Add logging about mem alloc * Add loggig about cardinality estimation * Address comments #1 * Address comments #2 * Disable dist test; cannot access executor instance from QR * Add test case for sharded table (in both single and dist) Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

* Rename device bitmap ptr variable and its getter * Rename count distinct bitmap host mem ptr getter * Rename count_distinct device mem ptr variable * Improve buf allocation logic for count distinct * Cleanup code & improve logic for mode and tdigest * Cleanup logic v2 * Address comments * Address comments #2 Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

* Fixup the function's logic * Address comments * Address comments #2 Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

* Determine the size of projection buffer based on filtered per-device cardinality when applicable * Address comments * Add logging about mem alloc * Add loggig about cardinality estimation * Address comments #1 * Address comments #2 * Disable dist test; cannot access executor instance from QR * Add test case for sharded table (in both single and dist) Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

* Rename device bitmap ptr variable and its getter * Rename count distinct bitmap host mem ptr getter * Rename count_distinct device mem ptr variable * Improve buf allocation logic for count distinct * Cleanup code & improve logic for mode and tdigest * Cleanup logic v2 * Address comments * Address comments #2 Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

* Fixup the function's logic * Address comments * Address comments #2 Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

* Determine the size of projection buffer based on filtered per-device cardinality when applicable * Address comments * Add logging about mem alloc * Add loggig about cardinality estimation * Address comments #1 * Address comments #2 * Disable dist test; cannot access executor instance from QR * Add test case for sharded table (in both single and dist) Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

* Add buffer holders for GPU execution * Rename structures used to codegen * Introduce WindowFunctionCtx namespace * Add preparation for GPU execution in window ctx * Cleanup & improve WindowFunctionContext::compute() * Improve a logic to build aggregate tree w/ supporting reusing * Improve segment tree constructor * Rebase * Address comments #1 * Address comments #2: refactor bool param functions * Address comments #3 * Address comments #4: tbb * Address comments #5 * Address comments #6 * Fixup test failures Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

asuhan self-assigned this May 8, 2017

asuhan changed the title ~~Improve literal hoisting~~ Improve hoisted literals code generation May 8, 2017

asuhan added enhancement lowpri labels May 8, 2017

asuhan removed their assignment May 8, 2017

njmarshall mentioned this issue Jun 13, 2017

NUMERIC/DECIMAL datatype has flaw. #28

Open

This was referenced Sep 27, 2017

mapd_server crashed when queries and inserts proceeded concurrently #82

Closed

LOAD queries starve SELECT queries (and mapdql) when number of total concurrent queries is GTE than number of cpu cores #95

Closed

billmaimone added the starter label Dec 14, 2017

pressenna mentioned this issue Dec 19, 2017

Implement ROUND function #73

Open

gshimansky mentioned this issue Nov 18, 2019

run_benchmark_import.py script improved to allow importing multiple data files and specify fragment size when table is created #449

Merged

alexbaden pushed a commit that referenced this issue Jan 2, 2020

Correctly merge separate arrow arrays offsets to one chunk offset arr…

27f2bd4

…ay (#2) Correctly merge separate arrow arrays offets to one chunk offset array

This was referenced Mar 22, 2021

sigsegv in query_template #634

Open

determinism and SELECT DISTINCT #636

Open

yoonminnam mentioned this issue Jun 10, 2021

Add initial sub-fragments support. #643

Closed

misiugodfrey pushed a commit that referenced this issue Sep 26, 2023

Fixup Executor::getDeviceTypeForTargets function's logic (#7448)

4ced0a5

* Fixup the function's logic * Address comments * Address comments #2 Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

misiugodfrey pushed a commit that referenced this issue Sep 28, 2023

Fixup Executor::getDeviceTypeForTargets function's logic (#7448)

e03ce22

* Fixup the function's logic * Address comments * Address comments #2 Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

misiugodfrey pushed a commit that referenced this issue Oct 12, 2023

Fixup Executor::getDeviceTypeForTargets function's logic (#7448)

73aaa69

* Fixup the function's logic * Address comments * Address comments #2 Signed-off-by: Misiu Godfrey <misiu.godfrey@kraken.mapd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve hoisted literals code generation #2

Improve hoisted literals code generation #2

asuhan commented May 8, 2017

pressenna commented Dec 17, 2017

billmaimone commented Dec 18, 2017

asuhan commented Dec 18, 2017

pressenna commented Dec 18, 2017

pressenna commented Dec 19, 2017

shtilman commented Dec 21, 2017

pressenna commented Dec 21, 2017

pressenna commented Dec 21, 2017

pressenna commented Dec 28, 2017

shtilman commented Jan 22, 2018 •

edited

Loading

Improve hoisted literals code generation #2

Improve hoisted literals code generation #2

Comments

asuhan commented May 8, 2017

pressenna commented Dec 17, 2017

billmaimone commented Dec 18, 2017

asuhan commented Dec 18, 2017

pressenna commented Dec 18, 2017

pressenna commented Dec 19, 2017

shtilman commented Dec 21, 2017

pressenna commented Dec 21, 2017

pressenna commented Dec 21, 2017

pressenna commented Dec 28, 2017

shtilman commented Jan 22, 2018 • edited Loading

shtilman commented Jan 22, 2018 •

edited

Loading