Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Add time context in scope in execution for pandas backend #2306

Merged
merged 35 commits into from
Sep 4, 2020

Conversation

LeeTZ
Copy link
Contributor

@LeeTZ LeeTZ commented Aug 3, 2020

What is the change

This PR purpose a change to the data structure scope in pandas execution. Time context is added to scope and getting, setting scope methods are changed accordingly.

Existing mechanism

  • Scope key is the op associated with each expression.

  • Scope value is a pd.DataFrame or pd.Series that is the result of executing key op.

  • Set scope kv pair: This is done in multiple places:

    • in execute_until_in_scope, new_scope is a combination of scopes of all children(calculating recursively), and scope from pre_execute.

    • in window.py window has a similar log of setting scope kv pair.

    • in selection.py selction is adding additional_scope to scope by remapping columns to data. See compute_projection_scalar_expr, compute_projection_column_expr etc.

      Although this kv is set in multiple places by different code pieces, conceptually, it's just scope[op] = result.

  • Get scope kv pair:

    • This is done in execute_until_in_scope using data = [ new_scope[arg.op()] if hasattr(arg, 'op') else arg for arg in computable_args ] that retrives computed data from scope, and pass on to execute_node implementation.
    • Also, there are shortcuts across execute_util_in_scope that return an op without execution if op is already in scope.
  • Scope miss: If there is a miss in getting scope[op], we will execute op.

Proposed mechanism

  • Scope key is still op associated with each expression.

  • Scope value is another key-value map
    - value: pd.DataFrame or pd.Series that is the result of executing key op, this is same as the original scope value.
    - timecontext: of type TimeContext, the time context associated with the data stored in value

    Note that the idea is: data should be determined by both op and timecontext. So the data structure above is conceptually same as making (op, timecontext) as the key for scope. But that may increase the time complexity in querying, so we make timecontext as another key of op. See following getting and setting logic for details.

  • Set scope kv pair: before setting the value op in scope we need to perform the following check first:

    • Test if op is in scope yet
      • No, then put op in scope, set timecontext to be the current timecontext (None if timecontext is not present), set value to be the DataFrame or Series of the actual data.
      • Yes, then get the timecontext stored in scope for op as old_timecontext, and compare it with current timecontext:
        • If curernt timecontext is same as old_timecontext, do nothing.
        • If current timecontext is a subset of old_timecontext, that means we already cache a larger range of data. Do nothing and we will trim data in later execution process.
        • If current timecontext is a superset of old_timecontext, that means we need to update cache. Set value to be the current data and set timecontext to be the current timecontext for op.
        • If current timecontext is neither a subset nor a superset of old_timcontext, If current time context is neither a subset nor a superset of old_timcontext, but they overlap, or not overlap at all. For example, this will happen when there is a window that looks forward, over a window that looks back. So in this case, we should not trust the data stored either, and go on to execute this node. For simplicity, we update the cache in this case as well.
  • Get scope kv pair: getting will change from scope[op] to get_scope(scope, op, timecontext) in get calls.

  • Scope miss: If there is a miss in getting scope[op], we will execute op to get the result.

Restrictions

  • Note that we have an important restriction here: For op with time context, we only use scope to cache leaf table nodes now. We may assume that if we cache a large time range that is the superset of all other time contexts, we could always use this cache and trim data to their associated time context later. But this assumption doesn't hold true for all cases. This is because some ops cannot use the cached results with a different time context to get the correct result. For example, a groupby followed by count, if we use a larger or smaller from the cache, we will probably get an error in the result. Such ops global aggregation, ops whose result is depending on other rows in result Dataframe, cannot use the cached result with different context to optimize calculation. For simplicity, we skip these ops for now and will address these in followups.

How is this change tested

Unit tests are added in ibis/pandas/tests/test_timecontext.py to test if time context comparison works. Also, concrete examples that use this feature like window in mutate and multi-window tests, are added as well. See inline comments for the explanation of each test case.

@LeeTZ LeeTZ changed the title Add time context in scope in execution for pandas backend FEAT: Add time context in scope in execution for pandas backend Aug 3, 2020
@@ -554,3 +610,113 @@ def canonicalize_context(
f'begin time {begin} must be before or equal' f' to end time {end}'
)
return begin, end


class TimeContextRelation(enum.Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps move this to a timecontext module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, will move this to an upper level (ibis level) so that we could reuse this module for other backends

NONOVERLAP = 3


def compare_timecontext(cur_context: TimeContext, old_context: TimeContext):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to a timecontext module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, same as above.

if op not in scope:
return None
# for ops without timecontext
if timecontext is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, we should perhaps use an timecontext place holder for the timecontext is None case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean have an entry of {"value":None, "timecontext":None}?

@@ -161,14 +161,17 @@ def trim_with_timecontext(data, timecontext: Optional[TimeContext]):
if not timecontext:
return data

df = data.reset_index()
df = data.reset_index(level=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this is tricky. Here, data is a Series of MultiIndex (int, Timestamp) where int is some index (0,1,2,3 ...) and Timestamp column is our time column. When we filter by time, the index is also filtered so we will have something like:

index  time  value
2       2010-01-02 2
3       2010-01-03 3

As you can see the index may not start from 0. And later if we put this Series back to generate result. There will be NaN and NaT filling in, since pandas will align rows according to index number.
Therefore, all changes in this function is trying to reindex the int inside the Multiindex after filtering, to guarantee the correctness of our result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this needs to be explained in the code, otherwise it's kind of confusing.

Also, is it always the case that time is an index here? What if this is not a moving window, e.g.

win = window(group_by='id')

What's the behavior of this code in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps

win = window(preceding=interval(days=1), group_by='id', order_by='some_other_time_column')

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or

win = window(preceding=10, group_by='id', order_by='time')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think this function is not very clear about what the expected input and output. It would be helpful to clarify.

data,
order_by,
group_by=group_by,
timecontext=new_timecontext,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename new_timecontext to adjusted_timecontext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

factory=OrderedDict,
)
additional_scope = {}
for t in operand.op().root_tables():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why changing the list comprehension to for loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be better ways but let me explain why, scope_item() now return with a dict {op: {"value":xx, "timecontext":xx}}, how can I turn it into (key, value) like (t, source) in a one-linere we had previously?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your code is the same as

toolz.merge(
    set_scope_item(t, source, adjusted_timecontext)
    for t in
    operand.op().root_tables()
)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

return TimeContextRelation.OVERLAP


def scope_item(op, result, timecontext):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we lift scope related code to its own module so the pyspark backend can also reuse it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, will move this to Ibis level

# in result Dataframe, cannot use cached result with different time
# context to optimize calculation. For simplicity, we skip these ops
# for now.
if isinstance(op, ops.TableColumn) or isinstance(op, ops.TableNode):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a xfail test for projection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per discussion offline. I think we should just enable this for all nodes instead of leaf nodes.

@@ -220,7 +264,7 @@ def execute_with_scope(
**kwargs,
),
**kwargs,
)[op]
)[op]['value']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the value here is for a context that is larger than the one that user requested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked again, this seems unlikely to happen because execute_until_in_scope would return with a dict with op as the only key. This is not reading from the scope. Time context should be already taken care of in execution phases in execute_until_in_scope

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think execute_until_in_scope used to return the scope after execution. Is that still the case? If so, shouldn't the code here be:

result = get_scope_item(scope, op, timecontext)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it makes sense, however, execute_until_in_scope is a recursive call and it always returns the executed result of the expr, instead of the whole scope. Before this change, the return value is just {op: computed}. So I think we don't need to call get method again here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -330,7 +380,7 @@ def execute_until_in_scope(

# pass our computed arguments to this node's execute_node implementation
data = [
new_scope[arg.op()] if hasattr(arg, 'op') else arg
get_scope(new_scope, arg.op()) if hasattr(arg, 'op') else arg
Copy link
Contributor

@icexelloss icexelloss Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think get_scope_item and set_scope_item is better. You basically want a get and set method to get/set item from scope with a op and timecontext

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will change these signature

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level looks good. Let's try to decouple timecontext scope and execution modules. I think timecontext and scope module can be reused by other backends as well.

@@ -343,7 +394,7 @@ def execute_until_in_scope(
**kwargs,
)
computed = post_execute_(op, result, timecontext=timecontext)
return {op: computed}
return set_scope_item(op, computed, timecontext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first arg to set_scope_item should be scope

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored and changed arg accrodingly

from ibis.timecontext.util import TimeContextRelation, compare_timecontext


def set_scope_item(op, result, timecontext: Optional[TimeContext]):
Copy link
Contributor

@icexelloss icexelloss Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a "set" method. A set method should change the state of object and returns nothing.

This set method would make more sense to me

def set_scope_item(scope, op, timecontext, value):
      scope[op] = {'value': result, 'timecontext': timecontext}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first name it as scope_item for the same reason. Now it is refactored and this is addressed.

timecontext,
)

if get_scope_item(scope, op, timecontext) is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the order to be after Literal check here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this is a bug.

@@ -330,7 +381,7 @@ def execute_until_in_scope(

# pass our computed arguments to this node's execute_node implementation
data = [
new_scope[arg.op()] if hasattr(arg, 'op') else arg
get_scope_item(new_scope, arg.op()) if hasattr(arg, 'op') else arg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing timecontext here when calling get_scope_item?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the same bug as the previous comment, op was directly executed and not add to op result in a None in getting time context. Will fix it

ibis/common/scope.py Outdated Show resolved Hide resolved
TIME_COL = 'time'


class TimeContextRelation(enum.Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps call this ibis/expr/timecontext.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, renamed the file.

from ibis.expr.typing import TimeContext


def adjust_context_asof_join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably define dispatch methods for extending.

@singledispatch
def adjust_timecontext(op, timecontext):
      raise NotImplementedError()


@adjust_timecontext.register(AsOfJoin):
 def adjust_timecontext_asof_join(op, timecontext):
        return ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for this to work, I added an extra arg clients in compute_time_context dispatch, we will need to pass clients so that in these dispatch function at expr level, we could run execute on op args.

ibis/timecontext/util.py Outdated Show resolved Hide resolved

Returns
-------
result : Enum[TimeContextRelation]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the type for this is just TimeContextRelation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed.

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed another around. Left some comments.

@LeeTZ LeeTZ reopened this Aug 27, 2020

`scope` in Scope class is the main cache. It is a dictionary mapping
ibis node instances to concrete data, and the time context associate
with it(if any).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update here

ibis/expr/scope.py Outdated Show resolved Hide resolved
ibis/expr/scope.py Show resolved Hide resolved
ibis/expr/scope.py Show resolved Hide resolved
ibis/expr/scope.py Outdated Show resolved Hide resolved
ibis/expr/scope.py Show resolved Hide resolved
ibis/expr/scope.py Outdated Show resolved Hide resolved
-------
Scope: a new Scope instance with op in it.
"""
item = namedtuple('item', ['value', 'timecontext'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be ScopeItem and created in the module (the definition)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the top of this module, do
ScopeItem= namedtuple(....)

ibis/expr/timecontext.py Show resolved Hide resolved
ibis/expr/timecontext.py Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good. just a small change in the impl of Scope and should be good

-------
Scope: a new Scope instance with op in it.
"""
item = namedtuple('item', ['value', 'timecontext'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the top of this module, do
ScopeItem= namedtuple(....)

ibis/expr/scope.py Outdated Show resolved Hide resolved
ibis/expr/scope.py Outdated Show resolved Hide resolved
ibis/expr/scope.py Show resolved Hide resolved

class Scope:
def __init__(self, items: Dict[str, ScopeItem] = None):
self.items = items or {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save this as self._items, then you can make a .items() method which return iter(self._items) and then Scope becomes more like a dictionary

ibis/expr/scope.py Show resolved Hide resolved
ibis/expr/scope.py Show resolved Hide resolved
a new Scope instance with items in two scope merged.
"""
result = Scope()
for op, v in self._get_items().items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can fix this as above

@LeeTZ LeeTZ force-pushed the tiezheng/add_time_context_in_scope branch from d706fd1 to f9e2b46 Compare September 4, 2020 16:03
@LeeTZ LeeTZ force-pushed the tiezheng/add_time_context_in_scope branch from f9e2b46 to 8b83630 Compare September 4, 2020 16:11
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one small additional comment, then lgtm. ping on green.

"""
result = Scope()
for op in self.items():
result._items[op] = self._items[op]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on Scope I would define __getitem__(self, op) and __setitem__(self, op) as public methods (and then use them here)

def __getitem__(self, op):
   return self._items[op]
def __setitem__(self, op, value):
   self._items[op] = value

result = Scope()

for op in self.items():
result.__setitem__(op, self.__getitem__(op))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use

result[op] = self[op]

"""
result = Scope()
for op in self.items():
result.__setitem__(op, self.__getitem__(op))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result[op] = self[op]

def __getitem__(self, op):
return self._items[op]

def __setitem__(self, op, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should return None, type value as Any, op is ScopeItem?

def items(self):
return iter(self._items)

def __getitem__(self, op):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

op is ScopeItem? returns -> Any

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the diff between __getitem__ and get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__getitem__ returns a ScopeItem like {'value': 'foo', 'timecontext': None}
get is an api to return the value, which is foo in the case above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. But be careful this is exposing implementation details

ibis/expr/scope.py Outdated Show resolved Hide resolved
Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping on green

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jreback jreback added this to the Next Feature Release milestone Sep 4, 2020
@jreback
Copy link
Contributor

jreback commented Sep 4, 2020

@LeeTZ if you can add a release note. pls ping when green.

@LeeTZ
Copy link
Contributor Author

LeeTZ commented Sep 4, 2020

Ping @jreback for green. Thank you all for reviewing!

@jreback jreback merged commit 816f521 into ibis-project:master Sep 4, 2020
@jreback
Copy link
Contributor

jreback commented Sep 4, 2020

awesome @LeeTZ very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Issues or PRs related to the expression API pandas The pandas backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants