Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix scope get to use hashmap lookup instead of list lookup #2386

Merged
merged 9 commits into from
Sep 17, 2020

Conversation

LeeTZ
Copy link
Contributor

@LeeTZ LeeTZ commented Sep 16, 2020

Overview

This PR aims to fix a bug in looking up an op in scope. The current implementation is using a list lookup to look up items in iter(_items). This is slower than a hashmap lookup, also it may cause bugs for objects if their equality is ill-defined.

Example

a = ibis.interval(days=1).op()
b = ibis.interval(hours=1).op()
>>> a
Literal(1, Interval*value_type=int8, unit='D', nullable=True)
>>> a == b
True

In this case, Literal is ill-defined for equality. And if we have only a in scope. b in iter(scope._items)will beTruewhileb in scope._items` return false.

Tests

This is tested in ibis/pandas/tests/tes_core.py with a new test case test_scope_look_up.

ibis/expr/scope.py Outdated Show resolved Hide resolved
ibis/expr/scope.py Outdated Show resolved Hide resolved
ibis/expr/scope.py Outdated Show resolved Hide resolved
@icexelloss
Copy link
Contributor

I found this class to be a little confusing. It looks like we are emulating dict interface by implementing __setitem__ and __getitem__ in addition to get_value and merge_scope, but:

  • It didn't implement other dict interfaces, such as __contains__
  • Scope.items() returns the keys in the scope (the ops), but dict.items() returns the dict entries (k, v pair), why do we make them different if we are emulating dict interface?
  • It's not obviously clear to me if this class is defined to be mutable or immutable - on the one hand, merge_scope seems to not mutating exising scope, but __setitem__ indicates this class is mutable.

I think clarifying these questions would be helpful to make it more clear how to use this class.

@LeeTZ
Copy link
Contributor Author

LeeTZ commented Sep 16, 2020

Thanks, @icexelloss ,
This class Scope is refactored from a dict. So it makes sense to implement some interfaces at the first glance. However I have to point out that from the perspective of a user of this class, the key is important, but the value is not. Since value is a data structure that we made up to wrap time context and value, we do need API to get the real value of the cached result, but not necessarily the API to get {"value":v,"timecontext":t}.
Also, for set, most scope usages for setting in Ibis is merging scopes generated by different functions. This class tried to preserve this behavior so merge is really the set method for this class.

So given this get and set use case, I will say without __get_item__ , __set_item__ or any dict like interface, this class works. Dict interfaces are helpful in some implementation of the internal method inside this class, but it does add the ambiguity of how to use this class.

For @icexelloss's question about __contain__ and items(), would you mind sharing your thoughts?@jreback

@LeeTZ
Copy link
Contributor Author

LeeTZ commented Sep 16, 2020

I refactored a little bit.

  • removed __get_item__. As I commented above, we don't care about the object associated with op, it is an implementation detail for this class and we should hide it from users. We exposed get_value to get the cached result for a given op.
  • removed __set_item__. We want to make this class immutable and always use merge for adding new items into Scope. This is the usage case in Ibis everywhere (by calling toolz.merge, before time context refactor )
  • add __contains__ One thing we need is to test whether an op is in Scope. This replaces op in scope._items.keys() or scope.get_value(op)
  • I think this class doesn't fully emulate a dict interface since we have made it immutable. thus not all dict interfaces are implemented.

To make this class clearer, these are the exposed API for Scope class:

  1. new a 'Scope: user should use Scope()to create an emptyScope, or call make_scopewith a pair ofop, result, time context` to create a new Scope with one item in it.
  2. get in Scope: user needs to call get_value to get the cached result for a given op.
  3. set in Scope: user always uses merge_scope or merge_scopes to set values.

ibis/expr/scope.py Outdated Show resolved Hide resolved
@@ -122,17 +116,17 @@ def merge_scope(self, other_scope: 'Scope', overwrite=False) -> 'Scope':
"""
result = Scope()

for op in self.items():
result[op] = self[op]
for op in self._items.keys():
Copy link
Contributor

@icexelloss icexelloss Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this basically what you want?
https://stackoverflow.com/questions/6354436/python-dictionary-merge-by-updating-but-not-overwriting-if-value-exists

new_items = {}

if overwrite:
    new_items = dict(itertools.chain(self._items.items(), other._items.items()))
else:
    new_items = dict(itertools.chain(other._items.items(), self._items.items()))

return new Scope(new_items)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice but to update we cannot simply do these. We should think about time context and the logic is covered in result.get_value(op, v.timecontext) is None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Ok what you have is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we define

def __iter__(self):
    return iter(self._items.keys())

then

for op in self:
   ...

will just work

or is this more confusing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, scope is indeed iterable, I will do that

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

@icexelloss
Copy link
Contributor

icexelloss commented Sep 16, 2020

I think it makes sense. Emulating dict interface is nice but not necessarily IMO.

def __setitem__(self, op: Node, value: Any) -> None:
self._items[op] = value
def __contains__(self, op):
return op in self._items.keys()
Copy link
Contributor

@icexelloss icexelloss Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just

return op in self._items

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see Jeff's comments. Ok this is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, no i agree this should be in self._items, only iteration should use the keys

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my knowledge, is k in self._items.keys O(n) or O(1)? The type of self._items.keys is dict_keys so it's not clear to me if the __contains__ method of dict_key is linear or constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked around and seems that in Python 3.x, k in self._items and k in self._items.keys() are equivalent and they are all O(1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with a Python 3.6 kernel

>>> dic = dict.fromkeys(range(10**5))
>>> %timeit 10000 in dic
35.8 ns ± 0.757 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit 10000 in dic.keys()
74.7 ns ± 0.385 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

seems keys() is still 2x slower than testing in for dict directly

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LeeTZ LeeTZ closed this Sep 17, 2020
@LeeTZ LeeTZ reopened this Sep 17, 2020
@jreback jreback added the window functions Issues or PRs related to window functions label Sep 17, 2020
@jreback jreback added this to the Next Bugfix Release milestone Sep 17, 2020
def __setitem__(self, op: Node, value: Any) -> None:
self._items[op] = value
def __contains__(self, op):
return op in self._items.keys()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, no i agree this should be in self._items, only iteration should use the keys

@@ -122,17 +116,17 @@ def merge_scope(self, other_scope: 'Scope', overwrite=False) -> 'Scope':
"""
result = Scope()

for op in self.items():
result[op] = self[op]
for op in self._items.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we define

def __iter__(self):
    return iter(self._items.keys())

then

for op in self:
   ...

will just work

or is this more confusing?

@icexelloss icexelloss added pandas The pandas backend expressions Issues or PRs related to the expression API and removed window functions Issues or PRs related to window functions expressions Issues or PRs related to the expression API labels Sep 17, 2020
result._items[op] = self._items[op]

for op in other_scope._items.keys():
for op in other_scope._items:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be

for op in other_scope

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, thanks

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this PR number to the original scope update in the release notes

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. let's merge on green.

@LeeTZ
Copy link
Contributor Author

LeeTZ commented Sep 17, 2020

Thank you all for reviewing! Green now @jreback .

@jreback jreback merged commit 899804c into ibis-project:master Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pandas The pandas backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants