New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: _dir_additions returns also the first level of a MultiIndex #16326

Merged
merged 7 commits into from Dec 11, 2017

Conversation

Projects
None yet
2 participants
@BibMartin
Contributor

BibMartin commented May 11, 2017

  • tests added / passed
  • passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
  • whatsnew entry

Assuming a DataFrame with MultiIndex columns like this:

>> df
    foo   bar      
          foo   bar
0  0.32  0.15  0.45
1  0.62  0.73  0.36
2  0.21  0.68  0.10
3  0.05  0.36  0.90

one can access to df['foo'] and df['bar'] with the shortcuts df.foo and df.bar,
but one don't benefit from autocompletion because 'foo' and 'bar' are not listed in dir(df).

>> 'bar' in dir(df)
False

This PR extends df._dir_additions so that the first level of a MultiIndex is listed in dir(df).

>> 'bar' in dir(df)
True
@codecov

This comment has been minimized.

codecov bot commented May 11, 2017

Codecov Report

Merging #16326 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16326      +/-   ##
==========================================
+ Coverage   90.37%   90.37%   +<.01%     
==========================================
  Files         161      161              
  Lines       50863    50863              
==========================================
+ Hits        45966    45969       +3     
+ Misses       4897     4894       -3
Flag Coverage Δ
#multiple 88.16% <ø> (+0.02%) ⬆️
#single 40.33% <ø> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 91.94% <ø> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.59% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.05% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0607e03...456a932. Read the comment docs.

@codecov

This comment has been minimized.

codecov bot commented May 11, 2017

Codecov Report

❗️ No coverage uploaded for pull request base (master@e909ea0). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #16326   +/-   ##
=========================================
  Coverage          ?   91.56%           
=========================================
  Files             ?      153           
  Lines             ?    51272           
  Branches          ?        0           
=========================================
  Hits              ?    46947           
  Misses            ?     4325           
  Partials          ?        0
Flag Coverage Δ
#multiple 89.42% <100%> (?)
#single 40.68% <0%> (?)
Impacted Files Coverage Δ
pandas/core/generic.py 95.9% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e909ea0...4ee5b9f. Read the comment docs.

@jreback

This comment has been minimized.

Contributor

jreback commented May 11, 2017

I am +0 on this. This add some needless complexity, though from a user POV might be nice. Would need some additional tests to consider.

@BibMartin

This comment has been minimized.

Contributor

BibMartin commented May 11, 2017

@jreback

Would need some additional tests to consider.

Sure. I plan to put tests if the principle is accepted. Any idea is welcome.
Btw, I've not seen whether there are tests about dir behavior.

@jreback

This comment has been minimized.

Contributor

jreback commented May 11, 2017

Btw, I've not seen whether there are tests about dir behavior.

bash-3.2$ grep -R 'dir(' pandas/tests/

@BibMartin

This comment has been minimized.

Contributor

BibMartin commented May 13, 2017

Added test_tab_completion for DataFrames, inpired from Series.

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 1, 2017

@jorisvandenbossche thoughts here

@jreback

This comment has been minimized.

Contributor

jreback commented Sep 23, 2017

this is ok in principle. can you rebase / update

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 12, 2017

can you rebase

@BibMartin

This comment has been minimized.

Contributor

BibMartin commented Nov 29, 2017

Sorry, I've been long to rebase.

Do I need to add a whatsneww entry ? In which version file ?

@@ -196,7 +196,11 @@ def __unicode__(self):
def _dir_additions(self):
""" add the string-like attributes from the info_axis """
additions = set([c for c in self._info_axis
if isinstance(c, string_types) and isidentifier(c)])
if isinstance(c, string_types) and isidentifier(c)] +
[c[0] for c in self._info_axis

This comment has been minimized.

@jreback

jreback Nov 29, 2017

Contributor

I would rather both of these to (private) methods on Index and overriden on a MultiIndex; this become much simpler

e.g.

def _to_identifiers(self):
    return [c for c in self if isinstance(c, string_types) and isidentifier(c)]

with some doc-strings & tests

then this method becomes really simple.

This comment has been minimized.

@BibMartin

BibMartin Nov 30, 2017

Contributor

Now that I understand the code better, I realize that another simple implementation could be:

additions = set([c for c in self._info_axis.get_level_values(0)
                 if isinstance(c, string_types) and isidentifier(c)])

Please tell me which method would be best.

This comment has been minimized.

@jreback

jreback Dec 1, 2017

Contributor

yes this looks reasonable. pls add a comment on what is going one.

@BibMartin

This comment has been minimized.

Contributor

BibMartin commented Dec 1, 2017

Changes are done, and rebased. Btw, I've added a test on Series auto-completion.

@jreback

can you add a note in 0.22, other enhancements.

@@ -234,6 +234,39 @@ def test_tab_completion(self):
assert 'str' not in dir(s)
assert 'dt' in dir(s) # as it is a datetime categorical
def test_index_tab_completion(self):
# dir contains string-like values of the Index.

This comment has been minimized.

@jreback

jreback Dec 1, 2017

Contributor

you can parametrize this with index

assert (not isinstance(x, string_types) or
not isidentifier(x) or x in dir_s)
# dir contains string-like values of the MultiIndex first level.

This comment has been minimized.

@jreback

jreback Dec 1, 2017

Contributor

you can add these in the parameterization above

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 1, 2017

one additional thing I think we need. If you have a very large index, _dir_additions actually takes quite a bit of time (this is not exclusive of this change).

So what I would do is if the index is say < 100, use the currently _dir_addition, otherwise return an empty list! (its essentially too big to use tab completion for anyhow). can you make this change and add an asv for this (could be a separate PR as well)

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 1, 2017

xref #18587

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 1, 2017

you can actually use .unique(level=0) to make this efficient

@BibMartin

This comment has been minimized.

Contributor

BibMartin commented Dec 5, 2017

I think that it's ready to go ; please tell me if I shall squash the commits.

@@ -77,6 +77,8 @@ Other Enhancements
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` for a categorical dtype (:issue:`17033`)
- :func:`pandas.read_clipboard` updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (:issue:`17722`)
- Improved wording of ``ValueError`` raised in :func:`read_csv` when the ``usecols`` argument cannot match all columns. (:issue:`17301`)
- :func:`NDFrame._dir_additions` (tab completion) also returns identifiers in the first level of a :func:`MultiIndex`. (:issue:`16326`)
- :func:`NDFrame._dir_additions` (tab completion) limits to 100 values, for better performance. (:issue:`18587`)

This comment has been minimized.

@jreback

jreback Dec 6, 2017

Contributor

can you move 2nd to performance. don't use NDFrame. just say Series/DataFrame tab completion

def test_tab_completion(self):
# DataFrame whose columns are identifiers shall have them in __dir__.
df = pd.DataFrame([list('abcd'), list('efgh')], columns=list('ABCD'))
assert 'A' in dir(df)

This comment has been minimized.

@jreback

jreback Dec 6, 2017

Contributor

actually you can assert all of the items are in the dir() (use a loop)

columns=pd.MultiIndex.from_tuples(list(zip('ABCD', 'EFGH'))))
assert 'A' in dir(df)
assert isinstance(df.__getitem__('A'), pd.DataFrame)

This comment has been minimized.

@jreback

jreback Dec 6, 2017

Contributor

assert EFGH are NOT in the result (in a loop) and assert ABCD there

@@ -10,7 +10,7 @@
from pandas import Index, Series, DataFrame, date_range
from pandas.core.indexes.datetimes import Timestamp
from pandas.compat import range
from pandas.compat import range, lzip, isidentifier, string_types

This comment has been minimized.

@jreback

jreback Dec 6, 2017

Contributor

can you add a test that has 101 columns and assert first 100 there and last 1 is not.

This comment has been minimized.

@jreback

jreback Dec 9, 2017

Contributor

can you do thsi

This comment has been minimized.

@BibMartin

BibMartin Dec 11, 2017

Contributor

I've put this in the other test (See here and there)
Please tell me if you prefer that I put it in a separate test.

""" add the string-like attributes from the info_axis.
If info_axis is a MultiIndex, it's first level values are used.
"""
additions = set(

This comment has been minimized.

@jreback

jreback Dec 6, 2017

Contributor

can you add an asv that does dir() on a Series/DataFrame with say 10000 elements (we might already have one of these)

@BibMartin

This comment has been minimized.

Contributor

BibMartin commented Dec 8, 2017

Rebased and ready

@jreback jreback added this to the 0.22.0 milestone Dec 11, 2017

@jreback jreback merged commit 2aa4aa9 into pandas-dev:master Dec 11, 2017

0 of 3 checks passed

ci/circleci Your tests are queued behind your running builds
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@jreback

This comment has been minimized.

Contributor

jreback commented Dec 11, 2017

thanks @BibMartin nice patch!

@BibMartin

This comment has been minimized.

Contributor

BibMartin commented Dec 11, 2017

Thanks to you @jreback : it's been a great occasion to learn for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment