New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .iloc and .ndloc integer indexing methods for Datasets #1435

Merged
merged 20 commits into from Jun 19, 2017

Conversation

Projects
None yet
2 participants
@philippjfr
Member

philippjfr commented May 14, 2017

Adds a tabular integer indexing interface for all data interfaces, allowing slicing and indexing by row and column indices. In the case of gridded ndarray data this operates on the flattened arrays. The main issue is that dask does not support integer indexing so iloc actually has to evaluate the graph and load a whole column at a time into memory, we should probably warn about this.

Valid signatures:

# Get first row
ds.iloc[0]

# Get first 10 rows
ds.iloc[:10]

# Get 1st, 3rd and 5th row
ds.iloc[[1, 3, 5]]

# Get every 3rd row
ds.iloc[::3]

# Get first column
ds.iloc[:, 0]

# Get 1st and 3rd column
ds.iloc[:, [0, 3]]

# Get columns 1 to 3
ds.iloc[:, 1:4]

And any combination of these.

  • Support for row, column indexing on all interfaces
  • Support ndarray like indexing on gridded interfaces
  • Unit tests
  • Better docstrings
  • Documentation

This also allows for the following optimizations:

  • Improve the decimate operation implementation
  • Improve Tabular indexing and implement truncating of bokeh table output
  • Vectorized Image.sample using sheet2matrixidx (fixing #1450)

@philippjfr philippjfr changed the title from Add ``iloc`` integer indexing interface for Datasets to Add .iloc integer indexing interface for Datasets May 14, 2017

@philippjfr philippjfr changed the title from Add .iloc integer indexing interface for Datasets to Add .iloc integer indexing method for Datasets May 14, 2017

@jlstevens

This comment has been minimized.

Member

jlstevens commented May 14, 2017

Still looking through this PR but my initial impression is that it looks good.

Here are my first two questions:

  1. Why not call TabularIndex something like ILoc? That seems all it is used for right now.
  2. How do the signatures that you listed compare to pandas iloc indexing?
@philippjfr

This comment has been minimized.

Member

philippjfr commented May 14, 2017

Why not call TabularIndex something like ILoc? That seems all it is used for right now.

Could do I suppose, not sure that's a better name though.

How do the signatures that you listed compare to pandas iloc indexing?

They match although I realized I still need to add boolean array indexing support (or at least add tests for it).

@jlstevens

This comment has been minimized.

Member

jlstevens commented May 14, 2017

I think naming it the same as the method (but capitalized) makes it instantly clear what the class is for. I agree that TabularIndex would be a better name if you are planning a use for it outside of the iloc method.

We already have a class redim(object) implementing the redim method and in my PR there is class periodic(object) implementing the periodic method. To be consistent, this should be class iloc(object) unless you see another use for it.

@philippjfr

This comment has been minimized.

Member

philippjfr commented Jun 18, 2017

I've now added a section to our Indexing and Selecting Data User Guide, which will be committed to the big docs PR.

@philippjfr philippjfr changed the title from Add .iloc integer indexing method for Datasets to Add .iloc and .ndloc integer indexing method for Datasets Jun 18, 2017

@philippjfr philippjfr changed the title from Add .iloc and .ndloc integer indexing method for Datasets to Add .iloc and .ndloc integer indexing methods for Datasets Jun 18, 2017

@philippjfr

This comment has been minimized.

Member

philippjfr commented Jun 18, 2017

I've now also added an .ndloc interface for gridded datasets. It works by applying integer indexing to the canonical value dimension orientations. Since our Image ndarray interface is flipped along the y-axis the indexing behavior there is probably unintuitive but it has to be consistent.

Here's an example of creating an Image using the Image ndarray, xarray and gridded dictionary interfaces and applying ndloc:

arr = np.random.randn(5, 10)
ds = hv.Dataset({'x': range(10), 'y': range(5), 'z': arr}, kdims=['x', 'y'], vdims=['z'],
               datatype=['grid'])
dict_img = hv.Image(ds, label='Gridded Dictionary')
xr_img = img.clone(datatype=['xarray'], label='XArray')
arr_img = hv.Image(arr[::-1], bounds=img.bounds, label='NdArray+Bounds')

(dict_img + dict_img.ndloc[1:3, 0:5] +
 xr_img + xr_img.ndloc[1:3, 0:5] +
 arr_img + arr_img.ndloc[1:3, 0:5]).cols(2)

image

@philippjfr

This comment has been minimized.

Member

philippjfr commented Jun 18, 2017

With my latest commit Image.sample now uses sheet2matrixidx and ndloc to efficiently sample the underlying array in continuous coordinates, this addresses issues with floating point precision on sampling and is considerably more efficient (addressing #1450), all existing sampling unit tests pass. I'll have to add some more direct unit tests and docstrings for ndloc though.

@philippjfr philippjfr requested a review from jlstevens Jun 19, 2017

@philippjfr

This comment has been minimized.

Member

philippjfr commented Jun 19, 2017

@jlstevens Ready for review.

@@ -550,7 +550,7 @@
"source": [
"print(rgb_parrot)\n",
"print(rgb_parrot[0,0])\n",
"print(rgb_parrot[0,0][0])"
"print(rgb_parrot[0,0].iloc[0, 0])"

This comment has been minimized.

@philippjfr

philippjfr Jun 19, 2017

Member

This was left over from when we returned tuples when indexing RGBs, so I ended up updating it, suppose rgb_parrot[0, 0, 'R'] would have been clearer but we're probably throwing this notebook out right?

This comment has been minimized.

@jlstevens
Allow selection by integer index, slice and list of integer
indices and boolean arrays, e.g.:
Examples:

This comment has been minimized.

@jlstevens

jlstevens Jun 19, 2017

Member

Bit redundant after 'e.g:' (which is sufficient)

"""
Dask does not support iloc, therefore iloc will execute
the call graph and lose the laziness of the operation.
"""

This comment has been minimized.

@jlstevens

jlstevens Jun 19, 2017

Member

I wonder if there could be optional performance warnings we could issue when laziness is lost. Not for this PR though.

@jlstevens

This comment has been minimized.

Member

jlstevens commented Jun 19, 2017

Other than a very minor comment about the docstrings, this PR looks good: the API is nice and clean and thanks for adding all those unit tests. I don't think people will need integer indexing often but there are certainly times when you need it and having this API will really help.

Happy to merge when the docstring is updated. The tests are passing except for one transient in the Python2 pr build.

@jlstevens

This comment has been minimized.

Member

jlstevens commented Jun 19, 2017

Thanks! Travis/conda is having issues at the moment but the tests were passing before the docstring changes so I'll just merge.

@jlstevens jlstevens merged commit 3804863 into master Jun 19, 2017

0 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/travis-ci/push The Travis CI build is in progress
Details

@philippjfr philippjfr deleted the iloc_indexing branch Jun 25, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment