-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slicing awkward column of DataFrame view not behaving as expected #27
Comments
I see the issue, but I'm not certain what to do about it yet. The index of the df after filtering looks like
i.e., the integer rows which matched the filter. When extracting We can cope with simple cases of such indexing, like the example diff below. However, as @jpivarski will tell you, the possible things you might pass to awkwards getitem is complex (e.g., fields strings, where the order of column selection and row selection can commute), and it's not entirely clear we can cover them all. --- a/src/awkward_pandas/accessor.py
+++ b/src/awkward_pandas/accessor.py
@@ -46,11 +46,20 @@ class AwkwardAccessor:
@property
def array(self):
return self.extarray._data
- def __getitem__(self, *items):
- ds = self.array.__getitem__(*items)
- return pd.Series(AwkwardExtensionArray(ds))
+ def __getitem__(self, items):
+ """Extract components using awkward indexing"""
+ ds = self.array.__getitem__(items)
+ index = None
+ if items[0]:
+ if (
+ not isinstance(items[0], str)
+ and not (isinstance(items[0], list) and isinstance(items[0][0], str))
+ ):
+ index = self._obj.index[items[0]]
+ return pd.Series(AwkwardExtensionArray(ds), index=index) |
Is this a follow-up from scikit-hep/awkward#803? I'll try to think of a way to get an "index slicer" from an arbitrary slice of an Awkward Array, so that you have something to apply to the Pandas index. It would have to turn any slicer into a one-dimensional slicer somehow. |
No, at least not from my side. I came here from scikit-hep/uproot5#803 because I thought it makes more sense here than as a discussion in |
Is it a coincidence that both are 803? :) I am pleased that this package is actually directly used from uproot, I didn't realise that was the case. |
Oh, yes, Uproot is using awkward-pandas! I'm sorry that I didn't mention this—I didn't realize that I hadn't. This was a follow-up project from @kkothari2001. I considered it a prerequisite for implementing |
Yes, I think there's been some confusion in authoring links; @ast0815 filed this after our discussion in uproot. I think scikit-hep/awkward#803 is unrelated (we're into the 2000s now). |
There is a well-defined set of slicer types that are accepted by Awkward Array. It hasn't been written down in documentation, but it all happens in (The second is for handling tuple items.) This list is:
So, which ones do you have to think about in awkward-pandas?
>>> array = np.array([0.0, 1.1, 2.2, 3.3, 4.4])
>>> array[np.array([2, 0, 0, 1])]
array([2.2, 0. , 0. , 1.1])
>>> array[np.array([[2, 0], [0, 1]])]
array([[2.2, 0. ],
[0. , 1.1]])
>>> df = pd.DataFrame({"x": [0.0, 1.1, 2.2, 3.3, 4.4]})
>>> df.loc[np.array([2, 0, 0, 1])]
x
2 2.2
0 0.0
0 0.0
1 1.1
>>> df.loc[np.array([[2, 0], [0, 1]])]
...
ValueError: Cannot index with multidimensional key
>>> df.iloc[np.array([[2, 0], [0, 1]])]
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
>>> df[np.array([[2, 0], [0, 1]])]
...
KeyError: "None of [Index([(2, 0), (0, 1)], dtype='object')] are in the [columns]" So while there's no output to produce, be sure to not apply a multidimensional NumPy array to the Awkward Array and then do nothing to the DataFrame index: this needs to raise an error.
>>> array = ak.Array([0.0, 1.1, 2.2, 3.3, 4.4])
>>> array[ak.Array([False, True, True, None, True])]
<Array [1.1, 2.2, None, 4.4] type='4 * ?float64'>
>>> array[ak.Array([2, 0, 0, None, 1])]
<Array [2.2, 0, 0, None, 1.1] type='5 * ?float64'> Since the For an array of option-type integers, it's not clear to me what should happen. The output needs to have Maybe this case can be
Those are all the cases! The only hard ones are |
I meant scikit-hep/uproot5#803! That's what happened. It probably also explains why I didn't see an automatic cross-link there. |
Answering this bit:
Yes you can, but you shouldn't! One of the many weir pandas cases. The most general form of the index is just like any other series; you don't need to be unique, ordered or any other particular condition. Indexing using such a series will be slow, and the presence of None (or pd.NA or nan...) will probably break something. |
Hello,
I hope the title is somewhat correct.
What I tried to do is select the first element of a an awkward column in a Pandas DataFrame and create a new column with just those elements. Because there are entries with 0 elements, I filtered those out before:
Unfortunately, in this case the output is not as expected, and the last entries are all just the same number repeated over and over:
If I use
reset_index
in between, it works as expected:It seems to me like the weird behaviour of the first instance is a bug. At least it is pretty unexpected. Or maybe I am just using the accessor wrong. I did not find an example of how to do what I want in the documentation.
I am using the latest version of awkward and awkward-pandas.
This is where I first discussed this issue in the
uproot
context: scikit-hep/uproot5#803The text was updated successfully, but these errors were encountered: