Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow genfromtxt to unpack structured arrays #16650

Merged
merged 13 commits into from Sep 11, 2020

Conversation

AndrewEckart
Copy link
Contributor

Previously, genfromtxt failed to transpose its output when
unpack=True and dtype was structured (or dtype=None and the
inferred dtype was structured). This patch resolves the issue by
returning a list of arrays, as in loadtxt. Includes tests and updates
the docstring to match loadtxt.

Fixes #4341

numpy/lib/npyio.py Outdated Show resolved Hide resolved
@charris charris changed the title BUG: Allow genfromtxt to unpack structured arrays ENH: Allow genfromtxt to unpack structured arrays Jun 21, 2020
rossbar
rossbar previously approved these changes Jun 26, 2020
Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple minor things. Thanks @AndrewEckart !

numpy/lib/tests/test_io.py Outdated Show resolved Hide resolved
numpy/lib/npyio.py Outdated Show resolved Hide resolved
if unpack:
return output.squeeze().T
return output.squeeze()
if names is not None and len(names) > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the len(names) > 1 condition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to match squeeze, then len(names) != 1 would work better for when names == [].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it, but I think names == [] necessarily implies output is empty (i.e. the input was an empty file).

Consider the following two calls:

np.genfromtxt(StringIO(''), dtype=None, unpack=True)
np.genfromtxt(StringIO(''), dtype={'names': (), 'formats': ()}, unpack=True)

With len(names > 1), both will return an empty ndarray , but with len(names) != 1), both will return an empty list.

def test_unpack_auto_dtype(self):
# Regression test for gh-4341
# Unpacking should work when dtype=None
txt = TextIO("21 72\n35 58")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM! would be also useful to add a test with a mix of integer and string to make sure that the field array dtypes are being discerned correctly.

# even if dtype is structured
txt = TextIO("1")
dt = {'names': ('a',), 'formats': ('<i4',)}
x = np.genfromtxt(txt, dtype=dt, unpack=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I get with this example (using StringIO instead of the internal TextIO):

In [37]: dt = np.dtype({'names': ('a',), 'formats': ('<i4',)})

In [38]: x = np.genfromtxt(StringIO('1'), dtype=dt, unpack=True)

In [39]: x                                                                                                        
Out[39]: array((1,), dtype=[('a', '<i4')])

I don't think the dtype of the return value should still be the structured dtype in this case--unpack=True is supposed to unpack the fields. So I expect this output to be array(1, dtype=int32). In other words, the structure should still be pulled apart even when len(names) == 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I modified the logic to return the single unpacked ndarray when len(names) == 1 and a list of ndarrays when len(names > 1), which I think is a little bit cleaner. Technically, len(names) == 0) is still possible with an empty input file and a structured dtype, but this already gives a UserWarning.

Previously, `genfromtxt` failed to transpose its output when
`unpack=True` and `dtype` was structured (or `dtype=None` and the
inferred `dtype` was structured). This patch resolves the issue by
returning a list of arrays, as in `loadtxt`. Includes tests and updates
the docstring to match `loadtxt`.

Fixes numpy#4341
mattip and others added 2 commits July 14, 2020 10:37
Co-authored-by: Eric Wieser <wieser.eric@gmail.com>
numpy/lib/npyio.py Outdated Show resolved Hide resolved
@seberg
Copy link
Member

seberg commented Jul 15, 2020

I think this is probably fine, although did not look too careful. Maybe @WarrenWeckesser can push this through?

It seems that this does change behaviour fairly strongly? Probably the old behaviour can be defined as a clear bug here? @AndrewEckart could you comment briefly on that for me, and add a release note unless it was a loud error before? The release notes go into numpy/doc/release/upcoming_changes/ there is a Readme there to explain how to create one.

@AndrewEckart
Copy link
Contributor Author

I think this is probably fine, although did not look too careful. Maybe @WarrenWeckesser can push this through?

It seems that this does change behaviour fairly strongly? Probably the old behaviour can be defined as a clear bug here? @AndrewEckart could you comment briefly on that for me, and add a release note unless it was a loud error before? The release notes go into numpy/doc/release/upcoming_changes/ there is a Readme there to explain how to create one.

I'm not sure how to characterize it. In #4341, it's described as a bug, and the old behavior is contrasted with loadtxt which correctly unpacks structured arrays. It wasn't a loud error, it just failed to unpack and returned as if you had passed unpack=False. I don't think it's a major change to make it behave as expected (and in parity with loadtxt), but I'm happy to update the release notes if you like.

@seberg
Copy link
Member

seberg commented Jul 15, 2020

Definitely needs a release note IMO. Someones code will probably break somewhere. We have to make a call that it is very few people...

@AndrewEckart
Copy link
Contributor Author

Definitely needs a release note IMO. Someones code will probably break somewhere. We have to make a call that it is very few people...

Would you consider this compatibility or improvement?

@seberg
Copy link
Member

seberg commented Jul 15, 2020

Lets put it under compat, since the actual improvement is pretty straight forward (and arguably even a bug fix).

numpy/lib/tests/test_io.py Outdated Show resolved Hide resolved
@rossbar rossbar self-requested a review July 29, 2020 20:30
one for each column::

>>> np.genfromtxt(StringIO("1 2.0 \n 3 4.0"), unpack=True)
[array([1, 3]), array([2., 4.])]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this example requires a structured dtype= being passed in? It may be good to point/show an example which triggers the changed behaviour without passing in dtype= (if that is possible).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i.e. I tried the example on this branch, and the behaviour is not as written here – also there is an additional closing bracket)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, it looks like the numbers in my example were too small so the inferred dtype was '<u4'. I replaced it with the test case for dtype=None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I have to take a closer look at the tests, or am I running the wrong code? But it seems to me that this gives:

In [2]: from io import StringIO
In [3]: np.genfromtxt(StringIO("21 58.0 \n 35 72.0"), unpack=True)
Out[3]: 
array([[21., 35.],
       [58., 72.]])

OTOH, this one changed behaviour as expected:

In [5]: np.genfromtxt(StringIO("21 58.0 \n 35 72.0"), unpack=True, dtype="i,d")

so, right now I think the example is still incorrect, and if there are tests for this, they should probably be refined to check that the result is (correctly, I assume) a single array in the first case, since it should be a single array when there is no structured dtype detected (which seems to be the default?).

Unless this should change and I am missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, you were running the right code; it turns out that the example was incorrect. I forgot that detection of a structured dtype relies on passing dtype=None to override the default dtype=float (otherwise, a single array is the correct return value). My example in the release note failed to specify the dtype argument. Should be correct now.

@mattip
Copy link
Member

mattip commented Aug 10, 2020

Unfortunately CircleCI does not merge-from-master before building. Could you either merge from master or rebase to clear the non-related failure?

Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of expanding the release note a bit. There are a couple things that need addressing in the updated tests

numpy/lib/tests/test_io.py Outdated Show resolved Hide resolved
numpy/lib/tests/test_io.py Outdated Show resolved Hide resolved
@rossbar rossbar dismissed their stale review August 10, 2020 20:02

Changes introduced after initial approval

Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for sticking with this @AndrewEckart !

@rossbar rossbar merged commit 3329d26 into numpy:master Sep 11, 2020
@charris charris mentioned this pull request Oct 10, 2020
20 tasks
@emirkmo
Copy link

emirkmo commented Jan 19, 2022

Hi, this change means that genfromtxt now ignores names=('column1','column2') when unpack=True is passed. Instead of giving a named/rec array as before, a list of arrays with no names is given. Is this the correct behavior? It means that passing the names explicitly into genfromtxt does nothing and the output is not an array of arrays/columns, which was very useful. Lists can't be indexed the same way. Can I be guaranteed that the returned list is in the right order so I can manually assign the names? Should I open an issue about this?

The example I am thinking of is, say we have a simple csv or space delimited document with 4 columns:

val1 val2 valignore1 valignore2
val3 val4 valignore3 valignore4
val5 val6 valignore5 valignore6

I read this in using np.genfromtxt(data.dat, unpack=True, names=('column1', 'column2'), usecols=(0,1)), the result is a list and the names are ignored. Previously, the names would be attached and the result would be a two column array that can be indexed with the given names. The list obviously can't be.

Now we get [array(val1, val3, val5), array(val1, val3, val5)], previously it would be an array of arrays with names (used to be called recarray I think but maybe that name has changed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

genfromtxt won't unpack if dtype=None
9 participants