Make itertuples really an iterator/generator in implementation, not just return type #20783

mitar · 2018-04-22T06:11:59Z

itertuples is not really an iterator/generator and constructs a copy of whole DataFrame in memory. Ideally it would return just an iterator and construct row by row as it is being iterated over.

The text was updated successfully, but these errors were encountered:

jreback · 2018-04-22T14:20:46Z

looks like a generator to me

In [1]: df = pd.DataFrame({'A': range(3), 'B': list('ABC')})

In [2]: df.itertuples()
Out[2]: <map at 0x10922c080>

In [3]: list(df.itertuples())
Out[3]: 
[Pandas(Index=0, A=0, B='A'),
 Pandas(Index=1, A=1, B='B'),
 Pandas(Index=2, A=2, B='C')]

In [5]: i = df.itertuples()

In [6]: next(i)
Out[6]: Pandas(Index=0, A=0, B='A')

In [7]: next(i)
Out[7]: Pandas(Index=1, A=1, B='B')

In [8]: next(i)
Out[8]: Pandas(Index=2, A=2, B='C')

In [9]: next(i)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-9-a883b34d6d8a> in <module>()
----> 1 next(i)

StopIteration:

mitar · 2018-04-22T14:56:59Z

That's just because you return a zip. But inside the function, you create whole list of values first. See here:

arrays.extend(self.iloc[:, k] for k in range(len(self.columns)))

It looks like iterator because of zip(*arrays), but arrays is not an iterator. This is a problem.

mitar · 2018-04-22T15:26:25Z

You can test the issue here by doing:

d = pandas.DataFrame({'a': range(100000000)})
for a in d.itertuples(index=False, name=None):
    print(a)

Do this in Python interpreter. Note the time it takes to create d. Now, when you press enter after print line, twice, note the time it takes to start producing anything. It takes this time because it is first creating all rows in memory, before iterating over them.

jreback · 2018-04-22T15:27:29Z

and if u want to fix it pls submit a PR
this is truly an iterator

so what u describe is an implementation detail

mitar · 2018-04-22T15:29:38Z

Implementation detail which blows up memory and performance?

Anything can be made look like iterator. But if does not really behave like iterator, it is not an iterator.

I think this is a bug. Please reopen this. And then me or somebody else can make a pull request.

mitar · 2018-04-23T06:10:19Z

This goes even deeper. Also iterating over a series constructs a list internally:

d = pandas.DataFrame({'a': range(100000000)})
for a in d['a']:
    print(a)

This will also have a large delay before starting sending results back.

jorisvandenbossche · 2018-04-23T07:05:57Z

I agree that ideally it would be more lazy the iteration (although not really a priority issue for me), and since we would accept a PR to fix, let's keep the issue open.

jreback · 2018-04-23T10:36:35Z

@mitar you are welcome to submit a PR, however, this method follows exactly the pandas paradigm. We create a new copy of things then hand it back to you, here the handing back is an iterator. If you can optimize this, great, but you are fighting standard practice. Further you may slow things down by doing this in the common case.

TomAugspurger · 2018-04-23T11:16:34Z

This would be a welcome fix if possible.

@mitar a couple things to watch out for, which we just hit with Categorical.__iter__:

Scalars have to be converted from NumPy scalars to Python scalars
We need to avoid tons of calls to Series/Frame.__getitem__, as this is relatively slow

mitar · 2018-04-23T14:52:53Z

I made: #20796

mitar · 2018-04-23T15:08:57Z

We need to avoid tons of calls to Series/Frame.__getitem__, as this is relatively slow

Calling iloc is OK? Or is this the same slow as __getitem__?

mitar · 2018-04-25T22:34:16Z

I think that the PR #20796 is ready to be reviewed.

jreback closed this as completed Apr 22, 2018

jreback added Usage Question Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 22, 2018

jreback added this to the No action milestone Apr 22, 2018

mitar changed the title ~~Make itertuples really an iterator/generator~~ Make itertuples really an iterator/generator in implementation, not just return type Apr 22, 2018

mitar mentioned this issue Apr 23, 2018

Surprising type conversion when iterating #20791

Open

jorisvandenbossche reopened this Apr 23, 2018

jorisvandenbossche removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question labels Apr 23, 2018

jorisvandenbossche modified the milestones: No action, Someday Apr 23, 2018

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Apr 23, 2018

mitar mentioned this issue Apr 23, 2018

ENH: Implemented lazy iteration #20796

Merged

4 tasks

jreback added this to the 0.24.0 milestone May 29, 2018

jreback modified the milestones: 0.24.0, 0.25.0 Oct 23, 2018

jreback modified the milestones: 0.25.0, 0.24.0 Dec 23, 2018

jreback closed this as completed in #20796 Dec 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make itertuples really an iterator/generator in implementation, not just return type #20783

Make itertuples really an iterator/generator in implementation, not just return type #20783

mitar commented Apr 22, 2018

jreback commented Apr 22, 2018 •

edited

Loading

mitar commented Apr 22, 2018

mitar commented Apr 22, 2018

jreback commented Apr 22, 2018

mitar commented Apr 22, 2018

mitar commented Apr 23, 2018

jorisvandenbossche commented Apr 23, 2018

jreback commented Apr 23, 2018

TomAugspurger commented Apr 23, 2018 •

edited

Loading

mitar commented Apr 23, 2018

mitar commented Apr 23, 2018 •

edited

Loading

mitar commented Apr 25, 2018

Make itertuples really an iterator/generator in implementation, not just return type #20783

Make itertuples really an iterator/generator in implementation, not just return type #20783

Comments

mitar commented Apr 22, 2018

jreback commented Apr 22, 2018 • edited Loading

mitar commented Apr 22, 2018

mitar commented Apr 22, 2018

jreback commented Apr 22, 2018

mitar commented Apr 22, 2018

mitar commented Apr 23, 2018

jorisvandenbossche commented Apr 23, 2018

jreback commented Apr 23, 2018

TomAugspurger commented Apr 23, 2018 • edited Loading

mitar commented Apr 23, 2018

mitar commented Apr 23, 2018 • edited Loading

mitar commented Apr 25, 2018

jreback commented Apr 22, 2018 •

edited

Loading

TomAugspurger commented Apr 23, 2018 •

edited

Loading

mitar commented Apr 23, 2018 •

edited

Loading