Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make itertuples really an iterator/generator in implementation, not just return type #20783

Closed
mitar opened this issue Apr 22, 2018 · 12 comments

Comments

Projects
None yet
4 participants
@mitar
Copy link
Contributor

commented Apr 22, 2018

itertuples is not really an iterator/generator and constructs a copy of whole DataFrame in memory. Ideally it would return just an iterator and construct row by row as it is being iterated over.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2018

looks like a generator to me

In [1]: df = pd.DataFrame({'A': range(3), 'B': list('ABC')})

In [2]: df.itertuples()
Out[2]: <map at 0x10922c080>

In [3]: list(df.itertuples())
Out[3]: 
[Pandas(Index=0, A=0, B='A'),
 Pandas(Index=1, A=1, B='B'),
 Pandas(Index=2, A=2, B='C')]

In [5]: i = df.itertuples()

In [6]: next(i)
Out[6]: Pandas(Index=0, A=0, B='A')

In [7]: next(i)
Out[7]: Pandas(Index=1, A=1, B='B')

In [8]: next(i)
Out[8]: Pandas(Index=2, A=2, B='C')

In [9]: next(i)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-9-a883b34d6d8a> in <module>()
----> 1 next(i)

StopIteration: 

@jreback jreback closed this Apr 22, 2018

@jreback jreback added this to the No action milestone Apr 22, 2018

@mitar

This comment has been minimized.

Copy link
Contributor Author

commented Apr 22, 2018

That's just because you return a zip. But inside the function, you create whole list of values first. See here:

arrays.extend(self.iloc[:, k] for k in range(len(self.columns)))

It looks like iterator because of zip(*arrays), but arrays is not an iterator. This is a problem.

@mitar

This comment has been minimized.

Copy link
Contributor Author

commented Apr 22, 2018

You can test the issue here by doing:

d = pandas.DataFrame({'a': range(100000000)})
for a in d.itertuples(index=False, name=None):
    print(a)

Do this in Python interpreter. Note the time it takes to create d. Now, when you press enter after print line, twice, note the time it takes to start producing anything. It takes this time because it is first creating all rows in memory, before iterating over them.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2018

and if u want to fix it pls submit a PR
this is truly an iterator

so what u describe is an implementation detail

@mitar

This comment has been minimized.

Copy link
Contributor Author

commented Apr 22, 2018

Implementation detail which blows up memory and performance?

Anything can be made look like iterator. But if does not really behave like iterator, it is not an iterator.

I think this is a bug. Please reopen this. And then me or somebody else can make a pull request.

@mitar mitar changed the title Make itertuples really an iterator/generator Make itertuples really an iterator/generator in implementation, not just return type Apr 22, 2018

@mitar

This comment has been minimized.

Copy link
Contributor Author

commented Apr 23, 2018

This goes even deeper. Also iterating over a series constructs a list internally:

d = pandas.DataFrame({'a': range(100000000)})
for a in d['a']:
    print(a)

This will also have a large delay before starting sending results back.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Apr 23, 2018

I agree that ideally it would be more lazy the iteration (although not really a priority issue for me), and since we would accept a PR to fix, let's keep the issue open.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2018

@mitar you are welcome to submit a PR, however, this method follows exactly the pandas paradigm. We create a new copy of things then hand it back to you, here the handing back is an iterator. If you can optimize this, great, but you are fighting standard practice. Further you may slow things down by doing this in the common case.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2018

This would be a welcome fix if possible.

@mitar a couple things to watch out for, which we just hit with Categorical.__iter__:

  • Scalars have to be converted from NumPy scalars to Python scalars
  • We need to avoid tons of calls to Series/Frame.__getitem__, as this is relatively slow

@mitar mitar referenced this issue Apr 23, 2018

Merged

ENH: Implemented lazy iteration #20796

4 of 4 tasks complete
@mitar

This comment has been minimized.

Copy link
Contributor Author

commented Apr 23, 2018

I made: #20796

@mitar

This comment has been minimized.

Copy link
Contributor Author

commented Apr 23, 2018

We need to avoid tons of calls to Series/Frame.__getitem__, as this is relatively slow

Calling iloc is OK? Or is this the same slow as __getitem__?

@mitar

This comment has been minimized.

Copy link
Contributor Author

commented Apr 25, 2018

I think that the PR #20796 is ready to be reviewed.

@jreback jreback added this to the 0.24.0 milestone May 29, 2018

@jreback jreback modified the milestones: 0.24.0, 0.25.0 Oct 23, 2018

@jreback jreback modified the milestones: 0.25.0, 0.24.0 Dec 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.