Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make itertuples really an iterator/generator in implementation, not just return type #20783

Closed
mitar opened this issue Apr 22, 2018 · 12 comments · Fixed by #20796
Closed

Make itertuples really an iterator/generator in implementation, not just return type #20783

mitar opened this issue Apr 22, 2018 · 12 comments · Fixed by #20796
Labels
Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@mitar
Copy link
Contributor

mitar commented Apr 22, 2018

itertuples is not really an iterator/generator and constructs a copy of whole DataFrame in memory. Ideally it would return just an iterator and construct row by row as it is being iterated over.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2018

looks like a generator to me

In [1]: df = pd.DataFrame({'A': range(3), 'B': list('ABC')})

In [2]: df.itertuples()
Out[2]: <map at 0x10922c080>

In [3]: list(df.itertuples())
Out[3]: 
[Pandas(Index=0, A=0, B='A'),
 Pandas(Index=1, A=1, B='B'),
 Pandas(Index=2, A=2, B='C')]

In [5]: i = df.itertuples()

In [6]: next(i)
Out[6]: Pandas(Index=0, A=0, B='A')

In [7]: next(i)
Out[7]: Pandas(Index=1, A=1, B='B')

In [8]: next(i)
Out[8]: Pandas(Index=2, A=2, B='C')

In [9]: next(i)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-9-a883b34d6d8a> in <module>()
----> 1 next(i)

StopIteration: 

@jreback jreback closed this as completed Apr 22, 2018
@jreback jreback added Usage Question Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 22, 2018
@jreback jreback added this to the No action milestone Apr 22, 2018
@mitar
Copy link
Contributor Author

mitar commented Apr 22, 2018

That's just because you return a zip. But inside the function, you create whole list of values first. See here:

arrays.extend(self.iloc[:, k] for k in range(len(self.columns)))

It looks like iterator because of zip(*arrays), but arrays is not an iterator. This is a problem.

@mitar
Copy link
Contributor Author

mitar commented Apr 22, 2018

You can test the issue here by doing:

d = pandas.DataFrame({'a': range(100000000)})
for a in d.itertuples(index=False, name=None):
    print(a)

Do this in Python interpreter. Note the time it takes to create d. Now, when you press enter after print line, twice, note the time it takes to start producing anything. It takes this time because it is first creating all rows in memory, before iterating over them.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2018

and if u want to fix it pls submit a PR
this is truly an iterator

so what u describe is an implementation detail

@mitar
Copy link
Contributor Author

mitar commented Apr 22, 2018

Implementation detail which blows up memory and performance?

Anything can be made look like iterator. But if does not really behave like iterator, it is not an iterator.

I think this is a bug. Please reopen this. And then me or somebody else can make a pull request.

@mitar mitar changed the title Make itertuples really an iterator/generator Make itertuples really an iterator/generator in implementation, not just return type Apr 22, 2018
@mitar
Copy link
Contributor Author

mitar commented Apr 23, 2018

This goes even deeper. Also iterating over a series constructs a list internally:

d = pandas.DataFrame({'a': range(100000000)})
for a in d['a']:
    print(a)

This will also have a large delay before starting sending results back.

@jorisvandenbossche
Copy link
Member

I agree that ideally it would be more lazy the iteration (although not really a priority issue for me), and since we would accept a PR to fix, let's keep the issue open.

@jorisvandenbossche jorisvandenbossche removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question labels Apr 23, 2018
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Someday Apr 23, 2018
@jreback
Copy link
Contributor

jreback commented Apr 23, 2018

@mitar you are welcome to submit a PR, however, this method follows exactly the pandas paradigm. We create a new copy of things then hand it back to you, here the handing back is an iterator. If you can optimize this, great, but you are fighting standard practice. Further you may slow things down by doing this in the common case.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Apr 23, 2018
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 23, 2018

This would be a welcome fix if possible.

@mitar a couple things to watch out for, which we just hit with Categorical.__iter__:

  • Scalars have to be converted from NumPy scalars to Python scalars
  • We need to avoid tons of calls to Series/Frame.__getitem__, as this is relatively slow

@mitar
Copy link
Contributor Author

mitar commented Apr 23, 2018

I made: #20796

@mitar
Copy link
Contributor Author

mitar commented Apr 23, 2018

We need to avoid tons of calls to Series/Frame.__getitem__, as this is relatively slow

Calling iloc is OK? Or is this the same slow as __getitem__?

@mitar
Copy link
Contributor Author

mitar commented Apr 25, 2018

I think that the PR #20796 is ready to be reviewed.

@jreback jreback added this to the 0.24.0 milestone May 29, 2018
@jreback jreback modified the milestones: 0.24.0, 0.25.0 Oct 23, 2018
@jreback jreback modified the milestones: 0.25.0, 0.24.0 Dec 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants