Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix generator support in fromdicts - use file cache instead of iterto… #625

Conversation

arturponinski
Copy link
Collaborator

@arturponinski arturponinski commented Jul 13, 2022

…ols.tee

This PR has the objective of improving the support of generators in fromdicts.
The current implementation uses itertools.tee which according to docs and production deployments uses large amounts of memory, leading to out of memory kills of processes.
This PR aims to keep the improved support of generators by using a filecache, similar to sorting, to allow multiple iterations.

Closes #618

Changes

  1. Refactored DictsGeneratorView in petl.io.json to use file cache
  2. Moved _iterchunk from sorts to petl.util.base. Imported to sorts as _iterchunk for BC

Checklist

Use this checklist for assuring the quality of pull requests that include new code and or make changes to existing code.

  • Source Code rules apply:
    • Includes unit tests
    • New functions have docstrings with examples that can be run with doctest
    • New functions are included in API docs
    • Docstrings include notes for any changes to API or behaviour
    • All changes are documented in docs/changes.rst
  • Versioning and history tracking rules apply:
    • Using atomic commits when possible
    • Commits are reversible when possible
    • There is no incomplete changes in the pull request
    • There is no accidental garbage added in source code
  • Testing rules apply:
    • Tested locally using tox / pytest
    • Automated testing passes (see CI)
    • Unit test coverage has not decreased (see Coveralls)
  • State of these changes is:
    • Just a proof of concept
    • Work in progress / Further changes needed
    • Ready to review
    • Ready to merge

@arturponinski arturponinski added Bug It must work in all situations, but this failed Performance labels Jul 13, 2022
@arturponinski arturponinski added this to the v1.7.11 milestone Jul 13, 2022
@coveralls
Copy link

coveralls commented Jul 13, 2022

Pull Request Test Coverage Report for Build 2693186888

  • 78 of 79 (98.73%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.03%) to 91.085%

Changes Missing Coverage Covered Lines Changed/Added Lines %
petl/io/json.py 34 35 97.14%
Totals Coverage Status
Change from base Build 2692922856: 0.03%
Covered Lines: 12689
Relevant Lines: 13931

💛 - Coveralls

@@ -742,6 +746,18 @@
return peek, chain(peek, it)


def iterchunk(fn):
# reopen so iterators from file cache are independent
debug('iterchunk, opening %s' % fn)

Check notice

Code scanning

Formatting a regular string which could be a f-string (consider-using-f-string)

Formatting a regular string which could be a f-string (consider-using-f-string)
@@ -742,6 +746,18 @@
return peek, chain(peek, it)


def iterchunk(fn):
# reopen so iterators from file cache are independent
debug('iterchunk, opening %s' % fn)

Check notice

Code scanning

Use lazy % formatting in logging functions (logging-not-lazy)

Use lazy % formatting in logging functions (logging-not-lazy)
yield pickle.load(f)
except EOFError:
pass
debug('end of iterchunk, closed %s' % fn)

Check notice

Code scanning

Formatting a regular string which could be a f-string (consider-using-f-string)

Formatting a regular string which could be a f-string (consider-using-f-string)
yield pickle.load(f)
except EOFError:
pass
debug('end of iterchunk, closed %s' % fn)

Check notice

Code scanning

Use lazy % formatting in logging functions (logging-not-lazy)

Use lazy % formatting in logging functions (logging-not-lazy)
@@ -175,16 +177,50 @@

class DictsGeneratorView(DictsView):

def __init__(self, dicts, header=None, sample=1000, missing=None):
super(DictsGeneratorView, self).__init__(dicts, header, sample, missing)

Check notice

Code scanning

Consider using Python 3 style super() without arguments (super-with-arguments)

Consider using Python 3 style super() without arguments (super-with-arguments)
yield self._header

if not self._filecache:
self._filecache = NamedTemporaryFile(delete=False, mode='wb')

Check notice

Code scanning

Consider using 'with' for resource-allocating operations (consider-using-with)

Consider using 'with' for resource-allocating operations (consider-using-with)
# reverse order
second_row3 = next(it2)
first_row3 = next(it1)
assert second_row3 == first_row3

Check warning

Code scanning / Bandit (reported by Codacy)

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
assert second_row3 == first_row3
ieq(actual, actual)
assert actual.header() == ('n', 'foo', 'bar')
assert len(actual) == 6

Check warning

Code scanning / Bandit (reported by Codacy)

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
@@ -1,6 +1,7 @@
from __future__ import absolute_import, print_function, division


import logging
import pickle

Check warning

Code scanning / Bandit (reported by Codacy)

Consider possible security implications associated with pickle module.

Consider possible security implications associated with pickle module.
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 15 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 16 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 16 potential problems in the proposed changes. Check the Files changed tab for more details.

@arturponinski arturponinski force-pushed the fix-fromdicts-generator-support branch 2 times, most recently from a598913 to 177756b Compare July 18, 2022 20:05
@arturponinski arturponinski force-pushed the fix-fromdicts-generator-support branch from 177756b to 62cebec Compare July 18, 2022 20:10
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 14 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 17 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 14 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 17 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 17 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 14 potential problems in the proposed changes. Check the Files changed tab for more details.

@arturponinski
Copy link
Collaborator Author

@juarezr I've updated the docstring for fromdicts, please re-review this. Once you accept I'd like to ship it 🎉

@juarezr
Copy link
Member

juarezr commented Jul 20, 2022

@arturponinski,

Nice solution.
If you need any help, please don't be afraid of contacting me.

@juarezr
Copy link
Member

juarezr commented Jul 29, 2022

@arturponinski,

Nice solution. If you need any help, please don't be afraid of contacting me.

@arturponinski,

I didn't understand if you consider this PR ready to merge.
I don't have any objections if you want to merge and ship it.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 13 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 14 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 13 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 13 potential problems in the proposed changes. Check the Files changed tab for more details.

self._filecache = NamedTemporaryFile(delete=False, mode='wb')
it = iter(self.dicts)
for o in it:
row = tuple(o[f] if f in o else self.missing for f in self._header)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
row = tuple(o[f] if f in o else self.missing for f in self._header)
row = tuple(o.get(f, self.missing) for f in self._header)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commited in the branch that has superseded this one: 662b785

@arturponinski
Copy link
Collaborator Author

@juarezr this took a while. I had asked for some more reviews and discussion about this solution and ended up not entirely happy with it. Dumping the whole generator to the file before yielding any data was considered as a faulty solution. Done some more trials and I've finally created a solution that I am happy with:
#626

@juarezr
Copy link
Member

juarezr commented Aug 21, 2022

@arturponinski,

Can we close this as #626 superseded it and it was merged?

@arturponinski
Copy link
Collaborator Author

Definitely to be closed, this one is obsolete now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug It must work in all situations, but this failed Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generator support in fromdicts requires large amount of memory
4 participants