Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fromdicts generator support lazy #626

Conversation

arturponinski
Copy link
Collaborator

@arturponinski arturponinski commented Aug 17, 2022

This PR has the objective of improving the support of generators in fromdicts.
The current implementation uses itertools.tee which according to docs and production deployments uses large amounts of memory, leading to out of memory kills of processes.
This PR aims to keep the improved support of generators by using a filecache, similar to sorting, to allow multiple iterations.

Closes #618

Changes
Moved _iterchunk from sorts to petl.util.base. Imported to sorts as _iterchunk for BC

Changes

  1. Refactored DictsGeneratorView in petl.io.json to use file cache in a lazy manner, dumping data during the first requested pass

Checklist

Use this checklist for assuring the quality of pull requests that include new code and or make changes to existing code.

  • Source Code rules apply:
    • Includes unit tests
    • New functions have docstrings with examples that can be run with doctest
    • New functions are included in API docs
    • Docstrings include notes for any changes to API or behaviour
    • All changes are documented in docs/changes.rst
  • Versioning and history tracking rules apply:
    • Using atomic commits when possible
    • Commits are reversible when possible
    • There is no incomplete changes in the pull request
    • There is no accidental garbage added in source code
  • Testing rules apply:
    • Tested locally using tox / pytest
    • Automated testing passes (see CI)
    • Unit test coverage has not decreased (see Coveralls)
  • State of these changes is:
    • Just a proof of concept
    • Work in progress / Further changes needed
    • Ready to review
    • Ready to merge

@coveralls
Copy link

coveralls commented Aug 17, 2022

Pull Request Test Coverage Report for Build 2880829446

  • 95 of 97 (97.94%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.04%) to 91.093%

Changes Missing Coverage Covered Lines Changed/Added Lines %
petl/io/json.py 46 48 95.83%
Totals Coverage Status
Change from base Build 2692922856: 0.04%
Covered Lines: 12713
Relevant Lines: 13956

💛 - Coveralls

# reverse order
second_row3 = next(it2)
first_row3 = next(it1)
assert second_row3 == first_row3

Check notice

Code scanning

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
assert second_row3 == first_row3
ieq(actual, actual)
assert actual.header() == ('n', 'foo', 'bar')
assert len(actual) == 6

Check notice

Code scanning

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
@@ -175,9 +205,58 @@

class DictsGeneratorView(DictsView):

def __init__(self, dicts, header=None, sample=1000, missing=None):
super(DictsGeneratorView, self).__init__(dicts, header, sample, missing)

Check notice

Code scanning

Consider using Python 3 style super() without arguments (super-with-arguments)

Consider using Python 3 style super() without arguments (super-with-arguments)

def _determine_header(self):
it = iter(self.dicts)
header = list()

Check notice

Code scanning

Consider using [] instead of list() (use-list-literal)

Consider using [] instead of list() (use-list-literal)
if PY2:
self._filecache = NamedTemporaryFile(delete=False, mode='wb+', bufsize=0)
else:
self._filecache = NamedTemporaryFile(delete=False, mode='wb+', buffering=0)

Check notice

Code scanning

Consider using 'with' for resource-allocating operations (consider-using-with)

Consider using 'with' for resource-allocating operations (consider-using-with)

if not self._filecache:
if PY2:
self._filecache = NamedTemporaryFile(delete=False, mode='wb+', bufsize=0)

Check notice

Code scanning

Unexpected keyword argument 'bufsize' in function call (unexpected-keyword-arg)

Unexpected keyword argument 'bufsize' in function call (unexpected-keyword-arg)
self._filecache = NamedTemporaryFile(delete=False, mode='wb+', buffering=0)

position = 0
it = iter(self.dicts)

Check notice

Code scanning

Variable name "it" doesn't conform to snake_case naming style

Variable name "it" doesn't conform to snake_case naming style
yield row

def _determine_header(self):
it = iter(self.dicts)

Check notice

Code scanning

Variable name "it" doesn't conform to snake_case naming style

Variable name "it" doesn't conform to snake_case naming style
self.dicts = it
if isinstance(peek, dict):
peek = [peek]
for o in peek:

Check notice

Code scanning

Variable name "o" doesn't conform to snake_case naming style

Variable name "o" doesn't conform to snake_case naming style
def _determine_header(self):
it = iter(self.dicts)
header = list()
peek, it = iterpeek(it, self.sample)

Check notice

Code scanning

Variable name "it" doesn't conform to snake_case naming style

Variable name "it" doesn't conform to snake_case naming style
@@ -175,9 +205,58 @@

class DictsGeneratorView(DictsView):

def __init__(self, dicts, header=None, sample=1000, missing=None):
super(DictsGeneratorView, self).__init__(dicts, header, sample, missing)

Check notice

Code scanning

Consider using Python 3 style super() without arguments

Consider using Python 3 style super() without arguments
@@ -213,3 +231,29 @@
('c', 2))
ieq(expect, actual)
ieq(expect, actual)


def test_fromdicts_generator_random_access():

Check notice

Code scanning

Missing function or method docstring

Missing function or method docstring
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 13 potential problems in the proposed changes. Check the Files changed tab for more details.

@arturponinski arturponinski added Bug It must work in all situations, but this failed Performance labels Aug 17, 2022
@arturponinski arturponinski added this to the v1.7.11 milestone Aug 17, 2022
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 13 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 16 potential problems in the proposed changes. Check the Files changed tab for more details.

@arturponinski arturponinski merged commit c000bb5 into petl-developers:master Aug 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug It must work in all situations, but this failed Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generator support in fromdicts requires large amount of memory
3 participants