Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file arguments #26

Merged
merged 1 commit into from Mar 3, 2016
Merged

Add file arguments #26

merged 1 commit into from Mar 3, 2016

Conversation

ExpHP
Copy link
Contributor

@ExpHP ExpHP commented Mar 1, 2016

Some API questions are how to handle force= and the default
output name for patching; I decided to ignore force for
files. For the latter issue, I raise ValueError when nml_fname
is a file since there is no sane default.

(personally, if it was my library, I would rip out the default name
functionality entirely, and make it so that when there is no output
file or path, it will still build and return the patched Namelist
object without writing a patched namelist file. But that's
just me)

Addresses #25 (although there is no string API here, at least yet)

Some API questions are how to handle `force=` and the default
output name for patching;  I decided to ignore `force` for
files.  For the latter issue, I raise `ValueError` when `nml_fname`
is a file since there is no sane default.

(personally, if it was my library, I would rip out the default name
 functionality entirely, and make it so that when there is no output
 file or path, it will still build and return the patched Namelist
 *object* without writing a patched namelist *file*.  But that's
 just me)
@marshallward marshallward changed the title Add file arguments Add file arguments Mar 1, 2016
@marshallward
Copy link
Owner

A big component of the patching is the preservation of the comments inside of the namelist file, which is lost in the conversion to Namelist, so unfortunately there isn't a good way to decouple patching from reading and writing.

If I recall, we parse and modify the file line by line (mostly) and immediately push the output to the target patched file.

I'll give your patch a try today and get back to you, thanks for your help with this issue!

@ExpHP
Copy link
Contributor Author

ExpHP commented Mar 1, 2016

Yeah, I have come to understand that over the course of writing the patch. read taking an output file seemed really odd at first, but I can't think of any cleaner way to preserve comments.

My thought process behind eliminating the default name feature is the following: Suppose that somebody provides a patch without specifying where they want the patched file to be written. Then I can think of three categories:
(a) they are not interested in preserving format and comments
(b) they are interested, but they probably think (incorrectly) that they will be able to call write() on the patched Namelist.
(c) they are interested, they are aware that a default name will be used, and they know what the default name will be.

The most useful behavior for (c) would be the original behavior, but (c) requires both that the default name is documented, and that people actually read the documentation, so I find it unlikely.

The most useful behavior for (a) would be to return a patched Namelist without writing a file. (e.g. self.pfile = open(os.devnull, 'w'))
The most useful behavior for (b) would be to throw a "missing output argument" (to make them look back at the documentation!).
Thus, I feel that the most sensible options are either (a) or (b). (though it would be better to pick one and stick with it, as opposed to what I have done in this patch where it is a mixture of (c) and (b))

@marshallward
Copy link
Owner

I think the interpretation of patch is probably a separate issue. For now, it's probably simplest to preserve the current functionality.

I wasn't able to reproduce the issue raised in #25 which required nested exception blocks. Do you have a test case which produces this error? I was able to just move the open call outside the try block.

@marshallward
Copy link
Owner

My modification is here:

1f7cc6c

I haven't gone over the namelist.py changes yet so I may add some other suggestions.

@ExpHP
Copy link
Contributor Author

ExpHP commented Mar 2, 2016

Oops, I mentioned that because I thought it might be related to the problem you had here and I didn't see the edit.

But now that I look at it I see it causes no directly observable problems, because it is simply a resource leak. It can however be exposed with the following stress test:

def test_expose_resource_leak(self):
    empty_nml = f90nml.read('empty.nml')
    parsers = [f90nml.Parser() for i in range(8000)]
    for p in parsers:
        try: p.read('non-existent.nml', empty_nml, 'out.nml')
        except Exception: pass

    # throws "too many open files!"
    with open('out.nml', 'w') as f:
        pass

This is because pfile is never explicitly closed in the following statement:

parser.read('a_file_that_does_not_exist.nml', patch, 'out.nml')

The leak is easier to observe in a garbage-collected implementation like Pypy. In CPython, files are automatically closed when refcount drops to zero, so f90nml.patch('non-existent.nml', empty_nml, 'out.nml') would be a resource leak in Pypy but not in CPython (where the temporary Parser is immediately destructed).

@ExpHP
Copy link
Contributor Author

ExpHP commented Mar 2, 2016

Just to be clear, on the default name thing, is what I have currently okay for now, or are you looking for something closer to the original behavior in some way? The reason I currently throw ValueError is because I couldn't see any way to extend the old behavior to new cases (I cannot do name + '~' because many file-like objects have no filename, like StringIO).

@marshallward
Copy link
Owner

I see the problem now with the file descriptors, thanks for the explanation. Moving the namelist open call after the patch open call left the patch file unresolved, and is unrelated to the new exception blocks.

Moving the namelist open prior to the patch management would just shift the problem to the patch open. Maybe there's some way to organise all this better, but in any case I now understand the nested exception blocks.


As for changing the default patch name output, I still need to think about this.

Currently, the purpose of patch is to produce a modified version of the original file (or character stream).

From what I can tell, you are (very justifiably) interpreting patch as a batch update to a namelist file, or perhaps a one-line combination of f90nml.read and nml.update()? To me, this is already a very streamlined operation which can be done in 2 lines, and would not require a new function.

But parsing a file while preserving its whitespace and comments is a much more deliberate operation, and is the motivation for including patch. It's true that patch also returns a Namelist, but this is mainly because the two operations are essentially the same function, and perhaps the user will have some use for the generated Namelist. But generally I would not expect users to need or use it when calling patch.

If we do not select a target file, and only want the function output, then we are really just saying that we want to do a read and update. In which case, there's no need to call patch.

So for now, I'd prefer to either retain the ~ default filename, or require an output filepath as an argument. For now, at least.

Generalising everything to support file objects and character streams will require some of these ideas to be revisited, such as the appropriate output of patch and write. But first I'd like to just see a working version that supports file objects, even if it does end up creating a bit of extra work and clutter in the longer term.

As you mentioned before, there may be some confusion about the choice of file name, or some questions about documentation, but I think that needs to be a separate discussion.

@marshallward
Copy link
Owner

I just pushed a change to the file openings that seems to resolve some of the file descriptor problems without the nested exceptins (your script works at least).

I can work through the conflicts if necessary, no need to push extra work on to you.

I do think that the default filename issue needs to be discussed before merging though.

@ExpHP
Copy link
Contributor Author

ExpHP commented Mar 2, 2016

Okay, thanks for the explanation, I think I see better where you are coming from about patch.

In retrospect, I think my thought process was colored by the fact that most of my short time on this project has been spent on Parser.read (a method whose name is more evocative of "returning an object"), which I've come to conflate with f90nml.patch simply because I know that one is a wrapper around the other. But that is just an implementation detail!

On the other hand, somebody first visiting the project and seeing a method named patch which takes an input file would probably not be thinking about what the function returns.

@ExpHP
Copy link
Contributor Author

ExpHP commented Mar 2, 2016

I believe the new version would still have a resource leak in Pypy if opening the pfile fails. For instance:

self.patch('types.nml', patch, 'non-existent-directory/out.nml')

This one can't be exposed in CPython by any sort of stress test because in this case, the issue is nml_file, which is a local variable not saved anywhere and therefore will get destructed as soon as the function exits.

Basically, as I see it, the only way to do this kind of stuff 100% correctly is to always have try as the next thing immediately after an open (or at least, don't put any statements that can have exceptions in-between). It's very easy to make mistakes, which is why with open exists (unfortunately in our specific case it is tough to use with since we are conditionally opening files).

@marshallward
Copy link
Owner

I tested out your original example in Pypy (2 and 3, very recent versions) and wasn't able to reproduce the same issue, even after ramping up the number of parsers to 80k. unittest also didn't detect any unclosed files.

(Also, I had missed your original comment about CPython. My original revision also showed file resource problems in CPython, though maybe for different reasons.)

Do you have a similar case that shows the unopened files in Pypy? Otherwise I can try to monitor file descriptors during a run (lsof or looking inside /proc I guess? Never done this before)

BTW this is what was changed. It looks OK to me but I know it's easy to miss these things.

# Open file descriptors
nml_file = open(nml_fname, 'r')
if nml_patch_in and patch_fname:
    self.pfile = open(patch_fname, 'w')

try:
    return self.readstream(nml_file, nml_patch)
finally:
    # Close the unfinished files on any exceptions within readstream
    nml_file.close()
    if self.pfile:
        self.pfile.close()

@marshallward
Copy link
Owner

Nevermind, I am being silly. Yes, of course if opening pfile fails, then there's nothing to close the original nml_file, if only because we never reach the exception blocks.

I did just try several cpython and pypy tests and didn't detect anything, but I'm guessing that they are just clever enough to prevent resource consumption.

@marshallward
Copy link
Owner

Do you think moving the file opens inside the try block would work?

try:
    # Open file descriptors
    nml_file = open(nml_fname, 'r')
    if nml_patch_in and patch_fname:
        self.pfile = open(patch_fname, 'w')

    return self.readstream(nml_file, nml_patch)
finally:
    # Close the unfinished files on any exceptions within readstream
    nml_file.close()
    if self.pfile:
        self.pfile.close()

@marshallward
Copy link
Owner

Nope... then self.pfile is undefined on a failed open call...

@marshallward
Copy link
Owner

Wait no.. the __init__ already defined self.pfile.

Actually now nml_file is undefined on failed open... I think I need to slow down a bit here, sorry...

@ExpHP
Copy link
Contributor Author

ExpHP commented Mar 2, 2016

Haha, yeah, exceptions can be hard to reason about! This kind of stuff is why some modern languages like e.g. Rust and Go don't even have them. (note: don't quote me on that :P)

Oftentimes I like to solve problems like this by writing a helper class. Here is a prototype of what I mean:

class maybe_open:
    def __init__(self, file, mode='r'):
        self._is_path = not hasattr(file, 'read') # NOTE: typo fixed
        self._file = open(file,mode) if self._is_path else file

    @property
    def is_file(self): return not self._is_path
    @property
    def is_path(self): return self._is_path

    # ... delegate some read/write methods to self._file ...

    # support with statements
    def __enter__(self):
        pass
    def __exit__(self, type, value, traceback):
        if self._is_path:
            self._file.close()

The idea being that we could then simply write something like this:

with maybe_open(nml_fname) as nml_file:
    with maybe_open(patch_fname, 'w') as pfile:
        # ... use nml_file and pfile

and know that the files will be correctly closed no matter how we exit the with statement.

@marshallward
Copy link
Owner

Sorry for going on about this, I just don't want to inadvertently break anything in the future!

I'll just revert any changes that I made and merge the pull request in. But do you think you could revert the changes on namelist.py or restore the default filename behaviour some other way?

Also, if you have any idea how to introduce a test to check for unclosed file descriptors that would be useful, but its not too important. (Mostly just to ensure that it doesn't get re-broken in the future.)

@ExpHP
Copy link
Contributor Author

ExpHP commented Mar 3, 2016

But do you think you could revert the changes on namelist.py

Note those are how I implement write(file). Though certainly I can leave that feature off for now if you would like to just add patch and read.

restore the default filename behaviour some other way?

I'm not sure if perhaps there is some miscommunication here? The behavior in this patch should be 100% backwards compatible; I intended for it to behave identical to the way it did before in all situations that worked before. I.e. if nml_fname is "in.nml" it still defaults to "in.nml~".

The new ValueError only occurs in situations that were previously impossible (i.e. when nml_fname is a file), and the purpose of it is basically to say "please provide an output path!"

@marshallward
Copy link
Owner

I see that it works, sorry. I thought I had tested it out but must have overlooked the output. No worries.

Thanks agan for the work and the explanations, I will merge this in now.

marshallward added a commit that referenced this pull request Mar 3, 2016
@marshallward marshallward merged commit 8054312 into marshallward:master Mar 3, 2016
@ExpHP
Copy link
Contributor Author

ExpHP commented Mar 3, 2016

I'm getting the impression that the changes I made to namelist are confusing; If you would like me to discuss what I did to it and why, I can. (though long story short, it is all to avoid the second open and truncate, which was no longer possible with file-like objects).

edit: I will anyways, because I would hate to think that I contributed confusing code!

So, the purpose of the os.truncate in the original code appeared to be to remove the newline that is automatically printed after the last section. Thus, the rewritten function tries to avoid writing that newline.

At first, I tried a "minimalist" approach. I tried adding a do_newline argument to write_nmlgrp. When do_newline=False, it would not print the newline. Then, in write, for each iteration of grp_vars and g_vars, I would try to compute whether or not this is the last group, and set do_newline=False. In the end I had something like this

for grp_idx, (grp_name, grp_vars) in enumerate(self.items()):
    grp_is_last = (grp_idx == len(self) - 1)
    if isinstance(grp_vars, list):
        for g_idx, g_vars in enumerate(grp_vars):
            g_is_last = (g_idx == len(grp_vars) - 1)
            self.write_nmlgrp(grp_name, g_vars, nml_file, do_newline=not (g_is_last and grp_is_last))
        else:   
            self.write_nmlgrp(grp_name, grp_vars, nml_file, do_newline=not grp_is_last)

I bring this up in case you find it easier to read the above than the current code. To me however, I find this very difficult to read, so I got rid of the do_newline flag I added and tried to streamline the logic with lines 231-236. It flattens out self.items() so that we can iterate over each section individually. Basically, if this is self.items():

[('foo', x), ('bar', [y, z])]

Then this is groups:

[('foo', x), ('bar', y), ('bar', z)]

The purpose of the code from line 240-246 is to call write_nmlgrp on each object, with newlines inbetween. The first namelist is given special treatment to avoid printing a newline before it. In hindsight, I probably could've kept the do_newline flag, and then these 7 lines could have been shortened to just 3:

for index, (name, vals) in enumerate(groups):
    do_newline = (index != len(groups) - 1)
    self.write_nmlgrp(name, vals, nml_file, do_newline)

which is perhaps clearer.

My apologies if I am drowning you in details you don't care to read about!

@ExpHP ExpHP deleted the file-arguments branch March 3, 2016 02:01
@marshallward
Copy link
Owner

No worries, more documentation is always better!

I can see that purpose is to avoid the extra carriage return. (I can also now see that the truncate call would have probably not have worked on Windows, for whatever that's worth.)

It does seem to me that flattening and rebuilding the namelist is a fair bit of extra work just to avoid a redundant carriage return. Though it's a pretty tiny amount of work or memory either way, of course. (Nearly all the time is in the rather slow tokenizing of shlex, anyway!)

As you say, the real issue is the closing and re-opening of the target file, which is probably a bad idea if the target is an already-opened file object. Maybe it would be ok to do the lseek without closing and re-opening. Though even this was never a very palatable solution. I can never settle on a good solution to these "head-tail" patterns.

I might have a play with it to avoid building and using groups, but I can see how the old truncate method is less practical for a file object input, so it's fine as it is.

@marshallward
Copy link
Owner

Honestly, part of me is tempted to just go back to the old behaviour which did not have spaces between the groups.

@marshallward
Copy link
Owner

I have an implementation which uses a flag inside the class to track when a newline is required.

It's under the newline branch, diff is here: https://github.com/marshallward/f90nml/compare/newline

If it looks OK to you, then I'll swap to this version.

@ExpHP
Copy link
Contributor Author

ExpHP commented Mar 3, 2016

Seems fair to me!

@marshallward
Copy link
Owner

Cool, I have merged it in. Thanks again for submitting this and going through it with me, it's often no fun to troll through someone else's old code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants