Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes issue #88 Added a 'parents' functionality #169

Closed
wants to merge 9 commits into from

Conversation

melvin15may
Copy link

Added functionality to create directory if parents option is set to True.
Similar to mkdir -p functionality.

New PR fixing unrelated file change issues in closed PR #126

@@ -36,6 +36,8 @@
from ssl import SSLError
import sys

import errno
from ntpath import split
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use ntpath instead of os.path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv
Copy link
Contributor

Thanks @melvin15may for PR! Can you describe, what's pros/cons of this feature?

@melvin15may
Copy link
Author

This PR adds a feature which basically helps segregate files automatically instead of creating directory structure beforehand and then using those. Especially useful when streaming to a new file.

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 15, 2018

@melvin15may What are the advantages/disadvantages of this PR over the below alternative?

def my_open(path, *args, **kwargs):  # untested pseudo-code
    subdir = os.path.dirname(path)
    if not os.path.isdir(subdir):
        os.mkdir(subdir)
    return smart_open.smart_open(path, *args, **kwargs)

and then use my_open instead of smart_open.smart_open in your code.

@melvin15may
Copy link
Author

melvin15may commented Jan 20, 2018

Gives the user an option (--parents) if the user wants to create a directory, support for windows and linux (hence the use of ntpath) and tested code.

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 22, 2018

@melvin15may In my opinion, this pull request and the my_open example above achieve the same thing. The advantages of the PR is that it prevents the user from having to write and test the code that creates a subdirectory if it doesn't already exist. However, the code that achieves this is trivial, so not having to write it is only a small advantage.

On the other hand, including this PR in the library code has several disadvantages that I can think of. In no particular order:

  • We have to add a new keyword parameter to the smart_open function and propagate it down to the lower-level functions. We also have to document this new behavior.
  • We have to implement meaningful behavior for code paths where parent is not a valid parameter (e.g. S3)
  • We have to update existing tests (looks like this is done) add new tests (I think the tests in the PR do not actually exercise the new code at the moment)
  • We have to maintain all of the above for the remainder of the library's lifetime

@menshikh-iv Given that the advantages of merging this PR are smaller than the disadvantages, my opinion is that we should handle directory creation outside the library.

@menshikh-iv
Copy link
Contributor

I agree with you @mpenkov, looks simple, but in depth can be slightly painful for us.
I think it's not worth it.

Wdyt @piskvorky?

@piskvorky
Copy link
Owner

piskvorky commented Jan 22, 2018

Up to you guys. I personally never needed such feature, it sounds dangerous to me to be creating deep directories transparently (I'd prefer to do it explicitly in upstream app code, with clear comments, at a place where it's needed). But I don't know all use cases, it may be useful to others.

@menshikh-iv
Copy link
Contributor

@melvin15may any additional reasons?

@melvin15may
Copy link
Author

I had a use case once for this. I wrote a function to do it then, submitted a PR looking at the issue #169. I didn't think that much about the complications it could cause (thanks to @mpenkov for the explanation).

@menshikh-iv
Copy link
Contributor

Ok, I close this PR.
Sorry @melvin15may, this is good work.

@gojomo
Copy link
Contributor

gojomo commented Jan 26, 2018

IMO taking on those 'disadvantages' are exactly what a utility-library does, to avoid duplication/less-optimal-reimplementations elsewhere, and thus amortize the cost of "doing it well" over all the projects that can then rely on the library.

This facility is useful in cmd-line mkdir, and it's a common-enough need that a variant is also available in Pyton's stdlib os.makedirs().

And for perhaps the most-common backend other than local volumes, S3 (& similar APIs), it's essentially a no-op – but making parents=True work locally will allow paths that already worked on S3 (with path-sep characters in them) also work locally, without the library-user having to write extra special-filesystem-specific non-library dir-making code. (And hiding such differences is again the kind of convenience such a library usually provides.)

@kenahoo
Copy link
Contributor

kenahoo commented Dec 23, 2019

I know this is an old thread, but I agree with @gojomo - in order to transparently write to both S3 and a local file, I'm going to work around this (poorly) by checking whether the URL starts with s3:, and if not, doing os.makedirs() on its basename.

It's a very kludgey solution that essentially undermines the cross-backend advantages of smart_open. The right place for code like this would really be inside smart_open IMO.

@rajeev
Copy link

rajeev commented Oct 16, 2020

Second, @gojomo and @kenahoo … we're facing this same problem with SFTP. It would've been nice if smart_open provided a common path for the supported platforms without requiring addl. code. In the case of SFTP this is a bit hard, as sftpclient.open simply fails with an exception. Then we have to construct the connection (which smart_open does beautifully, but hidden internally), and create all the dirs before calling it again.

@kenahoo
Copy link
Contributor

kenahoo commented Oct 19, 2020

@mpenkov wrote:

@melvin15may What are the advantages/disadvantages of this PR over the below alternative?

def my_open(path, *args, **kwargs):  # untested pseudo-code
    subdir = os.path.dirname(path)
    if not os.path.isdir(subdir):
        os.mkdir(subdir)
    return smart_open.smart_open(path, *args, **kwargs)

and then use my_open instead of smart_open.smart_open in your code.

IMO the large disadvantage is that it negates any advantage of using smart_open in the first place. We use smart_open to write transparently to lots of different destinations - e.g. local files, S3 paths, SFTP paths, etc., wherever the user configures a system to write. Obviously, checking os.path and using os.mkdir as in the above example can't work for remote paths, removing really the only reason to use smart_open for stuff like this.

@rajeev
Copy link

rajeev commented Oct 19, 2020

@kenahoo another (admittedly unclean) approach would be that, if the dir doesn't exist then throw an exception with the underlying client (sftpclient, etc) provided within the exception. Then, we have a way of getting the client, creating the dir and pushing the file. Again, I realise this is totally unclean, but at least we can do something with it.

Further, the S3 client doesn't complain at all. Isn't this a leaky abstraction in the sense that, one destination works, but another doesn't?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants