Skip to content

Conversation

@tdammers
Copy link

This allows convenient binary communication with subprocesses.

Some concerns that may need addressing:

  • I introduced these function in a rather dirty copy-and-paste style; they should work fine, but there is a bit of code duplication there now; I don't feel confident to refactor as needed though, because I am not all too familiar with the decisions that went into these particular functions.
  • I introduced a dependency on the bytestring package; this isn't usually something I worry about, but considering how process has hardly any dependencies, adding one feels like a somewhat bigger deal than usual.
  • The bytestring variants might deserve a module of their own.
  • I haven't added any tests; the ones in /tests seem kind of dead, or at least cabal test doesn't touch them, but copying the process006 test over and changing it to use the bytestring flavors should work.

This allows convenient binary communication with subprocesses.
@snoyberg
Copy link
Collaborator

@hvr I haven't reviewed the code yet, but I wanted to touch base regarding the bytestring dependency. Do you know of any reason we should avoid it?

@tdammers
Copy link
Author

I really can't think of any serious reason, really. Maybe there are people with use cases where it is absolutely vital to avoid unnecessary dependencies (idk, embedded dev maybe? Haven't done any of that, so I really have no clue what the situation is like there), but my expectation would be that bytestring is such a basic and ubiquitous library that depending on it should be acceptable even then.

TL;DR: If I were in charge, I wouldn't worry, but maybe others have pressing concerns here.

@tdammers
Copy link
Author

Oh, and also, the build fails for GHC 7.4 and earlier (https://travis-ci.org/haskell/process/jobs/72742820), as far as I can tell this is because there is no DeepSeq.NFData instance for strict ByteStrings yet. Not sure how to tackle this - I believe that since we're dealing with strict ByteStrings, deepseq-ing things isn't required, so we could just strip that part out?

@snoyberg
Copy link
Collaborator

Sounds correct on the deepseq parts. The only reason I'm concerned at all about bytestring is that I'm not familiar with the GHC build process, and don't want to cause problems there.

Strict ByteStrings have no suitable instances for deepseq'ing on older
GHC's (<= 7.4), but we don't actually need deepseq here, because
BS.hGetContents is strict already. Instead, we'll do our reading and
writing in concurrent threads directly, and send them back to the main
thread through MVars.
@hvr
Copy link
Member

hvr commented Jul 27, 2015

I can't think of any reason not to depend on bytestring in the context of GHC.
(btw, with process being so low in the dependency tree, it's even more important to specify exact version bounds, so that the cabal solver can avoid faulty install-plans -- process is an upgradable package even if one version is bundled w/ GHC by default)

I'd avoid however depending on deepseq if it's merely for strict ByteStrings.

However, I would have expected lazy ByteStrings being used, especially as the stdout/stderr output of processes is of unknown size and potentially unbounded... but this is just a quick observation...

@tdammers
Copy link
Author

This last commit removes the deepseq-ing for ByteStrings, instead putting the hGetContents call (which is strict for strict bytestrings) into the worker threads; it should build fine on older GHC's, and also play nicer with threading.

@tdammers
Copy link
Author

@hvr: true, lazy bytestrings would be expected, but I'm not sure how this would play along with the "read outputs strictly" part.

Re the deepseq dependency: that one was already in place in order to strictly read Strings from the subprocess; the ByteString-specific code in my last commit no longer depends on deepseq anymore though.

@snoyberg
Copy link
Collaborator

The data is going to be read out strictly regardless. The advantage of a lazy ByteString in this case would be theoretically better memory layout, since it can be chunked instead of a single large allocation. However:

  1. We shouldn't be using these functions for large outputs anyway, as that can lead to memory exhaustion
  2. From my experience with http-conduit, people get very confused about lazy-but-not-really-lazy ByteStrings

OK, I'll move ahead with reviewing this code, sounds like we don't have any blockers on merge besides the code itself ;)

@hvr
Copy link
Member

hvr commented Jul 27, 2015

fair enough, in the interest of some more bikeshedding, I'm slightly +1 for

  • The bytestring variants might deserve a module of their own.

=)

@tdammers
Copy link
Author

The data is going to be read out strictly regardless.

Indeed - the intended behavior is "block until the child process terminates, then return stdout and stderr", which is kind of impossible with lazy reads (can't read lazily when the child process has already terminated).

@hvr: Should I move the bytestring variants out then?

@snoyberg
Copy link
Collaborator

I'm +1 on the separate module approach too, it fits nicely with other core libraries (like network). It also makes it easy to make a shim compatibility package.

@tdammers
Copy link
Author

Alright, separated out the ByteString functions into System.Process.ByteString. A bunch of support functions had to move to System.Process.Internal in the process; most of them weren't exported, but those that were are now re-exported from System.Process.

@tdammers
Copy link
Author

OK, turns out I'm rather dumb:

http://hackage.haskell.org/package/process-extras-0.3.3.5/docs/System-Process-ByteString.html

I guess that means my pull request is moot then...

@bgamari
Copy link
Contributor

bgamari commented Jul 28, 2015

@tdammers I believe you really want lazy bytestrings here. Currently your approach uses the strict hGetContents which reads chunks and then concatenates them together into a single buffer, resulting in a needless copy.

For this reason you really want to use Data.ByteString.Lazy.hGetContents and force the result. The easiest way to accomplish this is with deepseq. That being said, you should be able to get the same effect with foldrChunks.

@hvr
Copy link
Member

hvr commented Jul 28, 2015

@tdammers it may still be desirable to move System.Process.ByteString from process-extras into process though...

@hvr
Copy link
Member

hvr commented Jul 28, 2015

@snoyberg re

We shouldn't be using these functions for large outputs anyway, as that can lead to memory exhaustion

maybe those strictly-consuming-output functions should take a max-size parameter as an upper bound; this would encourage a more defensive programming style...

@snoyberg
Copy link
Collaborator

Due to prior art, I'm very much tempted to take the two ByteString modules from process-extras exactly as they are to make for as easy a transition as possible. When it comes to consuming possibly-large data, I think dropping down to a Handle is what we need to encourage, and do so via the documentation, not necessarily any added functionality. If we're going to consider improving the API in that direction, I'd think of using streaming-commons's approach, but that should be a separate issue.

@hvr
Copy link
Member

hvr commented Jul 28, 2015

For the sake of compat/transition, I would have used process-extra's type-signatures, add a big warning, and add some ...WithLimit versions.

For me personally, those ByteString variants become much less attractive (other than for quick'n'dirty runghc scripts or GHCi experimentations where I don't care much if they terminate w/ OOMs) if they have neither lazyness nor an explicit length-limit passed. As I don't want to to have to mess with Handles, I guess I'll have to resort to streaming APIs based on conduit//machines/pipes/io-streams instead instead... :-)

@snoyberg
Copy link
Collaborator

To be clear: a strict ByteString API (whether based on actual strict ByteStrings or lazy ByteStrings that are fully evaluated) is semantically identical to the current functions in process, except for the obvious differences with a ByteString (space usage and character vs octet). readProcess is already strict, and will already run out of memory.

If we wanted to double down on lazy I/O (which I don't want), a function signature like the following could be added:

readProcessLazily
  :: CreateProcess
  -> L.ByteString -- ^ stdin
  -> (L.ByteString -> L.ByteString -> IO a) -- ^ inner function, takes lazy stdout and stderr
  -> IO (a, ExitCode)

And then the inner function will run concurrently with the process, instead of the current API which requires that the process exit before getting access to the data at all.

@snoyberg
Copy link
Collaborator

Is everyone OK with closing this based on the presence of process-extras?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants