New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pipelines stalling #270
Comments
I suspect what's going on here is that there's something about your access pattern that we haven't implemented correctly. Really sorry you hit this we'll get on it asap. Would it be possible for you to get us the pipeline that triggered it and the container images? |
The container was job-shim (f7b5c3e1894f) with the custom executable added in.
|
Could you do |
do you one better, I have the entire container on dockerhub jonathanfraser/refmt |
:D you da man |
Err, hopefully last thing I need: anyway I could get the dataset? |
we've sent a message out of band about access to our data. For the time being I was able to mock it up with some quick scripts and such. I've put everything up here. |
@JonathanFraser awesome, thanks so much for distilling this down to a test case for us :) we'll get this solved for you asap |
On another note I'm watching my free memory slowly climb down so I suspect there is a memory leak in there somewhere. |
Ok it looks like there are a few issues here that are falling over each other. looking at the pachd logs I'm geting alot of bufio.Scanner: token too long when trying to putFile into the output pipeline. This makes sense as it looks like you guys are using a scanner to read the bytes line by line from from the protostream. I presume this is so blocks contain only full lines of text. This of course fails if a single line exceeds 64kB. The second issue, which I have yet to track down is why this isn't triggering an error in the job. That error is getting swallowed somewhere and being exposed as just pipeline that stalls out. |
@JonathanFraser interesting, your diagnosis is right we do error when we hit long lines. I also got your sample code and started using it to push data in. I'm working on profiling and optimizing that as well but I'm going to switch gears now to try to address this bug. |
Is there any reason that scanner can't be changed to a bufio.Reader the readline function would probably do what you want. just keep reading until you've reached your block_size and is_prefix is false. It would also save the step of having to append a newline explicitly |
No reason in particular. Scanner seemed a bit better suited to our needs, and the docs for readline actually recommend that it's too low level for most callers and lists scanner as one of the alternatives so I figured we should heed that. I've added a test to our test suite which creates a job that outputs a line that's too long. I'm seeing the job getting stalled, stuck in the |
Yeah it sits there in a running state without erroring out in anyway. |
Partially fixez #270. The job is still erroring.
Progress! Grpc behaves differently than I thought wrt streaming calls. The second part of this issue is to make the job fail correctly. |
In addition to fixing that one call we should add some test coverage to make sure we have this fixed more generally. |
Whoops, didn't mean to close this. |
Few updates on this, we have the start of a fix in PR: #275. |
Updates: #275 has fixes for the initial problem in this thread and a few other fuse issues that arose. We're working on getting CI passing on it, then we'll merge it in! Thanks for bearing with us on this one, it led to us discovering that we were using FUSE incorrectly in many ways. |
Alright we're finally ready to close this one. The fix has passed our CR and been merged in. As always please feel free to reopen if you see the problem again. |
I have a pachyderm repo with a few commits commits in it (each a seperate branch). Each commit is about 100MB in size. The processing step begin and starts output data into the next pipeline but stops working after about 1MB of data output. If I force finish the commit and inspect the output is see that only some of the expected output was generated and the some of the files list the unix epoch as their creation date some of the time. Moreover, the pipeline takes an exceptionally long time to run compared to running it out side of pachyderm.
The text was updated successfully, but these errors were encountered: