pipelines stalling #270

JonathanFraser · 2016-04-06T21:20:52Z

I have a pachyderm repo with a few commits commits in it (each a seperate branch). Each commit is about 100MB in size. The processing step begin and starts output data into the next pipeline but stops working after about 1MB of data output. If I force finish the commit and inspect the output is see that only some of the expected output was generated and the some of the files list the unix epoch as their creation date some of the time. Moreover, the pipeline takes an exceptionally long time to run compared to running it out side of pachyderm.

jdoliner · 2016-04-06T21:31:15Z

I suspect what's going on here is that there's something about your access pattern that we haven't implemented correctly. Really sorry you hit this we'll get on it asap. Would it be possible for you to get us the pipeline that triggered it and the container images?

JonathanFraser · 2016-04-06T21:36:47Z

The container was job-shim (f7b5c3e1894f) with the custom executable added in.
here is the pipeline:

{ "pipeline": { "name": "tokamak_fmt" }, "transform": { "image":"refmt", "cmd": [ "sh" ], "stdin": [ "./refmt -S=/pfs/tokamak -D=/pfs/out > /pfs/out/log_file 2> /pfs/out/err_file", "cp /pfs/tokamak/shotnumber /pfs/out/shotnumber" ] }, "shards": "1", "inputs": [ { "repo": { "name": "tokamak" } } ] }

jdoliner · 2016-04-06T22:10:53Z

Could you do docker export f7b5c3e1894f -o image.tar.gz and send the file it spits out?
Thanks :)

JonathanFraser · 2016-04-06T22:13:19Z

do you one better, I have the entire container on dockerhub jonathanfraser/refmt

jdoliner · 2016-04-06T22:13:54Z

:D you da man

jdoliner · 2016-04-07T00:16:05Z

Err, hopefully last thing I need: anyway I could get the dataset?

JonathanFraser · 2016-04-07T23:01:09Z

we've sent a message out of band about access to our data. For the time being I was able to mock it up with some quick scripts and such. I've put everything up here.
You should be able to go get it and go install the gencsv command.
mk-source-repo creates a repo and fills it with dummy data after which you can attach the pipeline. You can either create the container your self or pull the one on dockerhub.

jdoliner · 2016-04-07T23:06:52Z

@JonathanFraser awesome, thanks so much for distilling this down to a test case for us :) we'll get this solved for you asap

JonathanFraser · 2016-04-07T23:47:06Z

On another note I'm watching my free memory slowly climb down so I suspect there is a memory leak in there somewhere.

JonathanFraser · 2016-04-08T23:03:42Z

Ok it looks like there are a few issues here that are falling over each other. looking at the pachd logs I'm geting alot of bufio.Scanner: token too long when trying to putFile into the output pipeline. This makes sense as it looks like you guys are using a scanner to read the bytes line by line from from the protostream. I presume this is so blocks contain only full lines of text. This of course fails if a single line exceeds 64kB. The second issue, which I have yet to track down is why this isn't triggering an error in the job. That error is getting swallowed somewhere and being exposed as just pipeline that stalls out.

jdoliner · 2016-04-08T23:09:25Z

@JonathanFraser interesting, your diagnosis is right we do error when we hit long lines.
The fact that it isn't getting returned by the Job is definitely a problem. I'll look into that and see what I can find.

I also got your sample code and started using it to push data in. I'm working on profiling and optimizing that as well but I'm going to switch gears now to try to address this bug.

JonathanFraser · 2016-04-08T23:11:50Z

Is there any reason that scanner can't be changed to a bufio.Reader the readline function would probably do what you want. just keep reading until you've reached your block_size and is_prefix is false. It would also save the step of having to append a newline explicitly

jdoliner · 2016-04-08T23:47:47Z

No reason in particular. Scanner seemed a bit better suited to our needs, and the docs for readline actually recommend that it's too low level for most callers and lists scanner as one of the alternatives so I figured we should heed that.

I've added a test to our test suite which creates a job that outputs a line that's too long. I'm seeing the job getting stalled, stuck in the RUNNING state. Does this matchup with what you're seeing? pachctl list-job should tell you.

JonathanFraser · 2016-04-08T23:50:19Z

Yeah it sits there in a running state without erroring out in anyway.
A few jobs get some of the output being produced, but I presume that is because it is succeeding a few times before it hits a line that is too long.

Partially fixez #270. The job is still erroring.

jdoliner · 2016-04-09T03:06:22Z

Progress!

Grpc behaves differently than I thought wrt streaming calls.
It turns out that the receiver of a streaming rpc must always call Recv from it until it gets EOF otherwise the sending end can block.

The second part of this issue is to make the job fail correctly.
It's currently reporting SUCCESS.

sjezewski · 2016-04-11T17:48:13Z

In addition to fixing that one call we should add some test coverage to make sure we have this fixed more generally.

jdoliner · 2016-04-11T22:24:31Z

Whoops, didn't mean to close this.

jdoliner · 2016-04-12T03:09:16Z

Few updates on this, we have the start of a fix in PR: #275.

jdoliner · 2016-04-19T22:54:53Z

Updates: #275 has fixes for the initial problem in this thread and a few other fuse issues that arose. We're working on getting CI passing on it, then we'll merge it in!

Thanks for bearing with us on this one, it led to us discovering that we were using FUSE incorrectly in many ways.

jdoliner · 2016-04-21T22:00:12Z

Alright we're finally ready to close this one. The fix has passed our CR and been merged in.
Sorry this took so long, it wound up revealing several issues in our fuse layer that had to be dealt with.

As always please feel free to reopen if you see the problem again.

jdoliner added a commit that referenced this issue Apr 9, 2016

Always read the putBlockServer to completion.

0be417c

Partially fixez #270. The job is still erroring.

sjezewski added this to the v1.0 milestone Apr 11, 2016

jdoliner closed this as completed Apr 11, 2016

jdoliner reopened this Apr 11, 2016

derekchiang mentioned this issue Apr 15, 2016

Error extracting large file into pfs #282

Closed

JonathanFraser mentioned this issue Apr 15, 2016

Allow for overwriting of data as a commit output #303

Closed

sjezewski mentioned this issue Apr 19, 2016

Enable more tests on travis #307

Closed

jdoliner self-assigned this Apr 20, 2016

jdoliner closed this as completed Apr 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipelines stalling #270

pipelines stalling #270

JonathanFraser commented Apr 6, 2016

jdoliner commented Apr 6, 2016

JonathanFraser commented Apr 6, 2016

jdoliner commented Apr 6, 2016

JonathanFraser commented Apr 6, 2016

jdoliner commented Apr 6, 2016

jdoliner commented Apr 7, 2016

JonathanFraser commented Apr 7, 2016

jdoliner commented Apr 7, 2016

JonathanFraser commented Apr 7, 2016

JonathanFraser commented Apr 8, 2016

jdoliner commented Apr 8, 2016

JonathanFraser commented Apr 8, 2016

jdoliner commented Apr 8, 2016

JonathanFraser commented Apr 8, 2016

jdoliner commented Apr 9, 2016

sjezewski commented Apr 11, 2016

jdoliner commented Apr 11, 2016

jdoliner commented Apr 12, 2016

jdoliner commented Apr 19, 2016

jdoliner commented Apr 21, 2016

pipelines stalling #270

pipelines stalling #270

Comments

JonathanFraser commented Apr 6, 2016

jdoliner commented Apr 6, 2016

JonathanFraser commented Apr 6, 2016

jdoliner commented Apr 6, 2016

JonathanFraser commented Apr 6, 2016

jdoliner commented Apr 6, 2016

jdoliner commented Apr 7, 2016

JonathanFraser commented Apr 7, 2016

jdoliner commented Apr 7, 2016

JonathanFraser commented Apr 7, 2016

JonathanFraser commented Apr 8, 2016

jdoliner commented Apr 8, 2016

JonathanFraser commented Apr 8, 2016

jdoliner commented Apr 8, 2016

JonathanFraser commented Apr 8, 2016

jdoliner commented Apr 9, 2016

sjezewski commented Apr 11, 2016

jdoliner commented Apr 11, 2016

jdoliner commented Apr 12, 2016

jdoliner commented Apr 19, 2016

jdoliner commented Apr 21, 2016