Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipelines stalling #270

Closed
JonathanFraser opened this issue Apr 6, 2016 · 20 comments
Closed

pipelines stalling #270

JonathanFraser opened this issue Apr 6, 2016 · 20 comments
Assignees
Milestone

Comments

@JonathanFraser
Copy link
Contributor

I have a pachyderm repo with a few commits commits in it (each a seperate branch). Each commit is about 100MB in size. The processing step begin and starts output data into the next pipeline but stops working after about 1MB of data output. If I force finish the commit and inspect the output is see that only some of the expected output was generated and the some of the files list the unix epoch as their creation date some of the time. Moreover, the pipeline takes an exceptionally long time to run compared to running it out side of pachyderm.

@jdoliner
Copy link
Member

jdoliner commented Apr 6, 2016

I suspect what's going on here is that there's something about your access pattern that we haven't implemented correctly. Really sorry you hit this we'll get on it asap. Would it be possible for you to get us the pipeline that triggered it and the container images?

@JonathanFraser
Copy link
Contributor Author

The container was job-shim (f7b5c3e1894f) with the custom executable added in.
here is the pipeline:

{ "pipeline": { "name": "tokamak_fmt" }, "transform": { "image":"refmt", "cmd": [ "sh" ], "stdin": [ "./refmt -S=/pfs/tokamak -D=/pfs/out > /pfs/out/log_file 2> /pfs/out/err_file", "cp /pfs/tokamak/shotnumber /pfs/out/shotnumber" ] }, "shards": "1", "inputs": [ { "repo": { "name": "tokamak" } } ] }

@jdoliner
Copy link
Member

jdoliner commented Apr 6, 2016

Could you do docker export f7b5c3e1894f -o image.tar.gz and send the file it spits out?
Thanks :)

@JonathanFraser
Copy link
Contributor Author

do you one better, I have the entire container on dockerhub jonathanfraser/refmt

@jdoliner
Copy link
Member

jdoliner commented Apr 6, 2016

:D you da man

@jdoliner
Copy link
Member

jdoliner commented Apr 7, 2016

Err, hopefully last thing I need: anyway I could get the dataset?

@JonathanFraser
Copy link
Contributor Author

we've sent a message out of band about access to our data. For the time being I was able to mock it up with some quick scripts and such. I've put everything up here.
You should be able to go get it and go install the gencsv command.
mk-source-repo creates a repo and fills it with dummy data after which you can attach the pipeline. You can either create the container your self or pull the one on dockerhub.

@jdoliner
Copy link
Member

jdoliner commented Apr 7, 2016

@JonathanFraser awesome, thanks so much for distilling this down to a test case for us :) we'll get this solved for you asap

@JonathanFraser
Copy link
Contributor Author

On another note I'm watching my free memory slowly climb down so I suspect there is a memory leak in there somewhere.

@JonathanFraser
Copy link
Contributor Author

Ok it looks like there are a few issues here that are falling over each other. looking at the pachd logs I'm geting alot of bufio.Scanner: token too long when trying to putFile into the output pipeline. This makes sense as it looks like you guys are using a scanner to read the bytes line by line from from the protostream. I presume this is so blocks contain only full lines of text. This of course fails if a single line exceeds 64kB. The second issue, which I have yet to track down is why this isn't triggering an error in the job. That error is getting swallowed somewhere and being exposed as just pipeline that stalls out.

@jdoliner
Copy link
Member

jdoliner commented Apr 8, 2016

@JonathanFraser interesting, your diagnosis is right we do error when we hit long lines.
The fact that it isn't getting returned by the Job is definitely a problem. I'll look into that and see what I can find.

I also got your sample code and started using it to push data in. I'm working on profiling and optimizing that as well but I'm going to switch gears now to try to address this bug.

@JonathanFraser
Copy link
Contributor Author

Is there any reason that scanner can't be changed to a bufio.Reader the readline function would probably do what you want. just keep reading until you've reached your block_size and is_prefix is false. It would also save the step of having to append a newline explicitly

@jdoliner
Copy link
Member

jdoliner commented Apr 8, 2016

No reason in particular. Scanner seemed a bit better suited to our needs, and the docs for readline actually recommend that it's too low level for most callers and lists scanner as one of the alternatives so I figured we should heed that.

I've added a test to our test suite which creates a job that outputs a line that's too long. I'm seeing the job getting stalled, stuck in the RUNNING state. Does this matchup with what you're seeing? pachctl list-job should tell you.

@JonathanFraser
Copy link
Contributor Author

Yeah it sits there in a running state without erroring out in anyway.
A few jobs get some of the output being produced, but I presume that is because it is succeeding a few times before it hits a line that is too long.

jdoliner added a commit that referenced this issue Apr 9, 2016
Partially fixez #270. The job is still erroring.
@jdoliner
Copy link
Member

jdoliner commented Apr 9, 2016

Progress!

Grpc behaves differently than I thought wrt streaming calls.
It turns out that the receiver of a streaming rpc must always call Recv from it until it gets EOF otherwise the sending end can block.

The second part of this issue is to make the job fail correctly.
It's currently reporting SUCCESS.

@sjezewski sjezewski added this to the v1.0 milestone Apr 11, 2016
@sjezewski
Copy link
Contributor

In addition to fixing that one call we should add some test coverage to make sure we have this fixed more generally.

@jdoliner jdoliner reopened this Apr 11, 2016
@jdoliner
Copy link
Member

Whoops, didn't mean to close this.

@jdoliner
Copy link
Member

Few updates on this, we have the start of a fix in PR: #275.

@jdoliner
Copy link
Member

Updates: #275 has fixes for the initial problem in this thread and a few other fuse issues that arose. We're working on getting CI passing on it, then we'll merge it in!

Thanks for bearing with us on this one, it led to us discovering that we were using FUSE incorrectly in many ways.

@jdoliner
Copy link
Member

Alright we're finally ready to close this one. The fix has passed our CR and been merged in.
Sorry this took so long, it wound up revealing several issues in our fuse layer that had to be dealt with.

As always please feel free to reopen if you see the problem again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants