Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better async job status handling #2851

Merged
merged 1 commit into from Aug 3, 2017
Merged

Conversation

chefsalim
Copy link
Contributor

This change fixes a race condition in the builder-scheduler where a job status can get lost. Now, the incoming job status is persisted by the handler thread, and later picked up from the datastore to be processed by the scheduler. Switching over to this mode also yields a simplification where we can eliminate the need to keep a separate socket around for status notifications.

Signed-off-by: Salim Alam salam@chef.io

tenor-248153294

@thesentinels
Copy link
Contributor

Thanks for the pull request! Here is what will happen next:

  1. Your PR will be reviewed by the maintainers
  2. If everything looks good, one of them will approve it, and your PR will be merged.

Thank you for contributing!

LIMIT $1
);
END
$$ LANGUAGE plpgsql VOLATILE"#,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these functions can be implemented in SQL rather than PL/pgSQL. Functions that are implemented in SQL are "transparent" to the planner, allowing it to optimize execution in ways that it can't when given the "black box" of a PL/pgSQL function.

Granted, for these simple functions, that's likely not going to make a huge impact, but always giving the system more information is probably a good habit to get into. As a nice bonus, you'll also get to drop the BEGIN and END keywords, so you get shorter functions 👅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah these are super simple functions that can go either way - we should look at a high level at our PL/pgSQL vs. SQL usage. PL/pgSQL seems to offer much more flexibility, including the ability to trap errors, better performance with plan caching when things don't get inlined, etc. It would be nice to just pick one language and stick with it so we can copy/paste more easily and use the patterns for that language. I've leaned toward pgSQL since that's what the vast bulk of our code is atm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the to see as little PL/pgSQL as possible. Not such a big deal for these functions but removing as much complexity from the data layer even if it sacrifices some sugar is a +. The only exception is where there are blatant perf improvements to the PL/pgSQL route.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're wrong about the PL/pgSQL planner conversation. Each function gets each statement automatically made into a prepared one, and then the actual execution plan gets cached and re-used. It's the most efficient thing possible for the planner - to get the same, you would need to have every statement prepared in advance manually. There is no benefit to using raw SQL functions most of the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also made a concerted decision to use functions as part of the contract between the application and the database. The alternative is putting this kind of logic in the application, which in large part is not better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. My primary concern is to avoid the temptation of including extra logic in postgres procs and keeping them in rust but that may be a different topic entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postgres is actually able to inline code and fold function bodies into outer queries in some cases when the function is written in SQL, but can't do that when it's in PL/pgSQL.

Another good thing about writing in SQL is that it forces you to express your problem in terms of set operations, which is almost without exception going to be better than expressing it iteratively. For instance, I helped @raskchanky refactor a PL/pgSQL function that was making n+1 insert statements into an SQL statement that does the same, but in a single operation. True, nothing is stopping you from executing that same single query in a PL/pgSQL function, but that seems like just adding noise at that point.

Certainly, if there are situations where we really do need the capabilities of PL/pgSQL (like exception handling, for example), then by all means, we should use PL/pgSQL. You can do a lot more things in SQL in Postgres these days, though, so that's becoming increasingly unnecessary, particularly for the kinds of things we're doing on this project. That, coupled with the clarity that being forced to express your problem in set-oriented operations, makes me reach for SQL first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya - the important thing is to not change the contract between the database and the application. We're consciously making the database interface be functions with versions in order to make forward migration and new behavior easy to add. It's not the only way to do that, but it's the way we chose, and we should make sure we don't change it accidentally. :) All things being equal, I'm 2/10 on wether a given function is written in SQL or PL/pgSQL. I'm 8/10 that we should keep the current function/app boundaries as they are, unless we want to really reconsider the entire approach.

net_error_msg text,
build_started_at timestamptz,
build_finished_at timestamptz
) RETURNS SETOF job_status AS $$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can our Rust database code deal with functions that return a single database type, or not? If not, that's fine; I was just thinking that declaring this as RETURNS job_status would be ideal, since it's only ever going to return one; there's no real possibility of it returning multiple rows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can go either way, however the way we are doing things above is part of a pattern that is established to help keep things DRY. There's some explanatory comments at https://github.com/habitat-sh/habitat/blob/master/components/builder-jobsrv/src/data_store.rs#L124

JobState::Processing => "Processing",
JobState::Complete => "Complete",
JobState::Rejected => "Rejected",
JobState::Failed => "Failed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement FromStr (so we can use parse) and Display for JobState instead, to handle the conversion to and from a String?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - I shoulda done that.

DateTime::<UTC>::from_str(job.get_build_finished_at()).unwrap(),
),
false => None,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These might be more concise as if statements, rather than match statements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - some copy pasta here.

LIMIT $1;
RETURN;
END
$$ LANGUAGE plpgsql VOLATILE"#)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name and argument type of this function suggests that you're getting the status for the job whose ID you're supplying; naming it something like get_oldest_active_job, pop_job_queue or similar would be more intention-revealing.

Also, is there any time we're calling this without an argument of 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will tweak the function name to make it clearer. The current usage is always 1, however like a number of other functions like the pending jobs ones, this is designed to eventually be able to handle more than just one status at a time.

FROM job_status
ORDER BY created_at ASC
LIMIT $1
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be implemented in terms of a JOIN with the output of get_job_status_v1, rather than duplicating that function's logic here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will simplify this by deleting based on id, as per your other comment.

}

// Clear the top job status from the queue
self.datastore.clear_job_status(1)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could have the potential for subtle bugs; can we just remove, by ID, the entry for the job we just processed, instead of assuming that it's at the top of our queue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed, I think that will be simpler as well - will update.

Ok(())
}

pub fn get_job_status(&self, count: i32) -> Result<Vec<Job>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment re: naming applies here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will tweak.

@adamhjk
Copy link
Contributor

adamhjk commented Aug 1, 2017 via email

@chefsalim chefsalim force-pushed the SA/scheduler-sync-status branch 4 times, most recently from 4859382 to 2463c2e Compare August 2, 2017 00:28
Copy link
Contributor

@christophermaier christophermaier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good overall, but thinking a bit more on it, I wonder if it'd be better to rethink how we're describing the functionality that's being added here. It seems like we're fundamentally talking about managing a queue in the database, but a lot of the naming is more narrow than that, and potentially confusing. We're not just dealing with the status of a job in a narrow sense; we're managing our queue of work. The more that our tables, functions, etc. reflect that intention, the clearer the code will be.

"Failed" => jobsrv::JobState::Failed,
_ => return Err(Error::UnknownJobState),
};
let job_state = jobsrv::JobState::from_str(&js).map_err(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit pick, but it seems like it's more idiomatic to use parse rather than calling from_str directly.

_ => (),
}

// Clear the top job status from the queue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't quite in line with the implementation anymore.

@chefsalim chefsalim force-pushed the SA/scheduler-sync-status branch 3 times, most recently from 50da7a2 to 02a5c63 Compare August 2, 2017 23:56
Signed-off-by: Salim Alam <salam@chef.io>
@adamhjk
Copy link
Contributor

adamhjk commented Aug 3, 2017

Looks good to me. In particular, this is a step toward the right thing regardless - it's either going to be fine for the long term, or need another refactor (one that deals with job volume/persistence in the database) - but assuming @christophermaier feels good, I say merge this.

Nice work all around, btw - @chefsalim for the code, and the team for the review.

@christophermaier
Copy link
Contributor

I agree with @adamhjk... I really like how this ended up; talking about the queue as a queue really makes the code flow very nicely. Nice job @chefsalim.

@thesentinels approve

@thesentinels
Copy link
Contributor

🤘 I am testing your branch against master before merging it. We do this to ensure that the master branch is never failing tests.

@thesentinels
Copy link
Contributor

:neckbeard: Travis CI has started testing this PR.

@thesentinels
Copy link
Contributor

💖 Travis CI reports this PR passed.

It always makes me feel nice when humans approve of one anothers work. I'm merging this PR now.

I just want you and the contributor to answer me one question:

gif-keyboard-3280869874741411265

@thesentinels thesentinels merged commit b779fcc into master Aug 3, 2017
@thesentinels thesentinels deleted the SA/scheduler-sync-status branch August 3, 2017 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants