Better async job status handling #2851

chefsalim · 2017-07-29T02:35:20Z

This change fixes a race condition in the builder-scheduler where a job status can get lost. Now, the incoming job status is persisted by the handler thread, and later picked up from the datastore to be processed by the scheduler. Switching over to this mode also yields a simplification where we can eliminate the need to keep a separate socket around for status notifications.

Signed-off-by: Salim Alam salam@chef.io

thesentinels · 2017-07-29T02:35:21Z

Thanks for the pull request! Here is what will happen next:

Your PR will be reviewed by the maintainers
If everything looks good, one of them will approve it, and your PR will be merged.

Thank you for contributing!

christophermaier · 2017-07-31T14:06:33Z

components/builder-scheduler/src/data_store.rs

+                                      LIMIT $1
+                                  );
+                              END
+                              $$ LANGUAGE plpgsql VOLATILE"#,


All these functions can be implemented in SQL rather than PL/pgSQL. Functions that are implemented in SQL are "transparent" to the planner, allowing it to optimize execution in ways that it can't when given the "black box" of a PL/pgSQL function.

Granted, for these simple functions, that's likely not going to make a huge impact, but always giving the system more information is probably a good habit to get into. As a nice bonus, you'll also get to drop the BEGIN and END keywords, so you get shorter functions 👅

Yeah these are super simple functions that can go either way - we should look at a high level at our PL/pgSQL vs. SQL usage. PL/pgSQL seems to offer much more flexibility, including the ability to trap errors, better performance with plan caching when things don't get inlined, etc. It would be nice to just pick one language and stick with it so we can copy/paste more easily and use the patterns for that language. I've leaned toward pgSQL since that's what the vast bulk of our code is atm.

I really like the to see as little PL/pgSQL as possible. Not such a big deal for these functions but removing as much complexity from the data layer even if it sacrifices some sugar is a +. The only exception is where there are blatant perf improvements to the PL/pgSQL route.

You're wrong about the PL/pgSQL planner conversation. Each function gets each statement automatically made into a prepared one, and then the actual execution plan gets cached and re-used. It's the most efficient thing possible for the planner - to get the same, you would need to have every statement prepared in advance manually. There is no benefit to using raw SQL functions most of the time.

We also made a concerted decision to use functions as part of the contract between the application and the database. The alternative is putting this kind of logic in the application, which in large part is not better.

Fair enough. My primary concern is to avoid the temptation of including extra logic in postgres procs and keeping them in rust but that may be a different topic entirely.

Postgres is actually able to inline code and fold function bodies into outer queries in some cases when the function is written in SQL, but can't do that when it's in PL/pgSQL.

Another good thing about writing in SQL is that it forces you to express your problem in terms of set operations, which is almost without exception going to be better than expressing it iteratively. For instance, I helped @raskchanky refactor a PL/pgSQL function that was making n+1 insert statements into an SQL statement that does the same, but in a single operation. True, nothing is stopping you from executing that same single query in a PL/pgSQL function, but that seems like just adding noise at that point.

Certainly, if there are situations where we really do need the capabilities of PL/pgSQL (like exception handling, for example), then by all means, we should use PL/pgSQL. You can do a lot more things in SQL in Postgres these days, though, so that's becoming increasingly unnecessary, particularly for the kinds of things we're doing on this project. That, coupled with the clarity that being forced to express your problem in set-oriented operations, makes me reach for SQL first.

Ya - the important thing is to not change the contract between the database and the application. We're consciously making the database interface be functions with versions in order to make forward migration and new behavior easy to add. It's not the only way to do that, but it's the way we chose, and we should make sure we don't change it accidentally. :) All things being equal, I'm 2/10 on wether a given function is written in SQL or PL/pgSQL. I'm 8/10 that we should keep the current function/app boundaries as they are, unless we want to really reconsider the entire approach.

christophermaier · 2017-07-31T14:09:45Z

components/builder-scheduler/src/data_store.rs

+                               net_error_msg text,
+                               build_started_at timestamptz,
+                               build_finished_at timestamptz
+                           ) RETURNS SETOF job_status AS $$


Can our Rust database code deal with functions that return a single database type, or not? If not, that's fine; I was just thinking that declaring this as RETURNS job_status would be ideal, since it's only ever going to return one; there's no real possibility of it returning multiple rows.

Yeah, I think we can go either way, however the way we are doing things above is part of a pattern that is established to help keep things DRY. There's some explanatory comments at https://github.com/habitat-sh/habitat/blob/master/components/builder-jobsrv/src/data_store.rs#L124

christophermaier · 2017-07-31T14:16:21Z

components/builder-scheduler/src/data_store.rs

+            JobState::Processing => "Processing",
+            JobState::Complete => "Complete",
+            JobState::Rejected => "Rejected",
+            JobState::Failed => "Failed",


Should we implement FromStr (so we can use parse) and Display for JobState instead, to handle the conversion to and from a String?

Good catch - I shoulda done that.

christophermaier · 2017-07-31T14:19:46Z

components/builder-scheduler/src/data_store.rs

+                DateTime::<UTC>::from_str(job.get_build_finished_at()).unwrap(),
+            ),
+            false => None,
+        };


These might be more concise as if statements, rather than match statements.

Yeah - some copy pasta here.

christophermaier · 2017-07-31T14:28:43Z

components/builder-scheduler/src/data_store.rs

+                                        LIMIT $1;
+                                   RETURN;
+                               END
+                               $$ LANGUAGE plpgsql VOLATILE"#)?;


The name and argument type of this function suggests that you're getting the status for the job whose ID you're supplying; naming it something like get_oldest_active_job, pop_job_queue or similar would be more intention-revealing.

Also, is there any time we're calling this without an argument of 1?

Will tweak the function name to make it clearer. The current usage is always 1, however like a number of other functions like the pending jobs ones, this is designed to eventually be able to handle more than just one status at a time.

christophermaier · 2017-07-31T14:31:46Z

components/builder-scheduler/src/data_store.rs

+                                      FROM job_status
+                                      ORDER BY created_at ASC
+                                      LIMIT $1
+                                  );


I think this could be implemented in terms of a JOIN with the output of get_job_status_v1, rather than duplicating that function's logic here.

Will simplify this by deleting based on id, as per your other comment.

christophermaier · 2017-07-31T14:33:30Z

components/builder-scheduler/src/server/scheduler.rs

+                    }
+
+                    // Clear the top job status from the queue
+                    self.datastore.clear_job_status(1)?;


This seems like it could have the potential for subtle bugs; can we just remove, by ID, the entry for the job we just processed, instead of assuming that it's at the top of our queue?

Yeah agreed, I think that will be simpler as well - will update.

christophermaier · 2017-07-31T14:34:52Z

components/builder-scheduler/src/data_store.rs

+        Ok(())
+    }
+
+    pub fn get_job_status(&self, count: i32) -> Result<Vec<Job>> {


Same comment re: naming applies here

Will tweak.

adamhjk · 2017-08-01T18:29:26Z

I care more about not abandoning the function boundaries we have now. If it's a SQL function rather than a pl/pgsql one, great. If we move more logic into the application, we have to start rethinking the design for the forward compatibility of the back end. Not that we can't do it, but that's why it is the way it is.

…

On Tue, Aug 1, 2017 at 10:55 AM Christopher Maier ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In components/builder-scheduler/src/data_store.rs <#2851 (comment)>: > + END + $$ LANGUAGE plpgsql VOLATILE"#)?; + + migrator.migrate( + "scheduler", + r#"CREATE OR REPLACE FUNCTION delete_job_status_v1 (integer) RETURNS void AS $$ + BEGIN + DELETE FROM job_status + WHERE id IN ( + SELECT id + FROM job_status + ORDER BY created_at ASC + LIMIT $1 + ); + END + $$ LANGUAGE plpgsql VOLATILE"#, Postgres is actually able to inline code and fold function bodies into outer queries in some cases when the function is written in SQL, but can't do that when it's in PL/pgSQL. Another good thing about writing in SQL is that it forces you to express your problem in terms of set operations, which is almost without exception going to be better than expressing it iteratively. For instance, I helped @raskchanky <https://github.com/raskchanky> refactor a PL/pgSQL function that was making n+1 insert statements into an SQL statement that does the same, but in a single operation. True, nothing is stopping you from executing that same single query in a PL/pgSQL function, but that seems like just adding noise at that point. Certainly, if there are situations where we really do need the capabilities of PL/pgSQL (like exception handling, for example), then by all means, we should use PL/pgSQL. You can do a lot more things in SQL in Postgres these days, though, so that's becoming increasingly unnecessary, particularly for the kinds of things we're doing on this project. That, coupled with the clarity that being forced to express your problem in set-oriented operations, makes me reach for SQL first. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2851 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAQ0IKkuCJVRFtIz2OvpyiGUExBpf1vks5sT2abgaJpZM4OnOxv> .

christophermaier

Look good overall, but thinking a bit more on it, I wonder if it'd be better to rethink how we're describing the functionality that's being added here. It seems like we're fundamentally talking about managing a queue in the database, but a lot of the naming is more narrow than that, and potentially confusing. We're not just dealing with the status of a job in a narrow sense; we're managing our queue of work. The more that our tables, functions, etc. reflect that intention, the clearer the code will be.

christophermaier · 2017-08-02T14:15:07Z

components/builder-jobsrv/src/data_store.rs

-        "Failed" => jobsrv::JobState::Failed,
-        _ => return Err(Error::UnknownJobState),
-    };
+    let job_state = jobsrv::JobState::from_str(&js).map_err(


Minor nit pick, but it seems like it's more idiomatic to use parse rather than calling from_str directly.

christophermaier · 2017-08-02T14:22:35Z

components/builder-scheduler/src/server/scheduler.rs

+                        _ => (),
+                    }
+
+                    // Clear the top job status from the queue


This comment isn't quite in line with the implementation anymore.

Signed-off-by: Salim Alam <salam@chef.io>

adamhjk · 2017-08-03T01:14:05Z

Looks good to me. In particular, this is a step toward the right thing regardless - it's either going to be fine for the long term, or need another refactor (one that deals with job volume/persistence in the database) - but assuming @christophermaier feels good, I say merge this.

Nice work all around, btw - @chefsalim for the code, and the team for the review.

christophermaier · 2017-08-03T01:45:47Z

I agree with @adamhjk... I really like how this ended up; talking about the queue as a queue really makes the code flow very nicely. Nice job @chefsalim.

@thesentinels approve

thesentinels · 2017-08-03T01:45:48Z

🤘 I am testing your branch against master before merging it. We do this to ensure that the master branch is never failing tests.

thesentinels · 2017-08-03T01:46:07Z

Travis CI has started testing this PR.

thesentinels · 2017-08-03T02:06:08Z

💖 Travis CI reports this PR passed.

It always makes me feel nice when humans approve of one anothers work. I'm merging this PR now.

I just want you and the contributor to answer me one question:

chefsalim force-pushed the SA/scheduler-sync-status branch from ee90df9 to 1357ae1 Compare July 29, 2017 02:54

christophermaier reviewed Jul 31, 2017

View reviewed changes

chefsalim force-pushed the SA/scheduler-sync-status branch from 1357ae1 to 574ce5c Compare August 1, 2017 21:25

chefsalim requested a review from raskchanky as a code owner August 1, 2017 21:25

chefsalim force-pushed the SA/scheduler-sync-status branch 4 times, most recently from 4859382 to 2463c2e Compare August 2, 2017 00:28

christophermaier reviewed Aug 2, 2017

View reviewed changes

chefsalim force-pushed the SA/scheduler-sync-status branch 3 times, most recently from 50da7a2 to 02a5c63 Compare August 2, 2017 23:56

Better async job status handling

1194cc1

Signed-off-by: Salim Alam <salam@chef.io>

chefsalim force-pushed the SA/scheduler-sync-status branch from 02a5c63 to 1194cc1 Compare August 3, 2017 00:11

thesentinels merged commit b779fcc into master Aug 3, 2017

thesentinels deleted the SA/scheduler-sync-status branch August 3, 2017 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better async job status handling #2851

Better async job status handling #2851

chefsalim commented Jul 29, 2017

thesentinels commented Jul 29, 2017

christophermaier Jul 31, 2017

chefsalim Aug 1, 2017

mwrock Aug 1, 2017

adamhjk Aug 1, 2017

adamhjk Aug 1, 2017

mwrock Aug 1, 2017

christophermaier Aug 1, 2017

adamhjk Aug 1, 2017

christophermaier Jul 31, 2017

chefsalim Aug 1, 2017

christophermaier Jul 31, 2017 •

edited

chefsalim Aug 1, 2017

christophermaier Jul 31, 2017

chefsalim Aug 1, 2017

christophermaier Jul 31, 2017

chefsalim Aug 1, 2017

christophermaier Jul 31, 2017

chefsalim Aug 1, 2017

christophermaier Jul 31, 2017

chefsalim Aug 1, 2017

christophermaier Jul 31, 2017

chefsalim Aug 1, 2017

adamhjk commented Aug 1, 2017 via email

christophermaier left a comment

christophermaier Aug 2, 2017

christophermaier Aug 2, 2017

adamhjk commented Aug 3, 2017

christophermaier commented Aug 3, 2017

thesentinels commented Aug 3, 2017

thesentinels commented Aug 3, 2017

thesentinels commented Aug 3, 2017

Better async job status handling #2851

Better async job status handling #2851

Conversation

chefsalim commented Jul 29, 2017

thesentinels commented Jul 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christophermaier Jul 31, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamhjk commented Aug 1, 2017 via email

christophermaier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamhjk commented Aug 3, 2017

christophermaier commented Aug 3, 2017

thesentinels commented Aug 3, 2017

thesentinels commented Aug 3, 2017

thesentinels commented Aug 3, 2017

christophermaier Jul 31, 2017 •

edited