Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to one-file-per-query for sqlx-cli prepare #570

Closed
mehcode opened this issue Jul 27, 2020 · 18 comments · Fixed by #2363
Closed

Move to one-file-per-query for sqlx-cli prepare #570

mehcode opened this issue Jul 27, 2020 · 18 comments · Fixed by #2363
Assignees

Comments

@mehcode
Copy link
Member

mehcode commented Jul 27, 2020

This seems to have a lot of advantages and I can't remember why we didn't do this originally.

  • git merge won't get in the way
  • orphan query files are currently a problem so this is a net zero
  • we already collect to multiple files during build
  • we could make cargo build transparently always do cargo sqlx prepare and leave the latter as a "make sure" / verify step
@abonander
Copy link
Collaborator

git merge won't get in the way

It will for schema changes where the queries were unaffected. Then you'll have potentially dozens of files to resolve merge conflicts on rather than one.

In general, I think we agreed that a single file was better because it could easily be skimmed over in diffs and if a merge conflict did arise it would be a single "accept ours"/"accept theirs" resolution.

@jplatte
Copy link
Contributor

jplatte commented Jul 27, 2020

Having worked with this for a bit, it's always annoying to have to go back when CI failed because of a stale cache. I would be very happy if the separate cargo sqlx prepare step was not needed.

In general, I think we agreed that a single file was better because it could easily be skimmed over in diffs and if a merge conflict did arise it would be a single "accept ours"/"accept theirs" resolution.

I think what I would do after resolving the rest of the merge conflicts (if any) is just delete .sqlx / sqlx-data.json and re-generate. In the case of editor integration that auto-runs cargo check and the files always being written, maybe I wouldn't even have to do anything in the directory case? Either way, I don't see it becoming harder with lots of files (except if resolving conflicts in the GitHub web interface, but I generally don't do that much and I imagine one would often fall back to local resolving for other more complex scenarios anyway).

@abonander
Copy link
Collaborator

If the files were contained in a standard directory then it would probably be possible to set default resolution rules with .gitattributes or the like.

I'm weary of the macros automatically creating such a directory outside of target/ though. Perhaps they shouldn't output any files if the directory doesn't exist, and then cargo sqlx prepare could create that directory.

As for the directory name, maybe .sqlx? The leading dot would hide it by default on *nix, I'm not sure if we want that or not.

@abonander
Copy link
Collaborator

abonander commented Jul 27, 2020

The reason the macros don't directly write to one file is because they might potentially be run in parallel (not currently I don't think but in the foreseeable future) so if they all tried to rewrite it they would probably clobber each other. However, if the sqlx-data format supported appends (CSV maybe?), I believe those are atomic so multiple processes writing to the same file would be fine. Then cargo sqlx prepare could just go through and clean up the file afterwards.

Although the more that I think about it, this would end up just being an inferior implementation of writing everything to a directory anyway.

@jplatte
Copy link
Contributor

jplatte commented Jul 29, 2020

Although the more that I think about it, this would end up just being an inferior implementation of writing everything to a directory anyway.

Sounds like changing to one file per query is realistic then? Would this be something that might end up in 0.4, or would you rather release 0.4 with sqlx-data.json and change to single file per query for 0.5, if at all?

@jplatte
Copy link
Contributor

jplatte commented Jul 30, 2020

I'm weary of the macros automatically creating such a directory outside of target/ though. Perhaps they shouldn't output any files if the directory doesn't exist, and then cargo sqlx prepare could create that directory.

If that's all it does, that seems like a pointless abstraction though. In that case the documentation could just mention that users can create .sqlx if they want the query cache.

@abonander
Copy link
Collaborator

abonander commented Aug 18, 2020

Actually, I don't think we have to do anything special to arrive at a file format that's good for appends. We could just atomically append JSON objects to a file, each object essentially being the same as the current sqlx-data.json format but with an individual query each, then cargo sqlx prepare can merge these objects together.

serde_json provides a way to parse multiple values from a stream already, the macros on a non-compacted file would just do a linear search to find the data for their query: https://docs.rs/serde_json/1.0.57/serde_json/de/struct.StreamDeserializer.html

Then a user could have a pre-commit hook that runs cargo sqlx prepare --compact to merge the entries without touching the database.

@mehcode any objections?

@abonander
Copy link
Collaborator

Hmm, on further research it seems like we may not be able to guarantee atomicity of writes to files anyway. It's very OS and filesystem specific and not at all specified.

Multiple files it is then.

@abonander
Copy link
Collaborator

As per the discussion on Discord, we still have the issue with concurrent writes with multiple files, if the filenames are the hashes of the queries (which is necessary for the macros looking up the query data).

We could make the filenames to be the spans of the macro invocations, which should be unique, but that has other issues:

  1. it duplicates data for identical queries from different macro invocations
  2. it relies on the Debug representation of Span which isn't guaranteed to be a valid filename fragment (which could be resolved by hashing the debug repr I guess), also assumes spans are stable for compilation runs on different machines
  3. making changes unrelated to macro invocations in the same file will move spans around, changing the filename, creating even more duplicate files which would have to be cleaned up in a separate step
  4. if you don't recompile between your last change and the time of commit, the data will be stale anyway

Instead, I propose we encourage users to configure Git to run cargo sqlx prepare as a pre-commit hook. I don't like pre-commit hooks myself because sometimes I just want to commit and push broken code when I need to step away from the computer and they're usually configured to block commits if they fail, but this commit hook doesn't need to do that; it just means that CI will fail (which it should for broken code anyway).

@jplatte
Copy link
Contributor

jplatte commented Sep 14, 2020

@abonander: What about first writing to a temporary file and then moving that to the actual location? I think I've seen that used to work around concurrent write issues before.

Personally I'm pretty sure I won't use a pre-commit hook for something that takes as long as cargo sqlx prepare because it would disturb my workflow – in particular interactive rebases where I squash or amend a few commits would become extremely annoying.

@jplatte
Copy link
Contributor

jplatte commented Nov 18, 2020

There's one more advantage to this that hasn't been mentioned: It should improve performance quite a bit for projects with lots of queries, since every macro call only has to parse (possibly with debug-compiled parsing code) its own JSON metadata, not a JSON file with all of the queries for the entire application.

@jplatte
Copy link
Contributor

jplatte commented Dec 11, 2020

I assume with one file per query, always rebuilding the entire crate also becomes unnecessary, right? I'm not entirely sure but I think it has a decent impact on both perf and cache disk usage.

@abonander @mehcode Would you be interested in a PR that implements this, using the aforementioned strategy of writing to a temporary file and then moving it to the actual location to avoid issues with concurrent writes to the same file?

@mehcode
Copy link
Member Author

mehcode commented Dec 11, 2020

I'm personally for this, yes. I imagine a sqlx/ folder that is always populated during cargo build that is either gitignored or not depending on preference. E.g., totally remove the concept of an explicit online mode, its just seamless.

@jplatte
Copy link
Contributor

jplatte commented Dec 12, 2020

This is in scope for 0.4 too, right? I would be happy to also then rebase it on top of 0.5, but if possible would like to see another 0.4.x release that has this.

Edit: Actually, it would probably be too confusing to have the sqlx directory just appear for people who activated the offline feature. So for this to become available in a 0.4.x, the old behaviour would have to still be supported 🤔

@jplatte
Copy link
Contributor

jplatte commented Dec 29, 2020

Since 0.5 is going to actually not take long, I'm now working on this and am already positive I'll have a PR ready later 🙂

@argv-minus-one
Copy link
Contributor

@abonander

Hmm, on further research it seems like we may not be able to guarantee atomicity of writes to files anyway. It's very OS and filesystem specific and not at all specified.

That's true, but replacing a file is atomic. To do that:

  1. Use Builder::tempfile_in from the tempfile crate to create a random-named temporary file (which is atomic) in the same folder (in order to guarantee that the temporary file is on the same file system as the final file, which it must be for step 4 to work).
  2. Write into the file (which isn't atomic but doesn't need to be).
  3. flush the file to ensure that the file's contents actually make it to disk.
  4. persist the file with its final name (which is atomic as long as steps 2 and 3 happen before step 4).

This is guaranteed to result in one of the following:

  • The file is completely replaced.
  • The original file is left untouched. (This will happen if replacement fails.)
  • The original file is left untouched and a stale temporary file is left behind. (This will happen if replacement fails and NamedTempFile's destructor fails to delete the temporary file, usually because of abrupt process termination, OS crash, power failure, or I/O error. It is possible to avoid this on Linux using O_TMPFILE, but the author of tempfile wants consistent behavior across platforms, and I don't think any other platform has equivalent functionality.)

None of these results involve the original file being only partially overwritten. Hence, replacing the file is atomic.

As per the discussion on Discord, we still have the issue with concurrent writes with multiple files, if the filenames are the hashes of the queries (which is necessary for the macros looking up the query data).

I haven't seen the Discord discussion, but if you mean the hash could collide, that shouldn't be an issue if you use a cryptographic hash like SHA256. It's incredibly hard to find collisions in such a hash even intentionally, let alone accidentally.

@abonander
Copy link
Collaborator

We ultimately came to the conclusion that the concurrent write issue is rather moot since macros aren't expanded in parallel and probably won't be for some time.

It's incredibly hard to find collisions in such a hash even intentionally, let alone accidentally.

I'm not worried about hash collisions, which are certainly unlikely with SHA256, but actually just the user copypasting the exact same query into multiple invocations, which is orders of magnitude more likely.

@argv-minus-one
Copy link
Contributor

That shouldn't be a problem, as long as the query data files are replaced atomically (as described above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment