Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move versioning to the code and handle unversioned S3 Buckets #10

Closed
wants to merge 2 commits into from

Conversation

candrsn
Copy link

@candrsn candrsn commented Sep 30, 2021

This PR fixes a problm when accessing data stored in an unversioned S3 bucket.

There is one other change. moving the version info into the code. (I can retract that part if it is not wanted)

This url demonstrates the problem
https://fema-cap-imagery.s3.us-east-1.amazonaws.com/Support/cap_index.gpkg

A sample query might be
with sqlite_s3_query.sqlite_s3_query("https://fema-cap-imagery.s3.us-east-1.amazonaws.com/Support/cap_index.gpkg") as query:
with query("""SELECT fid, Id, ImageEventName FROM images LIMIT 4 OFFSET 1000""") as (cols, rows):
......

@michalc
Copy link
Owner

michalc commented Sep 30, 2021

Hello 👋

Thanks for the PR

However, I think I'm fairly anti this supporting non-versioned buckets... I pretty much want it to fail if it's unversioned, so I know to activate versioning on it. (In fact in doing a bit of testing a day or two ago, it failed because I had forgotten to switch on versioning, and I was very pleased that it did)

I'm also torn on having the version in the code: what's the benefit of this?

@michalc
Copy link
Owner

michalc commented Oct 10, 2021

After thinking further, I'm still happy with sqlite-s3-query not working with unversioned buckets.

@michalc michalc closed this Oct 10, 2021
@Wrufesh
Copy link

Wrufesh commented Aug 7, 2023

I am one of the user who do not want to enable versioning.
Not sure why you think to cut off users who do not want to enable versioning.

@michalc
Copy link
Owner

michalc commented Aug 9, 2023

Hi @Wrufesh,

It's to keep the scope of this project very focused - my aim is for it to be good at querying SQLite files on versioned buckets.

It's not a free thing to have code for more cases. The code will be more complex, and so there will costs. Firstly, a cost in terms of maintenance time to keep this "good" - for example lots of tests would have to run both on versioned and unversioned buckets. And secondly, a cost in terms of risk - having this behaviour in the code runs the risk of wanting versioning, but accidentally running on unversioned buckets, which then risks queries failing in strange ways if they run during the replace of the database object (or even worse, succeeding with strange results). The "REPEATABLE READ" behaviour that's mentioned in the README - this depends on the bucket being versioned. These costs are essentially incurred by myself and other users that do want to run with versioning.

Not sure if it's helpful to say, or even maybe a bit... condescending (and if so I apologise), but I know it's frustrating to be on the side that's deemed out of scope of a project: I have been there myself. The one thing I can emphasise is that the code is very permissibly licensed with its MIT license. You are free to add this behaviour into a fork and use the fork on unversioned buckets in virtually any project.

That all being said...

I am one of the user who do not want to enable versioning

I am curious as to why you don't want to enable versioning?

Michal

@Wrufesh
Copy link

Wrufesh commented Aug 9, 2023

Hello @michalc,

Thank you for your reply.

To be honest I have a limited understanding of how it works. I believe that we are creating virtual file system and making range request based on actions performed on vfs.

I was able to work with un-versioned buckets by making this small change. #84

Please take a moment to take a look and let me know if it is acceptable.

I do not want to enable versioning because I want to replace the file every time it is written to s3 with same filename.
This is so because I am using s3 as a backend.
The sqlite db written is latest index of some dataset being uploaded on my system.

Rupesh

@michalc
Copy link
Owner

michalc commented Aug 9, 2023

I do not want to enable versioning because I want to replace the file every time it is written to s3 with same filename.

Ah so this is exactly the case when enforcing versioning is useful, since it allows this without risk:

  1. A query starts on the object
  2. The object is replaced
  3. The query continues on the object

Without versioning, at step 3, sqlite will see a corrupted database because it finishes the query on one that is different to the one it started on. With versioning, step 3 will use the same version as it used in step 1, and so the query will complete successfully.

Nothing in versioning prevents you from always starting queries on the most recent version of an object at the time if that's what you want. In fact, that's exactly how I use sqlite-s3-query.

I was able to work with un-versioned buckets by making this small change. #84
Please take a moment to take a look and let me know if it is acceptable.

I'm going to say no for the reasons above

@Wrufesh
Copy link

Wrufesh commented Aug 9, 2023

Thank you for your response.

I have limited s3 resource, and I do not want to save different versions of objects.
This is the case which only occurs when update happens, and in my case the update of the sqlite file is rare.

@michalc
Copy link
Owner

michalc commented Aug 9, 2023

I have limited s3 resource, and I do not want to save different versions of objects.

Ah I didn't consider this. Then would a lifecycle policy in S3 be acceptable? I suspect the shortest time is 1 day, so there would be at most 1 day of both versions existing at the same time. So something like this configured in the console for the bucket?

image

@michalc
Copy link
Owner

michalc commented Aug 9, 2023

I'm also wondering, how big are the objects that we're talking about?

@michalc
Copy link
Owner

michalc commented Aug 9, 2023

my case the update of the sqlite file is rare.

Ah and also, how rare is expected? Once a day/month/year?

@Wrufesh
Copy link

Wrufesh commented Aug 9, 2023

I am using minio object store on private cloud. In my case s3 is used as backend. The created objects are always associated with some kind of batch job. Batch job can be versioned; output objects from versioned jobs are always constant.

In my use case there is no use of bucket versioning.

@michalc
Copy link
Owner

michalc commented Aug 9, 2023

So I don't really understand I have to admit. But can I push on:

In my use case there is no use of bucket versioning.

Can you expand on why you can't enable it? I'm not sure if I'm missing something, but there shouldn't be a code impact - nothing needs to care about versioning if it doesn't want to, and cost impact I would imagine is relatively minimal in most cases - especially if you seem to have a non trivial setup with a few moving parts already.

And for background/context, are you looking for sqlite-s3-query to query minio, or S3?

@Wrufesh
Copy link

Wrufesh commented Aug 9, 2023

I am using sqlite-s3-query with minio as it supports s3 api.

Currently I am using the fork of sqlite-S3-query which doesn't require versioning.

I can ask the provider to turn on the versioning for me. But I want to be be free to choose as s3 object storage has given us freedom to choose whether to enable versioning or not.

S3 provider could force the user to use versioning but it choose to give freedom to it's user.

Similarly I propose to give here freedom as well.

That is just my personal opinion but I understand your concerns. 🙂

@michalc
Copy link
Owner

michalc commented Aug 10, 2023

I've pondered a bit more, and I'm going to stick with requiring versioning

  • You can make a fork as you have
  • Or, it sounds like in your specific case, you can probably enable versioning if you need to
  • Or, I've just realised there there is another library https://github.com/litements/s3sqlite, that I don't think requires versioning

It sounds like there are options - from what I can tell, nothing in keeping this library for versioned buckets prevents you from achieving your aims.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants