Skip to content

feat: add conditions, args to purge_ttl script#668

Merged
jrconlin merged 5 commits intomasterfrom
feat/631
Jun 12, 2020
Merged

feat: add conditions, args to purge_ttl script#668
jrconlin merged 5 commits intomasterfrom
feat/631

Conversation

@jrconlin
Copy link
Member

Description

attempt to try and provide a way to allow the purge_ttl script to complete.

Adds arguments (ENV VARS):

--instance_id (INSTANCE_ID) Spanner instance id
--database_id (DATABASE_ID) Spanner database id
--sync_database_url (SYNC_DATABASE_URL) Spanner DSN
spanner://instance/database
--collection_ids (COLLECTION_IDS)
JSON formatted list of collections to limit deletions
e.g. --collection_ids=123 limits to just collection 123
--collection_ids=[123,456] limits to both 123 & 456
default is all collections

Testing

May be used with the stage spanner database instance

Issue(s)

Issue #631

attempt to try and provide a way to allow the purge_ttl script to
complete.
* Adds arguments (ENV VARS):

  --instance_id (INSTANCE_ID)  Spanner instance id
  --database_id (DATABASE_ID)  Spanner database id
  --sync_database_url (SYNC_DATABASE_URL) Spanner DSN
        `spanner://instance/database`
  --collection_ids (COLLECTION_IDS)
        JSON formatted list of collections to limit deletions
        e.g. `--collection_ids=123` limits to just collection 123
             `--collection_ids=[123,456]` limits to both 123 & 456
             default is all collections

Issue #631
@jrconlin jrconlin requested review from a team and erkolson June 11, 2020 20:01
@erkolson
Copy link

erkolson commented Jun 12, 2020

TBC, the syntax for the env var COLLECTION_IDS is identical to the cli flag? i.e.

export COLLECTION_IDS="[123,456]"

and providing COLLECTION_IDS="[]" will enable or disable all collections?

@jrconlin
Copy link
Member Author

Yep. It uses the same logic. Let me know if that's a problem and I'll come up with something different. I tried it out locally on bash and it works, provided the value is quoted like you're doing.

if args.sync_database_url and not (
args.instance_id and args.database_id):
(args.instance_id,
args.collection_id) = from_env(args.sync_database_url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
args.collection_id) = from_env(args.sync_database_url)
args.database_id) = from_env(args.sync_database_url)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh

collections = [collections]
args.collection_ids = collections
if args.sync_database_url and not (
args.instance_id and args.database_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INSTANCE/DATABASE_ID have a default so they'll pretty much always override a SYNC_DATABASE_URL here. I think we can kill these defaults (or maybe these args entirely unless @erkolson or anyone else uses them, I don't), the cron job uses SYNC_DATABASE_URL.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have always used the SYNC_DATABASE_URL and was unaware of the other options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I use the others because I can never remember what the spanner DSN is. I've reworked things a bit to be smarter and more clear what's going on. Thanks!

Comment on lines +51 to +52
logging.info("{}: removed {} rows, batches_duration: {}".format(
name, result, end - start))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logging.info("{}: removed {} rows, batches_duration: {}".format(
name, result, end - start))
logging.info("{}: removed {} rows, {}_duration: {}".format(
name, result, name, end - start))

query += " = {:d}".format(args.collection_ids[0])
else:
query += " in ({})".format(
', '.join(map(str, args.collection_ids)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always prefer bind params :) but I guess this'll do for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, so do I, but it would have made things really complicated. Hopefully we trust @erkolson not to insert ";drop tables;"

@erkolson
Copy link

@jrconlin , what is the behavior when this is passed COLLECTION_IDS="[]"?

@jrconlin jrconlin requested a review from pjenvey June 12, 2020 18:58
database = instance.database(args.database_id)

logging.info("For {}:{}".format(args.instance_id, args.database_id))
batch_query = add_conditions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, almost forgot: let's not apply this to batches.

They hopefully won't ever have timeout problems, their expiry index is sane: a "global" index or whatever spanner calls it, not interleaved in user_collections. They're always going to have much less expired data to purge and their expiry logic is also a little different: all batches expire (and rather quickly) regardless of collection_id.

The batch delete job currently takes less than 10 min on prod.

@jrconlin jrconlin requested a review from pjenvey June 12, 2020 21:57
@jrconlin jrconlin merged commit 2a14eb2 into master Jun 12, 2020
@jrconlin jrconlin deleted the feat/631 branch July 30, 2020 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants