Skip to content

[hailctl] Initialize Hail Environment Interactively #13279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Aug 23, 2023

Conversation

jigold
Copy link
Contributor

@jigold jigold commented Jul 21, 2023

I wanted to get feedback on the code I've written thus far before I start testing everything. I'm worried it might be too complicated / brittle to maintain.

I also chose to blow away the entire existing environment rather than resetting the variables with new values. Not sure if that's what we want.

@jigold
Copy link
Contributor Author

jigold commented Jul 21, 2023

In GCP, this code will check remote_tmpdir existence, compatibility with regions, and assign service account permissions. Same with the artifact registry.

@jigold
Copy link
Contributor Author

jigold commented Jul 24, 2023

@daniel-goldstein When you get a chance, I'd appreciate your high level opinion on this addition before I start thoroughly testing everything. Goal is to have something before the ATGU Welcome Workshop September 5th. I know the import locations are an issue in initialize.py.

I mainly want feedback on whether the code passes the high level smell test and whether I've over designed this with too many checks and prompts or am forgetting something that would be nice to check for.

@daniel-goldstein
Copy link
Contributor

This is super cool! I am a big fan of the idea and the overall approach, particularly when it comes to setting up the tmp bucket and getting the permissions on it correct. Here's my high level thoughts. Sorry for the wall of text but I found these a little hard to articulate.

Regarding number of prompts

I think this is my primary concern. There's a lot of great automation here, but it's a lot right off the bat. I think what this is aiming to do is make it quick and simple to start running batches and every time someone has to stop and ask someone a question as to how they should respond to some prompt that process gets longer and more complicated. I think it's worth considering what the first batch people should run might be and design for a minimal first experience. IMO, a temp bucket is an absolutely crucial piece of configuration before you can do anything interesting and configuring a temp bucket is something that hailctl can easily be very opinionated about. Container registry… I feel like there's harder questions there, and you can run a lot of cool batches before having to worry about provisioning your own. It's also not actually a part of the hailctl config (unless something has changed recently) so it feels a little unusual in this flow. I still think that it is helpful to set people up with an AR and keep them from footguns, but maybe that can go in a separate command that the initial init command points to once you're done? Something along the lines of "if you get to the point where you need to upload custom container images, you can use hailctl to set up a registry"?

Another thing that gives me a little pause is the wording around google projects. I get that you need one to create a bucket, but I think we should just make sure to steer clear of the implication that you are "selecting a GCP project to use for Hail Batch", because that implies some link or ownership that isn't there. But I think there's a quick fix here: for a given resource that we are creating for hail use, like the temp bucket, ask for the name first and then ask which project it should be created in, using the projects listed in gcloud as choices with the option to write in your own.

Regarding number of checks

I think it'd be good to avoid warnings when possible. From looking at this I see a pattern of

  1. Ask a leading question
  2. Emit a warning if the user selects the alternative option instead of the suggested option

I think I would prefer instead to ask a leading question and in the prompt explain why the alternative option might be undesirable. Then when they make a decision just move on.

On a broader note, I think we should focus on having good documentation and linking to it over having perfectly thorough
explanations in the CLI. At some point in an interactive setup if it gets longwinded I start spamming enter, but if it was quick and at the end it said something to the effect of: "Your current configuration could result in excess cloud cost. See the documentation about common pitfalls and how to avoid them", I might decide to read through that FAQ with a more discerning eye. This is just my opinion though, I would be curious to see if other folks disagree regarding the UX.

@jigold
Copy link
Contributor Author

jigold commented Jul 25, 2023

I'm thinking that I want to have this:

Basic setup for the domain, remote_tmpdir, billing_project, and only select one region

hailctl init

Which will prompt for:

  1. GCP project
  2. Hail domain
  3. region
  4. remote tmpdir location

And set:

  • domain
  • batch/billing_project
  • batch/regions
  • batch/remote_tmpdir
  • batch/backend
  • query/backend

And then print out a message with the default settings with instructions on how to change any of the settings and warnings about the configuration by using hailctl config set .... I'll have the autocomplete PR merged (hopefully) by then so we can make that nice. The trial billing project is automatically used and we don't prompt for that.

Then we'll have optional flags for components to configure

hailctl init --container-registry --requester-pays-buckets --extra-query-settings --extra-batch-settings

This would prompt for the same as above as well as advanced regions settings, query settings, and create the container registry.

And lastly

hailctl config check

which prints warnings about misspecified regions and tmpdirs.

@jigold
Copy link
Contributor Author

jigold commented Jul 25, 2023

Actually, we can just pick a random bucket name and create it automatically and the user doesn't even specify the tmp directory.

@daniel-goldstein
Copy link
Contributor

This seems good. I have one objection but I might have just misunderstood the following sentence.

We can just pick a random bucket name and create it automatically and the user doesn't even specify the tmp directory.

I'm cool with having a default random name as an option but I think there should be a prompt for the bucket name so that the user is aware that we are creating resources on their behalf that will cost them money. I feel like they should always confirm any resource creation that hailctl does and the docs should inform them on when they might want to change or delete the resource later on.

@jigold
Copy link
Contributor Author

jigold commented Jul 25, 2023

You're absolutely right. We should pick a default name for the prompt. Actually, I already have the code to do that!

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff! A few comments. Seems broadly on the right track.

@@ -32,8 +32,8 @@ def _load_user_config():
user_config.read(config_file)


def get_user_config() -> configparser.ConfigParser:
if user_config is None:
def get_user_config(reload=False) -> configparser.ConfigParser:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this. Seems to easy to create races and the only use-case is printing it after init.


async def grant_service_account_bucket_write_access(storage_client, bucket_info: BucketInfo, service_account: str):
service_account_member = f'serviceAccount:{service_account}'
await storage_client.grant_bucket_write_access(bucket_info.name, service_account_member)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the grant_bucket_write_access function is easy to use incorrectly (e.g. forget to prepend serviceAccount:). Maybe it should have required mutually exclusive keyword args from user= or serviceAccount=?

'roles/storage.objectCreator',
)

iam_policy = await storage_client.get_bucket_iam_policy(bucket_info.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can assume you have permission to read IAM policy. I don't know where you plan to use check_service_account_has_bucket_write_access but if this is a bucket that Hail might write to it anyway seems very reasonable to just write to it.

bindings = iam_policy['bindings']
for binding in bindings:
role = binding['role']
if role in write_roles:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right. There can be custom roles that grant write access.

await storage_client.insert_bucket(bucket_info.project, body=body)


async def check_service_account_has_bucket_read_access(storage_client, bucket_info: BucketInfo, service_account: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same argument to below, just check that you can list the bucket. That's not exactly the same as read, but in practice Hail needs list and read.

'name': bucket_info.name,
'location': bucket_info.location,
'locationType': bucket_info.location_type,
'lifecycle': {'rule': [{'action': {'type': 'Delete'}, 'condition': {'age': bucket_info.retention_policy_days}}]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a retention policy its a lifecycle policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retention is the opposite: it prevents deletion.

return self.location_type.lower() != 'region'


async def get_gcp_bucket_information(storage_client, bucket: str) -> Optional[BucketInfo]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is in hailctl.config.utils so I'm gonna apply scrutiny I would to any method made available to all Hail developers.

The fact that this doesn't read and set the lifecycle policy information is confusing. This makes me think that None can mean either (1) there is no lifecycle policy or (2) we just don't know the lifecycle policy.

async def get_gcp_default_project() -> Optional[str]:
from hailtop.utils import check_shell_output # pylint: disable=import-outside-toplevel
try:
project, _ = await check_shell_output("gcloud config list --format 'value(core.project)'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcloud config get-value project feels more direct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

project is defined as the default project by the gcloud config docs: https://cloud.google.com/sdk/gcloud/reference/config/set

f'Does the bucket {bucket} exist? '
f'Have you installed gcloud? For directions see https://cloud.google.com/sdk/docs/install '
f'Do you have admin privileges to grant permissions to resources? '
f'Are you logged into your google account by running `gcloud auth login`?')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should catch the exception that says "you do not have permission" and specifically suggest the user request from the relevant admin the roles "XXX" and "YYY".

remote_tmpdir = f'{bucket_name}/batch/tmp'
errors = False

retention_days = IntPrompt.ask(f'How many days should files be retained in bucket {bucket_name}?', default=30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use the word retained and retention because those words are used by Google to talk about prevention of deletion not automatic deletion. Let's just say directly: "After how many days should files be automatically deleted in your temporary bucket, {bucket_name}"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And probably you need to give them an option to not do that.

@jigold jigold marked this pull request as ready for review July 28, 2023 17:48
@jigold
Copy link
Contributor Author

jigold commented Jul 28, 2023

Example with mismatching regions:
Screenshot 2023-07-28 at 1 50 41 PM

Example with verbose turned on:
Screenshot 2023-07-28 at 1 50 26 PM

@danking
Copy link
Contributor

danking commented Jul 28, 2023

Closing to ease CI burden

@danking danking closed this Jul 28, 2023
@danking danking reopened this Jul 28, 2023
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this level of granularity in terms of the initial configuration. Added some more detailed comments, mostly around managing user expectations.

import typer

from typer import Abort, Exit, Option as Opt
# from rich import print
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove



def initialize(
incremental: Ann[bool, Opt('--incremental', '-i', help='Do not destroy the existing configuration before setting new variables.')] = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's safer to default to not erasing anything from the user's config. How about instead having a --overwrite flag that explicitly overwrites any existing configuration?

if location != expected_region or bucket_info['locationType'] != 'region':
typer.secho(f'WARNING: remote temporary directory {remote_tmpdir} is not located in {expected_region} or is multi-regional. '
f'Found {location} with location type {bucket_info["locationType"]}.',
fg=typer.colors.MAGENTA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like a yellow/gold is a more expected warning color.

lifecycle_days = IntPrompt.ask(
f'After how many days should files be automatically deleted from bucket {bucket_name}?', default=30)
if lifecycle_days <= 0:
typer.secho(f'invalid value for lifecycle rule in days {lifecycle_days}', fg=typer.colors.RED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
typer.secho(f'invalid value for lifecycle rule in days {lifecycle_days}', fg=typer.colors.RED)
typer.secho(f'Invalid value for lifecycle rule in days {lifecycle_days}', fg=typer.colors.RED)

verbose=verbose,
)
typer.secho(f'Created bucket {bucket_name} in project {project} with lifecycle rule set to {lifecycle_days} days.', fg=typer.colors.GREEN)
except InsufficientPermissions as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put the minimum amount of code we expect might fail in the try block, which AFAICT starts on the await create_gcp_bucket line.

def get_default_region(supported_regions: List[str], cloud: str) -> Optional[str]:
if cloud == 'gcp':
if 'us-central1' in supported_regions:
return 'us-central1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to represent the default region in the deploy config? This isn't necessarily true for e.g. the australians. How about another endpoint like the supported regions?

If this doesn't represent the default region of the hail batch service, then I think we should not have a default and force the user to think about which region they really want.

labels_str = None

try:
await check_shell(f'gcloud --project {project} storage buckets create gs://{bucket} --location={location}', echo=verbose)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use check_shell we need to shell quote, but this isn't really using any shell functionality just the gcloud executable, so this should use check_exec_output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find writing commands for check_exec_output to be significantly harder to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, but I think using the shell when we don't have to is a footgun because we could forget to properly quote inputs and shell invocations end up in bash history files (so we have to be careful not to check_shell with anything sensitive). What if we just made a wrapper for check_exec_output that takes a single string, splits on spaces, and passes it to check_exec_output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this option. It's too hard to get the splitting to be correct. I'll just make it check exec output.

f'to assign you the StorageIAMAdmin role in Google Cloud Storage or ask them to update the bucket {bucket} on your behalf by giving ' \
f'service account {service_account} the role "roles/storage.objectCreator".'
raise InsufficientPermissions(msg) from e
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can replace these two functions with one function that also accepts a role.

except CalledProcessError as e:
if 'does not have storage.buckets.get access to the Google Cloud Storage bucket' in e.stderr.decode('utf-8'):
msg = f'ERROR: You do not have the necessary permissions to update bucket {bucket} in project {project}. Ask a project administrator ' \
f'to assign you the StorageAdmin role in Google Cloud Storage or ask them to update the bucket {bucket} on your behalf.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users only need that role on that particular bucket, not project-wide.

except CalledProcessError as e:
if 'does not have storage.buckets.create access to the Google Cloud project' in e.stderr.decode('utf-8'):
msg = f'ERROR: You do not have the necessary permissions to create buckets in project {project}. Ask a project administrator ' \
f'to assign you the StorageAdmin role in Google Cloud Storage or ask them to create the bucket {bucket} on your behalf.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace this with a narrower role that allows users to create buckets?

@jigold
Copy link
Contributor Author

jigold commented Aug 2, 2023

I think I addressed most of the comments. I haven't tested the new code -- I don't want to do that until we're happy with it. I don't know that I like how this is turning out. I think we're conflating what hailctl config init should be which is intitializing an environment configuration file versus a quick start to using Hail Batch and QoB. My intention for this feature was to idiot proof the latter especially for the ATGU workshop and make it as few commands as possible. I worry that needing to run hailctl auth login before this is not a just run this single command and then you can get going with little effort. I don't think hailctl config init is the right place for what I have written. Maybe it should be hailctl batch quick-start???

@danking
Copy link
Contributor

danking commented Aug 2, 2023

Agreed that hailctl init is maybe the wrong name b/c hailctl also controls Dataproc. hailctl batch init seems right to me. I agree that hailctl config init seems wrong if we're creating buckets.

On the issue of login, I think we can appease both points of view. If you're not logged in, have hailctl batch init do this:

You are not currently logged in to Hail. Redirecting you to a login screen.

... user does login flow ...

In the future, you can use hailctl auth login to login to Hail.

... continue with hailctl batch init ...

I think we should punt on addressing domain. Broad users don't need to interact with it at all.

What are the other concerns about how this is developing?

@danking
Copy link
Contributor

danking commented Aug 2, 2023

(and to be clear, I'm onboard with the idea of a single command which gets you from zero to simple batch pipelines; jury is out on the Artifact Registry. I think making sure that's configured correctly might be better upstreamed to Sam. Normal users might not have permission to enable/disable things like that.).

@daniel-goldstein
Copy link
Contributor

Hm, I guess I just don't personally see the value in one command instead of two. If the workshop is meant to teach users how to use batch, and one aspect of using batch is knowing how to log in, shouldn't they go through the motions?

@danking
Copy link
Contributor

danking commented Aug 2, 2023

That's a fair counter-point, but wouldn't it be annoying if you did:

hailctl batch init

And it was like:

Run hailctl auth login first.

Why not just do it for me?

@daniel-goldstein
Copy link
Contributor

I suppose this is where I started getting entangled with the domain issue. If the Australians run a workshop, what should happen if their users run hailctl batch init? Should they have to supply some additional argument so that if they're not authenticated they get sent somewhere other than hail.is? Maybe that's ok, seems kind of awkward though.

While it doesn't work this way today, I imagine that we should ultimately configure hailctl auth login to accept a domain. I feel like that would make the AUS scenario slightly less awkward, and the tool more consistent, though I admit it is conceding some of our own convenience.

@daniel-goldstein
Copy link
Contributor

Heck not even the Australians, what about Azure?

@danking
Copy link
Contributor

danking commented Aug 3, 2023

I agree that we never carefully considered switching domains at the user-level. I'm mostly concerned with Broadies using the GCP instance for now. I think how to deal with domains really gets into the idea of having config profiles, which just makes all of this more complicated. Ergo, my gut is: if you're not a Broadie using Google, you have to do some extra work, but that set of users is very small.

@danking
Copy link
Contributor

danking commented Aug 3, 2023

But let's solve for Aus and Azure as a separate thread of work that doesn't block with PR.

I imagine whatever we decide will either make this command need to take a domain or enhance this command to create a profile per-domain.

@jigold
Copy link
Contributor Author

jigold commented Aug 14, 2023

Could you please do one more broad look over this before I start testing again with the new changes?

@daniel-goldstein
Copy link
Contributor

This looks great, I love how straightforward async_basic_initialize is

@danking danking merged commit 1391db0 into hail-is:main Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants