Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require authentication for dry run function and run gcloud auth when … #5171

Merged
merged 5 commits into from
Mar 6, 2024

Conversation

scholtzan
Copy link
Collaborator

@scholtzan scholtzan commented Mar 5, 2024

…not logged in

I opened https://mozilla-hub.atlassian.net/browse/DSRE-1558 to actually require authentication for the dryrun function and give dryrun permissions against restricted datasets. The PR should be merged before we apply these changes.

Fixes #5168

Checklist for reviewer:

  • Commits should reference a bug or github issue, if relevant (if a bug is referenced, the pull request should include the bug number in the title).
  • If the PR comes from a fork, trigger integration CI tests by running the Push to upstream workflow and provide the <username>:<branch> of the fork as parameter. The parameter will also show up
    in the logs of the manual-trigger-required-for-fork CI task together with more detailed instructions.
  • If adding a new field to a query, ensure that the schema and dependent downstream schemas have been updated.
  • When adding a new derived dataset, ensure that data is not available already (fully or partially) and recommend extending an existing dataset in favor of creating new ones. Data can be available in the bigquery-etl repository, looker-hub or in looker-spoke-default.

For modifications to schemas in restricted namespaces (see CODEOWNERS):

┆Issue is synchronized with this Jira Task

@scholtzan scholtzan force-pushed the authenticated-dryrun-function branch 6 times, most recently from a0627cc to fe03660 Compare March 5, 2024 22:55
@dataops-ci-bot

This comment has been minimized.

@scholtzan scholtzan force-pushed the authenticated-dryrun-function branch from fe03660 to 19c024c Compare March 5, 2024 23:02
@dataops-ci-bot

This comment has been minimized.

@scholtzan scholtzan marked this pull request as ready for review March 5, 2024 23:14
Comment on lines 48 to 52
try:
subprocess.run(["gcloud", "auth", "application-default", "login"])
except Exception as e:
print(f"Could not log in to GCP: {e}")
return False
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will prompt the browser authentication flow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If that gcloud command somehow ends up getting run in a non-interactive context (e.g. Airflow tasks, CI jobs) it seems like it could hang the process waiting for the browser-based auth to be completed, unless the command is smart enough to detect that it's running in a non-interactive context or the attempt to open a browser fails. Is there a reasonable way to guard against that (e.g. checking for particular environment variables)?
  • Even in an interactive context this could cause a headache if it gets called in a parallelized way (e.g. pytest, SQL generators), where it could spam open a bunch of browser windows. To guard against that I think we'd want to make sure authentication is triggered either prior to the parallelization starting, or in some controlled way (e.g. for pytest maybe using some scoped fixture?).
  • While putting this logic into the is_authenticated() function is expedient, IMO it conflicts with the natural expectations people would have based on the function name that it's just a status check with no side effects. I'd suggest putting this logic in a new imperatively-named function like authenticate() or ensure_authenticated(), which would raise an appropriate error if authentication fails.
  • It might be a good idea to cache this function (and/or the alternate imperative function if you go that route) to avoid the authentication getting checked repeatedly unnecessarily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I believe check=True needs to be specified for subprocess.run() to raise an exception if the command exited with a non-zero exit code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. I don't see a reliable way to check for non-interactive environment. Perhaps for now it's best to not call gcloud auth and print out an error and a suggestion to run authentication.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #5174 for now, to keep track of this. I think for now we just keep the flow as is, and ask users to run gcloud auth themselves.

@scholtzan scholtzan requested a review from sean-rose March 5, 2024 23:16
.circleci/workflows.yml Show resolved Hide resolved
.circleci/workflows.yml Outdated Show resolved Hide resolved
bigquery_etl/cli/dryrun.py Outdated Show resolved Hide resolved
@@ -44,7 +45,11 @@ def is_authenticated():
try:
bigquery.Client(project="")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In testing this locally I didn't get prompted to authenticate because I did have application default credentials, but they were expired. I think an alternate approach is needed, like calling google.auth.default() to get the credentials (which will also raise DefaultCredentialsError if no default credentials were found) and then checking its valid property.

Comment on lines 48 to 52
try:
subprocess.run(["gcloud", "auth", "application-default", "login"])
except Exception as e:
print(f"Could not log in to GCP: {e}")
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If that gcloud command somehow ends up getting run in a non-interactive context (e.g. Airflow tasks, CI jobs) it seems like it could hang the process waiting for the browser-based auth to be completed, unless the command is smart enough to detect that it's running in a non-interactive context or the attempt to open a browser fails. Is there a reasonable way to guard against that (e.g. checking for particular environment variables)?
  • Even in an interactive context this could cause a headache if it gets called in a parallelized way (e.g. pytest, SQL generators), where it could spam open a bunch of browser windows. To guard against that I think we'd want to make sure authentication is triggered either prior to the parallelization starting, or in some controlled way (e.g. for pytest maybe using some scoped fixture?).
  • While putting this logic into the is_authenticated() function is expedient, IMO it conflicts with the natural expectations people would have based on the function name that it's just a status check with no side effects. I'd suggest putting this logic in a new imperatively-named function like authenticate() or ensure_authenticated(), which would raise an appropriate error if authentication fails.
  • It might be a good idea to cache this function (and/or the alternate imperative function if you go that route) to avoid the authentication getting checked repeatedly unnecessarily.

bigquery_etl/dryrun.py Outdated Show resolved Hide resolved
bigquery_etl/dryrun.py Outdated Show resolved Hide resolved
@dataops-ci-bot

This comment has been minimized.

@scholtzan scholtzan requested a review from sean-rose March 6, 2024 18:43
Comment on lines +200 to +201
- sql/moz-fx-data-shared-prod/fenix_derived/ltv_state_values_v1/query.sql
- sql/moz-fx-data-shared-prod/fenix_derived/ltv_state_values_v2/query.sql
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the Jira ticket, the dryrun service account won't have access to analysis. These are the only two datasets atm that access an analysis dataset. They'll get productionized soon, so until then skip dryruns.

@dataops-ci-bot

This comment has been minimized.

Copy link
Contributor

@sean-rose sean-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+wc

Comment on lines +122 to +130
- &skip_forked_pr
run:
name: Early return if this build is from a forked PR
command: |
if [ -n "$CIRCLE_PR_NUMBER" ]; then
echo "Cannot pass creds to forked PRs," \
"so marking this step successful"
circleci-agent step halt
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test-sql and test-routines jobs use similar forked-PR-detection logic. Should they use this instead?

Comment on lines +134 to +140
- &authenticate
run:
name: Authenticate to GCP
command: |
export GOOGLE_APPLICATION_CREDENTIALS="/tmp/gcp.json"
echo 'export GOOGLE_APPLICATION_CREDENTIALS="/tmp/gcp.json"' >> "$BASH_ENV"
echo "$GCLOUD_SERVICE_KEY" > "$GOOGLE_APPLICATION_CREDENTIALS"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test-sql, integration, and private-generate-sql jobs don't set up authentication. Should they?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private-generate-sql does indeed need this. At least the current tests in integration and test-sql don't depend on dryruns. So they don't need to change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least the current tests in integration and test-sql don't depend on dryruns. So they don't need to change

But the test-sql job has forked-PR-detection logic that says "Cannot pass creds to forked PRs". Is that referring to different credentials?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, those are different credentials afaik. for running some of these SQL tests we write test data to a BigQuery test project that needs these credentials


if not is_authenticated():
print(
"Authentication to GCP required. Run `gcloud auth login` "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested and verified that the specified gcloud auth login command isn't sufficient without the additional --update-adc argument.

Suggested change
"Authentication to GCP required. Run `gcloud auth login` "
"Authentication to GCP required. Run `gcloud auth login --update-adc` "

(this exact same message is used in a bunch of other places, but updating all those is probably beyond the scope of this PR).

"Authentication to GCP required. Run `gcloud auth login` "
"and check that the project is set correctly."
)
sys.exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that authentication is being required in many more cases, it might be good to update the "GCP CLI tools" section of the README to instruct non-DEs to authenticate.

@dataops-ci-bot

This comment has been minimized.

@scholtzan scholtzan force-pushed the authenticated-dryrun-function branch from 6fba024 to 68ef6f5 Compare March 6, 2024 22:29
@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot
Copy link

Integration report for "Refactor skip_fork in CI, clarify login requirements"

sql.diff

No content detected.

Link to full diff

@scholtzan scholtzan merged commit 70a355a into main Mar 6, 2024
22 checks passed
@scholtzan scholtzan deleted the authenticated-dryrun-function branch March 6, 2024 23:06
BenWu pushed a commit that referenced this pull request Mar 19, 2024
#5171)

* Require authentication for dry run function and run gcloud auth when not logged in

* authenticate step in CI, remove interactive gcloud auth

* Skip dryrun for ltv_state_values_v2

* Refactor skip_fork in CI, clarify login requirements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Require authentication for dry-run function
3 participants