Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor how RBAC enabled/disabled status is determined for Grafana Cloud stacks #4279

Merged
merged 20 commits into from
May 14, 2024

Conversation

joeyorlando
Copy link
Contributor

@joeyorlando joeyorlando commented Apr 25, 2024

What this PR does

In cloud we are currently (somewhat) improperly determining whether or not a Grafana stack had the accessControlOnCall feature flag enabled. At first things worked fine. We would enable this feature toggle via the Grafana Admin UI, and then the OnCall backend would read this value from GCOM's GET /instance/<stack_id> endpoint (via config.feature_toggles), and everything worked as expected.

There was a recent change made in grafana/deployment_tools to set this feature flag to True for all stacks. However, for some reason, the GCOM endpoint above doesn't return the accessControlOnCall feature toggle value in config.feature_toggles if it is set in this manner (it only returns the value if it is set via the Grafana Admin UI).

So what we should instead be doing is such instead of asking GCOM for this feature toggle, infer whether RBAC is enabled on the stack by doing a HEAD /api/access-control/users/permissions/search (this endpoint is only available on a Grafana stack if accessControlOnCall is enabled).

Few caveats to this ☝️

  1. we first have to make sure that the cloud stack is in an active state (ie. not paused). This is because, no matter if the accessControlOnCall is enabled or not, if the stack is in a paused state it will ALWAYS return HTTP 200 which can be misleading and lead to bugs (this feels like a bug on the Grafana API, will follow up with core grafana team)
  2. Once we roll out this change we will effectively actually be enabling RBAC for OnCall for all orgs. The Identity Access team would prefer a progressive rollout, which is why I decided to introduce the concept of settings.CLOUD_RBAC_ROLLOUT_PERCENTAGE (see also Organization. should_be_considered_for_rbac_permissioning)

Which issue(s) this PR closes

Related to https://github.com/grafana/identity-access-team/issues/667

Checklist

  • Unit, integration, and e2e (if applicable) tests updated
  • Documentation added (or pr:no public docs PR label added if not required)
  • Added the relevant release notes label (see labels prefixed w/ release:). These labels dictate how your PR will
    show up in the autogenerated release notes.

@joeyorlando joeyorlando added pr:no public docs Added to a PR that does not require public documentation updates release:patch PR will be added to "Other Changes" section of release notes labels Apr 25, 2024
@joeyorlando joeyorlando self-assigned this Apr 25, 2024
@joeyorlando joeyorlando requested review from a team and gamab April 25, 2024 16:04
Comment on lines 32 to 51
# NOTE: checking whether or not RBAC is enabled depends on whether we are dealing with an open-source or cloud
# stack. For Cloud we should make a call to the GCOM API, using an admin API token, and get the list of
# feature_toggles enabled for the stack. For open-source, simply make a HEAD request to the grafana instance's API
# and consider RBAC enabled if the list RBAC permissions endpoint returns 200. We cannot simply rely on the HEAD
# call in cloud because if an instance is not active, the grafana gateway will still return 200 for the
# HEAD request.
# stack. For open-source, simply make a HEAD request to the grafana instance's API and consider RBAC enabled if
# the list RBAC permissions endpoint returns 200.
#
# For cloud, we need to check the stack's status first. If the stack is active, we can make the same HEAD request
# to the grafana instance's API. If the stack is not active, we will simply rely on the org's previous state of
# org.is_rbac_permissions_enabled
if settings.LICENSE == settings.CLOUD_LICENSE_NAME:
# We cannot simply rely on the HEAD call in cloud because if an instance is not active
# the grafana gateway will still return 200 for the HEAD request.
stack_id = organization.stack_id
gcom_client = GcomAPIClient(settings.GRAFANA_COM_ADMIN_API_TOKEN)
rbac_is_enabled = gcom_client.is_rbac_enabled_for_stack(organization.stack_id)

# the stack MUST be active for this check.. if it is in any other state
# the Grafana API risks returning an HTTP 200 but the actual permissions data that is
# synced later on will be empty (and we'd erase all RBAC permissions stored in OnCall)
if gcom_client.is_stack_active(stack_id):
rbac_is_enabled = grafana_api_client.is_rbac_enabled_for_organization()
else:
rbac_is_enabled = organization.is_rbac_permissions_enabled
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main change

Comment on lines +24 to +79
@contextmanager
def patched_grafana_api_client(organization, is_rbac_enabled_for_organization=False):
GRAFANA_INCIDENT_PLUGIN_BACKEND_URL_KEY = "backendUrl"

with patch("apps.user_management.sync.GrafanaAPIClient") as mock_grafana_api_client:
mock_grafana_api_client.GRAFANA_INCIDENT_PLUGIN_BACKEND_URL_KEY = GRAFANA_INCIDENT_PLUGIN_BACKEND_URL_KEY

mock_client_instance = mock_grafana_api_client.return_value

mock_client_instance.get_users.return_value = [
{
"userId": 1,
"email": "test@test.test",
"name": "Test",
"login": "test",
"role": "admin",
"avatarUrl": "test.test/test",
"permissions": [],
},
]
mock_client_instance.get_teams.return_value = (
{
"totalCount": 1,
"teams": (
{
"id": 1,
"name": "Test",
"email": "test@test.test",
"avatarUrl": "test.test/test",
},
),
},
None,
)
mock_client_instance.get_team_members.return_value = (
[
{
"orgId": organization.org_id,
"teamId": 1,
"userId": 1,
},
],
None,
)
mock_client_instance.get_grafana_incident_plugin_settings.return_value = (
{"enabled": True, "jsonData": {GRAFANA_INCIDENT_PLUGIN_BACKEND_URL_KEY: MOCK_GRAFANA_INCIDENT_BACKEND_URL}},
None,
)
mock_client_instance.get_grafana_labels_plugin_settings.return_value = (
{"enabled": True, "jsonData": {}},
None,
)
mock_client_instance.check_token.return_value = (None, {"connected": True})
mock_client_instance.is_rbac_enabled_for_organization.return_value = is_rbac_enabled_for_organization

yield mock_grafana_api_client
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the changes in this file are related to this..

we had a lot of duplication around mocking out responses on the GrafanaAPIClient (plus several instances where we had like 8 nested context managers 😆 ).. this gets rid of that

Comment on lines 348 to 363
def should_be_considered_for_rbac_permissioning(self) -> bool:
"""
this is sort of a hacky workaround to address a cloud issue we introduced with the accessControlOncall
feature flag. The flag is technically enabled for all stacks, but the way in which OnCall used to be
reading it (via GCOM config.feature_flags for the stack) made it such that RBAC wasn't actually being
enabled for most stacks from the oncall backend perspective. Once we change things to start HEADing
the permissions search endpoint, this will effectively turn on RBAC for all orgs.. soo instead lets
slowly turn it on via the logic here
"""
cloud_rbac_rollout_percentage = settings.CLOUD_RBAC_ROLLOUT_PERCENTAGE

# if rbac permissions are already enabled for the org, they're "grandfathered" in
if not cloud_rbac_rollout_percentage or self.is_rbac_permissions_enabled:
return True
return self.id <= math.floor(Organization.objects.count() * cloud_rbac_rollout_percentage)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @gamab

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will eventually be removed once we're comfortable rolling this out to all stacks in cloud (ie. we'll slowly ramp up the value of CLOUD_RBAC_ROLLOUT_PERCENTAGE)

Copy link
Contributor

@gamab gamab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As discussed with you, I left a couple of questions.

Side note: what happens if the search endpoint returns an error (!=200) do we remove the users permissions?

engine/apps/user_management/sync.py Show resolved Hide resolved
engine/apps/user_management/models/organization.py Outdated Show resolved Hide resolved
engine/apps/grafana_plugin/helpers/client.py Show resolved Hide resolved
engine/apps/user_management/sync.py Show resolved Hide resolved
@joeyorlando joeyorlando added this pull request to the merge queue May 14, 2024
Merged via the queue into dev with commit 2582a1b May 14, 2024
21 checks passed
@joeyorlando joeyorlando deleted the jorlando/rbac-feature-flag-fixes branch May 14, 2024 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr:no public docs Added to a PR that does not require public documentation updates release:patch PR will be added to "Other Changes" section of release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants