Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/replace identity in kedro telemetry with unique username #58

Merged

Conversation

jmholzer
Copy link
Contributor

@jmholzer jmholzer commented Aug 23, 2022

Description

Resolves #50

This PR assigns the hashed username of the Kedro user to the identity field in calls to the Heap API. This allows heap to aggregate data collected by the same user. A default username of 'anonymous' is assigned in the case that it cannot be determined. Currently, in this case, the identity is also set to 'anonymous'. This would cause Heap to pool together all 'anonymous' identities; it is unclear if this meets requirements, some feedback here would be good.

Development notes

The user's username is collected in before_command_run, using getpass.getuser() (the best cross-platform choice according to the pwd docs). It is then passed to _format_user_cli_data and _send_heap_event. Tests have been changed to reflect this.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@jmholzer jmholzer force-pushed the feat/replace-identity-in-kedro-telemetry-with-unique-username branch 3 times, most recently from b693dd8 to 02187f1 Compare August 24, 2022 16:15
@noklam
Copy link
Contributor

noklam commented Aug 24, 2022

      - unless:
          condition:
            or: 
              - and:
                - equal: [ "3.10", <<parameters.python_version>> ]
                - equal: [true, <<pipeline.parameters.run-build-kedro-datasets>>]
              - or:
                - equal: [false, <<pipeline.parameters.run-build-kedro-datasets>>]

The CI config has been modified and it works fine, but this is very hard to reason. I think it makes sense to refactor this into
this and make it a separate PR, can you try that?

      - when:
          condition:
            and: 
              - and:
                - equal: [true, <<pipeline.parameters.run-build-kedro-datasets>>]
              - or:
                - equal: ["3.7", <<parameters.python_version>>]
                - equal: ["3.8", <<parameters.python_version>>]
                - equal: ["3.9", <<parameters.python_version>>]

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation looks good to me 🙂

On the point of grouping all "anonymous" events together: I think we should keep the same behaviour as before where if we can't determine the username we don't send any data at all. You've already switched to getpass.getuser() which is more robust then os.getlogin() afaik. But it would be good to hear what @yetudada thinks about this.

@@ -128,19 +127,6 @@ def test_before_command_run_no_consent_given(self, mocker, fake_metadata):

mocked_heap_call.assert_not_called()

def test_before_command_run_socket_timeout(self, mocker, fake_metadata):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed now you removed the timeout logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation looks good to me 🙂

On the point of grouping all "anonymous" events together: I think we should keep the same behaviour as before where if we can't determine the username we don't send any data at all. You've already switched to getpass.getuser() which is more robust then os.getlogin() afaik. But it would be good to hear what @yetudada thinks about this.

So I think following standup, it should be that if we can't determine the username they should get the unique Heap ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, I made this change in 7181e4d.

As I understand it from the heap docs, if the request is made with no identity field, heap automatically assigns the unique ID. In the case that no username can be found, the identity field is removed from the request.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@jmholzer jmholzer marked this pull request as ready for review August 31, 2022 08:53
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍 Don't forget to update the release notes with this change.

@SajidAlamQB
Copy link
Contributor

Pasting our DM here for clarity:

Regarding the telemetry ticket, your implementation looks good. Heap automatically assigns a User ID for each user and pools those users who have the same Identity into that User ID profile. So, I believe what Yetu meant in the case where we can't determine the Identity, we shouldn't set anything to Identity including "anonymous" and just let heaps automatic User ID pick it up. This seems to be the recommended method that heap suggests: https://developers.heap.io/docs/using-identify#identify-pitfalls

Copy link
Contributor

@SajidAlamQB SajidAlamQB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
@jmholzer jmholzer merged commit 4f5b1c8 into main Sep 12, 2022
@jmholzer jmholzer deleted the feat/replace-identity-in-kedro-telemetry-with-unique-username branch September 12, 2022 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace identity in kedro telemetry with unique username
5 participants