Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give orchest api access to the analytics module #947

Merged

Conversation

fruttasecca
Copy link
Member

@fruttasecca fruttasecca commented May 12, 2022

Description

This PR gives access to the analytics module to the orchest-api by moving it in the shared library. Analytics events that were previously sent by the webserver and that exist in the orchest-api are now responsibility of the orchest-api since this allows to leverage a number of things like asynchronous delivery, retries, access to the orchest-api db, etc.

The orchest-api analytics back-end is implemented as a subscriber subscribed to all events.

Now that the analytics event is in the shared library I think the interface/contract between the caller and the callee should be stronger. In particular, sending an event through the analytics module works by doing something akin to analytics.send_event(event_type, event_properties) , where event_type is an Enum and event_properties is a dictionary. It might be better to have a defined type (a class, data class, or typed dict) and make sending an event have the form of analytics.send_event(my_event_instance). This would reduce coupling between the analytics module and its callers, along with providing better guarantees about what is being sent, since currently we rely on the caller to act correctly w.r.t. the content of event_properties. @yannickperrenet Keen to know your opinion.

This PR changes the name of the analytics events, bringing them in line with how events are named in the orchest-api, more details will be provided once other changesets to improv/orchest-api-analytics-base are done, in particular, the following PRs:

  • moving more analytics events from the orchest-webserver to the orchest-api for the aforementioned reasons
  • expanding the types of orchest-api events, thus expanding analytics
  • possibly, a PR that refactors the interface of send_event like explained in the previous paragraph

The base of this PR was started pre-controller and I ended up not being able to split this into two smaller PRs for easier review, sorry about that :).

@fruttasecca fruttasecca added the improvement An improvement or enhancement to an existing feature. label May 12, 2022
Comment on lines 35 to 40
if isinstance(sub, models.AnalyticsSubscriber):
if app_utils.OrchestSettings()["TELEMETRY_DISABLED"]:
_logger.info(
"Telemetry is disabled, skipping event delivery to analytics."
)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving the check for TELEMETRY_DISABLED into the AnalyticsSubscriber, feels like it is something the subscriber should own, e.g.: "Am I active?".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean at the database level? Or just under the umbrella of the AnalyticsSubscriber logic? Note that models.py won't be able to access OrchestSettings() because said settings use the models.py module, but we might be able to fetch the TELEMETRY_DISABLED value through current_app, although I am not too happy to haves models.py access the current_app, feels a bit peculiar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just under the umbrella of the AnalyticsSubscriber logic?

This. I agree that putting it inside the models.py feels wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. AnalyticsSubscriber is a model so we are in a bit of a rut given the lack of extension methods in python (and I'd rather not monkey patch), nevertheless, I'll look for improvements.

Copy link
Contributor

@yannickperrenet yannickperrenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think it looks great! I thought creating an Analytics subscriber was a clever way to implement things ;)

Apart from what comes out of the discussions on this PR, I think it can be merged.

Identified that the dictionary defined two duplicates:
- `ONE_OFF_JOB_PIPELINE_RUN_CREATED`
- `CRON_JOB_PIPELINE_RUN_CREATED`
@fruttasecca
Copy link
Member Author

My reasoning behind going with this split was that event_properties contains the raw object as it lives within Orchest (with certain elements removed for anonimization) and derived_properties some information about the removed (due to anonimization) elements if possible.[...]

I see, I think it's very reasonable from the application POV, I think we can leave it as it is, let's see what's @astrojuanlu opinion.

@fruttasecca
Copy link
Member Author

fruttasecca commented May 18, 2022

I don't yet see the benefit of moving the two arguments (event_type and event_properties) into one argument, may it be a class (my_event_instance) or whatever that encapsulates the current arguments.

About the two arguments, if we are to create ad hoc classes or types.py entries I don't see the reason for having to pass the event type, given that said class would already encapsulate the information provided by the first argument, but we (the users of the module, and the implementor) would be in charge of verifying that what's passed is correct.

But (like I explained earlier in this comment) I agree that the interface could/should be improved. Maybe we can re-use the types.py or schema.py or do something similar.

About the use of a class as the contract, I think one of the main "issues" (for lack of a better word) is that the analytics module of the internal library is shared, which means it has no knowledge of the types of single services. If we want to provide a contract we have a few alternatives:

  • allow single services to pass already anonymized events to be sent, every service is responsible for what it's sent, this allows the analytics module to ignore the structure of the event itself, only define the event_types that it accepts.
  • typing of the arguments to be passed through a class, typeddict, etc., the analytics module ends up having to define the hierarcy/structure of events (likely redundant w.r.t. what the single services have defined)
  • move types.py to the shared lib, I am not sure we are ready for that, it implies we are moving to shared "global" types across all services

About point 2:

  • current approach: callers have to make sure to not add fields that would be missed by the analytics anonymization logic, it's a weak interface
  • TypedDict: can't share behaviour through inheritance/methods, overall not too big of an big issue. TypedDicts have no default values, which makes for worse ergonomics when it comes to some pre-defined fields (like the full name of the event type). The types.py module belongs to the orchest-api, making it shared would require more work and I am not sure we want to do that, regardless of that, the shortcomings of TypedDicts remain
  • data classes (https://docs.python.org/3/library/dataclasses.html): can reuse behaviour, have a bunch of initialization nice to haves like frozen or kw_only, default values, etc. A lot of implementation overhead of "normal" classes it taken away.

If I were to synthesize:

  • "bring your own anonymized event": no redundancy when it comes to schema/typing (I kinda like this because it plays very well with an implementation similar to to_notification_payload through the class hierarchy of events), but it's a shift in responsibility
  • "pass an event to analytics": either error-prone (weak interface) or leads to redundant and maybe harder to maintain code due to all extra typings required
  • "define global types used across services": quite a task imo
  • we can also leave things like it is to save time, I might make a small PR for some minor cleanup but that's it

…-api

Add environment created/delete events to `orchest-api` (notifications)
@yannickperrenet
Copy link
Contributor

"bring your own anonymized event"

I think I like this best (although not completely convinced about it being an improvement). One of the main reasons I went with the current "weak interface" is that I didn't want to pollute a bunch of webserver proxy endpoints to do the anonymization. Given that this constraint is now no longer valid (although might translate to the orchest-api) the "bring your own" seems like a valid proposal.

In the end, to share anonymization functions across endpoints it feels like pretty much the same logic is repeated as it is currently. Just in a different place.

Ideally, the send_event call is super easy to use as to encourage usage of it throughout the codebase. Having to import the correct anonymization functions to anonymize the data you want to send doesn't sound like an improvement.

@fruttasecca
Copy link
Member Author

fruttasecca commented May 18, 2022

After an internal chat we decided to go for a "bring your own anonymized event" approach, will be done in another PR.

…orchest-api

Move job_update event to the `orchest-api` (notifications)
…thub.com:orchest/orchest into improv/add-interactive-runs-event-to-orchest-api
…t-to-orchest-api

Add  interactive pipeline run events to `orchest-api` (notifications)
…hest into improv/add-builds-events-to-orchest-api
…est-api

Add  jupyter builds and interactive sessions events to `orchest-api` (notifications)
It's redundant and it being in synchronous API call can lead to issues
with big projects.
@astrojuanlu
Copy link
Contributor

👍🏽 to having event_properties with the original, raw information as much as possible (modulo anonymization of course) and splitting derived_properties!

@fruttasecca fruttasecca merged commit b535ce5 into improv/orchest-api-analytics-base May 19, 2022
@fruttasecca fruttasecca deleted the improv/orchest-api-analytics branch May 19, 2022 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement An improvement or enhancement to an existing feature.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants