Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GraphQL backend #964

Merged
merged 35 commits into from
Jul 19, 2022
Merged

Refactor GraphQL backend #964

merged 35 commits into from
Jul 19, 2022

Conversation

antonymilne
Copy link
Contributor

@antonymilne antonymilne commented Jul 7, 2022

Description

Originally I was just doing a bit of tidying to enable #958, but the deeper I got the more I decided that our current experiment tracking backend was not up to scratch. It's still not perfect, but it's a lot better and we now have some clear steps to improve it further.

The only new functionality introduced here is being able to query by group, which isn't yet used on the fronted. But following this PR it should as a result be much easier to extend now, e.g. to add plots.

Renaming and restructuring

api has been broken into two new folders that reflect the GraphQL and REST parts of the app. All GraphQL related code was previously in graphql.py; this has now been broken down into several separate files.

api
├── __init__.py
├── apps.py
├── graphql
│   ├── __init__.py
│   ├── router.py
│   ├── schema.py
│   ├── serializers.py
│   └── types.py
└── rest
    ├── __init__.py
    ├── responses.py
    └── router.py

Some naming has been aligned (e.g. experiments_tracking.py ➡️ experiment_tracking.py, graph.py ➡️ flowchart.py).

Separation of responsibility

To separate out responsibilities better we now have some new components on the backend:

  • in the data access layer, a new TrackingDatasetRepository
  • in the domain models, TrackingDatasetModel and TrackingDatasetGroup.

This means that, e.g., data loading is no longer leaking out into the API as it was before.

The way that tracking data is populated and retrieved is worth documenting here:

  • initially populate_data when the app is first loaded, TrackingDatasetRepository is populated by data_access_manager.add_catalog
  • this adds tracking datasets for the whole pipeline (i.e. the registered pipelines dropdown menu and --pipeline argument affect only the flowchart view)
  • If you add a new tracked dataset and do a kedro run then the new dataset won't show up in experiment tracking unless --autoreload is set, because populate_data won't be called again
  • the data for a run_id is only loaded (through dataset._load) when the GraphQL query run_tracking_data is run with that run_id. Subsequent run_tracking_data queries will not perform the load from disk again but instead read out of memory

Query by group

The runTrackingData query now accepts a group argument with type TrackingDatasetGroup:

enum TrackingDatasetGroup {
  METRIC
  JSON
}

This is not used yet on the frontend but will be very useful to query without needing another layer in the tracking datasets hierarchy. e.g.

query QueryMetricsJSON {
  metric: runTrackingData(group: METRIC, runIds: ...) {
    data
    datasetName
    datasetType
  }
  json: runTrackingData(group: JSON, runIds: ...) {
    data
    datasetName
    datasetType
  }
}

Tidying implementation

Lots of small changes to implementation just to make things tidier, e.g. using json.JSONDataSet._load rather than re-writing it; removing unnecessary custom strawberry type JSONObject; putting strawberry resolvers directly as class methods rather than external functions.

Documentation

The backend (specifically strawberry) defines the GraphQL schema and writes it to schema.graphql and generates a png from it:
image

A new CI check ensures this is always up to date.

All parts of the schema now have descriptions. Unlike docstrings, these are rendered in GraphiQL:
image

Next steps

Write lots of tests. Improve the way we do GraphQL tests in general (various ideas here).

For adding plots to experiment tracking:

  • Add query by group as TrackingDatasetGroup and produce existing behaviour for metrics and JSON.
  • Try to align TrackingDatasetModel and TrackingDatset. Consider model for run which would simplify (maybe even remove) format_run_tracking_data. How to query by run_id correctly?
  • Might be worth doing a new model for each TrackingDatasetRun in strawberry but not a dataclass model. Maybe do this as GraphQL interface with different implementations, serializers for plots, etc.

Important other refactoring:

  • Reuse DataNode and DataNoteMetadata models. There's too much duplication between these are tracking datasets.
  • Better system for check_db_session, e.g. decorator argument that returns empty iterable (could be done automatically from type hint)? Null class?
  • Consider whether is_tracking_dataset should use isinstance instead, but be careful with imports
  • Think about serizalisers. Is format_runs needed? Should formatting go into constructor or class method? Are they needed at all?
  • Consider structure of GraphQL models and response. e.g. why isn't TrackedDataSets a field in Run?

QA notes

Extended Python GraphQL query tests to be more e2e and accommodate new structure ✅
Manually tested ✅
Tested schema generated and CI check works ✅

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added new entries to the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
…re/refactor-graphql

Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
@@ -49,7 +54,7 @@ class JSONObject(dict):
description="Generic scalar type representing a JSON object",
)


# TODO: where should format functions go?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey so this is what I usually call serializers in API development. A serializer takes data from one format (domain model) and serializer it into another format (GraphQL type or FastAPI Response Model) so a file called graphql_serializers might be a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Lim! This is exactly what I was thinking.

My other idea was that it should live in the Run type itself. It seems that strawberry.scalar has a serialize option but strawberry.type doesn't, so would need to do this as a method:

@strawberry.type
class Run:
    author: Optional[str]
    bookmark: Optional[bool]
    ...

    def format(self):
        ...

What do you think? Is this a horrible misunderstanding of the purpose of these objects (maybe there's a good reason that strawberry.type doesn't have serialize) or a good way to do it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, one down side I can think of is unit testing. Ideally you would want to be able to test the serialization logic which is mostly standalone and stateless without having to mock out the whole API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I suspected, thank you for confirming!

Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
@antonymilne
Copy link
Contributor Author

antonymilne commented Jul 8, 2022

Here's how I've reorganised things:

api
├── __init__.py
├── apps.py
├── graphql
│   ├── __init__.py
│   ├── router.py
│   ├── schema.py
│   ├── serializers.py
│   └── types.py
└── rest
    ├── __init__.py
    ├── responses.py
    └── router.py

My proposal for tests would be:

  • router is very small and doesn't need tests
  • types doesn't need testing
  • schema methods (query, mutation, subscription) largely delegate to data_access_manager.get... (already covered by unit tests) and then call format functions on the results if required. Tests are covered by e2e-style query tests that we already have
  • serializers should be unit tested - currently missing

What do you think? Please tell me if this sounds right to you 🙏 Thank you very much! @limdauto

Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
antonymilne and others added 4 commits July 18, 2022 14:53
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
…re/refactor-graphql

Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Copy link
Contributor

@rashidakanchwala rashidakanchwala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this <3 -- amazinggggggg!

Copy link
Member

@tynandebold tynandebold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legend! Thank you so much 💥 I didn't see any issues when testing the app. All worked as it should.

Does this work warrant a line in the release.md file?

@tynandebold tynandebold self-requested a review July 19, 2022 12:26
Copy link
Member

@tynandebold tynandebold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helping Rashida with her next-step PR, I'm seeing an issue: the metadata and tracking data aren't showing up in the proper order anymore. It may be difficult to explain here. Let's sync and I'll show you to see if you see it too.

Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
@tynandebold tynandebold self-requested a review July 19, 2022 16:06
Copy link
Member

@tynandebold tynandebold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant!

Copy link
Collaborator

@limdauto limdauto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't have a lot of time to look into this in details but the new structure looks great to me. Thank you!

Re testing logic: +1 to your proposal.

@@ -37,12 +37,22 @@ lint-check:
flake8 --config=package/.flake8 package
mypy --config-file=package/mypy.ini package

schema-fix:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: what's the need for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what automatically generates the graphql.schema file and png from the strawberry schema. It's called schema-fix (bit of a weird name I know) by analogy with make format-fix, which will make the make format-check CI pass. Here schema-fix will make the make schema-check CI that tests if the schema file and diagram is up to date pass.

@limdauto
Copy link
Collaborator

@AntonyMilneQB if you are fishing for next project, I'd recommend drilling on this one:

If you add a new tracked dataset and do a kedro run then the new dataset won't show up in experiment tracking unless --autoreload is set, because populate_data won't be called again

This is related to this issue that I opened a few weeks ago: #872 -- the ideal flow in my head is:

  • populate_data will populate a sqlite db through the repositories in the data access layer on first run. Think of the sqlite as a backend for the data access manager. Currently it's an in-memory backend.
  • Then we listen to changes on the Kedro project and add new entry to this database without having to re-run the whole populate_data like --autoreload is currently doing.
  • The API layer can keep polling this DB and return new data to the client via subscription.

@antonymilne
Copy link
Contributor Author

@limdauto thanks so much for the comments, much appreciated! Makes sense about populate_data also - let me add that comment to #872 so I don't lose track of it.

@antonymilne antonymilne merged commit 45d81ef into main Jul 19, 2022
@antonymilne antonymilne deleted the chore/refactor-graphql branch July 19, 2022 16:57
rashidakanchwala added a commit that referenced this pull request Jul 20, 2022
This PR was created based on the backend refactor ticket. #964
The backend has now changed and for runTrackingData -- it will now get queries also based on the group (dataset type) the tracking data belongs to i.e. Metrics, JSON Data (and in future Plots)

In this ticket, we have adjusted the front-end to get information from the backend for each dataset. If a particular group (Metrics, JSON Dataset) has no datasets, that group will not be shown in the front-end. Otherwise all run tracking data will now have another parent in the heirarchy/accordian i.e. type of dataset (Metrics, JSON Data, Plots)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Python Pull requests that update Python code
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants