Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve IAMC endpoint query performance #56

Merged
merged 8 commits into from
Mar 6, 2024

Conversation

meksor
Copy link
Contributor

@meksor meksor commented Mar 6, 2024

In use some performance problems have been identified with the iamc endpoints in some production databases.
The culprits seem to be queries like this:

select
	distinct model.name,
	model.created_at,
	model.created_by,
	model.id
from
	model
join run on
	run.model__id = model.id
join iamc_timeseries on
	iamc_timeseries.run__id = run.id
join iamc_datapoint_universal on
	iamc_datapoint_universal.time_series__id = iamc_timeseries.id
order by
	model.id asc

If the iamc_datapoint table is very big this query takes quite a long time. A simple fix i found was to use a subquery instead of a join:

select
	distinct model.name,
	model.created_at,
	model.created_by,
	model.id
from
	model
join run on
	run.model__id = model.id
join iamc_timeseries on
	iamc_timeseries.run__id = run.id
where 
	exists (select * from iamc_datapoint_universal where iamc_datapoint_universal.time_series__id = iamc_timeseries.id  )
order by
	model.id asc

This is about x10 faster on my dataset, but still not ideal. The optimization i went with was the following:
First add some custom logic to the datapoint deletion: When datapoints are deleted, orphaned timeseries rows should be deleted. That way i can completely avoid the join on the iamc_datapoints table and just join the timeseries table:

select
	distinct model.name,
	model.created_at,
	model.created_by,
	model.id
from
	model
join run on
	run.model__id = model.id
join iamc_timeseries on
	iamc_timeseries.run__id = run.id
order by
	model.id asc

Query time is down to 300ms from 10s :)

This PR also adds test for the backend repository count function.

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

But is there a way to write a test for the new behavior, to ensure that the timeseries-row is indeed deleted (but only if all datapoints are removed).

@meksor
Copy link
Contributor Author

meksor commented Mar 6, 2024

Good point, i added a check to do_run_datapoints so all the iamc data tests also check if the timeseries get deleted.
Also found a better approach to do the deletion.

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks great!

@meksor meksor merged commit 6e04ed3 into main Mar 6, 2024
6 checks passed
@meksor meksor deleted the improvement/iamc-query-performance branch March 6, 2024 11:48
@glatterf42 glatterf42 mentioned this pull request Mar 15, 2024
3 tasks
@meksor meksor mentioned this pull request Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants