Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Improvement Proposal for queries related with the Task Table #20505

Open
peresureda opened this issue May 28, 2024 · 23 comments · Fixed by #20545
Open

Performance Improvement Proposal for queries related with the Task Table #20505

peresureda opened this issue May 28, 2024 · 23 comments · Fixed by #20545

Comments

@peresureda
Copy link

peresureda commented May 28, 2024

Is your feature request related to a problem? Please describe.
When you have the Trivy scanner activated and want to perform one of the following operations:

  • List the artifacts of a project.
  • Execute a ScanAll.

The following query is executed in the database:

SELECT * FROM task WHERE extra_attrs::jsonb @> cast( $1 as jsonb )

This query becomes very heavy for the database CPU, and in some cases, Harbor timeouts when the number of artifacts is large.

This is because the extra_attrs field is of type JSON and cannot have any kind of index on it.

Solution Proposal

Since the JSON and JSONB data types accept the same input and produce the same output, my recommendation is to change the data type to JSONB to allow indexing:

  1. ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb
  2. CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);

Improvements
With this improvement, we have achieved the following results:

  • Reduced artifact listing time from 45 seconds to 2 seconds.
  • Reduced scan time by 80%.
  • Reduced CPU consumption during a ScanAll from 60% to 25%.
@Vad1mo
Copy link
Member

Vad1mo commented May 28, 2024

Thank you for the improvement, this makes sense.

one question does ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb convert existing data?

@peresureda
Copy link
Author

Hi @Vad1mo ,

Important points:

  1. JSONB available from: PostgreSQL 9.4 (2014) added support for JSONB data type.
  2. PostgreSQL offers two types for storing JSON data: json and jsonb. The json and jsonb data types accept almost identical sets of values as input. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.

At summary: You can migrate a JSON column to JSONB simply by altering the table. However, it's not possible to revert back since JSONB does not store the JSON in raw format. If you want to revert back, you should first make a copy of the table.
create table task_aux as select * from task;

@flbla
Copy link
Contributor

flbla commented May 31, 2024

Hi,
we probably have the same issue.
since we updated Harbor from 2.10.0 to 2.10.2, we have a lot of slowness in harbor when we access artefacts page with the UI same with the artifacts API
we saw huge CPU usage on our postgres database since the update
It looks like its due to this change : https://github.com/goharbor/harbor/pull/20169/files#diff-529262240b8fd8c4239ff9411c3c13ae66343e674ff7ce1fa4978f692079089eR118

explain command on one of the select * from task :
image

difference in CPU usage between today (yellow line) and previous week (gray dotted line) :
image

@peresureda
Copy link
Author

Thank you for the update. Now I think everything is clear. Anyway, my recommendation is to change the data type ( to jsonb ) instead of performing the conversion to make the query. Performance tests have reduced CPU consumption by almost 50% with the change of the data type to how it was implemented before.

@flbla
Copy link
Contributor

flbla commented May 31, 2024

hi @peresureda, we tried it on our staging env, indeed with the conversion to jsonb and the add of the index it's much more faster, thank you

@wy65701436 @Vad1mo do you think this change could be included in a 2.10 version ? thank you

@AlenversFr
Copy link

AlenversFr commented May 31, 2024

hi @peresureda, we tried it on our staging env, indeed with the conversion to jsonb and the add of the index it's much more faster, thank you

@wy65701436 @Vad1mo do you think this change could be included in a 2.10 version ? thank you

Query analysis :

Before applying the fix :

                                           Table "public.task"
     Column      |            Type             | Collation | Nullable |             Default              
-----------------+-----------------------------+-----------+----------+----------------------------------
 id              | integer                     |           | not null | nextval('task_id_seq'::regclass)
 execution_id    | integer                     |           | not null | 
 job_id          | character varying(64)       |           |          | 
 status          | character varying(16)       |           | not null | 
 status_code     | integer                     |           | not null | 
 status_revision | integer                     |           |          | 
 status_message  | text                        |           |          | 
 run_count       | integer                     |           |          | 
 extra_attrs     | jsonb                       |           |          | 
 creation_time   | timestamp without time zone |           |          | now()
 start_time      | timestamp without time zone |           |          | 
 update_time     | timestamp without time zone |           |          | 
 end_time        | timestamp without time zone |           |          | 
 vendor_type     | character varying(64)       |           | not null | 
Indexes:
    "task_pkey" PRIMARY KEY, btree (id)
    "idx_task_extra_attrs_report_uuids" gin ((extra_attrs -> 'report_uuids'::text))```


when you run a query like just below the result of the EXPLAIN command indicates a seq_scan. It meens there's no use of the index  "idx_task_extra_attrs_report_uuids"  defined with task table.

```explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}' as jsonb );```

If you just apply :
```ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb```
You might think the index will be used but it's not the case it's still a seq_scan.

After applying the new index :
```CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);```

Here is the result of the explain command : 
```explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}' as jsonb );
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on task  (cost=8.01..10.02 rows=1 width=253)
   Recheck Cond: (extra_attrs @> '{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}'::jsonb)
   ->  Bitmap Index Scan on indx_task_jsonb  (cost=0.00..8.01 rows=1 width=0)
         Index Cond: (extra_attrs @> '{"report_uuids": ["6526b828-f6ef-4864-a27e-9b0dc38f1502"]}'::jsonb)
(4 rows)```

In our staging environment here is the performance impact of the modification : 
Number of rows in the task table = 13k rows
Average of the painful query before = 175 ms/query
Average of the painful query after (ALTER +INDEX) = 0.25 ms/query

When the task table is nicelly loaded (more than 300k rows like our production), the query avg is around 1s.


Modifying the type of the column means we have to check all the internal services  of Harbor that might use this table.
It's not easy to evaluate the impact.

Here is a query that helps to validate the correct interactions between task and execution tables : 
```WITH task_error as (                                          
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Error' group by b.vendor_type, b.execution_id
), task_success as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Success' group by b.vendor_type, b.execution_id
), task_stopped as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Stopped' group by b.vendor_type, b.execution_id
)
 select a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status,
 count(distinct a.id) as nb_execution,
 SUM(task_error.cnt) as task_error_cnt,
 SUM(task_success.cnt) as task_success_cnt,
 SUM(task_stopped.cnt) as task_stopped_cnt
 from execution as a
 left Join task_error on task_error.execution_id=a.id
 left Join task_success on task_success.execution_id=a.id
 left Join task_stopped on task_stopped.execution_id=a.id
 group by a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status
 order by to_char(a.start_time,'YYYY-MM-DD') desc limit 20;```
 
 In order to validate the performance fix  :
 This query should indicates no regression for every executions and tasks dependencies.
  
AS for now in our staging environment the fix seems to have no negative impact but it needs to be confirmed by the maintainers.

```   vendor_type     |  to_char   | status  | nb_execution | task_error_cnt | task_success_cnt | task_stopped_cnt 
--------------------+------------+---------+--------------+----------------+------------------+------------------
 EXECUTION_SWEEP    | 2024-05-31 | Success |            3 |                |                3 |                 
 GARBAGE_COLLECTION | 2024-05-31 | Success |            1 |                |                1 |                 
 IMAGE_SCAN         | 2024-05-31 | Running |            7 |                |                  |                 
 REPLICATION        | 2024-05-31 | Success |           10 |                |               16 |                 
 SCAN_ALL           | 2024-05-31 | Running |            1 |                |             2082 |                 
 SCHEDULER          | 2024-05-31 | Running |            1 |                |                  |  ```
 
 An easy way to test the performance issue is to create a replication that is scanning all the harbor projects.
 Example of configuration with a single check on a label :
 
![image](https://github.com/goharbor/harbor/assets/57966036/ee5bec28-e14a-4202-bdd4-bd3fd0cb4a33)


The fix seems legit and operationnal, we just need confirmation from the maintainers and wait for a new release.

stonezdj added a commit to stonezdj/harbor that referenced this issue Jun 4, 2024
…trs_report_uuids

 fixes goharbor#20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
stonezdj added a commit to stonezdj/harbor that referenced this issue Jun 4, 2024
…trs_report_uuids

  fixes goharbor#20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
stonezdj added a commit to stonezdj/harbor that referenced this issue Jun 4, 2024
…trs_report_uuids

 fixes goharbor#20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
stonezdj added a commit that referenced this issue Jun 4, 2024
…_task_extra_at… (#20546)

Adjust the query by UUID sql so that it can use the idx_task_extra_attrs_report_uuids

  fixes #20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
@Vad1mo
Copy link
Member

Vad1mo commented Jun 4, 2024

@stonezdj @wy65701436, thank you for the swift fix of the issues.

It is not clear to me why, in addition to changing the query, we can not change the Data Type for generally a better performance?

ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb
CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);

@peresureda
Copy link
Author

We are reverting to the previous situation where we are casting the object on each execution in order to use an index. Tests show that CPU consumption decreased exponentially by avoiding this cast and converting the data type. Am I missing any place where you need this field to be in JSON and cannot be migrated to JSONB?

@stonezdj , @wy65701436 ¿?

stonezdj added a commit to stonezdj/harbor that referenced this issue Jun 4, 2024
…trs_report_uuids

 fixes goharbor#20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
@Vad1mo
Copy link
Member

Vad1mo commented Jun 4, 2024

I would also suggest to transition to jsob data type, it is clearly recommended option for dealing with json in pg, specifically for performance reasons. There are also other opertunities to switch to jsonb.

@stonezdj
Copy link
Contributor

stonezdj commented Jun 5, 2024

The PR merged in 2.11 just fixed the index not used issue, for converting json to jsonb type and more index, we need to do more testing to verify it.

stonezdj added a commit to stonezdj/harbor that referenced this issue Jun 5, 2024
…trs_report_uuids

 fixes goharbor#20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
stonezdj added a commit to stonezdj/harbor that referenced this issue Jun 5, 2024
…trs_report_uuids

 fixes goharbor#20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
@wy65701436
Copy link
Contributor

wy65701436 commented Jun 5, 2024

Since we didn't encounter any performance issues prior to v2.10.1, the index on report_uuids has proven beneficial. The recent commit did not leverage this existing index, leading to performance problems.

The fix is to rewrite the SQL to ensure the DB engine utilizes the report_uuids index. This approach avoids updating the DB schema, which would require additional upgrade work and testing

This fix will be in v2.9.5, v2.10.3 and v2.11.0.

@Vad1mo
Copy link
Member

Vad1mo commented Jun 5, 2024

@stonezdj @wy65701436, thank you for more info, as a quick fix this is a good way forward.

However, I am quite confident, that we can optimize this further.

@flbla @thcdrt @peresureda, you have a rather large data sets, can you provide some data points.

This will also open up opportunities for further improvements, as have other json data types present, that we query.

stonezdj added a commit to stonezdj/harbor that referenced this issue Jun 6, 2024
…trs_report_uuids

 fixes goharbor#20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
stonezdj added a commit to stonezdj/harbor that referenced this issue Jun 6, 2024
…trs_report_uuids

 fixes goharbor#20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
stonezdj added a commit that referenced this issue Jun 6, 2024
…_task_extra_at… (#20551)

Adjust the query by UUID sql so that it can use the idx_task_extra_attrs_report_uuids

 fixes #20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
stonezdj added a commit that referenced this issue Jun 6, 2024
…_task_extra_at… (#20552)

Adjust the query by UUID sql so that it can use the idx_task_extra_attrs_report_uuids

 fixes #20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
stonezdj added a commit that referenced this issue Jun 11, 2024
#20545)

Adjust the query by UUID sql so that it can use the idx_task_extra_attrs_report_uuids

 fixes #20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
@peresureda
Copy link
Author

Hi @Vad1mo

This morning I have done extra tests with the different scenarios commented and I can provide more data:

Execution Plan:

explain SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> '{"artifact_id":10216,"artifact_tag":"1.585.0","report_uuids":["94aed6ca-8ac4-42c8-9325-80e70bf8040f"],"robot_id":14160}'

"Bitmap Heap Scan on task  (**cost=68.01..75.71** rows=2 width=220)"
"  Recheck Cond: (((extra_attrs)::jsonb -> 'report_uuids'::text) @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"
"  ->  Bitmap Index Scan on idx_task_extra_attrs_report_uuids  (cost=0.00..68.01 rows=2 width=0)"
"        Index Cond: (((extra_attrs)::jsonb -> 'report_uuids'::text) @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"

  • Applied fix https://github.com/goharbor/harbor/pull/20545 + change of DATA TYPE to jsonb: This case doesn't make sense because even if we change the data type, if the query is the provided to the fix, the performance is the same. This is because the way the query is set up doesn't use the JSONB index.
explain SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> '{"artifact_id":10216,"artifact_tag":"1.585.0","report_uuids":["94aed6ca-8ac4-42c8-9325-80e70bf8040f"],"robot_id":14160}'

"Bitmap Heap Scan on task  (**cost=72.01..79.73** rows=2 width=250)"
"  Recheck Cond: ((extra_attrs -> 'report_uuids'::text) @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"
"  ->  Bitmap Index Scan on idx_task_extra_attrs_report_uuids  (cost=0.00..72.01 rows=2 width=0)"
"        Index Cond: ((extra_attrs -> 'report_uuids'::text) @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"

  • Proposal: Looking at all the results, the real performance improvement is with the previous query and the proposed improvement, as we utilize the JSONB enhancements provided by PostgreSQL. In this case, the previous ListAll is reduced to 2-3 seconds.
explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast( '{"artifact_id":10216,"artifact_tag":"1.585.0","report_uuids":["94aed6ca-8ac4-42c8-9325-80e70bf8040f"],"robot_id":14160}' as jsonb )

"Bitmap Heap Scan on task  (cost=40.01..47.72 rows=2 width=250)"
"  Recheck Cond: (extra_attrs @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"
"  ->  Bitmap Index Scan on indx_task_jsonb  (cost=0.00..40.01 rows=2 width=0)"
"        Index Cond: (extra_attrs @> '{""robot_id"": 14160, ""artifact_id"": 10216, ""artifact_tag"": ""1.585.0"", ""report_uuids"": [""94aed6ca-8ac4-42c8-9325-80e70bf8040f""]}'::jsonb)"

@Vad1mo Vad1mo reopened this Jun 14, 2024
@flbla
Copy link
Contributor

flbla commented Jun 14, 2024

Hi @Vad1mo,

I built a version 2.10 with the rollback on the SELECT FROM task query.
I deployed it on my staging environment: ~7K artifacts (for comparaison, our prod env has arround 125K artifacts)
I reset my pg_stat_statements table before I started my tests

The size of the task table when I started my tests was 11313 lines.

Query Tot_Calls Tot_Time Avg_Time
SELECT * FROM task WHERE extra_attrs::jsonb -> $2 @> $1 1056 87428.30858300008 82.79195888541675

After I launched some "GC", "Scan All" and "Replications" actions :
Size of table task : 10408 lines

Query Tot_Calls Tot_Time Avg_Time
SELECT * FROM task WHERE extra_attrs::jsonb -> $2 @> $1 10012 1864485.1329499988 186.22504324310816

Then test with same builded image + change of DATA TYPE to jsonb + add indexes + reset my pg_stat_statements table :

ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb;
CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);
CREATE INDEX idx_task_extra_attrs_report_uuids ON public.task USING gin (((extra_attrs -> 'report_uuids'::text)));
Query Tot_Calls Tot_Time Avg_Time
SELECT * FROM task WHERE extra_attrs::jsonb -> $2 @> $1 406 185.97064199999988 0.45805576847290613

After I launched some "GC", "Scan All" and "Replications" actions :
Size of table task : 10414 lines

Query Tot_Calls Tot_Time Avg_Time
SELECT * FROM task WHERE extra_attrs::jsonb -> $2 @> $1 8526 2642.1774179999943 0.30989648346234977

As you can see, the average time :
BEFORE the switch to jsonb + creation of indexes increase with the number of actions done on harbor
AFTER the switch to jsonb + creation of indexes the average time is less than 0.5ms and do not change with the number of actions done on harbor

@flbla
Copy link
Contributor

flbla commented Jun 14, 2024

I then re-switch to the 2.10.2 version (the version with slowness) and kept the jsonb and indexes , reset pg state statement :
table task size : 10415 lines

Query Tot_Calls Tot_Time Avg_Time
SELECT * FROM task WHERE extra_attrs::jsonb @> cast( $1 as j 61 15.206537 0.2492874918032787

after some actions on harbor (scan all, gc, replications, ...) :
table task size : 12282 lines

Query Tot_Calls Tot_Time Avg_Time
SELECT * FROM task WHERE extra_attrs::jsonb @> cast( $1 as j 3520 1545.4175580000006 0.4390390789772729
explain SELECT * FROM task WHERE extra_attrs::jsonb @> cast('{"report_uuids": ["90246945-5f4d-4b04-b57a-c89a0283b487"]}' as jsonb );
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on task  (cost=8.01..10.02 rows=1 width=254)
   Recheck Cond: (extra_attrs @> '{"report_uuids": ["90246945-5f4d-4b04-b57a-c89a0283b487"]}'::jsonb)
   ->  Bitmap Index Scan on indx_task_jsonb  (cost=0.00..8.01 rows=1 width=0)
         Index Cond: (extra_attrs @> '{"report_uuids": ["90246945-5f4d-4b04-b57a-c89a0283b487"]}'::jsonb)
(4 rows)

@flbla
Copy link
Contributor

flbla commented Jun 14, 2024

So from my point of view, it could be good to do the following:

ALTER TABLE task ALTER COLUMN extra_attrs SET DATA TYPE jsonb;
CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);
CREATE INDEX idx_task_extra_attrs_report_uuids ON public.task USING gin (((extra_attrs -> 'report_uuids'::text)));

@wy65701436
Copy link
Contributor

We have reverted to using the index defined in the execution table. Please upgrade to the latest version (v2.11) to validate the performance changes.

If the performance still does not meet expectations, we can continue considering the use of the JSONB type. However, based on our testing results, we have observed a significant performance improvement after reverting.

mohamedawnallah pushed a commit to mohamedawnallah/harbor that referenced this issue Jul 2, 2024
goharbor#20545)

Adjust the query by UUID sql so that it can use the idx_task_extra_attrs_report_uuids

 fixes goharbor#20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
mohamedawnallah pushed a commit to mohamedawnallah/harbor that referenced this issue Jul 2, 2024
goharbor#20545)

Adjust the query by UUID sql so that it can use the idx_task_extra_attrs_report_uuids

 fixes goharbor#20505

Signed-off-by: stonezdj <stone.zhang@broadcom.com>
@peresureda
Copy link
Author

Hi @wy65701436 ,

After reverting the manual changes (changing the type from JSONB to JSON and removing the index) and applying the patch, we have observed that in terms of the performance of the ScanAll process, it is very similar to what we had with the improvement of JSONB + index. Where there is a small change is in resource usage, as we mentioned, the execution cost of having to transform to JSONB at runtime is higher than already having the data in that format.

CPU USage during ScanAll before Upgrade with jsonb + index:
Screenshot 2024-07-11 at 09 53 22

CPU USage during ScanAll with 2.11 without manual improvements:
Screenshot 2024-07-11 at 09 53 36

From here, the decision is yours.
Regards!

@wy65701436
Copy link
Contributor

@peresureda thanks for your data, it's helpful for us to do the further investigation.

The performance problem was resolved after upgrade to v2.11, but you still see high CPU usage, am I understand correct?
If so, I will do more investigation.

cc @stonezdj

@stonezdj
Copy link
Contributor

stonezdj commented Aug 14, 2024

Created 1 million records in the task. and run following test.

Operation/query Before(json) After(jsonb+gin index)
ScanAll 25s 26s
SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' 0.7ms 0.6ms
(Before) Aug 14 05:20:20 ubuntu core[2185963]: [ORM]2024/08/14 09:20:20  -[Queries/default] - [  OK /    db.Query /     0.7ms] - [SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> $1] - `"0e9f119d-30ba-46b6-9b79-4f7df6b06cea"
(After)Aug 14 05:22:30 ubuntu core[2185963]: [ORM]2024/08/14 09:22:30  -[Queries/default] - [  OK /    db.Query /     0.6ms] - [SELECT * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> $1] - `"0e9f119d-30ba-46b6-9b79-4f7df6b06cea"`

The CPU/Memory compare
Before
before_cpu_memory

After:
after_cpu_memory

Test ScanAll with 10000 artifact
before-after

The performance improvement doesn't seem very significant.

@flbla
Copy link
Contributor

flbla commented Aug 14, 2024

@stonezdj
did you created both indexes ?

CREATE INDEX indx_task_jsonb ON task USING GIN(extra_attrs jsonb_path_ops);
CREATE INDEX idx_task_extra_attrs_report_uuids ON public.task USING gin (((extra_attrs -> 'report_uuids'::text)));

@stonezdj
Copy link
Contributor

Yes,
After change it should be:

registry=# \d task;
                                           Table "public.task"
     Column      |            Type             | Collation | Nullable |             Default              
-----------------+-----------------------------+-----------+----------+----------------------------------
 id              | integer                     |           | not null | nextval('task_id_seq'::regclass)
 execution_id    | integer                     |           | not null | 
 job_id          | character varying(64)       |           |          | 
 status          | character varying(16)       |           | not null | 
 status_code     | integer                     |           | not null | 
 status_revision | integer                     |           |          | 
 status_message  | text                        |           |          | 
 run_count       | integer                     |           |          | 
 extra_attrs     | jsonb                       |           |          | 
 creation_time   | timestamp without time zone |           |          | CURRENT_TIMESTAMP
 start_time      | timestamp without time zone |           |          | 
 update_time     | timestamp without time zone |           |          | 
 end_time        | timestamp without time zone |           |          | 
 vendor_type     | character varying(64)       |           | not null | 
Indexes:
    "task_pkey" PRIMARY KEY, btree (id)
    "idx_task_extra_attrs_report_uuids" gin ((extra_attrs -> 'report_uuids'::text))
    "idx_task_job_id" btree (job_id)
    "indx_task_jsonb" gin (extra_attrs jsonb_path_ops)
    "task_execution_id_idx" btree (execution_id)
Foreign-key constraints:
    "task_execution_id_fkey" FOREIGN KEY (execution_id) REFERENCES execution(id)

@flbla
Copy link
Contributor

flbla commented Aug 16, 2024

I don't remember about the CPU and memory but in my previous tests I had really better time with Jsonb + gin indexes : #20505 (comment)
did you try to do a explain select * FROM task WHERE extra_attrs::jsonb -> 'report_uuids' @> 0e9f119d-30ba-46b6-9b79-4f7df6b06cea to be sure it uses correct indexes ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants