Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not seeing huge performance gain - could be my query? #23

Closed
pholly opened this issue Aug 14, 2017 · 8 comments
Closed

Not seeing huge performance gain - could be my query? #23

pholly opened this issue Aug 14, 2017 · 8 comments

Comments

@pholly
Copy link

pholly commented Aug 14, 2017

Postgresql 9.6.4
Windows Server 2016 with 32GB ram / SSD

160K hits, Rum index query time: 80+ seconds

When querying a table partition with 500K rows for a search term that returns 160K hits I am not seeing a large performance increase using rum tsvector <=> vs gin ts_rank. tsvector column includes weights. Each document is pretty large, similar to Wikipedia. They are all legal documents.

Judging by explain (analyze, buffers) output, looks like majority of time is reading and not sorting. Could this be why I am not seeing performance gain? Also, output shows there's still a bitmap heap scan performed when searching the RUM index.

I clear shared buffers and restart Postgresql (thanks https://stackoverflow.com/a/43186594/812610) so there's nothing in memory.

Appreciate any help and thanks for putting RUM on github!

RUM tsvector column: tsv_natural_rum

Index: CREATE INDEX code_docs_fl_tsv_natural_rum_idx on code_docs_fl using rum (tsv_natural_rum rum_tsvector_ops)

Query:

SELECT id, node_id, 
            ts_headline('mcc_natural_config', title, query, 'highlightall=true') title, 
	    ts_headline('mcc_natural_config', content, query, 'maxfragments=1') fragment, 
            ancestors, product_id, product_name, client_name, state_abbr, 
 	    0 AS full_count, --window function for getting full count here but removed it for simplicity
            tsv_natural_rum <=> query AS rank 
FROM code_docs_fl, 
           (select ts_rewrite(plainto_tsquery('mcc_natural_config',  'code'), 'SELECT target, sub FROM 
                aliases_natural WHERE to_tsquery(''mcc_natural_config'',  ''code'') @> target') query) query
WHERE tsv_natural_rum @@ query
ORDER BY tsv_natural_rum <=> query
LIMIT 10 OFFSET 0

Result:

Count: 159,613
Limit  (cost=9279.80..9284.95 rows=10 width=435) (actual time=81780.797..82289.912 rows=10 loops=1)
  Buffers: shared hit=314429 read=190532
  ->  Result  (cost=9279.80..10576.57 rows=2518 width=435) (actual time=81780.786..82289.862 rows=10 loops=1)
        Buffers: shared hit=314429 read=190532
        ->  Sort  (cost=9279.80..9286.10 rows=2518 width=829) (actual time=81764.315..81764.391 rows=10 loops=1)
              Sort Key: ((code_docs_fl.tsv_natural_rum <=> (ts_rewrite('''code'''::tsquery, 'SELECT target, sub FROM aliases_natural WHERE to_tsquery(''mcc_natural_config'',  ''code'') @> target'::text))))
              Sort Method: top-N heapsort  Memory: 35kB
              Buffers: shared hit=314405 read=190481
              ->  Nested Loop  (cost=43.52..9225.39 rows=2518 width=829) (actual time=174.173..80714.142 rows=159613 loops=1)
                    Buffers: shared hit=314402 read=190481
                    ->  Result  (cost=0.00..0.26 rows=1 width=32) (actual time=2.055..2.057 rows=1 loops=1)
                          Buffers: shared hit=56 read=4
                    ->  Bitmap Heap Scan on code_docs_fl  (cost=43.52..9193.65 rows=2518 width=821) (actual time=163.860..18193.512 rows=159613 loops=1)
                          Recheck Cond: (tsv_natural_rum @@ (ts_rewrite('''code'''::tsquery, 'SELECT target, sub FROM aliases_natural WHERE to_tsquery(''mcc_natural_config'',  ''code'') @> target'::text)))
                          Heap Blocks: exact=68933
                          Buffers: shared hit=4 read=69085
                          ->  Bitmap Index Scan on code_docs_fl_tsv_natural_rum_idx  (cost=0.00..42.89 rows=2518 width=0) (actual time=125.595..125.595 rows=159613 loops=1)
                                Index Cond: (tsv_natural_rum @@ (ts_rewrite('''code'''::tsquery, 'SELECT target, sub FROM aliases_natural WHERE to_tsquery(''mcc_natural_config'',  ''code'') @> target'::text)))
                                Buffers: shared read=156
Planning time: 276.156 ms
Execution time: 82299.866 ms

Gin TS_Rank_CD tsvector column: tsv_natural (exact same contents as tsv_natural_rum)

Index: gin index on column tsv_natural

Query:

SELECT id, node_id, 
            ts_headline('mcc_natural_config', title, query, 'highlightall=true') title, 
	    ts_headline('mcc_natural_config', content, query, 'maxfragments=1') fragment, 
            ancestors, product_id, product_name, client_name, state_abbr, 
 	    0 AS full_count, --window function for getting full count here but removed it for simplicity
            ts_rank_cd(tsv_natural, query, 4|1) AS rank 
FROM code_docs_fl, 
           (select ts_rewrite(plainto_tsquery('mcc_natural_config',  'code'), 'SELECT target, sub FROM 
                aliases_natural WHERE to_tsquery(''mcc_natural_config'',  ''code'') @> target') query) query
WHERE tsv_natural @@ query
ORDER BY rank DESC
LIMIT 10 OFFSET 0

Result:

Count: 159,613
Limit  (cost=10959.80..10964.95 rows=10 width=435) (actual time=104597.190..104597.762 rows=10 loops=1)
  Buffers: shared hit=415488 read=255421 dirtied=11754 written=3715
  ->  Result  (cost=10959.80..12256.57 rows=2518 width=435) (actual time=104597.180..104597.728 rows=10 loops=1)
        Buffers: shared hit=415488 read=255421 dirtied=11754 written=3715
        ->  Sort  (cost=10959.80..10966.10 rows=2518 width=829) (actual time=104597.046..104597.061 rows=10 loops=1)
              Sort Key: (ts_rank_cd(code_docs_fl.tsv_natural, (ts_rewrite('''code'''::tsquery, 'SELECT target, sub FROM aliases_natural WHERE to_tsquery(''mcc_natural_config'',  ''code'') @> target'::text)), 5)) DESC
              Sort Method: top-N heapsort  Memory: 27kB
              Buffers: shared hit=415488 read=255421 dirtied=11754 written=3715
              ->  Nested Loop  (cost=1723.52..10905.39 rows=2518 width=829) (actual time=535.603..103393.362 rows=159613 loops=1)
                    Buffers: shared hit=415485 read=255421 dirtied=11754 written=3715
                    ->  Result  (cost=0.00..0.26 rows=1 width=32) (actual time=1.971..1.972 rows=1 loops=1)
                          Buffers: shared hit=55 read=5
                    ->  Bitmap Heap Scan on code_docs_fl  (cost=1723.52..10873.65 rows=2518 width=976) (actual time=525.526..25184.048 rows=159613 loops=1)
                          Recheck Cond: (tsv_natural @@ (ts_rewrite('''code'''::tsquery, 'SELECT target, sub FROM aliases_natural WHERE to_tsquery(''mcc_natural_config'',  ''code'') @> target'::text)))
                          Heap Blocks: exact=119510
                          Buffers: shared hit=1 read=120034 dirtied=7722 written=1293
                          ->  Bitmap Index Scan on code_docs_fl_tsv_natural_idx  (cost=0.00..1722.89 rows=2518 width=0) (actual time=450.870..450.870 rows=322447 loops=1)
                                Index Cond: (tsv_natural @@ (ts_rewrite('''code'''::tsquery, 'SELECT target, sub FROM aliases_natural WHERE to_tsquery(''mcc_natural_config'',  ''code'') @> target'::text)))
                                Buffers: shared hit=1 read=524
Planning time: 255.426 ms
Execution time: 104601.713 ms
@za-arthur
Copy link
Contributor

za-arthur commented Aug 15, 2017

I think it is because of your query is in FROM clause.
Can you try to rewrite your query using prepared statement in the following way?

prepare code_docs_fl_fts (tsquery) as
	SELECT id, node_id, 
		ts_headline('mcc_natural_config', title, $1, 'highlightall=true') title, 
		ts_headline('mcc_natural_config', content, $1, 'maxfragments=1') fragment, 
		ancestors, product_id, product_name, client_name, state_abbr, 
		0 AS full_count, --window function for getting full count here but removed it for simplicity
		tsv_natural_rum <=> $1 AS rank 
	FROM code_docs_fl
	WHERE tsv_natural_rum @@ $1
	ORDER BY tsv_natural_rum <=> $1
	LIMIT 10 OFFSET 0;

explain analyze execute code_docs_fl_fts(ts_rewrite(plainto_tsquery('mcc_natural_config',  'code'),
	'SELECT target, sub FROM aliases_natural WHERE to_tsquery(''mcc_natural_config'',  ''code'') @> target'));

@pholly
Copy link
Author

pholly commented Aug 15, 2017

YESSSSSSSSSSSSSSSSSSSSSSSSSS

Limit  (cost=20.00..52.87 rows=10 width=435) (actual time=388.606..981.048 rows=10 loops=1)
  ->  Index Scan using code_docs_fl_tsv_natural_rum_idx on code_docs_fl  (cost=20.00..528655.06 rows=160825 width=435) (actual time=388.599..980.999 rows=10 loops=1)
        Index Cond: (tsv_natural_rum @@ '''code'''::tsquery)
        Order By: (tsv_natural_rum <=> '''code'''::tsquery)
Execution time: 990.911 ms

Execution time went from 100 seconds to 1 second.

Thank you!!!!!! I was about to switch to Solr for full-text search because some queries would take over 8 minutes.

Just for completeness, I tried using a prepared statement with ts_rank / gin index column, and still got over 100 seconds. Hitting index only is awesome!

Do you mind briefly explaining why having the query in FROM clause was so bad?

@obartunov
Copy link

obartunov commented Aug 15, 2017 via email

@za-arthur
Copy link
Contributor

Excellent!

Do you mind briefly explaining why having the query in FROM clause was so bad?

As Oleg said, when you put a to_tsquery() into your FROM clause PostgreSQL's planner use nested-loop join to join your table with to_tsquery()'s result. In your example nested-loop takes a lot of time:

Nested Loop (cost=43.52..9225.39 rows=2518 width=829) (actual time=174.173..80714.142 rows=159613 loops=1)

@pholly
Copy link
Author

pholly commented Aug 15, 2017

Interestingly, prepared statement with ts_rank / gin index takes longer than when the query is in the FROM clause. Looks like gin index isn't hit.

prepare code_docs_fl_fts_gin (tsquery) as
	SELECT id, node_id, 
		ts_headline('mcc_natural_config', title, $1, 'highlightall=true') title, 
		ts_headline('mcc_natural_config', content, $1, 'maxfragments=1') fragment, 
		ancestors, product_id, product_name, client_name, state_abbr, 
		0 AS full_count, --window function for getting full count here but removed it for simplicity
		ts_rank_cd(tsv_natural, $1, 4|1) AS rank 
	FROM code_docs_fl
	WHERE tsv_natural @@ $1
	ORDER BY rank desc
	LIMIT 10 OFFSET 0;
	
explain (analyze, buffers) 
execute code_docs_fl_fts_gin(ts_rewrite(plainto_tsquery('mcc_natural_config',  'code'),
	'SELECT target, sub FROM aliases_natural WHERE to_tsquery(''mcc_natural_config'',  ''code'') @> target'));
Limit  (cost=189800.03..189805.18 rows=10 width=435) (actual time=132878.664..132879.236 rows=10 loops=1)
  Buffers: shared hit=1591860 read=444385
  ->  Result  (cost=189800.03..272624.91 rows=160825 width=435) (actual time=132878.656..132879.202 rows=10 loops=1)
        Buffers: shared hit=1591860 read=444385
        ->  Sort  (cost=189800.03..190202.10 rows=160825 width=782) (actual time=132878.488..132878.506 rows=10 loops=1)
              Sort Key: (ts_rank_cd(tsv_natural, '''code'''::tsquery, 5)) DESC
              Sort Method: top-N heapsort  Memory: 27kB
              Buffers: shared hit=1591860 read=444385
              ->  Seq Scan on code_docs_fl  (cost=0.00..186324.66 rows=160825 width=782) (actual time=9.306..131641.933 rows=159613 loops=1)
                    Filter: (tsv_natural @@ '''code'''::tsquery)
                    Rows Removed by Filter: 344010
                    Buffers: shared hit=1591857 read=444385
Execution time: 132879.361 ms

Thanks again. I'm still waiting for trial and licensing info for Postgres Pro Enterprise. When I hear back I'd like to try it out.

@za-arthur
Copy link
Contributor

za-arthur commented Aug 15, 2017

Interestingly, prepared statement with ts_rank / gin index takes longer than when the query is in the FROM clause. Looks like gin index isn't hit.

For GIN index prepared statement won't bring a significant benefit. It anyway will use Bitmap Index Scan or Seq Scan (as in your last case with prepared statement). For ordering it needs to read both index pages and heap (table) pages.
To return 10 rows RUM needs to read only a few index pages. Whereas GIN returns 160K rows, only then PostgreSQL sort them and returns 10 rows.

Thanks again. I'm still waiting for trial and licensing info for Postgres Pro Enterprise. When I hear back I'd like to try it out.

I checked it and our salesman sent to you a mail about Windows Postgres Pro Enterprise yesterday. Did you receive it?

@pholly
Copy link
Author

pholly commented Aug 15, 2017

I checked it and our salesman sent to you a mail about Windows Postgres Pro Enterprise yesterday. Did you receive it?

Thanks for checking. I found it. It went into my quarantine folder for some reason.

@za-arthur
Copy link
Contributor

It is good!
I'm closing the issue. Feel free to open issues or report bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants