Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issue with Large Tables and HNSW Indexes #455

Closed
williamjeong2 opened this issue Feb 7, 2024 · 9 comments
Closed

Performance Issue with Large Tables and HNSW Indexes #455

williamjeong2 opened this issue Feb 7, 2024 · 9 comments

Comments

@williamjeong2
Copy link

Hello,

I'm currently facing performance challenges with pgvector on PostgreSQL, particularly with large tables and queries taking significant time to execute. I'd like to share my situation and seek advice on potential optimizations or configurations that could improve performance.

Environment & Configuration:

  • PostgreSQL version: Using the Docker image with pgvector version 0.6.0 included, which is based on PostgreSQL 16.
  • Hardware: The server has 28 cores, 56 threads, and 256GB of RAM, but we're using HDDs instead of SSDs, which might be impacting performance.
  • Tables: We have around 10 tables, each with approximately 10 to 20 million rows.
  • Current settings: shared_buffers is set to 80GB, and effective_cache_size is set to 120GB.
  • Indexes: Utilizing HNSW indexes with cosine similarity.

Issues & Observations:

  • Query Performance: Queries on these tables are taking at least 10 seconds each, which seems unusually high.
  • Considering Partitioning: I'm contemplating implementing partitioning to potentially improve performance and would appreciate any insights on whether partitioning might be beneficial in my case.

Given the above configuration and the challenges faced, I have a few questions:

  1. Are there any recommended configurations or optimizations specific to pgvector, especially when dealing with large tables and HNSW indexes on an HDD setup?
  2. Could the use of HDD instead of SSD be the primary factor in the observed performance issues? Would transitioning to SSDs significantly improve query performance?
  3. Would partitioning my tables potentially improve query times, and are there specific considerations with pgvector indexes when partitioning?

I understand that the hardware might be a limiting factor, but any guidance on optimizing PostgreSQL or pgvector settings to mitigate some of these performance issues would be greatly appreciated.

Thank you for your time and assistance.

@ankane
Copy link
Member

ankane commented Feb 7, 2024

Hi @williamjeong2, can you paste the output of EXPLAIN (ANALYZE, BUFFERS) for one of the queries?

@williamjeong2
Copy link
Author

Hi @ankane, here is the output(768 dim):

Limit  (cost=224.84..226.81 rows=3 width=1247) (actual time=8249.103..8318.815 rows=3 loops=1)
  Buffers: shared hit=128 read=1246
  ->  Index Scan using paper_embedding_idx on paper  (cost=224.84..11281870.44 rows=17203648 width=1247) (actual time=8249.099..8318.807 rows=3 loops=1)
        Order By: (embedding <=> '[0.8093864,0.60382533,...]'::vector)
        Buffers: shared hit=128 read=1246
Planning:
  Buffers: shared hit=118
Planning Time: 2.416 ms
Execution Time: 8319.119 ms

@ankane
Copy link
Member

ankane commented Feb 8, 2024

It looks like the buffer hit rate is pretty low, so a lot of reads are happening from disk. I suspect you'll see better performance with an SSD, especially since HNSW does a lot of random access. I don't think partitioning will help in this situation (unless you're filtering by the partition key). You could also try prewarming the index with pg_prewarm to get it in memory, but it may not stay there.

@williamjeong2
Copy link
Author

@ankane
Thanks so much, I'll try the things you suggested to me and share them here for use cases.

@ankane
Copy link
Member

ankane commented Feb 8, 2024

Awesome, sounds good.

@williamjeong2
Copy link
Author

I measured QPS with just moving the database to SSD (each line is a different table, with more data stored as you go down. The top row has 10 million rows and the last row has 20 million rows, with more rows as you go down):

Average QPS: 7.245211969821667, Total Queries Processed: 8
Average QPS: 6.831864681418859, Total Queries Processed: 7
Average QPS: 77.75631549154483, Total Queries Processed: 76
Average QPS: 78.14635672947067, Total Queries Processed: 99
Average QPS: 6.005239444042287, Total Queries Processed: 7
Average QPS: 5.587675697528546, Total Queries Processed: 7
Average QPS: 5.5092620964731545, Total Queries Processed: 6
Average QPS: 5.468562478180391, Total Queries Processed: 6

And this is the output of explain (analyze, buffers) for one of the queries:

Table that had slow QPS.

Limit  (cost=224.84..226.83 rows=3 width=1261) (actual time=163.940..164.732 rows=3 loops=1)
  Buffers: shared hit=473 read=1186
  ->  Index Scan using paper_2016_embedding_idx on paper_2016  (cost=224.84..3815999.63 rows=5742623 width=1261) (actual time=163.937..164.727 rows=3 loops=1)
        Order By: (embedding <=> '[0.6871753,-0.11484199,-0.52307606,...]'::vector)
        Buffers: shared hit=473 read=1186
Planning:
  Buffers: shared hit=1
Planning Time: 0.153 ms
Execution Time: 164.778 ms

Table that had fast QPS.

Limit  (cost=224.84..226.81 rows=3 width=1245) (actual time=15.788..16.005 rows=3 loops=1)
  Buffers: shared hit=1210 read=29
  ->  Index Scan using paper_2018_embedding_idx on paper_2018  (cost=224.84..5054610.74 rows=7706712 width=1245) (actual time=15.784..15.998 rows=3 loops=1)
        Order By: (embedding <=> '[0.6871753,-0.11484199,-0.52307606,...]'::vector)
        Buffers: shared hit=1210 read=29
Planning:
  Buffers: shared hit=42 read=10
Planning Time: 1.375 ms
Execution Time: 16.055 ms

Things are much better than before. But It looks like only the fastest tables are in the buffer. There is still a 'pg_prewarm' task left. However, in my case, my largest table is 100G excluding indexes(80G), so I don't expect it to work well. But I will give it a try.

@pashkinelfe
Copy link
Contributor

pashkinelfe commented Feb 8, 2024

@williamjeong2 Considering you have 20Gb HNSW index, it is likely that pg_prewarm will help. Otherwise 'natural' warm-up could take a long time. To check, try running your test for around 1000 sec and see how TPS evolves. Or just use pg_prewarm )

@williamjeong2
Copy link
Author

@ankane @pashkinelfe Your suggestions were very helpful.

This is the result of using 'pg_prewarm'. In the results below, the top 2 were using 'pg_prewarm' to warm up the indexes. It didn't take long.

Average QPS: 110.34824108994871, Total Queries Processed: 1079
Average QPS: 107.39188954299956, Total Queries Processed: 1112
Average QPS: 5.389447228919696, Total Queries Processed: 54
Average QPS: 16.25110630066733, Total Queries Processed: 155
Average QPS: 5.931909562061934, Total Queries Processed: 60
Average QPS: 6.5119287118480065, Total Queries Processed: 63
Average QPS: 6.479642628293995, Total Queries Processed: 67
Average QPS: 6.736313818006796, Total Queries Processed: 70

To summarize, I can say that using 'pg_prewarm' has performed the best in my case so far. Of course, in my case, especially since I have a very large index size (400 GB for the vector indexes of all tables combined), I think it will require a very large memory. Nevertheless, I think it should be possible to achieve faster TPS than what I have now, because I have seen very fast QPS results in blog posts, including supabase. Other control variables such as vector length, postgresql configuration, etc. remain to be seen.

@ankane
Copy link
Member

ankane commented Feb 14, 2024

Great, thanks for sharing @williamjeong2!

@ankane ankane closed this as completed Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants