Search and documents way too slow #9649

axelopale · 2025-04-14T14:41:33Z

axelopale
Apr 14, 2025

Hi there,

On a dedicated server with 10 cores and 90Go RAM (doing nothing else) the search is really slow: it takes 2 minutes to return the 50 results per page. I have 30 000 documents, 700 tags on Paperless installed via Docker.
When performing the search and/or displaying the documents, the server loads increases a little but nothing more than 0.8 average load. If I refresh the same page, it takes 2 min again.
The global search field works better.

I tried to tune the Mariadb settings in Docker but no luck.
I tried rebooting the containers as well as the server.

If anyone can point me to something, it would be nice, Thanks

EDIT : I enabled the MySQL slowlog. Scary ....

27 seconds for this simple query:

SELECT `documents_tag`.`id` FROM `documents_tag` LEFT OUTER JOIN `documents_document_tags` ON (`documents_tag`.`id` = `documents_document_tags`.`tag_id`) LEFT OUTER JOIN `documents_document` ON (`documents_
document_tags`.`document_id` = `documents_document`.`id`) GROUP BY `documents_tag`.`id`, LOWER(`documents_tag`.`name`) ORDER BY LOWER(`documents_tag`.`name`) ASC;

shamoon · 2025-04-14T15:07:52Z

shamoon
Apr 14, 2025
Maintainer

27 seconds for this simple query:

SELECT `documents_tag`.`id` FROM `documents_tag` LEFT OUTER JOIN `documents_document_tags` ON (`documents_tag`.`id` = `documents_document_tags`.`tag_id`) LEFT OUTER JOIN `documents_document` ON (`documents_
document_tags`.`document_id` = `documents_document`.`id`) GROUP BY `documents_tag`.`id`, LOWER(`documents_tag`.`name`) ORDER BY LOWER(`documents_tag`.`name`) ASC;

As you noted, thats not normal, so it's very hard for anyone here to tell you why this is happening but fundamentally it seems an issue with your database (installation) perhaps?

9 replies

virtadpt Apr 14, 2025

What about doing something like what Archivebox does - the possibility of using ripgrep, pdfgrep, or sonic?

shamoon Apr 14, 2025
Maintainer

I mean, not simple to replace the search backend here but it has been discussed. Ironically in most cases searching with the db is faster than the “search index” (which uses software called whoosh). Does that work faster for you than the db? The “advanced search” on the documents page).

I suppose if there was a search backend that outperformed the db and had feature parity we could just use that entirely. Again, huge amount(s) of work…

axelopale Apr 15, 2025
Author

In my opinion, 30k documents is not that much, I'm very surprised that Paperless starts struggling so fast.
I would imagine that 1M documents is a lot.
Either SQL queries must be optimized for advanced search (indexes, fulltext, split in multiple but smaller queries), or a search backend like Meilisearch, Solr or Elastic should be used.

ke-ma-fi Apr 17, 2025

Hi there,

I'm running Paperless-ngx with Docker and PostgreSQL, currently handling around 12,000 documents and over 40 million characters – and search results come back instantly for me.

My Paperelss instance runs on a system with an M.2 SSD and 8 GB RAM, and I've applied the following performance tweaks to PostgreSQL:

Memory tuning

shared_buffers = 2GB
work_mem = 32MB
effective_cache_size = 6GB

WAL & checkpoints

checkpoint_completion_target = 0.9
wal_buffers = 32MB

Disk I/O tuning

random_page_cost = 1.1
seq_page_cost = 1.0
effective_io_concurrency = 200

This setup has been working flawlessly for me.
You might want to consider migrating to PostgreSQL and testing it out.

axelopale Apr 17, 2025
Author

Thanks @ke-ma-fi
With 30k documents, I had more than 10B characters.
I spent more time than necessary on this so I deleted it all and cancelled my dedicated server.
Looking for another option, considering a Dropbox equivalent.

bytec77 · 2025-04-17T20:53:52Z

bytec77
Apr 17, 2025

Same problem here. Advanced search: customer AND *Product takes 1:30min. 10000 documents 8 Core 16GB RAM SSD, Postgresql. load under 0.5

@ke-ma-fi your tweaks change nothing for me

Unfortunately, I can't use paperless like this. It's a shame because it's otherwise great software. I currently use Devonthink but I would like to switch to paperless.
When I enter the above search there, the result is there in 2 seconds.

I hope for a high-performance searchengine

8 replies

ke-ma-fi Apr 18, 2025

@bytec77 alright, this is similar to my setup, also running in proxmox lxc. I don't think it's due to paperless itself. Mine is working flawlessly with 40mio characters and it should be with 10B as well (like query times of 2-3 secs should be doable for Postgres).
I think it's worth checking disk I/O while querying as well as the tsvector and GIN Index. If you also have a high disk I/O while running search, it's likely because the system isn't using the index right. When I query my db, disk I/O is nearly at zero an Postgres hits the cached index 99% of the time.

bytec77 Apr 18, 2025

the disk I/O is nearly at zero. How do I check the tsvector and GIN Index?

ke-ma-fi Apr 19, 2025

How to Check PostgreSQL Cache Hit Rate and tsvector Index in Paperless-ngx

You can use the following steps to verify that PostgreSQL is caching effectively and that your full-text search is using the tsvector field with a proper GIN index.

If using docker, pls check for correct db container naming and change commands accordingly. For bare metal, just connect to db and run the queries.

Pls also note that the following instructions are AI generated. They may have some mistakes.

1. Check Cache Hit Rate

This tells you how often PostgreSQL serves data from memory instead of reading from disk.

docker exec -it paperless-db-1 psql -U paperless -d paperless -c "SELECT ROUND(100 * sum(blks_hit)::numeric / nullif(sum(blks_hit) + sum(blks_read), 0), 2) AS cache_hit_ratio FROM pg_stat_database;"

≥ 99% → Excellent
< 95% → Could indicate disk reads → possible performance bottleneck

2. Check if `tsvector` and GIN Index Exist

Connect to the PostgreSQL container:

docker exec -it paperless-db-1 psql -U paperless -d paperless

Then list the structure of the document table:

\\d+ document_document

Look for:

A column named: document_search_vector
An index like: gin_idx_document_search_vector | gin | (document_search_vector)

3. Check if the Index is Being Used

Run an example full-text query with EXPLAIN ANALYZE:

EXPLAIN ANALYZE
SELECT * FROM document_document
WHERE document_search_vector @@ to_tsquery('english', 'customer & invoice');

You want to see something like:

Bitmap Index Scan using gin_idx_document_search_vector

If you instead see:

Seq Scan on document_document

...then PostgreSQL is not using the index, which likely causes slow search performance.

Let me know what you find!

bytec77 Apr 19, 2025

@ke-ma-fi
thanks for your help.

The cache hit rate is 99,91%

But I think the tsvector field and the GIN index. is not used. In my setup there is no table called document_document only documents_document
look at the screenshots

ke-ma-fi Apr 19, 2025

Hmm... I checked that with my instance and I don't have tsvector with a GIN Index either. I thought paperless is utilising that. So I guess search is performed in the title & content setting with a simple Django ORM filter? Maybe @shamoon can clarify that... so i don't really know why the performance of your instance is so much worse than mine.

bytec77 · 2025-04-20T15:55:49Z

bytec77
Apr 20, 2025

@ke-ma-fi
@shamoon

I did a bit more testing. Specifically, I wanted to run the same query from the “Advanced Search” on the command line using a PostgreSQL query. To do this, I set up the pg_trgm extension and then created the corresponding index:

CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX IF NOT EXISTS idx_documents_document_content_trgm ON public.documents_document USING gin (content gin_trgm_ops);

After that, I executed the same query as in Paperless and, I got the result in under a second. In Paperless it takes almost a minute, as you can see in the screenrecordings.

Why does it take so long in Paperless, and is there any way to change that?

paperless.mov

cli.mov

7 replies

bytec77 Apr 21, 2025

Ok thanks. I have a look if I could change the search mode for me to a direct postgreql query.

bytec77 Apr 21, 2025

I changed the filters.py so that the Titel&Content search uses the psql query thank @ke-ma-fi for your hint with GIN_Index

paperless-ngx/
└── src/
└── documents/
└── filters.py

from 50sec to 1sec :-)

paperless2.mov

now I just have to be careful with updates.

ke-ma-fi Apr 22, 2025

@bytec77 Would you provide me with the changes you made to the filters.py? And on the db you just created that index?
I think I might have to do that as well somewhere in the future :)

Edit:
And I am still curious, why a similar setup with similar amount of data gets different results. Still don't get why my instance is searching fine and yours was not. Which Postgres Version are you on? And did you try dumping and reinstalling the db? I can't see where the problem of the slow query is...

bytec77 Apr 22, 2025

Just to make it clear again. It is only so slow in the Advanced Search when I make an AND link, for example, otherwise it is also fast. But I often have to search for a supplier from a certain year and therefore it only works this way.

I only changed the filters.py, namely the class: class TitleContentFilter(Filter)
Then I created the index and that's it.

filters.py:

@extend_schema_field(serializers.CharField)
class TitleContentFilter(Filter):
    """
    Erweiterte Suche mit:
      - UND-Verknüpfung über alle Begriffe außerhalb von Klammern
      - ODER-Verknüpfung innerhalb von ( )-Gruppen
    Nutzt pg_trgm-GIN-Index via qs.extra()
    """
    def filter(self, qs, value):
        if not value:
            return qs

        import re

        val = value.strip()

        # 1) Erkenne Pattern: <and_terms> AND (<or_terms>)
        m = re.match(r'^(.*?)\s+AND\s*\((.*?)\)\s*$', val, flags=re.IGNORECASE)
        if m:
            # a) alle UND-Begriffe (außerhalb der Klammergruppe)
            and_terms = m.group(1).split()
            # b) alle OR-Begriffe (innerhalb der Klammern)
            or_terms  = re.split(r'\s+OR\s+', m.group(2), flags=re.IGNORECASE)

            # WHERE-Klauseln: jeweils content ILIKE %s
            where = []
            params = []

            # jeder AND-Term wird als eigene Klausel angehängt (und-verknüpft)
            for term in and_terms:
                where.append("content ILIKE %s")
                params.append(f"%{term}%")

            # die gesamte OR-Gruppe als eine Klammer-Klausel
            or_clause = "(" + " OR ".join(["content ILIKE %s"] * len(or_terms)) + ")"
            where.append(or_clause)
            for term in or_terms:
                params.append(f"%{term}%")

            return qs.extra(where=where, params=params)

        # 2) Fallback: alte Logik (einfach nur AND oder OR über alle Begriffe)
        #    wie in der vorherigen Version
        val_upper = val.upper()
        if ' ODER ' in val_upper or ' OR ' in val_upper:
            parts = re.split(r'\s+(?:ODER|OR)\s+', val, flags=re.IGNORECASE)
            clause = "(" + " OR ".join(["content ILIKE %s"] * len(parts)) + ")"
            return qs.extra(where=[clause], params=[f"%{p}%" for p in parts])
        else:
            parts = val.split()
            return qs.extra(
                where=["content ILIKE %s"] * len(parts),
                params=[f"%{p}%" for p in parts]
            )

index create:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX IF NOT EXISTS idx_documents_document_content_trgm
  ON public.documents_document
  USING gin (content gin_trgm_ops);

R3dst4r Apr 29, 2025

Running paperless-ngx 2.14.7, mariadb 11.7.2, all system checks from "settings" -> "system status" are green.
Installation was created on 26th Jan, 2025. Back then initial version was 2.14.5, I guess, maybe a bit older but not before 1st Sep, 2024 (filesystem creation timestamp = date of VM installation).

410M Characters, 50.000 docs (lots of pictures), 4900 tags, 180 correspondents, 60 doctypes. 95% of them added through consume directory.
73GB payload, 80GB VM memory, 6 VM cores (12 core ryzen 5 3600), hypervisor running on internal m2 (no RAID), payload stored on internal SATA SSD (ZFS), 1 gbit/s NIC.
Proxmox 8.3.5 qemu VM, other docker containers on this VM are idling (16x paperless in total. (Yes, I have lots of paper :D )).
While writing this I just noticed concerning swapping on the hypervisor but I am confident this was not the case when the problem occured for the first time after migrating the paylod to paperless. I will fix this.

I did some WebUI speed tests:

After fresh login
Statistics: 20 sec
Documents: Title & content: 14 sek
Documents: Advanced search: 13-14 sek
Documents: Title: 12 sek
ASN: 12 sek
Custom fields: 14 sek

Title & content:
Click on tags: 3 sek
List documents with tag having 19100 docs: 11sek
Resetting the filter: 13 sek
Title search: "picasso" takes 40 secs, leading to a list of 4.000 documents. This is twice the amount of regular "create document list" process. I assume, it takes twice as long because paperless already starts searching while i am typing. Typing is heavily delayed, too. I guess, with a better performance in my installation this would be no deal anoymore with blazing fast results, plus reducing loading times by half, making this a side quest.

Idle:
#docker stats db-1 webserver-1 gotenberg-1 broker-1 tika-1
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
fd7f458ca009 db-1 0.01% 1.809GiB / 78.55GiB 2.30% 1.15GB / 11.3GB 10.2TB / 31.5GB 12
7766f8f8f94d webserver-1 0.38% 5.786GiB / 78.55GiB 7.37% 11.5GB / 2.12GB 167GB / 37.3GB 97
724c6f925fd4 gotenberg-1 0.02% 7.395MiB / 78.55GiB 0.01% 128kB / 126B 0B / 0B 10
b6f15039c09f broker-1 0.21% 18.3MiB / 78.55GiB 0.02% 765MB / 222MB 14.9MB / 1.08GB 6
513ad5faf3f8 tika-1 0.14% 423.3MiB / 78.55GiB 0.53% 204kB / 1.21kB 1.84MB / 376MB 63

Searching:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
fd7f458ca009 db-1 71.02% 1.755GiB / 78.55GiB 2.23% 1.15GB / 11.3GB 10.2TB / 31.5GB 20
7766f8f8f94d webserver-1 0.54% 5.804GiB / 78.55GiB 7.39% 11.5GB / 2.12GB 167GB / 37.3GB 98
724c6f925fd4 gotenberg-1 0.01% 7.395MiB / 78.55GiB 0.01% 128kB / 126B 0B / 0B 10
b6f15039c09f broker-1 0.25% 18.29MiB / 78.55GiB 0.02% 766MB / 222MB 14.9MB / 1.08GB 6
513ad5faf3f8 tika-1 0.18% 423.3MiB / 78.55GiB 0.53% 204kB / 1.21kB 1.84MB / 376MB 63

^
|
Notice high CPU of the container db-1 usage while querying.

Interestingly, I cannot see any disk reading on the VM (I just loaded a fresh document list 2 minutes ago):

Temporarily adding more RAM didn't help, so I reduced it back to normal:

@ke-ma-fi
Do you know how to check cache hit ratio (or other tests/metrics) on mariadb?

Merinorus · 2025-07-15T09:29:02Z

Merinorus
Jul 15, 2025

Hi,

My comment will be about title & content search, not the advanced search with Whoosh.

Indeed, the title & content search uses the Django ORM. Plus, the search cannot be indexed as it uses a LIKE %A% type filter. It could work with some index in Postgres, but I don't know any way to index this with MariaDB and SQLite. Another solution would be combining full-text search with a proper index and the LIKE %A% filter.
I was working on it because of this discussion, then I found this one with Bytec77's solution.
Note: The Django documentation suggests NOT using the .extra() method (too bad, because I found it convenient), but .raw() or RawSQL instead.

Here is an attempt, working with the three backends (PostgreSQL, MariaDB, and SQLite). I didn't implement the "AND" and "OR" filters from the Whoosh search, but it might be possible: https://github.com/Merinorus/paperless-ngx/tree/feature-faster-title-content-search

The Django ORM's advanced features (full-text search, etc.) are almost exclusively available to PostgreSQL. In Paperless-ngx, we have to support multiple database backends. This results in almost plain SQL. It may be error-prone in the future because it is not correlated anymore with the ORM model, but sometimes the ORM is not enough to deliver proper performance.

class TitleContentFilter(Filter):
    def filter(self, qs, value):
        if not value:
            return qs
        tokens = split_tokens(value)
        limit = 1000
        fulltext_tokens = [t for t in tokens if len(t) >= FULLTEXT_MINIMAL_TOKEN_LENGTH]
        vendor = connection.vendor
        if vendor == "postgresql" and tokens:
            # PostgreSQL fulltext search benefits from tokens less than 3 characters long
            ft_search_exp = " & ".join(tokens) + ":*"
            id_query = RawSQL(
                "SELECT id FROM documents_document WHERE to_tsvector('simple', title) @@ to_tsquery('simple', %s) OR to_tsvector('simple', content) @@ to_tsquery('simple', %s) limit %s",
                params=(ft_search_exp, ft_search_exp, limit),
            )
            return qs.filter(id__in=(id_query))
        elif vendor in {"mysql", "mariadb"} and fulltext_tokens:
            # MariaDB needs at least 3 characters to be able to use the full-text search
            ft_search_exp = "+" + " +".join(fulltext_tokens) + "*"
            like_search_exp = "%" + "_%".join(tokens) + "%"
            id_query = RawSQL(
                "SELECT id FROM documents_document WHERE (MATCH(title) AGAINST(%s IN BOOLEAN MODE) OR MATCH(content) AGAINST(%s IN BOOLEAN MODE)) AND (title LIKE %s OR content LIKE %s) limit %s",
                params=(
                    ft_search_exp,
                    ft_search_exp,
                    like_search_exp,
                    like_search_exp,
                    limit,
                ),
            )
            return qs.filter(id__in=(id_query))
        elif vendor == "sqlite" and fulltext_tokens:
            ft_search_exp = " ".join(fulltext_tokens) + "*"
            like_search_exp = "%" + "_%".join(tokens) + "%"
            if len(fulltext_tokens) < len(tokens):
                id_query = (
                    Document.objects.filter(
                        id__in=(
                            RawSQL(
                                "SELECT rowid FROM documents_document_fts WHERE title MATCH %s AND title LIKE %s OR content MATCH %s AND (length(content) > 100000 OR content LIKE %s) LIMIT %s",
                                params=(
                                    ft_search_exp,
                                    like_search_exp,
                                    ft_search_exp,
                                    like_search_exp,
                                    limit,
                                ),
                            )
                        ),
                    )
                    .order_by()
                    .values_list("id", flat=True)
                )
            else:
                # Full-text clause only seems much more performant than being combined with the LIKE %A% clause in SQLite,
                # so we use this when possible
                id_query = (
                    Document.objects.filter(
                        id__in=(
                            RawSQL(
                                "SELECT rowid FROM documents_document_fts WHERE title MATCH %s OR content MATCH %s LIMIT %s",
                                params=(ft_search_exp, ft_search_exp, limit),
                            )
                        ),
                    )
                    .order_by()
                    .values_list("id", flat=True)
                )
            ids = [id for id in id_query.all()]
            return qs.filter(id__in=ids)
        else:
            # Fallback to non-indexed legacy search method if no proper solution is available
            return qs.filter(Q(title__icontains=value) | Q(content__icontains=value))

3 replies

axelopale Jul 15, 2025
Author

Hello, seems to me to be a search engine's job.
Why not using a Meilisearch? User would be able to choose between local instance (via Docker) or cloud instance.

Merinorus Jul 15, 2025

I wanted to quickly scale an existing feature. IMHO, searching one or two words in the title and content is an easy task for any SQL database. It is also less complex to maintain and convenient for low RAM hardware. But I agree that a dedicated search engine like Meilisearch is powerful. We can think about it, and at the same time optimize database queries. They are not mutually exclusive.

Merinorus Jul 24, 2025

The pull request is available: #10448. Testing and feedback are welcome!

2026-01-21T03:33:45Z

github-actions[bot]
Bot Jan 21, 2026

This discussion has been automatically closed due to inactivity. Please see our contributing guidelines for more details.

0 replies

2026-02-20T03:48:26Z

github-actions[bot]
Bot Feb 20, 2026

This discussion has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion for related concerns. See our contributing guidelines for more details.

0 replies

Uh oh!

Search and documents way too slow #9649

Uh oh!

Uh oh!

Replies: 6 comments · 27 replies

Uh oh!

shamoon Apr 14, 2025 Maintainer

Uh oh!

Uh oh!

shamoon Apr 14, 2025 Maintainer

Uh oh!

Uh oh!

axelopale Apr 15, 2025 Author

Uh oh!

Uh oh!

Memory tuning

WAL & checkpoints

Disk I/O tuning

Uh oh!

axelopale Apr 17, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

How to Check PostgreSQL Cache Hit Rate and tsvector Index in Paperless-ngx

1. Check Cache Hit Rate

2. Check if tsvector and GIN Index Exist

3. Check if the Index is Being Used

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

axelopale Jul 15, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions[bot] Bot Jan 21, 2026

Uh oh!

Replies: 6 comments 27 replies

shamoon
Apr 14, 2025
Maintainer

shamoon Apr 14, 2025
Maintainer

axelopale Apr 15, 2025
Author

axelopale Apr 17, 2025
Author

2. Check if `tsvector` and GIN Index Exist

axelopale Jul 15, 2025
Author

github-actions[bot]
Bot Jan 21, 2026