Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pdfgrep and rga comparison #31

Closed
ykonstant1 opened this issue Nov 8, 2019 · 5 comments
Closed

Pdfgrep and rga comparison #31

ykonstant1 opened this issue Nov 8, 2019 · 5 comments

Comments

@ykonstant1
Copy link

Hello; this is not a bug or anything, just a question.

I know that ripgrep is blazingly fast compared to other grep options, and I was recently recommended to use your ripgrep-all for my scripts to mass grep/sort/filter on pdfs. With little knowledge of specifics, I expected little from rga for search in pdfs: it seemed to me that decoding the pdf would be the time-consuming part and ripgrep would not make much difference for, say, a 200 page text over other greps. On the contrary, when I benchmarked rga against pdfgrep the difference was ridiculous and the diffs seem to clear, so no inconsistencies so far.

Could you let me know, briefly, what makes rga so much faster than things like pdfgrep (that is, if you know or can guess)? The speed difference seems so remarkable that for my purposes makes pdfgrep useless.

@phiresky
Copy link
Owner

phiresky commented Nov 8, 2019

Hey, thanks for the compliment! :)

You are indeed correct that ripgrep's speed advantages are mostly irrelevant for searching a directory of mainly pdf files, since parsing / rendering the pdf is by far slower than the walking or the grepping.

There are three reasons why rga is faster than pdfgrep:

  1. Caching. rga will only run extractors once and then use cached extraction results on all subsequent runs (if the compressed extracted text is under a size limit). This is only relevant if you run rga multiple times on the same files.
  2. Parallelization. ripgrep will call the preprocessing command in parallel for multiple files, so it uses multiple CPU cores for extraction (potentially with $(nproc)x speedup) as opposed to pdfgrep which is single threaded
  3. pdftotext is still ~2x faster than pdfgrep on a single file for unknown reasons. For example:
    $ time pdfgrep a test.pdf >/dev/null 
    pdfgrep  0,791 total
    $ time pdftotext test.pdf >/dev/null
    pdftotext 0,397 total
    $ time rga-preproc test.pdf >/dev/null
    rga-preproc 0,428 total
    $ time rga-preproc test.pdf >/dev/null # after caching
    rga-preproc 0,005 total
    
    You can see that pdfgrep is almost half as slow as pdftotext and that rga-preproc is almost as fast as pdftotext on the first run (and of course almost instant on the second run).
    This is unexpected since pdftotext uses poppler and pdfgrep uses poppler for rendering as well. Maybe it's because pdfgrep does some work on every page separately so it might have overhead from data structure initialization or similar.

I wanted to use a Rust pdf rendering library instead of poppler of course, but the only existing one is slower (4.2s on the above pdf) and not very mature.

@ykonstant1
Copy link
Author

That is very useful information. For academics like me, one of the most frequent situations is to have a gigantic, mostly static (slowly varying), flat folder of books and articles and search things in there all the time. For that purpose the caching seems fantastic.

About 3, I did not expect such a difference in speed! Although my pdfgrep seems suspiciously slow even granted these benchmarks, and I am wondering if there is some misconfiguration on my part, or some problem with my version of pdfgrep. In any case, since rga exists I am less motivated to pursue the riddle.

Thanks for your time and your work!

@phiresky
Copy link
Owner

phiresky commented Nov 8, 2019

By the way, seems like pdfgrep also implemented caching a while ago when you use pdfgrep --cache. It still can't really compete with rga though (very rough benchmark on a fairly large directory):

pdfgrep --cache -r hello (first run): 1:41 min
pdfgrep --cache -r hello (second run): 1.71 s

rga --rga-adapters=poppler hello (first run): 7.57 s
rga --rga-adapters=poppler hello (second run): 0.17 s

Also, the cache of pdfgrep is much larger in pdfgrep vs. rga since it doesn't use compression.

du -sh ~/.cache/rga: 7.8M
du -sh ~/.cache/pdfgrep: 31M

@phiresky
Copy link
Owner

Closing this since I think the documentation kinda answers this and the comments here contain more details :)

@rien333
Copy link

rien333 commented Aug 20, 2020

Being an academic who's workflow partly depends on pdfgrep, I was also very impressed with rga's speed. Some small caveats on their comparison in terms of speed, however. On my machine, pdfgrep is faster than rga when no files have been cached yet. This seems partly due to the fact that pdftotext can get stuck on some particular files for quite a while (not sure why yet). After having cached files, however, rga is the clear winner. Furthermore, note that pdfgrep can also be parallelized using GNU parallel (though I've never gotten this to work properly).

Which brings me to my main question (sorry for hijacking this thread a bit): is rga meant to be a pdfgrep killer/alternative? If so, that would be great, given that pdfgrep is not that actively developed anymore.

If rga is meant to be a true alternative to pdfgrep, it will be helpful to learn from pdfgrep's successes and failures. There are some things rga seems to do better, some slightly worse, and some things both programs can't do.

formatting

I like the way rga displays its results:

$ rga -C3 --rga-adapters=poppler Nullam **pdf
filename1.pdf
page 123: **Nullam** tempus.  
page 123: Nulla posuere.  
page 124: Donec posuere augue in quam.  
filename2.pdf
page 123: **Nullam** tempus.  
...

This seems like an improvement over pdfgrep, which displays the filename on every line. Displaying the filename on every line can cause long lines to wrap, which IMO degrades readability.

Other parts of the formatting of rga can be improved, however. First, for whatever reason, the page numbers are not colored. Page numbers in pdfgrep, on the other hand, are colored, and likewise for line numbers in rg. Colored page numbers help visually differentiate the main content from other information, so adding colors to page numbers would be greatly appreciated. Second, I don't really like how every page number is prefixed with page . When searching through pdf's, prefixing page numbers with the word "page" seems redundant, and creates clutter and potential linewrapping. pdfgrep also doesn't have this prefix.

Multicolumn documents

One of my main gripes with pdfgrep is searching through multicolumn documents. When doing this in a small terminal window, the results often look really messy and/or cause a lot of linewrapping. Additionally, when a match is found in one column, the information in the other column is often contextually irrelevant, and thus only adds clutter. For a more detailed description of this issue, see https://gitlab.com/pdfgrep/pdfgrep/-/issues/42.

Is there any way that multicolumn search can be fixed in rga, by for example reflowing text to a one column format?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants