New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pdfgrep and rga comparison #31
Comments
Hey, thanks for the compliment! :) You are indeed correct that ripgrep's speed advantages are mostly irrelevant for searching a directory of mainly pdf files, since parsing / rendering the pdf is by far slower than the walking or the grepping. There are three reasons why rga is faster than pdfgrep:
I wanted to use a Rust pdf rendering library instead of poppler of course, but the only existing one is slower (4.2s on the above pdf) and not very mature. |
That is very useful information. For academics like me, one of the most frequent situations is to have a gigantic, mostly static (slowly varying), flat folder of books and articles and search things in there all the time. For that purpose the caching seems fantastic. About 3, I did not expect such a difference in speed! Although my pdfgrep seems suspiciously slow even granted these benchmarks, and I am wondering if there is some misconfiguration on my part, or some problem with my version of pdfgrep. In any case, since rga exists I am less motivated to pursue the riddle. Thanks for your time and your work! |
By the way, seems like pdfgrep also implemented caching a while ago when you use
Also, the cache of pdfgrep is much larger in pdfgrep vs. rga since it doesn't use compression.
|
Closing this since I think the documentation kinda answers this and the comments here contain more details :) |
Being an academic who's workflow partly depends on pdfgrep, I was also very impressed with rga's speed. Some small caveats on their comparison in terms of speed, however. On my machine, pdfgrep is faster than rga when no files have been cached yet. This seems partly due to the fact that Which brings me to my main question (sorry for hijacking this thread a bit): is rga meant to be a pdfgrep killer/alternative? If so, that would be great, given that pdfgrep is not that actively developed anymore. If rga is meant to be a true alternative to pdfgrep, it will be helpful to learn from pdfgrep's successes and failures. There are some things rga seems to do better, some slightly worse, and some things both programs can't do. formattingI like the way rga displays its results: $ rga -C3 --rga-adapters=poppler Nullam **pdf
filename1.pdf
page 123: **Nullam** tempus.
page 123: Nulla posuere.
page 124: Donec posuere augue in quam.
filename2.pdf
page 123: **Nullam** tempus.
... This seems like an improvement over pdfgrep, which displays the filename on every line. Displaying the filename on every line can cause long lines to wrap, which IMO degrades readability. Other parts of the formatting of rga can be improved, however. First, for whatever reason, the page numbers are not colored. Page numbers in pdfgrep, on the other hand, are colored, and likewise for line numbers in Multicolumn documentsOne of my main gripes with pdfgrep is searching through multicolumn documents. When doing this in a small terminal window, the results often look really messy and/or cause a lot of linewrapping. Additionally, when a match is found in one column, the information in the other column is often contextually irrelevant, and thus only adds clutter. For a more detailed description of this issue, see https://gitlab.com/pdfgrep/pdfgrep/-/issues/42. Is there any way that multicolumn search can be fixed in rga, by for example reflowing text to a one column format? |
Hello; this is not a bug or anything, just a question.
I know that ripgrep is blazingly fast compared to other grep options, and I was recently recommended to use your ripgrep-all for my scripts to mass grep/sort/filter on pdfs. With little knowledge of specifics, I expected little from rga for search in pdfs: it seemed to me that decoding the pdf would be the time-consuming part and ripgrep would not make much difference for, say, a 200 page text over other greps. On the contrary, when I benchmarked rga against pdfgrep the difference was ridiculous and the diffs seem to clear, so no inconsistencies so far.
Could you let me know, briefly, what makes rga so much faster than things like pdfgrep (that is, if you know or can guess)? The speed difference seems so remarkable that for my purposes makes pdfgrep useless.
The text was updated successfully, but these errors were encountered: