Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental build affected by large bib file #256

Open
thrau opened this issue Feb 24, 2019 · 11 comments
Open

Incremental build affected by large bib file #256

thrau opened this issue Feb 24, 2019 · 11 comments

Comments

@thrau
Copy link

thrau commented Feb 24, 2019

I have a 700kb bibtex file with about a thousand entries, and one file to render it. So building the entire source is naturally a little slow (22 seconds).

However I found that jekyll --watch --incremental takes the same amount of time when building files that have no dependencies to the bibtex file.

My publications.md file is below. Interestingly, when I add a query to filter, e.g., only publications from 2018 (50-100 or so), the build speeds up drastically (22s -> 2s).

Any idea what the problem could be? In particular that the incremental build of unrelated files is affected by the amount of bibtex entries rendered in other files seems odd to me. I'm not familiar enough with jekyll to understand whether this is a jekyll related problem or has something to do with the plugin.

---
---

## Publications

{% bibliography %}

  • I'm using jekyll 3.8.5 and the latest jekyll-scholar release.
  • scholar config in _config.yml:
scholar:
  style: apa
  • build command: /usr/bin/ruby2.3 /usr/local/bin/jekyll b -d /var/www/html/ -s /home/webmaster/source --watch --incremental
@inukshuk
Copy link
Owner

Is it possible that the detail pages are being generated? If I remember correctly, all the generator plugins will run regardless of which sites need to be built. The long time it takes is probably caused by loading the style and processing the references; you're right that this should not be necessary unless building pages containing the actual references, so it would be great to improve that!

@thrau
Copy link
Author

thrau commented Feb 26, 2019

detail pages are not being generated in my configuration.

@cardi
Copy link

cardi commented May 30, 2019

I have a similar experience in lengthy build times that likely has to do with generating multiple bibliographies from citations across different pages, in addition to a monolithic page that iterates through all references.

Using Jekyll 3.8.5 with jekyll-scholar 5.14.1.

references.bib contains 203 entries, 197KB, with an ACM SIG proceedings style.

Some (rough) benchmarks:

  • baseline (jekyll-scholar loaded, empty references.bib): 30s
  • generate only cited references, no detail pages: 297s
  • generate all references, no detail pages: 271s
  • generate all references, detail pages: 1500s (!)

Given that most of the detail pages won't change too often from the underlying BibTeX or style used, a one-time expensive cost in the initial generation is manageable.

Generating references and detail pages might benefit from the upcoming Jekyll 4.0 Cache API.

I'm not familiar with Ruby or the internal workings of jekyll/jekyll-scholar, but if you can point to where I or someone else might start, that would be helpful.

@inukshuk
Copy link
Owner

To speed-up the generation of detail pages, you could add some conditions around here). After generating the detail pages, we could write some kind of manifest or save a timestamp which we could compare to the modification date of the bib file. That way, we'd generate details only if the bib file has changed since the last time the detail pages were generated. (A more granular approach, at the entry level, is probably not worth the effort.)

@cardi
Copy link

cardi commented May 31, 2019

Thanks. That seems like a reasonable approach–I'll see if I can't make a first pass over the next few days.

The other aspect in build times is, I think, building bibliographies from citations (e.g., {% bibliography --cited %}).

If all the entries in the bibliography are parsed, and then references for each cited entry are being built each time the bibliography command is called, I could see caching the entry in some way to plausibly save lots of time.

@cardi
Copy link

cardi commented Jun 9, 2019

I've prototyped something quickly to use the Cache API when generating details pages. The results are looking very promising:

files total (sec) average (sec) median (sec) min (sec) max (sec)
first run (all cache misses) 206 205.353 0.996859 1.007 0.333644 2.166210
second run (all cache hits) 206 0.040451 0.000196364 0.000186 0.000141 0.000586

Results may vary, since the underlying cache is loading each entry from disk the first time it's called (perhaps as the Cache API evolves, jekyll could warm up the cache by loading the entirety of the cache from disk into memory, or using a different backing store, but I don't anticipate working on that anytime soon.)

This should work well, especially for incremental builds: jekyll+jekyll-scholar will only build new BibTeX entries.

Some edge cases I haven't quite thought about yet, that won't trigger a rebuild of the details pages:

  • editing existing BibTeX entries (after the first run)
  • editing the template for details pages

I'd expect the above operations to happen rarely, so I think incurring the expensive cost is OK, but at the moment there are two ways to trigger a complete rebuild:

  1. delete .jekyll-cache directory
  2. modify _config.yml

Perhaps there will be a flag that one can pass to jekyll build that clears the cache when 4.0 is released.

@inukshuk
Copy link
Owner

Looks great! Did you figure out where scholar was modifying site.config?

Regarding the cache invalidation, perhaps we could create some kind of manifest file for the details pages with a checksum of the BibTeX file? That way we could detect when a rebuild is required.

@cardi
Copy link

cardi commented Jun 15, 2019

Looks great! Did you figure out where scholar was modifying site.config?

I haven't, but I plan on taking a closer look after I've polished the caching code.

Regarding the cache invalidation, perhaps we could create some kind of manifest file for the details pages with a checksum of the BibTeX file? That way we could detect when a rebuild is required.

That seems like a good approach that will take care of most of the issues, even if it is a bit heavy-handed. I suppose we could do the same with the layout for the details page.

Another, maybe easy approach that I've just thought of is to cache the hashes of each BibTeX entry: if the cached hash doesn't match or doesn't exist, then re-build that particular entry. I think this would only work if the BibTeX object (dictionary?) in Ruby is consistently ordered in a deterministic way.

@sneakers-the-rat
Copy link

@cardi still working on this? want to join forces on a PR with what we're talking about here? #335

@cardi
Copy link

cardi commented Dec 1, 2021

@cardi still working on this? want to join forces on a PR with what we're talking about here? #335

It's been a while since I've looked at this, and I'm still interested in having this feature implemented.

I made a first pass at using Jekyll's Cache API here: https://github.com/cardi/jekyll-scholar/tree/cached-details, but a critical blocker (that may or may not have been resolved since) is that any change to site.config internally will invalidate the cache and rebuild everything.

While I documented the issue and my findings in #262, I don't have a proposed fix for it. (Maybe storing some of jekyll-scholar's settings in a different variable outside of site.config?)

I think #262 has to be resolved before caching can be implemented and used.

@inukshuk
Copy link
Owner

inukshuk commented Dec 1, 2021

@cardi I took a quick look at this and I think that the BibTeX converter merged in the default scholar config during initialization. Give it another go, to see if this fixes the issue you'd been seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants