New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate a site with about 60K posts, take forever #560

Closed
404pnf opened this Issue May 24, 2012 · 12 comments

Comments

Projects
None yet
6 participants
@404pnf

404pnf commented May 24, 2012

Tried to migrate a site from drupal to jekyll. Unfortunately no migrate script for drupal 7. So, I generated all individual posts(57882 of them) in _posts folder. Ran jekyll. Took forever and never succeed on a server (RAM: 12GB, CPU: 16 cores)

Is jekyll suitable for such volume of site? What should I do to speed things up?

@alexcp

This comment has been minimized.

Show comment
Hide comment
@alexcp

alexcp May 25, 2012

With such a huge amount of posts, I think it would be easier to do it in smaller batch.
You could write a simple Bash or Ruby script to compile, let's say a thousand posts or less and repeat the operation automatically for every post.

alexcp commented May 25, 2012

With such a huge amount of posts, I think it would be easier to do it in smaller batch.
You could write a simple Bash or Ruby script to compile, let's say a thousand posts or less and repeat the operation automatically for every post.

@404pnf

This comment has been minimized.

Show comment
Hide comment
@404pnf

404pnf May 28, 2012

I am not a programmer, so let me paraphrase what you said and see if I get it. I should write a script to instruct jekyll to generate, e.g 1000 articles, every run. Which translates to action is: move 1000 articles to _posts and run jekyll, then move another 1000 articles in and re-run jekyll? I think for individule posts this works, but not for aggregated pages: archive.html categories.html, etc.

The tricky thing is about categoires and tags. Since it has to collect all posts to know a category has X many posts.

How to go about that?

According to doc here, jekyll can't incrementally generate site:

Jekyll collects data.
Jekyll scans the posts directory and collects all posts files as post objects. It then scans the layout assets and collects those and finally scans other directories in search of pages.

Jekyll computes data.
Jekyll takes these objects, computes metadata (permalinks, tags, categories, titles, dates) from them and constructs one big site object that holds all the posts, pages, layouts, and respective metadata. At this stage your site is one big computed ruby object.

http://jekyllbootstrap.com/lessons/jekyll-introduction.html

404pnf commented May 28, 2012

I am not a programmer, so let me paraphrase what you said and see if I get it. I should write a script to instruct jekyll to generate, e.g 1000 articles, every run. Which translates to action is: move 1000 articles to _posts and run jekyll, then move another 1000 articles in and re-run jekyll? I think for individule posts this works, but not for aggregated pages: archive.html categories.html, etc.

The tricky thing is about categoires and tags. Since it has to collect all posts to know a category has X many posts.

How to go about that?

According to doc here, jekyll can't incrementally generate site:

Jekyll collects data.
Jekyll scans the posts directory and collects all posts files as post objects. It then scans the layout assets and collects those and finally scans other directories in search of pages.

Jekyll computes data.
Jekyll takes these objects, computes metadata (permalinks, tags, categories, titles, dates) from them and constructs one big site object that holds all the posts, pages, layouts, and respective metadata. At this stage your site is one big computed ruby object.

http://jekyllbootstrap.com/lessons/jekyll-introduction.html

@plusjade

This comment has been minimized.

Show comment
Hide comment
@plusjade

plusjade May 28, 2012

@404pnf, you are right, if you batch process your posts then you lose out on aggregate data for the entire post collection. Unfortunately the decreased performance when handling large quantities of posts is a known issue for Jekyll. However it would seem like you have plenty of processing power on your server. Can you successfully compile say 100 posts? 1000? etc?

plusjade commented May 28, 2012

@404pnf, you are right, if you batch process your posts then you lose out on aggregate data for the entire post collection. Unfortunately the decreased performance when handling large quantities of posts is a known issue for Jekyll. However it would seem like you have plenty of processing power on your server. Can you successfully compile say 100 posts? 1000? etc?

@parkr

This comment has been minimized.

Show comment
Hide comment
@parkr

parkr May 29, 2012

Member

Yeah, check to see what the limit is. It may be 10,000. Due to the fact that each post and page is held as a separate object, it requires huge amounts of memory to run 60,000 posts, I'm sure. You may have the capacity though.

It'd be cool to add a jekyll compile command, sort of like what we have for Compass, that ran once and compiled the site. Right now, I think there's only a server functionality, and running WEBrick with this number of posts might not be possible. It could incrementally save and hold onto some cache file that is compressed and quickly readable for efficientcy, though the problems may run deeper than that.

Member

parkr commented May 29, 2012

Yeah, check to see what the limit is. It may be 10,000. Due to the fact that each post and page is held as a separate object, it requires huge amounts of memory to run 60,000 posts, I'm sure. You may have the capacity though.

It'd be cool to add a jekyll compile command, sort of like what we have for Compass, that ran once and compiled the site. Right now, I think there's only a server functionality, and running WEBrick with this number of posts might not be possible. It could incrementally save and hold onto some cache file that is compressed and quickly readable for efficientcy, though the problems may run deeper than that.

@404pnf

This comment has been minimized.

Show comment
Hide comment
@404pnf

404pnf May 29, 2012

I will follow the advice, go test the limits and report back.

My observation from previous tries told me, the memory it requires in RAM is roughly twice to three times the size of all the posts.

In my case, if I process all the posts,, 247M, in one run, the ram usage shows by top command is 450mb-600mb and stay there. If the size of posts is 41M, then it takes 112M RES RAM.

404pnf commented May 29, 2012

I will follow the advice, go test the limits and report back.

My observation from previous tries told me, the memory it requires in RAM is roughly twice to three times the size of all the posts.

In my case, if I process all the posts,, 247M, in one run, the ram usage shows by top command is 450mb-600mb and stay there. If the size of posts is 41M, then it takes 112M RES RAM.

@404pnf

This comment has been minimized.

Show comment
Hide comment
@404pnf

404pnf May 30, 2012

bench mark testing

cmd used:

  jekyll --server --kramdown --no-auto --limit_posts=num

All aggreated pages are disabled, yaml frontmatter has only layout and title.

number of posts | elapsed time | RES RAM | VIRT RAM | CPU | load average

2000 | < 10 mins | 317m | 354m | 100% | 1.6 - 3
10000 | =~ 90 mins | 463m | 493m | 100% | 1.6 - 3
20000 | =~ 4 hrs | 494m | 522m | 100% | 1.6 - 3
30000 | 6hrs already | 513m | 606m | 100% | 1.6 - 3
50000 | didn't try | | | |

I think evetually the 30K one will finnish sucessfully.

Machine spec:

RAM: 32GB
CPU: 16 cores

Conclusion?

Huge ram desn't help since the memory usage is predicatable less than three times of the size of all posts.

Multi-core cpu doesn't help since ruby only uses on core.

proposal?

For such volume, if I don't want any aggregate pages, I would like to have ar generator that read in path of all the files to be converted and do it one by one. But I am not a programmer. Would anyone give a skeleton code example of how to do that in jekyll and I will try to figure out the rest.

In my case, for aggregated pages I can use another generator just for that, since my aggregate pages only need a tiltle and an url of a post (categories, tags, archive and sitemap), with the exception of atom.xml (I don't use it), then the code only need to collect yaml frontmatter and generate 4 pages. I don't know how to do it either. :)

404pnf commented May 30, 2012

bench mark testing

cmd used:

  jekyll --server --kramdown --no-auto --limit_posts=num

All aggreated pages are disabled, yaml frontmatter has only layout and title.

number of posts | elapsed time | RES RAM | VIRT RAM | CPU | load average

2000 | < 10 mins | 317m | 354m | 100% | 1.6 - 3
10000 | =~ 90 mins | 463m | 493m | 100% | 1.6 - 3
20000 | =~ 4 hrs | 494m | 522m | 100% | 1.6 - 3
30000 | 6hrs already | 513m | 606m | 100% | 1.6 - 3
50000 | didn't try | | | |

I think evetually the 30K one will finnish sucessfully.

Machine spec:

RAM: 32GB
CPU: 16 cores

Conclusion?

Huge ram desn't help since the memory usage is predicatable less than three times of the size of all posts.

Multi-core cpu doesn't help since ruby only uses on core.

proposal?

For such volume, if I don't want any aggregate pages, I would like to have ar generator that read in path of all the files to be converted and do it one by one. But I am not a programmer. Would anyone give a skeleton code example of how to do that in jekyll and I will try to figure out the rest.

In my case, for aggregated pages I can use another generator just for that, since my aggregate pages only need a tiltle and an url of a post (categories, tags, archive and sitemap), with the exception of atom.xml (I don't use it), then the code only need to collect yaml frontmatter and generate 4 pages. I don't know how to do it either. :)

@plusjade

This comment has been minimized.

Show comment
Hide comment
@plusjade

plusjade May 30, 2012

@404pnf you don't need to spawn a server if you are just trying to compile the site. I don't know whether it matters performance-wise though:

$ jekyll --kramdown --no-auto --limit_posts=num

Disabling pages that display aggregate data does not mean that Jekyll does not still compute the aggregate data. It's a consequence of the one-huge-site-object design.

I think it might be possible to hack Jekyll to process a page one at time, with the obvious consequence that there will be no aggregate data.

I currently work on my own static blog generator that might address these bottlenecks. I'd be willing to test your website against my engine. http://ruhoh.com -- I can't promise anything but I can say that ruhoh takes a more functional approach so I can very likely get it to work to process one file at a time. Additionally I might even be able to do a kind of two-stage processing:

  1. Extract and compile all aggregate data and store either in memory or as a JSON file.
  2. Process each page one at a time, using the aggregate file object where needed.

I'm much more committed to my project then Jekyll at this point so let me know if you want some help in working on your site.

plusjade commented May 30, 2012

@404pnf you don't need to spawn a server if you are just trying to compile the site. I don't know whether it matters performance-wise though:

$ jekyll --kramdown --no-auto --limit_posts=num

Disabling pages that display aggregate data does not mean that Jekyll does not still compute the aggregate data. It's a consequence of the one-huge-site-object design.

I think it might be possible to hack Jekyll to process a page one at time, with the obvious consequence that there will be no aggregate data.

I currently work on my own static blog generator that might address these bottlenecks. I'd be willing to test your website against my engine. http://ruhoh.com -- I can't promise anything but I can say that ruhoh takes a more functional approach so I can very likely get it to work to process one file at a time. Additionally I might even be able to do a kind of two-stage processing:

  1. Extract and compile all aggregate data and store either in memory or as a JSON file.
  2. Process each page one at a time, using the aggregate file object where needed.

I'm much more committed to my project then Jekyll at this point so let me know if you want some help in working on your site.

@404pnf

This comment has been minimized.

Show comment
Hide comment
@404pnf

404pnf May 31, 2012

@plusjade I tried ruhoh, very promising! Especially when working with large volume of posts, ruhoh tells you what it is currently doing. It's comforting to know that information because staring at a blank screen makes you think the program doesn't work.

I do have some questions. I will post it to ruhoh issue.

404pnf commented May 31, 2012

@plusjade I tried ruhoh, very promising! Especially when working with large volume of posts, ruhoh tells you what it is currently doing. It's comforting to know that information because staring at a blank screen makes you think the program doesn't work.

I do have some questions. I will post it to ruhoh issue.

@parkr

This comment has been minimized.

Show comment
Hide comment
@parkr

parkr Mar 19, 2013

Member

We have a drupal 7 migration script now! Look at the jekyll/jekyll-import repo.

We've planned out a possible iterative regeneration, but it won't be until after the 1.0.0 release.

Member

parkr commented Mar 19, 2013

We have a drupal 7 migration script now! Look at the jekyll/jekyll-import repo.

We've planned out a possible iterative regeneration, but it won't be until after the 1.0.0 release.

@parkr parkr closed this Mar 19, 2013

@404pnf

This comment has been minimized.

Show comment
Hide comment
@404pnf

404pnf Mar 20, 2013

Looking forward to testing the d7 migration script! Thank you!

404pnf commented Mar 20, 2013

Looking forward to testing the d7 migration script! Thank you!

@davepeck

This comment has been minimized.

Show comment
Hide comment
@davepeck

davepeck Feb 18, 2016

Here's a case where I ran into this wall.

I've slowly been making moves to collect everything I post online under one (jekyll-managed) roof. As part of this, I exported my entire Twitter history and wrote a quick script to shred tweets into post files. HFS+ and Git handle these ~13,500 new files gracefully; Jekyll doesn't.

I'm not surprised I ran into this wall; I certainly didn't expect Jekyll to perform nicely here. Then again, perhaps an argument can be made that somewhere in the distant future, Jekyll really should handle this scale gracefully?

davepeck commented Feb 18, 2016

Here's a case where I ran into this wall.

I've slowly been making moves to collect everything I post online under one (jekyll-managed) roof. As part of this, I exported my entire Twitter history and wrote a quick script to shred tweets into post files. HFS+ and Git handle these ~13,500 new files gracefully; Jekyll doesn't.

I'm not surprised I ran into this wall; I certainly didn't expect Jekyll to perform nicely here. Then again, perhaps an argument can be made that somewhere in the distant future, Jekyll really should handle this scale gracefully?

@parkr

This comment has been minimized.

Show comment
Hide comment
@parkr

parkr Feb 18, 2016

Member

perhaps an argument can be made that somewhere in the distant future, Jekyll really should handle this scale gracefully?

@davepeck We are working toward this goal, however Jekyll will still have to process all 13000 posts. Jekyll 3 does have an --incremental flag to ease pain after the initial regeneration, but there are still some missing pieces. Known improvement points:

  • Read & write in parallel (disk I/O is expensive)
  • Build dependency tree instead of naïvely rendering sequentially
  • Cache more aggressively & ensure we're only producing what we need (ruby GC is very slow)

If you're willing to work on the above items or have further input, we can start a tracking issue for speeding up Jekyll if there isn't on already. Thanks!

Member

parkr commented Feb 18, 2016

perhaps an argument can be made that somewhere in the distant future, Jekyll really should handle this scale gracefully?

@davepeck We are working toward this goal, however Jekyll will still have to process all 13000 posts. Jekyll 3 does have an --incremental flag to ease pain after the initial regeneration, but there are still some missing pieces. Known improvement points:

  • Read & write in parallel (disk I/O is expensive)
  • Build dependency tree instead of naïvely rendering sequentially
  • Cache more aggressively & ensure we're only producing what we need (ruby GC is very slow)

If you're willing to work on the above items or have further input, we can start a tracking issue for speeding up Jekyll if there isn't on already. Thanks!

@jekyll jekyll locked and limited conversation to collaborators Feb 18, 2016

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.