This is my first open source contribution so if I missed anything please let me know and I'd be happy to correct it.
This commit comes from my use case where certain groups in my sitemap.rb take hours to run, even longer in development. This made testing sitemap generation an issue.
In many cases I would want to run everything but the very slow groups. In some cases I would want to test run only the groups I modified. My workaround was to comment out unwanted groups but since my sitemap.rb was several hundred lines it became cumbersome.
My solution is to add rake task options so that only the groups I want generated get run e.g.
rake sitemap:refresh:no_ping EXCLUDE=very_slow_group to skip the long running groups.
rake sitemap:refresh:no_ping EXCLUDE=very_slow_group
rake sitemap:refresh:no_ping ONLY=newly_modified_group to only run modified groups.
rake sitemap:refresh:no_ping ONLY=newly_modified_group
Groups that are skipped are printed like:
! group "very_slow_group" not processed - not included in ONLY option
! group "very_slow_group" not processed - not included in EXCLUDE option
My implementation was to essentially piggyback off the VERBOSE option definition.
Thank you and I appreciate any feedback you can give me.
Add `only` and `exclude` options to conditionally generate only speci…
This is exactly what I was looking for. My only concern is with the sitemap_index generated.
Let's say I have two groups called stores and contents, and I've previously generated two sitemaps for the stores group.
If I use the onlyoption with the stores group, and the new size of URLs generates 3 sitemaps, what will happen with the sitemap index? Will it be updated with the 3 sitemaps associated to the storesgroup? Will it include the sitemaps associated to the contents group (which were not generated this time)?
Yeah the kicker is what happens to the index. If you have a group which you don't want to run very often, but you still want included in the index, then you kinda need to detect and include the existing sitemap files (without regenerating them). This isn't out of the scope of possibility. Since we would have all the info about file locations and naming convention etc. The group could cycle through its files and if the file exists, add it to the index. This would be a useful feature but would have to be coded in a way that is easy to use. Since it's so easy to use ENV variables in ruby I'm reluctant to add much built-in support for special variables, unless they are very useful and take into account normal use-cases.
I think I have an easier solution which uses multiple sitemap configs rather than conditional code execution. I'll post more about it but the gist of it is to generate all your slow groups into a directory or their own, and have them in their own configuration, with create_index = false, so no index is written for them.
Then in your "regular" config you generate your sitemaps as per usual and then add each of the files from the slow group's directory to the index manually using the add_to_index() method. You can add the files by iterating over them and calling that method for each one, so it is dynamic and will add whatever files happen to be there.
Interesting solution, and I think it could be generalized to all groups (not just the slow ones you mentioned above). Tell me if I'm wrong, but if you set create_index = false to all groups, and then use add_to_index() manually for all groups (iterating the same way you mentioned), you can regenerate any group you want (even using the only and exclude options), and the sitemap index would be built from scratch every time.
@sivicencio that seems about right. However the create_index option doesn't really have any effect within groups per se. By that I mean that groups don't control index creation, but they do participate in it by having their files added to the index. Unfortunately whether or not the index is actually written out, links are added to it as if it were to be written out, then at the end we make the decision to write it out. So if you were to use add_to_index() to add a bunch of links, and then used a group() which created a bunch of files, I think those links would be added again, so there would be dupes. The reason I did it that way is because I needed to know what was generated in order to print out the summary lines and the index is where all that info resides...so a bit of a poor design choice, but not the end of the world. But what I see one could do is something like the following. You break up the index and group creation into separate create() blocks:
# Generate groups prior to index
SitemapGenerator::Sitemap.create(:create_index => false) do
group(:filename => 'slow_group') do
group(:filename => 'slower_group') do
loop through files like 'slow_group', 'slower_group'