Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sitemaps with less than 50,000 urls still use sitemap index #60

Closed
ehoch opened this issue Jan 6, 2012 · 18 comments
Closed

Sitemaps with less than 50,000 urls still use sitemap index #60

ehoch opened this issue Jan 6, 2012 · 18 comments

Comments

@ehoch
Copy link

ehoch commented Jan 6, 2012

When I create a sitemap with less than 50,000 urls, it should not use a sitemap index file. I personally run multiple sitemaps for my sites, including a small news-only one, and the idea of a sitemap index file for a sitemap with 20 urls seems silly. Other than that I love this project and think it will be perfect.

Could anyone spearhead fixing? Willing to donate for this!

@kjvarga
Copy link
Owner

kjvarga commented Jan 8, 2012

I read up on sitemap submissions and it seems like this is okay to do. So I can add an option for it. Something like SitemapGenerator::Sitemap.create_index = false. Seems pretty easy to do.

@ehoch
Copy link
Author

ehoch commented Jan 8, 2012

kjvarga,

The (or at least my) preferred way would be to default to no sitemap index, unless you have more than 50,000 pages, then it automatically creates an index (like dynamic_sitemaps)

Of course you could always add a SitemapGenerator::Sitemap.force_index = true to keep the old format for users that prefer.

You could even allow users to specify less than 50,000 the pagination like dynamic_sitemaps does.

@kjvarga
Copy link
Owner

kjvarga commented Jan 10, 2012

I'm not a fan of making this the default mainly because you would have to modify your robots.txt file to point to either the sitemap or the sitemap index depending on how many links you have. Most webmasters I think simply don't care if there is the extra file when they have < 50 000 urls and if you have dynamic numbers of links it's quite feasible to go back and forth below and above 50000 urls which means you would have to constantly change your robots.txt file. So for those reasons I would like to keep it as an option.

@ehoch
Copy link
Author

ehoch commented Jan 10, 2012

I guess I forgot about the very important aspect of legacy support. I was thinking that it would just be /sitemap.xml.gz as the sitemap if <= 50,000 pages and /sitemap.xml.gz would be the index if it was more than 50,000.

You could keep the _index.html as default for legacy support...

@vishaldpatel
Copy link

Heya,
It would be great to have this functionality:
SitemapGenerator::Sitemap.create_index = false

Thank you for creating such a great plugin, btw =).

Best,

  • V

@troex
Copy link

troex commented Apr 12, 2012

I also would like option to not create index files

@kjvarga
Copy link
Owner

kjvarga commented Apr 15, 2012

Ok I'll get to this hopefully this week. I'm thinking:

SitemapGenerator::Sitemap.create_index = (:auto, :always, :never, true, false)

:always and true are the same thing. :auto means create as needed. :never suppresses index creation altogether. I'm just not decided on what false should be. I think it should be the same as :auto i.e. don't create until needed.

@ehoch
Copy link
Author

ehoch commented Apr 16, 2012

Sure, that works for me. Maybe all you really need is?

true - Forces an index (current / default)
false - Force no index. Even if it exceeds 50,000 pages.
auto - Create an index only if needed. Probably my description above where sitemap.xml becomes the index and sitemap1.xml becomes the first sitemap...

@ehoch
Copy link
Author

ehoch commented Sep 14, 2012

Just pinging here to see if there's any update...

Also, thanks for taking my PR on autoplay (and cleaning it up). All warnings gone for me...

kjvarga added a commit that referenced this issue Oct 3, 2012
- Tweak finalisation: 
- add a check that we don't finalise an already finalised sitemap in the SitemapIndexFile
- In LinkSet don't try to add default links to a finalised sitemap.
@kjvarga
Copy link
Owner

kjvarga commented Oct 3, 2012

Hey Eric, I've released the feature in v3.3. Thanks again for generously paying for this feature! I'm sure everyone that uses it will appreciate that. I've added a shout-out to you on the readme.

And I'm glad that you are enjoying using the gem!

The naming convention hasn't changed. So the first sitemap will be sitemap1.xml.gz. However, I could add a simple namer that will generate sitemap.xml.gz, sitemap1.xml.gz etc but I wanted to get the main stuff out first.

@kjvarga kjvarga closed this as completed Oct 3, 2012
@ehoch
Copy link
Author

ehoch commented Oct 3, 2012

Soo... you're going to hate me. But now I'm going into client mode. This wasn't 100% how I envisioned and doesn't quite accomplish my goals due to the naming.

I need a static place that I can always point my robots / 302 (which I use on my root to point to where they are on my s3 bucket) to where I don't need to worry if I'm above or below 50k.

So I envisioned create_sitemap = :auto to create a "sitemap.xml.gz" for its first sitemap. If we get more than 50k, that sitemap.xml.gz would become the index and point to the sitemap[1-9]+.xml.gz and so on...

The problem with status quo is I still need to know where I'm pointing, /sitemap1.xml.gz or /sitemap_index.xml.gz

@kjvarga
Copy link
Owner

kjvarga commented Oct 4, 2012

Damn, you had to go there :/ Yeah that bit is more tricky. But I'm glad that I broke them into separate chunks because they are conceptually different features. Thinking about it I really like the idea though. In fact I would like it to be the default for the next major version. That way you always point to sitemap.xml.gz, no matter how many links your site has.

I can see a way to do it, but it is tricky because of the way things happen now. The index has to be written out last. And currently each sitemap is written out as it becomes full so that it doesn't occupy memory. What would have to happen is we need to delay writing out the first sitemap until we know with certainty whether there will be two sitemaps (or more, but two is enough to trigger the index). Then there's the issue that the index and the sitemaps will share the same namer instance. So I'm thinking there needs to be another configuration option that indicates that this naming convention is being used (any ideas of what to call this convention? 'zero-based' or 'simple' is all that comes to mind). However, whether or not the special naming convention is being used it doesn't hurt to adopt the same logic for all cases, which simplifies things a bit. It just means that the first sitemap sits in memory for a teeny bit longer.

So I think the sitemap index will need to keep an object reference to the sitemaps added to it, at least the first one. When those sitemaps become full they are not automatically written out. They are 'finalized' and added to the index but not written out. When the index has more than one sitemap, it reserves the first name, assigns the other names to the added sitemaps and writes them out. From then on new sitemaps get names and are written out when full as per usual. Then when the index is finalized there needs to be some handling of the different values for create_index.

So there's keeping track of sitemaps added to the index and keeping track of whether a name has been reserved. As well as delaying writing out of some sitemaps, so some tracking of whether writing has occurred. And a new namer instance to handle the new sequence.

@ehoch
Copy link
Author

ehoch commented Oct 6, 2012

Okay, so this is sounding like a 4.0 release. I think :simple could work. Since it's looking like it's not happening until the next major release, you could also just make this the new default and put a massive bold warning in the readme. That way, doesn't matter what the name of the configuration option is. No one will ever be specifying it...

And semver clearly states you're allow to do this...

I guess just give me an estimate on when you think you'll be able to do this. And hope I didn't ruin your vacation...

@kjvarga
Copy link
Owner

kjvarga commented Oct 11, 2012

Hey man, i've made some progress on this already. I think I'll be done in a week or so, haven't got too much on the go. I'll probably get out a prerelease and then we can see where we're at. Adding unit tests will take a bit longer.

@ehoch
Copy link
Author

ehoch commented Oct 12, 2012

Can't wait to see it. Just holding off on some of my multi-sitemaps until it's released. Thanks again!

@kjvarga
Copy link
Owner

kjvarga commented Oct 16, 2012

Okay try out 4.0.alpha.

You'll need these two config settings:

SitemapGenerator::Sitemap.create_index = :auto
SitemapGenerator::Sitemap.sitemaps_namer = SitemapGenerator::Sitemap.sitemap_index_namer = SitemapGenerator::SimpleNamer.new(:sitemap)

From my testing it seems to work as it should! I've not made these the defaults yet, just want to get it out and get some more people trying it out.

@ehoch
Copy link
Author

ehoch commented Oct 16, 2012

It works! Beautifully!!

@kjvarga
Copy link
Owner

kjvarga commented May 2, 2013

Hi guys, the 4.0 release is finally out! It was more work than I anticipated making this change official. I also want to feel confident that it was working correctly and that I had covered all the cases. I think I have and it feels like it does the intuitive thing.

So @ehoch you can remove the create_index and namer lines from your config and things should be working the same for you now, since that is the new default behaviour. The thing I like most about this is that it's clearer for everyone: point search engines to sitemap.xml.gz in all cases. Doesn't matter how many links you have. And now it is even more standards compliant because that is the filename that search engines expect. So win-win.

Also the ability to turn off index creation opens up some opportunities for more complex setups where one wants to generate sitemaps without affecting the index. The index can even be built "manually" by adding links to it using add_to_index and no sitemaps need to be written even. So one could separate the two tasks or do a hybrid setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants