Skip to content
This repository has been archived by the owner on Nov 18, 2021. It is now read-only.

GitHub prevents crawling of repository's Wiki pages - no Google search #1683

Open
kevinjwalters opened this issue Oct 27, 2019 · 18 comments
Open

Comments

@kevinjwalters
Copy link

GitHub currently has a robots.txt which is preventing crawling of the paths associated with the Wiki area for each and every repository. This is explicit and looks very intentional. I've asked about this (19-Oct-2019) and got no response, ticket number is 430217.

I've attached the current (27-Oct-2019) robots.txt file.

github.com.robots.20191027.txt

The gist of it:

Allow: /*/*/tree/master
Allow: /*/*/blob/master
...
Disallow: /*/*/wiki/*/*

I would like this to change to make the Wiki areas searchable using popular search engines.

@kevinjwalters
Copy link
Author

Got an update from GH today,

I'm not quite sure of the reason behind excluding wiki from Google's index but I'll pass your request
onto the team to consider.

I can't promise if or when it will be implemented, but your feedback is definitely in the right hands!

@gwlucastrig
Copy link

I urge the Github team to remove this restriction on robots scanning the github wiki pages.

I put a great deal of effort into providing wiki pages that would assist users of my open source software. I also hoped that they would help potential users find the project by providing meaningful content related to the problems my software addresses. The fact that Google cannot index my pages seriously limits the effectiveness of that content.

For example, I've written an article on Natural Neighbor Interpolation, which is a function my software supports. It's a specialty topic and the information I supply is not well-covered elsewhere. Enough people have linked to my article that if your run a Google search on "Natural Neighbor Interpolation" my wiki page comes up as the fourth item in the search. But, disappointingly, the description line on Google's search page reads "No information is available for this page".

Therefore I respectfully request that Github reconsider its position on restricting web crawlers from indexing wiki pages.

@ann0see
Copy link

ann0see commented Sep 22, 2020

Same problem here. Any updates on GitHub removing this entry (it does more harm than good)

@reallyuniquename
Copy link

Still blocked from crawlers.

@EdVassie
Copy link

Please can GitHub remove the restriction on Google, etc crawing Wiki pages. I want my Wiki to be seen!

@ann0see
Copy link

ann0see commented Jan 20, 2021

GitHub should remove this entry in the robots.txt file and let the repo owner descide. The default setting for a wiki page could be "noindex, nofollow" set in a meta tag but it should be possible to unset it.

@kevinjwalters
Copy link
Author

They appear to have shifted to a custom crawling process. First two lines of current robots.txt:

# If you would like to crawl GitHub contact us via https://support.github.com/contact/
# We also provide an extensive API: https://developer.github.com/

Google and DuckDuckGo still aren't indexing the GitHub wiki pages. I found another obscure search engine called Bing, that also gives no results for wiki. I've not had any further updates from them. I'll prod them again to see why they've ignored this request and persist in a partial crawl of GitHub.

For reference: github.com.robots.20210120.txt

@kevinjwalters
Copy link
Author

kevinjwalters commented Jan 20, 2021

I just put a new support ticket in for GitHub to review this fiasco and I mentioned this ticket for detail and support for the fix.

@kevinjwalters
Copy link
Author

GitHub support says:

According to our SEO and engineering teams, we originally blocked /wiki in January 2012 to address spam and any risks from wikis being open to anyone adding content. (When wikis were first introduced the default settings meant that anyone could edit them, whether they were a collaborator on the repository or not.)

Some pages had slipped through since it wasn’t written with a proper wildcard (*). That was fixed in May 2020 blocking all /wiki/ directories.

I’m afraid this is a deliberate decision, and it is not likely to be reversed due to the risk of wikis being used for spammy purposes.

So sorry about that; I completely understand why this could be a blocker.

Although it's unlikely to be unblocked, I am forwarding your ticket to the Product team to record your request for this change. They read and evaluate all feedback, however we cannot guarantee a response to every submission.

Kevin responds:

It's still not clear to me why you wouldn't allow the wiki areas which are not publicly editable to be available via Google Search and the like? I've not looked into this but I'd imagine that's trivial to do by alllowing a full crawl of /wiki and then putting in the appropriate indexing hints into HTTP response headers or HTML to restrict it for the wiki areas based on the repository configuration?

@sergiomb2
Copy link

wiki not be crawlable is a nonsense

@nelsonjchen
Copy link

The comma.ai community put a lot of work into the FAQ and many other pages. It's a bummer that it isn't indexed. I'm sure a few other projects have similar wikis with lots of content in them that are pretty much invisible.

Maybe there should be a warning put on the Wiki functionality that the content in Wikis is generally invisible to search engines.

@EdVassie
Copy link

The suggestion that a 'closed' Wiki that does not allow comments should be eligible to be crawled sounds sensible to me. This would stop people spamming GitHub, and would allow each project to decide if they wanted to make their Wiki searchable.

In any event, if someone wanted to spam GitHub, most projects allow issues to be raised. The argument that preventing a Wiki from being crawled is to stop spamming is a bit thin because Issues could just as easily be used as a vector for trolling.

Please allow projects to make their Wiki crawlable.

@jstavats
Copy link

As a way of sharing useful information, a wiki's whole purpose is defeated if it cannot be used to do as much as widely as their creators deem as applicable. Sure there should be a way to allow "private" wikis, but there should also be a way to have public ones. Otherwise projects will use other services to host such things (which I've seen in the past and not understood until now).

Setting non-crawlable as a default seems reasonable, but not allowing projects to choose otherwise does not. Please reconsider.

@nelsonjchen
Copy link

nelsonjchen commented Apr 4, 2021

I think that the URL is visible to Google and other search engines. When I search for terms that match that, the URLs are bolded with the search terms and they do come up in the search. I am not sure if the content is used though.

If you've ever searched for something that exist on StackOverflow, you may have noticed some mirrors of StackOverflow content mirroring also ranking highly. I don't particularly like these operations but maybe what they're doing can help here.

I hastily made this service to try to get the comma.ai openpilot wiki content indexed:

https://github-wiki-see.page/m/commaai/openpilot/wiki

It's quite sloppy but it should work for other wikis too if a relevant link is placed in a crawlable place. I'm no SEO expert so this experiment may very well crater but I figured I'll try something for not a lot of money. I doubt it'll rank highly since there are no links to it and it is in no way canonical.

I've also made some PRs as you can see in the issue reference alerts to update the GitHub documentation. In it, I've also suggested adding that users who want content that is crawlable and accepting of public contributions to produce a GitHub Page site backed by a public repository. To be honest though, setting up that setup kind of a pain in the ass for all parties and we're all lazy bastards.

@nelsonjchen
Copy link

nelsonjchen commented Jun 6, 2021

💸

I ran this big boy of a query in BigQuery as part of my project to generate sitemaps for my workaround:

#standardSQL
CREATE TEMPORARY FUNCTION
  parsePayload(payload STRING)
  RETURNS ARRAY<STRING>
  LANGUAGE js AS """ try { return JSON.parse(payload).pages.reduce((a,
      s) => {a.push(s.html_url); return a},
    []); } catch (e) { return []; } """;
SELECT
  *
FROM (
  WITH
    parsed_payloads AS (
    SELECT
      parsePayload(payload) AS html_urls,
      created_at
    FROM
      `githubarchive.month.*` 
    WHERE type = "GollumEvent")
  SELECT
    DISTINCT html_url,
    created_at,
    ROW_NUMBER() OVER(PARTITION BY html_url ORDER BY created_at DESC) AS rn
  FROM
    parsed_payloads
  CROSS JOIN
    UNNEST(parsed_payloads.html_urls) AS html_url)
WHERE
  rn = 1
  AND html_url NOT LIKE "%/wiki/Home"
  AND html_url NOT LIKE "%/wiki/_Sidebar"
  AND html_url NOT LIKE "%/wiki/_Footer"
  AND html_url NOT LIKE "%/wiki/_Header"

$45-less later, I had a list of 4,566,331 wiki pages that have been touched over the last decade excluding Home and trimmings. That's a lot of content being excluded from robots.txt!

I've saved the results into the publically accessible github-wiki-see.show.touched_wiki_pages_upto_202106 table if anyone else wants a gander. It's a small ~500MB dataset compared to the $45's worth of 9TB I had BQ crunch through.

I've also been using the litmus test of openpilot wiki nissan and openpilot wiki nissan leaf to see what search engines do about GitHub wikis. If the terms are in the URLs, a result does show up:

image

image

If you searched for openpilot wiki nissan leaf though, no results show up in Google. As a sidenote, my GHWSEE tool does show up in DDG/Bing though 😄 :

image

I think search engines don't index the content if robots.txt excludes them but they do index the link components.

@nelsonjchen
Copy link

I've since produced a new BigQuery table and a new bundle of sitemaps from that that has checked all the links and only includes 200s: github-wiki-see.show.checked_touched_wiki_pages_upto_202106. There are 2,090,792 200'ing pages.

@nelsonjchen
Copy link

GitHub support says:

According to our SEO and engineering teams, we originally blocked /wiki in January 2012 to address spam and any risks from wikis being open to anyone adding content. (When wikis were first introduced the default settings meant that anyone could edit them, whether they were a collaborator on the repository or not.)
Some pages had slipped through since it wasn’t written with a proper wildcard (*). That was fixed in May 2020 blocking all /wiki/ directories.
I’m afraid this is a deliberate decision, and it is not likely to be reversed due to the risk of wikis being used for spammy purposes.
So sorry about that; I completely understand why this could be a blocker.
Although it's unlikely to be unblocked, I am forwarding your ticket to the Product team to record your request for this change. They read and evaluate all feedback, however we cannot guarantee a response to every submission.

Kevin responds:

It's still not clear to me why you wouldn't allow the wiki areas which are not publicly editable to be available via Google Search and the like? I've not looked into this but I'd imagine that's trivial to do by alllowing a full crawl of /wiki and then putting in the appropriate indexing hints into HTTP response headers or HTML to restrict it for the wiki areas based on the repository configuration?

FWIW, I've made my mirroring tool append the attribute rel="nofollow ugc" to any links going outside of GitHub. Maybe they could do something like this if they decide to change their minds.

@nelsonjchen
Copy link

It turns out they already attach rel="nofollow" to external links but not rel="nofollow ugc".

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants