GitHub prevents crawling of repository's Wiki pages - no Google search #1683

kevinjwalters · 2019-10-27T14:04:04Z

GitHub currently has a robots.txt which is preventing crawling of the paths associated with the Wiki area for each and every repository. This is explicit and looks very intentional. I've asked about this (19-Oct-2019) and got no response, ticket number is 430217.

I've attached the current (27-Oct-2019) robots.txt file.

github.com.robots.20191027.txt

The gist of it:

Allow: /*/*/tree/master
Allow: /*/*/blob/master
...
Disallow: /*/*/wiki/*/*

I would like this to change to make the Wiki areas searchable using popular search engines.

The text was updated successfully, but these errors were encountered:

kevinjwalters · 2019-11-25T12:57:46Z

Got an update from GH today,

I'm not quite sure of the reason behind excluding wiki from Google's index but I'll pass your request
onto the team to consider.

I can't promise if or when it will be implemented, but your feedback is definitely in the right hands!

gwlucastrig · 2020-07-09T15:11:23Z

I urge the Github team to remove this restriction on robots scanning the github wiki pages.

I put a great deal of effort into providing wiki pages that would assist users of my open source software. I also hoped that they would help potential users find the project by providing meaningful content related to the problems my software addresses. The fact that Google cannot index my pages seriously limits the effectiveness of that content.

For example, I've written an article on Natural Neighbor Interpolation, which is a function my software supports. It's a specialty topic and the information I supply is not well-covered elsewhere. Enough people have linked to my article that if your run a Google search on "Natural Neighbor Interpolation" my wiki page comes up as the fourth item in the search. But, disappointingly, the description line on Google's search page reads "No information is available for this page".

Therefore I respectfully request that Github reconsider its position on restricting web crawlers from indexing wiki pages.

ann0see · 2020-09-22T13:10:58Z

Same problem here. Any updates on GitHub removing this entry (it does more harm than good)

reallyuniquename · 2020-10-28T11:28:34Z

Still blocked from crawlers.

EdVassie · 2021-01-20T10:04:50Z

Please can GitHub remove the restriction on Google, etc crawing Wiki pages. I want my Wiki to be seen!

ann0see · 2021-01-20T10:55:48Z

GitHub should remove this entry in the robots.txt file and let the repo owner descide. The default setting for a wiki page could be "noindex, nofollow" set in a meta tag but it should be possible to unset it.

kevinjwalters · 2021-01-20T12:24:57Z

They appear to have shifted to a custom crawling process. First two lines of current robots.txt:

# If you would like to crawl GitHub contact us via https://support.github.com/contact/
# We also provide an extensive API: https://developer.github.com/

Google and DuckDuckGo still aren't indexing the GitHub wiki pages. I found another obscure search engine called Bing, that also gives no results for wiki. I've not had any further updates from them. I'll prod them again to see why they've ignored this request and persist in a partial crawl of GitHub.

For reference: github.com.robots.20210120.txt

kevinjwalters · 2021-01-20T12:36:30Z

I just put a new support ticket in for GitHub to review this fiasco and I mentioned this ticket for detail and support for the fix.

kevinjwalters · 2021-01-20T14:06:59Z

GitHub support says:

According to our SEO and engineering teams, we originally blocked /wiki in January 2012 to address spam and any risks from wikis being open to anyone adding content. (When wikis were first introduced the default settings meant that anyone could edit them, whether they were a collaborator on the repository or not.)

Some pages had slipped through since it wasn’t written with a proper wildcard (*). That was fixed in May 2020 blocking all /wiki/ directories.

I’m afraid this is a deliberate decision, and it is not likely to be reversed due to the risk of wikis being used for spammy purposes.

So sorry about that; I completely understand why this could be a blocker.

Although it's unlikely to be unblocked, I am forwarding your ticket to the Product team to record your request for this change. They read and evaluate all feedback, however we cannot guarantee a response to every submission.

Kevin responds:

It's still not clear to me why you wouldn't allow the wiki areas which are not publicly editable to be available via Google Search and the like? I've not looked into this but I'd imagine that's trivial to do by alllowing a full crawl of /wiki and then putting in the appropriate indexing hints into HTTP response headers or HTML to restrict it for the wiki areas based on the repository configuration?

sergiomb2 · 2021-02-02T01:10:06Z

wiki not be crawlable is a nonsense

nelsonjchen · 2021-02-06T18:02:40Z

The comma.ai community put a lot of work into the FAQ and many other pages. It's a bummer that it isn't indexed. I'm sure a few other projects have similar wikis with lots of content in them that are pretty much invisible.

Maybe there should be a warning put on the Wiki functionality that the content in Wikis is generally invisible to search engines.

EdVassie · 2021-02-10T09:53:35Z

The suggestion that a 'closed' Wiki that does not allow comments should be eligible to be crawled sounds sensible to me. This would stop people spamming GitHub, and would allow each project to decide if they wanted to make their Wiki searchable.

In any event, if someone wanted to spam GitHub, most projects allow issues to be raised. The argument that preventing a Wiki from being crawled is to stop spamming is a bit thin because Issues could just as easily be used as a vector for trolling.

Please allow projects to make their Wiki crawlable.

jstavats · 2021-02-15T13:48:18Z

As a way of sharing useful information, a wiki's whole purpose is defeated if it cannot be used to do as much as widely as their creators deem as applicable. Sure there should be a way to allow "private" wikis, but there should also be a way to have public ones. Otherwise projects will use other services to host such things (which I've seen in the past and not understood until now).

Setting non-crawlable as a default seems reasonable, but not allowing projects to choose otherwise does not. Please reconsider.

…ines and are excluded by `robots.txt`. github#4115 https://web.archive.org/web/20210403000950/github.com/robots.txt isaacs/github#1683

github#4115 isaacs/github#1683 https://web.archive.org/web/20210403000950/github.com/robots.txt

nelsonjchen · 2021-04-04T19:53:02Z

I think that the URL is visible to Google and other search engines. When I search for terms that match that, the URLs are bolded with the search terms and they do come up in the search. I am not sure if the content is used though.

If you've ever searched for something that exist on StackOverflow, you may have noticed some mirrors of StackOverflow content mirroring also ranking highly. I don't particularly like these operations but maybe what they're doing can help here.

I hastily made this service to try to get the comma.ai openpilot wiki content indexed:

https://github-wiki-see.page/m/commaai/openpilot/wiki

It's quite sloppy but it should work for other wikis too if a relevant link is placed in a crawlable place. I'm no SEO expert so this experiment may very well crater but I figured I'll try something for not a lot of money. I doubt it'll rank highly since there are no links to it and it is in no way canonical.

I've also made some PRs as you can see in the issue reference alerts to update the GitHub documentation. In it, I've also suggested adding that users who want content that is crawlable and accepting of public contributions to produce a GitHub Page site backed by a public repository. To be honest though, setting up that setup kind of a pain in the ass for all parties and we're all lazy bastards.

nelsonjchen · 2021-06-06T17:14:04Z

💸

I ran this big boy of a query in BigQuery as part of my project to generate sitemaps for my workaround:

#standardSQL
CREATE TEMPORARY FUNCTION
  parsePayload(payload STRING)
  RETURNS ARRAY<STRING>
  LANGUAGE js AS """ try { return JSON.parse(payload).pages.reduce((a,
      s) => {a.push(s.html_url); return a},
    []); } catch (e) { return []; } """;
SELECT
  *
FROM (
  WITH
    parsed_payloads AS (
    SELECT
      parsePayload(payload) AS html_urls,
      created_at
    FROM
      `githubarchive.month.*` 
    WHERE type = "GollumEvent")
  SELECT
    DISTINCT html_url,
    created_at,
    ROW_NUMBER() OVER(PARTITION BY html_url ORDER BY created_at DESC) AS rn
  FROM
    parsed_payloads
  CROSS JOIN
    UNNEST(parsed_payloads.html_urls) AS html_url)
WHERE
  rn = 1
  AND html_url NOT LIKE "%/wiki/Home"
  AND html_url NOT LIKE "%/wiki/_Sidebar"
  AND html_url NOT LIKE "%/wiki/_Footer"
  AND html_url NOT LIKE "%/wiki/_Header"

$45-less later, I had a list of 4,566,331 wiki pages that have been touched over the last decade excluding Home and trimmings. That's a lot of content being excluded from robots.txt!

I've saved the results into the publically accessible github-wiki-see.show.touched_wiki_pages_upto_202106 table if anyone else wants a gander. It's a small ~500MB dataset compared to the $45's worth of 9TB I had BQ crunch through.

I've also been using the litmus test of openpilot wiki nissan and openpilot wiki nissan leaf to see what search engines do about GitHub wikis. If the terms are in the URLs, a result does show up:

If you searched for openpilot wiki nissan leaf though, no results show up in Google. As a sidenote, my GHWSEE tool does show up in DDG/Bing though 😄 :

I think search engines don't index the content if robots.txt excludes them but they do index the link components.

nelsonjchen · 2021-06-09T01:44:47Z

I've since produced a new BigQuery table and a new bundle of sitemaps from that that has checked all the links and only includes 200s: github-wiki-see.show.checked_touched_wiki_pages_upto_202106. There are 2,090,792 200'ing pages.

nelsonjchen · 2021-06-20T21:26:18Z

GitHub support says:

According to our SEO and engineering teams, we originally blocked /wiki in January 2012 to address spam and any risks from wikis being open to anyone adding content. (When wikis were first introduced the default settings meant that anyone could edit them, whether they were a collaborator on the repository or not.)
Some pages had slipped through since it wasn’t written with a proper wildcard (*). That was fixed in May 2020 blocking all /wiki/ directories.
I’m afraid this is a deliberate decision, and it is not likely to be reversed due to the risk of wikis being used for spammy purposes.
So sorry about that; I completely understand why this could be a blocker.
Although it's unlikely to be unblocked, I am forwarding your ticket to the Product team to record your request for this change. They read and evaluate all feedback, however we cannot guarantee a response to every submission.

Kevin responds:

It's still not clear to me why you wouldn't allow the wiki areas which are not publicly editable to be available via Google Search and the like? I've not looked into this but I'd imagine that's trivial to do by alllowing a full crawl of /wiki and then putting in the appropriate indexing hints into HTTP response headers or HTML to restrict it for the wiki areas based on the repository configuration?

FWIW, I've made my mirroring tool append the attribute rel="nofollow ugc" to any links going outside of GitHub. Maybe they could do something like this if they decide to change their minds.

nelsonjchen · 2021-06-29T08:20:06Z

It turns out they already attach rel="nofollow" to external links but not rel="nofollow ugc".

TPS added search wiki labels Nov 12, 2019

nelsonjchen mentioned this issue Feb 28, 2021

Document that wikis aren't crawled by search engines github/docs#4115

Closed

nelsonjchen mentioned this issue Apr 4, 2021

Wikis aren't crawled. Suggest GH Pages alternative github/docs#5054

Merged

3 tasks

nelsonjchen added a commit to nelsonjchen/docs-1 that referenced this issue Apr 4, 2021

Wikis aren't crawled. Suggest GH Pages alternative

fdbbe51

github#4115 isaacs/github#1683 https://web.archive.org/web/20210403000950/github.com/robots.txt

rusefillc mentioned this issue May 17, 2021

PROBLEM STATEMENT: Find a way to have rusEFI documentation indexed rusefi/rusefi_documentation#138

Open

rfjakob mentioned this issue Sep 16, 2021

Wiki page not indexed rfjakob/gocryptfs#602

Closed

mabelzhang mentioned this issue Nov 3, 2021

Document / update coordinates and units osrf/lrauv#48

Closed

7 tasks

holly-cummins mentioned this issue Dec 15, 2022

Cross-link to release documentation quarkusio/quarkusio.github.io#1593

Merged

jihyeseo mentioned this issue Apr 11, 2023

검색엔진에서 우리 페이지가 잘 나오는지 kkgs-ch/kkgs-ch.github.io#19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub prevents crawling of repository's Wiki pages - no Google search #1683

GitHub prevents crawling of repository's Wiki pages - no Google search #1683

kevinjwalters commented Oct 27, 2019

kevinjwalters commented Nov 25, 2019

gwlucastrig commented Jul 9, 2020

ann0see commented Sep 22, 2020 •

edited

reallyuniquename commented Oct 28, 2020

EdVassie commented Jan 20, 2021

ann0see commented Jan 20, 2021

kevinjwalters commented Jan 20, 2021

kevinjwalters commented Jan 20, 2021 •

edited

kevinjwalters commented Jan 20, 2021

sergiomb2 commented Feb 2, 2021

nelsonjchen commented Feb 6, 2021

EdVassie commented Feb 10, 2021

jstavats commented Feb 15, 2021

nelsonjchen commented Apr 4, 2021 •

edited

nelsonjchen commented Jun 6, 2021 •

edited

nelsonjchen commented Jun 9, 2021

nelsonjchen commented Jun 20, 2021

nelsonjchen commented Jun 29, 2021

GitHub prevents crawling of repository's Wiki pages - no Google search #1683

GitHub prevents crawling of repository's Wiki pages - no Google search #1683

Comments

kevinjwalters commented Oct 27, 2019

kevinjwalters commented Nov 25, 2019

gwlucastrig commented Jul 9, 2020

ann0see commented Sep 22, 2020 • edited

reallyuniquename commented Oct 28, 2020

EdVassie commented Jan 20, 2021

ann0see commented Jan 20, 2021

kevinjwalters commented Jan 20, 2021

kevinjwalters commented Jan 20, 2021 • edited

kevinjwalters commented Jan 20, 2021

sergiomb2 commented Feb 2, 2021

nelsonjchen commented Feb 6, 2021

EdVassie commented Feb 10, 2021

jstavats commented Feb 15, 2021

nelsonjchen commented Apr 4, 2021 • edited

nelsonjchen commented Jun 6, 2021 • edited

nelsonjchen commented Jun 9, 2021

nelsonjchen commented Jun 20, 2021

nelsonjchen commented Jun 29, 2021

ann0see commented Sep 22, 2020 •

edited

kevinjwalters commented Jan 20, 2021 •

edited

nelsonjchen commented Apr 4, 2021 •

edited

nelsonjchen commented Jun 6, 2021 •

edited