Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content is not indexed despite opt-in #22452

Open
v-p-b opened this issue Dec 18, 2022 · 11 comments
Open

Content is not indexed despite opt-in #22452

v-p-b opened this issue Dec 18, 2022 · 11 comments
Labels
bug Something isn't working

Comments

@v-p-b
Copy link

v-p-b commented Dec 18, 2022

Steps to reproduce the problem

  1. Opt-in for search engine indexing
  2. Wait
  3. Search for shared content in search engines
  4. Be disappointed :,(

Expected behaviour

Posts are indexed by search engines

Actual behaviour

Posts are (mostly) not indexed by search engines

Detailed description

I really want my posts to be discoverable, and while I know about Mastodon's stance on search inside the platform, having search giants do the work seems like a good option.

I opted-in for search engine indexing the minute I moved to my current instance (I'm @buherator@infosec.exchange), and I even created a website that points to my posts, fed it to Google for scraping via Search Console.

Still, I can't look up my content in Google (or Bing for that matter).

Now I see the chance, that it's just Google not handling Mastodon posts correctly, as I can see that on post URL's content is only presented in META tags, and visible content is dynamically generated with JavaScipt, so this is something that pbbly needs to be handled at the search engine's side. However, some content from my instance is indexed by Google, see https://www.google.com/search?q=site%3Ainfosec.exchange+twitter .

This a post of mine:

https://infosec.exchange/@buherator/109535230739398168

I can see no noindex attributes here, but I'm not sure if any other tags can scare crawlers away.

As of today this is the robots.txt on my instance:

# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file

User-agent: *
Disallow: /media_proxy/
Disallow: /interact/

As far as I can tell this doesn't affect my posts visibility either.

Any advice on how I can debug this further would be appreciated!

Specifications

Mastodon v4.0.2+glitch

@v-p-b v-p-b added the bug Something isn't working label Dec 18, 2022
@rbairwell
Copy link

rbairwell commented Dec 18, 2022

I believe that the major search engines are still "coming to terms" with Mastodon - I know "in the olden days" Google had full access to the "Twitter Firehose" which significantly bumped Twitter up in the SERs.

Looking at my https://mastodon.org.uk/@rbairwell profile in private mode, I can see that the page is actually a Javascript React.js app page which does - still - cause difficulties for search engines as they have to run a Javascript-interpreting browser (such as headless Chromium) to index the site which is slower/handles lower throughput than a basic HTML parser. I would say just give the search engines time - it may take a few months before they've fully updated their systems to "properly handle" Mastodon servers.

@v-p-b
Copy link
Author

v-p-b commented Dec 18, 2022

Fortunately the provided meta tags provide a way for at least partial parsing without browser emulation. This is proven by the numerous toots already indexed (see the google search I linked).

In the meantime I diffed the static contents of an indexed link against an unindexed one, and see no significant differences, so it might really be some prioritization delaying the indexing the less popular accounts.

@ineffyble
Copy link
Member

ineffyble commented Dec 18, 2022

Indexing posts on Mastodon 4.0.2 is definitely possible by Google.

@v-p-b My recommendation would be signing up for Google Search Console and verifying your domain: https://search.google.com/search-console/about

Once you do that, you should get actionable data about the indexing status of your content (in a day or two).

EDIT: Ah, you're not the instance admin. Have you contacted the admins of infosec.exchange about this?

@v-p-b
Copy link
Author

v-p-b commented Dec 19, 2022

@ineffyble no I haven't yet, this will be my next step and post any updates here! Thanks for the feedback!

@nemobis
Copy link
Contributor

nemobis commented Jan 22, 2023

Is your profile listed in the profile directory?

Before Mastodon 4, the logged-out profile directory served as a sort of sitemap, because it was easy to crawl. See https://respublicae.eu/explore as an example. https://infosec.exchange is running on v4.1.0rc1+glitch, so I'm not sure how your account would be discovered by search engines.

The example post also doesn't contain any real hashtag, so that's another way it can't be discovered.

@v-p-b
Copy link
Author

v-p-b commented Jan 22, 2023

https://infosec.exchange/ is running on v4.1.0rc1+glitch, so I'm not sure how your account would be discovered by search engines.

I set up a site specifically to provide in-links for my Mastodon posts directly (https://web.archive.org/web/20221219231050/https://infosex.exchange/) , and made Google index that domain via Search Console. It didn't improve much (I converted the domain to Akkoma, that supports search by default since then).

My instance admin also registered infosec.exchange to Search Console as advised by @ineffyble, but there is still only a couple of my posts indexed.

The example post also doesn't contain any real hashtag, so that's another way it can't be discovered.

The point would be not relying on hashtags, but proper full-text search.

@nemobis
Copy link
Contributor

nemobis commented Jan 22, 2023 via email

@v-p-b
Copy link
Author

v-p-b commented Jan 22, 2023

Do search engines agree with this recommendation for discovery?

I'm not sure I follow. Search engines tend to be able to find stuff without hashtags. There are also examples of Google indexing Mastodon posts (with or without hashtags).

There is a possibility that the root of the problem is that search engines don't currently have a way to discover new posts (+ time between crawls makes them miss posts). In this case I'm interested in what my instance admin could do to make such a feature available for users who opt-in.

@supernovae
Copy link

Search engines are increasing their crawl of major instances.

I think the problem is - Mastodon doesn't adhere to the expected behavior with these options.

image

There are still lots of parts of mastodon that are noindex even with that set (looking at the code, it doesn't even look to see if these parameters are set)

@supernovae
Copy link

supernovae commented Mar 21, 2023

There is lots of code that is set to noindex in the js/jsx files like

<Helmet> <meta name='robots' content='noindex' /> </Helmet>

When i think many admins expect it to be more like

<meta name='robots' content={(isLocal && isIndexable) ? 'all' : 'noindex'} /> (or something that can respect an instances approach to being public)

but i'm not even sure this achieves much is isIndexable may not be the public timeline options but rather the unfederated property of the local instance where if a post is public, can't be guaranteed anywhere else.

I spoke to Gargron about this on slack a while ago when i noticed my landing pages were having bad SERPS and Claire even asked me about it as it wasn't using my meta descriptions. When i removed some of the noindex google started showing my real meta descriptions and my landing pages were correct and more than just empty links.

@reggi
Copy link

reggi commented May 9, 2023

Hey all, what's the consensus here? I've been using mastodon for over a year now on indieweb.social if you search site:indieweb.social/@thomasreggi on google only three results come up. All of my posts have been public. Is there anything I need to ask my admin to configure? Perhaps checking in with them and seeing if they set up Google Search Console?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants