Pulumi AI is poisoning Google search results with AI answers #79

petetnt · 2024-03-21T11:49:44Z

What happened?

Today I was googling various infrastructure related searches and noticed a worrying trend of Pulumi AI answers getting indexed and ranking high on Google results, regardless of the quality of the AI answer itself or if the question involved Pulumi in the first place. This happened with multiple searches and will probably get even worse as the time goes on.

Example

For example search AWS Lightsail xray, brings up this AI Answer from Pulumi as the top result:

Link to the AI Answer: https://www.pulumi.com/ai/answers/bLHAi4DutXJvbJyNngGRvS/optimizing-aws-lightsail-and-x-ray-deployment.

While this might seem like a good thing for someone, spamming high ranked results that are at best misleading and at worst destructive does not seem like something I would want to associate Pulumi as a brand. There's already tons of generated, false content available on the internet and adding even more noise to the search results is not a good idea.

I would highly recommend Disallow:ing robots from scraping https://www.pulumi.com/ai/answers via Robots.txt or similar functionality.

Additional context

Adding -inurl:[pulumi.com/ai] to your query will remove Pulumi AI answers from the search results, but it's cumbersome.

Contributing

Vote on this issue by adding a 👍 reaction.

The text was updated successfully, but these errors were encountered:

AaronFriel · 2024-03-22T18:25:24Z

Hey @petetnt, we've taken this feedback from you and others and we've taken steps to remove more than half (almost two thirds) of AI Answers, and we plan to continue to ensure that these AI answers are complementary to our existing documentation.

We are also taking steps to:

Ensure our pages are grounded in the real APIs and upstream documentation
Validating the results, e.g.: via type checking and testing the generated code

cnunciato · 2024-03-22T19:00:27Z

we've taken steps to remove more than half of AI Answers

Worth mentioning this list was submitted to Google this morning, so it could be a bit before they're removed from search results. We expect this to happen fairly soon, though.

petetnt · 2024-03-22T20:16:49Z

Thank you @AaronFriel and @cnunciato for the prompt (sic) and solid response 🫡

petetnt · 2024-04-29T13:06:35Z

This was trending on my Twitter feed today, so it's pretty safe to assume that the situation is dire still https://twitter.com/ProgrammerDude/status/1784833971731223033

Camel · 2024-04-29T14:54:34Z

It honestly makes using Pulumi itself very challenging... hard to find valid answers on how to do something b/c the pulumi ask AI generated ones crowd the results and if you try them they don't actually work. And for some time (at least as of 2-3 weeks ago), the links to Pulumi's site for these generated results were 404'ing.

Appreciate there's a GTM genefit to this SEO work, but at least for me it cut the opposite direction. Wanted to use Pulumi but this was such a pain point I just stuck with Terraform.

mbomb007 · 2024-04-29T15:09:51Z

Hey @petetnt, we've taken this feedback from you and others and we've taken steps to remove more than half (almost two thirds) of AI Answers, and we plan to continue to ensure that these AI answers are complementary to our existing documentation.

It doesn't sound like robots.txt was changed. Removing answers isn't going to fix the issue if LLM-generated answers are still available in search results.

However, for people that don't like Google, the issue will probably help provide alternatives to Google gain market share.

daaain · 2024-05-03T11:39:04Z

You ABSOLUTELY MUST add a report button on these pages at the very least, ASAP!

If somebody asks a question about stuff that doesn't exist, the LLM will hallucinate it, it'll rank high in searches (as no one else will have written about the solution that isn't possible) and it'll confuse the hell out from whoever finds it!

I'm pretty knowledgeable about GCP (actually have the GCP Professional Cloud Architect certification) but I was chasing the wrong idea down that it would be possible to pre-create a Cloud Function with Pulumi and then use gcloud to deploy from source. I can't find those pages any more, but while looking for them found 2 wrong answers randomly:

https://www.pulumi.com/ai/answers/7Kzx1a8vhPuAX6yYjEpeG3/deciphering-google-cloud-artifact-registry-and-cloud-functions-v2-integration
https://www.pulumi.com/ai/answers/gpu3nhpaDXc7gDC5hG61PZ/deploying-gke-and-cloud-functions-on-google-cloud

cnunciato · 2024-05-03T19:23:52Z

@daaain Thank you for pointing this out! I actually thought we were doing this already. I just opened a PR to add the same feedback widget we use elsewhere in Pulumi AI.

cnunciato · 2024-05-03T20:49:09Z

I just opened a PR to add the same feedback widget we use elsewhere in Pulumi AI.

The PR's been merged and the site's been updated. Thanks again for the report!

cnunciato · 2024-05-03T21:04:06Z

It doesn't sound like robots.txt was changed.

@mbomb007 We actually did update our robots.txt file to point to a sitemap we built to tell Google about unpublished pages, all of which return HTTP 410 with <meta name="robots" content="noindex"> directives. (This is the sitemap that was submitted to Google on March 22.) This combination (HTTP 410 + noindex) is the strongest signal we know of to tell Google that these pages are gone and aren't coming back. It's unclear why it's taking so long for Google to remove them.

daaain · 2024-05-04T00:00:39Z

The PR's been merged and the site's been updated. Thanks again for the report!

Amazing turnaround time, thanks a lot for your hard work on this today!

I'll make sure to flag nonsense generated code when I find more (did on the 2 pages I linked above with explanation).

tobytteh · 2024-05-04T21:20:16Z

This use of AI is monstrously stupid. Let us all pray to our respective gods that someone at Pulumi is intelligent enough to end this.

mbomb007 · 2024-05-06T15:04:06Z

It doesn't sound like robots.txt was changed.

@mbomb007 We actually did update our robots.txt file to point to a sitemap we built to tell Google about unpublished pages, all of which return HTTP 410 with <meta name="robots" content="noindex"> directives. (This is the sitemap that was submitted to Google on March 22.) This combination (HTTP 410 + noindex) is the strongest signal we know of to tell Google that these pages are gone and aren't coming back. It's unclear why it's taking so long for Google to remove them.

You could maybe speed reindexing up using Google Search Console?

AaronFriel · 2024-05-06T16:34:02Z

@mbomb007 we did that as well, submitting the "unpublished" sitemap and the console reported it scanned (IIRC, it did not say "crawled") those pages. Our last resort has been using a tool that allows us to remove up to 1,000 URLs per day from Google's index, but it is fairly manual.

petetnt · 2024-05-07T08:57:10Z

Why not add <meta name="robots" content="noindex"> meta tag to all the pages under /ai path? Or explicitly add Disallow: /ai/* to the robots.txt at https://www.pulumi.com/robots.txt

mbomb007 · 2024-05-07T13:50:26Z

@AaronFriel This might help: https://developers.google.com/search/docs/crawling-indexing/block-indexing#debugging-noindex-issues

AaronFriel · 2024-05-07T18:00:33Z

@petetnt @mbomb007 Thanks, we've already taken some of those steps and we've added the meta tag for the pages we want to remove that didn't meet our quality bar and were cluttering search results (affecting roughly 2/3 of the pages we published.)

I think the point you're getting at is: why publish AI Answers at all? In short: we've gotten very positive feedback users when the pages show up appropriately and don't clutter the first page of results in a search engine. We don't want to throw out the good with the bad, and we've marked those pages as noindex. The URL Inspection tool reports what we expect since setting these pages to 404:

I'll speak to @tobytteh's comment here, which I think captures the frustration folks have and the underlying question of why we Pulumi feels comfortable generating code examples with AI:

This use of AI is monstrously stupid. Let us all pray to our respective gods that someone at Pulumi is intelligent enough to end this.

Code generation is Pulumi's bread and butter, it is a core competency of our engineering org. Every one of our providers has a rich schema describing the SDK (example Docker provider schema.json). Those schemas are then used to generate the SDKs for each language (source code in github.com/pulumi/pulumi/pkg/codegen). Pulumi AI combines this with retrieval augmented generation and type checking of generated programs to be more than ten times more likely to generate valid, working code for many questions than ChatGPT (GPT-4) on its own. Generating ten times better code than state of the art language models is itself a feat, but we aren't resting on our laurels and we're continuing to work on setting an even higher bar for ourselves to ensure that every program we publish would work from copy-and-paste to pulumi up.

That said, we certainly didn't expect the result of publishing as many pages as we did, and that's why we've taken drastic steps to withdraw a significant number (around 2/3) of the AI Answers. We'll continue to raise the bar on quality and prune pages that do not meet our standards.

petetnt · 2024-05-07T18:21:25Z

Personally I don’t think the (alledged) 2/3rds is nearly good enough ratio to spam the internet full of absolutely wrong answers. Not to mention the page mentioned in the first post is still up and indexed for example, which makes me think that you are willing to risk it for a piece of the much obsessed AI pie.

For example publishing an index of valid answers while keeping the actual index out the results would probably satisfy those looking for AI answers too.

AaronFriel · 2024-05-07T18:36:17Z

In the interest of transparency, I'm happy to set up a call to chat and prove that 2/3 figure. Email me at my last name at pulumi.com.

That said, there are three issues here:

Opposition to AI generated content, full stop
Low quality code examples/documentation
Cluttering of search results

While I see the pros and cons of 1, the issues we want to solve are 2 and 3. If you have examples where the example is "absolutely wrong", that falls under 2, so please create issues or use the feedback buttons to let us know.

AaronFriel · 2024-05-21T17:57:10Z

Thanks everyone for your feedback. In February when we saw the impact Pulumi AI Answers had on search result quality, we started work on solutions and we’re now seeing dramatic improvements from the work we’ve done:

Since day one, we’ve labeled our content as AI generated for search engines
In March, we removed 2/3 of the AI Answers pages that did not meet our quality bar
We’ve verified published answers to ensure they contain valid, compiling code snippets and are continuing to invest in quality controls, automated checks, and code generation

The good news is this has been effective! We’re pleased to see search engines use these signals to place our authoritative, expert-written docs content first.

Pulumi AI is still providing a ton of value to users - we're seeing thousands of questions asked and answered every day, helping devs build faster on any cloud. With quality checks in place and search results cleaned up, we’ve made Pulumi AI a better resource that is more correctly ranked relative to our other docs such as our Registry API Docs. And we will keep iterating on these improvements to documentation, code generation and verification of AI generated content.

We’ll close this issue as resolved, and thanks again for pushing us to make Pulumi better.

petetnt · 2024-05-21T18:26:26Z

Sadly for me the original issue persist, with the example in the OP still being one of the many answers that provide me with negative value, so I guess I’ll just consider this more a ”wontfix” than completed.

joeduffy · 2024-05-22T20:26:26Z

Bizarrely, this is one of the only queries that seems to still rank so highly. It honestly baffles me why Google ranks this page above everything else. I've tried numerous others - and we've validated that most of the traffic Google is sending to these pages - has died down considerably. We will keep monitoring and iterating.

That said, for what it's worth, the example on this page works! Is there a particular reason it isn't perceived as a reasonable page to have on the Internet? I'm not an AWS Lightsail expert, so apologies if I'm missing something obvious.

petetnt added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Mar 21, 2024

AaronFriel removed the needs-triage Needs attention from the triage team label Mar 22, 2024

AaronFriel assigned AaronFriel and interurban Mar 22, 2024

petetnt closed this as completed Mar 22, 2024

petetnt reopened this Apr 29, 2024

AaronFriel closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulumi AI is poisoning Google search results with AI answers #79

Pulumi AI is poisoning Google search results with AI answers #79

petetnt commented Mar 21, 2024 •

edited

Loading

AaronFriel commented Mar 22, 2024 •

edited by cnunciato

Loading

cnunciato commented Mar 22, 2024

petetnt commented Mar 22, 2024

petetnt commented Apr 29, 2024

Camel commented Apr 29, 2024 •

edited

Loading

mbomb007 commented Apr 29, 2024 •

edited

Loading

daaain commented May 3, 2024

cnunciato commented May 3, 2024

cnunciato commented May 3, 2024

cnunciato commented May 3, 2024 •

edited

Loading

daaain commented May 4, 2024

tobytteh commented May 4, 2024

mbomb007 commented May 6, 2024

AaronFriel commented May 6, 2024 •

edited

Loading

petetnt commented May 7, 2024 •

edited

Loading

mbomb007 commented May 7, 2024

AaronFriel commented May 7, 2024 •

edited

Loading

petetnt commented May 7, 2024 •

edited

Loading

AaronFriel commented May 7, 2024 •

edited

Loading

AaronFriel commented May 21, 2024

petetnt commented May 21, 2024

joeduffy commented May 22, 2024

Pulumi AI is poisoning Google search results with AI answers #79

Pulumi AI is poisoning Google search results with AI answers #79

Comments

petetnt commented Mar 21, 2024 • edited Loading

What happened?

Example

Additional context

Contributing

AaronFriel commented Mar 22, 2024 • edited by cnunciato Loading

cnunciato commented Mar 22, 2024

petetnt commented Mar 22, 2024

petetnt commented Apr 29, 2024

Camel commented Apr 29, 2024 • edited Loading

mbomb007 commented Apr 29, 2024 • edited Loading

daaain commented May 3, 2024

cnunciato commented May 3, 2024

cnunciato commented May 3, 2024

cnunciato commented May 3, 2024 • edited Loading

daaain commented May 4, 2024

tobytteh commented May 4, 2024

mbomb007 commented May 6, 2024

AaronFriel commented May 6, 2024 • edited Loading

petetnt commented May 7, 2024 • edited Loading

mbomb007 commented May 7, 2024

AaronFriel commented May 7, 2024 • edited Loading

petetnt commented May 7, 2024 • edited Loading

AaronFriel commented May 7, 2024 • edited Loading

AaronFriel commented May 21, 2024

petetnt commented May 21, 2024

joeduffy commented May 22, 2024

petetnt commented Mar 21, 2024 •

edited

Loading

AaronFriel commented Mar 22, 2024 •

edited by cnunciato

Loading

Camel commented Apr 29, 2024 •

edited

Loading

mbomb007 commented Apr 29, 2024 •

edited

Loading

cnunciato commented May 3, 2024 •

edited

Loading

AaronFriel commented May 6, 2024 •

edited

Loading

petetnt commented May 7, 2024 •

edited

Loading

AaronFriel commented May 7, 2024 •

edited

Loading

petetnt commented May 7, 2024 •

edited

Loading

AaronFriel commented May 7, 2024 •

edited

Loading