Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulumi AI is poisoning Google search results with AI answers #79

Closed
petetnt opened this issue Mar 21, 2024 · 22 comments
Closed

Pulumi AI is poisoning Google search results with AI answers #79

petetnt opened this issue Mar 21, 2024 · 22 comments
Assignees
Labels
kind/bug Some behavior is incorrect or out of spec

Comments

@petetnt
Copy link

petetnt commented Mar 21, 2024

What happened?

Today I was googling various infrastructure related searches and noticed a worrying trend of Pulumi AI answers getting indexed and ranking high on Google results, regardless of the quality of the AI answer itself or if the question involved Pulumi in the first place. This happened with multiple searches and will probably get even worse as the time goes on.

Example

For example search AWS Lightsail xray, brings up this AI Answer from Pulumi as the top result:

image

Link to the AI Answer: https://www.pulumi.com/ai/answers/bLHAi4DutXJvbJyNngGRvS/optimizing-aws-lightsail-and-x-ray-deployment.

While this might seem like a good thing for someone, spamming high ranked results that are at best misleading and at worst destructive does not seem like something I would want to associate Pulumi as a brand. There's already tons of generated, false content available on the internet and adding even more noise to the search results is not a good idea.

I would highly recommend Disallow:ing robots from scraping https://www.pulumi.com/ai/answers via Robots.txt or similar functionality.

Additional context

Adding -inurl:[pulumi.com/ai] to your query will remove Pulumi AI answers from the search results, but it's cumbersome.

Contributing

Vote on this issue by adding a 👍 reaction.

@petetnt petetnt added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Mar 21, 2024
@AaronFriel
Copy link

AaronFriel commented Mar 22, 2024

Hey @petetnt, we've taken this feedback from you and others and we've taken steps to remove more than half (almost two thirds) of AI Answers, and we plan to continue to ensure that these AI answers are complementary to our existing documentation.

We are also taking steps to:

  • Ensure our pages are grounded in the real APIs and upstream documentation
  • Validating the results, e.g.: via type checking and testing the generated code

@AaronFriel AaronFriel removed the needs-triage Needs attention from the triage team label Mar 22, 2024
@cnunciato
Copy link

we've taken steps to remove more than half of AI Answers

Worth mentioning this list was submitted to Google this morning, so it could be a bit before they're removed from search results. We expect this to happen fairly soon, though.

@petetnt
Copy link
Author

petetnt commented Mar 22, 2024

Thank you @AaronFriel and @cnunciato for the prompt (sic) and solid response 🫡

@petetnt petetnt closed this as completed Mar 22, 2024
@petetnt petetnt reopened this Apr 29, 2024
@petetnt
Copy link
Author

petetnt commented Apr 29, 2024

This was trending on my Twitter feed today, so it's pretty safe to assume that the situation is dire still https://twitter.com/ProgrammerDude/status/1784833971731223033

@Camel
Copy link

Camel commented Apr 29, 2024

It honestly makes using Pulumi itself very challenging... hard to find valid answers on how to do something b/c the pulumi ask AI generated ones crowd the results and if you try them they don't actually work. And for some time (at least as of 2-3 weeks ago), the links to Pulumi's site for these generated results were 404'ing.

Appreciate there's a GTM genefit to this SEO work, but at least for me it cut the opposite direction. Wanted to use Pulumi but this was such a pain point I just stuck with Terraform.

@mbomb007
Copy link

mbomb007 commented Apr 29, 2024

Hey @petetnt, we've taken this feedback from you and others and we've taken steps to remove more than half (almost two thirds) of AI Answers, and we plan to continue to ensure that these AI answers are complementary to our existing documentation.

It doesn't sound like robots.txt was changed. Removing answers isn't going to fix the issue if LLM-generated answers are still available in search results.

However, for people that don't like Google, the issue will probably help provide alternatives to Google gain market share.

@daaain
Copy link

daaain commented May 3, 2024

You ABSOLUTELY MUST add a report button on these pages at the very least, ASAP!

If somebody asks a question about stuff that doesn't exist, the LLM will hallucinate it, it'll rank high in searches (as no one else will have written about the solution that isn't possible) and it'll confuse the hell out from whoever finds it!

I'm pretty knowledgeable about GCP (actually have the GCP Professional Cloud Architect certification) but I was chasing the wrong idea down that it would be possible to pre-create a Cloud Function with Pulumi and then use gcloud to deploy from source. I can't find those pages any more, but while looking for them found 2 wrong answers randomly:

https://www.pulumi.com/ai/answers/7Kzx1a8vhPuAX6yYjEpeG3/deciphering-google-cloud-artifact-registry-and-cloud-functions-v2-integration
https://www.pulumi.com/ai/answers/gpu3nhpaDXc7gDC5hG61PZ/deploying-gke-and-cloud-functions-on-google-cloud

@cnunciato
Copy link

@daaain Thank you for pointing this out! I actually thought we were doing this already. I just opened a PR to add the same feedback widget we use elsewhere in Pulumi AI.

@cnunciato
Copy link

I just opened a PR to add the same feedback widget we use elsewhere in Pulumi AI.

The PR's been merged and the site's been updated. Thanks again for the report!

@cnunciato
Copy link

cnunciato commented May 3, 2024

It doesn't sound like robots.txt was changed.

@mbomb007 We actually did update our robots.txt file to point to a sitemap we built to tell Google about unpublished pages, all of which return HTTP 410 with <meta name="robots" content="noindex"> directives. (This is the sitemap that was submitted to Google on March 22.) This combination (HTTP 410 + noindex) is the strongest signal we know of to tell Google that these pages are gone and aren't coming back. It's unclear why it's taking so long for Google to remove them.

@daaain
Copy link

daaain commented May 4, 2024

The PR's been merged and the site's been updated. Thanks again for the report!

Amazing turnaround time, thanks a lot for your hard work on this today!

I'll make sure to flag nonsense generated code when I find more (did on the 2 pages I linked above with explanation).

@tobytteh
Copy link

tobytteh commented May 4, 2024

This use of AI is monstrously stupid. Let us all pray to our respective gods that someone at Pulumi is intelligent enough to end this.

@mbomb007
Copy link

mbomb007 commented May 6, 2024

It doesn't sound like robots.txt was changed.

@mbomb007 We actually did update our robots.txt file to point to a sitemap we built to tell Google about unpublished pages, all of which return HTTP 410 with <meta name="robots" content="noindex"> directives. (This is the sitemap that was submitted to Google on March 22.) This combination (HTTP 410 + noindex) is the strongest signal we know of to tell Google that these pages are gone and aren't coming back. It's unclear why it's taking so long for Google to remove them.

You could maybe speed reindexing up using Google Search Console?

@AaronFriel
Copy link

AaronFriel commented May 6, 2024

@mbomb007 we did that as well, submitting the "unpublished" sitemap and the console reported it scanned (IIRC, it did not say "crawled") those pages. Our last resort has been using a tool that allows us to remove up to 1,000 URLs per day from Google's index, but it is fairly manual.

@petetnt
Copy link
Author

petetnt commented May 7, 2024

Why not add <meta name="robots" content="noindex"> meta tag to all the pages under /ai path? Or explicitly add Disallow: /ai/* to the robots.txt at https://www.pulumi.com/robots.txt

@AaronFriel
Copy link

AaronFriel commented May 7, 2024

@petetnt @mbomb007 Thanks, we've already taken some of those steps and we've added the meta tag for the pages we want to remove that didn't meet our quality bar and were cluttering search results (affecting roughly 2/3 of the pages we published.)

I think the point you're getting at is: why publish AI Answers at all? In short: we've gotten very positive feedback users when the pages show up appropriately and don't clutter the first page of results in a search engine. We don't want to throw out the good with the bad, and we've marked those pages as noindex. The URL Inspection tool reports what we expect since setting these pages to 404:

Google URL Inspection tool reporting a page as 404ed

I'll speak to @tobytteh's comment here, which I think captures the frustration folks have and the underlying question of why we Pulumi feels comfortable generating code examples with AI:

This use of AI is monstrously stupid. Let us all pray to our respective gods that someone at Pulumi is intelligent enough to end this.

Code generation is Pulumi's bread and butter, it is a core competency of our engineering org. Every one of our providers has a rich schema describing the SDK (example Docker provider schema.json). Those schemas are then used to generate the SDKs for each language (source code in github.com/pulumi/pulumi/pkg/codegen). Pulumi AI combines this with retrieval augmented generation and type checking of generated programs to be more than ten times more likely to generate valid, working code for many questions than ChatGPT (GPT-4) on its own. Generating ten times better code than state of the art language models is itself a feat, but we aren't resting on our laurels and we're continuing to work on setting an even higher bar for ourselves to ensure that every program we publish would work from copy-and-paste to pulumi up.

That said, we certainly didn't expect the result of publishing as many pages as we did, and that's why we've taken drastic steps to withdraw a significant number (around 2/3) of the AI Answers. We'll continue to raise the bar on quality and prune pages that do not meet our standards.

@petetnt
Copy link
Author

petetnt commented May 7, 2024

Personally I don’t think the (alledged) 2/3rds is nearly good enough ratio to spam the internet full of absolutely wrong answers. Not to mention the page mentioned in the first post is still up and indexed for example, which makes me think that you are willing to risk it for a piece of the much obsessed AI pie.

For example publishing an index of valid answers while keeping the actual index out the results would probably satisfy those looking for AI answers too.

@AaronFriel
Copy link

AaronFriel commented May 7, 2024

In the interest of transparency, I'm happy to set up a call to chat and prove that 2/3 figure. Email me at my last name at pulumi.com.

That said, there are three issues here:

  1. Opposition to AI generated content, full stop
  2. Low quality code examples/documentation
  3. Cluttering of search results

While I see the pros and cons of 1, the issues we want to solve are 2 and 3. If you have examples where the example is "absolutely wrong", that falls under 2, so please create issues or use the feedback buttons to let us know.

@AaronFriel
Copy link

Thanks everyone for your feedback. In February when we saw the impact Pulumi AI Answers had on search result quality, we started work on solutions and we’re now seeing dramatic improvements from the work we’ve done:

  • Since day one, we’ve labeled our content as AI generated for search engines
  • In March, we removed 2/3 of the AI Answers pages that did not meet our quality bar
  • We’ve verified published answers to ensure they contain valid, compiling code snippets and are continuing to invest in quality controls, automated checks, and code generation

The good news is this has been effective! We’re pleased to see search engines use these signals to place our authoritative, expert-written docs content first.

Pulumi AI is still providing a ton of value to users - we're seeing thousands of questions asked and answered every day, helping devs build faster on any cloud. With quality checks in place and search results cleaned up, we’ve made Pulumi AI a better resource that is more correctly ranked relative to our other docs such as our Registry API Docs. And we will keep iterating on these improvements to documentation, code generation and verification of AI generated content.

We’ll close this issue as resolved, and thanks again for pushing us to make Pulumi better.

@petetnt
Copy link
Author

petetnt commented May 21, 2024

Sadly for me the original issue persist, with the example in the OP still being one of the many answers that provide me with negative value, so I guess I’ll just consider this more a ”wontfix” than completed.

@joeduffy
Copy link
Member

Bizarrely, this is one of the only queries that seems to still rank so highly. It honestly baffles me why Google ranks this page above everything else. I've tried numerous others - and we've validated that most of the traffic Google is sending to these pages - has died down considerably. We will keep monitoring and iterating.

That said, for what it's worth, the example on this page works! Is there a particular reason it isn't perceived as a reasonable page to have on the Internet? I'm not an AWS Lightsail expert, so apologies if I'm missing something obvious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Some behavior is incorrect or out of spec
Projects
None yet
Development

No branches or pull requests

9 participants