Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Add a better search engine #17037

Closed
bjnath opened this issue Aug 9, 2020 · 40 comments
Closed

DOC: Add a better search engine #17037

bjnath opened this issue Aug 9, 2020 · 40 comments

Comments

@bjnath
Copy link
Contributor

bjnath commented Aug 9, 2020

We have an opportunity to get better doc search on our site, with the new numpy.org thrown in as well.

It's the search engine pytorch.org uses. If you do a search there and click the Algolia logo at the bottom right, you'll see the deal: Algolia provides pytorch.org with search for free.

Algolia is looking to work with sites like ours -- sites hosting open-source docs.

Better site search would help people get the right content for the right NumPy version, rather than spinning the wheel with Google.

Does anyone object if I get started with this? They'll crawl our site and get back to us with an API key.

I'll integrate it; it's been done with Sphinx before.

@rgommers
Copy link
Member

rgommers commented Aug 9, 2020

This sounds great to me, thanks for working on that @bjnath!

@mattip
Copy link
Member

mattip commented Aug 11, 2020

I am concerned that Algolia is a commercial service. As far as I can tell, the search bar currently uses an internal javascript engine to find relevant pages. So we are trading an internal implementation for a dependency on a commercial web service? Or am I mistaken about the current search mechanism?

@eric-wieser
Copy link
Member

Related: sphinx-doc/sphinx#3812

@eric-wieser
Copy link
Member

My understanding of the situation is:

  • Now: sphinx creates a search index at build time, which is loaded by the browser (on page load?) and used in the search bar.
  • Proposed: algolia crawls our website, and our search bar calls out to their site.

Some obvious downsides to this second approach:

  • It breaks if algolia goes down
  • It doesn't work on PR branches
  • It doesn't work offline (I think the current search bar does?)
  • It maybe doesn't know about the documentation structure in the same way as sphinx does. assuming it does not use the generated index.

@rgommers
Copy link
Member

I'm +1 on this change for the deployed docs, assuming the search results are accurate (playing with pytorch.org search suggests they will be). Our content is now spread over 3 sites (website, main docs, and NEPs), and the benefit of having a unified search over all those 3 sites outweighs any of the above concerns. I did a quick test on the devdocs, and it cannot find any NEP content for example.

It doesn't work offline (I think the current search bar does?)

This is the main relevant concern I think. Maybe there's a switch we can build in for that.

@bjnath
Copy link
Contributor Author

bjnath commented Aug 11, 2020

So we are trading an internal implementation for a dependency on a commercial web service?

The internal implementation is alive and well. It's not being shut off and we can always go back to it.

@eric-wieser
Copy link
Member

I think all my concerns would be addressed if the sphinx builtin search is still accessible somewhere, and is discoverable for users looking for it.

@bjnath
Copy link
Contributor Author

bjnath commented Aug 11, 2020

if the sphinx builtin search is still accessible somewhere

It is: https://15086-908607-gh.circle-artifacts.com/0/doc/build/html/search.html

@eric-wieser
Copy link
Member

eric-wieser commented Aug 11, 2020

Not particularly useful to us, but perhaps interesting: ReadTheDocs have their own search provider that they replace the sphinx one with, https://readthedocs-sphinx-search.readthedocs.io/en/latest/?

@bjnath
Copy link
Contributor Author

bjnath commented Aug 11, 2020

We in fact need the original Sphinx for searching versions other than devdocs and stable.

I believe the Algolia results display is customizable. We'd add words to say that for offlline search, or to search an earlier release, use this link to the Sphinx search page.

@bashtage
Copy link
Contributor

bashtage commented Aug 11, 2020

IMO This should be strictly opt-in. Anyone who wishes to build docs locally or in a corporate environment will not want to be compelled, and should not be defaulted, into relying on commercial service. Commerical services come with other issues such as privacy policies and potential data sharing.

From their privacy policy:

The types of personal information we collect and share depend on which of our Websites or Marketing Activities you use. In general, we may collect information that identifies you, information about how you use our Websites, and the information that you create while you interact with our Websites.

Information that we collect from you may include the following:

  • Information that can personally identify you, such as name, photograph, postal address, email address, or telephone number.
  • Information about your Internet connection, the equipment you use to access our Websites, and your use of our Websites.
  • Information that you provide when you fill in forms on our Websites.
  • Information that identifies you in planning and hosting corporate events related to our Marketing Activities (e.g., user conferences).
  • Records and copies of your correspondence (including email addresses and Twitter handles), if you contact us.
  • Your responses to surveys that we might ask you to complete for research purposes.
  • Details of transactions you carry out through our Websites or related to orders of our Services.
  • You also may provide information as a contributing user that is published or displayed (hereafter, referred to as "posted") on public areas of our Websites, on the Algolia Community site, or transmitted to other users of our Websites or third-parties. Your user contributions are posted and transmitted to others at your own risk.

My emphasis.

@bashtage
Copy link
Contributor

It's the search engine pytorch.org uses. If you do a search there and click the Algolia logo at the bottom right, you'll see the deal: Algolia provides pytorch.org with search for free.

Algolia doesn't actually replace sphinx's search. pytorch uses it to aggregate searches across different domains but then retains sphinx search within the actual documentation.

This page is a pretty clear standard sphinx search results page:

https://pytorch.org/docs/stable/search.html?q=tensor&check_keywords=yes&area=default

@bjnath
Copy link
Contributor Author

bjnath commented Aug 11, 2020

@bashtage

Anyone who wishes to build docs locally or in a corporate environment will not want to be compelled, and should not be defaulted, into relying on commercial service.

Those users are not affected. This doesn't crawl a site. Algolia crawls the public NumPy devdocs and stable sites, and the searchbar gives access to their server's search results.

It's as if we added a Google searchbar. It doesn't give access to anything not publicly accessible.

This should be strictly opt-in

That is the plan. There will be continued access to the built-in Sphinx search, and in fact users need it if they want to search anything other than the devdocs and stable sites. Nobody needs to use the Algolia search.

@bjnath
Copy link
Contributor Author

bjnath commented Aug 11, 2020

Algolia doesn't actually replace sphinx's search

If we're checking Algolia usage at open-source sites, there are several more to choose from:
image

@bashtage
Copy link
Contributor

Those users are not affected. This doesn't crawl a site. Algolia crawls the public NumPy devdocs and stable sites, and the searchbar gives access to their server's search results.

If one builds locally, will the search box perform a local sphinx search or will it redirect to hosted numpy.org docs?

It's as if we added a Google searchbar. It doesn't give access to anything not publicly accessible.

I don't think anyone is suggesting that it would drawl other information on the computer I don't think that a search or locally built docs should transmit any information from my computer when searching locally built or intranet hosted copy of the docs.

If we're checking Algolia usage at open-source sites, there are several to choose from:

You should check how pytorch uses it. It does not use it in the way you are proposing in the related PR. Your approach is putting algolia at a lower level of the documentation tree than pytorch is using it.

@bjnath
Copy link
Contributor Author

bjnath commented Aug 11, 2020

The motivation for this change is the SciPy user survey, where users said

  • "searching for terms does not work"
  • "Google search results often yield confusing links"
  • "the search function ought to be improved"

@bashtage
Copy link
Contributor

  • "Google search results often yield confusing links"

This one is because the docs do not implement canonical URLs, which then leaves it up to google to choose which version of asarray should be treated as authoritative. Docs should always point to the canonical "stable" implementation from any version-specific copy.

@bjnath
Copy link
Contributor Author

bjnath commented Aug 11, 2020

Algolia lists 41 open source sites as users.

Though we don't have to follow their lead, sites like ours are finding this solution acceptable. Whether or not Pytorch uses it on every page, they put it on the page that's most likely to get search questions.

@rgommers
Copy link
Member

Docs should always point to the canonical "stable" implementation from any version-specific copy.

That exists now, as of a few months ago.

@bashtage
Copy link
Contributor

That exists now, as of a few months ago.

Are you sure?

When I look at

https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.Generator

I don't see the word canonical in the source, and when I looked at the headers I don't see the header either.

https://support.google.com/webmasters/answer/139066

@bashtage
Copy link
Contributor

Though we don't have to follow their lead, sites like ours are finding this solution acceptable. Whether or not Pytorch uses it on every page, they put it on the page that's most likely to get search questions.

The associated PR defaults to a change in behavior. I think this should be avoided.

Making the change to NumPy.org seems pretty low cost since it doesn't even have a search box, and would clearly benefit from an enhancement.

@bjnath
Copy link
Contributor Author

bjnath commented Aug 11, 2020

@bastage @rgommers

Took me a moment to catch on to the difference between having it on numpy.org vs the Sphinx pages.

Let me pursue adding it to numpy.org (in a new PR separate from #17056).

I'll separately see if there's a feasible approach where numpy-hosted Sphinx pages use the Algolia search and deployments elsewhere are unchanged (continue to use Sphinx search).

Is this agreeable to everyone?

@mattip
Copy link
Member

mattip commented Aug 11, 2020

My concerns about adding Javascript that sends information to a commercial entity still remain.

@mattip
Copy link
Member

mattip commented Aug 11, 2020

I think a local solution that allows search from numpy.org into numpy.org/doc/stable, and numpy.org/neps is preferable. I assume such a javascript-based solution exists. Only if we have exhausted those possibilities should we reach for something like Algolia. For instance, a quick search led me to https://gohugo.io/tools/search/ with many options for adding search to a hugo website, we should be able to adapt those to index the rest of numpy.org as well.

@bjnath
Copy link
Contributor Author

bjnath commented Aug 11, 2020

@mattip

local solution that allows search from numpy.org into numpy.org/doc/stable

I've in fact proposed this: numpy/numpy.org#263 (comment)

It of course doesn't search the Hugo site or neps, and it has all the inadequacies of Sphinx search, and it needs de-uglification.

@bjnath
Copy link
Contributor Author

bjnath commented Aug 11, 2020

many options for adding search to a hugo website

The question is finding someone who'd be willing to take on this kind of overhaul. It would have to be someone who shares your fervency about commercial tools, because anyone else would say, "Have you tried Algolia?"

@eric-wieser
Copy link
Member

eric-wieser commented Aug 11, 2020

Note that if we add algolia to numpy.org, we are probably required in the EU to add a cookie consent popup of some kind. I've gotten mixed results about whether we already are required to have one for google analytics.

@bjnath
Copy link
Contributor Author

bjnath commented Aug 11, 2020

@eric-wieser Good point. I'll ask.

@bjnath
Copy link
Contributor Author

bjnath commented Aug 11, 2020

@eric-wieser They address GDPR here, though I'm not sure what their compliance means to us. I can email their GDPR link and ask.

@eric-wieser
Copy link
Member

At a glance (looking at the use of algolia on the bootstrap docs) looks like they server payload contains only the information that I searched for, and it leaves no cookies

@mattip
Copy link
Member

mattip commented Aug 11, 2020

@bjnath please refrain from loading your comments with unnecessary barbs and purposefully manipulative expressions. Terms like "shares your fervor" or "Algolia is the devil" do nothing to further your cause, quite the opposite.

@rgommers
Copy link
Member

Are you sure?

When I look at

https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.Generator

I don't see the word canonical in the source, and when I looked at the headers I don't see the header either.

https://support.google.com/webmasters/answer/139066

Ah okay, I wasn't familiar with that. We added stable a few months ago, so links now are actually stable rather than having the version in the link. I thought that was enough. Looks like we have to add a bit of page metadata though - let me open a separate issue for that now.

@bjnath
Copy link
Contributor Author

bjnath commented Aug 12, 2020

@mattip I apologize; I had no intention to offend you. "Algolia is the devil" is a caricature of an extremist very unlike you, who isn't objecting quietly on principle but raving at the very idea. The intended reading was "even if you feel a thousand times stronger than Matti."

"Shares your fervor" is commendation; your convictions are deep and not lightly held. Look around; nobody here lacks fervor. Fervor held strong against the mainstream is called courage. No insult intended.

@mattip
Copy link
Member

mattip commented Aug 12, 2020

Thanks for the clarification.

@mireille-raad
Copy link

mireille-raad commented Sep 26, 2020

I saw/found this as open source
https://www.meilisearch.com/
https://github.com/meilisearch/MeiliSearch

It looks legit open source and Algolia-like. Interestingly, it is written in rust. Is anyone interested in looking into it and poking around together?

Do we have a list of common search queries on the site that we can use to test? (not needed, but nice to have)

@bjnath
Copy link
Contributor Author

bjnath commented Sep 26, 2020

Thanks very much for calling this to our attention and for offering to help. We concluded that Algolia search is OK, on a restricted set of pages. The issue is not that Algolia is commercial -- we use other commercial tools -- but that it sends information offsite.

The pages where Algolia is not allowed do have their own search. It's not great, but it's built-in and sends nothing offsite. Even if MeiliSearch gives better results, my guess is that we'd be reluctant to to add a non-Python dependency to those pages. Others may feel differently.

But thank you again for the suggestion! And, as I say, others may have a different view.

@mireille-raad
Copy link

mireille-raad commented Sep 27, 2020

Sure. Thanks for clarifying. I see the concerns with algolia and I understand privacy and sending data off-site concerns. Also it makes sense not to introduce a hundred language into a project as well.

Do you think it makes sense to close this issue since it was decided? (this won't make a dent in the 1.9k issues left :p). I was looking for issues to work on related to documentation and I saw this :)

@bjnath
Copy link
Contributor Author

bjnath commented Sep 27, 2020

@mireille-raad

I was looking for issues to work on related to documentation

Are you able to attend our documentation team meeting tomorrow? https://hackmd.io/oB_boakvRqKR-_2jRV-Qjg

Do you think it makes sense to close this issue

Algolia hasn't been installed in the pages for which it would be allowed, and it would be useful there. In particular, we have no search bar yet on the home page.

@rgommers
Copy link
Member

Do you think it makes sense to close this issue since it was decided?

Normally we keep an issue open till all the work is done, rather than when a decision is made. But in this case, it may indeed make sense to close it because the search bar needs to go on numpy.org and we have numpy/numpy.org#263 to keep track of that.

@mattip do you still need this issue for unified search between the two Sphinx sites, or can this be closed?

(this won't make a dent in the 1.9k issues left :p).

any tiny dent helps:)

I was looking for issues to work on related to documentation and I saw this :)

If you can't find a good one please let me know. We have a lot to do on improving our documentation (in particular the more narrative docs rather than API reference), but we may have to open some new issues with good actionable descriptions.

@mattip
Copy link
Member

mattip commented Sep 28, 2020

Closing, please reopen if there is more to discuss here.

@mattip mattip closed this as completed Sep 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants