New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Add a better search engine #17037
Comments
This sounds great to me, thanks for working on that @bjnath! |
I am concerned that Algolia is a commercial service. As far as I can tell, the search bar currently uses an internal javascript engine to find relevant pages. So we are trading an internal implementation for a dependency on a commercial web service? Or am I mistaken about the current search mechanism? |
Related: sphinx-doc/sphinx#3812 |
My understanding of the situation is:
Some obvious downsides to this second approach:
|
I'm +1 on this change for the deployed docs, assuming the search results are accurate (playing with pytorch.org search suggests they will be). Our content is now spread over 3 sites (website, main docs, and NEPs), and the benefit of having a unified search over all those 3 sites outweighs any of the above concerns. I did a quick test on the devdocs, and it cannot find any NEP content for example.
This is the main relevant concern I think. Maybe there's a switch we can build in for that. |
The internal implementation is alive and well. It's not being shut off and we can always go back to it. |
I think all my concerns would be addressed if the sphinx builtin search is still accessible somewhere, and is discoverable for users looking for it. |
It is: https://15086-908607-gh.circle-artifacts.com/0/doc/build/html/search.html |
Not particularly useful to us, but perhaps interesting: ReadTheDocs have their own search provider that they replace the sphinx one with, https://readthedocs-sphinx-search.readthedocs.io/en/latest/? |
We in fact need the original Sphinx for searching versions other than devdocs and stable. I believe the Algolia results display is customizable. We'd add words to say that for offlline search, or to search an earlier release, use this link to the Sphinx search page. |
IMO This should be strictly opt-in. Anyone who wishes to build docs locally or in a corporate environment will not want to be compelled, and should not be defaulted, into relying on commercial service. Commerical services come with other issues such as privacy policies and potential data sharing. From their privacy policy: The types of personal information we collect and share depend on which of our Websites or Marketing Activities you use. In general, we may collect information that identifies you, information about how you use our Websites, and the information that you create while you interact with our Websites. Information that we collect from you may include the following:
My emphasis. |
Algolia doesn't actually replace sphinx's search. pytorch uses it to aggregate searches across different domains but then retains sphinx search within the actual documentation. This page is a pretty clear standard sphinx search results page: https://pytorch.org/docs/stable/search.html?q=tensor&check_keywords=yes&area=default |
Those users are not affected. This doesn't crawl a site. Algolia crawls the public NumPy devdocs and stable sites, and the searchbar gives access to their server's search results. It's as if we added a Google searchbar. It doesn't give access to anything not publicly accessible.
That is the plan. There will be continued access to the built-in Sphinx search, and in fact users need it if they want to search anything other than the devdocs and stable sites. Nobody needs to use the Algolia search. |
If one builds locally, will the search box perform a local sphinx search or will it redirect to hosted numpy.org docs?
I don't think anyone is suggesting that it would drawl other information on the computer I don't think that a search or locally built docs should transmit any information from my computer when searching locally built or intranet hosted copy of the docs.
You should check how pytorch uses it. It does not use it in the way you are proposing in the related PR. Your approach is putting algolia at a lower level of the documentation tree than pytorch is using it. |
The motivation for this change is the SciPy user survey, where users said
|
This one is because the docs do not implement canonical URLs, which then leaves it up to google to choose which version of |
Algolia lists 41 open source sites as users. Though we don't have to follow their lead, sites like ours are finding this solution acceptable. Whether or not Pytorch uses it on every page, they put it on the page that's most likely to get search questions. |
That exists now, as of a few months ago. |
Are you sure? When I look at https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.Generator I don't see the word |
The associated PR defaults to a change in behavior. I think this should be avoided. Making the change to NumPy.org seems pretty low cost since it doesn't even have a search box, and would clearly benefit from an enhancement. |
Took me a moment to catch on to the difference between having it on numpy.org vs the Sphinx pages. Let me pursue adding it to numpy.org (in a new PR separate from #17056). I'll separately see if there's a feasible approach where numpy-hosted Sphinx pages use the Algolia search and deployments elsewhere are unchanged (continue to use Sphinx search). Is this agreeable to everyone? |
My concerns about adding Javascript that sends information to a commercial entity still remain. |
I think a local solution that allows search from numpy.org into numpy.org/doc/stable, and numpy.org/neps is preferable. I assume such a javascript-based solution exists. Only if we have exhausted those possibilities should we reach for something like Algolia. For instance, a quick search led me to https://gohugo.io/tools/search/ with many options for adding search to a hugo website, we should be able to adapt those to index the rest of numpy.org as well. |
I've in fact proposed this: numpy/numpy.org#263 (comment) It of course doesn't search the Hugo site or neps, and it has all the inadequacies of Sphinx search, and it needs de-uglification. |
The question is finding someone who'd be willing to take on this kind of overhaul. It would have to be someone who shares your fervency about commercial tools, because anyone else would say, "Have you tried Algolia?" |
Note that if we add algolia to numpy.org, we are probably required in the EU to add a cookie consent popup of some kind. I've gotten mixed results about whether we already are required to have one for google analytics. |
@eric-wieser Good point. I'll ask. |
@eric-wieser They address GDPR here, though I'm not sure what their compliance means to us. I can email their GDPR link and ask. |
At a glance (looking at the use of algolia on the bootstrap docs) looks like they server payload contains only the information that I searched for, and it leaves no cookies |
@bjnath please refrain from loading your comments with unnecessary barbs and purposefully manipulative expressions. Terms like "shares your fervor" or "Algolia is the devil" do nothing to further your cause, quite the opposite. |
Ah okay, I wasn't familiar with that. We added |
@mattip I apologize; I had no intention to offend you. "Algolia is the devil" is a caricature of an extremist very unlike you, who isn't objecting quietly on principle but raving at the very idea. The intended reading was "even if you feel a thousand times stronger than Matti." "Shares your fervor" is commendation; your convictions are deep and not lightly held. Look around; nobody here lacks fervor. Fervor held strong against the mainstream is called courage. No insult intended. |
Thanks for the clarification. |
I saw/found this as open source It looks legit open source and Algolia-like. Interestingly, it is written in rust. Is anyone interested in looking into it and poking around together? Do we have a list of common search queries on the site that we can use to test? (not needed, but nice to have) |
Thanks very much for calling this to our attention and for offering to help. We concluded that Algolia search is OK, on a restricted set of pages. The issue is not that Algolia is commercial -- we use other commercial tools -- but that it sends information offsite. The pages where Algolia is not allowed do have their own search. It's not great, but it's built-in and sends nothing offsite. Even if MeiliSearch gives better results, my guess is that we'd be reluctant to to add a non-Python dependency to those pages. Others may feel differently. But thank you again for the suggestion! And, as I say, others may have a different view. |
Sure. Thanks for clarifying. I see the concerns with algolia and I understand privacy and sending data off-site concerns. Also it makes sense not to introduce a hundred language into a project as well. Do you think it makes sense to close this issue since it was decided? (this won't make a dent in the 1.9k issues left :p). I was looking for issues to work on related to documentation and I saw this :) |
Are you able to attend our documentation team meeting tomorrow? https://hackmd.io/oB_boakvRqKR-_2jRV-Qjg
Algolia hasn't been installed in the pages for which it would be allowed, and it would be useful there. In particular, we have no search bar yet on the home page. |
Normally we keep an issue open till all the work is done, rather than when a decision is made. But in this case, it may indeed make sense to close it because the search bar needs to go on numpy.org and we have numpy/numpy.org#263 to keep track of that. @mattip do you still need this issue for unified search between the two Sphinx sites, or can this be closed?
any tiny dent helps:)
If you can't find a good one please let me know. We have a lot to do on improving our documentation (in particular the more narrative docs rather than API reference), but we may have to open some new issues with good actionable descriptions. |
Closing, please reopen if there is more to discuss here. |
We have an opportunity to get better doc search on our site, with the new numpy.org thrown in as well.
It's the search engine pytorch.org uses. If you do a search there and click the
Algolia
logo at the bottom right, you'll see the deal: Algolia provides pytorch.org with search for free.Algolia is looking to work with sites like ours -- sites hosting open-source docs.
Better site search would help people get the right content for the right NumPy version, rather than spinning the wheel with Google.
Does anyone object if I get started with this? They'll crawl our site and get back to us with an API key.
I'll integrate it; it's been done with Sphinx before.
The text was updated successfully, but these errors were encountered: