-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate search autocomplete from Algolia to Typesense #33
Comments
We never enabled it on the new infra for doc sites, and it was also never enabled on the blogs. The default search seems good enough on WordPress 6, plus for the most important sites (jquery.com, jqueryui.com) we use Typesense/Algolia Ref jquery/infrastructure-puppet#17. Ref jquery/infrastructure-puppet#33.
Fix a long-standing bug that also affected Algolia previously, where excerpts of API pages all start with "Description:". Ref jquery/infrastructure-puppet#33
These are not specific to algolia-docsearch. Ref jquery/infrastructure-puppet#33
I've drafted WordPress integration at https://github.com/jquery/jquery-wp-content/tree/draft-typesense. Visually, the input field is slightly taller, and I also made it a bit wider to make up for the increase spacing so preserve a more balanced feel (I think?). Search field![]() Search resultsFor jQuery Core, the main difference is that we no longer duplicate results from the same page (which is an intentional configuration difference), so that a search like "ajax" won't show 5/5 results of the same page but rather give you 5 different pages to choose from. For jQuery UI, the results are more or less the same. The improvements I made is that it now follows local brand colors (only requires 2 CSS variables!), and the prevalent There's also various small match and ranking improvements based on special characters. And of course most significantly, the fact that TypeSense is freely-licensed open source software, the whole minibar widget is only 2kB (compared to ~100 kB), and is served without privacy-leaking third-party requests.
|
These are not specific to algolia-docsearch. Ref jquery/infrastructure-puppet#33
This includes the sitemap so that we're sure no content is missed. Unlike api.jquery.com, api.jquerymobile.com does not start with an index that links to all content pages. This means the crawler would have to to rely on category pages to discover all content, except we don't want the cralwer to index /category/ pages, and thus are matched by stop_urls, which means they are never crawled. If there was a variant of `stop_urls` that behaved like `follow,noindex` instead of `noindex,follow` we could use that, but I'm not aware of such feature. The sitemap accomplishes the same thing in a more efficient manner. Ref jquery/infrastructure-puppet#33
https://github.com/jquery/jquerymobile.com/actions/runs/6490030580/job/17625221359 ``` DEBUG:scrapy.core.engine:Crawled (200) <GET https://api.jquerymobile.com/wp-sitemap.xml> (referer: None) ERROR:scrapy.core.scraper:Spider error processing <GET https://api.jquerymobile.com/wp-sitemap.xml> (referer: None) Traceback (most recent call last): … File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/spiders/sitemap.py" in _parse_sitemap File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/http/request/__init__.py" in self._set_url(url) File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/http/request/__init__.py" in _set_url ValueError: Missing scheme in request url: //api.jquerymobile.com/wp-sitemap-posts-post-1.xml 2023-10-12 01:21:37 [scrapy.core.scraper] ERROR: Spider error processing <GET https://api.jquerymobile.com/wp-sitemap.xml> (referer: None) ``` Ref jquery/infrastructure-puppet#33
Found by Typesense crawler. ``` DEBUG:scrapy.core.engine:Crawled (404) <GET https://jquerymobile.com/changelog/1.3.2/href=%22https:/github.com/jquery/jquery-mobile/issues/5974> (referer: https://jquerymobile.com/changelog/1.3.2/) ``` Ref jquery/infrastructure-puppet#33
Currently, the heading IDs are not actually on the heading elements. This means that scrapers such as Algolia and Typesense often link text excerpts under a heading to something far away like `#content` instead of the nearest preceeding heading. For semantics, and to benefit search suggestions, I think the heading IDs are better suited on headings. Ref jquery/infrastructure-puppet#33.
… pages Override the default from https://github.com/typesense/typesense-docsearch-scraper/blob/0.6.0/scraper/src/typesense_helper.py#L58 > 'token_separators': ['_', '-'] This should make it so that "jQuery.ajax" is tokenised as "jquery ajax" instead of "jqueryajax". Ref typesense/typesense-docsearch-scraper#40.
To update this checked in dependency in the future, change the number in composer.json and run `composer deps`. Ref jquery/infrastructure-puppet#33.
E.g.: - https://blog.jquery.com/ - https://learn.jquery.com/ - https://jquery.org/team/ These now use the typesense-minibar HTML appearance but without the data attributes and JS payload to hydrate them, keeping the same no-js behaviour as before, based on WordPress search. Also: * Remove `input:focus` override to improve accessibility. It didn't look very good on the previous design but seems fine with typesense-minibar and matches how typesense-minibar is used in its own demo. * Fix order of stylesheets and simplify selectors accordingly. Previously I was fighting specificity because our overrides applies *before* typesense-minibar.css was applied. This allows various selectors to be simplified. Ref jquery/infrastructure-puppet#33
Currently, the heading IDs are not actually on the heading elements. This means that scrapers such as Algolia and Typesense often link text excerpts under a heading to something far away like `#content` instead of the nearest preceeding heading. For semantics, and to benefit search suggestions, I think the heading IDs are better suited on headings. Closes gh-90 Ref jquery/infrastructure-puppet#33
This includes the sitemap so that we're sure no content is missed. Unlike api.jquery.com, api.jquerymobile.com does not start with an index that links to all content pages. This means the crawler would have to to rely on category pages to discover all content, except we don't want the cralwer to index /category/ pages, and thus are matched by stop_urls, which means they are never crawled. If there was a variant of `stop_urls` that behaved like `follow,noindex` instead of `noindex,follow` we could use that, but I'm not aware of such feature. The sitemap accomplishes the same thing in a more efficient manner. Ref jquery/infrastructure-puppet#33
Background as documented previously:
Since April 2023 we have an instance of Typesense running in the new infra, provisioned through this repostory (558de96). I also developed a 2kB minimalistic HTML-first client and user interface for it at https://github.com/jquery/typesense-minibar and integrated it with our Jekyll theme at https://github.com/qunitjs/jekyll-theme-amethyst/. This has been live on https://qunitjs.com/ for the past few months.
Next, we need to migrate the remaining doc sites which are still using the (now stale and deprecated) Algolia DocSearch indexes:
The text was updated successfully, but these errors were encountered: