Migrate search autocomplete from Algolia to Typesense #33

Krinkle · 2023-09-16T06:58:26Z

Background as documented previously:

As of 2021, we're exploring an open-source solution that we can support within the free software ecosystem. In doing so we will increase security and availability (by reducing client-side dependence on third-party domains), and lower our privacy budget.

We first evaluated Meilisearch and experienced some suboptimal aspects. These included: difficult upgrades (not yet committing to forward compatibility or automatic in-place upgrades), opt-out telemetry instead of opt-in, no official Debian packages, non-trivial interactive setup, missing support for querying multiple indexes (e.g. qunitjs.com and api.qunitjs.com), and a not yet clear future in terms of business model (Meilisearch Cloud was not yet in the picture, and the backend is not GPL licensed).

In mid-2022, the experiment transitioned to focus on Typesense instead.

Since April 2023 we have an instance of Typesense running in the new infra, provisioned through this repostory (558de96). I also developed a 2kB minimalistic HTML-first client and user interface for it at https://github.com/jquery/typesense-minibar and integrated it with our Jekyll theme at https://github.com/qunitjs/jekyll-theme-amethyst/. This has been live on https://qunitjs.com/ for the past few months.

Next, we need to migrate the remaining doc sites which are still using the (now stale and deprecated) Algolia DocSearch indexes:

The text was updated successfully, but these errors were encountered:

We never enabled it on the new infra for doc sites, and it was also never enabled on the blogs. The default search seems good enough on WordPress 6, plus for the most important sites (jquery.com, jqueryui.com) we use Typesense/Algolia Ref jquery/infrastructure-puppet#17. Ref jquery/infrastructure-puppet#33.

Ref jquery/infrastructure-puppet#33

Fix a long-standing bug that also affected Algolia previously, where excerpts of API pages all start with "Description:". Ref jquery/infrastructure-puppet#33

These are not specific to algolia-docsearch. Ref jquery/infrastructure-puppet#33

Ref jquery/infrastructure-puppet#33.

Krinkle · 2023-10-11T03:50:20Z

I've drafted WordPress integration at https://github.com/jquery/jquery-wp-content/tree/draft-typesense.

Visually, the input field is slightly taller, and I also made it a bit wider to make up for the increase spacing so preserve a more balanced feel (I think?).

Search field

Screenshot jQuery .com and jQuery UI .com

Search results

For jQuery Core, the main difference is that we no longer duplicate results from the same page (which is an intentional configuration difference), so that a search like "ajax" won't show 5/5 results of the same page but rather give you 5 different pages to choose from.

For jQuery UI, the results are more or less the same. The improvements I made is that it now follows local brand colors (only requires 2 CSS variables!), and the prevalent link words and QuickNav Examples words are gone from all results by adding .icon-link.toc-link and #quick-nav header h2 to the scraper's selectors_exclude config.

There's also various small match and ranking improvements based on special characters. And of course most significantly, the fact that TypeSense is freely-licensed open source software, the whole minibar widget is only 2kB (compared to ~100 kB), and is served without privacy-leaking third-party requests.

Before	After

These are not specific to algolia-docsearch. Ref jquery/infrastructure-puppet#33

Ref jquery/infrastructure-puppet#33

This includes the sitemap so that we're sure no content is missed. Unlike api.jquery.com, api.jquerymobile.com does not start with an index that links to all content pages. This means the crawler would have to to rely on category pages to discover all content, except we don't want the cralwer to index /category/ pages, and thus are matched by stop_urls, which means they are never crawled. If there was a variant of `stop_urls` that behaved like `follow,noindex` instead of `noindex,follow` we could use that, but I'm not aware of such feature. The sitemap accomplishes the same thing in a more efficient manner. Ref jquery/infrastructure-puppet#33

https://github.com/jquery/jquerymobile.com/actions/runs/6490030580/job/17625221359 ``` DEBUG:scrapy.core.engine:Crawled (200) <GET https://api.jquerymobile.com/wp-sitemap.xml> (referer: None) ERROR:scrapy.core.scraper:Spider error processing <GET https://api.jquerymobile.com/wp-sitemap.xml> (referer: None) Traceback (most recent call last): … File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/spiders/sitemap.py" in _parse_sitemap File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/http/request/__init__.py" in self._set_url(url) File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/http/request/__init__.py" in _set_url ValueError: Missing scheme in request url: //api.jquerymobile.com/wp-sitemap-posts-post-1.xml 2023-10-12 01:21:37 [scrapy.core.scraper] ERROR: Spider error processing <GET https://api.jquerymobile.com/wp-sitemap.xml> (referer: None) ``` Ref jquery/infrastructure-puppet#33

Found by Typesense crawler. ``` DEBUG:scrapy.core.engine:Crawled (404) <GET https://jquerymobile.com/changelog/1.3.2/href=%22https:/github.com/jquery/jquery-mobile/issues/5974> (referer: https://jquerymobile.com/changelog/1.3.2/) ``` Ref jquery/infrastructure-puppet#33

Currently, the heading IDs are not actually on the heading elements. This means that scrapers such as Algolia and Typesense often link text excerpts under a heading to something far away like `#content` instead of the nearest preceeding heading. For semantics, and to benefit search suggestions, I think the heading IDs are better suited on headings. Ref jquery/infrastructure-puppet#33.

… pages Override the default from https://github.com/typesense/typesense-docsearch-scraper/blob/0.6.0/scraper/src/typesense_helper.py#L58 > 'token_separators': ['_', '-'] This should make it so that "jQuery.ajax" is tokenised as "jquery ajax" instead of "jqueryajax". Ref typesense/typesense-docsearch-scraper#40.

To update this checked in dependency in the future, change the number in composer.json and run `composer deps`. Ref jquery/infrastructure-puppet#33.

E.g.: - https://blog.jquery.com/ - https://learn.jquery.com/ - https://jquery.org/team/ These now use the typesense-minibar HTML appearance but without the data attributes and JS payload to hydrate them, keeping the same no-js behaviour as before, based on WordPress search. Also: * Remove `input:focus` override to improve accessibility. It didn't look very good on the previous design but seems fine with typesense-minibar and matches how typesense-minibar is used in its own demo. * Fix order of stylesheets and simplify selectors accordingly. Previously I was fighting specificity because our overrides applies *before* typesense-minibar.css was applied. This allows various selectors to be simplified. Ref jquery/infrastructure-puppet#33

Currently, the heading IDs are not actually on the heading elements. This means that scrapers such as Algolia and Typesense often link text excerpts under a heading to something far away like `#content` instead of the nearest preceeding heading. For semantics, and to benefit search suggestions, I think the heading IDs are better suited on headings. Closes gh-90 Ref jquery/infrastructure-puppet#33

This includes the sitemap so that we're sure no content is missed. Unlike api.jquery.com, api.jquerymobile.com does not start with an index that links to all content pages. This means the crawler would have to to rely on category pages to discover all content, except we don't want the cralwer to index /category/ pages, and thus are matched by stop_urls, which means they are never crawled. If there was a variant of `stop_urls` that behaved like `follow,noindex` instead of `noindex,follow` we could use that, but I'm not aware of such feature. The sitemap accomplishes the same thing in a more efficient manner. Ref jquery/infrastructure-puppet#33

Krinkle self-assigned this Sep 16, 2023

Krinkle added the Service: Search Typesense, previously Algolia. label Sep 16, 2023

Krinkle added a commit to jquery/jqueryui.com that referenced this issue Oct 11, 2023

Build: Enable typesense scraper

1a3cfe3

Ref jquery/infrastructure-puppet#33

Krinkle added a commit to jquery/jqueryui.com that referenced this issue Oct 11, 2023

Build: Enable typesense scraper

c80fa31

Ref jquery/infrastructure-puppet#33

Krinkle added a commit to jquery/jqueryui.com that referenced this issue Oct 11, 2023

Build: Enable typesense scraper

396e608

Ref jquery/infrastructure-puppet#33

Krinkle added a commit to jquery/jquery-wp-content that referenced this issue Oct 11, 2023

All: Move searchform styles to base.css

d0b0497

These are not specific to algolia-docsearch. Ref jquery/infrastructure-puppet#33

Krinkle added a commit to jquery/jquery-wp-content that referenced this issue Oct 11, 2023

[WIP] Implement typesense-minibar integration

421906f

Ref jquery/infrastructure-puppet#33.

Krinkle added a commit to jquery/jquery-wp-content that referenced this issue Oct 11, 2023

All: Move searchform styles to base.css

78b4979

These are not specific to algolia-docsearch. Ref jquery/infrastructure-puppet#33

Krinkle added a commit to jquery/jqueryui.com that referenced this issue Oct 11, 2023

Build: Enable typesense scraper

95e538f

Ref jquery/infrastructure-puppet#33

Krinkle added a commit to jquery/jquerymobile.com that referenced this issue Oct 11, 2023

Build: Enable typesense scraper

74a53e1

Ref jquery/infrastructure-puppet#33

Krinkle added a commit to jquery/jquerymobile.com that referenced this issue Oct 12, 2023

Build: Enable typesense scraper

1d7ad24

Ref jquery/infrastructure-puppet#33

Krinkle added a commit to jquery/jqueryui.com that referenced this issue Oct 12, 2023

Build: Enable typesense scraper

6cf42a4

Ref jquery/infrastructure-puppet#33

Krinkle mentioned this issue Oct 12, 2023

Util: Move toc ID in parseMarkdown from icon to heading jquery/grunt-jquery-content#90

Merged

Krinkle mentioned this issue Oct 13, 2023

Upgrade TypeSense from 0.24 to 0.25.1 #36

Open

3 tasks

Krinkle closed this as completed Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate search autocomplete from Algolia to Typesense #33

Migrate search autocomplete from Algolia to Typesense #33

Krinkle commented Sep 16, 2023 •

edited

Krinkle commented Oct 11, 2023

Migrate search autocomplete from Algolia to Typesense #33

Migrate search autocomplete from Algolia to Typesense #33

Comments

Krinkle commented Sep 16, 2023 • edited

Krinkle commented Oct 11, 2023

Search field

Search results

Krinkle commented Sep 16, 2023 •

edited