Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate search autocomplete from Algolia to Typesense #33

Closed
10 tasks done
Krinkle opened this issue Sep 16, 2023 · 1 comment
Closed
10 tasks done

Migrate search autocomplete from Algolia to Typesense #33

Krinkle opened this issue Sep 16, 2023 · 1 comment
Assignees
Labels
Service: Search Typesense, previously Algolia.

Comments

@Krinkle
Copy link
Member

Krinkle commented Sep 16, 2023

Background as documented previously:

As of 2021, we're exploring an open-source solution that we can support within the free software ecosystem. In doing so we will increase security and availability (by reducing client-side dependence on third-party domains), and lower our privacy budget.

We first evaluated Meilisearch and experienced some suboptimal aspects. These included: difficult upgrades (not yet committing to forward compatibility or automatic in-place upgrades), opt-out telemetry instead of opt-in, no official Debian packages, non-trivial interactive setup, missing support for querying multiple indexes (e.g. qunitjs.com and api.qunitjs.com), and a not yet clear future in terms of business model (Meilisearch Cloud was not yet in the picture, and the backend is not GPL licensed).

In mid-2022, the experiment transitioned to focus on Typesense instead.


Since April 2023 we have an instance of Typesense running in the new infra, provisioned through this repostory (558de96). I also developed a 2kB minimalistic HTML-first client and user interface for it at https://github.com/jquery/typesense-minibar and integrated it with our Jekyll theme at https://github.com/qunitjs/jekyll-theme-amethyst/. This has been live on https://qunitjs.com/ for the past few months.

Next, we need to migrate the remaining doc sites which are still using the (now stale and deprecated) Algolia DocSearch indexes:

@Krinkle Krinkle self-assigned this Sep 16, 2023
@Krinkle Krinkle added the Service: Search Typesense, previously Algolia. label Sep 16, 2023
Krinkle added a commit to jquery/jquery-wp-content that referenced this issue Sep 18, 2023
We never enabled it on the new infra for doc sites, and it was also
never enabled on the blogs.

The default search seems good enough on WordPress 6, plus for the
most important sites (jquery.com, jqueryui.com) we use
Typesense/Algolia

Ref jquery/infrastructure-puppet#17.
Ref jquery/infrastructure-puppet#33.
Krinkle added a commit to jquery/jqueryui.com that referenced this issue Oct 11, 2023
Krinkle added a commit to jquery/jqueryui.com that referenced this issue Oct 11, 2023
Krinkle added a commit to jquery/jqueryui.com that referenced this issue Oct 11, 2023
Krinkle added a commit to jquery/api.jquery.com that referenced this issue Oct 11, 2023
Fix a long-standing bug that also affected Algolia previously,
where excerpts of API pages all start with "Description:".

Ref jquery/infrastructure-puppet#33
Krinkle added a commit to jquery/jquery-wp-content that referenced this issue Oct 11, 2023
These are not specific to algolia-docsearch.

Ref jquery/infrastructure-puppet#33
Krinkle added a commit to jquery/jquery-wp-content that referenced this issue Oct 11, 2023
@Krinkle
Copy link
Member Author

Krinkle commented Oct 11, 2023

I've drafted WordPress integration at https://github.com/jquery/jquery-wp-content/tree/draft-typesense.

Visually, the input field is slightly taller, and I also made it a bit wider to make up for the increase spacing so preserve a more balanced feel (I think?).

Search field

Screenshot jQuery .com and jQuery UI .com

Search results

For jQuery Core, the main difference is that we no longer duplicate results from the same page (which is an intentional configuration difference), so that a search like "ajax" won't show 5/5 results of the same page but rather give you 5 different pages to choose from.

For jQuery UI, the results are more or less the same. The improvements I made is that it now follows local brand colors (only requires 2 CSS variables!), and the prevalent link words and QuickNav Examples words are gone from all results by adding .icon-link.toc-link and #quick-nav header h2 to the scraper's selectors_exclude config.

There's also various small match and ranking improvements based on special characters. And of course most significantly, the fact that TypeSense is freely-licensed open source software, the whole minibar widget is only 2kB (compared to ~100 kB), and is served without privacy-leaking third-party requests.

Before After
Screenshot 2023-10-10 at 20 23 26 Screenshot 2023-10-10 at 20 23 27
Screenshot 2023-10-10 at 20 24 35 Screenshot 2023-10-10 at 20 24 38
Screenshot 2023-10-10 at 20 10 16 Screenshot 2023-10-10 at 20 10 17
Screenshot 2023-10-10 at 20 47 54 Screenshot 2023-10-10 at 20 48 03

Krinkle added a commit to jquery/jquery-wp-content that referenced this issue Oct 11, 2023
These are not specific to algolia-docsearch.

Ref jquery/infrastructure-puppet#33
Krinkle added a commit to jquery/jqueryui.com that referenced this issue Oct 11, 2023
Krinkle added a commit to jquery/jquerymobile.com that referenced this issue Oct 11, 2023
Krinkle added a commit to jquery/jquerymobile.com that referenced this issue Oct 12, 2023
Krinkle added a commit to jquery/jqueryui.com that referenced this issue Oct 12, 2023
Krinkle added a commit to jquery/jquerymobile.com that referenced this issue Oct 12, 2023
This includes the sitemap so that we're sure no content is missed.

Unlike api.jquery.com, api.jquerymobile.com does not start with an
index that links to all content pages. This means the crawler would
have to to rely on category pages to discover all content, except
we don't want the cralwer to index /category/ pages, and thus are
matched by stop_urls, which means they are never crawled.

If there was a variant of `stop_urls` that behaved like `follow,noindex`
instead of `noindex,follow` we could use that, but I'm not aware of
such feature. The sitemap accomplishes the same thing in a more
efficient manner.

Ref jquery/infrastructure-puppet#33
Krinkle added a commit to jquery/jquery-wp-content that referenced this issue Oct 12, 2023
https://github.com/jquery/jquerymobile.com/actions/runs/6490030580/job/17625221359

```
DEBUG:scrapy.core.engine:Crawled (200) <GET https://api.jquerymobile.com/wp-sitemap.xml> (referer: None)
ERROR:scrapy.core.scraper:Spider error processing <GET https://api.jquerymobile.com/wp-sitemap.xml> (referer: None)
Traceback (most recent call last):
  …
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/spiders/sitemap.py"
  in _parse_sitemap
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/http/request/__init__.py"
  in self._set_url(url)
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/http/request/__init__.py"
  in _set_url
ValueError: Missing scheme in request url: //api.jquerymobile.com/wp-sitemap-posts-post-1.xml
2023-10-12 01:21:37 [scrapy.core.scraper] ERROR: Spider error processing <GET https://api.jquerymobile.com/wp-sitemap.xml> (referer: None)
```

Ref jquery/infrastructure-puppet#33
Krinkle added a commit to jquery/jquerymobile.com that referenced this issue Oct 12, 2023
Krinkle added a commit to jquery/grunt-jquery-content that referenced this issue Oct 12, 2023
Currently, the heading IDs are not actually on the heading
elements. This means that scrapers such as Algolia and
Typesense often link text excerpts under a heading to
something far away like `#content` instead of the nearest
preceeding heading.

For semantics, and to benefit search suggestions, I think the
heading IDs are better suited on headings.

Ref jquery/infrastructure-puppet#33.
Krinkle referenced this issue in jquery/api.jquery.com Oct 13, 2023
… pages

Override the default from
https://github.com/typesense/typesense-docsearch-scraper/blob/0.6.0/scraper/src/typesense_helper.py#L58

> 'token_separators': ['_', '-']

This should make it so that "jQuery.ajax" is tokenised as "jquery ajax"
instead of "jqueryajax".

Ref typesense/typesense-docsearch-scraper#40.
Krinkle added a commit to jquery/jquery-wp-content that referenced this issue Oct 13, 2023
To update this checked in dependency in the future, change the
number in composer.json and run `composer deps`.

Ref jquery/infrastructure-puppet#33.
Krinkle added a commit to jquery/jquery-wp-content that referenced this issue Oct 13, 2023
E.g.:

- https://blog.jquery.com/
- https://learn.jquery.com/
- https://jquery.org/team/

These now use the typesense-minibar HTML appearance but without the
data attributes and JS payload to hydrate them, keeping the same
no-js behaviour as before, based on WordPress search.

Also:

* Remove `input:focus` override to improve accessibility.
  It didn't look very good on the previous design but seems fine
  with typesense-minibar and matches how typesense-minibar is used
  in its own demo.

* Fix order of stylesheets and simplify selectors accordingly.
  Previously I was fighting specificity because our overrides applies
  *before* typesense-minibar.css was applied. This allows various
  selectors to be simplified.

Ref jquery/infrastructure-puppet#33
mgol pushed a commit to jquery/grunt-jquery-content that referenced this issue Oct 30, 2023
Currently, the heading IDs are not actually on the heading
elements. This means that scrapers such as Algolia and
Typesense often link text excerpts under a heading to
something far away like `#content` instead of the nearest
preceeding heading.

For semantics, and to benefit search suggestions, I think the
heading IDs are better suited on headings.

Closes gh-90
Ref jquery/infrastructure-puppet#33
Krinkle added a commit to jquery/jquerymobile.com that referenced this issue Oct 31, 2023
This includes the sitemap so that we're sure no content is missed.

Unlike api.jquery.com, api.jquerymobile.com does not start with an
index that links to all content pages. This means the crawler would
have to to rely on category pages to discover all content, except
we don't want the cralwer to index /category/ pages, and thus are
matched by stop_urls, which means they are never crawled.

If there was a variant of `stop_urls` that behaved like `follow,noindex`
instead of `noindex,follow` we could use that, but I'm not aware of
such feature. The sitemap accomplishes the same thing in a more
efficient manner.

Ref jquery/infrastructure-puppet#33
@Krinkle Krinkle closed this as completed Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Service: Search Typesense, previously Algolia.
Development

No branches or pull requests

1 participant