New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf-8 search support #826

Closed
coffeetalkh opened this Issue Feb 16, 2016 · 11 comments

Comments

Projects
5 participants
@coffeetalkh

coffeetalkh commented Feb 16, 2016

Hey, we are working on a Persian version of a mkdocs theme, everything works fine but we have a problem on search contents with Persian contents, when i search the expression nothing found, i also test it with Arabic, Chinese and Hebrew and all of theme have problem with this case. any solution?

@coffeetalkh coffeetalkh changed the title from Search utf-8 support to utf-8 search support Feb 16, 2016

@d0ugal

This comment has been minimized.

Member

d0ugal commented Feb 16, 2016

Unfortunately search wont work well in non-latin character sets at the moment. We use lunr.js to provide the search functionality, I believe it can work with them but I am not sure what is involved in making that happen.

It is possible for themes to overwrite and completely replace the search implementation. To do this, themes would need to at least replace this file (or modify the theme so that file isn't loaded) by including a javascript file in the theme with the path mkdocs/js/search.js. However, as I said at the top, I have no experience in making search work for different character sets, so I don't know how difficult it will be.

We could also consider adding a setting to disable search. While this isn't ideal, it would be better to turn search off than have it in a broken state.

I don't think there is a good, easy solution for this at the moment. 😞

@waylan

This comment has been minimized.

Member

waylan commented Mar 31, 2016

There is also some related discussion in #859. Apparently, non-English search indexes are extremely large as lunr.js only "excludes" insignificant English words from the index by default. MihaiValentin/lunr-languages appears to be an add-on to lunr.js which adds support for various other languages including Japanese and Russian (both use non-latin alphabets). Unfortunately, at this time, Persian, Arabic, Chinese and Hebrew are not on the list of supported languages. Although there are currently outstanding requests for Arabic and Chinese to be added. If you would like to have support added for a certain language, I would suggest starting there. If/when, mkdocs adds language support, then your language will work if its supported by lunr.js.

@netroby

This comment has been minimized.

@waylan

This comment has been minimized.

Member

waylan commented Sep 29, 2017

@netroby, that is some old legacy (now deprecated) code which will be removed in the next major release. It has nothing to do with the current search implementation or any future updates to search.

@waylan

This comment has been minimized.

Member

waylan commented Feb 1, 2018

In a brief discussion about how to pass config settings to lunr.js, @squidfunk pointed out that the Material (third party) theme passes the values in via meta tags in the template (<meta name="lang:search.language" content="en">). However, it occurred to me that the document still has <html lang="en"> at the top. Obviously, if the documents contain something other than English content (requiring a search lang other than en), then perhaps the html tag should indicate that. In fact, Declaring language in HTML seems to indicate as much.

Therefore, I'm thinking that we approach this in two ways:

  1. Create a global lang config setting which is used as <html lang="{{ config.lang }}"> in the template. Search could then use this as a default language to use if one is not otherwise specified.
  2. Create a search plugin specific lang setting which can contain multiple language codes. Lunr supports indexing multiple languages and multiple languages can exist on one page. However, only one language gets defined in the html tag (see Declaring language in HTML). The plugin specific setting should include all possible languages for that site that will be indexed (see Multi-Language Content in Lunr docs).

Finally, rather than using the meta tags to pass the plugin specific settings, I'm inclined to include them in the JSON index file as so:

{ 'docs': [...], 'config': { 'lang': ['en, 'es'], 'otherkey': 'value'} }

Of course, any third party theme can ignore those. For example, Material would continue to work without changes (except to use config.plugins.search.lang rather than config.extra.search.lang as the source of its data to populate the meta tag).

@waylan

This comment has been minimized.

Member

waylan commented Feb 1, 2018

I just realized that having a global lang config setting (which sets <html lang="en">) means we also need to have Internationalization support for hardcoded text in templates (see #211). And I'm unsure how that will work with pages in different languages (see #774). I think at this time it makes sense to implement the search config setting (part 2 above) and leave the global setting to be implemented as part of #211 and/or #774.

In the meantime, the search config setting plugins.search.lang should probably default to ['en'], which is consistent with the current behavior.

Or perhaps all multi-language support matters should be addressed together, in which case this issue would be delayed to be addressed with #211 and #774 rather than as part of the search refactor.

@waylan

This comment has been minimized.

Member

waylan commented Feb 2, 2018

I almost have language support working. I was able to get it working within a worker in 6034dc2. I tried to adapt that to also run in the main thread in db1a84a as a fallback for browsers which don't support workers, Unsurprisingly, it fails as the scripts are not finished loading before the code attempts to run. I understand how to use a callback to avoid that sort of problem, but it is not obvious to me here how to make that work when we have multiple potential scripts which could be loaded and those calls are all nested within various separate conditionals. JavaScript's async nature is my biggest weakness.

@yeraydiazdiaz I started with your code. Any suggests from you or anyone?

@waylan

This comment has been minimized.

Member

waylan commented Feb 2, 2018

For the record, you can see all of my recent search related work (including the code this builds on) at master...waylan:new-search.

@yeraydiazdiaz

This comment has been minimized.

Contributor

yeraydiazdiaz commented Feb 3, 2018

Nice job adding language support.

I tweaked your version slightly to have a separate process where we first load the JSON, then resolve the Lunr files to fetch, grab them asynchronously and finally activate the search. Seems to work well with and without Web Worker support.

Give it a go and let me know what you think.

Edit: Of course, as I posted this I found an error, seems the multilanguage support requires Lunr to be loaded before executing them, I'll see if I can work around that

@yeraydiazdiaz

This comment has been minimized.

Contributor

yeraydiazdiaz commented Feb 3, 2018

I managed to solve the issue by abusing $.getScript callbacks, pretty awful but I guess it beats using require.js just for old browser support.

Here are the changes the full changes against your branch waylan:new-search...yeraydiazdiaz:new-search.

@waylan

This comment has been minimized.

Member

waylan commented Feb 3, 2018

I considered recursively loading each script but it wasn't obvious to me how to make it work. Nicely done.

@waylan waylan moved this from To Do to In Progress in Refactor search. Feb 5, 2018

@waylan waylan closed this in #1418 Mar 6, 2018

Refactor search. automation moved this from In Progress to Done Mar 6, 2018

waylan added a commit that referenced this issue Mar 6, 2018

Refactor search plugin (#1418)
* Use a web worker in the browser with a fallback (fixes #859 & closes #1396).
* Optionally pre-build search index (fixes #859 & closes #1061).
* Upgrade to lunr.js 2.x (fixes #1319).
* Support search in languages other than English (fixes #826).
* Allow the user to define the word separators (fixes #867).
* Only run searches for queries of length > 2 (fixes #1127).
* Remove dependency on require.js, mustache, etc. (fixes #1218).
* Compress the search index (fixes #1128).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment