Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large search indexes can cause Lunr to freeze the page while it creates the index. #859

Closed
sashadt opened this issue Mar 3, 2016 · 56 comments
Closed
Assignees
Labels
Milestone

Comments

@sashadt
Copy link

@sashadt sashadt commented Mar 3, 2016

When using readthedocs theme, search shows "no results" for a few seconds while search is being performed, and then results are loaded. However, few seconds are enough to convince the users nothing is found, and they move on.

For example see: http://docs.datatorrent.com Try searching for "test" and you will see the following for a while:

image

Perhaps an indicator of search in progress can be added?

@sashadt sashadt changed the title Search shows "No results" for a few seconds before results are loaded Search shows "Sorry, page not found" for a few seconds before results are loaded Mar 3, 2016
@waylan
Copy link
Member

@waylan waylan commented Mar 3, 2016

Perhaps an indicator of search in progress can be added?

That seems reasonable to me. Presumably the problem occurs due to the large number of pages. On a smaller site, this would not be a problem as the wait time would be much less.

@sashadt
Copy link
Author

@sashadt sashadt commented Mar 3, 2016

We have 50 files with 18,936 lines of Markdown. I don't know if that's considered large, but even for smaller projects running on slower servers it would provide a better user experience. Thanks for looking into it!

@waylan
Copy link
Member

@waylan waylan commented Mar 4, 2016

It appears that the RTD theme includes the default content "Sorry, page not found." on the search results page (see search.html#L18). However, the search tool will also replace any content with <p>No results found</p> (see search.js#L63) which makes the initial filler redundant. It appears that RTD displays nothing. Should we do the same or display a message to wait for results?

@d0ugal
Copy link
Member

@d0ugal d0ugal commented Mar 4, 2016

This is quite interesting. I assumed the delay was the length of time it takes to download the search index (about 0.5mb). However, there is a long delay after it is downloaded. So is the delay while lunr.js is processing the index? Meaning the slowness is probably more dependant users machines than the server. Looking a the Lunr change log there are a few performance improvements in 0.5.8 and later (we ship 0.5.7). So I think there are two things we should do:

  • Add a loading indicator (this could just be static text saying "Searching...") and remove the incorrect text in ReadTheDocs's search.html
  • Update Lunr.js to 0.6.0
@d0ugal
Copy link
Member

@d0ugal d0ugal commented Mar 4, 2016

I just cloned and tested with the DataTorrent docs locally, which confirms the slowness is with the JavaScript as downloading is almost instantaneous.

Even worse, my browser (Chrome) is locked up by the JavaScript for a good few seconds, so I can't actually click anything. I done some very rough measurements to try and find what was making it so slow. My timings below are all rounded because I don't have confidence in my measuring.

The loading of the JSON into Lunr is the culprit. This for loop takes around 8(!!) seconds. More specifically it is the call to index.add(doc), so I timed that for each document (the project has 1022, as we add individual markdown documents and page headers individually to improve the search results). Most of them are very fast (less than 50ms) but two are really slow. The Application Developer Guide takes around 1 second consistently and the Release Notes take around 4.5 seconds consistently.

I don't have time to keep looking now, but we need try and determine what makes these so slow. Is it just because these are long pages? Is there something that causes an issue with Lunr? I tested with both Lunr 0.5.7 and 0.6.0 and got similar results. 0.6.0 might actually be a little slower, but I didn't do enough tests to be confident about that.

@d0ugal d0ugal added the Bug label Mar 4, 2016
@waylan
Copy link
Member

@waylan waylan commented Mar 4, 2016

Even worse, my browser (Chrome) is locked up by the JavaScript for a good few seconds, so I can't actually click anything.

I noticed the same behavior when looking into this as well. Although I didn't big any deeper. The point is that I can confirm this behavior is not isolated.

Add a loading indicator (this could just be static text saying "Searching...") and remove the incorrect text in ReadTheDocs's search.html

I don't see any reason why we can't do this now. I'll push a PR when I get a few minutes unless someone beats me to it.

waylan added a commit that referenced this issue Mar 4, 2016
The search tool will replace the text with "No results found" if that is the case.
No reason to display is here. Also, the default text displays when the search 
is running (especially if the search takes a long time), so it might as well be 
accurate and indicate that the search is running. Addresses #859.
d0ugal pushed a commit to d0ugal/lunr.js that referenced this issue Mar 4, 2016
The loops in Index.add can easily be called hundreds of thousands of times for
large documents. Larger documents can perform particularly badly[1] under
Chrome. Removing the usage of reduce and filter in favour of inline native for
loops can lead to a 20-30% performance increase in Chrome. Firefox also sees a
similar but much smaller impovement (~5%).

[1]: mkdocs/mkdocs#859 (comment)
@d0ugal
Copy link
Member

@d0ugal d0ugal commented Mar 6, 2016

I investigated further. FireFox is much faster than Chrome. I managed to improve the performance under Chrome a bit. See: olivernn/lunr.js#208

olivernn added a commit to olivernn/lunr.js that referenced this issue Mar 8, 2016
The loops in Index.add can easily be called hundreds of thousands of times for
large documents. Larger documents can perform particularly badly[1] under
Chrome. Removing the usage of reduce and filter in favour of inline native for
loops can lead to a 20-30% performance increase in Chrome. Firefox also sees a
similar but much smaller impovement (~5%).

[1]: mkdocs/mkdocs#859 (comment)
@olivernn
Copy link

@olivernn olivernn commented Mar 8, 2016

I've pushed @d0ugal's changes in 0.7.0 of lunr, so there should hopefully be some improvements from that change.

Without knowing anything of how mkdocs makes use of lunr, I have a few suggestions that could improve the performance and user experience further:

  • Pre-build the index and then load this pre-built index
  • Move searching/building over to a webworker

As an example, the example lunr app makes use loading a pre-built index. I think a member of the Angular.js team wrote up how they improved performance using web workers too.

d0ugal added a commit to d0ugal/mkdocs that referenced this issue Mar 9, 2016
@d0ugal d0ugal changed the title Search shows "Sorry, page not found" for a few seconds before results are loaded Large search indexes can cause Lunr to freeze the page while it creates the index. Mar 9, 2016
@d0ugal
Copy link
Member

@d0ugal d0ugal commented Mar 9, 2016

Our usage of Lunr is very simple. I think all the relevant code is less than 100 lines. It is a testament to Lunr that it has gotten us this far before we hit an issue like this.

Thanks for the pointers, it sound like those would be both valuable additions. I'll look into building the index, I think that would be the biggest win.

@d0ugal
Copy link
Member

@d0ugal d0ugal commented Mar 9, 2016

The main issue for us, is that we are Python based, pre-building the index would be tricky. We would need to shell out to node and call a file. I would be okay with that if it gracefully fell back to the current approach when node isn't available. However, this sort of functionality might be better suited for a plugin (so another candidate for #206).

That maybe means we should look into web workers for now and revisit this later.

@pcdinh
Copy link

@pcdinh pcdinh commented Mar 16, 2016

I got the same issue. I have several files only. Every time I visit my mkdocs-based web document, the page is frozen. Firefox warning dialog appeared several times, that allowed me to stop lunr.js

I found that search_index.json response was about 1.2MB. It is fast to download but quite slow to parse in Javascript

@waylan
Copy link
Member

@waylan waylan commented Mar 16, 2016

The main issue for us, is that we are Python based, pre-building the index would be tricky. We would need to shell out to node and call a file.

I had wondered if there was a way to pre-build the index in Python. This is what I found:

The index is just JSON, so technically it should be possible to pre-build the index and serve that rather than the JSON file we serve today. However, there is no published spec for the index; although it is versioned. The index stores the version of lunr.js which was used to create it and an error is raised if a different version of lunr.js attempts to load the index. This suggests that backward incompatible changes between versions are possible/expected. However, as the lunr.js lib is shipped with MkDocs we have full control over which version is being used. The tricky part is the lack of a spec. Someone would have to go through the JS code and re-implement the index builder in Python and transitioning from one version to the next has the potential to require a similar amount of work. I suppose if a third party created such a lib, it might make sense to use it (or via a Plugin if/when MkDocs adds support), but until then, calling to Node is the more sensible approach. But even that feels more like something for a Plugin.

In the end, using webworkers to avoid a browser freeze is probably the most sensible short-term solution.

@rickpeters
Copy link

@rickpeters rickpeters commented Mar 25, 2016

Maybe I don't understand it, but search_index.json is already generated when you do mkdocs build. Problem is not in the generating or downloading, the problem is parsing this download by lunr.js. I am not a real expert on this, just looked at the webworker example, but I have no idea about all the javascript dependencies that are necessary, Just seems that python is not the issue here?
keep up the good work guys, I love the product!

regards,
Rick

@waylan
Copy link
Member

@waylan waylan commented Mar 25, 2016

@rickpeters the search_index.json file generated by MkDocs is not a lunr.js search index. It is simply a JSON file which contains all of the content of every page in a project. Every time you do a search, that file is downloaded by the browser, and the list of JSON objects are stepped through (one page at a time) with each one being passed to lunr.js to build the index which lunr.js then uses to run the search. And with each page load, the entire index has to be rebuilt. At least that is the way it works now.

The ideal solution would be to have search_index.json be an actual prebuilt lunr.js index. However, as a workaround, a webworker allows the browser to parse and build the index in a separate process from the one which is displaying the page. On page reloads, that existing separate process would continue to be used and the index would not need to be rebuilt. Even if we had a proper prebuilt index, the webworker would eliminate the need to reload the index on each page load, so using a webworker is a good idea regardless of how we generate an index. It also avoids browser freezes on index loading/building. Hope that clears things up.

@rickpeters
Copy link

@rickpeters rickpeters commented Mar 25, 2016

Thanks very much for this clarification!
Do you have pointers on where I could find how to prebuild this index? I am currently performing the full continuous integration / continuous delivery of my mkdocs site (> 250 pages) using Docker, including extra steps like inclusion of a dot-language plugin. It generates the site and then packages it into a nginx container for presentation. Works great and is very transferable.
I would be willing to look into integrating the necessary parts for pre-building the index which then could be used by lunr.js I guess?
Even then, the webworker would be a good idea I think. Would the webworker need server side javascript also? If needed I could package that in the container also.
And of course all of this could be added to mkdocs as opensource code for showing how to host your results using docker containers or eventually transerring the static site to a great other hosting solution.

@waylan
Copy link
Member

@waylan waylan commented Mar 25, 2016

The lunr.js library includes an example script which could be adapted and run under Node.js to build the index. You would need to call out to Node in your build processes (or call JavaScript from Python some way). That would give you a prebuilt index. However, the current lunr.js calling code in MkDocs would need to be updated to accept a prebuilt index (at aprox search.js#L32) similar to this.

That code in search.js would need to be modified for a webworker as well. The code would need to be placed in a webworker with a fallback for when a webworker is not available (in some older browsers).

@waylan
Copy link
Member

@waylan waylan commented Mar 28, 2016

FYI, I have succeeded in getting the search index to build here: 3009a3a. It was relatively easy using the PyExecJS lib. Still not usable though as search.js has not yet been modified.

You can follow my progress here: master...waylan:search

@waylan
Copy link
Member

@waylan waylan commented Mar 29, 2016

So I thought I had a working solution for pre-building the index (here: master...waylan:search), but I'm getting weird errors I can't make sense of on Windows. It works fine on Linux and I don't have access to a Mac for testing right now. If anyone wants to help, please test (my branch) on your local machine and report back.

@facelessuser
Copy link
Contributor

@facelessuser facelessuser commented Mar 29, 2016

The require path must use forward slashes:

        lunr_path = os.path.join(
            os.path.dirname(os.path.abspath(__file__)),
            'assets/search/mkdocs/js/lunr.min.js'
        ).replace('\\', '/')
@waylan
Copy link
Member

@waylan waylan commented Mar 29, 2016

Thanks @facelessuser. That solved one problem.

I have Node installed on all my machines and it is working now. However, as I understand it, PyExecJS does not need any dependencies installed in Windows or OS X as it can use JScript or Apple JavaScriptCore, each of which are installed by default on their respective systems (Linux will always need an additional dependency (one of Node.js, PyV8, Spidermoney, PhantomJS, ...), but that seems like an acceptable compromise).

When I force the use of JScript, I'm getting an error I can't make sense of (TypeError: Object expected with no indication of where in the JS the error is being raised) and I don't have access to a Mac, so any additional feedback would be appreciated.

@waylan
Copy link
Member

@waylan waylan commented Oct 3, 2016

Before we jump down the "store the index in LocalStorage" rabbit hole, we need to address the creation of the index in the first place. Even using LocalStorage, the first visit to a site would be painfully slow and unresponsive for the user (on a large site). You have lost your user on their first visit and they probably won't be coming back. Until we have that problem fixed, there is no point of exploring how to store the index across multiple visits. And if we do fix that problem, then maybe we don't need to worry about storing the index across visits. My point is, even if LocalStorage is possible, it doesn't solve the problem. it just addresses one of the symptoms.

As an aside, my current roadmap for this issue is as follows:

  1. Pre-build the index (with a client-side fallback for those who don't have node installed)
  2. Finish the Plugin API (see #206)
  3. Refactor search to be a plugin
  4. Wrap client built index in a webworker
  5. Explore other options for storing index client-side longterm

Note that I only have item 1 completed. I started item 4 and realized that the above would be a better approach, so I stopped and moved my attention to the Plugin API. The great thing about using a Plugin is that others can create their own search which starts with a different set of assumptions (or different backend tools) to better meet there needs.

@olivernn
Copy link

@olivernn olivernn commented Oct 3, 2016

@waylan completely agree that pre-building is the best option here. Was just saying that there is nothing in lunr preventing a serialised index being stored client side. The tradeoff made with pre-building the index and serving it to the client is that, even gzipped, it can be quite large. That said, if it was me, I'd still opt for serving a pre-built index.

If there is anything I can help with from lunr then please do let me know, I'm more than happy to help.

@waylan
Copy link
Member

@waylan waylan commented Oct 3, 2016

If there is anything I can help with from lunr then please do let me know, I'm more than happy to help.

What would be ideal is a Python library that builds the index. Requiring our users to have both Python and Node installed is a little much (our Python script calls out to a Node script to build the index). As it happens, many users have both installed anyway, but I'm not comfortable making it a hard requirement and having the fallback results in two very different user experiences. Of course, I don't expect you to go build a Python clone of your JavaScript lib, but a publicly published spec of your index file format could go a long way toward someone else creating a Python lib (or Ruby, or ... whatever other language static site generators are written in).

@olivernn
Copy link

@olivernn olivernn commented Oct 3, 2016

This is similar to an approach I am hoping to move towards with a new version of lunr, documenting (JSON Schema?) the serialisation format would open up the possibility of using the JavaScript version as a client, with the actual indexing etc being done in any other language that can generate the serialised index. Python would probably be an excellent choice for this given the number of NLP and IR libraries available.

My Python is non-existent, so I'm unlikely to be able to produce anything production ready myself, but by documenting the format others could certainly produce an implementation that would be easier to integrate into MkDocs (or other tools).

waylan added a commit to waylan/mkdocs that referenced this issue Oct 3, 2016
This builds an actual lunr.js search index via Node.js in addition to a JSON
array of page objects previously used to build the index client side. Note that
both are needed as the array of page objects is used to display search
results.

If Node fails (not installed or errors out), then fallback to old behavior
by creating an empty index. If the client (browser) receives an empty index,
then it builds the index from the array of pages as previously, which is
fine for most (small) sites. Large sites will want to make sure Node is
installed to avoid the browser from hanging during index creation.

Partially addresses mkdocs#859.
waylan added a commit to waylan/mkdocs that referenced this issue Oct 3, 2016
This builds an actual lunr.js search index via Node.js in addition to a JSON
array of page objects previously used to build the index client side. Note that
both are needed as the array of page objects is used to display search
results.

If Node fails (not installed or errors out), then fallback to old behavior
by creating an empty index. If the client (browser) receives an empty index,
then it builds the index from the array of pages as previously, which is
fine for most (small) sites. Large sites will want to make sure Node is
installed to avoid the browser from hanging during index creation.

Partially addresses mkdocs#859.
@waylan waylan self-assigned this Dec 2, 2016
@waylan waylan added this to To Do in Refactor search. May 2, 2017
@squidfunk
Copy link

@squidfunk squidfunk commented Oct 22, 2017

Just a note: This may be a perfect case for requestIdleCallback - processing in the background without freezing the UI. I will think about including this into Material. Reasonable polyfills available.

@waylan waylan added the Plugin label Nov 1, 2017
@waylan waylan mentioned this issue Jan 16, 2018
waylan added a commit to waylan/mkdocs that referenced this issue Jan 31, 2018
@waylan waylan moved this from To Do to In Progress in Refactor search. Feb 5, 2018
waylan added a commit to waylan/mkdocs that referenced this issue Feb 27, 2018
@waylan
Copy link
Member

@waylan waylan commented Feb 28, 2018

FYI to everyone following this: A refactor of search is available for review in #1418 which addresses all of the concerns raised here and more.

@waylan waylan closed this in #1418 Mar 6, 2018
Refactor search. automation moved this from In Progress to Done Mar 6, 2018
waylan added a commit that referenced this issue Mar 6, 2018
* Use a web worker in the browser with a fallback (fixes #859 & closes #1396).
* Optionally pre-build search index (fixes #859 & closes #1061).
* Upgrade to lunr.js 2.x (fixes #1319).
* Support search in languages other than English (fixes #826).
* Allow the user to define the word separators (fixes #867).
* Only run searches for queries of length > 2 (fixes #1127).
* Remove dependency on require.js, mustache, etc. (fixes #1218).
* Compress the search index (fixes #1128).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

9 participants
You can’t perform that action at this time.