Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving search functionality with a stored cache of keywords #325

Open
choldgraf opened this issue Sep 17, 2019 · 4 comments
Labels

Comments

@choldgraf
Copy link
Collaborator

@choldgraf choldgraf commented Sep 17, 2019

Currently, our search functionality is using a fairly simple method. It grabs the first N words of each article and turns that into the store that's used to lookup keywords in pages. Here's where that happens:

https://github.com/jupyter/jupyter-book/blob/master/jupyter_book/book_template/_includes/search/lunr/lunr-store.js#L26

As @psychemedia has noted, this is suboptimal for a number of reasons: it's not guaranteed that the first N words will have the most "important" words, and moreover that cache of words may change over time in unpredictable ways.

It would be fairly straightforward to re-create the functionality of the lunr store js code but using Python at build time. This could do something like:

  • Any time the book is built, read in each page
  • Grab the page's content from the notebook
  • Remove all "common" english words from the content
  • Remove all repeated words from the content
  • Sort the words based on their (length / frequency / etc)
  • Take the top N of those words
  • Store the result in the same JSON structure that the lunr store code uses.

I think this could be a way to get a more representative sample of words for lunr.js to look through.

Another option is to find a way for the search to scale better so we can search through the full article text instead of just a subset of words.

@psychemedia

This comment has been minimized.

Copy link
Contributor

@psychemedia psychemedia commented Sep 23, 2019

The lunr.py python package implements lunr.js in python and may provide handy functions for generating the index in the correct format.

I did a naive pencil sketch some time ago here that indexed a set of notebooks using lunr.py then generated the JS index object required for embedding in the HTML search page.

There is also question about whether to index code or not and how that should be treated.

@choldgraf

This comment has been minimized.

Copy link
Collaborator Author

@choldgraf choldgraf commented Sep 23, 2019

Oh awesome! I didn't realize there was a python implementation. That should simplify things quite a bit!

re: code vs. text, I think we could start with the content and decide later on if the code is something we want to include in the search database.

@psychemedia

This comment has been minimized.

Copy link
Contributor

@psychemedia psychemedia commented Oct 21, 2019

I've just started looking around for whether there are any more recent alternatives to lunr.js and elasticlunr looks like it could be interesting.

For example, field based searching would presumably allow searching within markdown, code and code outputs if such fields were separately indexed. (This should be easy enough when building from an .ipynb corpus, but perhaps a bit more challenging when using eg .md, unless it were piped through Jupytext and converted to .ipynb for indexing purposes?)

@choldgraf

This comment has been minimized.

Copy link
Collaborator Author

@choldgraf choldgraf commented Oct 21, 2019

@psychemedia at this point, we're actually converting all text files into ipynb first with jupytext, so this could still work if it were something that only worked with notebooks.

I think the main question is how lightweight the solution is - we need this to be something that works 99% of the time, that does most of what we want, and that will be fairly straightforward for developers to grok and maintain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.