Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improving search functionality #374

Merged
merged 3 commits into from
Oct 28, 2019

Conversation

choldgraf
Copy link
Member

This is a slight improvement to search functionality.

The old search

Would use the first N words of page content (where N is 100 by default)

This PR's search

Calculates the top N most common words on a page after some cleaning (dropping "very common" english words, removing special characters, etc). It stores these words in a search: field in the page's YAML metadata.

This should make searching more consistent as a page's content evolves, and it should also ensure that the "more important" (or at least more common) words on a page are in the search cache. Curious what @psychemedia thinks about this

@psychemedia
Copy link
Contributor

Not had a chance to try this yet, but off top of my head, in lots of retrieval algorithms, you often want rare words that appear in one document but not other documents to rank highly in a search, as per the TF-IDF score (the term frequency-inverse document frequency stat).

@choldgraf
Copy link
Member Author

@psychemedia that's a good point - I think the question for this PR is: "is it better than taking the first 100 words, and does this open the door to improve search in the future?".

I had thought about using gensim or some other NLP tool for doing proper munging and scoring of words, but I'm not sure we wanna add a dependency like that. However, since this PR gets us to using python for the creation of page search content, it should be easier to swap that out with something better in the future

@psychemedia
Copy link
Contributor

Would it make more sense to have a search tool that:

  1. takes an index in a particular format;
  2. has a tool that will by default create an index (or different sorts of default index) with no dependency;
  3. an easy way to import / embed an index generated according to the specified index format

and then at least one other demo repo that shows how to generate an index and use it in the Jupyter Book context. This would then set up the basis for a recipe for how to:

a) generate;
b) incorporate

other search indexes for Jupyter Book?

@choldgraf choldgraf added the enhancement New feature or request label Oct 15, 2019
@choldgraf
Copy link
Member Author

@psychemedia - I am not sure I entirely grok what you're suggesting there, mind explaining it again, perhaps in an issue? I think I'll merge this PR in, because it's a clear improvement over the current search setup (current: which just takes the first N words / this PR: as opposed to the N most-used uncommon words). I'm happy to keep iterating on this in an issue though!

@choldgraf choldgraf merged commit bd3d3d7 into executablebooks:master Oct 28, 2019
@choldgraf choldgraf deleted the search_improve branch October 28, 2019 19:54
@hamelsmu
Copy link

Is there a reason search is not mentioned in the docs at all? I"m having trouble finding any docs on how this functionality works or how it can be configured etc?

@hamelsmu
Copy link

I'll open an issue instead, my apologies. This PR was one of the few things I could find with keyword search

@choldgraf
Copy link
Member Author

@hamelsmu - ah that is because this PR is for an older version of Jupyter Book (before it was built on Sphinx) and so this PR is now out of date :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants