-
Notifications
You must be signed in to change notification settings - Fork 650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improving search functionality #374
Conversation
Not had a chance to try this yet, but off top of my head, in lots of retrieval algorithms, you often want rare words that appear in one document but not other documents to rank highly in a search, as per the TF-IDF score (the term frequency-inverse document frequency stat). |
@psychemedia that's a good point - I think the question for this PR is: "is it better than taking the first 100 words, and does this open the door to improve search in the future?". I had thought about using gensim or some other NLP tool for doing proper munging and scoring of words, but I'm not sure we wanna add a dependency like that. However, since this PR gets us to using python for the creation of page search content, it should be easier to swap that out with something better in the future |
Would it make more sense to have a search tool that:
and then at least one other demo repo that shows how to generate an index and use it in the Jupyter Book context. This would then set up the basis for a recipe for how to: a) generate; other search indexes for Jupyter Book? |
@psychemedia - I am not sure I entirely grok what you're suggesting there, mind explaining it again, perhaps in an issue? I think I'll merge this PR in, because it's a clear improvement over the current search setup (current: which just takes the first N words / this PR: as opposed to the N most-used uncommon words). I'm happy to keep iterating on this in an issue though! |
Is there a reason search is not mentioned in the docs at all? I"m having trouble finding any docs on how this functionality works or how it can be configured etc? |
I'll open an issue instead, my apologies. This PR was one of the few things I could find with keyword search |
@hamelsmu - ah that is because this PR is for an older version of Jupyter Book (before it was built on Sphinx) and so this PR is now out of date :-) |
This is a slight improvement to search functionality.
The old search
Would use the first N words of page content (where N is 100 by default)
This PR's search
Calculates the top N most common words on a page after some cleaning (dropping "very common" english words, removing special characters, etc). It stores these words in a
search:
field in the page's YAML metadata.This should make searching more consistent as a page's content evolves, and it should also ensure that the "more important" (or at least more common) words on a page are in the search cache. Curious what @psychemedia thinks about this