improving search functionality #374

choldgraf · 2019-10-03T22:59:15Z

This is a slight improvement to search functionality.

The old search

Would use the first N words of page content (where N is 100 by default)

This PR's search

Calculates the top N most common words on a page after some cleaning (dropping "very common" english words, removing special characters, etc). It stores these words in a search: field in the page's YAML metadata.

This should make searching more consistent as a page's content evolves, and it should also ensure that the "more important" (or at least more common) words on a page are in the search cache. Curious what @psychemedia thinks about this

psychemedia · 2019-10-06T14:35:25Z

Not had a chance to try this yet, but off top of my head, in lots of retrieval algorithms, you often want rare words that appear in one document but not other documents to rank highly in a search, as per the TF-IDF score (the term frequency-inverse document frequency stat).

choldgraf · 2019-10-06T16:13:35Z

@psychemedia that's a good point - I think the question for this PR is: "is it better than taking the first 100 words, and does this open the door to improve search in the future?".

I had thought about using gensim or some other NLP tool for doing proper munging and scoring of words, but I'm not sure we wanna add a dependency like that. However, since this PR gets us to using python for the creation of page search content, it should be easier to swap that out with something better in the future

psychemedia · 2019-10-06T20:56:16Z

Would it make more sense to have a search tool that:

takes an index in a particular format;
has a tool that will by default create an index (or different sorts of default index) with no dependency;
an easy way to import / embed an index generated according to the specified index format

and then at least one other demo repo that shows how to generate an index and use it in the Jupyter Book context. This would then set up the basis for a recipe for how to:

a) generate;
b) incorporate

other search indexes for Jupyter Book?

choldgraf · 2019-10-28T19:52:25Z

@psychemedia - I am not sure I entirely grok what you're suggesting there, mind explaining it again, perhaps in an issue? I think I'll merge this PR in, because it's a clear improvement over the current search setup (current: which just takes the first N words / this PR: as opposed to the N most-used uncommon words). I'm happy to keep iterating on this in an issue though!

hamelsmu · 2021-02-23T23:15:34Z

Is there a reason search is not mentioned in the docs at all? I"m having trouble finding any docs on how this functionality works or how it can be configured etc?

hamelsmu · 2021-02-23T23:20:10Z

I'll open an issue instead, my apologies. This PR was one of the few things I could find with keyword search

choldgraf · 2021-02-23T23:37:09Z

@hamelsmu - ah that is because this PR is for an older version of Jupyter Book (before it was built on Sphinx) and so this PR is now out of date :-)

improving search functionality

e44f25a

choldgraf added the enhancement New feature or request label Oct 15, 2019

choldgraf added 2 commits October 28, 2019 08:40

adding test for improved search

0b77e8b

Merge branch 'master' into search_improve

47edf1f

choldgraf merged commit bd3d3d7 into executablebooks:master Oct 28, 2019

choldgraf deleted the search_improve branch October 28, 2019 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improving search functionality #374

improving search functionality #374

choldgraf commented Oct 3, 2019

psychemedia commented Oct 6, 2019

choldgraf commented Oct 6, 2019

psychemedia commented Oct 6, 2019

choldgraf commented Oct 28, 2019

hamelsmu commented Feb 23, 2021

hamelsmu commented Feb 23, 2021

choldgraf commented Feb 23, 2021

improving search functionality #374

improving search functionality #374

Conversation

choldgraf commented Oct 3, 2019

The old search

This PR's search

psychemedia commented Oct 6, 2019

choldgraf commented Oct 6, 2019

psychemedia commented Oct 6, 2019

choldgraf commented Oct 28, 2019

hamelsmu commented Feb 23, 2021

hamelsmu commented Feb 23, 2021

choldgraf commented Feb 23, 2021