New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Improved fulltext search and adding keyword scoped searching #1183
Comments
I see that could be very useful for linking related projects to a main topic, for eg all |
Thanks for starting this proposal @bcardarella. Improving search and discoverability would be really good for Hex and the community.
I think one big part of the issue here is that we search for both name and description. Maybe we should default to only search names and have description search under a label such as
Can we run this on generic CPUs or do we need more specialized hardware?
I think
Maybe we can also give suggestions if there are similar tags that are already used by many packages? This would need to pre-check during |
It looks like there are 3 scopes of work here:
(1) Is easiest and would be a direct improvement over the current search. Need to look into how it would affect the description full-text search, but we can make something work. (2) I have long been hesitant to adding keywords or categories because I believed to have a good keyword system it would need some kind of moderation. But with good normalization and suggestions for keywords we may not need it. It's also very likely better to have a keyword system than have no keyword system because every package won't be perfectly categorized. (3) This is the biggest unknown for me. Mostly because I have little practical experience with Nx. It seems to be the most amount of work but I also think it could be very useful. |
I can check with Sean about using Nx and what hardware support would be needed. If it is going to require a change in hosting I would suggest not using Nx for this. I believe we can improve search experience quite a bit with fulltext changes alone. |
I am also curious if we could add a list of languages a library/project revolves around: FOr instance: Erlang, Elixir, LFE, Gleam, JSON, JSON-Schema, GraphQL, XML, XSD, SQL, HTML, CSS, JavaScript, TypeScript -- and when publishing you could supply a list of languages the project revolves around/uses? |
I am also curious if you have yet found a way to supply information that could go into meta description tag per module. The reason would be search engine discovery. That description could however also help for package search. |
I don’t think we will have both categories and tags, to me they serve the same purpose, just different ways of doing it. I am hesitant to anything that requires moderation, such as maintaining a list of approved categories/tags, because it does not scale unless we also build some good tooling around the moderation. Packages already have a description metadata field. |
Just my 2 cents. Taking the rust ecosystem as example, they provide a curated list of categories https://crates.io/category_slugs where authors can pick up to 5 categories if I'm not mistaken. Besides that, authors may set keywords, see https://crates.io/crates/serde displaying keywords on top and categories on the metadata section. I don't know the motivation for having both and I understand the concern of keeping that list up to date so seems like keywords chosen by authors would work well for this proposal. |
Fab suggestion, thanks @bcardarella ! I've been often wondering how we could make package discovery easier. I think normalising the keywords could be quite challenging. If I were to make an embeddable scripting language I might pick any of "interpreter", "vm", "virtual-machine", or "abstract-machine", but really they all mean the same thing. My main issue with the fixed list of Rust categories is that there just isn't very many tags one can pick from and I often struggle to find categories that feel suitable. I think I could be happy with a fixed list, but it would be somewhat larger than what Rust offers. |
Searching
The current searching on hex.pm can lead to a lot of false positives. For example, searching for
openid authentication
:gives results that match on either or both terms. In this case the
authentication
term is resulting in a lot of noise completely unrelated toopenid
.There are a few different potential solutions. The first, and easiest, would be to change each word to an
and
search instead of anor
search. Giving higher priority to the order of the terms themselves. Withopenid authentication
openid
would have higher sort priority thanauthentication
.The second option would be to use Nx to build a semantic search function for Hex. @seanmor5 wrote about this on the DY blog recently: https://dockyard.com/blog/2022/09/28/semantic-search-with-phoenix-axon-and-elastic and https://dockyard.com/blog/2023/01/11/semantic-search-with-phoenix-axon-bumblebee-and-exfaiss
We could also explore other search methods. TBH as much as it would be fun to implement Nx I think changing the query from
or
toand
would be the best ROI.Keywords
Prior works
npm spec: https://docs.npmjs.com/cli/v6/configuring-npm/package-json#keywords
npm search: https://www.npmjs.com/search?q=keywords%3Aauthentication
cargo spec: https://doc.rust-lang.org/cargo/reference/manifest.html#the-keywords-field
cargo search: https://crates.io/keywords/authentication
Adding keywords to Hex will allow for categorical deep linking and scoped searching. This is something I feel is necessary as the Elixir ecosystem continues to grow. First allowing for consumers to easily and quickly find relevant packages or browse a list of relevant packages. Second to incentivize producers if their efforts are more easily discoverable.
I would recommend the following update to the
mix.exs
spec:Keyword normalization
@ericmj voiced the concern around similarly intended keywords. For example:
LiveComponent
andlive_component
. To further illustrate this we could addlive-component
. I believe we can account for this by normalizing the keywords during publishing:this should also happen when querying against a keyword. We store one normalized version of the keyword. There should, however, be a recommended format in the documentation and I would recommend the downcased version of a keyword phrase.
Deep linking
Allowing projects to deep link into keyword categories will provide value for consumers. For example, https://phoenixframework.org could deep link directly from its site or documentation to https://hex.pm/packages?keywords=live_component
Scoped searching
Searching within a keyword scope provides additional value when the keyword results are too large. Proposed url: https://hex.pm/packages?keywords=live_component&search=authentication
Multiple keywords
I've used
keywords
plural as they query param. It opens the possiblity of providing a bisection of two or more keywords: https://hex.pm/packages?keywords=live_components+authentication similiar to the search proposal above I believe this should be anand
The text was updated successfully, but these errors were encountered: