Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Improved fulltext search and adding keyword scoped searching #1183

Open
bcardarella opened this issue Feb 14, 2023 · 9 comments
Open

Comments

@bcardarella
Copy link

bcardarella commented Feb 14, 2023

Searching

The current searching on hex.pm can lead to a lot of false positives. For example, searching for openid authentication:

Screenshot 2023-02-14 at 9 44 44 AM

gives results that match on either or both terms. In this case the authentication term is resulting in a lot of noise completely unrelated to openid.

There are a few different potential solutions. The first, and easiest, would be to change each word to an and search instead of an or search. Giving higher priority to the order of the terms themselves. With openid authentication openid would have higher sort priority than authentication.

The second option would be to use Nx to build a semantic search function for Hex. @seanmor5 wrote about this on the DY blog recently: https://dockyard.com/blog/2022/09/28/semantic-search-with-phoenix-axon-and-elastic and https://dockyard.com/blog/2023/01/11/semantic-search-with-phoenix-axon-bumblebee-and-exfaiss

We could also explore other search methods. TBH as much as it would be fun to implement Nx I think changing the query from or to and would be the best ROI.

Keywords

Prior works

npm spec: https://docs.npmjs.com/cli/v6/configuring-npm/package-json#keywords
npm search: https://www.npmjs.com/search?q=keywords%3Aauthentication

cargo spec: https://doc.rust-lang.org/cargo/reference/manifest.html#the-keywords-field
cargo search: https://crates.io/keywords/authentication

Adding keywords to Hex will allow for categorical deep linking and scoped searching. This is something I feel is necessary as the Elixir ecosystem continues to grow. First allowing for consumers to easily and quickly find relevant packages or browse a list of relevant packages. Second to incentivize producers if their efforts are more easily discoverable.

I would recommend the following update to the mix.exs spec:

  def project do
    [
      app: :my_app,
      version: "0.0.1",
      elixir: "~> 1.13",
      start_permanent: Mix.env() == :prod,
      elixirc_paths: elixirc_paths(Mix.env()),
      consolidate_protocols: Mix.env() != :test,
      package: package(),
      description: description(),
      source_url: @scm_url,
      docs: docs(),
      deps: deps(),
      keywords: ~w{live_view_component authentication}
    ]
  end

Keyword normalization

@ericmj voiced the concern around similarly intended keywords. For example: LiveComponent and live_component. To further illustrate this we could add live-component. I believe we can account for this by normalizing the keywords during publishing:

def normalize_keyword(keyword) do
   keyword |> underscore() |> downcase()
end

this should also happen when querying against a keyword. We store one normalized version of the keyword. There should, however, be a recommended format in the documentation and I would recommend the downcased version of a keyword phrase.

Deep linking

Allowing projects to deep link into keyword categories will provide value for consumers. For example, https://phoenixframework.org could deep link directly from its site or documentation to https://hex.pm/packages?keywords=live_component

Scoped searching

Searching within a keyword scope provides additional value when the keyword results are too large. Proposed url: https://hex.pm/packages?keywords=live_component&search=authentication

Multiple keywords

I've used keywords plural as they query param. It opens the possiblity of providing a bisection of two or more keywords: https://hex.pm/packages?keywords=live_components+authentication similiar to the search proposal above I believe this should be an and

@leandrocp
Copy link
Contributor

I see that could be very useful for linking related projects to a main topic, for eg all broadway_* packages linked to a #broadway keyword, all nerves_* packages linked to #nerves and so on. That would make it easier to find "sub-ecosystem" packages among all packages in hex.

@ericmj
Copy link
Member

ericmj commented Feb 21, 2023

Thanks for starting this proposal @bcardarella. Improving search and discoverability would be really good for Hex and the community.

There are a few different potential solutions. The first, and easiest, would be to change each word to an and search instead of an or search. Giving higher priority to the order of the terms themselves. With openid authentication openid would have higher sort priority than authentication.

I think one big part of the issue here is that we search for both name and description. Maybe we should default to only search names and have description search under a label such as description:foo. Another option would be to sort all name hits before description hits.

The second option would be to use Nx to build a semantic search function for Hex.

Can we run this on generic CPUs or do we need more specialized hardware?

I would recommend the following update to the mix.exs spec:

I think :keywords should be under :package unless there is other tooling that will also use it.

@ericmj voiced the concern around similarly intended keywords. For example: LiveComponent and live_component. To further illustrate this we could add live-component. I believe we can account for this by normalizing the keywords during publishing:

Maybe we can also give suggestions if there are similar tags that are already used by many packages? This would need to pre-check during mix hex.publish before we upload the package tarball.

@ericmj
Copy link
Member

ericmj commented Feb 21, 2023

It looks like there are 3 scopes of work here:

  1. Improving search by changing or to and.
  2. Keywords.
  3. Semantic search.

(1) Is easiest and would be a direct improvement over the current search. Need to look into how it would affect the description full-text search, but we can make something work.

(2) I have long been hesitant to adding keywords or categories because I believed to have a good keyword system it would need some kind of moderation. But with good normalization and suggestions for keywords we may not need it. It's also very likely better to have a keyword system than have no keyword system because every package won't be perfectly categorized.

(3) This is the biggest unknown for me. Mostly because I have little practical experience with Nx. It seems to be the most amount of work but I also think it could be very useful.

@bcardarella
Copy link
Author

keywords can go into package, that's 👍 from me.

I can check with Sean about using Nx and what hardware support would be needed. If it is going to require a change in hosting I would suggest not using Nx for this. I believe we can improve search experience quite a bit with fulltext changes alone.

@inoas
Copy link

inoas commented Mar 5, 2023

  • I assume categories (or keywords) would be a controlled vocabulary, probably controlled by the hex.pm team?
    • IMHO best if there was a (json) web endpoint to get them from, so that one can validate against them.
    • The controlled vocabulary would probably be append only - maybe allow deprecation/obsolete-marking of categories?
  • tags on the other hand could be uncontrolled

I am also curious if we could add a list of languages a library/project revolves around: FOr instance: Erlang, Elixir, LFE, Gleam, JSON, JSON-Schema, GraphQL, XML, XSD, SQL, HTML, CSS, JavaScript, TypeScript -- and when publishing you could supply a list of languages the project revolves around/uses?

@inoas
Copy link

inoas commented Mar 5, 2023

I am also curious if you have yet found a way to supply information that could go into meta description tag per module. The reason would be search engine discovery. That description could however also help for package search.

@ericmj
Copy link
Member

ericmj commented Mar 5, 2023

I don’t think we will have both categories and tags, to me they serve the same purpose, just different ways of doing it.

I am hesitant to anything that requires moderation, such as maintaining a list of approved categories/tags, because it does not scale unless we also build some good tooling around the moderation.

Packages already have a description metadata field.

@leandrocp
Copy link
Contributor

Just my 2 cents. Taking the rust ecosystem as example, they provide a curated list of categories https://crates.io/category_slugs where authors can pick up to 5 categories if I'm not mistaken. Besides that, authors may set keywords, see https://crates.io/crates/serde displaying keywords on top and categories on the metadata section. I don't know the motivation for having both and I understand the concern of keeping that list up to date so seems like keywords chosen by authors would work well for this proposal.

@lpil
Copy link

lpil commented Mar 6, 2023

Fab suggestion, thanks @bcardarella ! I've been often wondering how we could make package discovery easier.

I think normalising the keywords could be quite challenging. If I were to make an embeddable scripting language I might pick any of "interpreter", "vm", "virtual-machine", or "abstract-machine", but really they all mean the same thing.

My main issue with the fixed list of Rust categories is that there just isn't very many tags one can pick from and I often struggle to find categories that feel suitable. I think I could be happy with a fixed list, but it would be somewhat larger than what Rust offers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants