DrHyde edited this page Jul 6, 2011 · 2 revisions

Searching and Sorting

The search results could be improved in two ways. First, the results should be as relevant as possible. Second, the should by default be sorted so that the most useful results are on the first page. These are two separate concerns. Sorting is useless if the results are not relevant.

Improving Relevance

Some ways that the search results might be made more relevant:

  • Rank the presence of a word in the distro name very highly. If someone searches for "Moose", then the Moose distro absolutely must be the first result, as should other distros (not modules!) with Moose in the name.
    • This may need some massaging so that the search engine knows that "Moose" and "MooseX" are synonyms.
  • If all (or most) of the modules in a distro include the word then that should improve the result's relevance score. By contrast, if just one module of many matches, it should reduce relevance. For example, a search for "Moose" should find Moose, MooseX modules, Any::Moose, etc. It should not consider the presence of Mason::Moose in the Mason distro very relevant.
  • The presence of the word in a module's documentation should be considered much less relevant than the above two criteria.
  • We need a way to somehow de-rank distros. A good example is MARC-Moose, a badly named distro that includes Moose in every module name in the distro. This will be considered relevant by the first two criteria, so we need some way to override this cases like MARC-Moose.
    • This could be done by either some sort of administrator explicitly deranking this distro for the term Moose, or the search results could offer some sort of voting mechanism to vote on the quality of the result given the search term.
    • This voting should not be confused with voting on the quality of the distro as a whole. MARC-Moose might be fabulously useful at whatever it does, but it's not a good result when searching for "Moose".
    • The risk here is that most end users won't know what the difference means, so the admin route may be more effective. There's no reason there couldn't be 100 admins. Maybe we could combine admins with a "report bad search result" link. Admins could review reports and decide whether or not to take actoin.

Improving Sorting

Once we have relevant results, we still want to sort them. Obviously, a relevance score can be used in the sorting, but there are lots of other things that could be included:

  • CPAN ratings
  • How many modules depend on this one? (CPANdeps has an XML feed for the distributions that this one depends on, I could easily add one for what distributions depend on it -- DrHyde)
  • Web of Trust (trusted module authors explicitly trust this distro)
  • Kwalitee score
  • Test pass rate
  • Lots of other things