Skip to content
This repository has been archived by the owner on Jan 16, 2021. It is now read-only.

Proposal for Package Ranking #320

Open
carlisia opened this issue Sep 24, 2015 · 20 comments
Open

Proposal for Package Ranking #320

carlisia opened this issue Sep 24, 2015 · 20 comments
Labels

Comments

@carlisia
Copy link

Overview

Rank by Objective Quality Standards

A mechanism to transparently and objectively rank projects by quality level.

We're proposing to avoid the use of a user-based rating or feedback system. It would require the creation and maintenance of accounts and would be subject to abuse. Instead, we propose using a published set of project quality standards that are not subject to simple manipulation:

  • Percent of code coverage achieved by unit tests
  • Presence of package and type documentation for public types
  • Number of forks and downloads on repository.
  • Measure of recent activity on the project (commits in the last X days)
  • Number of includes in other public projects.
  • Number of pageviews for that project on GoDoc.org
  • [other criteria to be established by the community]

This could use (via API or similar) the system implemented by GoReportCard, for example:
image

The results of the ranking score would be used as the primary sort mechanism when browsing packages by search or by category. The information would also be linked/displayed at the top of each documentation page on Godoc.org.

References:
Examples of quality assessement tools: https://medium.com/@jgautheron/quality-pipeline-for-go-projects-497e34d6567

Include Ranking in Display

Once the ranking system is in place, include a summary of the score (and link to view details/explanation) on both the documentation pages and in a second column on search results.


Contributors to this proposal

@carlisia
@gdey
@rafaeljusto

@adg
Copy link
Contributor

adg commented Sep 24, 2015

I'm in favor of adding more signal to godoc.org's ranking algorithm. As such, I'm generally in favor of this proposal.

Let me comment on a few of the proposed metrics, and how they might be computed:

Percent of code coverage achieved by unit tests

This requires integration with a continuous integration system. As such, it's probably the hardest one to gather data for. Defer til last. (Also, this is easily gamed; just add a function 100k lines long and write one test to invoke it.)

Presence of package and type documentation for public types

Sounds good, if we're not doing this already (can't recall). I implemented this in a separate project a long time ago.

Number of forks and downloads on repository.
Measure of recent activity on the project (commits in the last X days)

How do we measure repository downloads? GitHub stars seem like good signals. Forks are ambiguous (@garyburd raises some interesting questions).

Number of includes in other public projects.

We already have this data. Seems like a no-brainer.

Number of pageviews for that project on GoDoc.org

We're not currently gathering this data, but it's probably worth doing.

When we decide to embark on implementing any of these specific metrics, please create a separate issue for that particular metric so that we can nail down the design before implementation.

@jbuberel
Copy link

@garyburd That's a good point about low activity on an established, high-quality project. Do you agree with @adg's suggestion that GitHub starts are a reasonable proxy for "this is a good project"?

One of the odd things about Github stars is that they're cumulative, with no decay function. You can star a project at a point when it was well maintained. A year goes by, it's been abandoned, but your star is still there. Hmmm. I'm having a hard time thinking of a better replacement.

@adg's spot on about the percent test coverage metric. That would require real compute time (and real money). That being said, it would be one of the better signals of project quality.

Regarding download counts: I do not think there is any real way of doing this. Even the GitHub repos api does not report this. You do get stargazer_count though.

@gdey
Copy link

gdey commented Sep 25, 2015

@garyburd

What are your goals for search ranking and how does ranking by these metrics support the goals?
One goal might be to help developers find the best package in a some domain. Ranking by number of > imports supports this goal because each import is an indication that a developer found the package
useful.

As you pointed out there are two goals for developers looking for packages.

  1. Make it easier for developers to find a packages in a domain.
  2. Make it easier for developers to evaluate the quality of packages in a domain.

Towards this end, the rankings based on imports and test help. As far as tests go, I think these need to be a bit liberal. The idea being is there at least one test, and it covers at least 5% percentage of the code. I don't think we should be expecting the percentage to be more then a single digit. I the the more important metric is the usage in other projects.

For quality I think percentage of documentation for package and public members is important. As well as the presence of examples, and testable examples is more important. As well as coming up with a ranking based on the output of go fmt, go vet, and other code linting tools.

But there is another audience as well; that is the package authors themselves.
For them the goal is to provide feedback on how they can improve the quality of their projects. Having such a ranking system that is objective, and can show you what you need to do to improve your packages standing — I believe, will raise the base line quality of all packages.

Ranking by recent activity does not necessarily support this goal. Activity on a high quality
mature package can be low. Activity on a buggy package can be high.

A part of the original proposal document that when towards building these requests was left out by mistake.

Archive Expired Packages

An archive section to where libraries that have been inactive for a given period of time and not used by a number of other active projects. This would help keep the listing to only active projects. Inactive is defined as "Most recent commit > 365 days ago" and "number of imports < 10", and can be adjusted.

The 365 and 10 are just place holders and can be changed if needed. The idea here being, that we need to, also, group or filter packages; so that there isn't a overwhelming amount of choice.

I would like to reference #90 as another issues that is trying to solve the getting too many packages to find the trees from the forest problem.

@adg adg added the proposal label Sep 25, 2015
@adg
Copy link
Contributor

adg commented Sep 25, 2015

Note that this is related to #52

@adg
Copy link
Contributor

adg commented Sep 25, 2015

And #172

@jbuberel
Copy link

This is a terrific idea, @garyburd:

I suggest writing a command line app to generate a list the packages to archive for a given import count and last commit time and checking the results to see if the filter is doing the right thing.

Implement the proposed filtering criteria, then test against the search results for a common term with a large result set, such as "web" or "sql" or "middleware". Generate side-by-side diff-able output with and without the filter.

Once we're confident that the filtering is "fair", then move onto changes in the ranking. Again, with the intent of being able to compare current vs proposed ranking so we can vet the diffs.

@jbuberel
Copy link

@garyburd Very much agreed.

@rafaeljusto
Copy link

@garyburd I started building the command line:
https://github.com/rafaeljusto/gddoexp

@dmitshur
Copy link
Contributor

The expired package idea looks promising. My intuition is that a conservative filter of 0 imports from outside of the repo and no commits in two years will filter out a lot of junk.

I think that's a great idea. But, an observation, that will have false positives for commands or libraries meant to be used at go generate time, since they're typically imported from other packages in // +build ignore files. For similar reasons, it won't work for libraries that are meant to be used with OSes or architectures that godoc does not support/know about.

@rafaeljusto
Copy link

I ran the tool that checks the expired packages on a database dump from 2015-10-01. It analyzed 132277 packages in 36h45m due to Github rate limit policies. The results:

% Description
3.88 not a Github project
2.75 should be archived
0.63 unexpected status code from Github

The unexpected status code from Github is probably some rate limit issues (403 Forbidden) that could be solved adjusting the token bucket values or analyzing the HTTP response headers from Github. The tool algorithm currents make two checks to identify if a package should be archived:

  • No other packages reference the analyzed package (ImporterCount)
  • Package wasn't modified in the last 2 years (Github response)

We also got 6 connections timeouts.

@jbuberel
Copy link

jbuberel commented Oct 5, 2015

Indeed, the list of "should be archived" is crucial here. We need to sanity check that to ensure that no legitimately keep-worthy projects would be get the archive treatment :-) Can you @rafaeljusto pastebin or gist it for us?

@rafaeljusto
Copy link

I checked some of them to see if they were modified in the last 2 years. But I didn't check if they were referenced by other packages, I'm trusting in the gddo database information.

Here is the list of packages to archive:
https://gist.github.com/rafaeljusto/0ef14863b39c23517e0a

@rafaeljusto
Copy link

Sure! I've created another program that inform packages with score 0 (zero) from an input list. So, from the list of packages that should be archived we have:

% Description
67.30 has no score
32.70 has score

The list of packages with score that should be archived are here:
https://gist.github.com/rafaeljusto/d2795a100f4661b9b126

@rafaeljusto
Copy link

Working on it. =)

@rafaeljusto
Copy link

I've created a new filter that checks for forks with maximum of 2 commits in the week after the fork date, I called then "fast forks".

On the list of scored packages that should be archived, when applying this filter we got:

% Description
51.05 fast fork
48.95 not fast fork

The list of packages after applying this 2 filters can be found bellow:
https://gist.github.com/rafaeljusto/3131e9e43c905d2e0808

@jbuberel
Copy link

jbuberel commented Oct 8, 2015

I just spot-checked about 50 items from the new list, and I didn't see any false-positives (project that would have been archived but should not have been). So far, LGTM.

@jbuberel
Copy link

jbuberel commented Oct 9, 2015

Sounds like we're in general agreement that the new fast-fork filter is working well as an identifier of packages that should be considered "archived" and therefore not displayed in search results.

Using that filtered set as the base, it make sense to begin experimenting with the rankings (the primary goal of this proposal). Given that gddo already takes import counts into account, how about an experiment to apply the stored page view counts for a small set of common search terms ("sql" and "middleway"), with outputs that allow us to diff/compare the ordering of:

  • filtered + ranked by imports
  • filtered + ranked by imports + ranked by pageviews

@rafaeljusto
Copy link

I ran the tool again now replacing the 2 years condition for the fast fork. It analyzed 132277 packages in 4h0m42s (we got many cache hits). The results:

% Description
3.88 not a Github project
14.14 should be archived
1.47 unexpected status code from Github

We also got 9 connections timeouts.

The new list of packages that could be archived are bellow:
https://gist.github.com/rafaeljusto/db8318b69efb10f622aa

We increased the packages to archive in 11.39% comparing with the first result. I think we could apply both rules: we archive if the package is a fast fork or has more than two years with no changes, already considering that there are no other packages referencing it.

There are other cases where we got a 404 from Github API that we could also archive, but this are only a few cases. I still need to work on the tool to avoid rate limit and decrease the "unexpected status code from Github" percentage.

PS: I will be offline for a week (hello vacations!)

@carlisia
Copy link
Author

This writeup is relevant for this discussion: https://github.com/mikeal/go-stats/blob/master/README.md

rafaeljusto added a commit to rafaeljusto/gddo that referenced this issue Oct 29, 2015
After a discussion, we decided that many packages of the gddo database could be
suppressed from the search results. We are currently adopting 2 rules to
suppress a package:

1. Package project wasn't modified in the last 2 years and there're no other
projects with references to it.

2. Package project is a fork with a small number of commits near the fork date
(what we called a fast fork).

The periodic will check all the packages from the repository and send queries
to Github API to determinate the current state.

See golang#320
@jamra
Copy link

jamra commented Dec 15, 2015

@carlisia I don't know why package count should be compared with other communities. It shouldn't matter. Some Go projects on github divide their package into many small packages. That is the Go way to do it and helps make things go gettable.

In terms of ranking packages on godoc, you bring up a good point: We can use an "imported by" metric to rank packages. That could remove some of the noise added by some of these sub packages.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants