Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report Median Review Score #656

Closed
oxinabox opened this issue Apr 14, 2019 · 22 comments
Closed

Report Median Review Score #656

oxinabox opened this issue Apr 14, 2019 · 22 comments

Comments

@oxinabox
Copy link
Contributor

@oxinabox oxinabox commented Apr 14, 2019

Problem you are facing

Mean is a poor metric for summarising scores from multiple sources.
Particularly with small numbers of reviewers.
Since the mean allows a single reviewer who either hates or loves something
to completely swing the score.

Possible Solution

A better metric is median.
An alternative is mean, dropping best and worst.

Context

I was looking at what talks to accept for our conference.
Our nominal cutoff is a score of 4.
I was double checking those that were boarderline.
The first one checked was scored 5, 4, 4, 4, 2, so net score of 3.8
I'ld rather be ignoring that 2 and 5 and be seeing it summarised as 4.
Then when sorting it would have appears in the area of "definately accept",
not "boarderline"

@rixx
Copy link
Member

@rixx rixx commented Apr 14, 2019

Thank you for your input! I'm definitely willing to work on the review scoring evaluation here – I have a couple of questions to figure out what the best way forward would look like:

  • Do you think the median is a display that will be useful for (nearly) every conference?
  • As an organiser, would you like for this option to be configurable? (Please keep in minde that too many options are not a good thing, especially in a software that already features a lot of them).
  • Do you prefer the median or the mean without considering spikes?
  • In your experience, how deep of an explanations would your fellow organisers (or maybe organisers of other conferences, if you have additional experience) require to understand which metric is applied to review scores?

@oxinabox
Copy link
Contributor Author

@oxinabox oxinabox commented Apr 14, 2019

Do you think the median is a display that will be useful for (nearly) every conference?

Yes. In general for almost all purposes median s better than mean.

As an organiser, would you like for this option to be configurable?

Not really, I don't see how it could be.
Just display mean, and median and allow to sort by both.

Do you prefer the median or the mean without considering spikes?

Median

In your experience, how deep of an explanations would your fellow organisers (or maybe organisers of other conferences, if you have additional experience) require to understand which metric is applied to review scores?

Very little, at least where I am from mean and median are both covered in primary school math.
I think if the column title said "mean" / "median" anyone could look up said definition words

@rixx
Copy link
Member

@rixx rixx commented Apr 14, 2019

I think if the column title said "mean" / "median" anyone could look up said definition words

I think at least some help text on hover would be good.

Just display mean, and median and allow to sort by both.

"Just" is a lie! 😉

Up until now, your feature request sounded like you wanted to see the mean replaced by the median, hence my request for clarification. What purpose would the display of the mean serve in your opinion? (We're trying to provide a comprehensive overview, after all, not a jumping point for statistial analysis.)

@vchuravy
Copy link

@vchuravy vchuravy commented Apr 14, 2019

If I have to choose between mean and median and would choose median because it is a more stable metric, and more stable against outliers.

But in someways I am actually interested in scores like this as a starting point for statisitical analysis. Since I am looking for the answer to: Was there controversy here or are all reviewers in agreement.

Many we can actually show a miniature histogram + median instead of just one value?

@rixx
Copy link
Member

@rixx rixx commented Apr 14, 2019

Many we can actually show a miniature histogram + median instead of just one value?

That escalated quickly. I think a histogram instead of a numerical value has a lot of problems. It is not intuitive for people to sort/filter/reason by, for one, and it doesn't serialize well. Having 50+ small histograms in a page will also potentially be a lot of visual clutter, helping nobody.

@oxinabox
Copy link
Contributor Author

@oxinabox oxinabox commented Apr 14, 2019

I would be happy to see mean replaced by median, but I assumed (for some reason) that having both was more acceptable.

@oxinabox
Copy link
Contributor Author

@oxinabox oxinabox commented Apr 15, 2019

To break it down, some, a table
I am assuming normal is to have 5 or fewer reviews:

Reviews R0 N1,P1 N1,R0,P1 N2,N1,P1,P2 N2,N1,R0,P1,P2
Median R0 (N1+P1)/2 R0 (N1+P1)/2 R0
Defanged Mean Not Defined Not Defined R0 (N1+P1)/2 (N1+R0+P1)/3
Mean R0 (N1+P1)/2 (N1,R0,P1)/3 (N2,N1,P1,P2)/4 (N2,N1,R0,P1,P2)/5

for 1 to 2 review, Median is the same as Mean, and Defanged Mean is not defined.
for 3-4 reviews, Median is the same as Defanged Mean.
at 5 reviews they are all different.

If you have far more than 5 reviews, say 10 or 20,
then one starts thinking:
I would like to guard against 2 or 3 outlier reviews, so want doubly or tripply defanged.

One could then define defanging proportionally say dropping the 1st and 5th quintile.
But it does start to feel a bit complicated.

From a user understanding perspective that seems aweful, though TBH I suspect it is a better measure. It is the kind of thing the company I work for company uses a bit for assessing the performance of some of our models; along with median. However, we also have frequent debates on this kind of thing, so I think there is no clear answer.

At the end of the day, people serious about doing this,
most likely are going to want to dump everything out into some easy to parse format,
and then throw Julia/Python/R at it,
and think hard about if they should be doing things like normalizing across reviewers (IMO ans: only if each reviewer reviews >20 things and >X% of the whole); or even doing some kind of rank merging (generate ranked list for each reviewer, then merge those to get final ranking. Feels viable, and this is somewhat like how some grants are awarded.)

And for some conferences that are wanting good debate, it might be that something scoring:
(5,5, 1,1) is way better than something scoring (3,3,3,3), or even than something scoring (5,5, 2,2).
So you can't please everyone with a single metrics

So really we want to use as the standard metric something that is good enough (i.e. minimal weird behavours) for most people, and easy to understand enough that people know if it is capturing what they in particular are interested in.

@vchuravy
Copy link

@vchuravy vchuravy commented Apr 15, 2019

My reason to look at a histogram, besides just the mean that I am interested in the variance and to see if the distribution is bimodal, but yes in the end I am much more likely to get the data through the API and then do statistical analysis on that to prep for a discussion round.

@rixx
Copy link
Member

@rixx rixx commented Apr 15, 2019

Thank you for your extensive explanations, they are appreciated. I very much agree with your assessment:

So really we want to use as the standard metric something that is good enough (i.e. minimal weird behavours) for most people, and easy to understand enough that people know if it is capturing what they in particular are interested in.

I'll have to try and find a way to build this data without the review list becoming very slow, as the mean is just a database level aggregation, and the median isn't. I think anything more unusual than the median might be pushing it.

@meisterluk
Copy link
Contributor

@meisterluk meisterluk commented Apr 20, 2019

+1 for the preference of median over mean.

I guess in terms of statistical analysis, it would be more important to be able to access feedback via the API. This is not possible so far AFAICS. Example (there is no feedback included here):

{
  "code": "9AJUNL",
  "speakers": [
    {
      "code": "7RZJR9",
      "name": "Jakob Miksch",
      "biography": "",
      "avatar": "/media/Jakob_Miksch.jpg"
    }
  ],
  "title": "Malawi Atlas - eine SDI mit PostGIS, GeoServer und GeoExt",
  "submission_type": {
    "de": "Vortrag"
  },
  "track": null,
  "state": "confirmed",
  "abstract": "Der Malawi-Atlas ist eine Plattform, um Naturgefahren in Malawi mittels Geodaten zu visualisieren. Im Hintergrund wird auf bewährte Open-Source-Komponenten wie PostGIS, GeoServer, GeoExt und OpenLayers gesetzt.",
  "description": "Der Vortrag berichtet von Erfahrungen über Konzeption, Umsetzung und Dokumentation der Plattform \"Malawi Atlas\". Das Projekt  basiert auf einem vier Jahre alten Vorgänger, der auf moderne Technologien aktualisiert werden musste. Es wird erläutert, aus welchen Gründen die Komponenten PostGIS, GeoServer, GeoExt und OpenLayers gewählt wurden und welche Alternativen es noch gegeben hätte. Es wird ein Überblick über die verwendeten Daten geben und wie diese organisiert und strukturiert wurden. Anschließend wird genauer darauf eingegangen, wie Funktionen wie PDF-Erzeugung, dynamische Diagrammerstellung und Download technisch umgesetzt wurden.",
  "duration": "00:30",
  "slot_count": 1,
  "do_not_record": false,
  "is_featured": false,
  "content_locale": "de",
  "slot": {
    "room": {
      "de": "Physik Z254"
    },
    "start": "2019-03-15T10:00:00+01:00",
    "end": "2019-03-15T10:30:00+01:00"
  },
  "image": null
}

@rixx
Copy link
Member

@rixx rixx commented Apr 21, 2019

We're currently not talking about feedback, though, but about reviews – please note the distinction!

Reviews wouldn't be embedded in the submission endpoint, once they arrive at the API level, they would receive a separate review/ endpoint instead, to avoid overloading the poor submission resource endpoint further.

@rixx rixx closed this in 0d06baa Apr 21, 2019
@meisterluk
Copy link
Contributor

@meisterluk meisterluk commented Apr 21, 2019

We're currently not talking about feedback, though, but about reviews – please note the distinction!

Oh man. Making too many mistakes recently. Thanks for pointing it out.

@wedge-jarrad
Copy link

@wedge-jarrad wedge-jarrad commented Aug 24, 2020

I'd like to offer a dissenting opinion. We just wrapped up a review and found that median made it difficult to rank talks.

We had a 5 person review team and asked each to rate each submission on a scale from 1 to 5 with 5 being the best. What we found was that the median makes a submission with scores 3 3 3 5 5 indistinguishable from a submission with scores 3 3 3 1 1. Bell curves being what they are, we ended up with the majority of submissions having a rating of 3.

Ranking is important to us because we often have a limited number of slots to fill and many qualified submissions to fill them with. What we found ourselves doing is looking through the reviews and mentally figuring how they compared. "That one is a high three" "that three was mixed" and so on. Since so many talks ended up with a median of three, that made finding the best of the threes a bit of a chore.

I appreciate the arguments made in favor of median and agree to a point, but not having the average available to rank submissions for whom the median is a tie was rather inconvenient. I would certainly appreciate having both available in the future. I know "simple" changes like that are often anything but, so thanks for considering it.

Thank you, @rixx, for Pretalx, by the way. This is our second year using it and it is quite an improvement over how we were doing things before. I need to get around to adding our event to your list one of these days :)

@rixx
Copy link
Member

@rixx rixx commented Aug 24, 2020

Thank you for the feedback, @wedge-jarrad! I'd be willing to merge a PR that makes this configurable (either as "show both" or "show either", I don't really care). In the past I didn't want to add too many settings, but now that we have a dedicated review settings section, it probably wouldn't be too bad to have this setting. I won't have time to work on this myself, though.

As a note to the future implementor: Please take care that the review page does not slow down significantly, as it's currently not paginated and I'd like to keep it that way if possible. Average scores can be added via db annotations pretty easily, so that's probably the way to go. You can refer to the way this was done before we switched to median scores (though you may have to change some details).

@oxinabox
Copy link
Contributor Author

@oxinabox oxinabox commented Aug 24, 2020

We just finished our review for juliacon after the swap from mean to median.
For us it was really good, though I think we would rather have both to help with close calls.
I wonder if its utility also depends on the distribution of reviews.
For use we ended up accepting everything with a Median of 4.5 or higher.
Which boils down to needing a near perfect review.
Which got us the number we were willing to accept.

We had a little space to take some that were median 4, and so working which of those to accept out whould have helped if we also had mean i think.

But median was a diffinate improvement over last year using mean as the focus,
as we didn't have the senario where 1 reviewer who just didn't get it would sink a otherwise perfectly scored proposal.

One day we will have the writable API,
and then we will be able to do things like pull all the reviews done,
normalize them by the reviewers average rating, then push numbers back up.
Which would let us correct for reviewers who are more or less harsh.
Or doing things like not counting all reviews from a given person who didn't follow policy.

But in the short term mean+median would rock

@rixx
Copy link
Member

@rixx rixx commented Jul 7, 2021

Once again, closing this issue via 9d6f1e4 with a new solution: Organisers can now choose between mean and median, with median continuing to be the default.

I realise that this is not ideal, since some of you would like both to be present, however: pretalx has since gained the feature of independent review scores, and having both aggregation methods present would lead to unintuitive amounts of data: For each independent measure there would be three columns: your score, the average and the median. I don't think that's particularly useful, so I introduced a switch between the two measurements instead.

For now, the switch is a setting to assure that all reviewers are talking about the same comparisons. I'm open to adding a table-wide toggle in the future, though this would probably need to be an outside contribution.

@rixx rixx closed this Jul 7, 2021
@plaindocs
Copy link

@plaindocs plaindocs commented Jul 7, 2021

The new mean/median selection is great! Thanks!

I'm going to add here a request here, but let me know if I need to delete / move to a new issue.

What I'd really like to be able to do is make mean a secondary sort criteria, so that all of my median 4s are then sorted by mean.

Median Mean
4 3.5
4 3.5
4 3.0
4 3.0
4 2.5

(regardless of whether these numbers specifically are possible)

Sort by Median, then by Mean. I suspect this is a non starter, because the sort is configurable in the display.

@rixx
Copy link
Member

@rixx rixx commented Jul 7, 2021

I'd merge a PR that provides a good UI for this, but I can't think of one off the top of my head.

@plaindocs
Copy link

@plaindocs plaindocs commented Jul 7, 2021

This isn't a terrible write up https://ux.stackexchange.com/questions/34786/sorting-tables-after-multiple-columns

But all of them have quite some display overhead. If one of them looks like something you might be able to live with let me know and maybe I'll attempt a PR

@rixx
Copy link
Member

@rixx rixx commented Jul 7, 2021

Ah, but all of these still assume that you're fine with having a minimum of three pure number columns at the start of your review dashboard, which I feel will look like a ton of overhead. In particular because each independent review score will add two more columns, so it'd be easy to end up with five or seven …

We could just make the non-included aggregation method a secondary sort choice by default, though that would feel a bit intransparent.

@plaindocs
Copy link

@plaindocs plaindocs commented Jul 7, 2021

Agreed.

The current default secondary sort isn't transparent either, FWIW.

And unless i'm missing a setting somewhere, only organizers get to see the median sort anyway, not the reviewers.

@rixx
Copy link
Member

@rixx rixx commented Jul 8, 2021

Reviewers who are able to see other reviews and review scores should also be able to see aggregate scores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants