Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report Median Review Score #656

Closed
oxinabox opened this issue Apr 14, 2019 · 12 comments

Comments

Projects
None yet
4 participants
@oxinabox
Copy link

commented Apr 14, 2019

Problem you are facing

Mean is a poor metric for summarising scores from multiple sources.
Particularly with small numbers of reviewers.
Since the mean allows a single reviewer who either hates or loves something
to completely swing the score.

Possible Solution

A better metric is median.
An alternative is mean, dropping best and worst.

Context

I was looking at what talks to accept for our conference.
Our nominal cutoff is a score of 4.
I was double checking those that were boarderline.
The first one checked was scored 5, 4, 4, 4, 2, so net score of 3.8
I'ld rather be ignoring that 2 and 5 and be seeing it summarised as 4.
Then when sorting it would have appears in the area of "definately accept",
not "boarderline"

@rixx

This comment has been minimized.

Copy link
Member

commented Apr 14, 2019

Thank you for your input! I'm definitely willing to work on the review scoring evaluation here – I have a couple of questions to figure out what the best way forward would look like:

  • Do you think the median is a display that will be useful for (nearly) every conference?
  • As an organiser, would you like for this option to be configurable? (Please keep in minde that too many options are not a good thing, especially in a software that already features a lot of them).
  • Do you prefer the median or the mean without considering spikes?
  • In your experience, how deep of an explanations would your fellow organisers (or maybe organisers of other conferences, if you have additional experience) require to understand which metric is applied to review scores?
@oxinabox

This comment has been minimized.

Copy link
Author

commented Apr 14, 2019

Do you think the median is a display that will be useful for (nearly) every conference?

Yes. In general for almost all purposes median s better than mean.

As an organiser, would you like for this option to be configurable?

Not really, I don't see how it could be.
Just display mean, and median and allow to sort by both.

Do you prefer the median or the mean without considering spikes?

Median

In your experience, how deep of an explanations would your fellow organisers (or maybe organisers of other conferences, if you have additional experience) require to understand which metric is applied to review scores?

Very little, at least where I am from mean and median are both covered in primary school math.
I think if the column title said "mean" / "median" anyone could look up said definition words

@rixx

This comment has been minimized.

Copy link
Member

commented Apr 14, 2019

I think if the column title said "mean" / "median" anyone could look up said definition words

I think at least some help text on hover would be good.

Just display mean, and median and allow to sort by both.

"Just" is a lie! 😉

Up until now, your feature request sounded like you wanted to see the mean replaced by the median, hence my request for clarification. What purpose would the display of the mean serve in your opinion? (We're trying to provide a comprehensive overview, after all, not a jumping point for statistial analysis.)

@vchuravy

This comment has been minimized.

Copy link

commented Apr 14, 2019

If I have to choose between mean and median and would choose median because it is a more stable metric, and more stable against outliers.

But in someways I am actually interested in scores like this as a starting point for statisitical analysis. Since I am looking for the answer to: Was there controversy here or are all reviewers in agreement.

Many we can actually show a miniature histogram + median instead of just one value?

@rixx

This comment has been minimized.

Copy link
Member

commented Apr 14, 2019

Many we can actually show a miniature histogram + median instead of just one value?

That escalated quickly. I think a histogram instead of a numerical value has a lot of problems. It is not intuitive for people to sort/filter/reason by, for one, and it doesn't serialize well. Having 50+ small histograms in a page will also potentially be a lot of visual clutter, helping nobody.

@oxinabox

This comment has been minimized.

Copy link
Author

commented Apr 14, 2019

I would be happy to see mean replaced by median, but I assumed (for some reason) that having both was more acceptable.

@oxinabox

This comment has been minimized.

Copy link
Author

commented Apr 15, 2019

To break it down, some, a table
I am assuming normal is to have 5 or fewer reviews:

Reviews R0 N1,P1 N1,R0,P1 N2,N1,P1,P2 N2,N1,R0,P1,P2
Median R0 (N1+P1)/2 R0 (N1+P1)/2 R0
Defanged Mean Not Defined Not Defined R0 (N1+P1)/2 (N1+R0+P1)/3
Mean R0 (N1+P1)/2 (N1,R0,P1)/3 (N2,N1,P1,P2)/4 (N2,N1,R0,P1,P2)/5

for 1 to 2 review, Median is the same as Mean, and Defanged Mean is not defined.
for 3-4 reviews, Median is the same as Defanged Mean.
at 5 reviews they are all different.

If you have far more than 5 reviews, say 10 or 20,
then one starts thinking:
I would like to guard against 2 or 3 outlier reviews, so want doubly or tripply defanged.

One could then define defanging proportionally say dropping the 1st and 5th quintile.
But it does start to feel a bit complicated.

From a user understanding perspective that seems aweful, though TBH I suspect it is a better measure. It is the kind of thing the company I work for company uses a bit for assessing the performance of some of our models; along with median. However, we also have frequent debates on this kind of thing, so I think there is no clear answer.

At the end of the day, people serious about doing this,
most likely are going to want to dump everything out into some easy to parse format,
and then throw Julia/Python/R at it,
and think hard about if they should be doing things like normalizing across reviewers (IMO ans: only if each reviewer reviews >20 things and >X% of the whole); or even doing some kind of rank merging (generate ranked list for each reviewer, then merge those to get final ranking. Feels viable, and this is somewhat like how some grants are awarded.)

And for some conferences that are wanting good debate, it might be that something scoring:
(5,5, 1,1) is way better than something scoring (3,3,3,3), or even than something scoring (5,5, 2,2).
So you can't please everyone with a single metrics

So really we want to use as the standard metric something that is good enough (i.e. minimal weird behavours) for most people, and easy to understand enough that people know if it is capturing what they in particular are interested in.

@vchuravy

This comment has been minimized.

Copy link

commented Apr 15, 2019

My reason to look at a histogram, besides just the mean that I am interested in the variance and to see if the distribution is bimodal, but yes in the end I am much more likely to get the data through the API and then do statistical analysis on that to prep for a discussion round.

@rixx

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

Thank you for your extensive explanations, they are appreciated. I very much agree with your assessment:

So really we want to use as the standard metric something that is good enough (i.e. minimal weird behavours) for most people, and easy to understand enough that people know if it is capturing what they in particular are interested in.

I'll have to try and find a way to build this data without the review list becoming very slow, as the mean is just a database level aggregation, and the median isn't. I think anything more unusual than the median might be pushing it.

@meisterluk

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2019

+1 for the preference of median over mean.

I guess in terms of statistical analysis, it would be more important to be able to access feedback via the API. This is not possible so far AFAICS. Example (there is no feedback included here):

{
  "code": "9AJUNL",
  "speakers": [
    {
      "code": "7RZJR9",
      "name": "Jakob Miksch",
      "biography": "",
      "avatar": "/media/Jakob_Miksch.jpg"
    }
  ],
  "title": "Malawi Atlas - eine SDI mit PostGIS, GeoServer und GeoExt",
  "submission_type": {
    "de": "Vortrag"
  },
  "track": null,
  "state": "confirmed",
  "abstract": "Der Malawi-Atlas ist eine Plattform, um Naturgefahren in Malawi mittels Geodaten zu visualisieren. Im Hintergrund wird auf bewährte Open-Source-Komponenten wie PostGIS, GeoServer, GeoExt und OpenLayers gesetzt.",
  "description": "Der Vortrag berichtet von Erfahrungen über Konzeption, Umsetzung und Dokumentation der Plattform \"Malawi Atlas\". Das Projekt  basiert auf einem vier Jahre alten Vorgänger, der auf moderne Technologien aktualisiert werden musste. Es wird erläutert, aus welchen Gründen die Komponenten PostGIS, GeoServer, GeoExt und OpenLayers gewählt wurden und welche Alternativen es noch gegeben hätte. Es wird ein Überblick über die verwendeten Daten geben und wie diese organisiert und strukturiert wurden. Anschließend wird genauer darauf eingegangen, wie Funktionen wie PDF-Erzeugung, dynamische Diagrammerstellung und Download technisch umgesetzt wurden.",
  "duration": "00:30",
  "slot_count": 1,
  "do_not_record": false,
  "is_featured": false,
  "content_locale": "de",
  "slot": {
    "room": {
      "de": "Physik Z254"
    },
    "start": "2019-03-15T10:00:00+01:00",
    "end": "2019-03-15T10:30:00+01:00"
  },
  "image": null
}
@rixx

This comment has been minimized.

Copy link
Member

commented Apr 21, 2019

We're currently not talking about feedback, though, but about reviews – please note the distinction!

Reviews wouldn't be embedded in the submission endpoint, once they arrive at the API level, they would receive a separate review/ endpoint instead, to avoid overloading the poor submission resource endpoint further.

@rixx rixx closed this in 0d06baa Apr 21, 2019

@meisterluk

This comment has been minimized.

Copy link
Contributor

commented Apr 21, 2019

We're currently not talking about feedback, though, but about reviews – please note the distinction!

Oh man. Making too many mistakes recently. Thanks for pointing it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.