Statistically nonsensical to use skewness in `label_distribution`

The `label_distribution` measurement is supposed to quantify how biased a column of labels is towards one value or another. It computes the fraction each label value takes up out of all labels (which is useful) and, problematically, the *skewness* of that distribution:

https://github.com/huggingface/evaluate/blob/55f1bc6e072b05c2d9db1589a07e20f38902b1ec/measurements/label_distribution/label_distribution.py#L85-L93

This is statistical nonsense. 

A class label is a [multinoulli variable](https://en.wikipedia.org/wiki/Categorical_distribution), which means that it's a discrete variable that can emit a finite amount of values **whose magnitude has no meaning other than being different from each other**. If you have classes `{cat, dog, giraffe}`, it has no meaning whether we choose to label `cat` as 0 or 3.1415, and it has no meaning whether `cat` is labelled 0 or `dog` is labelled 0, as long as they are different.

[*Skewness*](https://en.wikipedia.org/wiki/Skewness), however, is meant for continuous variables, and cares about magnitudes and permutations. **It looks at the symmetry of the distribution, not the uniformity.** 

- The label column `[0, 0, 0, 1, 1, 1, 2, 2, 2]` has 0 skewness, because it is symmetrical.
- The label column `[0, 0, 1, 1, 1, 1, 1, 2, 2]` has 0 skewness, because it is symmetrical. Yet, clearly, there is heavy bias towards label `1`.
- The label column `[0, 0, 1, 1, 2, 2, 2, 2, 2]`  is exactly as biased as the previous one (2 labels, 2 labels, 5 labels) *and yet now it has skewness != 0* because the weight of the distribution is "on the right", which has zero meaning.

The [*entropy*](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html) of the labels is what you are looking for to measure uniformity, not skewness. Entropy is class-permutation-invariant. It is maximised for uniform distributions. If you want to normalise it, you can divide by that maximal entropy (the [Hartley function](https://en.wikipedia.org/wiki/Hartley_function)).

	def _compute(self, data):
	"""Returns the fraction of each label present in the data"""
	c = Counter(data)
	label_distribution = {"labels": [k for k in c.keys()], "fractions": [f / len(data) for f in c.values()]}
	if isinstance(data[0], str):
	label2id = {label: id for id, label in enumerate(label_distribution["labels"])}
	data = [label2id[d] for d in data]
	skew = stats.skew(data)
	return {"label_distribution": label_distribution, "label_skew": skew}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistically nonsensical to use skewness in `label_distribution` #659

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Statistically nonsensical to use skewness in label_distribution #659

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Statistically nonsensical to use skewness in `label_distribution` #659