Fix number formatting for numeric columns #709

danielgavrilov · 2020-11-17T12:45:21Z

Fixes:

All formatting methods now accept overrides. These are necessary I think because the formatting depends on the context the number is formatted for. E.g. in the country picker we want to use short number prefixes (M = million, B = billion), but not in the tooltip. Both are using formatValueShort, because formatValueLong uses a long unit. We could introduce something likeformatValueVeryShort for the country picker use case, but that seems a bit much.
~~I removed the Population, PopulationDensity and Age columns. I don't think they are used anywhere?~~

It probably doesn't make sense to go that granular? Age could be used in different contexts – age of the planet? of people? life expectancy? They all might need slightly different settings. I think the current numeric types + formatting config are enough. Reverted. More comments below.

danielgavrilov · 2020-11-17T13:01:22Z

Actually I don't think we even need PercentageColumn and all those variations. Their magnitudes (and numDecimalPlaces) might vary from context to context. And the unit we can just set in the formatting options?

The Ratio column for some reason had numDecimalPlaces: 1 but this is again hugely dependent on context.

From the numeric columns, I think we only need DecimalPercentageColumn which multiplies the value by 100. But this also goes against how we do it – with display.conversionFactor.

danielgavrilov · 2020-11-17T13:06:32Z

In the way I've implemented it, we also can't change the unit of a CurrencyColumn, because it hardcodes the unit now. And that is the only special thing about the currency column. We can just achieve that with display.unit?

breck7 · 2020-11-17T20:18:14Z

All formatting methods now accept overrides. These are necessary I think because the formatting depends on the context the number is formatted for. E.g. in the country picker we want to use short number prefixes (M = million, B = billion), but not in the tooltip. Both are using formatValueShort, because formatValueLong uses a long unit. We could introduce something likeformatValueVeryShort for the country picker use case, but that seems a bit much.

I actually like formatValueVeryShort, or maybe formatValueShortWithAbbreviations. As old Uncle Bob says, a great function has 0 params, a good function has 1 param, and a buggy function has.

I removed the Population, PopulationDensity and Age columns. I don't think they are used anywhere?

I think they are used in the Covid Explorer? But this is a very good point, we don't have a connection yet between the authored Explorer files in owid-content and the parser, so hard to see uses. I think we can get there in the next week or two.

It probably doesn't make sense to go that granular? Age could be used in different contexts – age of the planet? of people? life expectancy? They all might need slightly different settings. I think the current numeric types + formatting config are enough.

This one is a very interesting topic. I think we should go super granular--"type the world! schema.org style"--with an expanding hierarchy of types. I'm thinking we should be prepared for hundreds, if not thousands+ of types. The reason for this is training good program synthesis models becomes exponentially easier the more specific your types. So Grapher v3 someone could paste a CSV in, and we would detect a "Population" type and a "GDP" type, and could immediately suggest adding a "GDP Per Capita" column, and suggest good chart types, colors to use, etc. Authors don't have to get too specific, and if they stick with just "Number" that's fine, you would just be helping us train better models and provide better suggestions if you added that extra bit of information.

breck7 · 2020-11-17T21:06:16Z

In the way I've implemented it, we also can't change the unit of a CurrencyColumn, because it hardcodes the unit now. And that is the only special thing about the currency column. We can just achieve that with display.unit?

I think (but need to check), that we could just ditch unit, and instead use type?

marcelgerber · 2020-11-19T11:50:03Z

@danielgavrilov This looks good!

What your PR still doesn't achieve, though, is that the author-provided numDecimalPlaces is not used for the tooltip formatting.
E.g. in http://localhost:3030/admin/test/embeds?ids=2069, all dimensions have a numDecimalPlaces of 2, but the tooltip only shows one decimal place.

I've briefly looked into it and the reason for this is that the map tooltip uses the axes' formatTick (because it doesn't have access to the underlying column?).

danielgavrilov · 2020-11-19T11:51:53Z

I actually like formatValueVeryShort, or maybe formatValueShortWithAbbreviations. As old Uncle Bob says, a great function has 0 params, a good function has 1 param, and a buggy function has.

Yeah it doesn't sound bad to add it actually.

I think they are used in the Covid Explorer? But this is a very good point, we don't have a connection yet between the authored Explorer files in owid-content and the parser, so hard to see uses. I think we can get there in the next week or two.

Ah good point I totally forgot that these are in the config, I was just thinking from implementation point of view. Will bring them back.

This one is a very interesting topic. I think we should go super granular--"type the world! schema.org style"--with an expanding hierarchy of types. I'm thinking we should be prepared for hundreds, if not thousands+ of types.

Yeah, again I missed the point. From a config point of view it's better to have as much semantic information as possible, so agree we should make super granular types.

In the current setup I don't think they scale very well though, I'm already confused what each inherits from its ancestors. Can be a long chain sometimes. We could/should make these classes very "stupid" and predictable, so that you don't have to follow the inheritance chain.

I think (but need to check), that we could just ditch unit, and instead use type?

There are many many units. These are just some:

SELECT DISTINCT display->"$.unit"
FROM variables
WHERE display->"$.unit" IS NOT NULL

We'll always need chart-level overrides

On the chart-level, I think we'll always need the option to override units. Often the units will be mentioned in the title, and the column/axis/formatting units only serve to disambiguate. E.g. on a ScatterPlot you might specify the unit in detail, but on a LineChart you'd try to be as concise as possible, as the unit is almost always in the title.

We should standardize dataset-level types

I am definitely in favour of making these more standardized on the dataset-level. It doesn't make that much sense that we have the same unit expressed in many different ways across datasets. It would massively help long-term to standardize them, as we're only creating a bigger and bigger mess.

It would be mostly an effort of the authors though, and those who upload datasets. So I think we need to be able to add/remove/update/browse types very easily, without touching the code each time.

The hierarchy as we've started it doesn't matter so much I think, and I think we'll definitely run into more and more cases where it would become hard to decide what to inherit from, and would make the hierarchy less relevant.

Primitive types with behaviour in code, semantic types in external collection?

It feels to me like this might be the best way to go. We need Numeric, Percentage, Integer, String, Color, Boolean, etc types in the codebase in order to specify different behaviour. But beyond that, the types can be semantic-only and specify some config defaults (like unit, shortUnit, numDecimalPlaces, showPlus, numberPrefixes, etc).

If we want to have some associations between the semantic types, I think it should be possible to point to multiple parents, or we can just have "related types". I think a clear hierarchy is not possible, mostly because the variables themselves are often derived from multiple variables. (I think. But if we want to find out we can go through the top 50 OWID units or something, and try out strategies for associations.)

So:

The primitive types will be few and defined in code. These will be in the tens and probably change very little once we put them together.
The semantic types would be maintained in a repo by authors, and there would be hundreds. Probably would be tweaked quite often.
Each semantic type must define a single primitive type it is based on.
Each semantic type can define multiple other semantic types it is related to.

But overall, I think to make a semantic type library we'd need to research a bit more and find out what works and doesn't. And think a bit about what we want to get out of it.

E.g. one idea I'm thinking about: Some of the units in my screenshot are derived units, e.g. people (thousands). We could handle unit conversion better than we do today. Currently it's not possible to convert °C to °F to K, because we only support multiplicative factors and not formulas.

danielgavrilov · 2020-11-19T13:25:18Z

the author-provided numDecimalPlaces is not used for the tooltip formatting.

Thanks Marcel! Should be fixed now.

This is handled in the ScatterPlot transformTable() now

We need super-short formatting, and this seems appropriate.

danielgavrilov · 2020-11-19T15:04:18Z

I used formatValueForMobile instead of creating formatValueShortWithAbbreviations. I think formatValueForMobile should be renamed maybe?

These overrides are still complicated to follow with all the supers. If there was some "definition navigator" which would tell you all the different settings that are applied in the inheritance chain it would be great. You just open up PercentageChangeOverTime and it tells you that the unit is set to % by PercentageColumn, can be overridden by unit or shortUnit if the author chooses to (implemented in NumericColumn), etc. Otherwise not easy to follow the hierarchy in code.

No longer retrievable since they are all merged together

marcelgerber · 2020-11-19T17:49:38Z

I just noticed this chart has wrong data on this branch, but is alright on next:
http://localhost:3030/admin/explorers/preview/covid-test?tab=map&time=166&Metric=Excess%20mortality&Interval=7-day%20rolling%20average&Relative%20to%20Population=true&Align%20outbreaks=false&hash=a48d63ce&selection=~United%20States&pickerSort=desc&pickerMetric=total_cases

Looks like the * 100 factor is applied twice?

breck7 · 2020-11-19T19:01:18Z

SELECT DISTINCT display->"$.unit"
FROM variables
WHERE display->"$.unit" IS NOT NULL

Ah great idea—already can see a good library of types.

We'll always need chart-level overrides
On the chart-level, I think we'll always need the option to override units. Often the units will be mentioned in the title, and the column/axis/formatting units only serve to disambiguate. E.g. on a ScatterPlot you might specify the unit in detail, but on a LineChart you'd try to be as concise as possible, as the unit is almost always in the title.

You can set the Title to anything you want for any chart, right "title = *". So if you wanted to mention them in the title, you could? One idea for a pattern we could implement is less "chart synthesis at runtime", and more "synthesis at write time". So instead of doing switches to build things like titles, we could employ N strategies while the author is editing and provide suggested titles, which they then select and hard code into the chart. Not too sure just thinking out loud.

The ability for an author to manually set title, subtitle, footer, etc, coupled with a rich column type library and an ability to cheaply add columns via "transforms", might reduce the need for a lot of display params?

We should standardize dataset-level types

I like the declarative idea you had—having declared types and less supers. Could probably whip up a quick DSL that we can use to define new types, that authors could easily edit as well.

The primitive types will be few and defined in code. These will be in the tens and probably change very little once we put them together.

The semantic types would be maintained in a repo by authors, and there would be hundreds. Probably would be tweaked quite often.

Each semantic type must define a single primitive type it is based on.

Each semantic type can define multiple other semantic types it is related to.

Great plan. Makes sense.

But overall, I think to make a semantic type library we'd need to research a bit more and find out what works and doesn't. And think a bit about what we want to get out of it.
E.g. one idea I'm thinking about: Some of the units in my screenshot are derived units, e.g. people (thousands). We could handle unit conversion better than we do today. Currently it's not possible to convert °C to °F to K, because we only support multiplicative factors and not formulas.

There are a few open source projects we could look into tapping (https://fossies.org/linux/units/definitions.units). A cool relevant project is typedefs (https://news.ycombinator.com/item?id=24972271)

breck7 · 2020-11-19T19:05:29Z

I used formatValueForMobile instead of creating formatValueShortWithAbbreviations. I think formatValueForMobile should be renamed maybe?

I haven't looked at all the use places we use these things, but these are cheap methods to create and we don't have to worry about backwards compat, so I would say lets create more of them, look for any patterns, and then trim.

These overrides are still complicated to follow with all the supers. If there was some "definition navigator" which would tell you all the different settings that are applied in the inheritance chain it would be great. You just open up PercentageChangeOverTime and it tells you that the unit is set to % by PercentageColumn, can be overridden by unit or shortUnit if the author chooses to (implemented in NumericColumn), etc. Otherwise not easy to follow the hierarchy in code.

This is a good point. Moving away from these as classes, and to a more declarative style makes a lot of sense. And instead of extending, for DRY we can do something analogous to:

const MetricDistanceDef = {
...
  }
const Centimeters = {
   ...MetricDistanceDef,
unit: "cm"
}

danielgavrilov · 2020-11-19T22:06:38Z

👍 I introduced formatValueShortWithAbbreviations and dropped the duplication stuff from DecimalPercentage because of the issue Marcel mentioned.

I think we can leave the declarative type library for 2021 unless anyone is feeling ambitious.

I will merge this since it (hopefully) fixes more things than it breaks but feel free to change!

danielgavrilov requested a review from breck7 November 17, 2020 12:45

danielgavrilov force-pushed the next-number-formatting branch from 3d5bf83 to 8181eb3 Compare November 17, 2020 12:55

danielgavrilov added 2 commits November 19, 2020 13:21

Fix number formatting for numeric columns

30e7773

Fix tests

1ed2d1c

danielgavrilov force-pushed the next-number-formatting branch from 0de2bf8 to d7dd484 Compare November 19, 2020 13:23

danielgavrilov added 2 commits November 19, 2020 13:26

Fix: use formatColumn in tooltip instead of axis.formatTick

28980af

Fix: do not hardcode props on subclasses to allow overriding

097bd79

danielgavrilov force-pushed the next-number-formatting branch from d7dd484 to 912b1dc Compare November 19, 2020 13:39

danielgavrilov added 5 commits November 19, 2020 14:50

Bring back Population, PopulationDensity and Age column

67b253f

Do not change default numDecimalPlaces for PercentageColumn

4c7d17f

Call super methods instead of own methods when overriding formatting

45c2a07

Remove resolved tolerance comment

39b880a

This is handled in the ScatterPlot transformTable() now

Use formatValueForMobile() in EntityPicker

64e4e51

We need super-short formatting, and this seems appropriate.

danielgavrilov force-pushed the next-number-formatting branch 2 times, most recently from 72392d3 to 5462417 Compare November 19, 2020 14:55

danielgavrilov added 2 commits November 19, 2020 15:24

Fix ScatterPlot avg annual change formatting

a399a12

Remove "original database unit" note

14cb8a9

No longer retrievable since they are all merged together

danielgavrilov force-pushed the next-number-formatting branch from 5462417 to 14cb8a9 Compare November 19, 2020 15:25

Fix: display settings should take precedence

708cd18

danielgavrilov added 2 commits November 19, 2020 21:55

Create formatValueShortWithAbbreviations for EntityPicker

3f1d1e3

Do not multiply value in DecimalPercentageColumn

c3b1a24

danielgavrilov merged commit 4c82f29 into next Nov 19, 2020

danielgavrilov deleted the next-number-formatting branch November 19, 2020 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix number formatting for numeric columns #709

Fix number formatting for numeric columns #709

danielgavrilov commented Nov 17, 2020 •

edited

Loading

danielgavrilov commented Nov 17, 2020 •

edited

Loading

danielgavrilov commented Nov 17, 2020

breck7 commented Nov 17, 2020

breck7 commented Nov 17, 2020

marcelgerber commented Nov 19, 2020

danielgavrilov commented Nov 19, 2020 •

edited

Loading

danielgavrilov commented Nov 19, 2020

danielgavrilov commented Nov 19, 2020

marcelgerber commented Nov 19, 2020 •

edited

Loading

breck7 commented Nov 19, 2020 •

edited

Loading

breck7 commented Nov 19, 2020 •

edited

Loading

danielgavrilov commented Nov 19, 2020

Fix number formatting for numeric columns #709

Fix number formatting for numeric columns #709

Conversation

danielgavrilov commented Nov 17, 2020 • edited Loading

danielgavrilov commented Nov 17, 2020 • edited Loading

danielgavrilov commented Nov 17, 2020

breck7 commented Nov 17, 2020

breck7 commented Nov 17, 2020

marcelgerber commented Nov 19, 2020

danielgavrilov commented Nov 19, 2020 • edited Loading

We'll always need chart-level overrides

We should standardize dataset-level types

Primitive types with behaviour in code, semantic types in external collection?

danielgavrilov commented Nov 19, 2020

danielgavrilov commented Nov 19, 2020

marcelgerber commented Nov 19, 2020 • edited Loading

breck7 commented Nov 19, 2020 • edited Loading

breck7 commented Nov 19, 2020 • edited Loading

danielgavrilov commented Nov 19, 2020

danielgavrilov commented Nov 17, 2020 •

edited

Loading

danielgavrilov commented Nov 17, 2020 •

edited

Loading

danielgavrilov commented Nov 19, 2020 •

edited

Loading

marcelgerber commented Nov 19, 2020 •

edited

Loading

breck7 commented Nov 19, 2020 •

edited

Loading

breck7 commented Nov 19, 2020 •

edited

Loading