Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix number formatting for numeric columns #709

Merged
merged 14 commits into from
Nov 19, 2020
Merged

Conversation

danielgavrilov
Copy link
Contributor

@danielgavrilov danielgavrilov commented Nov 17, 2020

Fixes:

  1. All formatting methods now accept overrides. These are necessary I think because the formatting depends on the context the number is formatted for. E.g. in the country picker we want to use short number prefixes (M = million, B = billion), but not in the tooltip. Both are using formatValueShort, because formatValueLong uses a long unit. We could introduce something likeformatValueVeryShort for the country picker use case, but that seems a bit much.

  2. I removed the Population, PopulationDensity and Age columns. I don't think they are used anywhere?

    It probably doesn't make sense to go that granular? Age could be used in different contexts – age of the planet? of people? life expectancy? They all might need slightly different settings. I think the current numeric types + formatting config are enough. Reverted. More comments below.

@danielgavrilov
Copy link
Contributor Author

danielgavrilov commented Nov 17, 2020

Actually I don't think we even need PercentageColumn and all those variations. Their magnitudes (and numDecimalPlaces) might vary from context to context. And the unit we can just set in the formatting options?

The Ratio column for some reason had numDecimalPlaces: 1 but this is again hugely dependent on context.

From the numeric columns, I think we only need DecimalPercentageColumn which multiplies the value by 100. But this also goes against how we do it – with display.conversionFactor.

@danielgavrilov
Copy link
Contributor Author

In the way I've implemented it, we also can't change the unit of a CurrencyColumn, because it hardcodes the unit now. And that is the only special thing about the currency column. We can just achieve that with display.unit?

@breck7
Copy link
Contributor

breck7 commented Nov 17, 2020

  1. All formatting methods now accept overrides. These are necessary I think because the formatting depends on the context the number is formatted for. E.g. in the country picker we want to use short number prefixes (M = million, B = billion), but not in the tooltip. Both are using formatValueShort, because formatValueLong uses a long unit. We could introduce something likeformatValueVeryShort for the country picker use case, but that seems a bit much.

I actually like formatValueVeryShort, or maybe formatValueShortWithAbbreviations. As old Uncle Bob says, a great function has 0 params, a good function has 1 param, and a buggy function has.

  1. I removed the Population, PopulationDensity and Age columns. I don't think they are used anywhere?

I think they are used in the Covid Explorer? But this is a very good point, we don't have a connection yet between the authored Explorer files in owid-content and the parser, so hard to see uses. I think we can get there in the next week or two.

It probably doesn't make sense to go that granular? Age could be used in different contexts – age of the planet? of people? life expectancy? They all might need slightly different settings. I think the current numeric types + formatting config are enough.

This one is a very interesting topic. I think we should go super granular--"type the world! schema.org style"--with an expanding hierarchy of types. I'm thinking we should be prepared for hundreds, if not thousands+ of types. The reason for this is training good program synthesis models becomes exponentially easier the more specific your types. So Grapher v3 someone could paste a CSV in, and we would detect a "Population" type and a "GDP" type, and could immediately suggest adding a "GDP Per Capita" column, and suggest good chart types, colors to use, etc. Authors don't have to get too specific, and if they stick with just "Number" that's fine, you would just be helping us train better models and provide better suggestions if you added that extra bit of information.

@breck7
Copy link
Contributor

breck7 commented Nov 17, 2020

In the way I've implemented it, we also can't change the unit of a CurrencyColumn, because it hardcodes the unit now. And that is the only special thing about the currency column. We can just achieve that with display.unit?

I think (but need to check), that we could just ditch unit, and instead use type?

@marcelgerber
Copy link
Member

@danielgavrilov This looks good!

What your PR still doesn't achieve, though, is that the author-provided numDecimalPlaces is not used for the tooltip formatting.
E.g. in http://localhost:3030/admin/test/embeds?ids=2069, all dimensions have a numDecimalPlaces of 2, but the tooltip only shows one decimal place.

I've briefly looked into it and the reason for this is that the map tooltip uses the axes' formatTick (because it doesn't have access to the underlying column?).

@danielgavrilov
Copy link
Contributor Author

danielgavrilov commented Nov 19, 2020

I actually like formatValueVeryShort, or maybe formatValueShortWithAbbreviations. As old Uncle Bob says, a great function has 0 params, a good function has 1 param, and a buggy function has.

Yeah it doesn't sound bad to add it actually.


I think they are used in the Covid Explorer? But this is a very good point, we don't have a connection yet between the authored Explorer files in owid-content and the parser, so hard to see uses. I think we can get there in the next week or two.

Ah good point I totally forgot that these are in the config, I was just thinking from implementation point of view. Will bring them back.


This one is a very interesting topic. I think we should go super granular--"type the world! schema.org style"--with an expanding hierarchy of types. I'm thinking we should be prepared for hundreds, if not thousands+ of types.

Yeah, again I missed the point. From a config point of view it's better to have as much semantic information as possible, so agree we should make super granular types.

In the current setup I don't think they scale very well though, I'm already confused what each inherits from its ancestors. Can be a long chain sometimes. We could/should make these classes very "stupid" and predictable, so that you don't have to follow the inheritance chain.


I think (but need to check), that we could just ditch unit, and instead use type?

There are many many units. These are just some:

Screenshot 2020-11-19 at 11 19 48

SELECT DISTINCT display->"$.unit"
FROM variables
WHERE display->"$.unit" IS NOT NULL

We'll always need chart-level overrides

On the chart-level, I think we'll always need the option to override units. Often the units will be mentioned in the title, and the column/axis/formatting units only serve to disambiguate. E.g. on a ScatterPlot you might specify the unit in detail, but on a LineChart you'd try to be as concise as possible, as the unit is almost always in the title.

We should standardize dataset-level types

I am definitely in favour of making these more standardized on the dataset-level. It doesn't make that much sense that we have the same unit expressed in many different ways across datasets. It would massively help long-term to standardize them, as we're only creating a bigger and bigger mess.

It would be mostly an effort of the authors though, and those who upload datasets. So I think we need to be able to add/remove/update/browse types very easily, without touching the code each time.

The hierarchy as we've started it doesn't matter so much I think, and I think we'll definitely run into more and more cases where it would become hard to decide what to inherit from, and would make the hierarchy less relevant.

Primitive types with behaviour in code, semantic types in external collection?

It feels to me like this might be the best way to go. We need Numeric, Percentage, Integer, String, Color, Boolean, etc types in the codebase in order to specify different behaviour. But beyond that, the types can be semantic-only and specify some config defaults (like unit, shortUnit, numDecimalPlaces, showPlus, numberPrefixes, etc).

If we want to have some associations between the semantic types, I think it should be possible to point to multiple parents, or we can just have "related types". I think a clear hierarchy is not possible, mostly because the variables themselves are often derived from multiple variables. (I think. But if we want to find out we can go through the top 50 OWID units or something, and try out strategies for associations.)

So:

  • The primitive types will be few and defined in code. These will be in the tens and probably change very little once we put them together.
  • The semantic types would be maintained in a repo by authors, and there would be hundreds. Probably would be tweaked quite often.
  • Each semantic type must define a single primitive type it is based on.
  • Each semantic type can define multiple other semantic types it is related to.

But overall, I think to make a semantic type library we'd need to research a bit more and find out what works and doesn't. And think a bit about what we want to get out of it.

E.g. one idea I'm thinking about: Some of the units in my screenshot are derived units, e.g. people (thousands). We could handle unit conversion better than we do today. Currently it's not possible to convert °C to °F to K, because we only support multiplicative factors and not formulas.

@danielgavrilov
Copy link
Contributor Author

the author-provided numDecimalPlaces is not used for the tooltip formatting.

Thanks Marcel! Should be fixed now.

@danielgavrilov danielgavrilov force-pushed the next-number-formatting branch 2 times, most recently from 72392d3 to 5462417 Compare November 19, 2020 14:55
@danielgavrilov
Copy link
Contributor Author

I used formatValueForMobile instead of creating formatValueShortWithAbbreviations. I think formatValueForMobile should be renamed maybe?

These overrides are still complicated to follow with all the supers. If there was some "definition navigator" which would tell you all the different settings that are applied in the inheritance chain it would be great. You just open up PercentageChangeOverTime and it tells you that the unit is set to % by PercentageColumn, can be overridden by unit or shortUnit if the author chooses to (implemented in NumericColumn), etc. Otherwise not easy to follow the hierarchy in code.

@marcelgerber
Copy link
Member

marcelgerber commented Nov 19, 2020

@breck7
Copy link
Contributor

breck7 commented Nov 19, 2020

SELECT DISTINCT display->"$.unit"
FROM variables
WHERE display->"$.unit" IS NOT NULL

Ah great idea—already can see a good library of types.

We'll always need chart-level overrides
On the chart-level, I think we'll always need the option to override units. Often the units will be mentioned in the title, and the column/axis/formatting units only serve to disambiguate. E.g. on a ScatterPlot you might specify the unit in detail, but on a LineChart you'd try to be as concise as possible, as the unit is almost always in the title.

You can set the Title to anything you want for any chart, right "title = *". So if you wanted to mention them in the title, you could? One idea for a pattern we could implement is less "chart synthesis at runtime", and more "synthesis at write time". So instead of doing switches to build things like titles, we could employ N strategies while the author is editing and provide suggested titles, which they then select and hard code into the chart. Not too sure just thinking out loud.

The ability for an author to manually set title, subtitle, footer, etc, coupled with a rich column type library and an ability to cheaply add columns via "transforms", might reduce the need for a lot of display params?

We should standardize dataset-level types

I like the declarative idea you had—having declared types and less supers. Could probably whip up a quick DSL that we can use to define new types, that authors could easily edit as well.

  • The primitive types will be few and defined in code. These will be in the tens and probably change very little once we put them together.
  • The semantic types would be maintained in a repo by authors, and there would be hundreds. Probably would be tweaked quite often.
  • Each semantic type must define a single primitive type it is based on.
  • Each semantic type can define multiple other semantic types it is related to.

Great plan. Makes sense.

But overall, I think to make a semantic type library we'd need to research a bit more and find out what works and doesn't. And think a bit about what we want to get out of it.
E.g. one idea I'm thinking about: Some of the units in my screenshot are derived units, e.g. people (thousands). We could handle unit conversion better than we do today. Currently it's not possible to convert °C to °F to K, because we only support multiplicative factors and not formulas.

There are a few open source projects we could look into tapping (https://fossies.org/linux/units/definitions.units). A cool relevant project is typedefs (https://news.ycombinator.com/item?id=24972271)

@breck7
Copy link
Contributor

breck7 commented Nov 19, 2020

I used formatValueForMobile instead of creating formatValueShortWithAbbreviations. I think formatValueForMobile should be renamed maybe?

I haven't looked at all the use places we use these things, but these are cheap methods to create and we don't have to worry about backwards compat, so I would say lets create more of them, look for any patterns, and then trim.

These overrides are still complicated to follow with all the supers. If there was some "definition navigator" which would tell you all the different settings that are applied in the inheritance chain it would be great. You just open up PercentageChangeOverTime and it tells you that the unit is set to % by PercentageColumn, can be overridden by unit or shortUnit if the author chooses to (implemented in NumericColumn), etc. Otherwise not easy to follow the hierarchy in code.

This is a good point. Moving away from these as classes, and to a more declarative style makes a lot of sense. And instead of extending, for DRY we can do something analogous to:

const MetricDistanceDef = {
...
  }
const Centimeters = {
   ...MetricDistanceDef,
unit: "cm"
}

@danielgavrilov
Copy link
Contributor Author

👍 I introduced formatValueShortWithAbbreviations and dropped the duplication stuff from DecimalPercentage because of the issue Marcel mentioned.

I think we can leave the declarative type library for 2021 unless anyone is feeling ambitious.

I will merge this since it (hopefully) fixes more things than it breaks but feel free to change!

@danielgavrilov danielgavrilov merged commit 4c82f29 into next Nov 19, 2020
@danielgavrilov danielgavrilov deleted the next-number-formatting branch November 19, 2020 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants