Skip to content

Parse PluralRules from CLDR#24

Merged
jeffijoe merged 27 commits intojeffijoe:masterfrom
kostya9:parse_pluralrules_from_cldr
Apr 26, 2021
Merged

Parse PluralRules from CLDR#24
jeffijoe merged 27 commits intojeffijoe:masterfrom
kostya9:parse_pluralrules_from_cldr

Conversation

@kostya9
Copy link
Contributor

@kostya9 kostya9 commented Apr 1, 2021

Fully generated file with rules is here https://gist.github.com/kostya9/467b404cddfbaae2fa63b5e6c6bfb584

I used the specification here https://www.unicode.org/reports/tr35/tr35-numbers.html#Language_Plural_Rules to guide me.

I implemented a source generator that

  1. Parses pluralrules.xml (taken from CLDR latest http://cldr.unicode.org/index/downloads) into a intermediate representations.
  2. Generates c# code for each rule there
  3. The MessageFormatter fallbacks to metadata if it can't find a rule for such locale in it's personal Pluralizer dictionary

Feel free to start review either from tests, or from the source generator entrypoint PluralLanguagesGenerator.cs.

For variables, I implemented only

  • n | absolute value of the source number.
  • i | integer digits of n.
  • v | number of visible fraction digits in n, with trailing zeros. (partially - only figured out how to determine whether a number is fractional or not. In most languages the value v=0 or v!=0 is used, excluded the languages with other values of v from metadata)
    Locales with other variables were excluded

For right side, the generator understands both ranges (11..14) and individual numbers (15).
For left side, the generator understands both plain variables(i), and module operations (i % 10).

Contributes to #22

UPD:
Added support for all variables except exponents

@kostya9
Copy link
Contributor Author

kostya9 commented Apr 1, 2021

Hey @jeffijoe could you please update the .NET SDK to 5.0.* in github actions?

</ItemGroup>

<PropertyGroup>
<PluralLanguagesMetadataExcludeLocales>si da is mk ceb fil tl lv prg bs hr sh sr fr dsb hsb lt</PluralLanguagesMetadataExcludeLocales>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These languages contain constructs that this generator cannot parse yet

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

da is Danish? Damn, sucks to be me 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, support for these should be there, but I figured that it would be even harder to review with more changes 😅

<LangVersion>latest</LangVersion>
<Nullable>enable</Nullable>
<TargetFramework>netstandard1.1</TargetFramework>
<TargetFramework>netstandard2.0</TargetFramework>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was forced to update to netstandard2.0 to consume the source generator. Feel free to evaluate whether this is worth it or not

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we losing by doing this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .NET Standard page documents this, but in layman's terms:

  • .NET Framework 4.5 through 4.6 (4.5 and 4.5.1 have reached EoL).
  • .NET Core 1.x (which has reached EoL).
  • Mono 4.5 through 5.3
  • All flavors of Xamarin and UWP would have to be upgraded slightly (probably not an issue for most)

For now keeping a single .NET Standard 2.0 target is the best option for supporting both .NET Framework and .NET Core without adding all of the dependencies that .NET Standard 1.x had.

You may also wish to consider the tradeoffs of supporting earlier versions of .NET Framework and improved API support for .NET Standard 2.1 which you could gain at the price of multi-targeting and keeping conditional sections to gracefully degrade for .NET Framework, both of which add complexity. But IMO, abandoning .NET Standard 1.x at this stage is a good decision either way.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So TL;DR is that upgrading to 2.0 is what we want to do either way? 😄

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. If multi-targeting .NET Framework 4.5 and .NET Standard 2.1, there isn't a huge gap that .NET Standard 2.0 fills (basically .NET Core 2.x and some older mobile platforms). It depends on how high your tolerance for setting up a more complicated build and for dealing with conditional compilation sections is 😉.

Copy link
Owner

@jeffijoe jeffijoe Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind removing support for EoL targets.


var numberSpan = ConsumeCharacters(numbersCount);

var number = int.Parse(new string(numberSpan.ToArray()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find span-based overload in netstandard 2.0 :(

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numberSpan.ToString() might be more efficient than ToArray() + new string()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know you could do that, thanks

@jeffijoe
Copy link
Owner

jeffijoe commented Apr 1, 2021

Hey @jeffijoe could you please update the .NET SDK to 5.0.* in github actions?

Isn't that controlled in the workflow file in the branch?

@kostya9
Copy link
Contributor Author

kostya9 commented Apr 1, 2021

Thanks, that worked! Wasn't sure of how the security model works with forks, but seems that I can change SDKs freely.

@jeffijoe
Copy link
Owner

jeffijoe commented Apr 1, 2021

This is amazing! I'm going to need some time to play with it and review, but wow! 🤩

@NightOwl888
Copy link

Rather than passing "Locale" as a string in .NET, it would be more intuitive (and more compatible) to pass in a CultureInfo instance. Of course, this would create some gaps that you would need to account for, but probably less so in .NET 5 which now uses ICU by default.

Also, it is quite non-intuitive to attach CultureInfo (or Locale) to an object instance in .NET by passing it through a constructor. It would be better if it could be passed to the method that does the formatting. This would also have the advantage that a single formatter could work with multiple threads at the same time. Ideally, there would be 2 overloads of FormatMessage(), one that accepts a CultureInfo and one that uses the CultureInfo from the current thread.

I am not saying you have to make these changes, but this is the way Microsoft would do it if they add support for message formatting.

@kostya9
Copy link
Contributor Author

kostya9 commented Apr 2, 2021

Accepting CultureInfo in method argument sounds reasonable, but would not satisfy my use-case. My situation is that multiple users are being served by one server. These users can have different languages configured for their personal account. Basically, I have a language code, and need to format a string for that language code.

I imagine that going from a language code to a CultureInfo may be a bit awkward.

@kostya9
Copy link
Contributor Author

kostya9 commented Apr 2, 2021

I agree with your point that it would have been easier to pass the locale to the formatting method, though. Probably, that sounds like a separate discussion thread, and the parsing logic for CLDR rules will be the same regardless of whether the locale-specific pluralization data is needed in the constructor or in the formatting method.

@NightOwl888
Copy link

Accepting CultureInfo in method argument sounds reasonable, but would not satisfy my use-case. My situation is that multiple users are being served by one server. These users can have different languages configured for their personal account. Basically, I have a language code, and need to format a string for that language code.

There isn't a lot of pluralization data and doesn't need to be updated at runtime. It could be made as a default set of pluralization rules that is cached.

But the fact that you are serving multiple users per server is the point. If this is a webserver, each user can pass their culture in the URL or somewhere else in the request envelope and it can then be made "the" culture of the current thread.

CultureInfo.CurrentCulture = CultureInfo.GetCultureInfo(userCulture);

This is done at the beginning of each request, so when using the MessageFormatter.FormatMessage() you don't necessarily have to pass the culture at that point because it is already set on the thread for the user. In fact, if pulling the data out of a read-only cache (or it looks like you may have it all codified all of the rules at that point), you could potentially have a singleton MessageFormatter that serves all users on all threads.

I imagine that going from a language code to a CultureInfo may be a bit awkward.

I agree with your point that it would have been easier to pass the locale to the formatting method, though. Probably, that sounds like a separate discussion thread, and the parsing logic for CLDR rules will be the same regardless of whether the locale-specific pluralization data is needed in the constructor or in the formatting method.

I haven't analyzed this in detail to see how big the change would be. But most users will be using the built-in CultureInfo object because it is attached to the current thread already. In fact, not using the one from the current thread would probably seem odd to most .NET developers because culture is context-sensitive by default in many .NET features.

I believe you could use CultureInfo.TwoLetterISOLanguageName to convert most cultures to locale, but there will probably be some gaps that need to be mapped from .NET to the CLDR name. It would be better for end users if closing those gaps was built-in rather than leaving it up to each one to roll their own solution.

Agreed it probably doesn't change the parsing of the file, but maybe change how it is dealt with at the top level after it is parsed. I just noticed it was being passed in to a constructor and wanted to point out that isn't the .NET way because .NET is already aware of its current culture at the top level. It would be best if MessageFormatter were aware of that culture and provide a way to override it.


<PropertyGroup>
<TargetFramework>netcoreapp3.1</TargetFramework>
<TargetFramework>net5.0</TargetFramework>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest multi-targeting the tests. This adds enough complexity that it may differ on .NET Framework vs .NET Core vs .NET 5.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, is this easy to do?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just change the test projects from using

<TargetFramework>net5.0</TargetFramework>

to

<TargetFrameworks>net5.0;netcoreapp3.1;net48</TargetFrameworks>

TIP: Make sure you pay attention to the s in the name or you might be scratching your head for awhile.

In Visual Studio 2019, it will then show you all of the tests on each of the target frameworks and you can run them all or select the ones you want to run.

image

At the command line, it is a bit more complicated because AFAIK you have to run a dotnet test on each target framework as a separate command.

I believe it will work just to install the .NET 5.0 SDK (or at least you can always use the .NET 5 SDK as the entry point and all of the options it supports, but you may have to install other SDKs such as .NET Core 3.1 before installing .NET 5.0).

Working Example

My setup is more complicated than yours because I test on multiple operating systems and use build assets to transfer the same binaries from the build agent to the test agents (you could just run dotnet test which would build from source on each OS, but I prefer to test the binaries I ship). I am also using Azure DevOps rather than GitHub Actions, but you should be able to set it up similarly there, too if you want to.

In my case, I do dotnet publish (in combination with properly setting the <IsPublishable> property to true for test projects and false for all of the others) to make the tests runnable on each OS and I specifically set up a build agent to test 1 target framework. But you are welcome to analyze the templates here and here and use what you need (from anywhere in the project) to get it working.

namespace Jeffijoe.MessageFormat.MetadataGenerator
{
[Generator]
public class PluralLanguagesGenerator : ISourceGenerator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant use of a source code generator!

@kostya9
Copy link
Contributor Author

kostya9 commented Apr 5, 2021

The more I think about it, the more your point about CultureInfo makes sense. I tried to parse a couple of locales via CultureInfo.GetCultureInfo, and they just work. Will need to research if it works for all languages that I care about.

WIll not do any changes in this PR, but the SourceGenerator should work if somewhere in the future the CultureInfo approach will be introduced. The partial method PluralRulesMetadata.TryGetRuleByLocale will just be used differently PluralRulesMetadata.TryGetRuleByLocale(CultureInfo.TwoLetterISOLanguageName, out var formatter);

Move AST to a separate folder, move classes into separate files.
Use parametrized expression for documentation file.
Apply simplification for ParseRuleContent.
Remove 'and' calculation out of loop.
Simplify number construction.
@jeffijoe
Copy link
Owner

jeffijoe commented Apr 8, 2021

Wasn't sure how to test that, I can't possibly test all of the languages. I can write a couple of test cases for languages I speak (en-ru-uk), what do you think about that?

Yeah that sounds good! You may be able to use the compiler to run the generated class?

Yeah, that should work. Didn't do that yet because was thinking: maybe there is another way without allocating a string when calculating the variables?

Could you give me an example of what you mean here?

@kostya9
Copy link
Contributor Author

kostya9 commented Apr 17, 2021

Yeah that sounds good! You may be able to use the compiler to run the generated class?

The generated rules are in tests assembly already, because MessageFormat assembly already contains the generated code when referenced. Added tests. Made sure they work by commenting out the line which uses generated metadata and ensuring the tests fail (except EN, of course).

Could you give me an example of what you mean here?

I mean that string.Format creates(allocates) a new string, was hesitant to use it for that reason. I wonder if there is an easy enough mathematical method (divide the initial number by something. multiply by something) to calculate these variables. If there is no other reasonable way - well, that's better than nothing :)

@jeffijoe
Copy link
Owner

jeffijoe commented Apr 17, 2021

@kostya9 oh I completely misunderstood what the v was actually for! Can you give me an example of a rule where the basic v!=/==0 does not apply?

EDIT: Thinking about it more, it really doesn't make sense that an input number will have trailing zeros because number types don't even encode that information. Unless we are saying that you may format a number-like string in which case we have to parse the string.

EDIT2: I completely forgot about the decimal data type which does in fact encode that information!

We currently convert to double here:

If we change from double to decimal then we should be able to use this approach:

int count = BitConverter.GetBytes(decimal.GetBits(value)[3])[2];

@kostya9
Copy link
Contributor Author

kostya9 commented Apr 25, 2021

@jeffijoe Finally found some time to work on this :)

I'm exploring adding string manipulation for fractional numbers to support all the rules properly (trailing zeroes and stuff).
Could you please add some context on what is 'offset' here ?

var offsetExtension = arguments.Extensions.FirstOrDefault(x => x.Extension == "offset");

@jeffijoe
Copy link
Owner

jeffijoe commented Apr 25, 2021

Could you please add some context on what is 'offset' here ?

Yes, see this: https://messageformat.github.io/messageformat/guide/#plural-offset

@kostya9
Copy link
Contributor Author

kostya9 commented Apr 25, 2021

Added support for all rules except exponent (you cannot express exponent in any way except passing string with exponent directly)
See the new version of generated file here https://gist.github.com/kostya9/c95fa4395811c0d3d24e7654b563c322

Thanks @pointnet , did something like you proposed

<LangVersion>latest</LangVersion>
<Nullable>enable</Nullable>
<TargetFramework>netstandard1.1</TargetFramework>
<TargetFrameworks>netstandard2.0;net5.0</TargetFrameworks>
Copy link
Contributor Author

@kostya9 kostya9 Apr 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added net5.0 to target frameworks to allocate less when parsing PluralContext on projects that support it

@jeffijoe
Copy link
Owner

Amazing work @kostya9, and shoutout to @pointnet and @NightOwl888 for the input! Much appreciated!

This looks good to merge, unless you have some final cleanup you want to do first?

@kostya9
Copy link
Contributor Author

kostya9 commented Apr 26, 2021

Thanks! Made a final pass through the code - made the plural metadata internal. Everything else looks good to me, feel free to merge

@jeffijoe
Copy link
Owner

Would you happen to have R# or Rider? Maybe running a code cleanup before merge would be good hygiene 😄

@kostya9
Copy link
Contributor Author

kostya9 commented Apr 26, 2021

My Rider instance generates an insane diff when I try to do that )

I suppose it can't infer the code style properly, it's clueless without an explicit editorconfig

@jeffijoe
Copy link
Owner

I think there's a DotSettings file in the repo, granted it was generated many years ago when I was just using ReSharper. All good then. I just saw some places where the code was not formatted but it's fine. :)

@jeffijoe jeffijoe merged commit c956c09 into jeffijoe:master Apr 26, 2021
@kostya9 kostya9 deleted the parse_pluralrules_from_cldr branch April 30, 2021 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants