Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spellout numbering #41

Closed
michaelhkay opened this issue Nov 23, 2022 · 3 comments · Fixed by #46
Closed

Spellout numbering #41

michaelhkay opened this issue Nov 23, 2022 · 3 comments · Fixed by #46

Comments

@michaelhkay
Copy link

michaelhkay commented Nov 23, 2022

As far as I can tell, ICU4N doesn't include the spellout numbering capabilities of ICU4J.

I'm interested in assessing whether it's feasible to port this code and contribute it to the project. Having no familiarity with ICU-J internals, I wouldn't know where to start, but if you can provide any initial thoughts (perhaps you've looked at it and decided it's too hard...) then I'd appreciate any pointers.

Alternatively, rather than doing it ourselves we could sponsor the development.

Note, we are currently using ICU4N in the SaxonCS project for localised collation support.

@NightOwl888
Copy link
Owner

Thanks for the inquiry.

This is a bit of a can of worms because .NET doesn't provide much support for extending formatters and parsers. In the ideal world, we would extend .NET to do this.

  • Formatting - all we've got for extensibility is IFormatProvider and ICustomFormatter and they only work on built-in types in the string.Format(IFormatProvider, string, object) method, and for value types that means boxing. Passing a custom formatter to long.ToString() gets you a type cast error because it is hard coded to only support NumberFormatInfo, which is a sealed class.
  • Parsing - parsers are not extensible at all. There are a handful of baked-in configuration options, and that is it.

I had some recent experience with how the Java and .NET approaches differ when porting over the parsers from .NET to add the ability to parse Java-specific formats in J2N.Numerics. A few things of note:

  • Java formatters are include both parse() and format() overloads - they are always round-trippable. This makes them more advanced, but also makes them very slow. .NET uses static methods to optimize performance. Also, not every format is round-trippable in .NET.
  • Java formatters are classes that can inherit other formatters. Built-in .NET formatters cannot be extended (only the options above).

Given the limitations in .NET formatters, it seems like it would be better to aim for extension methods to expose the APIs publicly on number types and provide an IFormatProvider that can be used in string.Format() (although we would need to box, it is still the only option for building up strings that mix in other formatters). As for the actual implementation, it could go a couple of ways.

  • Port the whole RuleBasedNumberFormat implementation including the ability to extend it through inheritance.
  • Take the .NET approach and create highly-optimized static parsers and formatters that can be customized with rules, but not extended.

It would generally be simpler to maintain the first approach as a line-by-line port from Java. But it comes with a pretty high performance cost. That being said, it is also a pretty big project to make a rules-based parser at the optimized level that the .NET runtime uses.

RuleBasedNumberFormat

What are your requirements? The RuleBasedNumberFormat class is highly extensible by design and would be very useful, indeed. But it also has dependencies on BigDecimal and BigInteger, which are another can of worms.

  • Do you need the format to be round-trippable, or are you only looking to convert numbers to strings?
  • Do you need it to support floating point types, or only integral types?
  • Do you need it to support BigInteger and/or BigDecimal?
  • Do you need to support every language?
  • Are you more concerned with performance or extensibility and customization?

The RuleBasedNumberFormat rules would need to be utilized if we go much beyond English (even Spanish has specialized rules for gender that need to be taken into account).

Unfortunately, it is based on Java's number formatting syntax, which makes it a bit of an oddball in .NET.

I also haven't worked out how to unpack the .res format that ICU4J uses (which may or may not be required to change the format) - we are simply using a port of ByteBuffer to read it in big-endian format. The raw data that is compiled to .res is here, but so far my attempts at compiling resources haven't been successful. So, changing the syntax of the formatter to align with .NET formatters will take some research.

Existing Options

  1. If all you care about is number > string and only need support for English, int and long, we have some extension methods in EnglishNumberFormatExtensions.cs that could be utilized.
  2. It is possible to compile icu4j to .NET using ikvm-maven (another project I contribute to).
<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>netcoreapp3.1</TargetFramework>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="IKVM.Maven.Sdk" Version="1.1.1" />
  </ItemGroup>

  <ItemGroup>
    <MavenReference Include="com.ibm.icu:icu4j" Version="60.1" />
  </ItemGroup>

</Project>
using System;

namespace ICU4JExperimentation
{
    internal class Program
    {
        static void Main(string[] args)
        {
            double num = 2718.28;
            var locale = java.util.Locale.ENGLISH;
            var nf = new com.ibm.icu.text.RuleBasedNumberFormat(locale, com.ibm.icu.text.RuleBasedNumberFormat.SPELLOUT);

            string formatted = nf.format(num);
            Console.WriteLine(formatted);
            var parsed = nf.parse(formatted);
            Console.WriteLine(parsed.doubleValue());
        }
    }
}
two thousand seven hundred eighteen point two eight
2718.28

As great as this seems, there are some drawbacks:

  1. AFAIK, there is no easy way to get from System.Globalization.CultureInfo to java.uti.Locale.
  2. It currently requires compilation on either .NET Framework or netcoreapp3.1 it doesn't compile on later versions of .NET Core (although if you make a library in netcoreapp3.1 it can be consumed by later versions of .NET Core).
  3. It doesn't currently run everywhere .NET Core runs. It doesn't fully support macOS, for example.
  4. It comes with around 70MB of dependencies (per target framework).
  5. It uses the Java formatting syntax.

The Plan

Due to the limitations of IKVM, our plan is not to utilize it for Lucene.NET except for the Lucene.Net.Analysis.OpenNLP module, where there are currently no other good options in .NET. Although, we will probably use <MavenReference> after the deployment size is cut down a bit by breaking up the JRE into separate class libraries.

As for the formatters, we have many compile warnings in work we did to support MessageFormat. Being that it is only used in 1 place internally to run a condition from a resource file, the plan was to try to replicate that without the baggage of MessageFormat and then remove all of the formatters from the codebase before the release. They aren't required by anything else we maintain, so it wasn't a concern (until now), and would give us the opportunity to analyze the formatters at a high level and do a more .NET-like port of them in the future.

Contributing

In light of the above, if you still wish to help to port RuleBasedNumberFormat functionality to .NET, we can continue the conversation and perhaps shift gears on the existing formatter code depending on where that takes us.

Funding

Yes, please. Given the number of useful tools I contribute to, I am surprised that there are not more people willing to kick a few dollars my way every month. Unfortunately, I am not great at self-promotion so the millions of package downloads are not translating into cash.

We have had a bit of support from Microsoft and iText Software, but at present we have no major funding and it is really tough to work on this enough to get it done when I have to seek other work to pay the bills.

@michaelhkay
Copy link
Author

michaelhkay commented Nov 24, 2022

Many thanks for the detailed response -- I thought it was likely to be complicated, but not that complicated!

Let's start with requirements: our requirement is primarily to support format-integer(xx, "w", lang) in XPath 3.1. For example format-integer(12, "w", "en") returns "twelve". We do need it to work for a wide variety of languages (it's easy to implement English ourselves). Ideally we would support arbitrary big integers (we use Singulinks.Numeric for this) but frankly, no-one actually is going to use it for numbers in the trillions so it would be fine to impose a limit. We don't need support for non-integer values, and we don't need the reverse function.

Integration with existing APIs in .NET isn't a concern for us at all. We'd be fine with a completely freestanding library that gives us a single method correspondonding to the above call.

SaxonCS is a commercial product (we will probably have an open source version at some stage, but we may well keep this functionality as one of the bonuses you get in the paid-for version) so we're happy to talk about funding the development of this as a component which you release as open source. There's certainly value in making the component open source as this will tend to stimulate support for more languages. Contact me off-list at saxonica.com to talk about commercial matters.

Oh, and I should add, this is about .NET Core. In the past we delivered Saxon on .NET using IKVM, but that didn't work on Core, so we developed SaxonCS by creating our own source-level Java-to-C# transpiler.

@NightOwl888
Copy link
Owner

Thanks also for following up. I have been analyzing this a bit more and have some more details.

Let's start with requirements: our requirement is primarily to support format-integer(xx, "w", lang) in XPath 3.1. For example format-integer(12, "w", "en") returns "twelve". We do need it to work for a wide variety of languages (it's easy to implement English ourselves). Ideally we would support arbitrary big integers (we use Singulinks.Numeric for this) but frankly, no-one actually is going to use it for numbers in the trillions so it would be fine to impose a limit. We don't need support for non-integer values, and we don't need the reverse function.

This is good news. Limiting the scope like this allows us to commit to a long-term stable API that we can support the spell out functionality while making the rest of the implementation internal until we decide how best to present it (which can even be done after we have a production release).

Eliminating double also saves some work. Strangely, this formatter doesn't support float (except through casting).

BigInteger is actually processed in BigDecimal by reading in its value using BigInteger.ToString().ToCharArray(). So, this should be completely compatible with the .NET BigInteger struct. There are more efficient ways to read its value than this, though.

Oddly, long.MinValue is processed through DecimalFormat - seems to be an edge case that couldn't be handled without the conversion. Given that it is one edge case value, a dirty workaround might be to do multiple format operations to work out how to do a string. Replace() to insert the missing info for that specific value.

Of course, this means that to process BigInteger we would need to port BigDecimal. For the short term, I see no issue with creating a line-by-line port if it and not exposing it publicly. It turns out that it is self-contained to 2 classes and has no other dependencies. It does approximately double the amount of code to port from ~4500 lines to ~8500 lines, though (roughly, there are a lot of documentation comments that I am including in this estimate).

DecimalFormat depends on both BigDecimal and DecimalFormatSymbols, which is another ~4000 lines. But is only required to support double and long.MinValue.

FYI - In .NET, the decimal format symbols are loaded internally in the NumberFormatInfo class (which is exposed through the CultureInfo.NumberFormat property), and the settings on NumberFormatInfo allow tweaking the behavior. Usually this is done by calling NumberFormatInfo.Clone(), changing the settings, and then passing the new instance to one of the format or parse methods that accept IFormatProvider.

Integration with existing APIs in .NET isn't a concern for us at all. We'd be fine with a completely freestanding library that gives us a single method correspondonding to the above call.

Yea, my gut reaction told me that putting it in a separate library made more sense, also. That is, until I started analyzing how RuleBasedNumberFormat is put together and its dependencies.

It turns out I was completely wrong about having to deal with any Java style formatting (more good news). I guess I had in mind the MessageFormat, which does that bit. Every other formatter generally handles formatting exactly 1 type so there is no special syntax.

What this means is that the raw format in the .res files are exactly the format we require. And they deal with most of the special cases rather than handling them in conditional code. Trying to build up the data from the CLDR database and design our own rule-based formatter means we don't leverage the work the ICU team has already done to organize the data and design how to process the rules. Using the .res files as-is saves a huge amount of work, and makes it easier to upgrade ICU4N to match a later version of ICU4J.

As for the design of RuleBasedMessageFormat, it is set up similar to how the Regex class is set up. That is, there is a stage where it loads the settings/compiles that can be pre-loaded in the instance, which saves from having to create an instance every time it is called. But like the Regex class, it is not thread safe. So, a little extra effort would be involved with using it with a cache and presenting it as a static extension method. But it is definitely something that is doable.

RuleBasedMessageFormat spans 8 separate files (~4500 lines + ~2000 lines for the tests), but basically all of its dependencies that weren't mentioned above (including the rather involved resource loading and UCultureInfo implementation) are already ported or mostly ported in ICU4N. The tests look pretty straightforward to port.

At this point, it is looking very much like a line-by-line port of RuleBasedMessageFormat is the way to go and we can expose the SPELLOUT functionality (as well as potentially the other 3 modes, ORDINAL, DURATION, and NUMBERING_SYSTEM) as extension methods of each of the primitive integral types. And make it a part of ICU4N rather than an external library, the latter of which would involve duplicating a lot of dependencies or moving them into a shared library.

Contact me off-list at saxonica.com to talk about commercial matters.

Will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants