Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to change text encoding post-parse #27

Closed
mdsitton opened this issue Jun 16, 2019 · 7 comments
Closed

Ability to change text encoding post-parse #27

mdsitton opened this issue Jun 16, 2019 · 7 comments
Labels
feature request API changes required
Projects

Comments

@mdsitton
Copy link

mdsitton commented Jun 16, 2019

So right now if you need to try and handle multiple midi files of various text encodings at runtime you are somewhat out of luck and would need to continuously re-parse the whole midi until the correct encoding is found.

Going from this:

public string Text { get; set; }
protected sealed override void ReadContent(MidiReader reader, ReadingSettings settings, int size)
{
    ThrowIfArgument.IsNegative(nameof(size), size,
            "Text event cannot be read since the size is negative number.");

    if (size == 0)
        return;

    textData = reader.ReadBytes(size);
    var encoding = settings.TextEncoding ?? SmfUtilities.DefaultEncoding;
    Text = encoding.GetString(bytes);
}

to something like this (this isnt 100% working code but just something to get an idea of what i'm saying here)

protected byte[] _textData = null;
protected string _textEncoded;

public string Text
{
    get
    {
        if (_textData != null)
        {
            var encoding = settings.TextEncoding ?? SmfUtilities.DefaultEncoding;
            _textEncoded = encoding.GetString(_textData);
            return _textEncoded;

        }
        else
        {
            return _textEncoded;
        }
    }
    set => _textEncoded = value;
}

protected sealed override void ReadContent(MidiReader reader, ReadingSettings settings, int size)
{
    ThrowIfArgument.IsNegative(nameof(size), size, "Text event cannot be read since the size is negative number.");

    if (size == 0)
        return;

    _textData = reader.ReadBytes(size);
}

It would need some other api changes perhaps with how settings and/or encodings are handled.

@mdsitton
Copy link
Author

Just a btw i'm one of the clone hero developers that we were talking on discord a while back, for which I also messaged this basic idea in that group message @melanchall

@melanchall
Copy link
Owner

Hi Matthew,

Do you need to have ability to change encoding somewhere in the future after file is parsed or you need to try different encodings during file reading?

For first case I suggest just get bytes from string and convert it to string using different encodings. Then you can set Text property of a MIDI event.

For second case I suggest to introduce some callback for text parsing to ReadingSettings/WritingSettings.

Please describe your task in more details so we can come to best solution.

Max

@mdsitton
Copy link
Author

@melanchall The issue with doing that is that is that decoding text from bytes into a string is not a lossless process. For example if you have a latin1 encoding decoded as UTF-8 or ASCII unsupported characters will be replaced with alternative codepoints in those encoding ranges.

@mdsitton
Copy link
Author

mdsitton commented Jun 16, 2019

I don't really specifically need access to the text parsing stage of decoding as mentioned in option 2, that wouldn't really work too well in how the application is setup. Having access to change encoding after the file is parsed would be most useful like i mentioned originally.

Basically what I'm doing is parsing lyrics from midi files, and some of these lyrics are in various different encodings. Mainly UTF-8 and ISO-8859-1. Currently i'm checking if "U+FFFD" is produced and re trying parsing that phrase as a different encoding through a tweak in ReadContent but i'd like to avoid this hack if i can.

@melanchall
Copy link
Owner

I bow to option 2. My arguments:

  • MIDI file hasn't such concept as "encoding". Text events just hold text string, there is no encoding here.
  • Text events are mutable. What if I set new string to Text property? What encoding this string in? Without this information it's impossible to convert string to another encoding.

When I'm designing API I should always think about different cases and most versatile solution. Introducing something like "encoding" in MIDI file creates more troubles than profit.

As I mentioned in option 2 ReadingSettings will provide callback for text parsing which can be used to any processing you want including charset detection (exact API is a subject of discussion).

Also conversion between different encodings is lossless in terms of bytes. At now you can just use this code without modifying the library code:

var originalEncoding = readingSettings.TextEncoding ?? Encoding.ASCII;
var bytes = originalEncoding.GetBytes(textEvent.Text);

// convert bytes to any encoding and set Text property of event

Not good but will work. So it seems implementing text parsing callback is a best option.

You wrote:

wouldn't really work too well in how the application is setup

Can you explain why it wouldn't really work too well?

@melanchall
Copy link
Owner

@mdsitton So is it good to have text parsing callback?

@melanchall
Copy link
Owner

I've added DecodeTextCallback property to ReadingSettings.

@melanchall melanchall added the feature request API changes required label Feb 2, 2020
@melanchall melanchall added this to Done in DryWetMIDI Mar 23, 2020
ronnieoverby pushed a commit to ronnieoverby/drywetmidi that referenced this issue Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request API changes required
Projects
DryWetMIDI
  
Done
Development

No branches or pull requests

2 participants