Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Non-ASCII Character Clobbering #324

Closed
wants to merge 1 commit into from

Conversation

NoSuck
Copy link
Contributor

@NoSuck NoSuck commented Mar 11, 2021

IT.TXT even explicitly mentions the ability to use characters that libxmp currently converts to periods for some reason:

3) A couple of people have asked about ASCII characters > 128. I'm
   sorry - I can't let you have 'em on anywhere else except the message
   editor.

Furthermore, this comment only pertains to the interface of IT.EXE and not the IT spec. In other words, instrument names and other strings should not care about character encodings (but I will leave those fixes for other pull requests).

IT.TXT even explicitly mentions the ability to use characters that libxmp currently converts to periods for some reason:

    3) A couple of people have asked about ASCII characters > 128. I'm
       sorry - I can't let you have 'em on anywhere else except the message
       editor.

Furthermore, this comment only pertains to the interface of IT.EXE and not the IT spec.  In other words, instrument names and other strings should not care about character encodings (but I will leave those fixes for other pull requests).
@NoSuck
Copy link
Contributor Author

NoSuck commented Mar 11, 2021

... I will leave those fixes for other pull requests.

Libxmp doesn't touch 'em, so nothing needs to be done. 👍

@AliceLR
Copy link
Contributor

AliceLR commented Mar 17, 2021

I don't personally see an issue with allowing characters >127 but IMO control codes should probably still be filtered out unless there's a good reason for them. With -DDEBUG enabled said control codes will be printed out, causing terminal beeps and potentially other unwanted side effects.

@NoSuck
Copy link
Contributor Author

NoSuck commented Mar 23, 2021

The song message is an encoding-independent array of bytes. We are unable to assume what constitutes a control code.

@AliceLR
Copy link
Contributor

AliceLR commented Mar 23, 2021

I don't really have the desire to argue about what is and isn't a control code since that is well-defined for most platforms libxmp supports. Intentionally providing unsanitized strings to the API (and console) by default seems in bad taste to me, though. Deferring this one to @cmatsuoka.

@sezero
Copy link
Collaborator

sezero commented Mar 23, 2021

I agree with @AliceLR

@NoSuck
Copy link
Contributor Author

NoSuck commented Mar 24, 2021

I don't really have the desire to argue about what is and isn't a control code since that is well-defined for most platforms libxmp supports.

This illustrates a fundamental misunderstanding of the issue. Perhaps I failed to communicate well.

Whether or not any given byte represents a control code is constituted by the character encoding, not the platform.

@cmatsuoka
Copy link
Collaborator

cmatsuoka commented Mar 29, 2021

In an ideal world it would be great if we could convert the platform-specific characters to something similar in the running system so we could show messages as the original author intended, but dealing with encodings, especially inside the library, is not worth the pain. I would add proper message display to the next generation module player wishlist and leave libxmp as is for now.

One possible way to implement it in libxmp-ng could be to have an "original platform" attribute in formats and offer a way to retrieve sanitized or raw bytes for each displayable piece of data, so that the application using the library could convert and display it (but for that we would also need a better API that doesn't expose structure fields directly.)

@cmatsuoka
Copy link
Collaborator

cmatsuoka commented Mar 29, 2021

Regarding the specific usage with IT modules, I see the specs allow it but I think the actual output the user would see (typically utf-8) would not match what the author intended. Also I agree that control characters for the host system should be filtered out to prevent it to mess with terminal output.

What encoding is Windows using these days?

@sagamusix How does OpenMPT deal with characters > 127 in the module message?

@sagamusix
Copy link
Contributor

OpenMPT itself currently doesn't do any codepage conversion for a number of reasons. libopenmpt however does faithful codepage conversion to UTF-8 (for some formats it's trivial, for other formats it requires guesswork based on which tracker was used to save the file, etc...). Its library interface has used UTF-8 since day 1.

@zzo38
Copy link

zzo38 commented Aug 8, 2022

I think that it would be better to expose the raw text without conversion, and to include a function to request the IBM code page number of the text (or zero if it is unknown).

(I, personally, would rather to deal with the raw text directly and not deal with UTF-8.)

I think that the command-line interface should include filtering out control codes, but the library should not do so.

@sezero
Copy link
Collaborator

sezero commented Jun 20, 2023

This looks like a WONTFIX, yes?

@sezero sezero closed this Jun 22, 2023
@sagamusix
Copy link
Contributor

Not sure why it needs to be rejected - if this sort of "fixup" was only done for IT files, I see no reason to keep it, because arguably with all other file formats it will already be possible to retrieve text with unknown encoding through the API. If libxmp wanted to make sure that no extended characters with unknown encoding are provided through its API, this should be applied in a central place for all formats, not just for IT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants