Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files with UTF-16 TIT2 (and others) have invalid bytes before name #61

Closed
nicklan opened this issue Nov 24, 2015 · 16 comments
Closed

Files with UTF-16 TIT2 (and others) have invalid bytes before name #61

nicklan opened this issue Nov 24, 2015 · 16 comments

Comments

@nicklan
Copy link

nicklan commented Nov 24, 2015

I have some files that have utf-16 titles. When looking at them in demo.browse, the values get prefixed with invalid characters. These show up as ? chars in my browser, but looking at the returned data, they are not valid utf-16 either. For instance, for one file, the comments_html section contains:

album   array (1)   0   string (22)     �[correct subsequent characters for album]

This is for a number of different files, and other tools process the tags correctly.

Let me know if you need more info, or what else I can do to help track down what's wrong. I'm on version 1.9.10-20150914

@JamesHeinrich
Copy link
Owner

A UTF-16 sample file would be a great start.

@nicklan
Copy link
Author

nicklan commented Nov 24, 2015

Sure. This file: http://datashat.net/music_for_programming_10-unity_gain_temple.mp3 (from http://musicforprogramming.net/) shows the problem.

Screenshot of what I'm seeing:
2015-11-23-221744_1066x675_scrot

and output of id3v2 -l:
2015-11-23-221905_893x102_scrot

@JamesHeinrich
Copy link
Owner

Those non-displayable characters are indeed the Byte Order Marker (BOM) from the UTF-16 text.

The ID3 documentation specifies this regarding text encodings:

Frames that allow different types of text encoding contains a text
encoding description byte. Possible encodings:

 $00   ISO-8859-1 [ISO-8859-1]. Terminated with $00.
 $01   UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All
       strings in the same frame SHALL have the same byteorder.
       Terminated with $00 00.
 $02   UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM.
       Terminated with $00 00.
 $03   UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with $00.

Strings dependent on encoding are represented in frame descriptions
as <text string according to encoding>, or <full text string
according to encoding> if newlines are allowed. Any empty strings of
type $01 which are NULL-terminated may have the Unicode BOM followed
by a Unicode NULL ($FF FE 00 00 or $FE FF 00 00).

Your file is tagged with encoding 01 "UTF-16" which means the text could be either big-endian or little-endian, as determined by the BOM at the start of the string. Without the BOM it is unknown how to display (or convert) the text since it's not known what order the bytes come in. With encoding 02 "UTF-16BE" the order is known so the BOM is not needed.

I did make a small change to remove the BOM from blank frame description fields (which are usually blank). The BOM will remain for non-empty description as well as the actual data.
88d284f

Normally you would pull the comment data you need from $info['comments']['title'] rather than $info['id3v2']['COMM'][0]['data'] and the data there is (by default) already converted to UTF-8 which intrinsically removes the BOM. If you do need to process your data directly in UTF-16 for whatever reason then you would need the BOM intact otherwise your string couldn't be handled.

@nicklan
Copy link
Author

nicklan commented Nov 24, 2015

Ahh yes, this makes sense. Can I ask then though, why $info['comments']['title'] seems to be an array of two elements, one without the BOM but shortened, and one still with the BOM (i assume) but all the rest of text. See below:
2015-11-24-150841_902x363_scrot

@JamesHeinrich JamesHeinrich reopened this Nov 24, 2015
@JamesHeinrich
Copy link
Owner

That shouldn't be. There should only be one instance of each title without the BOM. Please check that you've mirrored all the changes from Github.

g61

@nicklan
Copy link
Author

nicklan commented Nov 25, 2015

I have the latest version and I'm still seeing the same as above. I made a fresh checkout of the repo, and at the bottom of the page I see "Powered by getID3() v1.9.10-201511241457" which seems to be the latest version. (Thanks very much for looking into this by the way!)

@nicklan
Copy link
Author

nicklan commented Nov 25, 2015

Well, I think I know why there are two things, seems like one is coming from the id3v1 tag (the shortened one) and one from the id3v2 tag (with the BOM). You probably already figure that :) But I'm still not sure why you're not seeing that behavior. Could there be something in my php settings? I'm on 5.6.4 64-bit.

@JamesHeinrich
Copy link
Owner

My best guess would be that your PHP installation doesn't have native iconv() support and it's relying on getid3_lib::iconv_fallback() and there may be an issue in there.

Note that this is simply a guess at this point, I'll need to take a look at that tomorrow and see if I can find a problem. I'll let you know.

@JamesHeinrich JamesHeinrich reopened this Nov 25, 2015
@JamesHeinrich
Copy link
Owner

Can you save the entire output of demo.browse for that file to a .html file and attach it here please?

@nicklan
Copy link
Author

nicklan commented Nov 25, 2015

Sure, attached below (as .txt so github would let me). I'll have a look too and see if I can figure anything out with the iconv thing, thanks for the hint.

getID3() - _demo_demo.browse.php (sample script).txt

@JamesHeinrich
Copy link
Owner

If I disable the built-in iconv and use getID3's version it still works correctly. Perhaps there is an issue with your built-in version of iconv?

First let's check if it's there, what version if available, and then try a very simple conversion using both PHP's iconv() function and getID3's version:

require_once('N:/webroot/_github/getID3/getid3/getid3.lib.php');
$string = "\xFF\xFE\x48\x00\x69\x00"; // BOM+"Hi"
echo '<pre>';
echo (function_exists('iconv') ? 'yes: '.`iconv --version` : 'no').'<hr>';
var_dump(iconv('UTF-16', 'UTF-8//TRANSLIT', $string));
var_dump(getid3_lib::iconv_fallback('UTF-16', 'UTF-8//TRANSLIT', $string));
echo '</pre>';

They should both just say "Hi" with no BOM, 2 chars long. I suspect one of them will be 4-chars with a BOM.

@nicklan
Copy link
Author

nicklan commented Nov 25, 2015

Yep, looks like iconv is failing and the builtin one is leaving the BOM:

yes: iconv (Gentoo 2.21-r1 p5) 2.21
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
bool(false)
string(6) "��Hi"

@nicklan
Copy link
Author

nicklan commented Nov 25, 2015

ahh, and iconv error is: "Notice: iconv(): Wrong charset, conversion from UTF-16' toUTF-8//TRANSLIT' is not allowed in [path_to_test].php on line 13" (the iconv line)

@nicklan
Copy link
Author

nicklan commented Nov 25, 2015

couple of other notes

  • on the command line, iconv seems to be able to convert from utf-16 to utf-8 without a problem (i.e. not going through php). not sure if that's at all relevant but I wanted to test.
  • i've tried UTF-8//IGNORE and UTF-8 with the same results

@nicklan
Copy link
Author

nicklan commented Nov 25, 2015

ohh, and if i run php at the command line, it works. outputting:

<pre>yes: iconv (Gentoo 2.21-r1 p5) 2.21
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
<hr>string(2) "Hi"
string(2) "Hi"

So it must be something with my nginx install. Yar. I will keep hunting.

@nicklan
Copy link
Author

nicklan commented Nov 28, 2015

Okay, turned out to be an issue with php-fpm which wasn't loading the iconv shared libraries properly. Thanks for the help pin-pointing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants