Files with UTF-16 TIT2 (and others) have invalid bytes before name #61

nicklan · 2015-11-24T00:37:22Z

I have some files that have utf-16 titles. When looking at them in demo.browse, the values get prefixed with invalid characters. These show up as ? chars in my browser, but looking at the returned data, they are not valid utf-16 either. For instance, for one file, the comments_html section contains:

album   array (1)   0   string (22)     &#2089536;[correct subsequent characters for album]

This is for a number of different files, and other tools process the tags correctly.

Let me know if you need more info, or what else I can do to help track down what's wrong. I'm on version 1.9.10-20150914

The text was updated successfully, but these errors were encountered:

JamesHeinrich · 2015-11-24T00:38:12Z

A UTF-16 sample file would be a great start.

nicklan · 2015-11-24T06:20:21Z

Sure. This file: http://datashat.net/music_for_programming_10-unity_gain_temple.mp3 (from http://musicforprogramming.net/) shows the problem.

Screenshot of what I'm seeing:

and output of id3v2 -l:

#61

JamesHeinrich · 2015-11-24T19:25:22Z

Those non-displayable characters are indeed the Byte Order Marker (BOM) from the UTF-16 text.

The ID3 documentation specifies this regarding text encodings:

Frames that allow different types of text encoding contains a text
encoding description byte. Possible encodings:

 $00   ISO-8859-1 [ISO-8859-1]. Terminated with $00.
 $01   UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All
       strings in the same frame SHALL have the same byteorder.
       Terminated with $00 00.
 $02   UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM.
       Terminated with $00 00.
 $03   UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with $00.

Strings dependent on encoding are represented in frame descriptions
as <text string according to encoding>, or <full text string
according to encoding> if newlines are allowed. Any empty strings of
type $01 which are NULL-terminated may have the Unicode BOM followed
by a Unicode NULL ($FF FE 00 00 or $FE FF 00 00).

Your file is tagged with encoding 01 "UTF-16" which means the text could be either big-endian or little-endian, as determined by the BOM at the start of the string. Without the BOM it is unknown how to display (or convert) the text since it's not known what order the bytes come in. With encoding 02 "UTF-16BE" the order is known so the BOM is not needed.

I did make a small change to remove the BOM from blank frame description fields (which are usually blank). The BOM will remain for non-empty description as well as the actual data.
88d284f

Normally you would pull the comment data you need from $info['comments']['title'] rather than $info['id3v2']['COMM'][0]['data'] and the data there is (by default) already converted to UTF-8 which intrinsically removes the BOM. If you do need to process your data directly in UTF-16 for whatever reason then you would need the BOM intact otherwise your string couldn't be handled.

nicklan · 2015-11-24T23:10:40Z

Ahh yes, this makes sense. Can I ask then though, why $info['comments']['title'] seems to be an array of two elements, one without the BOM but shortened, and one still with the BOM (i assume) but all the rest of text. See below:

JamesHeinrich · 2015-11-24T23:35:12Z

That shouldn't be. There should only be one instance of each title without the BOM. Please check that you've mirrored all the changes from Github.

nicklan · 2015-11-25T00:00:23Z

I have the latest version and I'm still seeing the same as above. I made a fresh checkout of the repo, and at the bottom of the page I see "Powered by getID3() v1.9.10-201511241457" which seems to be the latest version. (Thanks very much for looking into this by the way!)

nicklan · 2015-11-25T01:23:19Z

Well, I think I know why there are two things, seems like one is coming from the id3v1 tag (the shortened one) and one from the id3v2 tag (with the BOM). You probably already figure that :) But I'm still not sure why you're not seeing that behavior. Could there be something in my php settings? I'm on 5.6.4 64-bit.

JamesHeinrich · 2015-11-25T04:05:04Z

My best guess would be that your PHP installation doesn't have native iconv() support and it's relying on getid3_lib::iconv_fallback() and there may be an issue in there.

Note that this is simply a guess at this point, I'll need to take a look at that tomorrow and see if I can find a problem. I'll let you know.

JamesHeinrich · 2015-11-25T05:31:01Z

Can you save the entire output of demo.browse for that file to a .html file and attach it here please?

nicklan · 2015-11-25T07:12:50Z

Sure, attached below (as .txt so github would let me). I'll have a look too and see if I can figure anything out with the iconv thing, thanks for the hint.

getID3() - _demo_demo.browse.php (sample script).txt

JamesHeinrich · 2015-11-25T15:13:47Z

If I disable the built-in iconv and use getID3's version it still works correctly. Perhaps there is an issue with your built-in version of iconv?

First let's check if it's there, what version if available, and then try a very simple conversion using both PHP's iconv() function and getID3's version:

require_once('N:/webroot/_github/getID3/getid3/getid3.lib.php');
$string = "\xFF\xFE\x48\x00\x69\x00"; // BOM+"Hi"
echo '<pre>';
echo (function_exists('iconv') ? 'yes: '.`iconv --version` : 'no').'<hr>';
var_dump(iconv('UTF-16', 'UTF-8//TRANSLIT', $string));
var_dump(getid3_lib::iconv_fallback('UTF-16', 'UTF-8//TRANSLIT', $string));
echo '</pre>';

They should both just say "Hi" with no BOM, 2 chars long. I suspect one of them will be 4-chars with a BOM.

nicklan · 2015-11-25T23:15:22Z

Yep, looks like iconv is failing and the builtin one is leaving the BOM:

yes: iconv (Gentoo 2.21-r1 p5) 2.21
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
bool(false)
string(6) "��Hi"

nicklan · 2015-11-25T23:20:04Z

ahh, and iconv error is: "Notice: iconv(): Wrong charset, conversion from UTF-16' toUTF-8//TRANSLIT' is not allowed in [path_to_test].php on line 13" (the iconv line)

nicklan · 2015-11-25T23:35:11Z

couple of other notes

on the command line, iconv seems to be able to convert from utf-16 to utf-8 without a problem (i.e. not going through php). not sure if that's at all relevant but I wanted to test.
i've tried UTF-8//IGNORE and UTF-8 with the same results

nicklan · 2015-11-25T23:39:02Z

ohh, and if i run php at the command line, it works. outputting:

<pre>yes: iconv (Gentoo 2.21-r1 p5) 2.21
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.
<hr>string(2) "Hi"
string(2) "Hi"

So it must be something with my nginx install. Yar. I will keep hunting.

nicklan · 2015-11-28T21:24:55Z

Okay, turned out to be an issue with php-fpm which wasn't loading the iconv shared libraries properly. Thanks for the help pin-pointing it!

JamesHeinrich added a commit that referenced this issue Nov 24, 2015

ID3v2 remove BOM from frame descriptions

88d284f

#61

JamesHeinrich closed this as completed Nov 24, 2015

JamesHeinrich reopened this Nov 24, 2015

JamesHeinrich closed this as completed Nov 24, 2015

JamesHeinrich reopened this Nov 25, 2015

nicklan closed this as completed Nov 28, 2015

Rello mentioned this issue Aug 27, 2016

Version 1.1.0 doesn't work error 15% scan Rello/audioplayer#49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files with UTF-16 TIT2 (and others) have invalid bytes before name #61

Files with UTF-16 TIT2 (and others) have invalid bytes before name #61

nicklan commented Nov 24, 2015

JamesHeinrich commented Nov 24, 2015

nicklan commented Nov 24, 2015

JamesHeinrich commented Nov 24, 2015

nicklan commented Nov 24, 2015

JamesHeinrich commented Nov 24, 2015

nicklan commented Nov 25, 2015

nicklan commented Nov 25, 2015

JamesHeinrich commented Nov 25, 2015

JamesHeinrich commented Nov 25, 2015

nicklan commented Nov 25, 2015

JamesHeinrich commented Nov 25, 2015

nicklan commented Nov 25, 2015

nicklan commented Nov 25, 2015

nicklan commented Nov 25, 2015

nicklan commented Nov 25, 2015

nicklan commented Nov 28, 2015

Files with UTF-16 TIT2 (and others) have invalid bytes before name #61

Files with UTF-16 TIT2 (and others) have invalid bytes before name #61

Comments

nicklan commented Nov 24, 2015

JamesHeinrich commented Nov 24, 2015

nicklan commented Nov 24, 2015

JamesHeinrich commented Nov 24, 2015

nicklan commented Nov 24, 2015

JamesHeinrich commented Nov 24, 2015

nicklan commented Nov 25, 2015

nicklan commented Nov 25, 2015

JamesHeinrich commented Nov 25, 2015

JamesHeinrich commented Nov 25, 2015

nicklan commented Nov 25, 2015

JamesHeinrich commented Nov 25, 2015

nicklan commented Nov 25, 2015

nicklan commented Nov 25, 2015

nicklan commented Nov 25, 2015

nicklan commented Nov 25, 2015

nicklan commented Nov 28, 2015