Non-latin characters in PelEntryWindowsString #16

dmitrii-fediuk · 2011-01-02T04:40:39Z

Why does PelEntryWindowsString support only Latin-1 character set?
Windows surely supports other ASCII character sets in file properties dialog (particularly, russian: Windows-1251).

lsolesen · 2011-01-02T09:37:29Z

It should certainly support other character sets. I think we should just go for UTF8. Do you have a test case showing that it does not support your character set?

dmitrii-fediuk · 2011-01-02T12:18:48Z

Test Case: http://pastie.org/1423164

Results (given and expected): http://img35.imageshack.us/img35/4917/testresults.png

lsolesen · 2011-01-03T13:58:53Z

Could you post the test image also?

dmitrii-fediuk · 2011-01-03T14:13:00Z

I test it with several JPEG images from different sources - result the same.
For example,
the image:
http://img689.imageshack.us/img689/1230/testss.jpg

exif data, produced by the library (see "subject" field): http://regex.info/exif.cgi?b=3&url=http://img689.imageshack.us/img689/1230/testss.jpg

lsolesen · 2011-01-03T14:50:56Z

I have just added a testcase and it seems to pass at my end. Are you using utf8 as encoding for the file you are using?

dmitrii-fediuk · 2011-01-03T14:58:06Z

Yes, I use utf8.
Can you check that Windows correctly shows "subject" field in file properties dialog after your test?
I downloaded the image attached to your test case, and "subject" field is wrong...
Right subject is: "Превед, медвед!"
Wrong subject (Windows shows it) is: "Ïðåâåä, ìåäâåä!"

lsolesen · 2011-01-03T15:03:27Z

I have not put a subject field in the image. I just used your image as a reference. As you can see in the test case I copy the picture and use a copy of it. You can change tearDown() so it will not unlink the test image and check it for yourself. Sorry but I do not have a Windows machine. Let me know what it generates?

dmitrii-fediuk · 2011-01-03T15:17:23Z

My Windows 7 shows subject as "Ïðåâåä, ìåäâåä!", and it is wrong...
And at the same time, if I manually fill "subject" field with "Превед, медвед!", Windows correctly saves russian (cyrillic) characters.
Here is the file with subject, that Windows shows correctly (filled manually): http://img407.imageshack.us/img407/8624/rightgg.jpg

Here is the file with subject, that Windows shows wrong (filled by script): http://img10.imageshack.us/img10/1867/wrongne.jpg

You can see the difference by online EXIF viewer here:
right: http://regex.info/exif.cgi?b=3&url=http://img407.imageshack.us/img407/8624/rightgg.jpg

wrong: http://regex.info/exif.cgi?b=3&url=http://img10.imageshack.us/img10/1867/wrongne.jpg

lsolesen · 2011-01-03T15:20:05Z

I am aware of the difference. Have you tried running the test case

php test/gh-16.php

Uncomment the tearDown() method and read the subject in the tmp file?

dmitrii-fediuk · 2011-01-03T17:29:23Z

In your test case you write subject string as UTF-8 and then read it as UTF-8.
Therefore, your test case passes correctly.

But Windows (as I think) does not support UTF-8 for PelTag::XP_SUBJECT (and, as I think, for other PelTag::XP_* tags too).
Windows expects it is ASCII.
And when I open the image produced but your test case in Windows - it shows it as "Ð�Ñ�ÐµÐ²ÐµÐ´, Ð¼ÐµÐ´Ð²ÐµÐ´!"
It is wrong.

In my test case I take into consideration what Windows expects PelTag::XP_SUBJECT as ASCII and do recoding from UTF-8 to Windows-1251 (russian Windows encoding):
$subject =
iconv (
"UTF-8"
,
"windows-1251"
,
"Превед, медвед!"
)
;
But in this case your library treat is as Latin-1 and produces wrong result too: "Ïðåâåä, ìåäâåä!".

dmitrii-fediuk · 2011-01-03T19:06:03Z

Finally, I get it worked!

The problem is in PelEntryWindowsString::setValue.

In your library, it works only for Latin-1 and looks as:

function setValue($str) {
    $l = strlen($str);

    $this->components = 2 * ($l + 1);
    $this->str        = $str;
    $this->bytes      = '';
    for ($i = 0; $i < $l; $i++)
        $this->bytes .= $str{$i} . chr(0x00);

    $this->bytes .= chr(0x00) . chr(0x00);
}

I rewrite it as:
function setValue($str) {

    $l = mb_strlen($str);

    $this->components = 2 * ($l + 1);
    $this->str        = $str;
    $this->bytes      =
        mb_convert_encoding (
            $str
            ,
            "UCS-2LE"
            ,
            "UTF-8"
        )
        . chr(0x00) . chr(0x00)
    ;
}

My function expects $str argument to be UTF-8.
Function convert it to UCS-2 little-endian - it is format that Windows expects.
I googled that Windows always expects XP_* data to be little-endian, in spite of other data byte order...

lsolesen · 2016-04-08T12:30:13Z

@mage2pro Do you know whether Windows has started supporting UTF-8. It's been a while, so I think (hope) they may support it now?

lsolesen · 2016-04-08T12:32:32Z

Closing this for now. Please reopen, if you still have this problem.

ghost assigned lsolesen Apr 11, 2011

lsolesen modified the milestones: v0.10.0, v0.9.4 Mar 21, 2016

lsolesen closed this as completed Apr 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-latin characters in PelEntryWindowsString #16

Non-latin characters in PelEntryWindowsString #16

dmitrii-fediuk commented Jan 2, 2011

lsolesen commented Jan 2, 2011

dmitrii-fediuk commented Jan 2, 2011

lsolesen commented Jan 3, 2011

dmitrii-fediuk commented Jan 3, 2011

lsolesen commented Jan 3, 2011

dmitrii-fediuk commented Jan 3, 2011

lsolesen commented Jan 3, 2011

dmitrii-fediuk commented Jan 3, 2011

lsolesen commented Jan 3, 2011

dmitrii-fediuk commented Jan 3, 2011

dmitrii-fediuk commented Jan 3, 2011

lsolesen commented Apr 8, 2016

lsolesen commented Apr 8, 2016

Non-latin characters in PelEntryWindowsString #16

Non-latin characters in PelEntryWindowsString #16

Comments

dmitrii-fediuk commented Jan 2, 2011

lsolesen commented Jan 2, 2011

dmitrii-fediuk commented Jan 2, 2011

lsolesen commented Jan 3, 2011

dmitrii-fediuk commented Jan 3, 2011

lsolesen commented Jan 3, 2011

dmitrii-fediuk commented Jan 3, 2011

lsolesen commented Jan 3, 2011

dmitrii-fediuk commented Jan 3, 2011

lsolesen commented Jan 3, 2011

dmitrii-fediuk commented Jan 3, 2011

dmitrii-fediuk commented Jan 3, 2011

lsolesen commented Apr 8, 2016

lsolesen commented Apr 8, 2016