Skip to content
This repository has been archived by the owner on Jun 23, 2023. It is now read-only.

Non-latin characters in PelEntryWindowsString #16

Closed
dmitrii-fediuk opened this issue Jan 2, 2011 · 13 comments
Closed

Non-latin characters in PelEntryWindowsString #16

dmitrii-fediuk opened this issue Jan 2, 2011 · 13 comments
Labels
Milestone

Comments

@dmitrii-fediuk
Copy link

Why does PelEntryWindowsString support only Latin-1 character set?
Windows surely supports other ASCII character sets in file properties dialog (particularly, russian: Windows-1251).

@lsolesen
Copy link
Collaborator

lsolesen commented Jan 2, 2011

It should certainly support other character sets. I think we should just go for UTF8. Do you have a test case showing that it does not support your character set?

@dmitrii-fediuk
Copy link
Author

Test Case: http://pastie.org/1423164

Results (given and expected): http://img35.imageshack.us/img35/4917/testresults.png

@lsolesen
Copy link
Collaborator

lsolesen commented Jan 3, 2011

Could you post the test image also?

@dmitrii-fediuk
Copy link
Author

I test it with several JPEG images from different sources - result the same.
For example,
the image:
http://img689.imageshack.us/img689/1230/testss.jpg

exif data, produced by the library (see "subject" field): http://regex.info/exif.cgi?b=3&url=http://img689.imageshack.us/img689/1230/testss.jpg

@lsolesen
Copy link
Collaborator

lsolesen commented Jan 3, 2011

I have just added a testcase and it seems to pass at my end. Are you using utf8 as encoding for the file you are using?

@dmitrii-fediuk
Copy link
Author

Yes, I use utf8.
Can you check that Windows correctly shows "subject" field in file properties dialog after your test?
I downloaded the image attached to your test case, and "subject" field is wrong...
Right subject is: "Превед, медвед!"
Wrong subject (Windows shows it) is: "Ïðåâåä, ìåäâåä!"

@lsolesen
Copy link
Collaborator

lsolesen commented Jan 3, 2011

I have not put a subject field in the image. I just used your image as a reference. As you can see in the test case I copy the picture and use a copy of it. You can change tearDown() so it will not unlink the test image and check it for yourself. Sorry but I do not have a Windows machine. Let me know what it generates?

@dmitrii-fediuk
Copy link
Author

My Windows 7 shows subject as "Ïðåâåä, ìåäâåä!", and it is wrong...
And at the same time, if I manually fill "subject" field with "Превед, медвед!", Windows correctly saves russian (cyrillic) characters.
Here is the file with subject, that Windows shows correctly (filled manually): http://img407.imageshack.us/img407/8624/rightgg.jpg

Here is the file with subject, that Windows shows wrong (filled by script): http://img10.imageshack.us/img10/1867/wrongne.jpg

You can see the difference by online EXIF viewer here:
right: http://regex.info/exif.cgi?b=3&url=http://img407.imageshack.us/img407/8624/rightgg.jpg

wrong: http://regex.info/exif.cgi?b=3&url=http://img10.imageshack.us/img10/1867/wrongne.jpg

@lsolesen
Copy link
Collaborator

lsolesen commented Jan 3, 2011

I am aware of the difference. Have you tried running the test case

php test/gh-16.php

Uncomment the tearDown() method and read the subject in the tmp file?

@dmitrii-fediuk
Copy link
Author

In your test case you write subject string as UTF-8 and then read it as UTF-8.
Therefore, your test case passes correctly.

But Windows (as I think) does not support UTF-8 for PelTag::XP_SUBJECT (and, as I think, for other PelTag::XP_* tags too).
Windows expects it is ASCII.
And when I open the image produced but your test case in Windows - it shows it as "��евед, медвед!"
It is wrong.

In my test case I take into consideration what Windows expects PelTag::XP_SUBJECT as ASCII and do recoding from UTF-8 to Windows-1251 (russian Windows encoding):
$subject =
iconv (
"UTF-8"
,
"windows-1251"
,
"Превед, медвед!"
)
;
But in this case your library treat is as Latin-1 and produces wrong result too: "Ïðåâåä, ìåäâåä!".

@dmitrii-fediuk
Copy link
Author

Finally, I get it worked!

The problem is in PelEntryWindowsString::setValue.

In your library, it works only for Latin-1 and looks as:

function setValue($str) {
    $l = strlen($str);

    $this->components = 2 * ($l + 1);
    $this->str        = $str;
    $this->bytes      = '';
    for ($i = 0; $i < $l; $i++)
        $this->bytes .= $str{$i} . chr(0x00);

    $this->bytes .= chr(0x00) . chr(0x00);
}

I rewrite it as:
function setValue($str) {

    $l = mb_strlen($str);

    $this->components = 2 * ($l + 1);
    $this->str        = $str;
    $this->bytes      =
        mb_convert_encoding (
            $str
            ,
            "UCS-2LE"
            ,
            "UTF-8"
        )
        . chr(0x00) . chr(0x00)
    ;
}

My function expects $str argument to be UTF-8.
Function convert it to UCS-2 little-endian - it is format that Windows expects.
I googled that Windows always expects XP_* data to be little-endian, in spite of other data byte order...

@ghost ghost assigned lsolesen Apr 11, 2011
@lsolesen lsolesen modified the milestones: v0.10.0, v0.9.4 Mar 21, 2016
@lsolesen
Copy link
Collaborator

lsolesen commented Apr 8, 2016

@mage2pro Do you know whether Windows has started supporting UTF-8. It's been a while, so I think (hope) they may support it now?

@lsolesen
Copy link
Collaborator

lsolesen commented Apr 8, 2016

Closing this for now. Please reopen, if you still have this problem.

@lsolesen lsolesen closed this as completed Apr 8, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants