Problem with toAscii() function in uploaded file name #614

Closed
Chates opened this Issue Apr 4, 2012 · 8 comments

Comments

Projects
None yet
4 participants

Chates commented Apr 4, 2012

If you upload a file called for example "rozčarovaný ceník.pdf" via \Nette\Application\UI\Form there is a problem with getSanitizedName() method.

If you call:

$form['uploadedFile']->getValue()->getSanitizedName();

Only "rozc" is returned. It looks like encoding problem in \Nette\Utils\Strings::webalize function. In toAscii() function to be precise.

I have done some experiments:
mb_detect_encoding($form['uploadedFile']->getValue()->name, 'UTF-8', true); // Is valid UTF-8
\Nette\Utils\Strings::webalize('rozčarovaný ceník.txt'); // works fine
\Nette\Utils\Strings::webalize($form['uploadedFile']->getValue()->name); // returns only "rozc"
\Nette\Utils\Strings::webalize(iconv('UTF-8', 'UTF-8', $values['attachment_1']->name)); // returns only "rozc"

Owner

dg commented Apr 5, 2012

Whether \Nette\Utils\Strings::webalize('rozčarovaný ceník.txt'); works fine, $form['uploadedFile']->getValue()->name is not the same string.

Member

milo commented Apr 6, 2012

@WebToad Are you sure that file name is correctly passed to php itself? Try to dump($_FILES)

Chates commented Apr 6, 2012

@dg It obviously isnt the same string. But why? It seems it is. Dumped value is correct and also the encoding. Maybe thats the bug. Few more examples:

dump($form['attachment_1']->getValue()->name); // "rozčarovaný ceník.txt" (27)
\Nette\Utils\Strings::webalize($form['uploadedFile']->getValue()->name); // returns only "rozc"

@milo dump($_FILES);

array(1) {
attachment_1 => array(5) {
name => "rozčarovaný ceník.txt" (27)
type => "text/plain" (10)
tmp_name => "/Applications/MAMP/tmp/php/phpfnhYOv" (36)
error => 0
size => 888
}
}

Member

milo commented Apr 6, 2012

What dump($form['uploadedFile']->getValue()->name) shows?

Chates commented Apr 6, 2012

@milo It shows exactly "rozčarovaný ceník.txt" (27)

Maybe it is system encoding thing. I have OS X 10.7.3. maybe encoding of file names differs or something.

If you try it on the server. It behaves differently. Screenshot: http://cl.ly/FdRc/o Its like every diacritic is taken as a separate character.

You can try the behaviour here: http://hlasovani.webtoad.cz/test/

Member

milo commented Apr 6, 2012

It really seems that diacritic signs are separated characters. Also in there, github comments. When I copy "rozčarovaný ceník.txt" (27) into PSPad, it is obvious. Oh, UTF8 characters magic :)

Maybe http://api.nette.org/2.0/source-Utils.Strings.php.html#178 extending to more chars "fix" the problem.

Contributor

voda commented Apr 6, 2012

the 'ć' char is actually the char 'c' plus a separate character for the caron. Have a look at http://www.fileformat.info/info/unicode/char/30c/index.htm and http://en.wikipedia.org/wiki/Combining_character

Chates commented Apr 6, 2012

So what about to modify toAscii() function (http://api.nette.org/2.0/source-Utils.Strings.php.html#164) to cut out these combining characters instead of replacing it with the "-"?

dg closed this in 07d1bf4 Apr 6, 2012

@juzna juzna pushed a commit to juzna/nette that referenced this issue May 15, 2012

@dg dg Strings::toAscii() removes combining character [Closes #614] ab8ba04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment