offending filename should be logged on Malformed UTF-8 error #18

Open
ericcj opened this Issue Jan 27, 2011 · 17 comments

Comments

Projects
None yet

ericcj commented Jan 27, 2011

also, this bug was closed for a bogus reason and should be reopened: https://github.com/protocool/AckMate/issues/closed#issue/13

+1 for re-opening issue #13

Perl has different ways of treating utf8 stuff:
http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8

The ackmate_ack that is shipped with ackmate uses Encode::encode_utf8($buffer) instead of Encode::encode('UTF-8',$buffer). The previous malformed erros I was getting disappeared when I used the latter version of encode, although I don't know why (yet). I thought this could help others investigate this issue.

Owner

protocool commented Apr 21, 2011

13 was closed because people were getting sick of getting emails about it.

As for using the latter version, I can't recall where I saw it, but I recall seeing a statement that encode_utf8 was the correct call. It's also odd that encode('UTF-8', $buffer) would result in fewer errors if it's more strict...

However, it's been a very long time since I burned any hours on this issue.

If you can come up with a few test files that prove it works better then I will gratefully apply any pull request.

Trev

Hi Trev, I'll see if I can come up with something and possibly submit a pull request to your ack fork.

Best regards.

tedeh commented May 16, 2011

+1 for this issue, I guess demanding a fix for the actual error might be too much but at least outputting the erroneous file name should be easy for anyone versed in Perl.

jjb commented May 28, 2011

Alright folks, I've put together a solution for reporting offending files. Tell me what you think: https://github.com/protocool/AckMate/wiki/Unicode-UTF-8-error-message

Would be really nice if this could be fixed. As far as I'm aware, all my files are encoded as UTF-8, but given that the kinds of directories I need to search over contain upward of 1,000 files, it's hard to track down the offending article. I've downgraded for now, but you know how we developers like to be on the bleeding edge :)

Even if someone could put together a script that will report all the "broken" files (or suggest one that already exists) and of course ideally fix them (wishful thinking I know!) that would be brilliant too.

artzte commented Aug 9, 2011

I added a similar patch in another location of ackmate_ack: https://gist.github.com/1135457

airways commented Dec 11, 2011

Addd a note to #13 just because it's the first thing you find in google, but I thought I should comment here:

Adding these lines to the end of needs_line_scan fixed this issue completely for me:

use Encode;
$buffer =  encode("UTF-8", $buffer);
return $buffer =~ /$regex/m;

Granted, there may be better places to do this encoding, but I just needed a quick fix. Hopefully this helps someone else out if they find the tickets.

yyyc514 commented Dec 15, 2011

At the end of my needs_line_scan method (around line 2597) I changed the end of the method to:

    my $regex = $opt->{regex};
    if ( ! utf8::valid($buffer) ) {
            $buffer = Encode::encode_utf8($buffer);
    }
        return $buffer =~ /$regex/m;

And you have to add 'use utf8;' to the beginning of the script... does this seem reasonable?

Owner

protocool commented Dec 15, 2011

Guys,

thanks for taking the lead on sorting this stuff out.

I've seen someone mention that setting export LC_CTYPE=en_US.UTF-8 in the environment helps.

I'd like to experiment with the various options before merging into AM proper, and at the same time updating AM's version of ack to the latest.

Can any of you point me to files which cause the UTF explosions so I can test against it?

Thanks, Trev

airways commented Dec 15, 2011

I attempted to add logging as recommended by @jjb above, but the log items did not show up in the output. I can play with it a bit later unless @yyyc514 gets to it first. :)

@yyyc514 solution works for me

@protocool I was getting stung on language files in ext.js

trisweb commented Feb 15, 2012

No errors should be logged because AFAIK they aren't errors, they're files not in UTF-8 format being interpreted as if they are.

Here's a patch for the ackmate_ack to solve the issue - read the file with :raw format first, then convert to UTF-8 using the Encode lib. Does not complain. It also uses @protocool's (or @airways ?) encode strategy for the other possible error section, with which I had no issues.

I have not yet submitted this to the ack project because I am unsure if it's universally applicable, but it should work for our purposes.

To apply, first paste into an ackmate_ack.patch somewhere, then:

cd ~/Library/Application Support/TextMate/Plugins/AckMate.tmplugin/Contents/Resources
cp [path/to/ackmate_ack.patch] .
patch ackmate_ack < ackmate_ack.patch

--- ackmate_ack 2012-02-15 13:46:32.000000000 -0500
+++ ackmate_ack 2012-02-15 13:59:25.000000000 -0500
@@ -1553,12 +1553,15 @@

     # If there's no extension, or we don't recognize it, check the shebang line
     my $fh;
-    if ( !open( $fh, '<', $filename ) ) {
+    if ( !open( $fh, '<:raw', $filename ) ) {
         App::Ack::warn( "$filename: $!" );
         return;
     }
     my $header = <$fh>;
     close $fh;
+   use Encode;
+   $header = encode("UTF-8", $header);

     if ( $header =~ /^#!/ ) {
         return ($1,TEXT)       if $header =~ /\b(ruby|p(?:erl|hp|ython))\b/;
@@ -2607,10 +2610,14 @@
         App::Ack::warn( "$self->{filename}: $!" );
         return 1;
     }
+
     return 0 unless $rc && ( $rc == $size || length(Encode::encode_utf8($buffer)) == $size );

     my $regex = $opt->{regex};
-    return $buffer =~ /$regex/m;
+
+   use Encode;
+   $buffer = encode("UTF-8", $buffer);
+   return $buffer =~ /$regex/m;
 }

airways commented Feb 15, 2012

This is basically the same as my patch for the malformed UTF-8 errors, which I posted on #13. I feel like this ticket should just be closed and the patch should be applied. No one has pointed out any issues with it, so I'm not sure why this hasn't been resolved already.

I agree that the request in this ticket seems pointless, we don't need logging for the error - we just need it to go away since ack and ackmate shouldn't tell us what format our files are supposed to be in.

trisweb commented Feb 15, 2012

Correct. I've forked the project and attempted to apply, but the problem, you see, is that ackmate_ack is actually generated from http://github.com/petdance/ack - see: https://github.com/trisweb/AckMate/blob/master/bundle_extras/ackmate_ack.autogenerated

So it's not exactly simple.

@protocool I wget this file http://www.fiscontex.com.br/legislacao/ICMS/aliquotainternaicms.htm and the problem happens with it. This file seens to be written on Windows FrontPage and have weird windows encoding chars. Hope it helps.

normann commented Jul 31, 2012

@trisweb 's patch totally work in my case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment