Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out if Unicode::UTF8 can be replaced #11

Closed
jhthorsen opened this issue Oct 16, 2014 · 8 comments
Closed

Figure out if Unicode::UTF8 can be replaced #11

jhthorsen opened this issue Oct 16, 2014 · 8 comments
Assignees

Comments

@jhthorsen
Copy link
Owner

Maybe we can do something funky with this code? https://metacpan.org/source/RIBASUSHI/Devel-PeekPoke-0.03/lib/Devel/PeekPoke/PP.pm#L100

@nicomen
Copy link

nicomen commented Nov 11, 2016

I think this might illustrate a way how to do it with a core module:

use Data::Dumper;
use Encode;
use feature 'say';
use warnings;
use strict;

my $latin_bytes = "\346\370\345";
my $utf8_bytes = "æøå";

my $flagged_utf8_str = decode_utf8("æøå");

say "Decoding two raw strings with byte values corresponding to latin1 and utf8: ";
say Dumper({ latin1_bytes => $latin_bytes, utf8_bytes => $utf8_bytes, flagged_utf8_str => $flagged_utf8_str });

say encode_utf8(
    decode( 'utf-8', "$latin_bytes $utf8_bytes", sub {
        decode( 'latin1', chr($_[0]), sub {
            chr($_[0]);
        })
    })
);

__END__

Decoding two raw strings with byte values corresponding to latin1 and utf8: 
$VAR1 = {
          'flagged_utf8_str' => "\x{e6}\x{f8}\x{e5}",
          'latin1_bytes' => '���',
          'utf8_bytes' => 'æøå'
        };

æøå æøå

@jhthorsen
Copy link
Owner Author

What do you think @marcusramberg ?

@marcusramberg
Copy link
Collaborator

@jhthorsen I'm not sure. Would appreciate @chansen 's input here as he made the original recommendation.

@pink-mist
Copy link

Is there a test for this in the test suite? I'm not quite certain I understand what the issue actually is, and if I could see code that would verify things, it'd probably be enlightening.

@jhthorsen
Copy link
Owner Author

I can't seem to find any... The test should be something like this:

  1. Receive a latin1 message (like latin1 "hey æøå!" or something) from the IRC server.
  2. Make sure they are passed on as a valid utf8 "hey æøå!" string in the callback.

The reason for using this module is that we want to guess that if a messages is received from a IRC client using latin1, then we still want the characters decoded as utf8 on our side.

Note: I might be very wrong here. Maybe we want the scalar to contain bytes, without any encoding? I can't really remember.

@Grinnz
Copy link
Collaborator

Grinnz commented Jan 10, 2017

I'm not sure what it's used for either, but for the purpose of guessing+decoding messages received from the IRC server, IRC::Utils::decode_irc will do the job.

@jhthorsen
Copy link
Owner Author

That's interesting @Grinnz! Thanks.

@chansen
Copy link

chansen commented Jan 14, 2017

@jhthorsen, the code in Devel/PeekPoke/PP.pm#L100 detects whether or not the given string contains code points outside the octet code space.

@nicomen, close but no cigar! You should use Encode::encode("UTF-8", chr $_[0]) in the callback, not Encode::decode("latin1", chr $_[0]) .

@Grinnz IRC::Utils::decode_irc() does not handle messages that contains mixed encodings, it's also very slow because it invokes Encode::Guess::guess().

Unicode::UTF8

The callback passed to Unicode::UTF8::decode_utf8() decodes any ill-formed UTF-8 sequences as Latin1:

$ perl -MUnicode::UTF8=decode_utf8 -e '                                                                                                      
  printf "<%s>\n", join " ", 
    map { sprintf "U+%.4X", ord } split //, decode_utf8("☺\xE6\xF8\xE5✔!", sub { $_[0] })
'
<U+263A U+00E6 U+00F8 U+00E5 U+2714 U+0021>

The reason I recommend Unicode::UTF8 to @marcusramberg was that it's works correctly on all supported Perl versions (>= 5.8.1), is fast and lightweight on memory usage.

Encode

It's possible to use Encode::decode()'s fallback mechanism to achieve the same:

$ perl -MEncode -e '                                                                                                      
  printf "<%s>\n", join " ", 
    map { sprintf "U+%.4X", ord } split //, 
      Encode::decode("UTF-8", "☺\xE6\xF8\xE5✔!", sub { Encode::encode("UTF-8", chr $_[0]) })
'
<U+263A U+00E6 U+00F8 U+00E5 U+2714 U+0021>

Portability + Efficiency

This code uses an instance of Encode::utf8 (subclass of Encode::Encoding) instead of the factory function Encode::decode to gain better performance.

my $Encoding;
BEGIN {
  my $has_unicode_utf8 = !!eval { require Unicode::UTF8; 1 };

  unless ($has_unicode_utf8) {
    require Encode;
    $Encoding = Encode::find_encoding('UTF-8') 
      or die q/Could not find UTF-8 encoding in Encode/;
  }
  *HAS_UNICODE_UTF8 = sub () { $has_unicode_utf8 };
}

sub decode_irc {
  @_ == 1 or die q/Usage: decode_irc($octets)/;
  if (HAS_UNICODE_UTF8) {
    no warnings 'utf8';
    return Unicode::UTF8::decode_utf8($_[0], sub { $_[0] });
  }
  else {
    # The stringfication of $_[0] is intentional!
    # Older versions of Encode have had bugs with GETMAGIC and issues 
    # with references and overloaded objects causing segfaults.
    return $Encoding->decode("$_[0]", sub { $Encoding->encode(chr $_[0]) });
  }
}

--
chansen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants