Figure out if Unicode::UTF8 can be replaced #11

jhthorsen · 2014-10-16T17:59:25Z

Maybe we can do something funky with this code? https://metacpan.org/source/RIBASUSHI/Devel-PeekPoke-0.03/lib/Devel/PeekPoke/PP.pm#L100

nicomen · 2016-11-11T09:14:06Z

I think this might illustrate a way how to do it with a core module:

use Data::Dumper;
use Encode;
use feature 'say';
use warnings;
use strict;

my $latin_bytes = "\346\370\345";
my $utf8_bytes = "æøå";

my $flagged_utf8_str = decode_utf8("æøå");

say "Decoding two raw strings with byte values corresponding to latin1 and utf8: ";
say Dumper({ latin1_bytes => $latin_bytes, utf8_bytes => $utf8_bytes, flagged_utf8_str => $flagged_utf8_str });

say encode_utf8(
    decode( 'utf-8', "$latin_bytes $utf8_bytes", sub {
        decode( 'latin1', chr($_[0]), sub {
            chr($_[0]);
        })
    })
);

__END__

Decoding two raw strings with byte values corresponding to latin1 and utf8: 
$VAR1 = {
          'flagged_utf8_str' => "\x{e6}\x{f8}\x{e5}",
          'latin1_bytes' => '���',
          'utf8_bytes' => 'æøå'
        };

æøå æøå

jhthorsen · 2016-12-11T10:52:46Z

What do you think @marcusramberg ?

marcusramberg · 2016-12-12T21:51:40Z

@jhthorsen I'm not sure. Would appreciate @chansen 's input here as he made the original recommendation.

pink-mist · 2017-01-04T11:23:48Z

Is there a test for this in the test suite? I'm not quite certain I understand what the issue actually is, and if I could see code that would verify things, it'd probably be enlightening.

jhthorsen · 2017-01-04T13:43:44Z

I can't seem to find any... The test should be something like this:

Receive a latin1 message (like latin1 "hey æøå!" or something) from the IRC server.
Make sure they are passed on as a valid utf8 "hey æøå!" string in the callback.

The reason for using this module is that we want to guess that if a messages is received from a IRC client using latin1, then we still want the characters decoded as utf8 on our side.

Note: I might be very wrong here. Maybe we want the scalar to contain bytes, without any encoding? I can't really remember.

Grinnz · 2017-01-10T04:28:48Z

I'm not sure what it's used for either, but for the purpose of guessing+decoding messages received from the IRC server, IRC::Utils::decode_irc will do the job.

jhthorsen · 2017-01-10T15:31:25Z

That's interesting @Grinnz! Thanks.

chansen · 2017-01-14T11:51:30Z

@jhthorsen, the code in Devel/PeekPoke/PP.pm#L100 detects whether or not the given string contains code points outside the octet code space.

@nicomen, close but no cigar! You should use Encode::encode("UTF-8", chr $_[0]) in the callback, not Encode::decode("latin1", chr $_[0]) .

@Grinnz IRC::Utils::decode_irc() does not handle messages that contains mixed encodings, it's also very slow because it invokes Encode::Guess::guess().

Unicode::UTF8

The callback passed to Unicode::UTF8::decode_utf8() decodes any ill-formed UTF-8 sequences as Latin1:

$ perl -MUnicode::UTF8=decode_utf8 -e '                                                                                                      
  printf "<%s>\n", join " ", 
    map { sprintf "U+%.4X", ord } split //, decode_utf8("☺\xE6\xF8\xE5✔!", sub { $_[0] })
'
<U+263A U+00E6 U+00F8 U+00E5 U+2714 U+0021>

The reason I recommend Unicode::UTF8 to @marcusramberg was that it's works correctly on all supported Perl versions (>= 5.8.1), is fast and lightweight on memory usage.

Encode

It's possible to use Encode::decode()'s fallback mechanism to achieve the same:

$ perl -MEncode -e '                                                                                                      
  printf "<%s>\n", join " ", 
    map { sprintf "U+%.4X", ord } split //, 
      Encode::decode("UTF-8", "☺\xE6\xF8\xE5✔!", sub { Encode::encode("UTF-8", chr $_[0]) })
'
<U+263A U+00E6 U+00F8 U+00E5 U+2714 U+0021>

Portability + Efficiency

This code uses an instance of Encode::utf8 (subclass of Encode::Encoding) instead of the factory function Encode::decode to gain better performance.

my $Encoding;
BEGIN {
  my $has_unicode_utf8 = !!eval { require Unicode::UTF8; 1 };

  unless ($has_unicode_utf8) {
    require Encode;
    $Encoding = Encode::find_encoding('UTF-8') 
      or die q/Could not find UTF-8 encoding in Encode/;
  }
  *HAS_UNICODE_UTF8 = sub () { $has_unicode_utf8 };
}

sub decode_irc {
  @_ == 1 or die q/Usage: decode_irc($octets)/;
  if (HAS_UNICODE_UTF8) {
    no warnings 'utf8';
    return Unicode::UTF8::decode_utf8($_[0], sub { $_[0] });
  }
  else {
    # The stringfication of $_[0] is intentional!
    # Older versions of Encode have had bugs with GETMAGIC and issues 
    # with references and overloaded objects causing segfaults.
    return $Encoding->decode("$_[0]", sub { $Encoding->encode(chr $_[0]) });
  }
}

--
chansen

jhthorsen added the enhancement label Oct 16, 2014

jhthorsen assigned marcusramberg Oct 16, 2014

marcusramberg closed this as completed Mar 21, 2018

jhthorsen mentioned this issue Oct 27, 2019

Get rid of Mojo::IRC and Unicode::UTF8 as dependencies convos-chat/convos#392

Closed

jhthorsen mentioned this issue Dec 23, 2019

Get rid of Unicode::UTF8 as dependency convos-chat/convos#425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out if Unicode::UTF8 can be replaced #11

Figure out if Unicode::UTF8 can be replaced #11

jhthorsen commented Oct 16, 2014

nicomen commented Nov 11, 2016 •

edited

jhthorsen commented Dec 11, 2016

marcusramberg commented Dec 12, 2016

pink-mist commented Jan 4, 2017

jhthorsen commented Jan 4, 2017

Grinnz commented Jan 10, 2017

jhthorsen commented Jan 10, 2017

chansen commented Jan 14, 2017

Figure out if Unicode::UTF8 can be replaced #11

Figure out if Unicode::UTF8 can be replaced #11

Comments

jhthorsen commented Oct 16, 2014

nicomen commented Nov 11, 2016 • edited

jhthorsen commented Dec 11, 2016

marcusramberg commented Dec 12, 2016

pink-mist commented Jan 4, 2017

jhthorsen commented Jan 4, 2017

Grinnz commented Jan 10, 2017

jhthorsen commented Jan 10, 2017

chansen commented Jan 14, 2017

Unicode::UTF8

Encode

Portability + Efficiency

nicomen commented Nov 11, 2016 •

edited