Support for reading UTF-8 strings #78

shioyama · 2016-04-15T06:42:46Z

It seems that stupidedi does not support UTF-8 chacaters when reading out an EDI. I get:

ArgumentError: universe does not contain element "あ"

Digging a bit I found that Stupidedi::Reader::C_BYTES, defined here, only includes ASCII characters and crashes here when other characters are used.

I believe this should be fixable simply by changing the definition of C_BYTES but wondering if there are other considerations that would prevent this from being possible.

The text was updated successfully, but these errors were encountered:

shioyama · 2016-04-15T07:47:09Z

I've just done this for now:

C_BYTES    = (0..65535).inject(""){|string, c| string << c rescue RangeError; string}.freeze

Not really a solution, but enough so I can continue testing.

kputnam · 2016-04-16T00:54:30Z

Unfortunately, X12 does not support UTF-8. The specification lists the set of allowed characters which is H_BASIC (a subset of C_BYTES). There is an extended set of characters, H_EXTENDED, which can be used if both trading partners agree.

If you're constructing a X12 document to send to someone, you may have to transcode your UTF-8 to the limited character set. I would be surprised if extending C_BYTES works, just because it wasn't intended to. You might try generating a document and then write it out as X12, then try reading it back in (even just using edi-pp) to make sure it looks right.

kputnam · 2016-04-16T01:19:57Z

I'm trying to think through how it would work, but one thing that might cause trouble is Reader.is_control_character? returns true if a character isn't in H_BASIC or H_EXTENDED. So I think most UTF-8 would be treated as control characters.

From what I can work out, the consume_isa method in StreamReader will look for the start of the document, which is always ISA, and ignore control characters. That should be OK unless you have an input like where something like IあああSああA occurs before the X12 part of the file starts, since this will be tokenized as ISA and it will think that's the beginning of a X12 document. Probably most people have files that are entirely X12, but I had files which had an arbitrary header message before the ISA token (the spec doesn't forbid this), so consume_isa is written to skip that.

So to summarize StreamReader figures out where the X12 starts in a stream of arbitrary characters. Because it would throw away the new UTF-8 characters (it thinks they are control characters), it might identify a sequence of characters as the start of the X12 when it isn't. Seems unlikely to actually happen, unless your X12 files have random junk in between the ISA/ISE envelopes.

Next, TokenReader is what scans the stream of characters for either segment identifiers like ST, GE etc or specific characters or delimiters, or entire parts of a segment, like all of its elements, etc. In most of these functions, when reading until a particular substring is matched, any "control characters" are thrown out, like they weren't even present in the input. So I think this will probably cause all of the new UTF-8 characters to be discarded, since they are classified as control characters along with things like line endings.

If you notice that happening, then you might look at Reader.is_control_character? and change it so all the stuff that it previously classified as control characters (e.g., \n\t\f\v, various single-byte characters) are still control characters, but the characters that you've added above 255.chr aren't marked as control characters. That might actually work!

shioyama · 2016-04-18T02:26:23Z

@kputnam Thanks very much for the detailed reply! If X12 does not support UTF-8, then I think that's enough to convince me to not use it. Actually our partner asked us if we could send it in non-UTF-8 characters so I think we'll have to do that.

Now just have to think of a way to map UTF-8 (Japanese) addresses into corresponding English addresses... which is a slightly different problem.

kputnam · 2016-04-18T17:27:37Z

Good luck!

shioyama · 2016-04-19T08:12:00Z

For anybody who encounters this problem, geocoder is your friend:

result = Geocoder.search("東京都武蔵野市吉祥寺本町二丁目5番10号 いちご吉祥寺ビル").first
result.address
=> "2 Chome-5-10 Kichijōji Honchō, Musashino-shi, Tōkyō-to 180-0004, Japan"

shioyama changed the title ~~Support for UTF-8 strings~~ Support for reading UTF-8 strings Apr 15, 2016

shioyama closed this as completed Apr 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for reading UTF-8 strings #78

Support for reading UTF-8 strings #78

shioyama commented Apr 15, 2016

shioyama commented Apr 15, 2016

kputnam commented Apr 16, 2016

kputnam commented Apr 16, 2016

shioyama commented Apr 18, 2016

kputnam commented Apr 18, 2016

shioyama commented Apr 19, 2016

Support for reading UTF-8 strings #78

Support for reading UTF-8 strings #78

Comments

shioyama commented Apr 15, 2016

shioyama commented Apr 15, 2016

kputnam commented Apr 16, 2016

kputnam commented Apr 16, 2016

shioyama commented Apr 18, 2016

kputnam commented Apr 18, 2016

shioyama commented Apr 19, 2016