Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for reading UTF-8 strings #78

Closed
shioyama opened this issue Apr 15, 2016 · 6 comments
Closed

Support for reading UTF-8 strings #78

shioyama opened this issue Apr 15, 2016 · 6 comments

Comments

@shioyama
Copy link

It seems that stupidedi does not support UTF-8 chacaters when reading out an EDI. I get:

ArgumentError: universe does not contain element "あ"

Digging a bit I found that Stupidedi::Reader::C_BYTES, defined here, only includes ASCII characters and crashes here when other characters are used.

I believe this should be fixable simply by changing the definition of C_BYTES but wondering if there are other considerations that would prevent this from being possible.

@shioyama shioyama changed the title Support for UTF-8 strings Support for reading UTF-8 strings Apr 15, 2016
@shioyama
Copy link
Author

I've just done this for now:

C_BYTES    = (0..65535).inject(""){|string, c| string << c rescue RangeError; string}.freeze

Not really a solution, but enough so I can continue testing.

@kputnam
Copy link
Owner

kputnam commented Apr 16, 2016

Unfortunately, X12 does not support UTF-8. The specification lists the set of allowed characters which is H_BASIC (a subset of C_BYTES). There is an extended set of characters, H_EXTENDED, which can be used if both trading partners agree.

If you're constructing a X12 document to send to someone, you may have to transcode your UTF-8 to the limited character set. I would be surprised if extending C_BYTES works, just because it wasn't intended to. You might try generating a document and then write it out as X12, then try reading it back in (even just using edi-pp) to make sure it looks right.

@kputnam
Copy link
Owner

kputnam commented Apr 16, 2016

I'm trying to think through how it would work, but one thing that might cause trouble is Reader.is_control_character? returns true if a character isn't in H_BASIC or H_EXTENDED. So I think most UTF-8 would be treated as control characters.

From what I can work out, the consume_isa method in StreamReader will look for the start of the document, which is always ISA, and ignore control characters. That should be OK unless you have an input like where something like IあああSああA occurs before the X12 part of the file starts, since this will be tokenized as ISA and it will think that's the beginning of a X12 document. Probably most people have files that are entirely X12, but I had files which had an arbitrary header message before the ISA token (the spec doesn't forbid this), so consume_isa is written to skip that.

So to summarize StreamReader figures out where the X12 starts in a stream of arbitrary characters. Because it would throw away the new UTF-8 characters (it thinks they are control characters), it might identify a sequence of characters as the start of the X12 when it isn't. Seems unlikely to actually happen, unless your X12 files have random junk in between the ISA/ISE envelopes.

Next, TokenReader is what scans the stream of characters for either segment identifiers like ST, GE etc or specific characters or delimiters, or entire parts of a segment, like all of its elements, etc. In most of these functions, when reading until a particular substring is matched, any "control characters" are thrown out, like they weren't even present in the input. So I think this will probably cause all of the new UTF-8 characters to be discarded, since they are classified as control characters along with things like line endings.

If you notice that happening, then you might look at Reader.is_control_character? and change it so all the stuff that it previously classified as control characters (e.g., \n\t\f\v, various single-byte characters) are still control characters, but the characters that you've added above 255.chr aren't marked as control characters. That might actually work!

@shioyama
Copy link
Author

@kputnam Thanks very much for the detailed reply! If X12 does not support UTF-8, then I think that's enough to convince me to not use it. Actually our partner asked us if we could send it in non-UTF-8 characters so I think we'll have to do that.

Now just have to think of a way to map UTF-8 (Japanese) addresses into corresponding English addresses... which is a slightly different problem.

@kputnam
Copy link
Owner

kputnam commented Apr 18, 2016

Good luck!

@shioyama
Copy link
Author

For anybody who encounters this problem, geocoder is your friend:

result = Geocoder.search("東京都武蔵野市吉祥寺本町二丁目5番10号 いちご吉祥寺ビル").first
result.address
=> "2 Chome-5-10 Kichijōji Honchō, Musashino-shi, Tōkyō-to 180-0004, Japan"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants