Handling ISO-8859-1 characters #157

Closed
ossiangrr opened this Issue Mar 19, 2013 · 6 comments

Projects

None yet

4 participants

@ossiangrr

I'm not sure if this is a problem with irc in general, or with Javascript, or node.

I have been writing a simple bot that works as a search engine for a card game (VTES). Some cards have names with foreign characters, and I'd like them to be searchable by literal character.
I am listening with addListener("message#",callback) and addListener("pm",callback)

If someone sends a UTF-8 character -- say, ö or ç -- it works great!

But if their encoding is ISO-8859-1, my bot sees all of the "special" characters as the same character sequence: �
Not even a different sequence of bytes that I could brute-force translate.

How can I get my bot to see these as different characters?
Or is this just a limitation of javascript/node that I'll have to suck up and deal?

(I do have an option for users to search by "ascii-ized" versions of the name, so there's a workaround, but it would be nice if I could handle more literally-typed or copy-pasted strings)

Here is a real-world excerpt.
In the first of each of these cases, the "foreign" character is UTF-8. In the second case, it is ISO-8859-1.

-> gramle whois Zöe
Gramle Zöe. Clan: Malkavian Group: 2 Capacity: 3 cel obf AUS
Gramle Camarilla: Zöe does not get the usual +1 stealth when hunting.

-> gramle whois Zöe
Gramle No results found for 'whois Z�e'.


-> gramle whois Monçada
Gramle Ambrosio Luis Monçada, Plenipotentiary. Clan: Lasombra Group: 2 Capacity: 10 aus for DOM OBT POT PRE
Gramle Sabbat cardinal: Monçada cannot block. Other Methuselahs' actions targeting Monçada cost an additional pool. If Monçada is ready during your discard phase, he can untap another ready Lasombra.

-> gramle whois Monçada
Gramle No results found for 'whois Mon�ada'.

@damianb
damianb commented Mar 19, 2013

You can do most of this using the buffer builtin. http://nodejs.org/api/buffer.html#buffer_new_buffer_str_encoding

You'll need to determine somehow if the character set isn't utf8 chars. That'll have to be up to you.

@ossiangrr

Well, the earliest moment that I have access to the string (inside an addListener callback), it's already in the "garbled" state.
So I guess what you're saying to me is that the changes would have to be made inside the node-irc library itself. I guess I could attempt to locally modify it and see what happens... I'm just a relative newcomer to node so I was hoping there was something within the irc library that I had just overlooked.

@damianb
damianb commented Mar 19, 2013

@ossiangrr is there an actual difference when looking at the buffer's state directly?

check this. use console.dir on the string provided there and look at the hex values, see if they do differ. that'll tell you how low you've gotta go.

@ossiangrr

Yeah, those still come out as the "same character" using console.dir.. so it would have to be something inside node-irc.

@ossiangrr

I've found references in node-irc's forums about "encoding" patches but I don't understand node and/or github enough to figure out if I can use this patch: #113

I have also found this: https://github.com/bnoordhuis/node-iconv
Which, again, I would use to modify node-irc itself if I was a little more well-versed in the code.

Maybe the core node-irc team could work with these links better than me?

@jacobrask

Did anyone figure out a solution, in node-irc or outside? I have both ISO-8859-1 users and UTF-8 users.

@sigkell sigkell closed this May 29, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment