Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non ascii characters in SBD mode #86

Closed
alphapats opened this issue Jul 11, 2022 · 4 comments
Closed

Non ascii characters in SBD mode #86

alphapats opened this issue Jul 11, 2022 · 4 comments

Comments

@alphapats
Copy link

alphapats commented Jul 11, 2022

reassembler.py in 'sbd' mode decodes ASCII characters to corresponding characters and rest are encoded as hex. This makes most of SBD data as garbled with no meaning. The code snippet from utils.py which converts int values to corresponding ASCII characters is as follows:
if( c>=32 and c<127): str1+=chr(c)
I investigated these hex values and found that they belong to other languages like arabic/french.
str = str.replace(r'\x{e2}\x{80}\x{99}',"'") str = str.replace(r'\x{e2}\x{80}\x{a6}',"…") str = str.replace(r'\x{f4}',"ô") str = str.replace(r'\x{c0}','À') str = str.replace(r'\x{c7}',"Ç") str = str.replace(r'\x{ea}',"ê") str = str.replace(r'\x{f9}',"ù") str = str.replace(r'\x{80}',"€") str = str.replace(r'\x{20}\x{A3}',"₣") str = str.replace(r'\x{c2}',"Â") str = str.replace(r'\x{e8}',"è") str = str.replace(r'\x{c9}',"É") str = str.replace(r'\x{ca}',"Ê")
How can we modify this code to view non ascii characters (french or arabic language). I tried to replace these non ascii hex values to corresponding characters but it is very time consuming. Is there any efficient way to convert these non ascii values to corresponding non english characters?

@alphapats
Copy link
Author

I have modified the code of util.py to include arabic, french, punctuation, roman numerals, hindi etc :
`
for c in data:
if mask:
c=c&0x7f
if(c>=32 and c<126):
str1+=chr(c)
#elif( c in [128,130,132,135,136,137,138,139,145,146,147,148,149,152,153,154]):
# str1+=chr(c)
elif c in [233, 224, 232, 249, 226, 234, 238, 244, 251, 231, 235, 239, 252]: #french
str1+=chr(c)
#print('french')
elif(c>=8208 and c<=8231): #punctuation
str1+=chr(c)
elif(c>=8240 and c<=8231): #punctuation
str1+=chr(c)
elif(c>=8308 and c<=8334): #superscript
str1+=chr(c)
elif(c>=8531 and c<=8579): #roman
str1+=chr(c)
elif (c >= 1569 and c<=1791): #arabic
str1+=chr(c)
elif (c>=3840 and c<=4047): #tibetan
str1+=chr(c)
elif (c>=8528 and c<=8579): #number
str1+=chr(c)
elif (c>=4096 and c<=4185):#mynamar
str1+=chr(c)
elif(c>=2305 and c<=2416): #hindi
str1+=chr(c)
elif(c>=3584 and c<=3675): #thai
str1+=chr(c)
elif(c>=880 and c<=1011): #greek
str1+=chr(c)
elif(c>=3458 and c<=3572): #sinhala
str1+=chr(c)
elif(c>=8448 and c<=8506): #letterlikesymbol
str1+=chr(c)
else:
if dot:
str1+="."
elif escape:
if c==0x0d:
str1+='\r'
elif c==0x0a:
str1+='\n'
else:

                str1+='\\x{%02x}'%c    
        else:
            str1+="[%02x]"%c

`

@Sec42
Copy link
Member

Sec42 commented Jul 28, 2022

Hi,

sbd data is m2m (machine-to-machine) communication. So most of the communication will be in binary and without knowledge of the protocol and/or the participating endpoints it is difficult to understand.

I don't think blindly printing characters will help with understanding these protocols.

If you have concrete examples where this change helps understanding a protocol, please let me know.

@alphapats
Copy link
Author

alphapats commented Jul 29, 2022

I have got few Short Burst Data msgs when using -m sbd. It does contain msg content which is sent from machine terminal to other over sbd mode. If its ascii, its readable in english. If msg sent in some other language then it prints hex values.
04-06-2022T17:39:46,DL,<26:02:5b:01:00:47:96>,\x{87})C*\x{d9}#I\x{e2}€\x{99}ll check now. Yesterday was 118Q\x{01}R\x{01}U\x{d3}\x{00}\x{00}\x{01}\x{81}.\x{9e}\x{97}\x{bb}C\x{c4}\x{06}\x{13}\x{07}i\x{04}\x{83}O@\x{c4}\x{06}\x{17} pSx\x{8f} 04-06-2022T17:44:08,DL,<26:02:5c:02:00:19:cf>,\x{87})C*\x{d9}\x{d1}145 opened 21 clicked on various links but some of those links were to Wikipedia.. so approx 15 clicked on actual trips. 2 unsubscribed. I guess we will not know the results until you can check your mailbox Q\x{01}R\x{01}U\x{d3}\x{00}\x{00}\x{01}\x{81}.\x{a2}J\x{b4}C\x{c4}\x{06}\x{13}\x{07}i\x{04}\x{83}O@\x{c4}\x{06}\x{17} pSx\x{8f}
Above example is in english language, I also found out some msgs which were in french/spanish. So msg was readable for those spanish/ french characters which were common in english (falls in ascii range) for rest, it was showing hex so I tried to replace hex with its corresponding french/spanish character and i was able to get complete message.

PS: Out of 200-300 msgs, only 5-10 msgs contains readable text. Rest all comes in hex.

@Sec42
Copy link
Member

Sec42 commented Dec 4, 2022

I understand where you're coming from. Unfortunately without knowing the code page/encoding mappings like these will just amount to guessing.

Case in point: most of your code references codepoints > 255 . which can't happen since the message is parsed byte-wise.

Your decoding of the "french" characters works more or less by accident, since the iso-8859-1 standard (which is what I guess is being used in your case) matches the first 256 characters of unicode (which is what chr() uses).

I guess decoding/displaying the accented characters of iso-8859-1 would not do much harm, and just be mildly confusing. I'll test it for a bit & see how I feel about it.

However implementing speculative decoding of utf-8 (or other multi-byte encodings) is definitely out of scope here.

@Sec42 Sec42 closed this as completed Dec 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants