Skip to content

A brute-force legacy encodings decoder, available as both a package and a self-hosted web service

License

Notifications You must be signed in to change notification settings

kirisakow/whatever-disentangler

Repository files navigation

whatever-disentangler is a brute-force disentangler for legacy encodings

Use cases

  • When you already know what the expected (disentangled) string looks like
  • When you know which encodings you want to try, even without knowing what the expected string looks like
  • Tough cases which need two-step detangling

Installation with Poetry

git clone https://github.com/kirisakow/whatever-disentangler.git

cd whatever-disentangler

poetry install

Use whatever-disentangler as a CLI executable

Run script with no arguments to see a complete usage note. Here are the key moments:

  1. str_to_fix is the only required argument and the only positional argument. As a positional argument, it takes no key, only the value; as the only positional argument, it goes either to the very first or the very last position of the command line (prefer the beginning though, otherwise it may be mistaken for the value of those other arguments that can take multiple values). If the string contains spaces, enclose it in quotation marks.
  2. All the other arguments are optional. Their keys must go in pair with their values: --expected_str "the actual expected string". Both the underscore and the hyphen are valid characters to write the keys; in other words, both snake_case and kebab-case notations are valid.
  3. The optional arguments --encoding-from and --encoding-to can take multiple values, separated by space or another IFS.

Examples:

python whatever_disentangler "échéancier" --recursivity-depth 2 --expected-str "échéancier" --encoding_from cp1250 cp1251 cp1252
...
'échéancier' ('cp1252') -> 'échéancier' ('utf_8')
    -> 'échéancier' ('cp1252') -> 'échéancier' ('utf_8')
    -> 'échéancier' ('cp1252') -> 'échéancier' ('utf_8_sig')
...

Use whatever-disentangler as an importable library in Python code

Add whatever-disentangler as a dependency so you can import it:

cd your-project

poetry add --editable ../rel/path/to/whatever-disentangler/

poetry install

Use whatever-disentangler as both offline executable or a remote HTTP API caller:

from whatever_disentangler import whatever_disentangler as wd

# this one is an offline disentangler:
disentangler = wd.Disentangler()
disentangler.flatten_legibly(
  disentangler.disentangle(str_to_fix="боз▌з╤з╙з╤ б░з▄зтз╤Б0Ж3з▀Б0┌1! Б0┘5з╓зтзрз┴з▐ зуз▌з╤з╙з╤!", expected_str="Слава Україні! Героям слава!", recursivity_depth=2)
)

# and this one is remote: it calls a homemade REST API:
remote_disentangler = wd.RemoteDisentangler(endpoint='https://crac.ovh/fix_legacy_encoding')
response_obj = await remote_disentangler.fetch_response(str_to_fix="Ţč޻޹ަ ŢÓަޮޢ޴޷޵޺! Ţč޻޹ަ ޹ަŢŔްޢ!", expected_str="Жыве Беларусь! Жыве вечна!", recursivity_depth=2)
remote_disentangler.flatten_legibly(response_obj)

To see whatever_disentangler in action,

About

A brute-force legacy encodings decoder, available as both a package and a self-hosted web service

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages