Skip to content

Hungarian dictionary contains invalid UTF-8 sequences #559

@dimztimz

Description

@dimztimz

The issues can be of three types (put an X between the brackets):

  • Bug reports
  • Change request or feature request
  • Others, questions

Reporting bugs

When reporting a bug you must tell us the state of your system and the
steps to reproduce the bug. For the state please specify the following:

key value
OS, distro, version = Linux, Ubuntu
Hunspell version = 1.6.2
Dictionary,package name,version = Hungarian

Steps to reproduce

Open hu_HU_u8.aff in gedit

sudo apt install hunspell-hu
gedit /usr/share/hunspell/hu_HU.aff --encoding=UTF-8

Bugged behavior (output)

Gedit shows error. If by any chance it tries to interpret the file as ISO-8859-15 open the file with --encoding option in gedit.

Expected behavior (output)

No error should be shown by the text editor. Valid UTF-8 is expected.

Solution

Invalid UTF appears only in comments and in flag vectors.

Upstream is here https://sourceforge.net/projects/magyarispell/ , open the source tarball.

The fix is in the file bin/u8myspell. The following script should fix it completely.

#!/bin/bash
set -x
export LANG=en_US
export LC_ALL=C

case $# in
0|1|2) echo "u8myspell - converts MySpell dictionaries to UTF-8
usage: u8myspell source_name output_name source_charset"; exit 1;;
esac

i=$1
o=$2
charset=$3
localdir="$(dirname $0)"

iconv -f "$charset" -t UTF-8 "$i.dic" | sed -f "$localdir"/l1_u8.sed > "$o.dic"
iconv -f "$charset" -t UTF-8 "$i.aff" |
sed 's/^SET .*$/SET UTF-8\
FLAG UTF-8/' | sed -f "$localdir"/l1_u8.sed > "$o.aff"

Basically the latin2 is converted to utf8 and the command FLAG UTF-8 is additionally issued in .aff.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions