# Converting the lexical resources
Author: Pierre Nugues

The lexical resources of _Granska_ are in the Latin 1, or possibly Latin 9, encodings. These are legacy codes, no longer supported by many programs, and for Latin 1, very poorly designed. We convert them into the more recent Unicode.

## The modules

In [None]:
from os.path import join, dirname, exists
from os import mkdir
from urllib.request import urlopen
import regex as re

## The resource names and locations

We find the resources in Viggo Kann's GitHub repository

In [None]:
src_url = 'https://raw.githubusercontent.com/viggokann/granska/willes/lex/'

With the `morfs` folder and its files

In [None]:
morfs_folder = 'morfs/'
morfs_files = ['cw', 'cwt']

The `tags` folder and its files

In [None]:
tags_folder = 'tags/'
tags_files = ['ct', 'ctm', 'ctt', 'cttt', 'features', 'taginfo']

And finally `words`

In [None]:
words_folder = 'words/'
words_files = ['bitransitivaverb', 'compound-begin-ok.w', 'compound-end-stop.w',
              'cw', 'cwtl', 'feminina', 'foreign.w', 'inflection.lex', 'inflection.rules',
              'intransitivaverb', 'opt_space_words', 'spellNotOK', 'spellOK']
words_problematic_files = ['cw', 'cwtl']

We store the converted files in this folder. If it does not exist, we create it.

In [None]:
dest_folder = '../../lex/'
if not exists(dest_folder):
    mkdir(dest_folder)

## Retrieving and converting the resources

We assume the original files are in Latin-1.

### `morfs`

The `morfs` folder. If it does not exist, we create it

In [None]:
if not exists(dest_folder + morfs_folder):
    mkdir(dest_folder + morfs_folder)

We retrieve and convert the files

In [None]:
for file in morfs_files:
    data = urlopen(src_url + morfs_folder + file).read().decode('latin-1')
    open(dest_folder + morfs_folder + file, 'w', encoding='utf-8').write(data)

### `tags`

We do the same thing for `tags`. The folder:

In [None]:
if not exists(dest_folder + tags_folder):
    mkdir(dest_folder + tags_folder)

And the files

In [None]:
for file in tags_files:
    data = urlopen(src_url + tags_folder + file).read().decode('latin-1')
    open(dest_folder + tags_folder + file, 'w', encoding='utf-8').write(data)

### `words`

Finally `words`. The folder :

In [None]:
if not exists(dest_folder + words_folder):
    mkdir(dest_folder + words_folder)

And the files

In [None]:
for file in words_files:
    data = urlopen(src_url + words_folder + file).read().decode('latin-1')
    open(dest_folder + words_folder + file, 'w', encoding='utf-8').write(data)

## Correcting the encoding

Two files in the `words` folder, `cw` and `cwtl`, contain spurious codes corresponding to the bell command `\a`. We remove them as they are not translated in the UTF-8 file and result in a null string. This is the new loop that discards lines with words only consisting of such control characters.

In [None]:
for file in words_problematic_files:
    data = urlopen(src_url + words_folder + file).read().decode('latin-1')
    lines = re.split('[\r\n]+', data)
    lines = [line for line in lines 
             if len(re.split('\t+', line)) > 1 
             and not re.match('\\a$', 
                          re.split('\t+', line)[1])]
    data = '\n'.join(lines)
    open(dest_folder + words_folder + file, 'w', encoding='utf-8').write(data)