# Programs to load the JSON files and restructure them

Here are some examples on how to read the JSON files, restructure them into dictionaries, and see how we can save them.

In [None]:
import json

The source folder that contains the original files

In [None]:
src_folder = '../../lex/'

## The `morfs` folder

The `morfs` folder and files

In [None]:
morfs_folder = 'morfs/'
morfs_files = ['cw.json', 'cwt.json']

We load the files

In [None]:
cw = json.load(open(src_folder + morfs_folder + 'cw.json'))
cwt = json.load(open(src_folder + morfs_folder + 'cwt.json'))

The `cw` file contains a list of word counts and words. Some examples:

In [None]:
cw['cw'][:4]

We create a dictionary from the list, where the key is the word and the value, the frequency

In [None]:
cw = {pair[1]: pair[0] for pair in cw['cw']}

In [None]:
cw['bil'], cw['bilar'], cw['bi']

`cwt` contains lists of counts, words, and their tags.

In [None]:
cwt['cwt'][:4]

We encode them as pairs (w, t): count. We use tuples as it is easier to extract the elements of the tuple.

In [None]:
cwt = {(triple[1], triple[2]): triple[0] for triple in cwt['cwt']}

In [None]:
cwt[('bil', 'pm.nom')], cwt[('bil', 'nn.utr.sin.ind.nom')], cwt[('bil', 'jj.pos.utr.sin.ind.nom')]

## Saving the dictionaries as JSON files

As JSON does not allow the tuples, we have to encode the keys as strings. 

In [None]:
cwt = {str(x):y for x, y in cwt.items()}

In [None]:
fp = open(src_folder + morfs_folder + 'cwt_new' + '.json', 'w', encoding='utf-8')
json.dump(cwt, fp, indent=4, ensure_ascii=False)
fp.close()

## Reading the JSON dictionaries

We read the dictionaries with `load()`:

In [None]:
cwt = json.load(open(src_folder + morfs_folder + 'cwt_new' + '.json'))

The keys of `cwt` are conceptually tuples (pairs of a word and a part of speech). To have a JSON compatibility, we encoded them as strings. We have to apply `eval()` to convert them back into tuples.

In [None]:
cwt = {eval(x):y for x, y in cwt.items()}

We access the word-tag counts the usual way.

In [None]:
cwt[('bil', 'pm.nom')], cwt[('bil', 'nn.utr.sin.ind.nom')], cwt[('bil', 'jj.pos.utr.sin.ind.nom')]

## The `tags` folder

The tag folder contains counts. We could encode the data as with `morfs`. We load the two other files.

The `tags` folder

In [None]:
tags_folder = 'tags/'

In [None]:
features = json.load(open(src_folder + tags_folder + 'features.json'))
taginfo = json.load(open(src_folder + tags_folder +  'taginfo.json'))

`taginfo` is just a list

In [None]:
taginfo['taginfo'][:6]

The `features` file is more complex. Maybe it could be simplified. Here we get the list of parts of speech.

In [None]:
features['wordcl']['values'].keys()

And the list of values for the case feature (`case`)

In [None]:
features['case']['values'].keys()

And the information on it is given by the value of the dictionary

In [None]:
features['case']['values']['sms']

## The `words` folder

The `words` folder has the same type of files. The most complex one is `inflection.rules`. We examine it now.

The name of the `words` folder

In [None]:
words_folder = 'words/'

`inflection_rules` is a list of possible inflections in Swedish and the parts of speech they apply to. We load it.

In [None]:
inflection_rules = json.load(open(src_folder + words_folder + 'inflection.rules.json'))

Each item in the list corresponds to a specific inflection model depending on a part of speech and a paradigm. It is represented as a dictionary with two keys: `feat_infl`and `paradigm`. `feat_infl` corresponds to the parts of speech and the grammatical features. `paradigm` is the inflection model. It is a list, where the two first items are an inflection code and a list of suffixes separated by commas. The rest corresponds to all the inflections in the same order as the `feat_infl` list.

In [None]:
inflection_rules['inflection.rules'][1]

The features governing the inflections

In [None]:
inflection_rules['inflection.rules'][0]['feat_infl']

And the paradigms, where each paradigm consists of a code and a list of suffixes (a tuple). The value is a list of inflections corresponding to the grammatical features 

In [None]:
inflection_rules['inflection.rules'][0]['paradigm'][10]

We convert the strings into tuples for a more convenient processing of the paradigm keys

In [None]:
for rule in inflection_rules['inflection.rules']:
    temp_list = rule['paradigm']
    rule['paradigm'] = {}
    for paradigm in temp_list:
        rule['paradigm'][(paradigm[0], paradigm[1])] = paradigm[2:]

In [None]:
inflection_rules['inflection.rules'][0]['paradigm'][('u1', 'behå,byrå,hå,slå,så,^å')]