Niko Partanen, 3.12.2019

## ELAN tests

These are the ELAN file validation tests used in IKDP research project, and it's continuation project [Language Documentation meets Language Technology: The Next Step in the Description of Komi](https://langdoc.github.io/IKDP-2/). Some parts of the code is very old, some is never, and with some parts the final implementation is still being thought about. Especially with tests that interact between ELAN files and metadata there are countless ways to do them effectively, and the current method is probably not the final. Similarly, it is still bit unclear what is the best way to store project's common attributes. Currently we are using a YAML file called `project.yaml`, but there are maybe other better alternatives to that. The idea is that in principle the methods could be adapted into other projects by editing this configuration file. In practice this may be more complicated.

I use a lot Pympi package in these tests, but I did make a change in `Elan.py` file at line 1437. Otherwise there are warnings everytime I parse an ELAN file. So this:

```
if tree_root.attrib['VERSION'] not in ['2.8', '2.7']:
```

Is changed to:

```
if tree_root.attrib['VERSION'] not in ['2.8', '2.7', '3.0']:
```

Mikatools contains some functions used:

In [1]:
from mikatools import *

The tests themselves are loaded from this file:

In [2]:
from elan_tests import *

Here are other packages that are used.

In [3]:
from uralicNLP import uralicApi
from uralicNLP.cg3 import Cg3
import pympi
from nltk.tokenize import word_tokenize
import glob
import os
import re
import yaml
from pathlib import Path
import pandas as pd

What it comes to corpus metadata, the scripts are currently assuming following structure in a JSON file.

In [None]:
meta_json = []

session = {}

session["session_name"] = "recording_session_1"

participants = [{"participant":"S1"},
                {"participant":"S2"}]

session["participants"] = participants

meta_json.append(session)

meta_json


So if the ELAN file is called `recording_session_1.eaf`, then in metadata the information about this recording is stored under object with this identifier. Under each session there is participant information, and there the field `participant` has the id's that are present in the ELAN files as well.

In this point we are mainly testing for two potential problems. First, every ELAN file should have corresponding item in the metadata. If the ELAN file's session name is not matching with anything in metadata, then it will be impossible to match any external information about recording to the content of ELAN file. Second, each participant in ELAN file should be present in the right place within the metadata. It is of course entirely possible that in the metadata there are participants who are not present in the metadata, for example, if someone didn't say anything during the recording, but was present anyway.

Our project's own metadata is easily loaded this way. 

In [None]:
corpus_meta = json_load('../ikdp-meta/ikdp_meta.json')

## Running the tests one by one

The easiest way to run the tests is to set them into a loop that goes through each ELAN file, and prints out the result if there are some issues. 

Some tests are possibly only if previous tests succeeded. I.e. it is useless to try finding a speaker ID from metadata if the session is not found. There is probably quite much logic that could be set into the steps better, but at least the current structure attempts to reflect movement from one domain to another in the order of increasing complexity. 

Our focus is in structural issues that would potentially invalidate the ELAN file. And these tests either focus into issues in ELAN files or in ways information in ELAN files and metadata may mismatch in harmful way. One could argue, that things like whether different metadata attributes are present could also be checked. Yes, they could and we should do it, but if information is not somehow repeated in ELAN files, the problem their lacking or changing values pose is not directly connected to ELAN.

In [None]:
elan_file_paths = glob.glob(f"{corpus_location}/**/kpv_izva*.eaf", recursive=True)

for elan_file in elan_file_paths:
    
    pass
    
#   Is ELAN file named according to the scheme
#   test_session_names(elan_file)
    
#    In transcriptions, do we have only characters that are supposed to be there
#    check_illegal_characters(elan_file, verification_list = manually_verified_files)

#    Do we have other whitespace than spaces
#    check_illegal_characters(elan_file, verification_list = manually_verified_files, tier_type = "orthT", unwanted_characters = "[\t\n\r]")

#    Are all required tier types present
#    test_tier_types(elan_file, types = ['refT', 'orthT', 'ft-engT', 'ft-rusT'])

#    Are all wanted tiers present in the file
#    test_tier_existence(elan_file, tier_prefixes = ['ref', 'orth', 'ft-word', 'ft-rus'])

#    Are right tier types used on right tiers
#    test_tier_type_consistency(elan_file)

#    # Is session name in metadata
#    test_tier_type_consistency(elan_file)

#    Is participant ID in metadata
#    test_participant_meta(elan_file, corpus_meta, "orthT")

### NOT YET IMPLEMENTED

#    Check if linked files exist
#    check_linked_files(elan_file_path)

It is also possible to go one step further, and analyse the linguistic content directly. For example, as we have a Komi morphological analyser, we can test which 

In [7]:
print_unknown_words("/Volumes/langdoc/langs/kpv/kpv_lit19570000lytkin-1323_2az/kpv_lit19570000lytkin-1323_2az.eaf")

пасьтадас (2)
ӧддьйӧджык (1)
чысьянӧдас (1)
сійӧс-тӧ (1)
пышйас (1)
платтьӧн (1)
ней (1)
мӧдӧдіс. (1)
лэдзы (1)
ичӧтьлик (1)
гӧсьньӧч (1)
гырйас (1)
вӧртіын (1)
верднылӧй (1)
бӧртчис (1)
аслы (1)
Патурлик (1)
Всё (1)


As usual, there are dialectal words, unknown words and Russian words that do not get an analysis. Also there are incorrectly tokenized words, which brings us to other questions, such as how to best way tokenize a language documentation corpus etc.