## ELAN file tokenization with Python 

Pympi can be installed with `pip install pympi-ling`.

In [None]:
import pympi

ELAN file is read into Pympi by giving the file path. Without that the result would be only an empty ELAN file, which can also be useful, for example, when adding content from some other sources.

In [248]:
elan_file = pympi.Eaf(file_path='../test.eaf')

Parsing unknown version of ELAN spec... This could result in errors...


Let's check what kind of tiers there are for transcription type, or we could just get all the tiers, but usually we want to do something more focused:

In [249]:
tiers = elan_file.get_tier_ids_for_linguistic_type('orthT')
tiers

['orth@Niko', 'orth@Naknek']

In [250]:
elan_file.get_tier_names()

dict_keys(['ref@Niko', 'orth@Niko', 'ref@Naknek', 'orth@Naknek'])

In [251]:
elan_file.get_linguistic_type_names()

dict_keys(['refT', 'orthT', 'wordT'])

So we have a linguistic type called `wordT`, but there is no tier for words. Let's add those! So for each tier which we picked up earlier we add a child tier. We pick from the parent tier the participant name and add that into the tier naming and attributes.

In [252]:
for tier in tiers:
    participant = elan_file.get_parameters_for_tier(tier)['PARTICIPANT']
    elan_file.add_tier(tier_id = 'word@' + participant, parent = 'orth@' + participant, ling = 'wordT', part=participant)

Now we have all tiers, which we can check by getting tier names:

In [254]:
elan_file.get_tier_names()

dict_keys(['ref@Niko', 'orth@Niko', 'ref@Naknek', 'orth@Naknek', 'word@Niko', 'word@Naknek'])

This is fine, but still bit boring! We can create new tiers, but it is more exciting to add there some content! So let's tokenize the transcription tier and add the content into word tier we just created.

In [105]:
import nltk

In [180]:
from nltk.tokenize import wordpunct_tokenize

This can be used to tokenize different strings. It is worth noting that tokenization in itself is rather complex task, and I'm not going to the details here, but for ELAN corpora we may usually need to set up some more specific ways to handle it. Tokenization can be done both on sentence and word level, and there are lots of reasons why someone could it one way or another.

In [258]:
wordpunct_tokenize('This is a sentence. Тайӧ мӧд сёрникузя.')

['This', 'is', 'a', 'sentence', '.', 'Тайӧ', 'мӧд', 'сёрникузя', '.']

So in the next code block we have a more complicated for-loop, which takes for all transcription tier data, which includes start and end times, and then for each transcription content it tokenizes whatever there is and writes those pieces into the word level tier.

In [242]:
for tier in tiers:
    participant = elan_file.get_parameters_for_tier(tier)['PARTICIPANT']
    content = elan_file.get_annotation_data_for_tier(tier)   
    for start, end, content, reference in content:
        for token_position, token in enumerate(wordpunct_tokenize(content)):
            if token_position == 0:
                elan_file.add_ref_annotation(value=token, id_tier='word@' + participant, tier2='ref@' + participant, time = start + 1)
            else:
                elan_file.add_ref_annotation(value=token, id_tier='word@' + participant, tier2='ref@' + participant, time = start + 1, prev='a' + str(elan_file.maxaid))


The result can now be saved into file:

In [None]:
elan_file.to_file('tokenized_file.eaf')

And it looks like this:

![Imgur](https://i.imgur.com/xtazRYn.png)

The result looks very nice, and the file works well. But if we examine it closed on XML level we find some conventions that deviate from the way ELAN would handle the file:

![Imgur](https://i.imgur.com/DnZq2yh.png?2)

It is bit unclear to me whether this is a problem or not. It certainly is a problem if we want to read the file into R following the way annotation ID's are matching, as now they will have different pattern than before. And maybe there are some situations where ELAN uses the tier numbering for some of its own internal purposes, and this will lead into new problems? 

I just opened an [issue](https://github.com/dopefishh/pympi/issues/11) in Pympi GitHub repository, so maybe I'm not realizing something or this issue has another solution.

Anyway, to wrap up this exercise, we can turn the code above into a new function that can be applied into other ELAN files.

In [263]:
def tokenize_eaf(filename):
    elan_file = pympi.Eaf(file_path=filename)
    tiers = elan_file.get_tier_ids_for_linguistic_type('orthT')

    for tier in tiers:
        participant = elan_file.get_parameters_for_tier(tier)['PARTICIPANT']
        elan_file.add_tier(tier_id = 'word@' + participant, parent = 'orth@' + participant, ling = 'wordT', part=participant)
    
    for tier in tiers:
        participant = elan_file.get_parameters_for_tier(tier)['PARTICIPANT']
        content = elan_file.get_annotation_data_for_tier(tier)   
        
        for start, end, content, reference in content:
            for token_position, token in enumerate(wordpunct_tokenize(content)):
                if token_position == 0:
                    elan_file.add_ref_annotation(value=token, id_tier='word@' + participant, tier2='ref@' + participant, time = start + 1)
                else:
                    elan_file.add_ref_annotation(value=token, id_tier='word@' + participant, tier2='ref@' + participant, time = start + 1, prev='a' + str(elan_file.maxaid))

    elan_file.to_file(filename)

So with something like this we should be able to loop through a list of ELAN files and tokenize them. There are of course numerous questions and places for improvement:

- Better tokenization method
- Checking whether the tier exists
- Checking whether the linguistic type exists
- Should the tier be removed when it exist? Or renamed? Do we want a warning in this case, or what should happen?


In [None]:
(c)