GitHub - langdoc/eaf2korp: This is a script that converts an ELAN file into VRT format used in Korp

ELAN to Korp pipeline

This is a preliminary script to convert ELAN files, as structured in IKDP project, as well as in several other language documentation projects, into VRT file format used in Korp. An introduction to VRT format can be found from the Language Bank of Finland's Korp documentation.

Bringing the corpus into Korp so that it can be accessed easily by wider audience fits well into the goals of IKDP-2 project, which focuses in application of language technology into Komi resources, with focus in spoken corpora.

The pipeline is built so that annotations are parsed from ELAN files using pympi package, and annotations are added with uralicNLP package. This should make the pipeline rather flexible, as the only variables that need to be changed are the transcription tier and the language.

To-do

Timecodes and speaker id's need to be read from ELAN file
Metadata needs to be added in one form or another
Other morphological tags, besides POS, need to be added into other column. This also needs some decisions of what all we want, and how ambiguity is handled on that level.
The utterances could be restructured into sentences by following punctuation characters. This would make the content more readable in Korp.
The utterances, or whatever units we deal with, need to be ordered somehow in Korp. The best order would probably be the start time.

Ideas

There should maybe be two columns, pre- and post-CG

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
LICENSE		LICENSE
README.md		README.md
eaf2korp.py		eaf2korp.py
environment.yaml		environment.yaml
example.py		example.py
ikdp2korp.ipynb		ikdp2korp.ipynb
korp_example.eaf		korp_example.eaf
korp_example.pfsx		korp_example.pfsx
korp_example.vrt		korp_example.vrt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ELAN to Korp pipeline

To-do

Ideas

About

Releases

Packages

Languages

License

langdoc/eaf2korp

Folders and files

Latest commit

History

Repository files navigation

ELAN to Korp pipeline

To-do

Ideas

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages