NER tagging plugin for Vulyk, crowdsourcing framework

To connect plugin to Vulyk:

Install this plugin as pip package pip install git+https://github.com/lang-uk/vulyk-ner.git
Make sure to include configuration for it in local_settings.py

ENABLED_TASKS = {
    # other plugins will be somewhere here
    'vulyk_ner': 'NERTaggingTaskType'
}

Full installation instructions can be found here https://github.com/mrgambal/vulyk

Running tests after you made changes

python -m unittest discover -s test

Conversion utils included

There are two utitilies included in this package:

convert2vulyk.py which is Swiss Army Knife to convert/tag texts into the format, suitable for vulyk tasks
convert_vulyk2iob.py which allows you to convert the individual answers, exported from vulyk with ./manage.py db export command into standard IOB

Convert texts with `convert2vulyk.py`

convert2vulyk.py subcommand convert allows you to convert bunch of files (either txt or json, see --format) into a jsonlines file that you can feed directly into vulyk. It also can autodiscover annotation layer in brat standoff format (see --ann_autodiscovery). You can supply a glob-style string as input_files param for batch processing. Beware, when applied to raw txt file, the tool will fix the whitespaces around punctuation according to the rules of typography

Converting pre-tokenized json files ([["sent1_word1", "sent1_word2"], ["sent2_word1"]] format)

python bin/convert2vulyk.py -f json convert  tokenized/jsons/*.json > vulyk_tasks.jsonlines

Converting text files with no annotations

This will tokenize input text files using whitespace tokenizer, adjust whitespaces and ignore annotation layer (if any)

python bin/convert2vulyk.py -f txt convert --ignore_annotations tokenized/txt/*.txt > vulyk_tasks.jsonlines

Converting text files with annotation layer stored in *.ann format

This will tokenize input text files using whitespace tokenizer, adjust whitespaces, autodiscover *.ann file next to *.txt (if any) and adjust positions of found NER tokens.

python bin/convert2vulyk.py -f txt convert  tokenized/txt/*.txt > vulyk_tasks.jsonlines

Tag texts with `convert2vulyk.py`

Subcommand tag allows you to pre-annotate given texts (tokenized or raw) using either stanza or spacy. You might as well specify your own models with --ner-model

To do so, you have to install extra dependencies

pip install -r extra_requirements.txt

Tag pretokenized json files with SpaCy model

python bin/convert2vulyk.py -f json tag --ner_framework spacy --ner_model /my/best/spacy/model tokenized/json/*.json > vulyk_tasks.jsonlines

Tokenize and tag raw text files with stanza model

Beware: to tokenize raw texts, script will use lang-uk's tokenize-uk tokenizer, which is sometime naïve

python bin/convert2vulyk.py -f txt tag --ner_framework stanza --ner_model "uk" tokenized/txt/*.txt > vulyk_tasks.jsonlines

Import to Vulyk

./manage.py db load ner_tagging_task --batch batch_name ./path/save_to_file.json

For more details and possible parameters refer to python bin/convert2vulyk.py -h

Convert vulyk results to IOB with `convert2vulyk.py`

The tool allows you to convert one or more batches with answers, exported from vulyk into the iob files:

python bin/convert_vulyk2iob.py "test_results/*.jsonlines" test_results/iobs/

Each individual answer from the annotator will be stored according to scheme {batch_dir}/{username}/{task_id}.iob, where batch_dir is the basename of the input files, username is the name of the annotator, task_id is the unique identifier of the task from vulyk.

As usual, python bin/convert_vulyk2iob.py -h is your friend.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
bin		bin
test		test
vulyk_ner		vulyk_ner
.gitignore		.gitignore
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
extra_requirements.txt		extra_requirements.txt
requirements.txt		requirements.txt
setup.py		setup.py
test_tasks.json		test_tasks.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NER tagging plugin for Vulyk, crowdsourcing framework

Conversion utils included

Convert texts with `convert2vulyk.py`

Converting pre-tokenized json files ([["sent1_word1", "sent1_word2"], ["sent2_word1"]] format)

Converting text files with no annotations

Converting text files with annotation layer stored in *.ann format

Tag texts with `convert2vulyk.py`

Tag pretokenized json files with SpaCy model

Tokenize and tag raw text files with stanza model

Import to Vulyk

Convert vulyk results to IOB with `convert2vulyk.py`

About

Releases

Packages

Contributors 3

Languages

License

lang-uk/vulyk-ner

Folders and files

Latest commit

History

Repository files navigation

NER tagging plugin for Vulyk, crowdsourcing framework

Conversion utils included

Convert texts with convert2vulyk.py

Converting pre-tokenized json files ([["sent1_word1", "sent1_word2"], ["sent2_word1"]] format)

Converting text files with no annotations

Converting text files with annotation layer stored in *.ann format

Tag texts with convert2vulyk.py

Tag pretokenized json files with SpaCy model

Tokenize and tag raw text files with stanza model

Import to Vulyk

Convert vulyk results to IOB with convert2vulyk.py

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Convert texts with `convert2vulyk.py`

Tag texts with `convert2vulyk.py`

Convert vulyk results to IOB with `convert2vulyk.py`

Packages