Skip to content
This project is aimed to evaluate the effects of changes to a corpus annotation on POS tagging. The Czech corpus DESAM (with its attributive tagset) is assumed, as well RFTagger. Originally developed as part of my master’s thesis.
Python HTML Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
3rdparty
modifier
.gitignore
LICENSE
README.md
annotate.py
bootstrap.py
compare_evaluation.py
configure
convert_to_latex.py
copy_rows.js
create_lexicon.py
create_model.py
cz_brno_attributive.py
deploy.sh
enqueue_job.py
evaluate_tagging.py
generate_measurements.py
generate_rftagger_lexicon_majka.sh
html_writer.py
log.py
majka.py
makefile.py
models.py
modify_corpus.py
mwe.py
my-uri-handler.py
original_wordclass.txt
pokusy.html
protocol-dicto.desktop
protocol-enqueue.desktop
protocol-geany.desktop
protocol-notes.desktop
protocol-okular.desktop
protocol-spooler.desktop
protocol-tagsetbench.desktop
pygments_lexer.py
readme.html
regenerated-wordclass.txt
rftagger.py
rftagger_lexicon.py
rftagger_possible_unknown_tags.txt
rftagger_possible_unknown_tags_kA.txt
run_measurements.py
shlex.py
split_corpus.py
spooler.py
style.css
symbols.py
tags.py
tagsetbench.py
template.html
vertical.py
wordclass.txt
xml_tag.py

README.md

tagsetbench

This project is aimed to evaluate the effects of changes to a corpus annotation on POS tagging, with cross-validation.

The Czech corpus DESAM (with its attributive tagset) is assumed, as well RFTagger. Originally developed as part of my master’s thesis.

MIT licensed, except the 3rd party files which have their own licences.

TODO

Currently, the code includes parts of my (unreleased) chart parser “ijáček”. It should be released as well and the common code should be shared across the projects.

Usage notes

To be written, but you need at least Python 3.5, RFTagger, and GNU Make. Plus the DESAM corpus or any corpus using the Czech attributive tagset. The tagset is employed by a free morphological analyzer Majka.

There may be some useful description in readme.html.

(Optional) Python 3 packages, available in Arch Linux AUR:

  • python-beautifulsoup4 4.5.1-1 (required by compare_evaluation.py)
  • python-tabulate (convert_to_latex.py, just a helper script)
  • python-colorlog (optional)
  • python-pygments (pygments_lexer.py, also an unnecessary part)

Further notes

Firefox >= 51 is advised for colourful emojis to help navigate generated HTML tables with better visual cue than just shapes/glyphs.

Czech comments in the code do not contain important stuff.

You can’t perform that action at this time.