Telugu Part of Speech (POS) Tagger

Usage:

We support only UNIX based systems. There is nothing to do to install.

To tag the sample file given in this software, run this command

make tag

The sample file provided with the tool is telugu.input.txt. When you run the command, a file named telugu.output.txt will be created. For more tagging options, modify the Makefile.

For a sample output, see telugu.sample.out.txt.

Output Format:

The output format contains the following columns separated by tab space.

word	lemma	POS tag	suffix	coarse pos	gender	number	case marker
మీకోసం	మీరు	NN	కోసం	pn	any	pl	2

You probably require only the first 3 columns. The main pos tag is highlighted in bold.

Tagset:

We use IIIT Tagset described in posguidelines.pdf (Bharati et al., 2006).


Fine Grained Tags ~~300 discarding low frequent tags.
Main POS Tag        25     CC, JJ, NN, VM, . . .
Coarse POS Tag      9      adj, n, num, unk . . .
Gender              6      any, f, m, n, punc, null
Number              4      any, pl, sg, null
Person              5      1, 2, 3, any, null
Case                3      d, o, null

Citation:

Please cite either http://sivareddy.in/downloads or the reference below. Paper can be downloaded from: http://sivareddy.in/papers/clia2011IndianCrossLang.pdf


@InProceedings{reddy-sharoff:2011:CLIA5,
  author    = {Reddy, Siva  and  Sharoff, Serge},
  title     = {Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources},
  booktitle = {Proceedings of the Fifth International Workshop On Cross Lingual Information Access},
  month     = {November},
  year      = {2011},
  address   = {Chiang Mai, Thailand},
  publisher = {Asian Federation of Natural Language Processing},
  pages     = {11--19},
  url       = {http://www.aclweb.org/anthology/W11-3603}
}

This work is supported by Intellitext [1] project and Lexical Computing Ltd [2] (Sketch Engine) [1] http://corpus.leeds.ac.uk/it/ [2] http://www.sketchengine.co.uk/?page=Website/Company

Description

The tagger is similar to Model 5 described in Table 2 of (Reddy and Sharoff 2011), but with a focus on Telugu. Short synopsis is presented below.

Large web corpora of Telugu is downloaded and cleaned, and tagged with with a high precision but low recall tagger. Morph analyzer is also run on this data. The tagger learns morphological analysis and pos tagging at the same time, there by pos tagging getting befitted from morphological analysis and vice versa. Since the tagger is trained on large data, the tagger is expected to handle large vocabulary, and also predicting the tags of unknown words using known words.

Current tagger is based on TnT tagger. TnT Tagger is well known for its robustness and speed, however it initially loads lex and trigram files which make take time to load. Once the loading is finished, we expect the tagger to be very fast.

License:

The model files are distributed under GNU GPL license. Feel free to use, modify, and redistribute the files as necessary. But the TnT tagger binary files are free only for research purposes (Get a license of TnT from http://www.coli.uni-saarland.de/~thorsten/tnt/)

This work is supported by Intellitext project and Lexical Computing Ltd (Sketch Engine)

Contact:

For additional corpora and tools for other languages, please email your queries to siva@sivareddy.in

Training Details:

Trained on a corpus containing 3,152,199 tokens.

Lexicon contains 365591 tokens.

Evaluation Results of Main POS tag:


Equal	 :   19463 /  21452 ( 90.73%)
Different:    1989 /  21452 (  9.27%)

Tag  Freq   Prec         Rec        F-Measure
=============================================   
NN  6754    0.837423    0.944922    0.887930
SYM 4963    0.999526    0.850494    0.919007
VM  4469    0.925830    0.960841    0.943011
PRP 1075    0.977591    0.973953    0.975769
NNP 677 0.817787    0.556869    0.662566
NST 546 0.951493    0.934066    0.942699
JJ  458 0.884444    0.868996    0.876652
RB  424 0.847599    0.957547    0.899225
QC  335 0.950292    0.970149    0.960118
WQ  327 0.945161    0.896024    0.919937
PSP 276 0.925676    0.992754    0.958042
DEM 254 0.988327    1.000000    0.994129
UT  191 0.974093    0.984293    0.979167
INTF    156 0.847561    0.891026    0.868750
QF  137 0.906977    0.854015    0.879699
RP  118 0.876106    0.838983    0.857143
CC  88  0.766355    0.931818    0.841026
RDP 46  0.833333    0.326087    0.468750
INJ 38  0.916667    0.578947    0.709677
CL  12  1.000000    0.916667    0.956522
QO  9   1.000000    1.000000    1.000000

Acknowledgements:

Avinesh PVS

References:

Bharati, Akshar, Rajeev Sangal, Dipti Misra Sharma, and Lakshmi Bai. "Anncorra: Annotating corpora guidelines for pos and chunk annotation for indian languages." LTRC-TR31 (2006).

Reddy, Siva, and Serge Sharoff. "Cross language POS taggers (and other tools) for Indian languages: An experiment with Kannada using Telugu resources." Cross Lingual Information Access (2011): 11.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bin		bin
models		models
Makefile		Makefile
README.md		README.md
posguidelines.pdf		posguidelines.pdf
telugu.input.txt		telugu.input.txt
telugu.sample.out.pdf		telugu.sample.out.pdf
telugu.sample.out.txt		telugu.sample.out.txt
version.md		version.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

models

models

Makefile

Makefile

README.md

README.md

posguidelines.pdf

posguidelines.pdf

telugu.input.txt

telugu.input.txt

telugu.sample.out.pdf

telugu.sample.out.pdf

telugu.sample.out.txt

telugu.sample.out.txt

version.md

version.md

Repository files navigation

Telugu Part of Speech (POS) Tagger

Usage:

Output Format:

Tagset:

Citation:

Description

License:

Contact:

Training Details:

Evaluation Results of Main POS tag:

Acknowledgements:

References:

About

Releases

Packages

Contributors 2

Languages

pradeep-miriyala/Telugu-POS-Python3

Folders and files

Latest commit

History

Repository files navigation

Telugu Part of Speech (POS) Tagger

Usage:

Output Format:

Tagset:

Citation:

Description

License:

Contact:

Training Details:

Evaluation Results of Main POS tag:

Acknowledgements:

References:

About

Resources

Stars

Watchers

Forks

Languages