Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POS tagging #13

Open
TviNet opened this issue May 22, 2019 · 4 comments
Open

POS tagging #13

TviNet opened this issue May 22, 2019 · 4 comments
Labels
enhancement New feature or request

Comments

@TviNet
Copy link

TviNet commented May 22, 2019

https://universaldependencies.org/ has labelled data for parts of speech, dependencies and information about morphology for Hindi, Sanskrit, Marathi, Tamil and Telugu.
I plan on using a LM-LSTM-CRF architecture for sequence tagging. However the language models in iNLTK use sentencepiece tokens. Could anyone guide me through using the existing lm for word tokens or do I need to retrain the word embeddings for word tokens?

@goru001
Copy link
Owner

goru001 commented May 23, 2019

@TviNet Thanks for reaching out!
I glanced over LM-LSTM-CRF repo, and saw that they're considering every space separated word as a token. I think you can do that for Indic languages as well. But in this case you might not be able to use transfer learning (use pretrained LMs ) (I might be wrong here, need to dig deep into repo, but a quick glance at it makes me think this way).

The way I was thinking of tackling POS is to use transfer learning by doing some pre-processing over the dataset, which would be - breakdown every word into its token (using what we have in iNLTK) and their corresponding tags into -> <sometag1, sometag2, sometag3> depending upon the number of tokens it gets broken down into. I think this will yield better model/results. But we should experiment.

Let me know what your thoughts are. Thanks!

@TviNet
Copy link
Author

TviNet commented May 23, 2019

I tried averaging subtokens and then an LSTM+CRF which gave decent results for Hindi ( 13k train sentences, 96.3% accuracy) but not for Tamil (400 train sentences, 87% accuracy). Other languages similarly have very few training samples.

@goru001
Copy link
Owner

goru001 commented May 25, 2019

Yes, that's why I think using transfer learning is important here, especially for low resource languages.

@goru001 goru001 added the enhancement New feature or request label Apr 10, 2020
@sarves
Copy link

sarves commented Jan 11, 2021

Hi,

In case if you are interested in a BiLSTM based Tamil POS tagger (this developed using Stanza framework): https://github.com/sarves/thamizhi-pos
You can find relevant models and tagged data.

Sarves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants