Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corpus Reader for Universal Dependency Treebank #693

Closed
stevenbird opened this issue Jul 8, 2014 · 7 comments
Closed

Corpus Reader for Universal Dependency Treebank #693

stevenbird opened this issue Jul 8, 2014 · 7 comments

Comments

@stevenbird
Copy link
Member

Add the corpus reader for the Universal Dependency Treebank
https://code.google.com/p/uni-dep-tb/

@shcherbin
Copy link

I'll try to implement this reader.

@stevenbird
Copy link
Member Author

@shcherbin any progress? I'd like to find someone else if you don't have time, thanks.

@shcherbin
Copy link

I've wrote you an email regarding this, I belive, in August but received no reply.

This corpus is in CONLL format.
By simply inheriting from ConllCorpusReader (/nltk/corpus/reader/conll.py) and by providing correct columtypes there is no need to override some of the interface methods:

raw()
words()
sents()
tagged_words()
tagged_sents()

They all work as we would expect them to.

@stevenbird
Copy link
Member Author

Thanks @shcherbin – I must have missed that email. I can't merge this PR as is, but will adapt it, and add data to the NLTK Data repository.

@stevenbird stevenbird self-assigned this Dec 10, 2014
@stevenbird stevenbird added this to the late-December milestone Dec 10, 2014
@shcherbin
Copy link

There are two versions of Universal Dependency Treebank, do we need to add them both?

@stevenbird
Copy link
Member Author

This issue is superseded: #809

@ohenrik
Copy link

ohenrik commented Jan 16, 2018

@shcherbin Any idea about what the correct column types are? could you provide an example?

Here is some test data:

# sent_id =  000001
# text = Lam og piggvar på bryllupsmenyen
1	Lam	lam	NOUN	_	Definite=Ind|Gender=Neut|Number=Sing	0	root	_	_
2	og	og	CCONJ	_	_	3	cc	_	_
3	piggvar	piggvar	NOUN	_	Definite=Ind|Gender=Masc|Number=Sing	1	conj	_	_
4	på	på	ADP	_	_	5	mark	_	_
5	bryllupsmenyen	bryllupsmeny	NOUN	_	Definite=Def|Gender=Masc|Number=Sing	1	xcomp	_	_

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants