Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using all of the Dutch wikipedia #1

Open
kootenpv opened this issue Mar 12, 2017 · 12 comments
Open

Using all of the Dutch wikipedia #1

kootenpv opened this issue Mar 12, 2017 · 12 comments

Comments

@kootenpv
Copy link

I believe that using this guide, it is possible to use the whole of wikipedia for spaCy?

Just changing "en" to "nl" works.

https://dumps.wikimedia.org/nlwiki/latest/nlwiki-latest-pages-articles.xml.bz2

I was able to train the word embeddings in a few hours.

Is it correct that spaCy's Dutch resulted from this repository? If so, great work :)
Can't wait to see the POS tagging / dependency parsing included as well!

@jvdzwaan
Copy link

@kootenpv Thank you for your suggestion! It is correct that the spaCy Dutch resources come from this repository.

At the moment I'm still struggling with learning Brown clusters from the Common Crawl data. When that is finished, I'll have a look at the next task...

@kootenpv
Copy link
Author

kootenpv commented Mar 20, 2017

I haven't really seen Brown clusters being used by spaCy? What are they needed for?

If it is about creating a vocab, maybe the above guide could help to create a (pruned) vocab and associated vectors (heavily relying on gensim and the whole of Dutch wikipedia data)? In my case it resulted in a model with 883089 words.

Or is there a goal beyond vocab+vectors specifically for POS/dependency tagging?

@jvdzwaan
Copy link

Brown clusters are used as features for training the pos tagger. Apparently, it helps. Performance for the Dutch pos tagger has room for improvement :)

https://spacy.io/docs/usage/adding-languages#brown-clusters

@jvdzwaan
Copy link

Perhaps it is possible to use pretrained word vectors: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

@redevries
Copy link

redevries commented May 16, 2017

Did you look at using the FastText models? They have pre-trained word vectors for 294 languages :) https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md If not, I'll have a go :)

@jvdzwaan
Copy link

@redevries Looking forward to your pull request :) (sorry, I haven't had time yet to look at pre-trained word vectors for spaCy Dutch)

@KaterinaDK
Copy link

Hello all!

Thank you for developing Spacy for parsing Dutch texts. I am using it in English and Dutch to create POS-tags and model EEG data according to POS-tags and categories thereof. I have observed that the Dutch POS-tags are not very good in general (see attached example).
NL-tags.pdf

I am wondering whether I have missed including a bigger vocab, if it is available somewhere. Currently, the command print len(vocab_nl.strings) gives me the number: 263751, which is about 1/6 of the English vocab.

I'd be grateful for any help on this.

Best,
Katerina

@jvdzwaan
Copy link

Hi @KaterinaDK , have you read the readme? In it we explain the performance of the pos-tagger isn't very good and make some suggestions to improve it. As mentioned, we used a very small corpus for creating the vocabulary and brown clusters. You are very welcome to use our code to retrain on a larger corpus. That's why we have made everything available.

@KaterinaDK
Copy link

Hey @jvdzwaan thanks for the reply. Ok, I'll try to retrain. The dump links to Wikipedia seem to be broken though. Where can I get the dump files from?

@twielfaert
Copy link

Wikipedia makes periodic dumps, the latest for Dutch can be found here: https://dumps.wikimedia.org/nlwiki/20171020/

@KaterinaDK
Copy link

Wonderful, thank you!

@jvdzwaan
Copy link

In the mean time, I found the spacy dev resources, which provides a script to create a vocabulary, word vectors and brown clusters: https://github.com/explosion/spacy-dev-resources/tree/master/vocab

The script also downloads the latest wikipedia dump. Training the Brown clusters requires more computational resources than I currently have (I ran it for 33 hours on a 4 core laptop), but for the rest everything seems to be working okay.

There also is a script for creating a spacy language model, but I haven't tried that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants