-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using all of the Dutch wikipedia #1
Comments
@kootenpv Thank you for your suggestion! It is correct that the spaCy Dutch resources come from this repository. At the moment I'm still struggling with learning Brown clusters from the Common Crawl data. When that is finished, I'll have a look at the next task... |
I haven't really seen Brown clusters being used by spaCy? What are they needed for? If it is about creating a vocab, maybe the above guide could help to create a (pruned) vocab and associated vectors (heavily relying on gensim and the whole of Dutch wikipedia data)? In my case it resulted in a model with 883089 words. Or is there a goal beyond vocab+vectors specifically for POS/dependency tagging? |
Brown clusters are used as features for training the pos tagger. Apparently, it helps. Performance for the Dutch pos tagger has room for improvement :) |
Perhaps it is possible to use pretrained word vectors: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md |
Did you look at using the FastText models? They have pre-trained word vectors for 294 languages :) https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md If not, I'll have a go :) |
@redevries Looking forward to your pull request :) (sorry, I haven't had time yet to look at pre-trained word vectors for spaCy Dutch) |
Hello all! Thank you for developing Spacy for parsing Dutch texts. I am using it in English and Dutch to create POS-tags and model EEG data according to POS-tags and categories thereof. I have observed that the Dutch POS-tags are not very good in general (see attached example). I am wondering whether I have missed including a bigger vocab, if it is available somewhere. Currently, the command I'd be grateful for any help on this. Best, |
Hi @KaterinaDK , have you read the readme? In it we explain the performance of the pos-tagger isn't very good and make some suggestions to improve it. As mentioned, we used a very small corpus for creating the vocabulary and brown clusters. You are very welcome to use our code to retrain on a larger corpus. That's why we have made everything available. |
Hey @jvdzwaan thanks for the reply. Ok, I'll try to retrain. The dump links to Wikipedia seem to be broken though. Where can I get the dump files from? |
Wikipedia makes periodic dumps, the latest for Dutch can be found here: https://dumps.wikimedia.org/nlwiki/20171020/ |
Wonderful, thank you! |
In the mean time, I found the spacy dev resources, which provides a script to create a vocabulary, word vectors and brown clusters: https://github.com/explosion/spacy-dev-resources/tree/master/vocab The script also downloads the latest wikipedia dump. Training the Brown clusters requires more computational resources than I currently have (I ran it for 33 hours on a 4 core laptop), but for the rest everything seems to be working okay. There also is a script for creating a spacy language model, but I haven't tried that one. |
I believe that using this guide, it is possible to use the whole of wikipedia for spaCy?
Just changing "en" to "nl" works.
https://dumps.wikimedia.org/nlwiki/latest/nlwiki-latest-pages-articles.xml.bz2
I was able to train the word embeddings in a few hours.
Is it correct that spaCy's Dutch resulted from this repository? If so, great work :)
Can't wait to see the POS tagging / dependency parsing included as well!
The text was updated successfully, but these errors were encountered: