Using all of the Dutch wikipedia #1

kootenpv · 2017-03-12T17:48:31Z

I believe that using this guide, it is possible to use the whole of wikipedia for spaCy?

Just changing "en" to "nl" works.

https://dumps.wikimedia.org/nlwiki/latest/nlwiki-latest-pages-articles.xml.bz2

I was able to train the word embeddings in a few hours.

Is it correct that spaCy's Dutch resulted from this repository? If so, great work :)
Can't wait to see the POS tagging / dependency parsing included as well!

jvdzwaan · 2017-03-13T08:54:05Z

@kootenpv Thank you for your suggestion! It is correct that the spaCy Dutch resources come from this repository.

At the moment I'm still struggling with learning Brown clusters from the Common Crawl data. When that is finished, I'll have a look at the next task...

kootenpv · 2017-03-20T10:24:43Z

I haven't really seen Brown clusters being used by spaCy? What are they needed for?

If it is about creating a vocab, maybe the above guide could help to create a (pruned) vocab and associated vectors (heavily relying on gensim and the whole of Dutch wikipedia data)? In my case it resulted in a model with 883089 words.

Or is there a goal beyond vocab+vectors specifically for POS/dependency tagging?

jvdzwaan · 2017-03-20T13:48:54Z

Brown clusters are used as features for training the pos tagger. Apparently, it helps. Performance for the Dutch pos tagger has room for improvement :)

https://spacy.io/docs/usage/adding-languages#brown-clusters

jvdzwaan · 2017-03-20T13:50:19Z

Perhaps it is possible to use pretrained word vectors: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

redevries · 2017-05-16T14:06:57Z

Did you look at using the FastText models? They have pre-trained word vectors for 294 languages :) https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md If not, I'll have a go :)

jvdzwaan · 2017-05-17T07:42:32Z

@redevries Looking forward to your pull request :) (sorry, I haven't had time yet to look at pre-trained word vectors for spaCy Dutch)

KaterinaDK · 2017-10-20T11:11:12Z

Hello all!

Thank you for developing Spacy for parsing Dutch texts. I am using it in English and Dutch to create POS-tags and model EEG data according to POS-tags and categories thereof. I have observed that the Dutch POS-tags are not very good in general (see attached example).
NL-tags.pdf

I am wondering whether I have missed including a bigger vocab, if it is available somewhere. Currently, the command print len(vocab_nl.strings) gives me the number: 263751, which is about 1/6 of the English vocab.

I'd be grateful for any help on this.

Best,
Katerina

jvdzwaan · 2017-10-21T13:06:21Z

Hi @KaterinaDK , have you read the readme? In it we explain the performance of the pos-tagger isn't very good and make some suggestions to improve it. As mentioned, we used a very small corpus for creating the vocabulary and brown clusters. You are very welcome to use our code to retrain on a larger corpus. That's why we have made everything available.

KaterinaDK · 2017-10-23T16:35:27Z

Hey @jvdzwaan thanks for the reply. Ok, I'll try to retrain. The dump links to Wikipedia seem to be broken though. Where can I get the dump files from?

twielfaert · 2017-10-23T16:49:04Z

Wikipedia makes periodic dumps, the latest for Dutch can be found here: https://dumps.wikimedia.org/nlwiki/20171020/

KaterinaDK · 2017-10-23T16:54:44Z

Wonderful, thank you!

jvdzwaan · 2017-10-24T06:59:00Z

In the mean time, I found the spacy dev resources, which provides a script to create a vocabulary, word vectors and brown clusters: https://github.com/explosion/spacy-dev-resources/tree/master/vocab

The script also downloads the latest wikipedia dump. Training the Brown clusters requires more computational resources than I currently have (I ran it for 33 hours on a 4 core laptop), but for the rest everything seems to be working okay.

There also is a script for creating a spacy language model, but I haven't tried that one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using all of the Dutch wikipedia #1

Using all of the Dutch wikipedia #1

kootenpv commented Mar 12, 2017

jvdzwaan commented Mar 13, 2017

kootenpv commented Mar 20, 2017 •

edited

jvdzwaan commented Mar 20, 2017

jvdzwaan commented Mar 20, 2017

redevries commented May 16, 2017 •

edited

jvdzwaan commented May 17, 2017

KaterinaDK commented Oct 20, 2017

jvdzwaan commented Oct 21, 2017

KaterinaDK commented Oct 23, 2017

twielfaert commented Oct 23, 2017

KaterinaDK commented Oct 23, 2017

jvdzwaan commented Oct 24, 2017

Using all of the Dutch wikipedia #1

Using all of the Dutch wikipedia #1

Comments

kootenpv commented Mar 12, 2017

jvdzwaan commented Mar 13, 2017

kootenpv commented Mar 20, 2017 • edited

jvdzwaan commented Mar 20, 2017

jvdzwaan commented Mar 20, 2017

redevries commented May 16, 2017 • edited

jvdzwaan commented May 17, 2017

KaterinaDK commented Oct 20, 2017

jvdzwaan commented Oct 21, 2017

KaterinaDK commented Oct 23, 2017

twielfaert commented Oct 23, 2017

KaterinaDK commented Oct 23, 2017

jvdzwaan commented Oct 24, 2017

kootenpv commented Mar 20, 2017 •

edited

redevries commented May 16, 2017 •

edited