Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
add norwegian model
  • Loading branch information
mollerhoj committed Mar 13, 2019
1 parent 55063e1 commit 3224830
Show file tree
Hide file tree
Showing 5 changed files with 19 additions and 5 deletions.
21 changes: 16 additions & 5 deletions README.md
@@ -1,22 +1,33 @@
# Danish ULMFiT
# Scandinavian ULMFiT


Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch.

This repository contains the weights for the embedding layer of a UMLFiT language model that can be used as the first step in fine-tuning any Natural Language Processing task.

The weights were trained on 90% of all text in the Danish Wikipedia as per 3. July 2018. The remaining 10% was used for validation.
The weights were trained on 90% of all text in the corresponding language wikipedia as per 3. July 2018. The remaining 10% was used for validation.

# Supported Languages:

- Danish

Trained on 78,373,122 tokens, and validated on 7,837,310 tokens. We achieve a perplexity of 30.9.

- Norwegian

Trained on 80,284,231 tokens, and validated on 8,920,387 tokens. We achieve a perplexity of 26.31.

We achieve a perplexity of 30.9 on the validation data.
Training even higher performance models is possible, but require more (costly) training time. If you need a model with higher performance, feel free to contact us.

### Paper

See Universal Language Model Fine-tuning for Text Classification, Jeremy Howard, Sebastian Ruder, https://arxiv.org/abs/1801.06146

### File descriptions

- dawt.h5 (Danish WikiText) contains the weights in 'Hierarchical Data Format'
- enc.h5 Contains the weights in 'Hierarchical Data Format'

- dawt.pth (Danish WikiText) contains the weights in 'Pytorch model format'
- enc.pth Contains the weights in 'Pytorch model format'

- itos.pkl (Integers to Strings) contains the vocabulary mapping from ids (0 - 30000) to strings

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
3 changes: 3 additions & 0 deletions norwegian_enc.h5
Git LFS file not shown

0 comments on commit 3224830

Please sign in to comment.