Skip to content

Commit

Permalink
Adding new multilingual model
Browse files Browse the repository at this point in the history
  • Loading branch information
jacobdevlin-google committed Nov 24, 2018
1 parent 1cd50d7 commit 332a687
Show file tree
Hide file tree
Showing 3 changed files with 38 additions and 15 deletions.
19 changes: 18 additions & 1 deletion README.md
@@ -1,5 +1,20 @@
# BERT

**\*\*\*\*\* New November 23rd, 2018: Un-normalized multilingual model + Thai +
Mongolian \*\*\*\*\***

We uploaded a new multilingual model which does *not* perform any normalization
on the input (no lower casing, accent stripping, or Unicode normalization), and
additionally inclues Thai and Mongolian.

**It is recommended to use this version for developing multilingual models,
especially on languages with non-Latin alphabets.**

This does not require any code changes, and can be downloaded here:

* **[`BERT-Base, Multilingual Cased`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**:
104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters

**\*\*\*\*\* New November 15th, 2018: SOTA SQuAD 2.0 System \*\*\*\*\***

We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which is
Expand Down Expand Up @@ -207,7 +222,9 @@ The links to the models are here (right-click, 'Save link as...' on the name):
12-layer, 768-hidden, 12-heads , 110M parameters
* **`BERT-Large, Cased`**: 24-layer, 1024-hidden, 16-heads, 340M parameters
(Not available yet. Needs to be re-generated).
* **[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
* **[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**:
104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
* **[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
* **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**:
Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
Expand Down
32 changes: 19 additions & 13 deletions multilingual.md
Expand Up @@ -4,12 +4,20 @@ There are two multilingual models currently available. We do not plan to release
more single-language models, but we may release `BERT-Large` versions of these
two in the future:

* **[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
* **[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**:
104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
* **[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
* **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**:
Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
parameters

**The `Multilingual Cased (New)` model also fixes normalization issues in many
languages, so it is recommended in languages with non-Latin alphabets (and is
often better for most languages with Latin alphabets). When using this model,
make sure to pass `--do_lower_case=false` to `run_pretraining.py` and other
scripts.**

See the [list of languages](#list-of-languages) that the Multilingual model
supports. The Multilingual model does include Chinese (and English), but if your
fine-tuning data is Chinese-only, then the Chinese model will likely produce
Expand All @@ -26,13 +34,14 @@ XNLI, not Google NMT). For clarity, we only report on 6 languages below:

<!-- mdformat off(no table) -->

| System | English | Chinese | Spanish | German | Arabic | Urdu |
| ------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- |
| XNLI Baseline - Translate Train | 73.7 | 67.0 | 68.8 | 66.5 | 65.8 | 56.6 |
| XNLI Baseline - Translate Test | 73.7 | 68.3 | 70.7 | 68.7 | 66.8 | 59.3 |
| BERT -Translate Train | **81.4** | **74.2** | **77.3** | **75.2** | **70.5** | 61.7 |
| BERT - Translate Test | 81.4 | 70.1 | 74.9 | 74.4 | 70.4 | **62.1** |
| BERT - Zero Shot | 81.4 | 63.8 | 74.3 | 70.5 | 62.1 | 58.3 |
| System | English | Chinese | Spanish | German | Arabic | Urdu |
| --------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- |
| XNLI Baseline - Translate Train | 73.7 | 67.0 | 68.8 | 66.5 | 65.8 | 56.6 |
| XNLI Baseline - Translate Test | 73.7 | 68.3 | 70.7 | 68.7 | 66.8 | 59.3 |
| BERT - Translate Train Cased | **81.9** | **76.6** | **77.8** | **75.9** | **70.7** | 61.6 |
| BERT - Translate Train Uncased | 81.4 | 74.2 | 77.3 | 75.2 | 70.5 | 61.7 |
| BERT - Translate Test Uncased | 81.4 | 70.1 | 74.9 | 74.4 | 70.4 | **62.1** |
| BERT - Zero Shot Uncased | 81.4 | 63.8 | 74.3 | 70.5 | 62.1 | 58.3 |

<!-- mdformat on -->

Expand Down Expand Up @@ -292,8 +301,5 @@ chosen because they are the top 100 languages with the largest Wikipedias:
* Western Punjabi
* Yoruba

The only language which we had to unfortunately exclude was Thai, since it is
the only language (other than Chinese) that does not use whitespace to delimit
words, and it has too many characters-per-word to use character-based
tokenization. Our WordPiece algorithm is quadratic with respect to the size of
the input token so very long character strings do not work with it.
The **Multilingual Cased (New)** release contains additionally **Thai** and
**Mongolian**, which were not included in the original release.
2 changes: 1 addition & 1 deletion tokenization.py
Expand Up @@ -249,7 +249,7 @@ def _clean_text(self, text):
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""

def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
Expand Down

0 comments on commit 332a687

Please sign in to comment.