Adding new multilingual model

google-research · Nov 24, 2018 · 332a687 · 332a687
1 parent 1cd50d7
commit 332a687
Show file tree

Hide file tree

Showing 3 changed files with 38 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,20 @@
 # BERT
 
+**\*\*\*\*\* New November 23rd, 2018: Un-normalized multilingual model + Thai +
+Mongolian \*\*\*\*\***
+
+We uploaded a new multilingual model which does *not* perform any normalization
+on the input (no lower casing, accent stripping, or Unicode normalization), and
+additionally inclues Thai and Mongolian.
+
+**It is recommended to use this version for developing multilingual models,
+especially on languages with non-Latin alphabets.**
+
+This does not require any code changes, and can be downloaded here:
+
+*   **[`BERT-Base, Multilingual Cased`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**:
+    104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+
 **\*\*\*\*\* New November 15th, 2018: SOTA SQuAD 2.0 System \*\*\*\*\***
 
 We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which is
@@ -207,7 +222,9 @@ The links to the models are here (right-click, 'Save link as...' on the name):
     12-layer, 768-hidden, 12-heads , 110M parameters
 *   **`BERT-Large, Cased`**: 24-layer, 1024-hidden, 16-heads, 340M parameters
     (Not available yet. Needs to be re-generated).
-*   **[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
+*   **[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**:
+    104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+*   **[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
     102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
 *   **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**:
     Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M

diff --git a/multilingual.md b/multilingual.md
@@ -4,12 +4,20 @@ There are two multilingual models currently available. We do not plan to release
 more single-language models, but we may release `BERT-Large` versions of these
 two in the future:
 
-*   **[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
+*   **[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**:
+    104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+*   **[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
     102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
 *   **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**:
     Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
     parameters
 
+**The `Multilingual Cased (New)` model also fixes normalization issues in many
+languages, so it is recommended in languages with non-Latin alphabets (and is
+often better for most languages with Latin alphabets). When using this model,
+make sure to pass `--do_lower_case=false` to `run_pretraining.py` and other
+scripts.**
+
 See the [list of languages](#list-of-languages) that the Multilingual model
 supports. The Multilingual model does include Chinese (and English), but if your
 fine-tuning data is Chinese-only, then the Chinese model will likely produce
@@ -26,13 +34,14 @@ XNLI, not Google NMT). For clarity, we only report on 6 languages below:
 
 <!-- mdformat off(no table) -->
 
-| System                          | English  | Chinese  | Spanish  | German   | Arabic   | Urdu     |
-| ------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- |
-| XNLI Baseline - Translate Train | 73.7     | 67.0     | 68.8     | 66.5     | 65.8     | 56.6     |
-| XNLI Baseline - Translate Test  | 73.7     | 68.3     | 70.7     | 68.7     | 66.8     | 59.3     |
-| BERT -Translate Train           | **81.4** | **74.2** | **77.3** | **75.2** | **70.5** | 61.7     |
-| BERT - Translate Test           | 81.4     | 70.1     | 74.9     | 74.4     | 70.4     | **62.1** |
-| BERT - Zero Shot                | 81.4     | 63.8     | 74.3     | 70.5     | 62.1     | 58.3     |
+| System                            | English  | Chinese  | Spanish  | German   | Arabic   | Urdu     |
+| --------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- |
+| XNLI Baseline - Translate Train   | 73.7     | 67.0     | 68.8     | 66.5     | 65.8     | 56.6     |
+| XNLI Baseline - Translate Test    | 73.7     | 68.3     | 70.7     | 68.7     | 66.8     | 59.3     |
+| BERT - Translate Train Cased      | **81.9** | **76.6** | **77.8** | **75.9** | **70.7** | 61.6     |
+| BERT - Translate Train Uncased    | 81.4     | 74.2     | 77.3     | 75.2     | 70.5     | 61.7     |
+| BERT - Translate Test Uncased     | 81.4     | 70.1     | 74.9     | 74.4     | 70.4     | **62.1** |
+| BERT - Zero Shot Uncased          | 81.4     | 63.8     | 74.3     | 70.5     | 62.1     | 58.3     |
 
 <!-- mdformat on -->
 
@@ -292,8 +301,5 @@ chosen because they are the top 100 languages with the largest Wikipedias:
 *   Western Punjabi
 *   Yoruba
 
-The only language which we had to unfortunately exclude was Thai, since it is
-the only language (other than Chinese) that does not use whitespace to delimit
-words, and it has too many characters-per-word to use character-based
-tokenization. Our WordPiece algorithm is quadratic with respect to the size of
-the input token so very long character strings do not work with it.
+The **Multilingual Cased (New)** release contains additionally **Thai** and
+**Mongolian**, which were not included in the original release.
diff --git a/tokenization.py b/tokenization.py
@@ -249,7 +249,7 @@ def _clean_text(self, text):
 class WordpieceTokenizer(object):
   """Runs WordPiece tokenziation."""
 
-  def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
+  def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
     self.vocab = vocab
     self.unk_token = unk_token
     self.max_input_chars_per_word = max_input_chars_per_word