xlmr

neulab · Aug 11, 2021 · 337f684 · 337f684
1 parent 1e0543f
commit 337f684
Show file tree

Hide file tree

Showing 19 changed files with 7,049 additions and 2,912 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 ## AWESOME: Aligning Word Embedding Spaces of Multilingual Encoders
 
-`awesome-align` is a tool that can extract word alignments from multilingual BERT (mBERT) [[Demo]](https://colab.research.google.com/drive/1205ubqebM0OsZa1nRgbGJBtitgHqIVv6?usp=sharing) and allows you to fine-tune mBERT on parallel corpora for better alignment quality (see [our paper](https://arxiv.org/abs/2101.08231) for more details).
+This is the XLM-R version of `awesome-align`.
 
 ### Dependencies
 
@@ -17,11 +17,11 @@ Inputs should be *tokenized* and each line is a source language sentence and its
 
 ### Extracting alignments
 
-Here is an example of extracting word alignments from multilingual BERT:
+Here is an example of extracting word alignments XLM-R:
 
 ```bash
 DATA_FILE=/path/to/data/file
-MODEL_NAME_OR_PATH=bert-base-multilingual-cased
+MODEL_NAME_OR_PATH=xlm-roberta-base
 OUTPUT_FILE=/path/to/output/file
 
 CUDA_VISIBLE_DEVICES=0 awesome-align \
@@ -42,7 +42,7 @@ You can also set `MODEL_NAME_OR_PATH` to the path of your fine-tuned model as sh
 
 If there is parallel data available, you can fine-tune embedding models on that data.
 
-Here is an example of fine-tuning mBERT that balances well between efficiency and effectiveness:
+Here is an example of fine-tuning XLM-R:
 
 ```bash
 TRAIN_FILE=/path/to/train/file
@@ -51,7 +51,7 @@ OUTPUT_DIR=/path/to/output/directory
 
 CUDA_VISIBLE_DEVICES=0 awesome-train \
     --output_dir=$OUTPUT_DIR \
-    --model_name_or_path=bert-base-multilingual-cased \
+    --model_name_or_path=xlm-roberta-base \
     --extraction 'softmax' \
     --do_train \
     --train_tlm \
@@ -76,7 +76,7 @@ OUTPUT_DIR=/path/to/output/directory
 
 CUDA_VISIBLE_DEVICES=0 awesome-train \
     --output_dir=$OUTPUT_DIR \
-    --model_name_or_path=bert-base-multilingual-cased \
+    --model_name_or_path=xlm-roberta-base \
     --extraction 'softmax' \
     --do_train \
     --train_mlm \
@@ -108,7 +108,7 @@ OUTPUT_DIR=/path/to/output/directory
 
 CUDA_VISIBLE_DEVICES=0 awesome-train \
     --output_dir=$OUTPUT_DIR \
-    --model_name_or_path=bert-base-multilingual-cased \
+    --model_name_or_path=xlm-roberta-base \
     --extraction 'softmax' \
     --do_train \
     --train_so \
@@ -132,9 +132,8 @@ The following table shows the alignment error rates (AERs) of our models and pop
 | [fast\_align](https://github.com/clab/fast_align) | 27.0 | 10.5 | 32.1 | 51.1 | 38.1 |
 | [eflomal](https://github.com/robertostling/eflomal) | 22.6 | 8.2 | 25.1 | 47.5 | 28.7 |
 | [Mgiza](https://github.com/moses-smt/mgiza)    | 20.6 | 5.9 | 26.4 | 48.0 | 35.1 |
-| Ours (w/o fine-tuning, softmax) | 17.4 | 5.6 | 27.9 | 45.6 | 18.1 |
-| Ours (multilingually fine-tuned <br/>  w/o `--train_co`, softmax) [[Download]](https://drive.google.com/file/d/1IcQx6t5qtv4bdcGjjVCwXnRkpr67eisJ/view?usp=sharing) | 15.2 | **4.1** | 22.6 | **37.4** | **13.4** |
-| Ours (multilingually fine-tuned <br/>  w/ `--train_co`, softmax) [[Download]](https://drive.google.com/file/d/1IluQED1jb0rjITJtyj4lNMPmaRFyMslg/view?usp=sharing) |  **15.1** | 4.5 | **20.7** | 38.4 | 14.5 |
+| Ours-mBERT (w/o fine-tuning, softmax) | 17.4 | 5.6 | 27.9 | 45.6 | 18.1 |
+| Ours-XLM-R (w/o fine-tuning, softmax) | 23.1 | 9.2 | 28.6 | 62.0 | 30.7 |
 
 
 ### Citation

diff --git a/awesome_align/configuration_bert.py b/awesome_align/configuration_bert.py
@@ -17,6 +17,7 @@
 
 
 import logging
+
 from awesome_align.configuration_utils import PretrainedConfig
 
 
@@ -38,13 +39,14 @@
     "bert-base-cased-finetuned-mrpc": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json",
     "bert-base-german-dbmdz-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-config.json",
     "bert-base-german-dbmdz-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-config.json",
-    "bert-base-japanese": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-config.json",
-    "bert-base-japanese-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking-config.json",
-    "bert-base-japanese-char": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-config.json",
-    "bert-base-japanese-char-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-whole-word-masking-config.json",
-    "bert-base-finnish-cased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/config.json",
-    "bert-base-finnish-uncased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/config.json",
-    "bert-base-dutch-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/config.json",
+    "cl-tohoku/bert-base-japanese": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese/config.json",
+    "cl-tohoku/bert-base-japanese-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking/config.json",
+    "cl-tohoku/bert-base-japanese-char": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char/config.json",
+    "cl-tohoku/bert-base-japanese-char-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-whole-word-masking/config.json",
+    "TurkuNLP/bert-base-finnish-cased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/config.json",
+    "TurkuNLP/bert-base-finnish-uncased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/config.json",
+    "wietsedv/bert-base-dutch-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/config.json",
+    # See all BERT models at https://huggingface.co/models?filter=bert
 }
 
 
@@ -54,12 +56,9 @@ class BertConfig(PretrainedConfig):
         It is used to instantiate an BERT model according to the specified arguments, defining the model
         architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
         the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
-
         Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
         to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
         for more information.
-
-
         Args:
             vocab_size (:obj:`int`, optional, defaults to 30522):
                 Vocabulary size of the BERT model. Defines the different tokens that
@@ -88,25 +87,17 @@ class BertConfig(PretrainedConfig):
                 The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
             layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
                 The epsilon used by the layer normalization layers.
-
+            gradient_checkpointing (:obj:`bool`, optional, defaults to False):
+                If True, use gradient checkpointing to save memory at the expense of slower backward pass.
         Example::
-
-            from transformers import BertModel, BertConfig
-
-            # Initializing a BERT bert-base-uncased style configuration
-            configuration = BertConfig()
-
-            # Initializing a model from the bert-base-uncased style configuration
-            model = BertModel(configuration)
-
-            # Accessing the model configuration
-            configuration = model.config
-
-        Attributes:
-            pretrained_config_archive_map (Dict[str, str]):
-                A dictionary containing all the available pre-trained checkpoints.
+            >>> from transformers import BertModel, BertConfig
+            >>> # Initializing a BERT bert-base-uncased style configuration
+            >>> configuration = BertConfig()
+            >>> # Initializing a model from the bert-base-uncased style configuration
+            >>> model = BertModel(configuration)
+            >>> # Accessing the model configuration
+            >>> configuration = model.config
     """
-    pretrained_config_archive_map = BERT_PRETRAINED_CONFIG_ARCHIVE_MAP
     model_type = "bert"
 
     def __init__(
@@ -123,9 +114,11 @@ def __init__(
         type_vocab_size=2,
         initializer_range=0.02,
         layer_norm_eps=1e-12,
+        pad_token_id=0,
+        gradient_checkpointing=False,
         **kwargs
     ):
-        super().__init__(**kwargs)
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
 
         self.vocab_size = vocab_size
         self.hidden_size = hidden_size
@@ -139,3 +132,4 @@ def __init__(
         self.type_vocab_size = type_vocab_size
         self.initializer_range = initializer_range
         self.layer_norm_eps = layer_norm_eps
+        self.gradient_checkpointing = gradient_checkpointing
diff --git a/awesome_align/configuration_roberta.py b/awesome_align/configuration_roberta.py
@@ -0,0 +1,61 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" RoBERTa configuration """
+
+
+import logging
+
+from awesome_align.configuration_bert import BertConfig
+
+
+logger = logging.getLogger(__name__)
+
+ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "roberta-base": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json",
+    "roberta-large": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.json",
+    "roberta-large-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-config.json",
+    "distilroberta-base": "https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-config.json",
+    "roberta-base-openai-detector": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-openai-detector-config.json",
+    "roberta-large-openai-detector": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-openai-detector-config.json",
+}
+
+
+class RobertaConfig(BertConfig):
+    r"""
+        This is the configuration class to store the configuration of a :class:`~transformers.RobertaModel`.
+        It is used to instantiate an RoBERTa model according to the specified arguments, defining the model
+        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+        the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
+        Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+        to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+        for more information.
+        The :class:`~transformers.RobertaConfig` class directly inherits :class:`~transformers.BertConfig`.
+        It reuses the same defaults. Please check the parent class for more information.
+        Example::
+            >>> from transformers import RobertaConfig, RobertaModel
+            >>> # Initializing a RoBERTa configuration
+            >>> configuration = RobertaConfig()
+            >>> # Initializing a model from the configuration
+            >>> model = RobertaModel(configuration)
+            >>> # Accessing the model configuration
+            >>> configuration = model.config
+    """
+    model_type = "roberta"
+
+    def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs):
+        """Constructs RobertaConfig.
+        """
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)