Skip to content

Commit

Permalink
xlmr
Browse files Browse the repository at this point in the history
  • Loading branch information
zdou0830 committed Aug 11, 2021
1 parent 1e0543f commit 337f684
Show file tree
Hide file tree
Showing 19 changed files with 7,049 additions and 2,912 deletions.
19 changes: 9 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## AWESOME: Aligning Word Embedding Spaces of Multilingual Encoders

`awesome-align` is a tool that can extract word alignments from multilingual BERT (mBERT) [[Demo]](https://colab.research.google.com/drive/1205ubqebM0OsZa1nRgbGJBtitgHqIVv6?usp=sharing) and allows you to fine-tune mBERT on parallel corpora for better alignment quality (see [our paper](https://arxiv.org/abs/2101.08231) for more details).
This is the XLM-R version of `awesome-align`.

### Dependencies

Expand All @@ -17,11 +17,11 @@ Inputs should be *tokenized* and each line is a source language sentence and its

### Extracting alignments

Here is an example of extracting word alignments from multilingual BERT:
Here is an example of extracting word alignments XLM-R:

```bash
DATA_FILE=/path/to/data/file
MODEL_NAME_OR_PATH=bert-base-multilingual-cased
MODEL_NAME_OR_PATH=xlm-roberta-base
OUTPUT_FILE=/path/to/output/file

CUDA_VISIBLE_DEVICES=0 awesome-align \
Expand All @@ -42,7 +42,7 @@ You can also set `MODEL_NAME_OR_PATH` to the path of your fine-tuned model as sh

If there is parallel data available, you can fine-tune embedding models on that data.

Here is an example of fine-tuning mBERT that balances well between efficiency and effectiveness:
Here is an example of fine-tuning XLM-R:

```bash
TRAIN_FILE=/path/to/train/file
Expand All @@ -51,7 +51,7 @@ OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 awesome-train \
--output_dir=$OUTPUT_DIR \
--model_name_or_path=bert-base-multilingual-cased \
--model_name_or_path=xlm-roberta-base \
--extraction 'softmax' \
--do_train \
--train_tlm \
Expand All @@ -76,7 +76,7 @@ OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 awesome-train \
--output_dir=$OUTPUT_DIR \
--model_name_or_path=bert-base-multilingual-cased \
--model_name_or_path=xlm-roberta-base \
--extraction 'softmax' \
--do_train \
--train_mlm \
Expand Down Expand Up @@ -108,7 +108,7 @@ OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 awesome-train \
--output_dir=$OUTPUT_DIR \
--model_name_or_path=bert-base-multilingual-cased \
--model_name_or_path=xlm-roberta-base \
--extraction 'softmax' \
--do_train \
--train_so \
Expand All @@ -132,9 +132,8 @@ The following table shows the alignment error rates (AERs) of our models and pop
| [fast\_align](https://github.com/clab/fast_align) | 27.0 | 10.5 | 32.1 | 51.1 | 38.1 |
| [eflomal](https://github.com/robertostling/eflomal) | 22.6 | 8.2 | 25.1 | 47.5 | 28.7 |
| [Mgiza](https://github.com/moses-smt/mgiza) | 20.6 | 5.9 | 26.4 | 48.0 | 35.1 |
| Ours (w/o fine-tuning, softmax) | 17.4 | 5.6 | 27.9 | 45.6 | 18.1 |
| Ours (multilingually fine-tuned <br/> w/o `--train_co`, softmax) [[Download]](https://drive.google.com/file/d/1IcQx6t5qtv4bdcGjjVCwXnRkpr67eisJ/view?usp=sharing) | 15.2 | **4.1** | 22.6 | **37.4** | **13.4** |
| Ours (multilingually fine-tuned <br/> w/ `--train_co`, softmax) [[Download]](https://drive.google.com/file/d/1IluQED1jb0rjITJtyj4lNMPmaRFyMslg/view?usp=sharing) | **15.1** | 4.5 | **20.7** | 38.4 | 14.5 |
| Ours-mBERT (w/o fine-tuning, softmax) | 17.4 | 5.6 | 27.9 | 45.6 | 18.1 |
| Ours-XLM-R (w/o fine-tuning, softmax) | 23.1 | 9.2 | 28.6 | 62.0 | 30.7 |


### Citation
Expand Down
50 changes: 22 additions & 28 deletions awesome_align/configuration_bert.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@


import logging

from awesome_align.configuration_utils import PretrainedConfig


Expand All @@ -38,13 +39,14 @@
"bert-base-cased-finetuned-mrpc": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json",
"bert-base-german-dbmdz-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-config.json",
"bert-base-german-dbmdz-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-config.json",
"bert-base-japanese": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-config.json",
"bert-base-japanese-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking-config.json",
"bert-base-japanese-char": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-config.json",
"bert-base-japanese-char-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-whole-word-masking-config.json",
"bert-base-finnish-cased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/config.json",
"bert-base-finnish-uncased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/config.json",
"bert-base-dutch-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/config.json",
"cl-tohoku/bert-base-japanese": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese/config.json",
"cl-tohoku/bert-base-japanese-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-whole-word-masking/config.json",
"cl-tohoku/bert-base-japanese-char": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char/config.json",
"cl-tohoku/bert-base-japanese-char-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/cl-tohoku/bert-base-japanese-char-whole-word-masking/config.json",
"TurkuNLP/bert-base-finnish-cased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-cased-v1/config.json",
"TurkuNLP/bert-base-finnish-uncased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/TurkuNLP/bert-base-finnish-uncased-v1/config.json",
"wietsedv/bert-base-dutch-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/wietsedv/bert-base-dutch-cased/config.json",
# See all BERT models at https://huggingface.co/models?filter=bert
}


Expand All @@ -54,12 +56,9 @@ class BertConfig(PretrainedConfig):
It is used to instantiate an BERT model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Args:
vocab_size (:obj:`int`, optional, defaults to 30522):
Vocabulary size of the BERT model. Defines the different tokens that
Expand Down Expand Up @@ -88,25 +87,17 @@ class BertConfig(PretrainedConfig):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
The epsilon used by the layer normalization layers.
gradient_checkpointing (:obj:`bool`, optional, defaults to False):
If True, use gradient checkpointing to save memory at the expense of slower backward pass.
Example::
from transformers import BertModel, BertConfig
# Initializing a BERT bert-base-uncased style configuration
configuration = BertConfig()
# Initializing a model from the bert-base-uncased style configuration
model = BertModel(configuration)
# Accessing the model configuration
configuration = model.config
Attributes:
pretrained_config_archive_map (Dict[str, str]):
A dictionary containing all the available pre-trained checkpoints.
>>> from transformers import BertModel, BertConfig
>>> # Initializing a BERT bert-base-uncased style configuration
>>> configuration = BertConfig()
>>> # Initializing a model from the bert-base-uncased style configuration
>>> model = BertModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
pretrained_config_archive_map = BERT_PRETRAINED_CONFIG_ARCHIVE_MAP
model_type = "bert"

def __init__(
Expand All @@ -123,9 +114,11 @@ def __init__(
type_vocab_size=2,
initializer_range=0.02,
layer_norm_eps=1e-12,
pad_token_id=0,
gradient_checkpointing=False,
**kwargs
):
super().__init__(**kwargs)
super().__init__(pad_token_id=pad_token_id, **kwargs)

self.vocab_size = vocab_size
self.hidden_size = hidden_size
Expand All @@ -139,3 +132,4 @@ def __init__(
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.gradient_checkpointing = gradient_checkpointing
61 changes: 61 additions & 0 deletions awesome_align/configuration_roberta.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" RoBERTa configuration """


import logging

from awesome_align.configuration_bert import BertConfig


logger = logging.getLogger(__name__)

ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"roberta-base": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json",
"roberta-large": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.json",
"roberta-large-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-config.json",
"distilroberta-base": "https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-config.json",
"roberta-base-openai-detector": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-openai-detector-config.json",
"roberta-large-openai-detector": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-openai-detector-config.json",
}


class RobertaConfig(BertConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.RobertaModel`.
It is used to instantiate an RoBERTa model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
The :class:`~transformers.RobertaConfig` class directly inherits :class:`~transformers.BertConfig`.
It reuses the same defaults. Please check the parent class for more information.
Example::
>>> from transformers import RobertaConfig, RobertaModel
>>> # Initializing a RoBERTa configuration
>>> configuration = RobertaConfig()
>>> # Initializing a model from the configuration
>>> model = RobertaModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = "roberta"

def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs):
"""Constructs RobertaConfig.
"""
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
Loading

0 comments on commit 337f684

Please sign in to comment.