DREAM : Disentangled REpresentation for language-Agnostic Meaning

Language-agnostic Representation from Multilingual Sentence Encoders for Cross-lingual Similarity Estimation

We propose a method to distil language-agnostic meaning embedding using a multilingual sentence encoder. By removing language-specific information from the original embedding, we retrieve an embedding that fully represents the meaning of the sentence. The proposed method relies only on parallel corpora without any human annotations. Our meaning embedding allows for efficient cross-lingual sentence similarity estimation using a simple cosine similarity calculation.

Installation

Install requirements
```
$ pip install -r requirements.txt 
```
Prepare the training data and its embedding from the model of your choice.
Edit the model in train_model.py and run:
```
$ python train_model.py 
```
The result will be written in result.csv

Paper link : https://aclanthology.org/2021.emnlp-main.612/

BibTex For citation

@inproceedings{tiyajamorn-etal-2021-language,
    title = "Language-agnostic Representation from Multilingual Sentence Encoders for Cross-lingual Similarity Estimation",
    author = "Tiyajamorn, Nattapong  and
      Kajiwara, Tomoyuki  and
      Arase, Yuki  and
      Onizuka, Makoto",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.612",
    pages = "7764--7774",
    abstract = "We propose a method to distill a language-agnostic meaning embedding from a multilingual sentence encoder. By removing language-specific information from the original embedding, we retrieve an embedding that fully represents the sentence{'}s meaning. The proposed method relies only on parallel corpora without any human annotations. Our meaning embedding allows efficient cross-lingual sentence similarity estimation by simple cosine similarity calculation. Experimental results on both quality estimation of machine translation and cross-lingual semantic textual similarity tasks reveal that our method consistently outperforms the strong baselines using the original multilingual embedding. Our method consistently improves the performance of any pre-trained multilingual sentence encoder, even in low-resource language pairs where only tens of thousands of parallel sentence pairs are available.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
requirement.txt		requirement.txt
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DREAM : Disentangled REpresentation for language-Agnostic Meaning

Language-agnostic Representation from Multilingual Sentence Encoders for Cross-lingual Similarity Estimation

Installation

About

Releases

Packages

Languages

nattaptiy/qe_disentangled

Folders and files

Latest commit

History

Repository files navigation

DREAM : Disentangled REpresentation for language-Agnostic Meaning

Language-agnostic Representation from Multilingual Sentence Encoders for Cross-lingual Similarity Estimation

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages