Introduction

Knowledge-enhanced Bilingual Textual Representations for Cross-lingual Semantic Textual Similarity

Introduction

Cross-lingual representations learning receives high attention recently, but is still restricted by the availability of parallel data.

In this work we propose a method jointly embed texts and KB entities on comparable data. Different to other cross-lingual representations learning methods, our method can apply on weakly-supervised cross-lingual data (articles, paragraphs, and texts), and is still able to capture cross-lingual information well. Our method is validated on Semantic Textual Similarity (STS) task, including mono-lingual and cross-lingual datasets of SemEval-14 and SemEval-17.

Inspiration

Existing STS methods are incapable to assess the semantic similarity of cross-lingual texts. In mono-lingual setting, the dissimilarity of text pairs can be graded by supervised models (from 0 to 5).

It is easy to assess the similarity of text pairs by fitting models on annotated similarity labels. But the cost to annotate data is expensive. In order to reduce the requirement of supervised data, we propose a method to generate comparable texts, and train cross-lingual sentence embeddings on the data generated.

To address the issues mentioned above, we propose a framework that integrates the information of text and Knowledge Base (KB) entities so as to model cross-lingual semantic relatedness of texts. Different to STS tasks, our method is capable to assess the relatedness between imbalanced texts.

Method

The overall framework is as shown in the figure, including three modules: (1) mono-lingual word, entity learning, (2) cross-lingual text regularizer, and (3) joint learning of text and entity.

The model is based on Skip-Gram algorithm. In our method, we train all words, and KB entities embeddings jointly, including mono-lingual and cross-lingual embeddings.

Experiment

In this work, we use SemEval-2017 STS Task as our evaluating dataset.

In the experiment, we compare our method with other 5 cross-lingual embeddings methods. The comparing methods include BiCVM, BilBOWA, BWE Skip-Gram, VecMap, and LASER. And we use same dataset to train all the models, and the results are reported in the following table:

Released Embeddings

In this work, we choose three languages (English, Chinese, Spanish) as our main research languages. All embeddings are 300d and can be downloaded in the following links.

Language	1st	2nd
En-Zh	En (1.85G)	Zh (2.4G)
En-Es	En (1.7G)	Es (1.8G)

Citation

If you found this work helpful, consider citing the work as:

@InProceedings{
	10.1007/978-981-15-0118-0_33,
	author="Lu, Hsuehkuan
	and Cao, Yixin
	and Lei, Hou
	and Li, Juanzi",
	editor="Cheng, Xiaohui
	and Jing, Weipeng
	and Song, Xianhua
	and Lu, Zeguang",
	title="Knowledge-Enhanced Bilingual Textual Representations for Cross-Lingual Semantic Textual Similarity",
	booktitle="Data Science",
	year="2019",
	publisher="Springer Singapore",
	address="Singapore",
	pages="425--440",
	isbn="978-981-15-0118-0"
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LICENSE		LICENSE
README.md		README.md
experiment.png		experiment.png
framework.png		framework.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

experiment.png

experiment.png

framework.png

framework.png

Repository files navigation

Knowledge-enhanced Bilingual Textual Representations for Cross-lingual Semantic Textual Similarity

Introduction

Inspiration

Method

Experiment

Released Embeddings

Citation

About

Releases

Packages

License

hsuehkuan-lu/KEBTR

Folders and files

Latest commit

History

Repository files navigation

Knowledge-enhanced Bilingual Textual Representations for Cross-lingual Semantic Textual Similarity

Introduction

Inspiration

Method

Experiment

Released Embeddings

Citation

About

Resources

License

Stars

Watchers

Forks