Skip to content

laiguokun/xlu-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bridging the domain gap in cross-lingual document classification

This repository contains the preprocessing scripts, augmentation data and pretrained models for our paper

Bridging the domain gap in cross-lingual document classification

Guokun Lai, Barlas Oguz, Veselin Stoyanov

Data Preprocessing

We include the data preprocessing scripts and download links for the cross-lingual sentiment classification task in the corresponding folder.

As for the news classification task, we directly use the MLDoc dataset, which can be obtained from https://github.com/facebookresearch/MLDoc.

After preprocessing, there are different tsv files. Each line is a data sample, and the format is "$label \t $text". For the unlabeled data file, we put a placeholder in the $label field.

Augmentation Data

We also provide our generated augmentation data in the google drive. In each folder, the augmentation file is the augmentation samples generated based on the corresponding unlabeled data.

The data format for each line is "$label \t $original \t $augmented \t $original-lang \t $augmented-lang". The $original-lang field denote the language of the original sample.

However, the texts of these files are processed by the BPE tokenization based on the scripts from XLM repo. If you want to obtain the original text, a simple approach is removing the BPE tokenization by deleting "@@ " symbols.

Pretrained XLM Models

The pretrained-models used in this project are also in the google drive. We provide the pretrained XLM models based on the unlabeled data from different domains. The XLM models are based on XNLI15 version XLM and have the same storage format.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages