An XLM-RoBERTa based classifier for predicting code-switch points from English-Spanish human--human dialogue.
Link to the full paper and citation here.
This code incorporates the LIL layer from the SelfExplain framework (https://arxiv.org/abs/2103.12279)
Make sure to unzip the data folder and put it in an enclosing folder named bangor_data, otherwise rename the filepaths for '$DATA_FOLDER' or --dataset_basedir in the sh scripts below.
Data for preprocessing available in bangor_data/
folder.
To run scripts for getting the phrase masks, use:
sh scripts/run_preprocessing_bangor_idx.sh
sh scripts/run_preprocessing_bangor_desc.sh
sh scripts/baseline_train_ctx1.sh
sh scripts/list_models_ctx1.sh
Preprocessed data is available for download here.
Data for control experiments is available for download here.
Extract these under the bangor_data
folder.
Model outputs for the unbalanced validation and test sets is available under model_outputs
. Folders are organized by split (test or validation), model type (speaker-prompted or baseline), and context size (in number of previous utterances).