Swedish sentence BERT (KB-SBERT) was trained to emulate a strong English sentence embedding model called paraphrase-mpnet-base-v2. The model achieved a Pearson correlation coefficient of 0.918 and a Spearman's rank correlation coefficient of 0.911 on the SweParaphrase test set.
An article explaining the data and the model in further detail can be found on the KBLab blog.
We trained on 14.6 million sentences from different parallel corpus sources. Model was trained for 50 hours on a single A100 GPU with 40GB memory.
To replicate the training, first download data by running all get_parallel_data_{}.py
files. Make sure all resulting data files are created correctly in the folder parallel-sentences/
.
Training should simply be a matter of running
python train.py
Adjust the variables in the script if GPU memory is an issue (lower batch_size
and max_seq_length
).
A training script for launching the training as a SLURM job is available in slumrp_gpu.sh
.
The model was evaluated on 4 different SuperLim test sets. Run code in each of the evaluate_{}.py
scripts to recreate the results. If you train your own model, you can load that model by pointing to the model directory, e.g:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("output/make-multilingual-en-sv-2021-08-05_19-59-51")
Download the data for evaluation here: https://spraakbanken.gu.se/en/resources/superlim
Big thanks to the sentence-transformers
package. Code was adapted and modified from their training examples.
We gratefully acknowledge the HPC RIVR consortium (www.hpc-rivr.si) and EuroHPC JU (eurohpc-ju.europa.eu) for funding this research by providing computing resources of the HPC system Vega at the Institute of Information Science (www.izum.si).