sif-embedding

This is a Rust implementation of simple but powerful sentence embedding algorithms based on SIF and uSIF described in the following papers:

Sanjeev Arora, Yingyu Liang, and Tengyu Ma, A Simple but Tough-to-Beat Baseline for Sentence Embeddings, ICLR 2017
Kawin Ethayarajh, Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline, RepL4NLP 2018

Features

No GPU required: This library runs on CPU only.
Fast embeddings: This library provides fast sentence embeddings thanks to the simple algorithms of SIF and uSIF. We observed that our SIF implementation could process ~80K sentences per second on M2 MacBook Air. (See benchmarks.)
Reasonable evaluation scores: The performances of SIF and uSIF on similarity evaluation tasks do not outperform those of SOTA models such as SimCSE. However, they are not so worse. (See evaluations.)

This library will help you if

DNN-based sentence embeddings are too slow for your application,
you do not have an option using GPUs, or
you want baseline sentence embeddings for your development.

Documentation

https://docs.rs/sif-embedding/

Getting started

See tutorial.

Benchmarks

benchmarks provides speed benchmarks.

We observed that, with an English Wikipedia dataset, our SIF implementation could process ~80K sentences per second on MacBook Air (one core of Apple M2, 24 GB RAM).

Evaluations

evaluations provides tools to evaluate sif-embedding on several similarity evaluation tasks.

STS/SICK

evaluations/senteval provides evaluation tools and results for SentEval STS/SICK Tasks.

As one example, the following table shows the evaluation results with the Spearman's rank correlation coefficient for the STS-Benchmark.

Model	train	dev	test	Avg.
sif_embedding::Sif	65.2	75.3	63.6	68.0
sif_embedding::USif	68.0	78.2	66.3	70.8
princeton-nlp/unsup-simcse-bert-base-uncased	76.9	81.7	76.5	78.4
princeton-nlp/sup-simcse-bert-base-uncased	83.3	86.2	84.3	84.6

JSTS/JSICK

eveluations/japanese provides evaluation tools and results for JGLUE JSTS and JSICK tasks.

As one example, the following table shows the evaluation results with the Spearman's rank correlation coefficient.

Model	JSICK (test)	JSTS (train)	JSTS (val)	Avg.
sif_embedding::Sif	79.7	67.6	74.6	74.0
sif_embedding::USif	79.7	69.3	76.0	75.0
cl-nagoya/unsup-simcse-ja-base	79.0	74.5	79.0	77.5
cl-nagoya/unsup-simcse-ja-large	79.6	77.8	81.4	79.6
cl-nagoya/sup-simcse-ja-base	82.8	77.9	80.9	80.5
cl-nagoya/sup-simcse-ja-large	83.1	79.6	83.1	81.9

Similarity search

qdrant-examples provides an example of using sif-embedding with qdrant/rust-client.

Wiki

Trouble shooting: Tips on how to resolve errors I faced in my environment.

Licensing

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.github/workflows		.github/workflows
benchmarks/wiki1m		benchmarks/wiki1m
evaluations		evaluations
finalfusion-tools		finalfusion-tools
qdrant-examples		qdrant-examples
src		src
tutorial		tutorial
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

sif-embedding

Features

Documentation

Getting started

Benchmarks

Evaluations

STS/SICK

JSTS/JSICK

Similarity search

Wiki

Licensing

About

Licenses found

Releases 6

Packages

Languages

License

Licenses found

kampersanda/sif-embedding

Folders and files

Latest commit

History

Repository files navigation

sif-embedding

Features

Documentation

Getting started

Benchmarks

Evaluations

STS/SICK

JSTS/JSICK

Similarity search

Wiki

Licensing

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages