https://huggingface.co/blog/train-sparse-encoder

https://github.com/UKPLab/sentence-transformers/blob/master/examples/sparse_encoder/applications/retrieve_rerank/hybrid_search.py


## 为什么使用稀疏嵌入模型？
简而言之，神经稀疏嵌入模型在 BM25 等传统词汇方法和 Sentence Transformers 等密集嵌入模型之间占据着以下优势：

混合潜力：与密集模型非常有效地结合，在词汇匹配很重要的搜索中可能会遇到困难
可解释性：你可以准确地看到哪些标记有助于匹配
性能：在许多检索任务中具有竞争力或优于密集模型

In [None]:
from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("naver/splade-v3")

# Run inference
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (3, 30522)

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[   32.4323,     5.8528,     0.0258],
#         [    5.8528,    26.6649,     0.0302],
#         [    0.0258,     0.0302,    24.0839]])

# Let's decode our embeddings to be able to interpret them
decoded = model.decode(embeddings, top_k=10)
for decoded, sentence in zip(decoded, sentences):
    print(f"Sentence: {sentence}")
    print(f"Decoded: {decoded}")
    print()


在此示例中，嵌入是 30,522 维向量，其中每个维度对应于模型词汇表中的一个标记。该decode方法返回了嵌入中值最高的 10 个标记，这使我们能够解读哪些标记对嵌入的贡献最大。

## Finetune

In [None]:
from sentence_transformers import SparseEncoder

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

model = SparseEncoder("google-bert/bert-base-uncased")
# SparseEncoder(
#   (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
#   (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': None})
# )

