diff --git a/src/posts/embedding-benchmark-2026.md b/src/posts/embedding-benchmark-2026.md index 1ae4f15..98f0f9b 100644 --- a/src/posts/embedding-benchmark-2026.md +++ b/src/posts/embedding-benchmark-2026.md @@ -7,7 +7,7 @@ tag: --- # Benchmark Text Embedding Models for RecSys in 2026 -In the 2025 post [Text Embedding Benchmark for Recommender Systems](./embedding-benchmark.md), we benchmarked the performance of text embedding models in similarity-based recommendations. Within six months of that post's publication, Alibaba Cloud and Google launched a new generation of open-source text embedding models: [qwen3-embedding](https://github.com/QwenLM/Qwen3-Embedding) by Alibaba Cloud and [embeddinggemma](https://ai.google.dev/gemma/docs/embeddinggemma) by Google. Recently, the [gorse-cli](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-cli) also added a benchmarking feature for text embedding models. This post will use [gorse-cli](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-cli) and the playground dataset to conduct a comprehensive benchmark of popular open-source text embedding models. +In the 2025 post [Text Embedding Benchmark for Recommender Systems](./embedding-benchmark.md), we benchmarked the performance of text embedding models in similarity-based recommendations. Within six months of that post's publication, Alibaba Cloud and Google launched a new generation of open-source text embedding models: [qwen3-embedding](https://github.com/QwenLM/Qwen3-Embedding) by Alibaba Cloud and [embeddinggemma](https://ai.google.dev/gemma/docs/embeddinggemma) by Google. Recently, the [gorse-bench](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-bench) also added a benchmarking feature for text embedding models. This post will use [gorse-bench](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-bench) and the playground dataset to conduct a comprehensive benchmark of popular open-source text embedding models. ## Evaluation: 1-shot Similarity-based Recommendation @@ -39,10 +39,10 @@ OPENAI_BASE_URL="https://integrate.api.nvidia.com/v1" OPENAI_AUTH_TOKEN="NVIDIA_API_KEY" ``` -Compile [gorse-cli](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-cli) from Gorse repository and run the following command to evaluate the performance of the text embedding model: +Compile [gorse-bench](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-bench) from Gorse repository and run the following command to evaluate the performance of the text embedding model: ```bash -./gorse-cli bench-embedding --config ./config/config.toml \ +./gorse-bench embedding --config ./config/config.toml \ --text-column item.Comment \ --embedding-model qwen3-embedding:0.6b \ --embedding-dimensions 1024 \ @@ -146,4 +146,4 @@ For text embedding models for recommender systems in 2026, we offer the followin - **Cost-Efficiency/Private Deployment**: [qwen3-embedding:4b](https://huggingface.co/Qwen/Qwen3-Embedding-4B) is the current king of cost-efficiency. It achieves recommendation accuracy comparable to commercial models with fewer parameters. - **Low Latency/Edge**: [qwen3-embedding:0.6b](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) with 64 or 128-dimension is the best lightweight solution. -While this post provides some guidance, it is recommended to use [gorse-cli](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-cli) to evaluate on your own dataset to choose the text embedding model that best fits your specific business scenario. +While this post provides some guidance, it is recommended to use [gorse-bench](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-bench) to evaluate on your own dataset to choose the text embedding model that best fits your specific business scenario. diff --git a/src/posts/llm-ranker.md b/src/posts/llm-ranker.md index 9978b8f..28f8191 100644 --- a/src/posts/llm-ranker.md +++ b/src/posts/llm-ranker.md @@ -64,14 +64,14 @@ After saving the recommendation flow, Gorse will load the recommendation flow de ## Evaluation -The accuracy of the LLM-based reranker needs to be evaluated using the *gorse-cli* tool. +The accuracy of the LLM-based reranker needs to be evaluated using the *gorse-bench* tool. -1. Compile [gorse-cli](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-cli) from Gorse repository. -2. *gorse-cli* temporarily does not support recommendation flows defined by the RecFlow editor, so the recommendation workflow configuration needs to be written into the configuration file. Additionally, database access methods also need to be provided via the configuration file or environment variables. +1. Compile [gorse-bench](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-bench) from Gorse repository. +2. *gorse-bench* temporarily does not support recommendation flows defined by the RecFlow editor, so the recommendation workflow configuration needs to be written into the configuration file. Additionally, database access methods also need to be provided via the configuration file or environment variables. 3. Run the following command to evaluate the performance of an LLM-based reranker: ```bash -./gorse-cli bench-llm --config config.toml +./gorse-bench llm --config config.toml ``` This tool will read the user's historical feedback and split the feedback into training and test sets in an 8:2 ratio. For each user, the query is rendered using positive feedback from the training set, and documents are rendered using feedback from the test set (including both positive and negative feedback). Finally, Group AUC (GAUC)[^1] is used to calculate the ranking accuracy: diff --git a/src/zh/posts/embedding-benchmark-2026.md b/src/zh/posts/embedding-benchmark-2026.md index 5b31c8c..b1f2df2 100644 --- a/src/zh/posts/embedding-benchmark-2026.md +++ b/src/zh/posts/embedding-benchmark-2026.md @@ -8,7 +8,7 @@ tag: --- # 2026年哪个本文嵌入模型最适合推荐系统 -在2025年的文章[推荐场景下文本嵌入模型性能对比](./embedding-benchmark.md)中,我们评估了本文嵌入模型在相似推荐上的表现。在文章发布之后的半年内,阿里云和谷歌相继推出了新一代的开源本文嵌入模型,分别是阿里云的[qwen3-embedding](https://github.com/QwenLM/Qwen3-Embedding)和谷歌的[embeddinggemma](https://ai.google.dev/gemma/docs/embeddinggemma)。最近[gorse-cli](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-cli)工具也新增了文本嵌入模型的基准测试功能,本文将使用[gorse-cli](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-cli)和playground数据集,对热门的开源本文嵌入模型进行一次全面的评测。 +在2025年的文章[推荐场景下文本嵌入模型性能对比](./embedding-benchmark.md)中,我们评估了本文嵌入模型在相似推荐上的表现。在文章发布之后的半年内,阿里云和谷歌相继推出了新一代的开源本文嵌入模型,分别是阿里云的[qwen3-embedding](https://github.com/QwenLM/Qwen3-Embedding)和谷歌的[embeddinggemma](https://ai.google.dev/gemma/docs/embeddinggemma)。最近[gorse-bench](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-bench)工具也新增了文本嵌入模型的基准测试功能,本文将使用[gorse-bench](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-bench)和playground数据集,对热门的开源本文嵌入模型进行一次全面的评测。 ## 评估方法:基于相似度的单样本推荐 @@ -40,10 +40,10 @@ OPENAI_BASE_URL="https://integrate.api.nvidia.com/v1" OPENAI_AUTH_TOKEN="NVIDIA_API_KEY" ``` -从代码仓库编译好[gorse-cli](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-cli)运行以下命令评估文本嵌入模型的准确率: +从代码仓库编译好[gorse-bench](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-bench)运行以下命令评估文本嵌入模型的准确率: ```bash -./gorse-cli bench-embedding --config ./config/config.toml \ +./gorse-bench embedding --config ./config/config.toml \ --text-column item.Comment \ --embedding-model qwen3-embedding:0.6b \ --embedding-dimensions 1024 \ @@ -147,4 +147,4 @@ OPENAI_AUTH_TOKEN="NVIDIA_API_KEY" - **追求高性价比/私有化部署**:[qwen3-embedding:4b](https://huggingface.co/Qwen/Qwen3-Embedding-4B)是目前的性价比之王。它以较小的参数量实现了媲美商业模型的推荐精度。 - **低延迟/端侧场景**:[qwen3-embedding:0.6b](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)结合64或128维向量,是最佳的轻量化方案。 -即使本文提供了一些建议,但是在实际选型时,建议使用[gorse-cli](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-cli)在自己的数据集上进行评测,以选择最适合自己业务场景的文本嵌入模型。 +即使本文提供了一些建议,但是在实际选型时,建议使用[gorse-bench](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-bench)在自己的数据集上进行评测,以选择最适合自己业务场景的文本嵌入模型。 diff --git a/src/zh/posts/llm-ranker.md b/src/zh/posts/llm-ranker.md index 9d6cbcb..99df552 100644 --- a/src/zh/posts/llm-ranker.md +++ b/src/zh/posts/llm-ranker.md @@ -65,14 +65,14 @@ docker run -p 8088:8088 \ ## 排序准确率评估 -大语言模型重排的准确率需要使用 *gorse-cli* 工具进行评估。 +大语言模型重排的准确率需要使用 *gorse-bench* 工具进行评估。 -1. 从代码仓库编译 [gorse-cli](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-cli) -2. *gorse-cli* 暂时不支持流程编辑器定义的推荐流程,因此需要将推荐工作流的配置写入配置文件中。另外,数据库的访问方式也需要通过配置文件或者环境变量提供。 +1. 从代码仓库编译 [gorse-bench](https://github.com/gorse-io/gorse/tree/master/cmd/gorse-bench) +2. *gorse-bench* 暂时不支持流程编辑器定义的推荐流程,因此需要将推荐工作流的配置写入配置文件中。另外,数据库的访问方式也需要通过配置文件或者环境变量提供。 3. 运行以下命令评估 大语言模型重排器 的准确率: ```bash -./gorse-cli bench-llm --config config.toml +./gorse-bench llm --config config.toml ``` 此工具会读取用户的历史反馈,将反馈按照8:2的比例划分为训练集和测试集。针对每个用户,使用训练集中的正反馈渲染查询,使用测试集中的反馈(包括正反馈和负反馈)渲染文档,最后使用 GAUC[^1] 计算排序准确率: