This repository contains the implementation of SnipRec, a novel conversational recommender system that leverages Large Language Models to extract atomic "snippets" from user-generated content (customer reviews) for representing item information and user preferences.
├── config/ # YAML configuration files
│ ├── experiments/ # Experiment configurations
│ ├── recommender/ # Recommender system configurations
│ └── user_simulator/ # User simulator configurations
├── data/ # Dataset storage
│ ├── yelp_philadelphia/ # Yelp restaurant data
│ ├── amazon_books/ # Amazon book data
│ └── amazon_clothing/ # Amazon clothing data
├── data_preprocessing/ # Data filtering and preprocessing scripts
├── snippet_extraction/ # LLM-based snippet extraction
├── recommenders/ # SnipRec implementation
├── baselines/ # Baseline recommender systems
├── user_simulators/ # User simulation for evaluation
├── scripts/ # Experiment runner scripts
│ ├── run_experiments.py # SnipRec experiments (GPU support)
│ └── run_baseline_experiments.py # Baseline experiments
├── docs/ # Detailed documentation
│ ├── preprocessing.md # Data preprocessing guide
│ └── experiments.md # Experiment execution guide
uv sync- Yelp Reviews: https://business.yelp.com/data/resources/open-dataset/
- Amazon Reviews: https://amazon-reviews-2023.github.io/
Place the downloaded datasets in data/{dataset}/raw/ directories.
Create a .env file with your OpenAI API key:
OPENAI_API_KEY=your_api_key_hereSet FIREWORKS_API_KEY if using Fireworks API.
See .env.example for reference.
The preprocessing pipeline consists of four main stages:
- Filter reviews and items based on quality metrics
- Convert data formats (JSONL → Parquet) for efficient processing
- Extract evaluation seeds: positive reviews from users with 10-99 reviews
- Generate anonymized reviews to avoid answer leakage
- Use OpenAI Batch API to extract snippets (atomic propositions) from reviews
- Create FAISS vector indices for retrieval
- Create document-level indices using LlamaIndex
- Split documents into sentences with spaCy and create FAISS indices
Detailed instructions: See docs/preprocessing.md
Experiments compare SnipRec against document-level, sentence-level, and snippet-level baseline approaches using simulated conversations.
Experiments are configured by YAML files:
config/experiments/→ references →config/recommender/+config/user_simulator/
SnipRec:
CUDA_VISIBLE_DEVICES=0 python scripts/run_experiments.py config/experiments/yelp_philadelphia/sniprec/model-ask.gpt-4o-mini.v1.yaml --use-gpuBaselines:
python scripts/run_baseline_experiments.py config/experiments/yelp_philadelphia/baseline/doc.gpt-4o-mini.v1.yamlParallel execution:
# Multiple GPUs
CUDA_VISIBLE_DEVICES=0,1,2 python scripts/run_experiments.py config.yaml -j 3 --use-gpu
# CPU parallel
python scripts/run_baseline_experiments.py config.yaml -j 3Detailed instructions: See docs/experiments.md
If you use this code, please cite our paper:
@misc{sun2025snippetbasedconversationalrecommender,
title={Snippet-based Conversational Recommender System},
author={Haibo Sun and Naoki Otani and Hannah Kim and Dan Zhang and Nikita Bhutani},
year={2025},
eprint={2411.06064},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2411.06064},
}
This software may include, incorporate, or access open source software (OSS) components, datasets and other third party components, including those identified below. The license terms respectively governing the datasets and third-party components continue to govern those portions, and you agree to those license terms may limit any distribution, use, and copying. You may use any OSS components under the terms of their respective licenses, which may include BSD 3, Apache 2.0, and other licenses. In the event of conflicts between Megagon Labs, Inc. (“Megagon”) license conditions and the OSS license conditions, the applicable OSS conditions governing the corresponding OSS components shall prevail. You agree not to, and are not permitted to, distribute actual datasets used with the OSS components listed below. You agree and are limited to distribute only links to datasets from known sources by listing them in the datasets overview table below. You agree that any right to modify datasets originating from parties other than Megagon are governed by the respective third party’s license conditions. You agree that Megagon grants no license as to any of its intellectual property and patent rights. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS (INCLUDING MEGAGON) “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You agree to cease using, incorporating, and distributing any part of the provided materials if you do not agree with the terms or the lack of any warranty herein. While Megagon makes commercially reasonable efforts to ensure that citations in this document are complete and accurate, errors may occur. If you see any error or omission, please help us improve this document by sending information to contact_oss@megagon.ai.
All datasets used within the product are listed below (including their copyright holders and the license information).
For Datasets having different portions released under different licenses, please refer to the included source link specified for each of the respective datasets for identifications of dataset files released under the identified licenses.
| ID | OSS Component Name | Modified | Copyright Holder | Upstream Link | License |
|---|---|---|---|---|---|
| 1 | Yelp Open Dataset | Yes | Yelp | link | Yelp Data Agreement |
| 2 | Amazon Reviews 2023 | Yes | Amazon | link | MIT License |