Snippet-based Conversational Recommender System

This repository contains the implementation of SnipRec, a novel conversational recommender system that leverages Large Language Models to extract atomic "snippets" from user-generated content (customer reviews) for representing item information and user preferences.

Directory Structure

├── config/                          # YAML configuration files
│   ├── experiments/                 # Experiment configurations
│   ├── recommender/                 # Recommender system configurations
│   └── user_simulator/              # User simulator configurations
├── data/                            # Dataset storage
│   ├── yelp_philadelphia/           # Yelp restaurant data
│   ├── amazon_books/                # Amazon book data
│   └── amazon_clothing/             # Amazon clothing data
├── data_preprocessing/              # Data filtering and preprocessing scripts
├── snippet_extraction/              # LLM-based snippet extraction
├── recommenders/                    # SnipRec implementation
├── baselines/                       # Baseline recommender systems
├── user_simulators/                 # User simulation for evaluation
├── scripts/                         # Experiment runner scripts
│   ├── run_experiments.py           # SnipRec experiments (GPU support)
│   └── run_baseline_experiments.py  # Baseline experiments
├── docs/                            # Detailed documentation
│   ├── preprocessing.md             # Data preprocessing guide
│   └── experiments.md               # Experiment execution guide

Setup

Install Dependencies

uv sync

Download Data

Yelp Reviews: https://business.yelp.com/data/resources/open-dataset/
Amazon Reviews: https://amazon-reviews-2023.github.io/

Place the downloaded datasets in data/{dataset}/raw/ directories.

Environment Configuration

Create a .env file with your OpenAI API key:

OPENAI_API_KEY=your_api_key_here

Set FIREWORKS_API_KEY if using Fireworks API.

See .env.example for reference.

Data Preprocessing Overview

The preprocessing pipeline consists of four main stages:

1. Data Filtering

Filter reviews and items based on quality metrics
Convert data formats (JSONL → Parquet) for efficient processing

2. User Simulation Preparation

Extract evaluation seeds: positive reviews from users with 10-99 reviews
Generate anonymized reviews to avoid answer leakage

3. Snippet Extraction

Use OpenAI Batch API to extract snippets (atomic propositions) from reviews
Create FAISS vector indices for retrieval

4. Baseline Setup

Create document-level indices using LlamaIndex
Split documents into sentences with spaCy and create FAISS indices

Detailed instructions: See docs/preprocessing.md

Experiments Overview

Experiments compare SnipRec against document-level, sentence-level, and snippet-level baseline approaches using simulated conversations.

Configuration

Experiments are configured by YAML files:

config/experiments/ → references → config/recommender/ + config/user_simulator/

Running Experiments

SnipRec:

CUDA_VISIBLE_DEVICES=0 python scripts/run_experiments.py config/experiments/yelp_philadelphia/sniprec/model-ask.gpt-4o-mini.v1.yaml --use-gpu

Baselines:

python scripts/run_baseline_experiments.py config/experiments/yelp_philadelphia/baseline/doc.gpt-4o-mini.v1.yaml

Parallel execution:

# Multiple GPUs
CUDA_VISIBLE_DEVICES=0,1,2 python scripts/run_experiments.py config.yaml -j 3 --use-gpu

# CPU parallel
python scripts/run_baseline_experiments.py config.yaml -j 3

Detailed instructions: See docs/experiments.md

Citation

If you use this code, please cite our paper:

@misc{sun2025snippetbasedconversationalrecommender,
      title={Snippet-based Conversational Recommender System},
      author={Haibo Sun and Naoki Otani and Hannah Kim and Dan Zhang and Nikita Bhutani},
      year={2025},
      eprint={2411.06064},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2411.06064},
}

Disclosures

This software may include, incorporate, or access open source software (OSS) components, datasets and other third party components, including those identified below. The license terms respectively governing the datasets and third-party components continue to govern those portions, and you agree to those license terms may limit any distribution, use, and copying. You may use any OSS components under the terms of their respective licenses, which may include BSD 3, Apache 2.0, and other licenses. In the event of conflicts between Megagon Labs, Inc. (“Megagon”) license conditions and the OSS license conditions, the applicable OSS conditions governing the corresponding OSS components shall prevail. You agree not to, and are not permitted to, distribute actual datasets used with the OSS components listed below. You agree and are limited to distribute only links to datasets from known sources by listing them in the datasets overview table below. You agree that any right to modify datasets originating from parties other than Megagon are governed by the respective third party’s license conditions. You agree that Megagon grants no license as to any of its intellectual property and patent rights. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS (INCLUDING MEGAGON) “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You agree to cease using, incorporating, and distributing any part of the provided materials if you do not agree with the terms or the lack of any warranty herein. While Megagon makes commercially reasonable efforts to ensure that citations in this document are complete and accurate, errors may occur. If you see any error or omission, please help us improve this document by sending information to contact_oss@megagon.ai.

Dataset

All datasets used within the product are listed below (including their copyright holders and the license information).

For Datasets having different portions released under different licenses, please refer to the included source link specified for each of the respective datasets for identifications of dataset files released under the identified licenses.

ID	OSS Component Name	Modified	Copyright Holder	Upstream Link	License
1	Yelp Open Dataset	Yes	Yelp	link	Yelp Data Agreement
2	Amazon Reviews 2023	Yes	Amazon	link	MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Snippet-based Conversational Recommender System

Directory Structure

Setup

Install Dependencies

Download Data

Environment Configuration

Data Preprocessing Overview

1. Data Filtering

2. User Simulation Preparation

3. Snippet Extraction

4. Baseline Setup

Experiments Overview

Configuration

Running Experiments

Citation

Disclosures

Dataset

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
baselines		baselines
config		config
data_preprocessing		data_preprocessing
docs		docs
recommenders		recommenders
scripts		scripts
snippet_extraction		snippet_extraction
user_simulators		user_simulators
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
utils.py		utils.py
uv.lock		uv.lock

License

megagonlabs/sniprec

Folders and files

Latest commit

History

Repository files navigation

Snippet-based Conversational Recommender System

Directory Structure

Setup

Install Dependencies

Download Data

Environment Configuration

Data Preprocessing Overview

1. Data Filtering

2. User Simulation Preparation

3. Snippet Extraction

4. Baseline Setup

Experiments Overview

Configuration

Running Experiments

Citation

Disclosures

Dataset

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages