This repository is an official implementation of ItemRAG, an item-based retrieval-augmented generation technique for LLM-based recommendation.
- Title: ItemRAG: Item-based Retrieval-Augmented Generation for LLM-based Recommendation.
- Authos: Sunwoo Kim, Geon Lee, Kyungho Kim, Jaemin Yoo, and Kijung Shin.
- Venue: SIGIR 2026 (short paper)
In this work, we use the four datasets from https://amazon-reviews-2023.github.io/.
| Name | # Users | # Items |
|---|---|---|
| Sports & Outdoors | 25,363 | 15,701 |
| Toys & Games | 19,026 | 14,718 |
| Beauty & Personal Care | 45,490 | 31,151 |
| Arts, Crafts & Sewing | 24,511 | 18,884 |
The entire dataset, including (1) user-item interactions and (3) item titles, is presented in the link below:
- Link: https://www.dropbox.com/scl/fo/olpz13hyfcdn5jg6a4tzy/AN1Na5w_ySyO4nnYFyYnRpE?rlkey=5i6iyloeq9fpa48tz29ztmaqj&st=ibnulph8&dl=0
- Description: Refer to the
README.txtfile within the link.
ItemRAG consists of the four steps below:
- Step 1: Find semantically similar items based on the textual similarity, which is implemented in
step1_get_similar_title_items.py - Step 2: Retrieve relevant items, which is implemented in
step2_do_retrieval.py - Step 3: Generate a summary of the relevant items, which is implemented in
step3_generate_summary.py - Step 4: Perform final LLM-based recommendation, which is implemented in
evaluation.py
The overall repository hierarchy is expected to be formed as follows:
/
- dataset
- /dataset/sports_outdoors_full.pickle
- /dataset/toys_games_full.pickle
- /dataset/beauty_care_full.pickle
- /dataset/arts_full.pickle
- step1_get_similar_title_items.py
- step2_do_retrieval.py
- step3_generate_summary.py
- evaluation.py
All files include a term --dataset. This indicates the name of the target dataset. It should be one of:
- Sports & Outdoors -> 'sports_outdoors'
- Toys & Games -> 'toys_games'
- Beauty & Personal Care -> 'beauty_care'
- Arts, Crafts & Sewing -> 'arts'
We detail each Python file:
- Step 1: Finding semantically similar items. '--device' indicates the backbone GPU device for a language model. '--K' indicates the number of semantically similar items for each.
python3 step1_get_similar_title_items.py --dataset sports_outdoors --device cuda:0 --K 5
- Step 2: Retrieving relevant items. '--K' indicates the number of items to retrieve.
python3 step2_do_retrieval.py --dataset sports_outdoors --K 50
- Step 3: Generating summary. Note that one should put a proper api key for GPT in the file!
python3 step3_generate_summary.py --dataset sports_outdoors
- Step 4: Performing LLM-based recommendation.
python3 evaluation.py --dataset sports_outdoors