Skip to content

PeterGriffinJin/LMIndexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Models as Semantic Indexers

This repository contains the source code and datasets for Language Models as Semantic Indexers, ICML 2024.

Links

Requirements

The code is written in Python 3.8. Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):

pip3 install -r requirements.txt

Overview

LMIndexer is a self-supervised framework learned to tokenize documents into semantic IDs.

LMIndexer can be applied to various downstream tasks, including recommendation and retrieval.

Data Preparation

Download processed data. To reproduce the results in our paper, you need to first download the processed datasets. Then put the dataset folders under data/rec-data/{data_name} (data_name=Beauty, Sports, Toys) and data/retrieval-data/{data_name} (data_name=NQ_aug, macro) respectively.

Raw data & data processing. Raw data can be downloaded from Amazon-Recommendation, Amazon-Retrieval, NQ and MS-MACRO directly. More details about the data processing for recommendation, product retrieval and document retrieval can be found here.

Learn Semantic IDs

Codes are in SemanticID/. Please refer to the README.md here.

Downstream Tasks

Codes are in downstream/. Please refer to the README.md here.

Citations

Please cite the following paper if you find the code helpful for your research.

@article{jin2023language,
  title={Language Models As Semantic Indexers},
  author={Jin, Bowen and Zeng, Hansi and Wang, Guoyin and Chen, Xiusi and Wei, Tianxin and Li, Ruirui and Wang, Zhengyang and Li, Zheng and Li, Yang and Lu, Hanqing and others},
  journal={arXiv preprint arXiv:2310.07815},
  year={2023}
}

About

Language Models as Semantic Indexers (ICML 2024)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published