Efficient Sequential Recommendation for Long Term User Interest Via Personalization

Changes to Original HLLM CodeBase

rewrite the dataloader to support larger dataset and make the data loading more efficient;
deprecated csv support but using parquet format instead;
add support of efficient sequential modeling via attention masking.

Installation

This code could run in devgpu without buck2. To do that, please take those additional steps: 0. set up the proxy according to this wiki

conda create -n hllm
conda activate hllm
conda install pip. conda install could be very slow.
$$if you are using python 3.13, please change to python 3.10 instead.$$ conda install python=3.10
pip install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
pip install ninja to speed up the flash_atten build
pip install -r requirements.txt. flash_atten could be very slow to build
pip install xformers for merrec specific
for LLama 3.1+, you need to pip install transformers==4.43.1

Dataset

To support larger dataset and make the data loading more efficient, we have deprecated csv support but using parquet format instead. You can use the following command to convert csv to parquet:

import pandas as pd

# Read the CSV file
df = pd.read_csv("your_file.csv")

# Write the DataFrame to a Parquet file
df.to_parquet("your_file.parquet")

TODO: how to process Merrec and NB-NERD dataset.

Please make sure you have configured text_path, data_path, item_pretrain_dir and user_pretrain_dir accordingly.

Training

We are using torchx as the helper to launch the training and evaluation job. You could find the scripts under reproduce/. To train HLLM on PixelRec / Amazon Book Reviews, you can run the following command.

Set master_addr, master_port, nproc_per_node, nnodes and node_rank in environment variables for multinodes training.

All hyper-parameters (except model's config) can be found in code/REC/utils/argument_list.py and passed through CLI. More model's hyper-parameters are in IDNet/* or HLLM/*.

# Item and User LLM are initialized by specific pretrain_dir.
python3 main.py \
--config_file overall/LLM_deepspeed.yaml HLLM/HLLM.yaml \ # We use deepspeed for training by default.
--loss nce \
--epochs 5 \
--dataset {Pixel200K / Pixel1M / Pixel8M / amazon_books} \
--train_batch_size 16 \
--MAX_TEXT_LENGTH 256 \
--MAX_ITEM_LIST_LENGTH 10 \
--checkpoint_dir saved_path \
--optim_args.learning_rate 1e-4 \
--item_pretrain_dir item_pretrain_dir \ # Set to LLM dir.
--user_pretrain_dir user_pretrain_dir \ # Set to LLM dir.
--text_path text_path \ # Use absolute path to text files.
--text_keys '[\"title\", \"tag\", \"description\"]' # Please remove tag in books dataset.

You can use --gradient_checkpointing True and --stage 3 with deepspeed to save memory.

You can also train ID-based models by the following command.

python3 main.py \
--config_file overall/ID.yaml IDNet/{hstu / sasrec / llama_id}.yaml \
--loss nce \
--epochs 201 \
--dataset {Pixel200K / Pixel1M / Pixel8M / amazon_books} \
--train_batch_size 64 \
--MAX_ITEM_LIST_LENGTH 10 \
--optim_args.learning_rate 1e-4

Some key args:

number_of_user_tokens: number of learnable tokens to be added.
insert_method: how to apply the learnable tokens. Use to 0 to insert all after pretrain_item_num or List[Tuple[offset, k]] in json encoded string to insert k user embedding at offset of pos_emb. sum(k) === number_of_user_tokens
pretrain_item_num: if set, we will use the full item sequence for prediction.
model: HSTU or PersonalizedHSTU
pretrain_item_num: number of items to be used for longer history.
recent_item_num: number of items to be used for recent history.
train_feat_max_length: if not provided, will use MAX_ITEM_LIST_LENGTH + 1. The number of items from the sequence loaded to be used for training and inference.
item_start_position. default 0. The index of the first item (inclusive) from train_feat_max_length to be used for training and inference.
item_end_position: default 0 or None. The index of the last item (inclusive) from train_feat_max_length to be used for training and inference.
MAX_ITEM_LIST_LENGTH: max length of item sequence. This is also the max length of transformer could support. MAX_ITEM_LIST_LENGTH === pretrain_item_num + recent_item_num

Inference

We couldn't release the model's weight due to company's policy as of now but we are working on it.

Then you can evaluate models by the following command (the same as training but val_only).

python3 main.py \
--config_file overall/LLM_deepspeed.yaml HLLM/HLLM.yaml \ # We use deepspeed for training by default.
--loss nce \
--epochs 5 \
--dataset {Pixel200K / Pixel1M / Pixel8M / amazon_books} \
--train_batch_size 16 \
--MAX_TEXT_LENGTH 256 \
--MAX_ITEM_LIST_LENGTH 10 \
--checkpoint_dir saved_path \
--optim_args.learning_rate 1e-4 \
--item_pretrain_dir item_pretrain_dir \ # Set to LLM dir.
--user_pretrain_dir user_pretrain_dir \ # Set to LLM dir.
--text_path text_path \ # Use absolute path to text files.
--text_keys '[\"title\", \"tag\", \"description\"]' \ # Please remove tag in books dataset.
--val_only True # Add this for evaluation

Citation

If our work has been of assistance to your work, feel free to give us a star ⭐ or cite us using :

TODO

Thanks to the excellent code repository HLLM, RecBole, VisRec, PixelRec and HSTU ! This repository is released under the Apache License 2.0, some codes are modified from HLLM, which are released under the Apache License 2.0, respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
dataset		dataset
information		information
reproduce		reproduce
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
TORCHRUN		TORCHRUN
process_books.py		process_books.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Sequential Recommendation for Long Term User Interest Via Personalization

Changes to Original HLLM CodeBase

Installation

Dataset

Training

Inference

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Efficient Sequential Recommendation for Long Term User Interest Via Personalization

Changes to Original HLLM CodeBase

Installation

Dataset

Training

Inference

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages