WSDM Cup 2023 Source Code:

This repo contains the source code of our competition in WSDM Cup 2023: Pre-training for Web Search and Unbiased Learning for Web Search.
In the Pre-training task, we implement all codes in both Pytorch and PaddlePaddle version (You can pretrain & finetune in anyone of these two frameworks.).
In the Unbiased LTR task, we implement codes in Pytorch version.
All checkpoints are available here: Download

Paper released

Please refer to our paper for details in this competition:

Task1 Unbiased Learning to rank: Multi-Feature Integration for Perception-Dependent Examination-Bias Estimation
Task2 Pretraining for web search: Pretraining De-Biased Language Model with Large-scale Click Logs for Document Ranking

Below is details for Pre-training task. For Unbiased LTR task, see README.md for details.

Method Overview

Pre-training BERT with MLM and CTR prediction loss (or multi-task CTR prediction loss).
Finetuning BERT with pairwise ranking loss.
Obtain prediction scores from different BERTs.
Ensemble learning to combine BERT features and sparse features.

Details will be updated in the submission paper.

Training

In all start.sh files in pytorch_pretrain (or paddle_pretrain), modify all data_root={Palce Your Data Root Path Here}/baidu_ultr as your file path.
Set NPROC as your GPU number.

1) Pretraining

cd pytorch_pretrain/pretrain (or paddle_pretrain/pretrain)
sh start.sh

2) Finetuning

cd pytorch_pretrain/finetune (or paddle_pretrain/finetune)
sh start.sh

3) Inference for submission

cd pytorch_pretrain/submit (or paddle_pretrain/submit)
sh start.sh

Ensemble learning

We use lambdamart by lightgbm to ensemble different scores from the finetuned bert models.

Sparse features:

query length
document length
query frequency
number of hit words of query in document
BM25 score
TF-IDF score

BERT features:

1) Model details: Checkpoints Download Here

Index	Model Flag	Method	Pretrain step	Finetune step	DCG on leaderboard
1	large_group2_wwm_from_unw4625K	M1	1700K	5130	11.96214
2	large_group2_wwm_from_unw4625K	M1	1700K	5130	NAN
3	base_group2_wwm	M2	2150K	5130	~11.32363
4	large_group2_wwm_from_unw4625K	M1	590K	5130	11.94845
5	large_group2_wwm_from_unw4625K	M1	1700K	4180	NAN
6	large_group2_mt_pretrain	M3	1940K	5130	NAN

2) Method details

Method	Model Layers	Details
M1	24	WWM & CTR prediction as pretraining tasks
M2	12	WWM & CTR prediction as pretraining tasks
M3	24	WWM & Multi-task CTR prediction as pretraining tasks

The procedure contains two steps:

Cross validation on validation set to determine best parameters. See ./lambdamart/cross_validation.ipynb.
Generate the final scores based on the determined parameters in step 1. See ./lambdamart/run.ipynb.

Reproduce results on leaderboard

1) Convert Torch checkpoint to Paddle checkpoint.

Install X2Paddle and onnxsim
We use method one in this link. Using the following three commands:

python ./paddle_pretrain/convert/convert-onnx.py 
python -m onnxsim model.onnx model_sim.onnx
x2paddle --framework=onnx --model=model_sim.onnx --save_dir=./pd_model

It will output a folder named ./pd_model which contains x2paddle.py(Model definition in paddlepaddle) and model.pdparams (Trained parameters). Copy x2paddle.py to ./paddle_pretrain/review/x2paddle.py. We already generate a file there which converts a 24-layer model, you can directly use this file if your always use a 24-layer model.

2) Inference score for each bert model

Modify data_root as your path in ./paddle_pretrain/review/start.sh and then run it by sh start.sh.
It uses PaddlePaddle framework to inference the score of each query-document pair.

3) Ensemble learning

We already inference scores of 6 different models. The Scores are all contained in ./lambdamart/features.
Run all cells in ./lambdamart/run.ipynb. It will reproduce the scores of our final submission by ensembling all scores from different models, which is the same as ./lambdamart/features/final_result_submit.csv.

Environment

We opensource dockers for both pytorch and paddlepaddle to save your configuration time of environment.

Version	Key configuration
Pytorch	Python 3.6 torch1.8.0 transformers-4.18.0
PaddlePaddle	Python 3.9 Paddle2.4 cuda11.2-cudnn8.2
To be updated.

Contacts

Xiangsheng Li: lixsh6@gmail.com.
Xiaoshu Chen: xschenranker@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
lamdamart		lamdamart
paddle_pretrain		paddle_pretrain
pytorch_pretrain		pytorch_pretrain
pytorch_unbias		pytorch_unbias
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lamdamart

lamdamart

paddle_pretrain

paddle_pretrain

pytorch_pretrain

pytorch_pretrain

pytorch_unbias

pytorch_unbias

README.md

README.md

Repository files navigation

WSDM Cup 2023 Source Code:

Paper released

Quick Links

Method Overview

Training

1) Pretraining

2) Finetuning

3) Inference for submission

Ensemble learning

Sparse features:

BERT features:

1) Model details: Checkpoints Download Here

2) Method details

The procedure contains two steps:

Reproduce results on leaderboard

1) Convert Torch checkpoint to Paddle checkpoint.

2) Inference score for each bert model

3) Ensemble learning

Environment

Contacts

About

Releases

Packages

Languages

lixsh6/Tencent_wsdm_cup2023

Folders and files

Latest commit

History

Repository files navigation

WSDM Cup 2023 Source Code:

Paper released

Quick Links

Method Overview

Training

1) Pretraining

2) Finetuning

3) Inference for submission

Ensemble learning

Sparse features:

BERT features:

1) Model details: Checkpoints Download Here

2) Method details

The procedure contains two steps:

Reproduce results on leaderboard

1) Convert Torch checkpoint to Paddle checkpoint.

2) Inference score for each bert model

3) Ensemble learning

Environment

Contacts

About

Resources

Stars

Watchers

Forks

Languages