Skip to content

greenwolf-nsk/yandex-cup-2022-recsys

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Yandex Cup 2022: Like Prediction, 2nd place solution

This solution uses two-stage recommender system: candidate selection with different methods and ranking with GBDT.

Environment and running

  • if you want to use CUDF, you can install conda distribution from https://rapids.ai/start.html
  • on top of it - pip install -r requirements.txt
  • running with USE_CUDF=0 may be problematic, because there are some places where cudf-only methods like .to_pandas() are used
  • to run pipeline put unzipped files to data and just use dvc repro
  • to run only training&cv use dvc repro train_lightgbm_cv
  • you can also use dvc exp run with different params, e.g. dvc exp run -S train.working_dir=data/processed/sample will run pipeline on a small sample of data

Hardware

All experiments were run on a rig with 512GB RAM and A100 GPU. The most memory intense step is model training, takes ~250GB RAM at peak. GPU is only needed for fast calculation of co-occurence features with cudf, but it's possible to use pandas instead (set env USE_CUDF=0). Full pipeline with inference takes ~8 hours if executed consecutively with GPU.

Candidate selection

Next-item co-occurence

  • calculate dictionary on all consecutive pairs in train&test data {item: {next_item: count}}
  • get candidates as most common items in dictionary by keys - last item, pre-last, etc.

Smart co-occurence

  • get candidates by aggregated co-occurence count with last N history item (used sum)
  • one thing I missed during the competition and checked after - to add weight based on rank of action in history. Using 1 / (rank + 1) as weight boosts recall by 10% and precision by 30% compared to evenly weighted sum.

Implicit BM25 (Item2Item)

  • train Item-Item recommender, take similar to last item as candidates
  • other types of I2I from implicit also work, but BM25 is slightly better in terms of recall

Implicit ALS

  • use imlpicit ALS model for candidates. I used recalculate_user=True, but using real user factors could be a bit better

Popular items

  • since popular tracks change over time, popularity counts using only last items in user sessions

Last artist items

  • recommend top tracks of user's last liked artist

Features

  • score and rank from each candidate engine
  • co-occurence aggregated stats (mean, max, std, min)
  • als similarity aggregated stats
  • i2i similarity aggregated stats
  • item/artists statistics with different offsets (last 10, 50 actions, etc.)
  • user features: number of likes, unique artists, likes per artist

Ranker

tl;dr - LightGBM with lambdarank objective. Some things to notice:

  • 3 fold CV and averaged prediction
  • downsample negative items with rate 0.3 (e.g. we keep 300k negatives from 1mln)
  • use custom numba MRR implementation for early stopping
  • 100 early stopping rounds, 600 iterations on average
  • hyperparams were tuned on small subset of data once almost in the beginning, trying to change any of them later did not help
  • learning rate: 0.04
  • l1 reg: ~1
  • l2 reg: ~8
  • colsample: 0.6
  • subsample: 0.6

Final ensemble

Final submission is generated by blending 3 submission files with inverse rank blend (see blend.py for exmaple).

Features and LightGBM parameters were pretty much the same between all three models.

first (0.0849 lb, 0.0845 cv, 0.49 recall)

  • 0 1 2 3 4 5 next item co-occurence candidates (300 per rank)
  • default implicit ALS, 300 items
  • last item similar candidates, 300 items
  • 200 popular items
  • 100 last artist top items
  • LEFT JOIN CANDIDATES :D

second (0.0854 lb, 0.0852 cv, 0.62 recall)

  • 0 1 next item co-occurence candidates (300 per rank)
  • 1500 "smart co-occurence" candidates (cooc calculated in +-7 range, use 100 last items in history)
  • default implicit ALS, 300 items
  • last item similar candidates, 300 items
  • 200 popular items

third (0.08608 lb, 0.0856 cv, 0.64 recall)

  • 0 1 next item co-occurence candidates (300 per rank)
  • 1500 "smart co-occurence" candidates (cooc calculated in +-7 range, use 16 last items in history)
  • default implicit ALS, 300 items
  • last item similar candidates, 300 items
  • 200 popular items

Bonus - DVC pipeline flow chart

flowchart LR
        node1["calculate_als_candidates@test"]
        node2["calculate_als_candidates@val"]
        node3["calculate_artist_candidates@test"]
        node4["calculate_artist_candidates@val"]
        node5["calculate_cooc_candidates@test"]
        node6["calculate_cooc_candidates@val"]
        node7["calculate_cooc_smart_candidates@test"]
        node8["calculate_cooc_smart_candidates@val"]
        node9["calculate_cooc_stats"]
        node10["calculate_cooc_stats_for_smart"]
        node11["calculate_popular_candidates@test"]
        node12["calculate_popular_candidates@val"]
        node13["calculate_similar_candidates@test"]
        node14["calculate_similar_candidates@val"]
        node15["create_artist_features"]
        node16["create_item_features"]
        node17["create_submission"]
        node18["create_submission_cv"]
        node19["create_user_artist_features@test"]
        node20["create_user_artist_features@val"]
        node21["create_user_features@test"]
        node22["create_user_features@val"]
        node23["create_user_history_als_features@test"]
        node24["create_user_history_als_features@val"]
        node25["create_user_history_artist_features@test"]
        node26["create_user_history_artist_features@val"]
        node27["create_user_history_cooc_features@test"]
        node28["create_user_history_cooc_features@val"]
        node29["create_user_history_features@test"]
        node30["create_user_history_features@val"]
        node31["create_user_history_similarity_features@test"]
        node32["create_user_history_similarity_features@val"]
        node33["merge_candidates@test"]
        node34["merge_candidates@val"]
        node35["merge_candidates_and_features@test"]
        node36["merge_candidates_and_features@val"]
        node37["prepare_data"]
        node38["split_test_by_chunks"]
        node39["train_als_candidates"]
        node40["train_cooc_candidates"]
        node41["train_lightgbm"]
        node42["train_lightgbm_cv"]
        node43["train_popular_candidates"]
        node44["train_similar_candidates"]
        node1-->node33
        node2-->node34
        node5-->node33
        node6-->node34
        node7-->node33
        node8-->node34
        node9-->node27
        node9-->node28
        node10-->node7
        node10-->node8
        node11-->node33
        node12-->node34
        node13-->node33
        node14-->node34
        node15-->node35
        node15-->node36
        node16-->node35
        node16-->node36
        node19-->node35
        node20-->node36
        node21-->node35
        node22-->node36
        node23-->node35
        node24-->node36
        node27-->node35
        node28-->node36
        node29-->node35
        node30-->node36
        node31-->node35
        node32-->node36
        node33-->node23
        node33-->node25
        node33-->node27
        node33-->node31
        node33-->node35
        node34-->node24
        node34-->node26
        node34-->node28
        node34-->node32
        node34-->node36
        node35-->node17
        node35-->node38
        node36-->node41
        node36-->node42
        node37-->node1
        node37-->node2
        node37-->node3
        node37-->node4
        node37-->node5
        node37-->node6
        node37-->node7
        node37-->node8
        node37-->node9
        node37-->node10
        node37-->node11
        node37-->node12
        node37-->node13
        node37-->node14
        node37-->node15
        node37-->node16
        node37-->node19
        node37-->node20
        node37-->node21
        node37-->node22
        node37-->node23
        node37-->node24
        node37-->node25
        node37-->node26
        node37-->node27
        node37-->node28
        node37-->node29
        node37-->node30
        node37-->node31
        node37-->node32
        node37-->node39
        node37-->node40
        node37-->node41
        node37-->node43
        node37-->node44
        node38-->node18
        node39-->node1
        node39-->node2
        node39-->node23
        node39-->node24
        node40-->node5
        node40-->node6
        node41-->node17
        node42-->node18
        node43-->node11
        node43-->node12
        node44-->node13
        node44-->node14
        node44-->node31
        node44-->node32

About

2nd place solution for Next Like prediction task

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages