ranking algorithm: position unbiased option #4531

robert-howley-zocdoc · 2021-08-18T15:29:25Z

adds support for the position unbiased adjustments described in the Unbiased LambdaMART paper
- this methodology attempts to correct for position bias in the result set
- implementation assumes queries are fed into training in the order in which they appeared
note for fellow practitioners ...
- you'll often see lower ndcg@1 but higher ndcg at lower cutoffs relative to standard (no bias correction) training
- this isn't an unreasonable outcome of reducing position bias
this pr is loosely inspired by the author's implementation here
- warning: interpreting that repo can be difficult bc some parameters were either implicitly or explicitly fixed
- free to talk in more detail about the implications of fixed sigmoid in that repo and how it translates
created a branch position_unbiased_notes that has more detailed comments on how the formulas of the paper line up with the code changes

chore: ignore swig jni file

run tests on pr to position_debias

add unbiased config params

feat: rough migration of acbull's source

chore: give better name to lambdarank bias regularizer

remove unused param position_bins_, replaced by truncation_level

robert-howley-zocdoc · 2021-09-07T18:11:08Z

Thanks very much for the contribution @robert-howley-zocdoc !

Once you sign the CLA, we'd be happy to start providing a review. Ask here if you have any questions or concerns!

btw sorry for the long lapse. getting (internal) approval on the cla took a bit and then i was on vacation for 2+ weeks. back and will be responsive.

jameslamb · 2021-09-07T18:14:43Z

back and will be responsive.

no problem at all, we understand! If you're able to submit a separate PR with some of the recommended small changes like the .gitignore one (#4531 (comment)), you might be able to go faster here because you won't have to wait for maintainers to manually approve Continuous Integration runs (GitHub only applies that gate to first-time contributors in a repo).

I also want to set the expectation that I personally am probably not qualified to review that the submitted code is algorithmically correct, and that we'll need a review from @guolinke or @shiyu1994 or @btrotta for that.

Thanks again for working with us on this great addition!

jameslamb · 2021-11-07T01:23:10Z

It's been about two months since this PR had any activity.

@shiyu1994 @guolinke @btrotta could one of you please provide a review? I'd like to bring this to a resolution before the 4.0 release.

@robert-howley-zocdoc could you update this to the latest master? I tried to do that myself tonight, but I think you've opted out of allowing maintainers to push to branches on your fork.

what I tried (click me)

git clone \
    git@github.com:Zocdoc/LightGBM.git \
    zocdoc-lgb

cd zocdoc-lgb
git remote add upstream git@github.com:microsoft/LightGBM.git
git pull upstream master
git fetch origin position_unbiased
git checkout position_unbiased
git merge master
git push origin position_unbiased

ERROR: Permission to Zocdoc/LightGBM.git denied to jameslamb.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

shiyu1994 · 2021-11-11T03:49:25Z

Sorry for the late response. Will review this soon.

thvasilo · 2021-12-03T00:15:20Z

Hi @robert-howley-zocdoc could you clarify this assumption a bit?

implementation assumes queries are fed into training in the order in which they appeared

I was trying to determine the document position appears in the calculation, does the assumption mean that for a query/document group like:

idx  query_id     var1      var2      var3  relevance
0           0  0.624776  0.191463  0.598358          0
1           0  0.258280  0.658307  0.148386          0
2           0  0.893683  0.059482  0.340426          0
3           0  0.879514  0.526022  0.712648          1
4           0  0.188580  0.279471  0.062942          0

the implementation assumes that the record with idx=0 appeared first in the results, idx=1 second etc? So it assume that the input dataset was created with presentation order in mind correct? Otherwise we'd need to include another column indicating position in the calculation.

thvasilo · 2021-12-20T18:17:30Z

Hi @shiyu1994 based on the comment above and in my limited testing of this PR, reading through the paper/code it looks correct to me. I'd like to still run a couple of tests with the same datasets as in the paper to see how close this gets to the results there.

One question I had: After this gets merged one extension I'd like to work on is allowing the position to be a separate column that is available to the optimization process. The reason I'm suggesting this is that in real world data you often have missing rows/data and other logging errors that would break the contiguous assumption of this implementation, leading to incorrect position biases being updated.

I wanted to get your early opinion on this potential contribution. Position data could be added as a metadata attribute, similar to query boundaries.

robhowley · 2021-12-20T23:56:53Z

@thvasilo

the implementation assumes that the record with idx=0 appeared first in the results, idx=1 second etc? So it assume that the input dataset was created with presentation order in mind correct? Otherwise we'd need to include another column indicating position in the calculation.

Correct on all 3. I use lightgbm via mmlspark so I added a sortWithinPartitions call to zocdoc's fork of that repo to take care of it. your point on real world data potentially having incomplete sets is completely valid.

thvasilo · 2021-12-22T01:51:16Z

Hi @robhowley one thing I'm noticing in the debug logs is that while the position biases monotonically decrease for most positions, the cutoff position (truncation level) has a large bias, I even see it go above 1 for bias_i, whereas the biases for the positions before are very low as expected. Any idea what could be causing that? What's special about the bias value of the cutoff?

Another thing I'm observing is that the bias_j values for positions 2 and 3 are > 1 (I've seen this for many values of p_norm). Have you observed that or could this be a data issue (because of the real-world data issues I mentioned)?

thvasilo · 2022-01-10T14:40:29Z

src/objective/rank_objective.hpp

+        int debias_high_rank = static_cast<int>(std::min(high, truncation_level_ - 1));
+        int debias_low_rank = static_cast<int>(std::min(low, truncation_level_ - 1));


Hi I am trying to investigate what the issue is with the biases for the truncation cutoff spot that I'm observing and I'm wondering if the thresholding here is the culprit.

Within the context of this implementation, the case where debias_high_rank gets the value truncation_level - 1 is when high > truncation_level - 1, i.e. sorted_idx[high_rank] > truncation_level - 1 i.e. sorted_idx[j] > truncation_level - 1.

Correct me if I'm wrong but the only case the above happens is when j is the high_rank, and j starts from i+1 which can take a >truncation_level value right?

I was wondering if it's better to check the truncation level for both high and low ranks and only update the position biases when both are lower than truncation_level, instead of thresholding their values.

As I'm looking for the update to introduce the position column as a metadata value, what I'm thinking is along the lines of:

const data_size_t high = sorted_idx[high_rank]; const data_size_t low = sorted_idx[low_rank]; data_size_t high_position, low_position; if (unbiased_) { // record_positions_ is a vector of length data_size, with the position of each // record in the data. start is the offset for the current query high_position = record_positions_[start + sorted_idx[high_rank]]; low_position = record_positions_[start + sorted_idx[low_rank]]; } else { high_position = high; low_position = low; } if (unbiased_) { double p_cost = log(1.0f / (1.0f - p_lambda)) * delta_pair_NDCG; // more relevant (clicked) gets debiased by less relevant (unclicked), only if within truncation levels if (high_position <= truncation_level_ && low_position <= truncation_level_) { i_costs_buffer_[tid][high_position] += p_cost / j_biases_pow_[low_position]; j_costs_buffer_[tid][low_position] += p_cost / i_biases_pow_[high_position]; // and vice versa } } // By default we set values of 1.0 as no-ops double i_bias_pow = 1.0; double j_bias_pow = 1.0; // We only use actual bias values if they are both within the truncation limits if (unbiased_ && high_position <= truncation_level_ && low_position <= truncation_level_) { i_bias_pow = i_biases_pow_[high_position]; j_bias_pow = j_biases_pow_[low_position]; } // update, either with 1.0 values if at least one of data points ended up outside the truncation threshold or the actual biases p_lambda *= -sigmoid_ * delta_pair_NDCG / i_bias_pow / j_bias_pow; p_hessian *= sigmoid_ * sigmoid_ * delta_pair_NDCG / i_bias_pow / j_bias_pow;

Does this make sense? The main suggestion is to not "clamp" the bias positions between [0, truncation_level - 1] but rather not perform bias updates when one of the document-pair items falls outside the truncation level (making all bias positions valid).

It's a bit unclear to me still the difference between high/low and high_rank/low_rank. IIRC We sort records in a query by their score and to establish high/low rank we compare their label values.

What I want to ensure is that: since my record_positions_ will have the original data order, the extracted position corresponds to the expected record, which I think is what we accomplish by record_positions_[start + sorted_idx[high_rank]].

@thvasilo Thanks for reviewing this.
What confuses me here is that the index of the cost buffer is truncation of the high and low variables. It doesn't make sense to me because high and low are just document indices. According to equations (28)~(31) in the Unbiased Lambdarank paper, the truncation should be done on the current rank by the model instead of the document index. Since with truncation level, we are truncating the loss in equation (28) to consider only the top ranked i's.

I believe here static_cast<int>(std::min(high, truncation_level_ - 1)); and static_cast<int>(std::min(low, truncation_level_ - 1)); should be static_cast<int>(std::min(high_rank, truncation_level_ - 1)); and static_cast<int>(std::min(low_rank, truncation_level_ - 1)); instead.

shiyu1994

@robert-howley-zocdoc Thanks for the great contribution. Please kindly check my comments. The implementation is ok overall, except some minor mistakes and inconsistence with the Unbiased Lambdarank paper.

shiyu1994 · 2022-06-09T07:05:43Z

src/objective/rank_objective.hpp

@@ -199,15 +222,26 @@ class LambdarankNDCG : public RankingObjective {
        // get delta NDCG
        double delta_pair_NDCG = dcg_gap * paired_discount * inverse_max_dcg;
        // regular the delta_pair_NDCG by score distance
-        if (norm_ && best_score != worst_score) {
+        if ((norm_ || unbiased_) && best_score != worst_score) {


It is unclear to me why with unbiased Lambdarank we must normalize the delta score here.

shiyu1994 · 2022-06-09T07:21:06Z

src/objective/rank_objective.hpp

+        int debias_high_rank = static_cast<int>(std::min(high, truncation_level_ - 1));
+        int debias_low_rank = static_cast<int>(std::min(low, truncation_level_ - 1));


@thvasilo Thanks for reviewing this.
What confuses me here is that the index of the cost buffer is truncation of the high and low variables. It doesn't make sense to me because high and low are just document indices. According to equations (28)~(31) in the Unbiased Lambdarank paper, the truncation should be done on the current rank by the model instead of the document index. Since with truncation level, we are truncating the loss in equation (28) to consider only the top ranked i's.

shiyu1994 · 2022-06-09T07:56:04Z

src/objective/rank_objective.hpp

        // update
-        p_lambda *= -sigmoid_ * delta_pair_NDCG;
-        p_hessian *= sigmoid_ * sigmoid_ * delta_pair_NDCG;
+        p_lambda *= -sigmoid_ * delta_pair_NDCG / i_biases_pow_[debias_high_rank] / j_biases_pow_[debias_low_rank];


Suppose we are now before starting iteration k + 1, then the i_biases_pow_ and j_biases_pow_ would be calculated based the prediction results up to iteration k - 1. Only the newly updated i_biases_pow_ and j_biases_pow_ in UpdatePositionBiasesAndGradients after GetGradients consider the prediction values up to iteration k. This is inconsistent with Algorithm 1 in the paper. Though I don't think this should affect the performance much.

shiyu1994 · 2022-06-09T07:57:39Z

src/objective/rank_objective.hpp

+        int debias_high_rank = static_cast<int>(std::min(high, truncation_level_ - 1));
+        int debias_low_rank = static_cast<int>(std::min(low, truncation_level_ - 1));


I believe here static_cast<int>(std::min(high, truncation_level_ - 1)); and static_cast<int>(std::min(low, truncation_level_ - 1)); should be static_cast<int>(std::min(high_rank, truncation_level_ - 1)); and static_cast<int>(std::min(low_rank, truncation_level_ - 1)); instead.

shiyu1994 · 2022-06-09T08:00:13Z

src/objective/rank_objective.hpp

+      j_biases_pow_[i] = pow(j_costs_[i] / j_costs_[0], position_bias_regularizer);
+    }
+
+    LogDebugPositionBiases();


Yes. Please remove this or wrap the debug code with #ifdef DEBUG macro.

shiyu1994 · 2022-06-09T08:01:46Z

src/objective/rank_objective.hpp

+  double bias_p_norm_;
+
+  /*! \brief Position bias regularizer exponent, 1 / (1 + bias_p_norm_) */
+  double position_bias_regularizer;


We'd better add an underscore after the member's name to be consistent with the whole code base, i.e. position_bias_regularizer_.

thvasilo · 2022-06-21T20:16:19Z

Hi @shiyu1994 if @robert-howley-zocdoc doesn't have the time to make requested changes I could do the final few changes on top of @robert-howley-zocdoc's great work to push this through. Then I can create a follow up PR with my extensions that enable a custom position column.

StrikerRUS · 2022-06-25T17:45:47Z

@thvasilo Sounds great!

shiyu1994 · 2022-07-07T15:48:09Z

@thvasilo Thanks! That would be great! Do you have the permission to pushing to this on-going branch? If not, let's figure out how to make it possible to you.

robhowley · 2022-07-11T14:08:14Z

yea, sorry for not being on top of this. i don't work at zocdoc anymore which makes time allocation to this a little tricky. feel free to takeover as needed.

shiyu1994 · 2022-07-28T02:29:04Z

I'm picking up this. But it seems that I have no permission to push to Zocdoc/LightGBM. So I'll merge this into a branch in our repo and open a new PR.

jameslamb · 2022-07-28T03:46:27Z

Thanks @shiyu1994 , yeah I agree with that approach! Thanks @robhowley for getting this work to this point, and for agreeing to let us pick it up and carry it forward. Really really appreciate it!

shiyu1994 · 2022-07-29T03:18:24Z

Close this and move to #5393.

github-actions · 2023-08-15T20:36:22Z

This pull request has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

robhowley and others added 30 commits February 1, 2021 21:53

chore: ignore swig jni file

e5b6f31

Merge pull request #1 from Zocdoc/rh_update_ignore

ca85e05

chore: ignore swig jni file

chore: run tests on pr to position_debias

7718343

Update cuda.yml

7063000

Merge pull request #2 from Zocdoc/rh_main_fork_branch

4f4387a

run tests on pr to position_debias

feat: crude copy paste

482c143

feat: add unbiased lambdamart config params

4bae3ad

Merge pull request #3 from Zocdoc/rh_config

1449576

add unbiased config params

Merge branch 'position_unbiased' into rh_impl_phase_1

c49005e

chore: variable init

3cbcb62

feat: add bias corrected lambda accumulators

a0680e6

chore: remove intermediate vectors

8827540

chore: address linter issues

6798c2a

chore: remove unused position_lambdas variable

e1c8158

chore: remove unused position_scores variables

26b316b

chore: remove position counts variables

4b8c2e9

chore: consolidate initialization and updates

c50e92c

chore: linter whitespace

592ade6

chore: add comments on formulas and derivations

c29a6f8

chore: eta has slightly clearer/diff meaning in this impl

9796a72

fix: debias rank values

a00ac5c

chore: linter

672ec5b

Merge pull request #4 from Zocdoc/rh_impl_phase_1

d3347d2

feat: rough migration of acbull's source

chore: give better name to bias regularizer

88e3542

Merge pull request #5 from Zocdoc/rh_better_eta_name

52f6243

chore: give better name to lambdarank bias regularizer

chore: tests w configs relevant to unbiased

99f4f04

chore: remove unused param, replaced by truncation_level

87219c2

Merge pull request #6 from Zocdoc/rh_remove_position_bins

baa4a0d

remove unused param position_bins_, replaced by truncation_level

fix: update workflow trigger to correct branch name

20fe972

merge conflicts

9ac06d5

robhowley added 3 commits September 7, 2021 12:02

remove gitignore changes, more whitespace removal

ebb8f40

end w new line

6ae3cfe

fix redundant blank line found in cpp lint

9572d7d

Merge branch 'microsoft:master' into position_unbiased

0cbe02d

robert-howley-zocdoc requested review from hzy46 and tongwu-sh as code owners December 20, 2021 23:49

thvasilo reviewed Jan 10, 2022

View reviewed changes

shiyu1994 reviewed Jun 9, 2022

View reviewed changes

shiyu1994 mentioned this pull request Jul 29, 2022

Continued work on position unbiased ranking #5393

Closed

shiyu1994 closed this Jul 29, 2022

shiyu1994 added a commit that referenced this pull request Jul 29, 2022

apply review comments from shiyu1994 in #4531

ef89b41

trivialfis mentioned this pull request Feb 17, 2023

Rework learning to rank. dmlc/xgboost#8822

Merged

jameslamb removed the awaiting review label Mar 16, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ranking algorithm: position unbiased option #4531

ranking algorithm: position unbiased option #4531

robert-howley-zocdoc commented Aug 18, 2021 •

edited

robert-howley-zocdoc commented Sep 7, 2021 •

edited

jameslamb commented Sep 7, 2021

jameslamb commented Nov 7, 2021

shiyu1994 commented Nov 11, 2021

thvasilo commented Dec 3, 2021 •

edited

thvasilo commented Dec 20, 2021 •

edited

robhowley commented Dec 20, 2021

thvasilo commented Dec 22, 2021 •

edited

thvasilo Jan 10, 2022 •

edited

shiyu1994 Jun 9, 2022

shiyu1994 Jun 9, 2022

shiyu1994 left a comment

shiyu1994 Jun 9, 2022

shiyu1994 Jun 9, 2022

shiyu1994 Jun 9, 2022

shiyu1994 Jun 9, 2022

shiyu1994 Jun 9, 2022

shiyu1994 Jun 9, 2022

thvasilo commented Jun 21, 2022 •

edited

StrikerRUS commented Jun 25, 2022

shiyu1994 commented Jul 7, 2022

robhowley commented Jul 11, 2022

shiyu1994 commented Jul 28, 2022

jameslamb commented Jul 28, 2022

shiyu1994 commented Jul 29, 2022

github-actions bot commented Aug 15, 2023

		int debias_high_rank = static_cast<int>(std::min(high, truncation_level_ - 1));
		int debias_low_rank = static_cast<int>(std::min(low, truncation_level_ - 1));

ranking algorithm: position unbiased option #4531

ranking algorithm: position unbiased option #4531

Conversation

robert-howley-zocdoc commented Aug 18, 2021 • edited

robert-howley-zocdoc commented Sep 7, 2021 • edited

jameslamb commented Sep 7, 2021

jameslamb commented Nov 7, 2021

shiyu1994 commented Nov 11, 2021

thvasilo commented Dec 3, 2021 • edited

thvasilo commented Dec 20, 2021 • edited

robhowley commented Dec 20, 2021

thvasilo commented Dec 22, 2021 • edited

thvasilo Jan 10, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiyu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thvasilo commented Jun 21, 2022 • edited

StrikerRUS commented Jun 25, 2022

shiyu1994 commented Jul 7, 2022

robhowley commented Jul 11, 2022

shiyu1994 commented Jul 28, 2022

jameslamb commented Jul 28, 2022

shiyu1994 commented Jul 29, 2022

github-actions bot commented Aug 15, 2023

robert-howley-zocdoc commented Aug 18, 2021 •

edited

robert-howley-zocdoc commented Sep 7, 2021 •

edited

thvasilo commented Dec 3, 2021 •

edited

thvasilo commented Dec 20, 2021 •

edited

thvasilo commented Dec 22, 2021 •

edited

thvasilo Jan 10, 2022 •

edited

thvasilo commented Jun 21, 2022 •

edited