Training LightGBMRanker several times gives different NDCG on testing set #580

daureg · 2019-06-05T16:02:12Z

I noticed that when training on Databricks with the same parameters on the same data several times, the resulting models don't give the same predictions, as evidenced by different NDCG on a separate testing set.
Here is my training function, my training set has 400K exemples in 5K lists, with 60 features:

def train(): Unit = {
  val lgbm = new LightGBMRanker()
  .setCategoricalSlotIndexes(Array(0, 2, 3, 4, 6, 7, 8, 59))
  .setFeaturesCol("features")
  .setGroupCol("query_id")
  .setLabelCol("label")
  .setMaxPosition(10)
  .setParallelism("voting")
  .setNumIterations(15)
  .setMaxDepth(4)
  .setNumLeaves(12)
  val training = table(s"training")
  val model = lgbm.fit(training)
}

Is that inherent to distributed training (on 5 executors) or should I change some parameters of my LightGBMRanker instance?

The text was updated successfully, but these errors were encountered:

daniloascione · 2019-06-05T16:20:40Z

If the table is repartitioned to 1 partition (table(s"training").repartition(1)), then the results are consistent, but this means no parallelism.

imatiach-msft · 2019-06-05T16:20:48Z

@daureg thank you for reporting this issue. This looks similar to the issue here:
#564
I will need to investigate this problem more to figure out the root cause of the randomness, I'm not sure if it is fixable. It's on my todo list now, but not as high priority as:
#569
#483
Does one model always give the same predictions? Or is it only different models trained on the same data?

daureg · 2019-06-05T16:29:00Z

indeed it's the same as #564 (unless there is something specific with ranker, but most likely not). I will also try to predict several time with the same model, but for now it's different models trained on the same data

daniloascione · 2019-06-05T16:56:10Z

@imatiach-msft maybe there is the need to ensure that each partition get all the elements from the same group and to enforce the group sorting by adding a sortWithinPartition here https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMBase.scala#L45 (similarly to https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMRanker.scala#L67)

imatiach-msft · 2019-06-05T17:12:35Z

@daniloascione yes, that was something that I was going to add later; not sure if it should be a separate utility or if it should be done in the ranker itself (which may hurt performance significantly since it would incur a shuffle across partitions) - note it wouldn't go into LightGBMBase because that's the base class for classifier and regressor as well, and this is something needed just for ranker. I sort in LightGBM Ranker so that the groups are ordered, but I don't ensure that a group doesn't cross partitions; as you said one group should only be in one partition in the ranker case. I'm not sure if it is related to your specific issue though. Even if the same group is in each partition you may still get different results from run to run, although at least the difference should be smaller from model to model.

imatiach-msft · 2019-06-06T17:40:36Z

@daniloascione @daureg just out of curiosity, how are you computing the NDCG? I would like to add an evaluator for LGBMRanker, similar to the Spark ML evaluators and MLLib metrics. Is there one that exists out there already? I couldn't find anything in Spark ML.

daniloascione · 2019-06-07T07:36:55Z

@imatiach-msft I tried to add ranking metrics in Spark ML in the past (apache/spark#16618 and https://issues.apache.org/jira/browse/SPARK-14409) but things got stuck for several reasons. Currently, we are using an udf based implementation of ndcg, which is similar to this one http://lobotomys.blogspot.com/2016/08/normalised-discounted-cumulative-gain.html

kbafna-antuit · 2020-03-28T11:32:35Z

@daniloascione @daureg I am facing a similar issue where in training the model on the same data with same parameters result in different predictions each time.
Did you find a fix for this ?

daniloascione · 2020-03-28T13:24:41Z

@KeertiBafna No, I didn't find a fix, unfortunately. I haven't tried the idea to "sort within partitions" yet (see above), maybe it is time to look at this.

kbafna-antuit · 2020-04-02T12:32:26Z

@daniloascione Can i use repartitioning by a key as below ?
Say for ex: If i repartition my data into 8 partitions and add a column 'key' with values from 0 to 7, will the below line ensure each partition has the same key group and order everytime ?
df.repartition(8, 'key').sortWithinPartitions('order_col')

daniloascione · 2020-04-03T10:15:16Z

Yes, I think so, the partition should be sorted at least until the next operation with a shuffle.
I recommend you to write tests anyway.

daniloascione · 2020-06-24T15:33:25Z

@imatiach-msft is this issue solved in later versions? I believe you mentioned in another issue that you added a sortwithinpartitions to preserve the sorting.

imatiach-msft self-assigned this Jun 6, 2019

mhamilton723 added the area/lightgbm label Aug 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training LightGBMRanker several times gives different NDCG on testing set #580

Training LightGBMRanker several times gives different NDCG on testing set #580

daureg commented Jun 5, 2019

daniloascione commented Jun 5, 2019

imatiach-msft commented Jun 5, 2019

daureg commented Jun 5, 2019

daniloascione commented Jun 5, 2019

imatiach-msft commented Jun 5, 2019 •

edited

Loading

imatiach-msft commented Jun 6, 2019

daniloascione commented Jun 7, 2019

kbafna-antuit commented Mar 28, 2020 •

edited

Loading

daniloascione commented Mar 28, 2020

kbafna-antuit commented Apr 2, 2020

daniloascione commented Apr 3, 2020

daniloascione commented Jun 24, 2020

Training LightGBMRanker several times gives different NDCG on testing set #580

Training LightGBMRanker several times gives different NDCG on testing set #580

Comments

daureg commented Jun 5, 2019

daniloascione commented Jun 5, 2019

imatiach-msft commented Jun 5, 2019

daureg commented Jun 5, 2019

daniloascione commented Jun 5, 2019

imatiach-msft commented Jun 5, 2019 • edited Loading

imatiach-msft commented Jun 6, 2019

daniloascione commented Jun 7, 2019

kbafna-antuit commented Mar 28, 2020 • edited Loading

daniloascione commented Mar 28, 2020

kbafna-antuit commented Apr 2, 2020

daniloascione commented Apr 3, 2020

daniloascione commented Jun 24, 2020

imatiach-msft commented Jun 5, 2019 •

edited

Loading

kbafna-antuit commented Mar 28, 2020 •

edited

Loading