Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add the option to get Feature Contributions in LightGBMBooster used by LightGBMRanker #791

Merged
merged 9 commits into from Feb 11, 2020

Conversation

JoanFM
Copy link
Contributor

@JoanFM JoanFM commented Jan 30, 2020

feat: This PR tries to add the option to get feature contribution for a given score. For now I applied it to lightGBMRanker but it might be useful for other models, but I am not sure.

test: I have added tests in VerifyLightGBMRanker to ensure the features shap length and results are expected.

Please let me know if you have any doubt or any request for me to change in this PR. I hope you find this PR helpful.

@JoanFM JoanFM closed this Jan 30, 2020
@JoanFM JoanFM reopened this Jan 30, 2020
@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

@JoanFM this is very interesting! As someone who was worked a lot on https://github.com/slundberg/shap and https://github.com/interpretml/interpret-community lately, it's very exciting to see this contribution in mmlspark. It would be even more amazing if we could add a spark-based implementation of TreeExplainer that works with the SparkML tree-based models natively.

@JoanFM JoanFM force-pushed the predict_contrib branch 2 times, most recently from 51d9e5f to 148ed60 Compare January 30, 2020 18:16
@imatiach-msft
Copy link
Contributor

(similar to the lime explainer implementation that @mhamilton723 worked on in mmlspark)

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Jan 30, 2020

Codecov Report

Merging #791 into master will decrease coverage by 42.10%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #791       +/-   ##
===========================================
- Coverage   52.40%   10.30%   -42.11%     
===========================================
  Files         241      185       -56     
  Lines        9704     8504     -1200     
  Branches      529      525        -4     
===========================================
- Hits         5085      876     -4209     
- Misses       4619     7628     +3009     
Impacted Files Coverage Δ
...ain/scala/org/apache/spark/ml/param/UDFParam.scala 0.00% <0.00%> (-100.00%) ⬇️
...n/scala/org/apache/spark/ml/param/UDPyFParam.scala 0.00% <0.00%> (-100.00%) ⬇️
...cala/com/microsoft/ml/spark/io/binary/Binary.scala 0.00% <0.00%> (-100.00%) ⬇️
...cala/org/apache/spark/ml/param/DataTypeParam.scala 0.00% <0.00%> (-100.00%) ⬇️
...ala/org/apache/spark/ml/param/ByteArrayParam.scala 0.00% <0.00%> (-100.00%) ⬇️
...ala/org/apache/spark/ml/param/DataFrameParam.scala 0.00% <0.00%> (-100.00%) ⬇️
...ala/org/apache/spark/ml/param/EstimatorParam.scala 0.00% <0.00%> (-100.00%) ⬇️
...ala/org/apache/spark/ml/param/EvaluatorParam.scala 0.00% <0.00%> (-100.00%) ⬇️
...la/org/apache/spark/ml/param/ParamSpaceParam.scala 0.00% <0.00%> (-100.00%) ⬇️
...a/org/apache/spark/ml/param/TransformerParam.scala 0.00% <0.00%> (-100.00%) ⬇️
... and 200 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 875f89d...cf092d6. Read the comment docs.

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JoanFM
Copy link
Contributor Author

JoanFM commented Jan 31, 2020

@imatiach-msft is it normal to have cognitive UnitTests failing or is it just some system unstability?

@imatiach-msft
Copy link
Contributor

imatiach-msft commented Jan 31, 2020

probably some system unstability

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

tagging @mhamilton723 for the cognitive test failures

@imatiach-msft
Copy link
Contributor

@mhamilton723 any idea why the cognitive tests are failing? might you be able to take a look? Those test failures seem to be unrelated to this PR.

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

@JoanFM could you please resolve the conflicts in src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala? Thank you!

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

@JoanFM sorry, could you again resolve the conflicts in src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala? It's due to the new PR you had which was just merged. Thank you!

@JoanFM
Copy link
Contributor Author

JoanFM commented Feb 11, 2020

@JoanFM sorry, could you again resolve the conflicts in src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala? It's due to the new PR you had which was just merged. Thank you!

I just did, I was expecting to have conflicts with this set of PRs

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

@JoanFM looks like there was some compilation failure:

[error] /home/vsts/work/1/s/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala:198:24: type mismatch;
[error] found : ThreadLocal[com.microsoft.ml.spark.lightgbm.LongLongNativePtrHandler]
[error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_long_long
[error] boosterHandler.scoredDataLengthLongPtr, boosterHandler.scoredDataOutPtr)
[error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_double
[error] boosterHandler.shapDataLengthLongPtr, boosterHandler.shapDataOutPtr)
[error] ^
[error] /home/vsts/work/1/s/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala:222:24: type mismatch;
[error] found : ThreadLocal[com.microsoft.ml.spark.lightgbm.LongLongNativePtrHandler]
[error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_long_long
[error] boosterHandler.shapDataLengthLongPtr, boosterHandler.shapDataOutPtr)
[error] ^
[error] /home/vsts/work/1/s/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala:222:62: type mismatch;
[error] found : ThreadLocal[com.microsoft.ml.spark.lightgbm.DoubleNativePtrHandler]
[error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_double
[error] boosterHandler.shapDataLengthLongPtr, boosterHandler.shapDataOutPtr)
[error] ^
[error] /home/vsts/work/1/s/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala:224:32: type mismatch;
[error] found : ThreadLocal[com.microsoft.ml.spark.lightgbm.DoubleNativePtrHandler]
[error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_double
[error] shapToArray(boosterHandler.shapDataOutPtr)
[error] ^
[error] 15 errors found
[error] (Compile / compileIncremental) Compilation failed
[error] Total time: 68 s, completed Feb 11, 2020 3:42:48 PM

@imatiach-msft
Copy link
Contributor

@JoanFM I made a suggestion for where the fix needs to be added

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft imatiach-msft merged commit f702921 into microsoft:master Feb 11, 2020
@JoanFM JoanFM deleted the predict_contrib branch February 11, 2020 18:19
ocworld pushed a commit to AhnLab-OSS/mmlspark that referenced this pull request Mar 24, 2020
…used by LightGBMRanker (microsoft#791)

* Allow LightGBMRanker to compute features shap

* Take featureShapGetter into trait that potentially can be used by other models

* Fix data used to be the one for shap and add tests for getShapFeatures in LightGBMRanker

* Fix issues with merge conflict resolution

* Refactor to share predictForMat and predictForCSR from Score, Shap and LeafIndex usage

* Fix compilation issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants