Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit support for Regression, performed major refactoring of tests, removed unused code and updated notebooks to work (again). #248

Conversation

ReinierKoops
Copy link
Contributor

@ReinierKoops ReinierKoops commented Mar 20, 2024

This PR depends on the PR to be accepted: #242


The PR fixes the following:

ReinierKoops and others added 18 commits March 26, 2024 16:16
Set the random states explicitly. 

Tasks:

- [x] Adjust the code where "sample" does not use random_state
- [x] Adjust the test code for it
- [x] Make sure the tests use it consistently.
- [x] Look if we can remove some unnecessary checks as mentioned in this
issue: ing-bank#221
Updates the requirements on
[catboost](https://github.com/catboost/catboost) to permit the latest
version.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/catboost/catboost/releases">catboost's
releases</a>.</em></p>
<blockquote>
<h2>1.2.3</h2>
<h2>Python package</h2>
<ul>
<li>Support Python 3.12. <a
href="https://redirect.github.com/catboost/catboost/issues/2510">#2510</a></li>
<li>[Performance]: Fix ineffective loops in Cython. Significant speedups
(up to 3x) on dataset construction from data in C-order can be
expected.</li>
<li>[Performance]: Make features data initialization from C-order
<code>numpy.ndarray</code>s with <code>float32</code> data type
multithreaded. Significant speedups of 5x up to 10x (on CPUs with many
cores) can be expected. <a
href="https://redirect.github.com/catboost/catboost/issues/385">#385</a>,
<a
href="https://redirect.github.com/catboost/catboost/issues/2542">#2542</a></li>
<li>Save training metrics into the model metadata. So
<code>best_score_</code>, <code>evals_result_</code>,
<code>best_iteration_</code> model attributes now work after model
saving and loading. Can be removed by model metadata manipulation if
needed. <a
href="https://redirect.github.com/catboost/catboost/issues/1166">#1166</a></li>
<li>[Breaking change]. Support a separate boolean target type, now
<code>Class</code> predictions for models that have been trained with
boolean targets will also be boolean instead of <code>True</code>,
<code>False</code> strings as before. Such models will be incompatible
with the previous versions of CatBoost appliers. If you want the old
behavior convert your target to <code>False</code>, <code>True</code>
strings before training. <a
href="https://redirect.github.com/catboost/catboost/issues/1954">#1954</a></li>
<li>Restrict <code>jupyterlab</code> version for setup to 3.x for now.
Fixes <a
href="https://redirect.github.com/catboost/catboost/issues/2530">#2530</a></li>
<li><code>utils.read_cd</code>: Support CD files with non-increasing
column indices.</li>
<li>Make <code>log_cout</code>, <code>log_cerr</code> specification
consistent, avoid reset in recursive calls.</li>
<li>Late-initialize default values for <code>log_cout</code>,
<code>log_cerr</code>. <a
href="https://redirect.github.com/catboost/catboost/issues/2195">#2195</a></li>
<li>Add missing generated metrics: <code>Cox</code>,
<code>PairLogitPairwise</code>, <code>UserPerObjMetric</code>,
<code>SurvivalAft</code>.</li>
</ul>
<h2>New features</h2>
<ul>
<li>Support boolean target/labels type during training in Python and
Spark (in the latter case only when using <code>fit</code> with
<code>Pool</code> arguments) and <code>Class</code> prediction in
Python. <a
href="https://redirect.github.com/catboost/catboost/issues/1954">#1954</a></li>
<li>[Spark]: Support Spark 3.5.x.</li>
<li>[C/C++ applier]. Add functions for getting indices of features of
different types to C and C++ API. <a
href="https://redirect.github.com/catboost/catboost/issues/2568">#2568</a>.
Thanks to <a
href="https://github.com/nimusp"><code>@​nimusp</code></a>.</li>
<li>[C/C++ applier]. Add staged prediction functions to C API. <a
href="https://redirect.github.com/catboost/catboost/issues/2584">#2584</a>.
Thanks to <a
href="https://github.com/Mb-NextTime"><code>@​Mb-NextTime</code></a>.</li>
<li>[JVM applier]. Add loading CatBoostModel from a byte array to API.
<a
href="https://redirect.github.com/catboost/catboost/issues/2539">#2539</a></li>
<li>[Linux] Support CgroupsV2 when computing default number of threads
used in parallel computations. <a
href="https://redirect.github.com/catboost/catboost/issues/2519">#2519</a>.
Thanks to <a
href="https://github.com/elukey"><code>@​elukey</code></a>.</li>
<li>[CLI] Support printing <code>Auxiliary</code> columns by name in
evaluation result output. <a
href="https://redirect.github.com/catboost/catboost/issues/1659">#1659</a></li>
<li>Save training metrics into the model metadata. Can be removed by
model metadata manipulation if needed. <a
href="https://redirect.github.com/catboost/catboost/issues/1166">#1166</a></li>
</ul>
<h2>Build &amp; testing</h2>
<ul>
<li>[Windows]: Use <code>clang-cl</code> compiler and tools from Visual
Studio 2022 for the build without CUDA (build with CUDA still uses
standard Microsoft toolchain from Visual Studio 2019).</li>
<li>[macOS]: Pass <code>os.version</code> to <code>conan</code> host
settings to ensure version consistency.</li>
<li>[Linux aarch64]: Set <code>-mno-outline-atomics</code> for modern
versions of CLang and GCC to avoid unresolved symbols linking errors. <a
href="https://redirect.github.com/catboost/catboost/issues/2527">#2527</a></li>
<li>Added missing <code>CMakeLists</code> for unit tests for
<code>util</code>. <a
href="https://redirect.github.com/catboost/catboost/issues/2525">#2525</a></li>
</ul>
<h2>Bugfixes</h2>
<ul>
<li>[Performance]: Fix performance regression that could slow down
training on GPU by 50% on some datasets that had been introduced in
release 1.2. Thanks to <a
href="https://github.com/JeanPaulShapo"><code>@​JeanPaulShapo</code></a>.</li>
<li>[Python-package]: Fix segfault on Pool(data=None). <a
href="https://redirect.github.com/catboost/catboost/issues/2522">#2522</a></li>
<li>[Python-package]: Fix Python exception in <code>Pool()</code> when
<code>pairs_weight</code> is a numpy array. <a
href="https://redirect.github.com/catboost/catboost/issues/1913">#1913</a></li>
<li>[Python-package]: Fix segfault and other strange errors when
specifying custom logger with <code>__call__</code> method. <a
href="https://redirect.github.com/catboost/catboost/issues/2277">#2277</a></li>
<li>[Python-package]: Fix returning complex params in hyperparameter
search. <a
href="https://redirect.github.com/catboost/catboost/issues/1741">#1741</a>,
<a
href="https://redirect.github.com/catboost/catboost/issues/1833">#1833</a></li>
<li>[Python-package]: Fix ignored exceptions for missed metrics
descriptions on startup. This has not been visible to users but has been
making debugging more difficult.</li>
<li>[Python-package]: Fix misleading <code>Targets are required for
YetiRank loss function.</code> error in Cross validation. <a
href="https://redirect.github.com/catboost/catboost/issues/2083">#2083</a></li>
<li>[Python-package]: Fix <code>Pool.get_label()</code> returns constant
<code>True</code> for boolean labels. <a
href="https://redirect.github.com/catboost/catboost/issues/2133">#2133</a></li>
<li>[Python-package]: Copying models does not lose
<code>best_score_</code>, <code>evals_result_</code>,
<code>best_iteration_</code> attributes values anymore. <a
href="https://redirect.github.com/catboost/catboost/issues/1793">#1793</a></li>
<li>[Spark]: Fix hangs at the end of the training. <a
href="https://redirect.github.com/catboost/catboost/issues/2151">#2151</a></li>
<li><code>Precision</code> metric default value in the absense of
positive samples is changed to 0 and a warning is added
(similar to the behavior of <code>scikit-learn</code> implementation).
<a
href="https://redirect.github.com/catboost/catboost/issues/2422">#2422</a></li>
<li>Fix ignoring embedding features</li>
<li>Try to avoid hash collisions when computing group ids with datasets
with a lot of groups (may occur in datasets with around a 10^9
samples).</li>
<li>Fix Multiclass models export to C++ and Python code. <a
href="https://redirect.github.com/catboost/catboost/issues/2549">#2549</a></li>
<li>Fix dataset_statistics mode when no <code>Target</code> data is
available.</li>
<li>Fix <code>Error: can't proceed some features</code> error on GPU. <a
href="https://redirect.github.com/catboost/catboost/issues/1024">#1024</a></li>
<li>Fix <code>allow_const_label=True</code> for classification. <a
href="https://redirect.github.com/catboost/catboost/issues/1933">#1933</a></li>
<li>Add checking of approx and target dimensions for
<code>SurvivalAft</code> objective/metric.</li>
<li>Fix Focal loss derivatives sign. <a
href="https://redirect.github.com/catboost/catboost/issues/2563">#2563</a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/catboost/catboost/blob/master/RELEASE.md">catboost's
changelog</a>.</em></p>
<blockquote>
<h1>Release 1.2.3</h1>
<h2>Python package</h2>
<ul>
<li>Support Python 3.12. <a
href="https://redirect.github.com/catboost/catboost/issues/2510">#2510</a></li>
<li>[Performance]: Fix ineffective loops in Cython. Significant speedups
(up to 3x) on dataset construction from data in C-order can be
expected.</li>
<li>[Performance]: Make features data initialization from C-order
<code>numpy.ndarray</code>s with <code>float32</code> data type
multithreaded. Significant speedups of 5x up to 10x (on CPUs with many
cores) can be expected. <a
href="https://redirect.github.com/catboost/catboost/issues/385">#385</a>,
<a
href="https://redirect.github.com/catboost/catboost/issues/2542">#2542</a></li>
<li>Save training metrics into the model metadata. So
<code>best_score_</code>, <code>evals_result_</code>,
<code>best_iteration_</code> model attributes now work after model
saving and loading. Can be removed by model metadata manipulation if
needed. <a
href="https://redirect.github.com/catboost/catboost/issues/1166">#1166</a></li>
<li>[Breaking change]. Support a separate boolean target type, now
<code>Class</code> predictions for models that have been trained with
boolean targets will also be boolean instead of <code>True</code>,
<code>False</code> strings as before. Such models will be incompatible
with the previous versions of CatBoost appliers. If you want the old
behavior convert your target to <code>False</code>, <code>True</code>
strings before training. <a
href="https://redirect.github.com/catboost/catboost/issues/1954">#1954</a></li>
<li>Restrict <code>jupyterlab</code> version for setup to 3.x for now.
Fixes <a
href="https://redirect.github.com/catboost/catboost/issues/2530">#2530</a></li>
<li><code>utils.read_cd</code>: Support CD files with non-increasing
column indices.</li>
<li>Make <code>log_cout</code>, <code>log_cerr</code> specification
consistent, avoid reset in recursive calls.</li>
<li>Late-initialize default values for <code>log_cout</code>,
<code>log_cerr</code>. <a
href="https://redirect.github.com/catboost/catboost/issues/2195">#2195</a></li>
<li>Add missing generated metrics: <code>Cox</code>,
<code>PairLogitPairwise</code>, <code>UserPerObjMetric</code>,
<code>SurvivalAft</code>.</li>
</ul>
<h2>New features</h2>
<ul>
<li>Support boolean target/labels type during training in Python and
Spark (in the latter case only when using <code>fit</code> with
<code>Pool</code> arguments) and <code>Class</code> prediction in
Python. <a
href="https://redirect.github.com/catboost/catboost/issues/1954">#1954</a></li>
<li>[Spark]: Support Spark 3.5.x.</li>
<li>[C/C++ applier]. Add functions for getting indices of features of
different types to C and C++ API. <a
href="https://redirect.github.com/catboost/catboost/issues/2568">#2568</a>.
Thanks to <a
href="https://github.com/nimusp"><code>@​nimusp</code></a>.</li>
<li>[C/C++ applier]. Add staged prediction functions to C API. <a
href="https://redirect.github.com/catboost/catboost/issues/2584">#2584</a>.
Thanks to <a
href="https://github.com/Mb-NextTime"><code>@​Mb-NextTime</code></a>.</li>
<li>[JVM applier]. Add loading CatBoostModel from a byte array to API.
<a
href="https://redirect.github.com/catboost/catboost/issues/2539">#2539</a></li>
<li>[Linux] Support CgroupsV2 when computing default number of threads
used in parallel computations. <a
href="https://redirect.github.com/catboost/catboost/issues/2519">#2519</a>.
Thanks to <a
href="https://github.com/elukey"><code>@​elukey</code></a>.</li>
<li>[CLI] Support printing <code>Auxiliary</code> columns by name in
evaluation result output. <a
href="https://redirect.github.com/catboost/catboost/issues/1659">#1659</a></li>
<li>Save training metrics into the model metadata. Can be removed by
model metadata manipulation if needed. <a
href="https://redirect.github.com/catboost/catboost/issues/1166">#1166</a></li>
</ul>
<h2>Build &amp; testing</h2>
<ul>
<li>[Windows]: Use <code>clang-cl</code> from Visual Studio 2022 for the
build without CUDA (build with CUDA still uses standard Microsoft
toolchain from Visual Studio 2019).</li>
<li>[macOS]: Pass <code>os.version</code> to <code>conan</code> host
settings to ensure version consistency.</li>
<li>[Linux aarch64]: Set <code>-mno-outline-atomics</code> for modern
versions of CLang and GCC to avoid unresolved symbols linking errors. <a
href="https://redirect.github.com/catboost/catboost/issues/2527">#2527</a></li>
<li>Added missing <code>CMakeLists</code> for unit tests for
<code>util</code>. <a
href="https://redirect.github.com/catboost/catboost/issues/2525">#2525</a></li>
</ul>
<h2>Bugfixes</h2>
<ul>
<li>[Performance]: Fix performance regression that could slow down
training on GPU by 50% on some datasets that had been introduced in
release 1.2. Thanks to <a
href="https://github.com/JeanPaulShapo"><code>@​JeanPaulShapo</code></a>.</li>
<li>[Python-package]: Fix segfault on Pool(data=None). <a
href="https://redirect.github.com/catboost/catboost/issues/2522">#2522</a></li>
<li>[Python-package]: Fix Python exception in <code>Pool()</code> when
<code>pairs_weight</code> is a numpy array. <a
href="https://redirect.github.com/catboost/catboost/issues/1913">#1913</a></li>
<li>[Python-package]: Fix segfault and other strange errors when
specifying custom logger with <code>__call__</code> method. <a
href="https://redirect.github.com/catboost/catboost/issues/2277">#2277</a></li>
<li>[Python-package]: Fix returning complex params in hyperparameter
search. <a
href="https://redirect.github.com/catboost/catboost/issues/1741">#1741</a>,
<a
href="https://redirect.github.com/catboost/catboost/issues/1833">#1833</a></li>
<li>[Python-package]: Fix ignored exceptions for missed metrics
descriptions on startup. This has not been visible to users but has been
making debugging more difficult.</li>
<li>[Python-package]: Fix misleading <code>Targets are required for
YetiRank loss function.</code> error in Cross validation. <a
href="https://redirect.github.com/catboost/catboost/issues/2083">#2083</a></li>
<li>[Python-package]: Fix <code>Pool.get_label()</code> returns constant
<code>True</code> for boolean labels. <a
href="https://redirect.github.com/catboost/catboost/issues/2133">#2133</a></li>
<li>[Python-package]: Copying models does not lose
<code>best_score_</code>, <code>evals_result_</code>,
<code>best_iteration_</code> attributes values anymore. <a
href="https://redirect.github.com/catboost/catboost/issues/1793">#1793</a></li>
<li>[Spark]: Fix hangs at the end of the training. <a
href="https://redirect.github.com/catboost/catboost/issues/2151">#2151</a></li>
<li><code>Precision</code> metric default value in the absense of
positive samples is changed to 0 and a warning is added
(similar to the behavior of <code>scikit-learn</code> implementation).
<a
href="https://redirect.github.com/catboost/catboost/issues/2422">#2422</a></li>
<li>Fix ignoring embedding features</li>
<li>Try to avoid hash collisions when computing group ids with datasets
with a lot of groups (may occur in datasets with around a 10^9
samples).</li>
<li>Fix Multiclass models export to C++ and Python code. <a
href="https://redirect.github.com/catboost/catboost/issues/2549">#2549</a></li>
<li>Fix dataset_statistics mode when no <code>Target</code> data is
available.</li>
<li>Fix <code>Error: can't proceed some features</code> error on GPU. <a
href="https://redirect.github.com/catboost/catboost/issues/1024">#1024</a></li>
<li>Fix <code>allow_const_label=True</code> for classification. <a
href="https://redirect.github.com/catboost/catboost/issues/1933">#1933</a></li>
<li>Add checking of approx and target dimensions for
<code>SurvivalAft</code> objective/metric.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/catboost/catboost/commit/fe0941b208f9c392ce788c314463b6816d335c6a"><code>fe0941b</code></a>
Use paths from CMAKE_*_DIR when running in open source to avoid issues
on Win...</li>
<li><a
href="https://github.com/catboost/catboost/commit/cf282f73707cbfaca6a8976693a317bf942d1a90"><code>cf282f7</code></a>
CatBoost release 1.2.3.</li>
<li><a
href="https://github.com/catboost/catboost/commit/ec263e7d0e4ed96a39820d1cee9d8f205a0624d4"><code>ec263e7</code></a>
Update contrib/python/ipywidgets/py3 to 8.1.2</li>
<li><a
href="https://github.com/catboost/catboost/commit/704a5d8ba26afdcd4fd28842dcdd7541b176531e"><code>704a5d8</code></a>
Intermediate changes</li>
<li><a
href="https://github.com/catboost/catboost/commit/a13b5ba1b650cf68e6129a165d3fa144642bc132"><code>a13b5ba</code></a>
Add loading CatBoostModel from a byte array to API.. Fix <a
href="https://redirect.github.com/catboost/catboost/issues/2539">#2539</a></li>
<li><a
href="https://github.com/catboost/catboost/commit/56a0b44e8529920def3db9f5a13204bdb8c37cbe"><code>56a0b44</code></a>
Add Get*FeaturesIndices to C++ wrapper. <a
href="https://redirect.github.com/catboost/catboost/issues/2323">#2323</a>,
<a
href="https://redirect.github.com/catboost/catboost/issues/2568">#2568</a></li>
<li><a
href="https://github.com/catboost/catboost/commit/caed72b46db0f4de9cb92b4e12195e683579ce36"><code>caed72b</code></a>
Add GetEmbeddingFeaturesCount() to C++ wrapper</li>
<li><a
href="https://github.com/catboost/catboost/commit/4490314cac96e0aacb0925483042db6287b091f4"><code>4490314</code></a>
Add Spark 3.5 to pyspark_wrapper_generator.</li>
<li><a
href="https://github.com/catboost/catboost/commit/98c36676fed9e44892b5f228166694b2f71e1cec"><code>98c3667</code></a>
Support boolean target type in Spark (where possible).</li>
<li><a
href="https://github.com/catboost/catboost/commit/ad980dac979d2d8b4d633dffab06cb2e7303c126"><code>ad980da</code></a>
Manually unroll the loop</li>
<li>Additional commits viewable in <a
href="https://github.com/catboost/catboost/compare/v1.1...v1.2.3">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Reinier Koops <info@reinier.work>
rebase master
rebase master
@ReinierKoops
Copy link
Contributor Author

@PaulZhutovsky @adri0 the branch should be cleaned up a bit now (I rebased it), so reviewing should be (hopefully) easier.

Copy link
Collaborator

@adri0 adri0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ReinierKoops ReinierKoops requested review from adri0 and removed request for PaulZhutovsky April 11, 2024 10:18
Copy link
Collaborator

@adri0 adri0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving again

@ReinierKoops ReinierKoops merged commit ab672d4 into ing-bank:main Apr 11, 2024
15 checks passed
ReinierKoops added a commit that referenced this pull request Apr 23, 2024
…252)

This PR depends on the PR to be accepted:
#248

______

This cleanup removes some more unused code and simplifies parts of our
implementations. It should allow for a boost in performance for bigger
use-cases, although minimal.

Also fix:

- [x] #242 comments 
- [x] #255
- [x] #245
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate if parts of the codebase can leverage other libraries code
2 participants