Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrations: sklearn #5

Closed
5 tasks done
pared opened this issue Nov 19, 2020 · 14 comments
Closed
5 tasks done

integrations: sklearn #5

pared opened this issue Nov 19, 2020 · 14 comments

Comments

@pared
Copy link
Contributor

pared commented Nov 19, 2020

Seems we should be supporting at least few popular frameworks.

Considering their popularity, we should probably start with:

  • keras - we have initial implementation
  • sklearn
  • xgboost

Worth considering:

TF and PyTorch - it seems to me that using their pure form is done when users need highly custom models, and probably in that cases they will be able to handle dvclive by hand.
@dmpetrov did I miss some popular framework?

EDIT:
crossing out FastAi as it has its own issue now

@pared pared changed the title integrations: integrate with FastAi integrations Nov 19, 2020
@pared pared added this to To do in DVC 12 - 26 Jan 2021 via automation Jan 12, 2021
@pared pared self-assigned this Jan 12, 2021
@pared pared moved this from To do to In progress in DVC 12 - 26 Jan 2021 Jan 26, 2021
@efiop efiop added this to To do in DVC 26 Jan - 9 Feb 2021 via automation Jan 26, 2021
@efiop efiop moved this from In progress to Done in DVC 12 - 26 Jan 2021 Jan 26, 2021
@efiop efiop moved this from To do to In progress in DVC 26 Jan - 9 Feb 2021 Jan 26, 2021
@dberenbaum
Copy link
Contributor

I think it's easy enough for users to add integrations as needed (or for the dvc team to add them in response to demand), so it's probably not worthwhile to spend time adding more now.

How do we plan to handle dependencies for multiple frameworks? Each supported framework is pretty heavy, and I think it's unreasonable already to expect an XGBoost user to install Tensorflow to use dvclive. Similar concerns would apply for dvcx.

Thoughts @pared @dmpetrov ?

@dberenbaum
Copy link
Contributor

See #25 for more discussion of dependency management.

@pared
Copy link
Contributor Author

pared commented Jan 28, 2021

@dberenbaum
I think leaving particular implementations for our users is a good idea, those are easy tasks. Writing tests might be harder, but I guess we can help users write them, instead of doing all the legwork, not even knowing whether particular integrations will be desired by userbase.

As to installation, you are right, we do it already in dvc (for different backends) and we will have to go this way here too.

@dberenbaum
Copy link
Contributor

On second thought here, is it worthwhile to add sklearn integration? Since this is such a large framework, integration may be more complex, and if you have an opinion about how to implement it, probably better to add the integration now than wait for contributions. Even if it means implementing one particular model or class of models, it may be a worthwhile template. Thoughts?

@pared
Copy link
Contributor Author

pared commented Jan 29, 2021

Makes sense, I will get to that once I am done with supporting dvclive outputs caching

@pared pared added this to To do in DVC 9 - 23 Feb 2021 via automation Feb 9, 2021
@pared pared moved this from In progress to Done in DVC 26 Jan - 9 Feb 2021 Feb 9, 2021
@efiop efiop removed this from To do in DVC 9 - 23 Feb 2021 Feb 23, 2021
@efiop efiop added this to To do in DVC 23 Feb - 9 March 2021 via automation Feb 23, 2021
@pared pared moved this from To do to In progress in DVC 23 Feb - 9 March 2021 Mar 5, 2021
@efiop efiop added this to To do in DVC 9 - 23 March 2021 via automation Mar 9, 2021
@efiop efiop moved this from In progress to Done in DVC 23 Feb - 9 March 2021 Mar 9, 2021
@dberenbaum
Copy link
Contributor

sklearn is largely not focused on deep learning, which has been the primary use case for dvclive. Should other algorithms be supported? If the primary purpose is to track model training progress, it seems only useful where models are trained iteratively. I only know of a couple of classes of algorithms where this is true:

  • Gradient descent (including neural networks/deep learning)
  • Ensemble methods (such as gradient boosting)

@pared
Copy link
Contributor Author

pared commented Mar 11, 2021

@dberenbaum Yes, after digging through documentation, it seems to me that in general, learning algorithms divide to those which utilize fit method and both fit and partial_fit. It does not seem to me that we can provide integration for "only fit" models, and in case of partial_fit models, the workflow will probably look more like torch one, which in my opinion does not require any integration, as its created manually.

The only place I could probably see some integration is methods accepting scoring param which can be Callable but it seems to me it would be really hard to define how such integration could work.

@daavoo
Copy link
Contributor

daavoo commented Apr 1, 2021

I am considering to work on the integration with pytorch-lightning but I'm not sure about where to contribute the new logger (i.e. this repository or pytorch-lightning itself). See #70 (comment)

@daavoo
Copy link
Contributor

daavoo commented Jun 2, 2021

I added an integration with mmcv:

open-mmlab/mmcv#1075

@pared
Copy link
Contributor Author

pared commented Jun 3, 2021

@daavoo Thats a great news! Can we do something to help with that pull request?

@daavoo
Copy link
Contributor

daavoo commented Jun 4, 2021

@daavoo Thats a great news! Can we do something to help with that pull request?

It has been already approved so I think it will be merged soon, thanks!

@daavoo
Copy link
Contributor

daavoo commented Jun 10, 2021

I think it might be a good idea to have separated issues for each integration in order to better track the progress and have specific discussions for each one (i.e. this issue got "populated" by specific sklearn discussions).

I.e: #83

@pared
Copy link
Contributor Author

pared commented Jun 10, 2021

@daavoo That is right, in the beggining we intended it to be an umbrella issue, since singular implementations seemed like easy tasks. As sklerarn example shows, we should probably track each integration separately.

For future reference:
Changing the name of the issue for sklearn. Other integrations issues should be created as separate issues.

@pared pared changed the title integrations sklearn integration Jun 10, 2021
@pared pared changed the title sklearn integration integrations: sklearn Jun 10, 2021
@daavoo daavoo added A: frameworks Area: ML Framework integration feature request labels Jul 14, 2021
@daavoo
Copy link
Contributor

daavoo commented Oct 27, 2021

Reviving this as I think that skearn should be the entry point for discussing what can dvclive provide in "stepless" scenarios (no deep learning no gradient boosting) beyond #182

Taking a quick look at our example repositories using sklearn (https://github.com/iterative/example-get-started), it looks that it would be a low-hanging fruit to add some utility to go from (y_true, y_pred) to PRC / ROC plots.

Given that example repo, we would be removing quite a few lines for users:

# Given labels, predictions

precision, recall, prc_thresholds = metrics.precision_recall_curve(labels, predictions)
fpr, tpr, roc_thresholds = metrics.roc_curve(labels, predictions)

# ROC has a drop_intermediate arg that reduces the number of points.
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve.
# PRC lacks this arg, so we manually reduce to 1000 points as a rough estimate.
nth_point = math.ceil(len(prc_thresholds) / 1000)
prc_points = list(zip(precision, recall, prc_thresholds))[::nth_point]
with open(prc_file, "w") as fd:
    json.dump(
        {
            "prc": [
                {"precision": p, "recall": r, "threshold": t}
                for p, r, t in prc_points
            ]
        },
        fd,
        indent=4,
    )

with open(roc_file, "w") as fd:
    json.dump(
        {
            "roc": [
                {"fpr": fp, "tpr": tp, "threshold": t}
                for fp, tp, t in zip(fpr, tpr, roc_thresholds)
            ]
        },
        fd,
        indent=4,
    )

To:

from dvclive.sklearn import log_precision_recall_curve, log_roc_curve

log_precision_recall_curve(labels, predictions)
log_roc_curve(labels, predictions)

@pared pared removed their assignment Nov 3, 2021
@daavoo daavoo closed this as completed Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: frameworks Area: ML Framework integration feature request
Projects
No open projects
Development

No branches or pull requests

3 participants