Remove scikit-learn #1063

calebrob6 · 2023-01-29T23:59:32Z

We only use sklearn for GroupShuffleSplit. Reimplementing our own version to remove this dependency.

torchgeo/datamodules/utils.py

adamjstewart · 2023-01-30T17:19:07Z

Trying to decide if the +100 lines is worth it to remove one dependency...

calebrob6 · 2023-01-30T17:33:36Z

Which is more likely to cause issues? (I'm thinking sklearn obviously -- more generally, I'm tired of dependencies breaking everything, it seems we spend as much time fiddling with CI and dependencies as we do on actual features)

Another relevant question is "do we see any other features of sklearn that we'll be using in the future?" (I'm thinking maybe if we implement fine tuning and want to use sklearn, but that isn't necessary)

calebrob6 · 2023-02-23T23:24:51Z

Can you give some arguments for/against?

adamjstewart · 2023-02-24T03:47:15Z

Pros

-1 required direct dependency
-11 required indirect dependencies (including scipy which was listed as optional)
Fewer dependabot PRs
Faster installs during tests

+/- sklearn isn't a huge deal just because most ML people will already have it installed anyway. But it's always nice to decrease the number of our deps, both for install times and for simpler solves. For example, sklearn requires setuptools < 60, while fiona requires setuptools 61+. Pip can handle this, but Spack can't, meaning you would have to choose between the latest version of sklearn or fiona, you couldn't have both. This was fixed the other day, but still.

Cons

+100 lines of code we have to maintain
Possibility of sklearn implementation getting bug fixes we don't
Possibility of sklearn implementation changing API someday
Possibility we may decide to re-add sklearn for another feature someday (

Maintenance burden is my biggest fear. We aren't explicitly making this function public (it doesn't get an import alias), but it's still something people could potentially try to use. If we do keep it, we should prob prefix with an underscore to avoid people relying on it.

Alternatives

Could move sklearn to [datasets] and make it optional
Could simplify the function to the absolute bare minimum feature with no error checks, but that might lose compatibility with the sklearn implementation, making it difficult to revert someday if we decide to use sklearn for something else

I'm really on the fence with this one, not sure how to decide. Curious how you feel about alternative 1 (moving to [datasets]). Most of the pros go away, but at least it's optional. Technically it's for datamodules, not datasets, but close enough.

adamjstewart · 2023-02-24T03:53:38Z

calebrob6 · 2023-02-26T19:54:55Z

I don't really like the idea of moving sklearn to [datasets] as the whole point of this was to reduce dependency related maintenance burden.

If we divide current lines of code by current number dependencies we can get an idea of how many lines of code are worth one dependency. This number will be assuredly by larger than 100 (if it isn't I'm sure I can code golf the current implementation of group splitting...). Put differently, would you take on a new dependency just to get rid of 100 lines of code? (I think no way in hell 🙂)

For the "Possibility we may decide to re-add sklearn for another feature someday" -- that doesn't seem like a con at all. If we do, then we can drop these 100 lines of code, else it is a non-issue.

The function that we are maintaining has pretty clear logic:

Takes of list of something (hashables maybe?), groups
Splits the input list into two (according to a given percentage) while ensuring that the values in groups assigned to both splits are distinct.

We don't really care if sklearn changes their API for doing this because we just care about the functionality.

adamjstewart · 2023-03-04T05:21:52Z

I was 50/50 on this but now I'm more like 60/40 in favor. Still approaching critical mass...

calebrob6 · 2023-03-22T12:49:02Z

I was 50/50 on this but now I'm more like 60/40 in favor. Still approaching critical mass...

Anything actionable?

adamjstewart · 2023-03-22T15:17:46Z

Not yet, still mulling...

adamjstewart · 2023-04-22T17:16:41Z

Yeah let's do this. We're adding a bunch more new deps now, so it would be good to reduce. Can you rebase?

torchgeo/datamodules/utils.py

calebrob6 · 2023-04-23T21:57:41Z

I'm not trying to mimic torch.utils or torchgo.datasets.splits here. This is meant to be a simple drop in replacement that doesn't depend on all of scikit-learn. (also, are you thinking of something other than torchgeo.datasets.split -- that method doesn't support lengths or generators)

adamjstewart · 2023-04-23T22:12:52Z

I'm talking about torchgeo/datasets/splits.py.

If we're going to write our own splitting utility, I don't see why we shouldn't follow the PyTorch style instead of the sklearn style. Could even put it in torchgeo/datasets/splits.py and export it to the user.

calebrob6 · 2023-04-23T22:52:48Z

Okay, I see what you're saying now! Sorry that took me a minute -- the disconnect was because I wasn't seeing this as something that operated on torch Datasets. It could go in torchgeo/datasets/splits.py but it isn't really specific to Datasets.

(and, given that, should it really be in torchgeo in the first place 😉 ?)

adamjstewart · 2023-04-24T02:14:26Z

That's fair. I'm fine with keeping this internal-only and using a different style since it doesn't yet support passing in a NonGeoDataset. But if we can figure out a stable API that would support that it would be pretty cool.

Want me to merge this as is and save this for another day?

torchgeo/datamodules/utils.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

calebrob6 · 2023-04-24T15:47:47Z

Tests fail with round(...) I expect the behavior guaranteed by ceiling.

adamjstewart · 2023-04-24T16:15:21Z

I expect the behavior guaranteed by ceiling.

I would expect it to behave similarly to our other splitting functions.

torchgeo/datamodules/utils.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

calebrob6 · 2023-04-24T20:45:02Z

can you finish this?

github-actions bot added datamodules PyTorch Lightning datamodules dependencies Packaging and dependencies testing Continuous integration testing labels Jan 29, 2023

calebrob6 force-pushed the remove_sklearn branch from 7ebd138 to 349a1de Compare January 30, 2023 00:51

calebrob6 commented Jan 30, 2023

View reviewed changes

torchgeo/datamodules/utils.py Outdated Show resolved Hide resolved

calebrob6 requested a review from adamjstewart February 1, 2023 05:50

calebrob6 added 3 commits April 23, 2023 20:39

Remove sklearn

228e30f

New groupshufflesplit

f3bfde0

added tests

dcb455a

calebrob6 force-pushed the remove_sklearn branch from 349a1de to dcb455a Compare April 23, 2023 20:47

Black and typing fix

b0726eb

adamjstewart reviewed Apr 23, 2023

View reviewed changes

torchgeo/datamodules/utils.py Show resolved Hide resolved

adamjstewart reviewed Apr 23, 2023

View reviewed changes

torchgeo/datamodules/utils.py Show resolved Hide resolved

adamjstewart reviewed Apr 23, 2023

View reviewed changes

torchgeo/datamodules/utils.py Show resolved Hide resolved

adamjstewart requested changes Apr 24, 2023

View reviewed changes

torchgeo/datamodules/utils.py Outdated Show resolved Hide resolved

torchgeo/datamodules/utils.py Outdated Show resolved Hide resolved

torchgeo/datamodules/utils.py Show resolved Hide resolved

Update torchgeo/datamodules/utils.py

c3d7596

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

calebrob6 added 2 commits April 24, 2023 15:32

Requested changes

0d85da2

REquested changes

19be010

calebrob6 added 5 commits April 24, 2023 10:21

Change the tests to pass with round()

2cd6d0f

Typing

f81f8b6

TIL Iterable is now in collections.abc

3e4b377

Adding type hints to returns

fc7aabf

isort

d866bea

adamjstewart reviewed Apr 24, 2023

View reviewed changes

torchgeo/datamodules/utils.py Show resolved Hide resolved

calebrob6 and others added 3 commits April 24, 2023 13:26

Update torchgeo/datamodules/utils.py

e4028e3

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

last

11c9df0

last for real

bc791de

adamjstewart added 3 commits April 24, 2023 16:25

Comment out type hints

e01f8f8

Just use lists

2a9c913

Link to docs

f2cd15d

github-actions bot added the documentation Improvements or additions to documentation label Apr 24, 2023

adamjstewart approved these changes Apr 24, 2023

View reviewed changes

adamjstewart added this to the 0.4.2 milestone Apr 24, 2023

adamjstewart enabled auto-merge (squash) April 24, 2023 21:40

adamjstewart merged commit 4f714f7 into main Apr 24, 2023

adamjstewart deleted the remove_sklearn branch April 24, 2023 22:00

adamjstewart modified the milestones: 0.4.2, 0.5.0 Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove scikit-learn #1063

Remove scikit-learn #1063

calebrob6 commented Jan 29, 2023

adamjstewart commented Jan 30, 2023

calebrob6 commented Jan 30, 2023

calebrob6 commented Feb 23, 2023

adamjstewart commented Feb 24, 2023

adamjstewart commented Feb 24, 2023

calebrob6 commented Feb 26, 2023

adamjstewart commented Mar 4, 2023

calebrob6 commented Mar 22, 2023

adamjstewart commented Mar 22, 2023

adamjstewart commented Apr 22, 2023

calebrob6 commented Apr 23, 2023 •

edited

Loading

adamjstewart commented Apr 23, 2023

calebrob6 commented Apr 23, 2023 •

edited

Loading

adamjstewart commented Apr 24, 2023

calebrob6 commented Apr 24, 2023

adamjstewart commented Apr 24, 2023

calebrob6 commented Apr 24, 2023

Remove scikit-learn #1063

Remove scikit-learn #1063

Conversation

calebrob6 commented Jan 29, 2023

adamjstewart commented Jan 30, 2023

calebrob6 commented Jan 30, 2023

calebrob6 commented Feb 23, 2023

adamjstewart commented Feb 24, 2023

Pros

Cons

Alternatives

adamjstewart commented Feb 24, 2023

calebrob6 commented Feb 26, 2023

adamjstewart commented Mar 4, 2023

calebrob6 commented Mar 22, 2023

adamjstewart commented Mar 22, 2023

adamjstewart commented Apr 22, 2023

calebrob6 commented Apr 23, 2023 • edited Loading

adamjstewart commented Apr 23, 2023

calebrob6 commented Apr 23, 2023 • edited Loading

adamjstewart commented Apr 24, 2023

calebrob6 commented Apr 24, 2023

adamjstewart commented Apr 24, 2023

calebrob6 commented Apr 24, 2023

calebrob6 commented Apr 23, 2023 •

edited

Loading

calebrob6 commented Apr 23, 2023 •

edited

Loading