Finalize naming conventions for estimators and package #6

grovduck · 2023-04-17T16:49:34Z

Currently, our estimators are named to follow method names introduced in R's yaImpute package. We likely want to conform to scikit-learn's model of being more explicit about whether estimators are classifiers or regressors. kNN methods fit into a funny space here - most often they are thought of as classifiers as they use the k nearest neighbors in prediction. However, some attributes are continuous and are calculated as (weighted) means from the k neighbors attributes, which is more similar to KNeighborsRegressor.

Also need to decide whether it makes sense to stick with scikit-learn-knn as a package name as knn is already a meaningful term in scikit-learn. We could also decide to "pay homage" to yaImpute which is serving as the inspiration for this package.

The text was updated successfully, but these errors were encountered:

aazuspan · 2023-05-05T00:29:51Z

Regarding estimator names, what do you think about using Imputer as opposed to Classifier or Regressor? To me this seems like the most technically accurate term, but I'm guessing you shied away from it to avoid confusion with the imputers (including KNNImputer) built in to sklearn? The tricky thing seems to be that our estimators actually very closely match the functionality of the KNNImputer, so we will need to be careful to distinguish how this package differs from the built-in features of sklearn...

grovduck · 2023-05-05T19:07:00Z

@aazuspan, honestly I hadn't come across the Imputer classes in sklearn until a few weeks ago when we were diving in. I definitely agree with you that our estimators basically do both straight-forward and fancy imputation and really closely mimic the behavior of KNNImputer (which is basically our Raw estimator).

The only reason I've shied away from using this terminology on our estimators is because they seem to treat imputation as a way to fill missing values on the way to transformation/classification/regression. That is, it's a necessary pre-filling step such that other estimators down the line don't choke.

The expectation for our estimators is that the X we pass to fit is complete whereas in KNNImputer, the expectation is that the X passed to fit will be transformed to get rid of missing values.

But recognize that I haven't yet come up with a better term yet ... how about GNNClassifierAndRegressorAndReallyAnImputer 😄.

I'm willing to go with Imputer if you think we can effectively explain the differences.

aazuspan · 2023-05-05T20:24:25Z

GNNClassifierAndRegressorAndReallyAnImputer

Perfect, I'll just close this issue as resolved 😆

they seem to treat imputation as a way to fill missing values on the way to transformation/classification/regression

I completely agree. It's a challenging distinction to make because they function almost identically, but have very different purposes.

I'm willing to go with Imputer if you think we can effectively explain the differences.

No, I think your instincts were right. The association between imputation and filling missing data is just too strong for sklearn users, even if it's technically an accurate description of what we're doing.

Back to the drawing board!

grovduck · 2023-05-08T18:29:05Z

Based on discussions in #2 and #13, starting to learn toward using Regressor here to describe these custom estimators. If we do indeed subclass our estimators from KNearestRegressor, logical names would be:

RawRegressor
EuclideanRegressor
MahalanobisRegressor
MSNRegressor
GNNRegressor

I might be tempted to give a nod to being explicit that these are KNearest regressors, so we could consider names like:

RawKnnRegressor
...

It gets a bit awkward/redundant with MSN (most similar neighbor) and GNN (gradient nearest neighbor) because they already have NN implied. So, my current thinking is:

RawKnnRegressor
EuclideanKnnRegressor
MahalanobisKnnRegressor
MSNRegressor
GNNRegressor
(and eventually RFNNRegressor)

Does that make scikit-learn-knn-regression (package) and sklearn-knn-regression (imported module) too unwieldy as a names?

aazuspan · 2023-05-08T21:47:22Z

So, my current thinking is:

I'm happy with these names, and I'm also in favor of being more explicit by including Knn while avoiding the redundancy with the MSN and GNN estimators.

In the interest of thinking through every option, would there be any advantage to combining similar estimators into a single estimator that could be selected with a parameter like method? For example, say a KNNRegressor where method could be raw, Euclidean, or Mahalanobis? I'm not sure I like this idea, but I'm curious if you have any inclination.

Does that make scikit-learn-knn-regression (package) and sklearn-knn-regression (imported module) too unwieldy as a names?

They are a little wordy... I think the package name would be okay since it's only used once or twice, but the module name is definitely a mouthful to import every time. I don't want to criticize without tossing out some alternatives, so just to put a few ideas on the board for consideration:

sknn (this is already in use by a relatively popular archived project, so probably not a good choice)
skknn
skneighbors
sknear

grovduck · 2023-05-10T17:15:22Z

In the interest of thinking through every option, would there be any advantage to combining similar estimators into a single estimator that could be selected with a parameter like method? For example, say a KNNRegressor where method could be raw, Euclidean, or Mahalanobis? I'm not sure I like this idea, but I'm curious if you have any inclination.

I definitely agree that there is very little difference between the estimators we've introduced and that they likely could be combined into a single estimator. My aversion to this is mainly based on trying to decipher the yai function in yaImpute which takes exactly this approach (method is a parameter). The complexity of that function is quite high with a lot of conditionals (granted it has even more parameters than just method that causes this complexity). So it's entirely possible that I've gone overboard (too many distinct classes) when porting these estimators over. I'm curious what you feel is the right balance.

I think one goal that I have in mind is inter-comparison between the methods along with their hyperparameters through GridSearchCV or RandomizedSearchCV. Perhaps having method as a hyperparameter for a single estimator versus different estimators makes this comparison easier? Is that what you were thinking?

I don't want to criticize without tossing out some alternatives, so just to put a few ideas on the board for consideration

I agree we'd probably want to avoid an import name that has been used before (unfortunately "nearest-neighbors" and "neural-networks" share the same abbreviation). All of your suggestions are better than mine, but thought I'd throw one more out - sknnr (aka "skinner" - a nod to the Simpsons principal). It looks like there is a package on PyPi called "skinner", but not under active development. I like "skneighbors" as well.

aazuspan · 2023-05-10T18:52:58Z

My aversion to this is mainly based on trying to decipher the yai function in yaImpute which takes exactly this approach (method is a parameter). The complexity of that function is quite high with a lot of conditionals

Yes, I think that kind of parameter spaghetti is definitely something we should try to avoid. The sklearn framework gives us an advantage there since it forces separation of the model configuration from the fitting, predicting, and scoring parameters, but I still think you're right to err on the side of atomic over configurable.

I'm curious what you feel is the right balance.

I think either approach is reasonable, but I'm leaning towards the way you have it set up now. I agree that simpler functions are better, and in particular I'm not a fan of string params with predefined options because they're harder for users to remember and force us to handle invalid inputs. In any case, I don't think we're locking ourselves into anything and can always reconsider later. I imagine if we wanted to switch to a method param, we would probably leave the estimators as is and just add a KNNRegressor function that dispatches to the appropriate estimator, so no big code changes needed.

I think one goal that I have in mind is inter-comparison between the methods along with their hyperparameters through GridSearchCV or RandomizedSearchCV. Perhaps having method as a hyperparameter for a single estimator versus different estimators makes this comparison easier?

I don't have a ton of experience with those, but the conclusion I came to was that a method parameter would only help for comparing within that single estimator, and might make it harder to compare between other estimators. This is a good thing to keep in mind though.

All of your suggestions are better than mine, but thought I'd throw one more out - sknnr (aka "skinner" - a nod to the Simpsons principal). It looks like there is a package on PyPi called "skinner", but not under active development. I like "skneighbors" as well.

I like sknnr! That matches the proposed package name better and looks like it should be more searchable than skneighbors, which Google thinks is a typo for kneighbors. That's at the top of my list.

grovduck · 2023-05-10T21:11:34Z

All sounds good to me, @aazuspan. I agree that we can revisit the decision of the method parameter, but let's stick with what we have now. Are we ready to implement this? I'm happy to take a cut at implementing it, although I could use some advice on the cleanest way to rename the package without really messing things up. I guess that's probably a simple rename on Github and it shouldn't impact any local repos (because you can give the local repo whatever name you want, right?)

aazuspan · 2023-05-10T21:29:50Z

Are we ready to implement this?

I think so! It will be a little painful to merge that change in with any branches we have ongoing since I think git will consider all the old src files deleted, but we'll have to deal with it at some point, so I'm not sure we gain anything by waiting.

I could use some advice on the cleanest way to rename the package without really messing things up

I think locally it should be as simple as a find-replace in VS Code and renaming the module folder. There could be some other snags I'm not remembering, but I've done this a few times and can't remember any major issues. It's also possible we'll have to rebuild our hatch environments using hatch env prune, but I'm not sure.

I guess that's probably a simple rename on Github and it shouldn't impact any local repos (because you can give the local repo whatever name you want, right?)

Yeah, that should be pretty painless. You can continue to push to the old repo name and Github will redirect it, although it might raise a warning. I think the long-term fix is to run git remote set-url origin ... with the new URL.

grovduck · 2023-05-10T21:47:08Z

It will be a little painful to merge that change in with any branches we have ongoing since I think git will consider all the old src files deleted, but we'll have to deal with it at some point, so I'm not sure we gain anything by waiting.

I may be showing my ignorance here, but considering our flow to this point, it seems like we only have one or maybe two active feature branches going and right now they are branching off fb_add_estimators (itself supposedly a feature branch). I don't necessarily think we need to retain a develop branch based on how we've been working (likely more trouble than its worth). If we make the naming change now, continue to develop on fb_add_estimators until we've got the first five estimators implemented and then rebase onto main, do we avoid the problem you mention?

grovduck · 2023-05-10T22:03:27Z

I propose renaming MyStandardScaler to StandardScalerWithDOF as part of this as well. What do you think of this implementation?

import numpy as np

class StandardScalerWithDOF(StandardScaler):
    def __init__(self, ddof=0):
        super().__init__()
        self.ddof_ = ddof
    def fit(self, X, y=None, sample_weight=None):
        scaler = super().fit(X, y, sample_weight)
        scaler.scale_ = np.std(X, axis=0, ddof=self.ddof_)
        return scaler

aazuspan · 2023-05-10T22:27:30Z

If we make the naming change now, continue to develop on fb_add_estimators until we've got the first five estimators implemented and then rebase onto main, do we avoid the problem you mention?

The problem I'm thinking of is with unpushed local branches. For example, if you're working on src/sklearn_knn/msn.py and we rename that to src/sknnr/msn.py, I think that merging those changes back to your local branch will require manually pulling code out of the "deleted" original file and into the renamed file. Does that sound right to you?

I don't necessarily think we need to retain a develop branch based on how we've been working (likely more trouble than its worth).

I've never worked with a develop branch before, so unless you think there would be some advantages once we start making releases, I'm happy to do without it.

I propose renaming MyStandardScaler to StandardScalerWithDOF as part of this as well. What do you think of this implementation?

I like the name and the added flexibility with the implementation. My only concern would be how that fits in with #15. If it's a quick fix to make it dataframe compatible, maybe we could tackle two issues with one commit, so to speak.

I was also wondering if that should be moved out of _base and into the transformers module?

grovduck · 2023-05-10T22:39:12Z

The problem I'm thinking of is with unpushed local branches

Ah, yep, I totally agree with this. At this time, I don't have any unpushed local branches (I stashed the work on MSN and deleted that branch before bringing in your work on use_regressors. You're right that I'll still need to do some manual transfer after popping the stash, but I think it should be fairly painless at this stage. And, like you say, better now than later.

I've never worked with a develop branch before, so unless you think there would be some advantages once we start making releases, I'm happy to do without it.

Again, we can always introduce it down the line if we find that it becomes necessary.

My only concern would be how that fits in with #15. If it's a quick fix to make it dataframe compatible, maybe we could tackle two issues with one commit, so to speak. I was also wondering if that should be moved out of _base and into the transformers module?

Good thoughts for both of these. I'm leaning toward a separate PR once we do the renaming. I'll do the renaming now.

aazuspan · 2023-05-10T23:33:01Z

I'm leaning toward a separate PR once we do the renaming. I'll do the renaming now.

Sounds good to me!

grovduck · 2023-05-11T18:22:09Z

Resolved all but StandardScaler via #18

grovduck mentioned this issue Apr 17, 2023

Refactor estimators for simplicity and make tests more modular #1

Merged

grovduck added the refactor Code cleanup without changing functionality label May 1, 2023

aazuspan mentioned this issue May 8, 2023

Support continuous multioutput targets #13

Closed

aazuspan assigned grovduck May 10, 2023

grovduck mentioned this issue May 10, 2023

Rename estimators and package #18

Merged

grovduck closed this as completed May 11, 2023

grovduck mentioned this issue May 11, 2023

Rename MyStandardScaler and fix to work with dataframes #19

Merged

grovduck linked a pull request Jun 7, 2023 that will close this issue

Rename estimators and package #18

Merged

grovduck added this to the Core Estimators milestone Jun 7, 2023

grovduck mentioned this issue Oct 3, 2023

Set up documentation #59

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalize naming conventions for estimators and package #6

Finalize naming conventions for estimators and package #6

grovduck commented Apr 17, 2023

aazuspan commented May 5, 2023

grovduck commented May 5, 2023

aazuspan commented May 5, 2023

grovduck commented May 8, 2023

aazuspan commented May 8, 2023

grovduck commented May 10, 2023

aazuspan commented May 10, 2023

grovduck commented May 10, 2023

aazuspan commented May 10, 2023

grovduck commented May 10, 2023

grovduck commented May 10, 2023

aazuspan commented May 10, 2023

grovduck commented May 10, 2023

aazuspan commented May 10, 2023

grovduck commented May 11, 2023

Finalize naming conventions for estimators and package #6

Finalize naming conventions for estimators and package #6

Comments

grovduck commented Apr 17, 2023

aazuspan commented May 5, 2023

grovduck commented May 5, 2023

aazuspan commented May 5, 2023

grovduck commented May 8, 2023

aazuspan commented May 8, 2023

grovduck commented May 10, 2023

aazuspan commented May 10, 2023

grovduck commented May 10, 2023

aazuspan commented May 10, 2023

grovduck commented May 10, 2023

grovduck commented May 10, 2023

aazuspan commented May 10, 2023

grovduck commented May 10, 2023

aazuspan commented May 10, 2023

grovduck commented May 11, 2023