Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyproject.toml #11

Merged
merged 5 commits into from May 2, 2023
Merged

pyproject.toml #11

merged 5 commits into from May 2, 2023

Conversation

aazuspan
Copy link
Contributor

This would close #9 by swapping setup.py for pyproject.toml and setuptools for hatch.

@grovduck I went ahead and set this up using hatch just to give you an idea how it would work, but I'd like to get your thoughts on that. In particular, from a build perspective, there's a question of whether current lack of support for Cython in hatch is a deal-breaker. I would lean towards no, but you obviously have a lot more experience with the performance considerations here, so I'd defer to you.

The other consideration is whether to use hatch for environment management. I set this up with that in mind (ie removing the optional dependencies in favor of using hatch environments), and I think there are some strong benefits to making sure developers and CI are all testing with identical environments, but if it doesn't work with your workflow then we should obviously reconsider. We could still use hatch for building, versioning, and publishing, but go back to using pip install -e .[dev] and leaving it up to developers to set up their environments.

Other changes:

- Remove optional test dependencies in favor of hatch environments
- Remove optional dependencies already defined by pre-commit
- Remove bumpversion and twine in favor of hatch
- Set dynamic __version__ handled by hatch
- Update CI to use hatch
- Add Aaron Zuspan as second author
- Update developer guide
`hatch shell` doesn't seem to work in a Github action. Instead, it
should work to use `hatch run` to run the sourcery login in the dev
environment and let the pre-commit action handle installation of
pre-commit.
@aazuspan aazuspan added the dx Developer experience, tools, efficiency, etc. label Apr 27, 2023
@grovduck
Copy link
Member

In particular, from a build perspective, there's a question of whether current lack of support for Cython in hatch is a deal-breaker. I would lean towards no, but you obviously have a lot more experience with the performance considerations here, so I'd defer to you.

If we think of this package as just the extended estimators for now, any bottlenecks would likely be in fitting and not in prediction. For the "complicated" estimators (e.g. GNN and MSN), once we get the matrix to do the transform on the environmental data through the fit process, it should be fast to apply this in a predict or score context. KNeighborsClassifier itself uses either BallTree or KDTree search algorithms which are already optimized.

The only unknown for me at this point is RFNN which is likely the slowest of the bunch. But I don't think we're going to write anything other than the nearest neighbor "distance" which is really just a similarity matrix between a 1-d vector (target) and a 2-d array of vectors (references). That should also be fairly well optimized in numpy, but I could be wrong about that!

I played around with hatch for 30 minutes at the end of last week and can understand the appeal for sure. Given that you've done all the hard work of setting up the initial configuration, my hope is that I'm trainable and can get up to speed quickly. Plus, you've already shown your capacity for putting up with my stupid questions!

@grovduck
Copy link
Member

The only change I'd probably make is to add my name rather than "LEMMA group @ Oregon State University". That will be obvious in the repo name already.

Also, I noticed the switch back to __version__ in sklearn_knn/__init__.py. Sounds like this is the way to do dynamic version bumping with hatch? In a really quick Google, I found that some folks and perhaps the hatch project itself put the version in an about.py file instead? Any preference here?

@aazuspan
Copy link
Contributor Author

Thanks for the thoughts on performance! That was my hope, that the existing optimizations in scikit-learn and numpy will be able to do the heavy lifting for us. I suppose if we do hit a performance bottleneck in the future, we can always reconsider the build system at that time.

The only change I'd probably make is to add my name rather than "LEMMA group @ Oregon State University"

Good point, will do!

Sounds like this is the way to do dynamic version bumping with hatch?

Yes. I guess this is technically contrary to the PEP 396 recommendation we discussed before that recommended against a __version__, but this seems to be the hatch approach and still allows for importlib.metdata.version("scikit-learn-knn") to function. As far as I can tell it's just a win-win since it allows for users to access it both ways without any maintenance hassle on our end.

In a really quick Google, I found that some folks and perhaps the hatch project itself put the version in an about.py file instead? Any preference here?

Yeah, I think you're right that this is the standard approach. My first thought was that less files is better and we could avoid importing __version__ into __init__.py to make it accessible, but sticking to the standards and separating the metadata is probably preferable overall. I'll move that over!

@grovduck
Copy link
Member

All looks great to me, @aazuspan. I made a particular valuable edit from "Matthew" to "Matt". I think the side benefit is that the pre-commit with the new sourcery key didn't cause any issues.

@grovduck
Copy link
Member

grovduck commented May 1, 2023

Hey @aazuspan, just test-driving using hatch for setting up the development environment (sorry that I hadn't done this previously). Here were my steps:

> git clone git@github.com:lemma-osu/scikit-learn-knn.git
> git checkout pyproject
> hatch shell (I have hatch installed globally)
(scikit-learn-knn) > pip list
Package          Version Editable project location
---------------- ------- -------------------------
cfgv             3.3.1
distlib          0.3.6
filelock         3.12.0
identify         2.5.23
joblib           1.2.0
nodeenv          1.7.0
numpy            1.24.3
pip              23.1
platformdirs     3.5.0
pre-commit       3.2.2
PyYAML           6.0
scikit-learn     1.2.2
scikit-learn-knn 0.1.0   D:\scikit-learn-knn
scipy            1.10.1
setuptools       67.6.1
sourcery         1.2.0
threadpoolctl    3.1.0
virtualenv       20.23.0
wheel            0.40.0

(scikit-learn-knn) > python
>>> from sklearn_knn import GNN
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'sklearn_knn'

I'm sure it's a silly error, but not sure what I'm doing incorrectly and guessing users might hit up against the same issues? pip show scikit-learn-knn shows the correct info and tests run fine. Ahh, I think I just stumbled into it. This actually works:

(scikit-learn-knn) > python
>>> from src.sklearn_knn import GNN

So, somehow hatch isn't registering src as the root code location?

@aazuspan
Copy link
Contributor Author

aazuspan commented May 1, 2023

Apparently I hadn't tried this either, because I get the same error. Good catch!

I did find that changing name in pyproject.toml to sklearn_knn resolves the issue, so it seems to be a problem with how hatch locates the package. There must be a way to set up a different package and distribution name, since that's such a common practice, but I haven't found it yet...

EDIT: I'm stumped, so I opened pypa/hatch#845

See pypa/hatch#845. This was necessary because our project name
is different from our package name, so hatch wasn't able to
locate it in the default path. Note that if the shell has been
built before making this change, it will need to be manually
rebuilt by running `hatch env prune`.
@aazuspan
Copy link
Contributor Author

aazuspan commented May 2, 2023

We have a solution courtesy of the hatch maintainers! We just had to manually point hatch to the sklearn_knn path, similar to how you'd specify packages in a setup.py.

You will probably have to remove the environment for the change to take effect, which you should be able to do with hatch env prune. Let me know if that works on your end 🤞

EDIT: Back to the question of performance, if we hit bottlenecks we could look into mypyc which converts typed Python to C extensions and has a hatch plugin. I've never used it before, but it sounds like it should be a lot easier than having to manually write C or Cython...

@grovduck
Copy link
Member

grovduck commented May 2, 2023

Awesome find, @aazuspan. Just checked and it works for me after pruning my environment!

Wow, mypyc looks really cool. It has the promise of being so much easier than Cython.

Even though I'm likely 25% of understanding everything hatch does for us, I feel like you have a great handle on the configuration and I'm ready to merge this PR whenever you are. I got Mahalanobis coded and tests passing yesterday, but it seems like we might want to merge this one back into fb_add_estimators before I branch just because of the major changes to the build environment. Thoughts?

@aazuspan
Copy link
Contributor Author

aazuspan commented May 2, 2023

Great, glad you caught this before we merged!

Very exciting to have Mahalanobis up and running! I think you're right that merging this first makes sense, since it will change the environment the tests run in. Shouldn't make a practical difference, but better to be on the safe side.

Consider it merged!

@aazuspan aazuspan merged commit fe5fe01 into fb_add_estimators May 2, 2023
10 checks passed
@aazuspan aazuspan deleted the pyproject branch May 2, 2023 17:58
@grovduck
Copy link
Member

grovduck commented May 2, 2023

Excellent, thx!

Quick question regarding VSCode. When you're using hatch do you change the interpreter path to the install location of the hatch venv? Right now, it's pointing to my "default" venv, but I assume it's best to point to the hatch venv or perhaps even the hatch test venv (for me, this is something like ...\AppData\Local\hatch\env\virtual\scikit-learn-knn\kdkx8vzg\test?

Capture

@aazuspan
Copy link
Contributor Author

aazuspan commented May 2, 2023

Yeah, I think pointing to the test env like you suggested probably makes the most sense. That's what I'm currently doing.

@grovduck grovduck linked an issue Jun 7, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dx Developer experience, tools, efficiency, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pyproject.toml and build system
2 participants