Skip to content

Commit

Permalink
Merge branch 'main' into refactor_ordination
Browse files Browse the repository at this point in the history
  • Loading branch information
grovduck committed Oct 17, 2023
2 parents a5d0d84 + e112c2c commit 12b6641
Show file tree
Hide file tree
Showing 22 changed files with 447 additions and 48 deletions.
13 changes: 13 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
version: 2

build:
os: ubuntu-22.04
tools:
python: "3.11"

mkdocs:
configuration: docs/mkdocs.yml

python:
install:
- requirements: docs/requirements.txt
106 changes: 58 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,77 +1,87 @@
# scikit-learn-knn-regression
# sknnr

This package is in active development.
> ⚠️ **WARNING: sknnr is in active development!** ⚠️
## Developer Guide
## What is sknnr?

### Setup
`sknnr` is a package for running k-nearest neighbor (kNN) imputation[^imputation] methods using estimators that are fully compatible with [`scikit-learn`](https://scikit-learn.org/stable/). Notably, common methods such as most similar neighbor (MSN, Moeur & Stage 1995), gradient nearest neighbor (GNN, Ohmann & Gregory, 2002), and random forest nearest neighbors[^rfnn] (RFNN, Crookston & Finley, 2008) are included in this package.

This project uses [hatch](https://hatch.pypa.io/latest/) to manage the development environment and build and publish releases. Make sure `hatch` is [installed](https://hatch.pypa.io/latest/install/) first:
## Features

```bash
$ pip install hatch
```
- 🤝 Tight integration with the [`scikit-learn`](https://scikit-learn.org/stable/) API
- 🐼 Native support for [`pandas`](https://pandas.pydata.org/) dataframes
- 📊 [Multi-output](https://scikit-learn.org/stable/modules/multiclass.html) estimators for [regression and classification](https://sknnr.readthedocs.io/usage/#regression-and-classification)
- 📝 Results validated against [yaImpute](https://cran.r-project.org/web/packages/yaImpute/index.html) (Crookston & Finley 2008)[^validation]

Now you can [enter the development environment](https://hatch.pypa.io/latest/environment/#entering-environments) using:
## Why the Name "sknnr"?

```bash
$ hatch shell
```
`sknnr` is an acronym of its main three components:

This will install development dependencies in an isolated environment and drop you into a shell (use `exit` to leave).
1. **"s"** is for `scikit-learn`. All estimators in this package derive from the `sklearn.BaseEstimator` class and comply with the requirements associated with [developing custom estimators](https://scikit-learn.org/stable/developers/develop.html).
2. **"knn"** is for k-nearest neighbors. All estimators use the _k_ >= 1 samples that are nearest in feature space to create their prediction. Each estimator in this package defines that feature space in a different way which often leads to different neighbors chosen for the prediction.
3. **"r"** is for regression. Estimators in this package are run in [regression mode](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html). For nearest neighbor imputation, this is simply an (optionally-weighted) average of its _k_ neighbors. When _k_ is set to 1, this effectively behaves as in [classification mode](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). All estimators support multi-output prediction so that multiple features can be predicted with the same estimator.

### Pre-commit
## Quick-Start

Use [pre-commit](https://pre-commit.com/) to run linting, type-checking, and formatting:
1. Follow the [installation guide](https://sknnr.readthedocs.io/installation).
2. Import any `sknnr` estimator, like [MSNRegressor](https://sknnr.readthedocs.io/api/estimators/msn), as a drop-in replacement for a `scikit-learn` regressor.
```python
from sknnr import MSNRegressor

```bash
$ pre-commit run --all-files
est = MSNRegressor()
```
3. Load a custom dataset like [SWO Ecoplot](https://sknnr.readthedocs.io/api/datasets/swo_ecoplot) (or bring your own).
```python
from sknnr.datasets import load_swo_ecoplot

...or install it to run automatically before every commit with:

```bash
$ pre-commit install
X, y = load_swo_ecoplot(return_X_y=True, as_frame=True)
```
4. Train, predict, and score [as usual](https://scikit-learn.org/stable/getting_started.html#fitting-and-predicting-estimator-basics).
```python
from sklearn.model_selection import train_test_split

You can run pre-commit hooks separately and pass additional arguments to them. For example, to run `black` on a single file:
X_train, X_test, y_train, y_test = train_test_split(X, y)

```bash
$ pre-commit run black --files=src/sknnr/_base.py
est = est.fit(X_train, y_train)
est.score(X_test, y_test)
```
5. Check out the additional features like [independent scoring](https://sknnr.readthedocs.io/usage/#independent-scores-and-predictions), [dataframe indexing](https://sknnr.readthedocs.io/usage/#retrieving-dataframe-indexes), and [dimensionality reduction](https://sknnr.readthedocs.io/usage/#dimensionality-reduction).
```python
# Evaluate the model using the second-nearest neighbor in the test set
print(est.fit(X, y).independent_score_)

### Testing

Unit tests are *not* run by `pre-commit`, but can be run manually using `hatch` [scripts](https://hatch.pypa.io/latest/config/environment/overview/#scripts):
# Get the dataframe index of the nearest neighbor to each plot
print(est.kneighbors(return_dataframe_index=True, return_distance=False))

```bash
$ hatch run test:all
# Apply dimensionality reduction using CCorA ordination
MSNRegressor(n_components=3).fit(X_train, y_train)
```

Measure test coverage with:
## History and Inspiration
`sknnr` was heavily inspired by (and endeavors to implement functionality of) the [yaImpute](https://cran.r-project.org/web/packages/yaImpute/index.html) package for R (Crookston & Finley 2008). As Crookston and Finley (2008) note in their abstract,
> Although nearest neighbor imputation is used in a host of disciplines, the methods implemented in the yaImpute package are tailored to imputation-based forest attribute estimation and mapping ... [there is] a growing interest in nearest neighbor imputation methods for spatially explicit forest inventory, and a need within this research community for software that facilitates comparison among different nearest neighbor search algorithms and subsequent imputation techniques.
```bash
$ hatch run test:coverage
```
Indeed, many regional (e.g. [LEMMA](https://lemmadownload.forestry.oregonstate.edu/)) and national (e.g. [BIGMAP](https://storymaps.arcgis.com/stories/c710684b98f54452804e8960d37905b2), [TreeMap](https://www.firelab.org/project/treemap-tree-level-model-forests-united-states)) projects use nearest-neighbor methods to
estimate and map forest attributes across time and space.

Any additional arguments are passed to `pytest`. For example, to run a subset of tests matching a keyword:
To that end, `sknnr` ports and expands the functionality present in `yaImpute` into a Python package that helps facilitate intercomparison between k-nearest neighbor methods (and other built-in estimators from `scikit-learn`) using an API which is familiar to `scikit-learn` users.

```bash
$ hatch run test:all -k gnn
```
## Acknowledgements

### Releasing
Thanks to Andrew Hudak (USDA Forest Service Rocky Mountain Research Station) for the inclusion of the [Moscow Mountain / St. Joes dataset](https://sknnr.readthedocs.io/api/datasets/moscow_stjoes) (Hudak 2010), and the USDA Forest Service Region 6 Ecology Team for the inclusion of the [SWO Ecoplot dataset](https://sknnr.readthedocs.io/api/datasets/swo_ecoplot) (Atzet et al., 1996). Development of this package was funded by:

First, use `hatch` to [update the version number](https://hatch.pypa.io/latest/version/#updating).
- an appointment to the United States Forest Service (USFS) Research Participation Program administered by the Oak Ridge Institute for Science and Education (ORISE) through an interagency agreement between the U.S. Department of Energy (DOE) and the U.S. Department of Agriculture (USDA).
- a joint venture agreement between USFS Pacific Northwest Research Station and Oregon State University (agreement 19-JV-11261959-064).
- a cost-reimbursable agreement between USFS Region 6 and Oregon State University (agreeement 21-CR-11062756-046).

```bash
$ hatch version [major|minor|patch]
```
## References

Then, [build](https://hatch.pypa.io/latest/build/#building) and [publish](https://hatch.pypa.io/latest/publish/#publishing) the release to PyPI with:
- Atzet, T, DE White, LA McCrimmon, PA Martinez, PR Fong, and VD Randall. 1996. Field guide to the forested plant associations of southwestern Oregon. USDA Forest Service. Pacific Northwest Region, Technical Paper R6-NR-ECOL-TP-17-96.
- Crookston, NL, Finley, AO. 2008. yaImpute: An R package for kNN imputation. Journal of Statistical Software, 23(10), 16.
- Hudak, A.T. 2010. Field plot measures and predictive maps for "Nearest neighbor imputation of species-level, plot-scale forest structure attributes from LiDAR data". Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station. https://www.fs.usda.gov/rds/archive/Catalog/RDS-2010-0012.
- Moeur M, Stage AR. 1995. Most Similar Neighbor: An Improved Sampling Inference Procedure for Natural Resources Planning. Forest Science, 41(2), 337–359.
- Ohmann JL, Gregory MJ. 2002. Predictive Mapping of Forest Composition and Structure with Direct Gradient Analysis and Nearest Neighbor Imputation in Coastal Oregon, USA. Canadian Journal of Forest Research, 32, 725–741.

```bash
$ hatch clean
$ hatch build
$ hatch publish
```
[^imputation]: In a mapping context, kNN imputation refers to predicting feature values for a target from its k-nearest neighbors, and should not be confused with the usual `scikit-learn` usage as a pre-filling strategy for missing input data, e.g. [`KNNImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html).
[^rfnn]: In [development](https://github.com/lemma-osu/scikit-learn-knn-regression/issues/24)!
[^validation]: All estimators and parameters with equivalent functionality in `yaImpute` are tested to 3 decimal places against the R package.
6 changes: 6 additions & 0 deletions docs/abbreviations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
*[GNN]: Gradient Nearest Neighbor
*[MSN]: Most Similar Neighbor
*[kNN]: k-nearest neighbor
*[RFNN]: Random Forest Nearest Neighbor
*[CCorA]: Canonical Correlation Analysis
*[CCA]: Canonical Correspondence Analysis
82 changes: 82 additions & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
site_name: sknnr
repo_url: https://github.com/lemma-osu/scikit-learn-knn-regression
repo_name: lemma-osu/scikit-learn-knn-regression
docs_dir: pages/

nav:
- Home: index.md
- Installation: installation.md
- Usage: usage.md
- "API Reference":
- Estimators:
- RawKNNRegressor: api/estimators/raw.md
- EuclideanKNNRegressor: api/estimators/euclidean.md
- MahalanobisKNNRegressor: api/estimators/mahalanobis.md
- GNNRegressor: api/estimators/gnn.md
- MSNRegressor: api/estimators/msn.md
- Transformers:
- StandardScalerWithDOF: api/transformers/standardscalerwithdof.md
- MahalanobisTransformer: api/transformers/mahalanobis.md
- CCATransformer: api/transformers/cca.md
- CCorATransformer: api/transformers/ccora.md
- Datasets:
- Dataset: api/datasets/dataset.md
- "Moscow Mountain / St. Joes": api/datasets/moscow_stjoes.md
- "SWO Ecoplot": api/datasets/swo_ecoplot.md
- Contributing: contributing.md

theme:
name: material
features:
- search.suggest
- search.highlight
- navigation.instant
- navigation.path
- content.code.copy
- content.code.annotate
palette:
- media: "(prefers-color-scheme: light)"
scheme: default
toggle:
icon: material/weather-night
name: Dark mode
- media: "(prefers-color-scheme: dark)"
scheme: slate
toggle:
icon: material/weather-sunny
name: Light mode

plugins:
- search
- mkdocstrings:
handlers:
python:
paths: [../src]
options:
show_source: false
inherited_members: true
undoc_members: true
docstring_style: numpy
show_if_no_docstring: true
show_signature_annotations: true
show_root_heading: true
show_category_heading: true
merge_init_into_class: true
signature_crossrefs: true

markdown_extensions:
- abbr
- admonition
- tables
- footnotes
- toc:
permalink: true
- pymdownx.snippets:
auto_append:
- docs/abbreviations.md
- pymdownx.highlight:
anchor_linenums: true
line_spans: __span
pygments_lang_class: true
- pymdownx.inlinehilite
- pymdownx.superfences
1 change: 1 addition & 0 deletions docs/pages/api/datasets/dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.datasets._base.Dataset
1 change: 1 addition & 0 deletions docs/pages/api/datasets/moscow_stjoes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.datasets.load_moscow_stjoes
1 change: 1 addition & 0 deletions docs/pages/api/datasets/swo_ecoplot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.datasets.load_swo_ecoplot
1 change: 1 addition & 0 deletions docs/pages/api/estimators/euclidean.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.EuclideanKNNRegressor
1 change: 1 addition & 0 deletions docs/pages/api/estimators/gnn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.GNNRegressor
1 change: 1 addition & 0 deletions docs/pages/api/estimators/mahalanobis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.MahalanobisKNNRegressor
1 change: 1 addition & 0 deletions docs/pages/api/estimators/msn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.MSNRegressor
1 change: 1 addition & 0 deletions docs/pages/api/estimators/raw.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.RawKNNRegressor
1 change: 1 addition & 0 deletions docs/pages/api/transformers/cca.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.transformers.CCATransformer
1 change: 1 addition & 0 deletions docs/pages/api/transformers/ccora.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.transformers.CCorATransformer
1 change: 1 addition & 0 deletions docs/pages/api/transformers/mahalanobis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.transformers.MahalanobisTransformer
1 change: 1 addition & 0 deletions docs/pages/api/transformers/standardscalerwithdof.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.transformers.StandardScalerWithDOF
91 changes: 91 additions & 0 deletions docs/pages/contributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Contributing

## Developer Guide

### Setup

This project uses [hatch](https://hatch.pypa.io/latest/) to manage the development environment and build and publish releases. Make sure `hatch` is [installed](https://hatch.pypa.io/latest/install/) first:

```bash
$ pip install hatch
```

Now you can [enter the development environment](https://hatch.pypa.io/latest/environment/#entering-environments) using:

```bash
$ hatch shell
```

This will install development dependencies in an isolated environment and drop you into a shell (use `exit` to leave).

### Pre-commit

Use [pre-commit](https://pre-commit.com/) to run linting, type-checking, and formatting:

```bash
$ pre-commit run --all-files
```

...or install it to run automatically before every commit with:

```bash
$ pre-commit install
```

You can run pre-commit hooks separately and pass additional arguments to them. For example, to run `black` on a single file:

```bash
$ pre-commit run black --files=src/sknnr/_base.py
```

### Testing

Unit tests are *not* run by `pre-commit`, but can be run manually using `hatch` [scripts](https://hatch.pypa.io/latest/config/environment/overview/#scripts):

```bash
$ hatch run test:all
```

Measure test coverage with:

```bash
$ hatch run test:coverage
```

Any additional arguments are passed to `pytest`. For example, to run a subset of tests matching a keyword:

```bash
$ hatch run test:all -k gnn
```

### Documentation

Documentation is built with [mkdocs](https://www.mkdocs.org/). During development, you can run a live-reloading server with:

```bash
$ hatch run docs:serve
```

The API reference is generated from Numpy-style docstrings using [mkdocstrings](https://mkdocstrings.github.io/). New classes can be added to the API reference by creating a new markdown file in the `docs/pages/api` directory, adding that file to the [`nav` tree](https://www.mkdocs.org/user-guide/configuration/#nav) in `docs/mkdocs.yml`, and [including the docstring](https://mkdocstrings.github.io/python/usage/#injecting-documentation) in the markdown file:

```markdown
::: sknnr.module.class
```

Whenever the docs are updated, they will be automatically rebuilt and deployed by [ReadTheDocs](https://about.readthedocs.com). Build status can be monitored [here](https://readthedocs.org/projects/sknnr/builds/).

### Releasing

First, use `hatch` to [update the version number](https://hatch.pypa.io/latest/version/#updating).

```bash
$ hatch version [major|minor|patch]
```

Then, [build](https://hatch.pypa.io/latest/build/#building) and [publish](https://hatch.pypa.io/latest/publish/#publishing) the release to PyPI with:

```bash
$ hatch clean
$ hatch build
$ hatch publish
```
1 change: 1 addition & 0 deletions docs/pages/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
--8<-- "README.md"
17 changes: 17 additions & 0 deletions docs/pages/installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Installation

!!! info
`sknnr` will be available through PyPI and conda-forge once it is ready for release. Until then, you can install it from source.

## From Source

```bash
pip install git+https://github.com/lemma-osu/scikit-learn-knn-regression@main
```

## Dependencies

- Python >= 3.8
- scikit-learn >= 1.2
- numpy
- scipy
Loading

0 comments on commit 12b6641

Please sign in to comment.