diff --git a/.readthedocs.yaml b/.readthedocs.yaml new file mode 100644 index 0000000..53b9204 --- /dev/null +++ b/.readthedocs.yaml @@ -0,0 +1,13 @@ +version: 2 + +build: + os: ubuntu-22.04 + tools: + python: "3.11" + +mkdocs: + configuration: docs/mkdocs.yml + +python: + install: + - requirements: docs/requirements.txt \ No newline at end of file diff --git a/README.md b/README.md index 84fbb80..40dad85 100644 --- a/README.md +++ b/README.md @@ -1,77 +1,87 @@ -# scikit-learn-knn-regression +# sknnr -This package is in active development. +> ⚠️ **WARNING: sknnr is in active development!** ⚠️ -## Developer Guide +## What is sknnr? -### Setup +`sknnr` is a package for running k-nearest neighbor (kNN) imputation[^imputation] methods using estimators that are fully compatible with [`scikit-learn`](https://scikit-learn.org/stable/). Notably, common methods such as most similar neighbor (MSN, Moeur & Stage 1995), gradient nearest neighbor (GNN, Ohmann & Gregory, 2002), and random forest nearest neighbors[^rfnn] (RFNN, Crookston & Finley, 2008) are included in this package. -This project uses [hatch](https://hatch.pypa.io/latest/) to manage the development environment and build and publish releases. Make sure `hatch` is [installed](https://hatch.pypa.io/latest/install/) first: +## Features -```bash -$ pip install hatch -``` +- 🤝 Tight integration with the [`scikit-learn`](https://scikit-learn.org/stable/) API +- 🐼 Native support for [`pandas`](https://pandas.pydata.org/) dataframes +- 📊 [Multi-output](https://scikit-learn.org/stable/modules/multiclass.html) estimators for [regression and classification](https://sknnr.readthedocs.io/usage/#regression-and-classification) +- 📝 Results validated against [yaImpute](https://cran.r-project.org/web/packages/yaImpute/index.html) (Crookston & Finley 2008)[^validation] -Now you can [enter the development environment](https://hatch.pypa.io/latest/environment/#entering-environments) using: +## Why the Name "sknnr"? -```bash -$ hatch shell -``` +`sknnr` is an acronym of its main three components: -This will install development dependencies in an isolated environment and drop you into a shell (use `exit` to leave). +1. **"s"** is for `scikit-learn`. All estimators in this package derive from the `sklearn.BaseEstimator` class and comply with the requirements associated with [developing custom estimators](https://scikit-learn.org/stable/developers/develop.html). +2. **"knn"** is for k-nearest neighbors. All estimators use the _k_ >= 1 samples that are nearest in feature space to create their prediction. Each estimator in this package defines that feature space in a different way which often leads to different neighbors chosen for the prediction. +3. **"r"** is for regression. Estimators in this package are run in [regression mode](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html). For nearest neighbor imputation, this is simply an (optionally-weighted) average of its _k_ neighbors. When _k_ is set to 1, this effectively behaves as in [classification mode](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). All estimators support multi-output prediction so that multiple features can be predicted with the same estimator. -### Pre-commit +## Quick-Start -Use [pre-commit](https://pre-commit.com/) to run linting, type-checking, and formatting: +1. Follow the [installation guide](https://sknnr.readthedocs.io/installation). +2. Import any `sknnr` estimator, like [MSNRegressor](https://sknnr.readthedocs.io/api/estimators/msn), as a drop-in replacement for a `scikit-learn` regressor. +```python +from sknnr import MSNRegressor -```bash -$ pre-commit run --all-files +est = MSNRegressor() ``` +3. Load a custom dataset like [SWO Ecoplot](https://sknnr.readthedocs.io/api/datasets/swo_ecoplot) (or bring your own). +```python +from sknnr.datasets import load_swo_ecoplot -...or install it to run automatically before every commit with: - -```bash -$ pre-commit install +X, y = load_swo_ecoplot(return_X_y=True, as_frame=True) ``` +4. Train, predict, and score [as usual](https://scikit-learn.org/stable/getting_started.html#fitting-and-predicting-estimator-basics). +```python +from sklearn.model_selection import train_test_split -You can run pre-commit hooks separately and pass additional arguments to them. For example, to run `black` on a single file: +X_train, X_test, y_train, y_test = train_test_split(X, y) -```bash -$ pre-commit run black --files=src/sknnr/_base.py +est = est.fit(X_train, y_train) +est.score(X_test, y_test) ``` +5. Check out the additional features like [independent scoring](https://sknnr.readthedocs.io/usage/#independent-scores-and-predictions), [dataframe indexing](https://sknnr.readthedocs.io/usage/#retrieving-dataframe-indexes), and [dimensionality reduction](https://sknnr.readthedocs.io/usage/#dimensionality-reduction). +```python +# Evaluate the model using the second-nearest neighbor in the test set +print(est.fit(X, y).independent_score_) -### Testing - -Unit tests are *not* run by `pre-commit`, but can be run manually using `hatch` [scripts](https://hatch.pypa.io/latest/config/environment/overview/#scripts): +# Get the dataframe index of the nearest neighbor to each plot +print(est.kneighbors(return_dataframe_index=True, return_distance=False)) -```bash -$ hatch run test:all +# Apply dimensionality reduction using CCorA ordination +MSNRegressor(n_components=3).fit(X_train, y_train) ``` -Measure test coverage with: +## History and Inspiration +`sknnr` was heavily inspired by (and endeavors to implement functionality of) the [yaImpute](https://cran.r-project.org/web/packages/yaImpute/index.html) package for R (Crookston & Finley 2008). As Crookston and Finley (2008) note in their abstract, +> Although nearest neighbor imputation is used in a host of disciplines, the methods implemented in the yaImpute package are tailored to imputation-based forest attribute estimation and mapping ... [there is] a growing interest in nearest neighbor imputation methods for spatially explicit forest inventory, and a need within this research community for software that facilitates comparison among different nearest neighbor search algorithms and subsequent imputation techniques. -```bash -$ hatch run test:coverage -``` +Indeed, many regional (e.g. [LEMMA](https://lemmadownload.forestry.oregonstate.edu/)) and national (e.g. [BIGMAP](https://storymaps.arcgis.com/stories/c710684b98f54452804e8960d37905b2), [TreeMap](https://www.firelab.org/project/treemap-tree-level-model-forests-united-states)) projects use nearest-neighbor methods to +estimate and map forest attributes across time and space. -Any additional arguments are passed to `pytest`. For example, to run a subset of tests matching a keyword: +To that end, `sknnr` ports and expands the functionality present in `yaImpute` into a Python package that helps facilitate intercomparison between k-nearest neighbor methods (and other built-in estimators from `scikit-learn`) using an API which is familiar to `scikit-learn` users. -```bash -$ hatch run test:all -k gnn -``` +## Acknowledgements -### Releasing +Thanks to Andrew Hudak (USDA Forest Service Rocky Mountain Research Station) for the inclusion of the [Moscow Mountain / St. Joes dataset](https://sknnr.readthedocs.io/api/datasets/moscow_stjoes) (Hudak 2010), and the USDA Forest Service Region 6 Ecology Team for the inclusion of the [SWO Ecoplot dataset](https://sknnr.readthedocs.io/api/datasets/swo_ecoplot) (Atzet et al., 1996). Development of this package was funded by: -First, use `hatch` to [update the version number](https://hatch.pypa.io/latest/version/#updating). +- an appointment to the United States Forest Service (USFS) Research Participation Program administered by the Oak Ridge Institute for Science and Education (ORISE) through an interagency agreement between the U.S. Department of Energy (DOE) and the U.S. Department of Agriculture (USDA). +- a joint venture agreement between USFS Pacific Northwest Research Station and Oregon State University (agreement 19-JV-11261959-064). +- a cost-reimbursable agreement between USFS Region 6 and Oregon State University (agreeement 21-CR-11062756-046). -```bash -$ hatch version [major|minor|patch] -``` +## References -Then, [build](https://hatch.pypa.io/latest/build/#building) and [publish](https://hatch.pypa.io/latest/publish/#publishing) the release to PyPI with: +- Atzet, T, DE White, LA McCrimmon, PA Martinez, PR Fong, and VD Randall. 1996. Field guide to the forested plant associations of southwestern Oregon. USDA Forest Service. Pacific Northwest Region, Technical Paper R6-NR-ECOL-TP-17-96. +- Crookston, NL, Finley, AO. 2008. yaImpute: An R package for kNN imputation. Journal of Statistical Software, 23(10), 16. +- Hudak, A.T. 2010. Field plot measures and predictive maps for "Nearest neighbor imputation of species-level, plot-scale forest structure attributes from LiDAR data". Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station. https://www.fs.usda.gov/rds/archive/Catalog/RDS-2010-0012. +- Moeur M, Stage AR. 1995. Most Similar Neighbor: An Improved Sampling Inference Procedure for Natural Resources Planning. Forest Science, 41(2), 337–359. +- Ohmann JL, Gregory MJ. 2002. Predictive Mapping of Forest Composition and Structure with Direct Gradient Analysis and Nearest Neighbor Imputation in Coastal Oregon, USA. Canadian Journal of Forest Research, 32, 725–741. -```bash -$ hatch clean -$ hatch build -$ hatch publish -``` +[^imputation]: In a mapping context, kNN imputation refers to predicting feature values for a target from its k-nearest neighbors, and should not be confused with the usual `scikit-learn` usage as a pre-filling strategy for missing input data, e.g. [`KNNImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html). +[^rfnn]: In [development](https://github.com/lemma-osu/scikit-learn-knn-regression/issues/24)! +[^validation]: All estimators and parameters with equivalent functionality in `yaImpute` are tested to 3 decimal places against the R package. \ No newline at end of file diff --git a/docs/abbreviations.md b/docs/abbreviations.md new file mode 100644 index 0000000..dc91b57 --- /dev/null +++ b/docs/abbreviations.md @@ -0,0 +1,6 @@ +*[GNN]: Gradient Nearest Neighbor +*[MSN]: Most Similar Neighbor +*[kNN]: k-nearest neighbor +*[RFNN]: Random Forest Nearest Neighbor +*[CCorA]: Canonical Correlation Analysis +*[CCA]: Canonical Correspondence Analysis \ No newline at end of file diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml new file mode 100644 index 0000000..d4e7fd0 --- /dev/null +++ b/docs/mkdocs.yml @@ -0,0 +1,82 @@ +site_name: sknnr +repo_url: https://github.com/lemma-osu/scikit-learn-knn-regression +repo_name: lemma-osu/scikit-learn-knn-regression +docs_dir: pages/ + +nav: + - Home: index.md + - Installation: installation.md + - Usage: usage.md + - "API Reference": + - Estimators: + - RawKNNRegressor: api/estimators/raw.md + - EuclideanKNNRegressor: api/estimators/euclidean.md + - MahalanobisKNNRegressor: api/estimators/mahalanobis.md + - GNNRegressor: api/estimators/gnn.md + - MSNRegressor: api/estimators/msn.md + - Transformers: + - StandardScalerWithDOF: api/transformers/standardscalerwithdof.md + - MahalanobisTransformer: api/transformers/mahalanobis.md + - CCATransformer: api/transformers/cca.md + - CCorATransformer: api/transformers/ccora.md + - Datasets: + - Dataset: api/datasets/dataset.md + - "Moscow Mountain / St. Joes": api/datasets/moscow_stjoes.md + - "SWO Ecoplot": api/datasets/swo_ecoplot.md + - Contributing: contributing.md + +theme: + name: material + features: + - search.suggest + - search.highlight + - navigation.instant + - navigation.path + - content.code.copy + - content.code.annotate + palette: + - media: "(prefers-color-scheme: light)" + scheme: default + toggle: + icon: material/weather-night + name: Dark mode + - media: "(prefers-color-scheme: dark)" + scheme: slate + toggle: + icon: material/weather-sunny + name: Light mode + +plugins: + - search + - mkdocstrings: + handlers: + python: + paths: [../src] + options: + show_source: false + inherited_members: true + undoc_members: true + docstring_style: numpy + show_if_no_docstring: true + show_signature_annotations: true + show_root_heading: true + show_category_heading: true + merge_init_into_class: true + signature_crossrefs: true + +markdown_extensions: + - abbr + - admonition + - tables + - footnotes + - toc: + permalink: true + - pymdownx.snippets: + auto_append: + - docs/abbreviations.md + - pymdownx.highlight: + anchor_linenums: true + line_spans: __span + pygments_lang_class: true + - pymdownx.inlinehilite + - pymdownx.superfences diff --git a/docs/pages/api/datasets/dataset.md b/docs/pages/api/datasets/dataset.md new file mode 100644 index 0000000..952e48a --- /dev/null +++ b/docs/pages/api/datasets/dataset.md @@ -0,0 +1 @@ +::: sknnr.datasets._base.Dataset \ No newline at end of file diff --git a/docs/pages/api/datasets/moscow_stjoes.md b/docs/pages/api/datasets/moscow_stjoes.md new file mode 100644 index 0000000..c07aba5 --- /dev/null +++ b/docs/pages/api/datasets/moscow_stjoes.md @@ -0,0 +1 @@ +::: sknnr.datasets.load_moscow_stjoes \ No newline at end of file diff --git a/docs/pages/api/datasets/swo_ecoplot.md b/docs/pages/api/datasets/swo_ecoplot.md new file mode 100644 index 0000000..1c22436 --- /dev/null +++ b/docs/pages/api/datasets/swo_ecoplot.md @@ -0,0 +1 @@ +::: sknnr.datasets.load_swo_ecoplot \ No newline at end of file diff --git a/docs/pages/api/estimators/euclidean.md b/docs/pages/api/estimators/euclidean.md new file mode 100644 index 0000000..e11bf16 --- /dev/null +++ b/docs/pages/api/estimators/euclidean.md @@ -0,0 +1 @@ +::: sknnr.EuclideanKNNRegressor \ No newline at end of file diff --git a/docs/pages/api/estimators/gnn.md b/docs/pages/api/estimators/gnn.md new file mode 100644 index 0000000..408bbb7 --- /dev/null +++ b/docs/pages/api/estimators/gnn.md @@ -0,0 +1 @@ +::: sknnr.GNNRegressor \ No newline at end of file diff --git a/docs/pages/api/estimators/mahalanobis.md b/docs/pages/api/estimators/mahalanobis.md new file mode 100644 index 0000000..3fd17f8 --- /dev/null +++ b/docs/pages/api/estimators/mahalanobis.md @@ -0,0 +1 @@ +::: sknnr.MahalanobisKNNRegressor \ No newline at end of file diff --git a/docs/pages/api/estimators/msn.md b/docs/pages/api/estimators/msn.md new file mode 100644 index 0000000..fb12841 --- /dev/null +++ b/docs/pages/api/estimators/msn.md @@ -0,0 +1 @@ +::: sknnr.MSNRegressor \ No newline at end of file diff --git a/docs/pages/api/estimators/raw.md b/docs/pages/api/estimators/raw.md new file mode 100644 index 0000000..fcb24a0 --- /dev/null +++ b/docs/pages/api/estimators/raw.md @@ -0,0 +1 @@ +::: sknnr.RawKNNRegressor \ No newline at end of file diff --git a/docs/pages/api/transformers/cca.md b/docs/pages/api/transformers/cca.md new file mode 100644 index 0000000..e644f32 --- /dev/null +++ b/docs/pages/api/transformers/cca.md @@ -0,0 +1 @@ +::: sknnr.transformers.CCATransformer \ No newline at end of file diff --git a/docs/pages/api/transformers/ccora.md b/docs/pages/api/transformers/ccora.md new file mode 100644 index 0000000..732868c --- /dev/null +++ b/docs/pages/api/transformers/ccora.md @@ -0,0 +1 @@ +::: sknnr.transformers.CCorATransformer \ No newline at end of file diff --git a/docs/pages/api/transformers/mahalanobis.md b/docs/pages/api/transformers/mahalanobis.md new file mode 100644 index 0000000..e0a40e3 --- /dev/null +++ b/docs/pages/api/transformers/mahalanobis.md @@ -0,0 +1 @@ +::: sknnr.transformers.MahalanobisTransformer \ No newline at end of file diff --git a/docs/pages/api/transformers/standardscalerwithdof.md b/docs/pages/api/transformers/standardscalerwithdof.md new file mode 100644 index 0000000..9a391a0 --- /dev/null +++ b/docs/pages/api/transformers/standardscalerwithdof.md @@ -0,0 +1 @@ +::: sknnr.transformers.StandardScalerWithDOF \ No newline at end of file diff --git a/docs/pages/contributing.md b/docs/pages/contributing.md new file mode 100644 index 0000000..d6ad3b6 --- /dev/null +++ b/docs/pages/contributing.md @@ -0,0 +1,91 @@ +# Contributing + +## Developer Guide + +### Setup + +This project uses [hatch](https://hatch.pypa.io/latest/) to manage the development environment and build and publish releases. Make sure `hatch` is [installed](https://hatch.pypa.io/latest/install/) first: + +```bash +$ pip install hatch +``` + +Now you can [enter the development environment](https://hatch.pypa.io/latest/environment/#entering-environments) using: + +```bash +$ hatch shell +``` + +This will install development dependencies in an isolated environment and drop you into a shell (use `exit` to leave). + +### Pre-commit + +Use [pre-commit](https://pre-commit.com/) to run linting, type-checking, and formatting: + +```bash +$ pre-commit run --all-files +``` + +...or install it to run automatically before every commit with: + +```bash +$ pre-commit install +``` + +You can run pre-commit hooks separately and pass additional arguments to them. For example, to run `black` on a single file: + +```bash +$ pre-commit run black --files=src/sknnr/_base.py +``` + +### Testing + +Unit tests are *not* run by `pre-commit`, but can be run manually using `hatch` [scripts](https://hatch.pypa.io/latest/config/environment/overview/#scripts): + +```bash +$ hatch run test:all +``` + +Measure test coverage with: + +```bash +$ hatch run test:coverage +``` + +Any additional arguments are passed to `pytest`. For example, to run a subset of tests matching a keyword: + +```bash +$ hatch run test:all -k gnn +``` + +### Documentation + +Documentation is built with [mkdocs](https://www.mkdocs.org/). During development, you can run a live-reloading server with: + +```bash +$ hatch run docs:serve +``` + +The API reference is generated from Numpy-style docstrings using [mkdocstrings](https://mkdocstrings.github.io/). New classes can be added to the API reference by creating a new markdown file in the `docs/pages/api` directory, adding that file to the [`nav` tree](https://www.mkdocs.org/user-guide/configuration/#nav) in `docs/mkdocs.yml`, and [including the docstring](https://mkdocstrings.github.io/python/usage/#injecting-documentation) in the markdown file: + +```markdown +::: sknnr.module.class +``` + +Whenever the docs are updated, they will be automatically rebuilt and deployed by [ReadTheDocs](https://about.readthedocs.com). Build status can be monitored [here](https://readthedocs.org/projects/sknnr/builds/). + +### Releasing + +First, use `hatch` to [update the version number](https://hatch.pypa.io/latest/version/#updating). + +```bash +$ hatch version [major|minor|patch] +``` + +Then, [build](https://hatch.pypa.io/latest/build/#building) and [publish](https://hatch.pypa.io/latest/publish/#publishing) the release to PyPI with: + +```bash +$ hatch clean +$ hatch build +$ hatch publish +``` diff --git a/docs/pages/index.md b/docs/pages/index.md new file mode 100644 index 0000000..4ab0748 --- /dev/null +++ b/docs/pages/index.md @@ -0,0 +1 @@ +--8<-- "README.md" \ No newline at end of file diff --git a/docs/pages/installation.md b/docs/pages/installation.md new file mode 100644 index 0000000..5a71ab8 --- /dev/null +++ b/docs/pages/installation.md @@ -0,0 +1,17 @@ +# Installation + +!!! info + `sknnr` will be available through PyPI and conda-forge once it is ready for release. Until then, you can install it from source. + +## From Source + +```bash +pip install git+https://github.com/lemma-osu/scikit-learn-knn-regression@main +``` + +## Dependencies + +- Python >= 3.8 +- scikit-learn >= 1.2 +- numpy +- scipy \ No newline at end of file diff --git a/docs/pages/usage.md b/docs/pages/usage.md new file mode 100644 index 0000000..45b8a01 --- /dev/null +++ b/docs/pages/usage.md @@ -0,0 +1,153 @@ +## Estimators + +`sknnr` provides five estimators that are fully compatible, drop-in replacements for `scikit-learn` estimators: + +- [RawKNNRegressor](api/estimators/raw.md) +- [EuclideanKNNRegressor](api/estimators/euclidean.md) +- [MahalanobisKNNRegressor](api/estimators/mahalanobis.md) +- [GNNRegressor](api/estimators/gnn.md) +- [MSNRegressor](api/estimators/msn.md) + +These estimators can be used like any other `sklearn` regressor (or [classifier](#regression-and-classification))[^sklearn-docs]. + +[^sklearn-docs]: Check out the [sklearn docs](https://scikit-learn.org/stable/getting_started.html#fitting-and-predicting-estimator-basics) for a refresher on estimator basics. + +```python +from sknnr import EuclideanKNNRegressor +from sknnr.datasets import load_swo_ecoplot +from sklearn.model_selection import train_test_split + +X, y = load_swo_ecoplot(return_X_y=True) +X_train, X_test, y_train, y_test = train_test_split(X, y) +est = EuclideanKNNRegressor(n_neighbors=3).fit(X_train, y_train) + +print(est.score(X_test, y_test)) +# 0.11496218649569434 +``` + +In addition to their core functionality of fitting, predicting, and scoring, `sknnr` estimators offer a number of other features, detailed below. + +### Regression and Classification + +The estimators in `sknnr` are all initialized with an optional parameter `n_neighbors` that determines how many plots a target plot's attributes will be predicted from. When `n_neighbors` > 1, a plot's attributes are calculated as optionally-weighted averages of each of its _k_ nearest neighbors. Predicted values can fall anywhere between the observed plot values, making this "regression mode" suitable for continuous attributes (e.g. basal area). To maintain categorical attributes (e.g. dominant species type), the estimators can be run in "classification mode" with `n_neighbors` = 1, where each attribute is imputed directly from its nearest neighbor. To predict a combination of continuous and categorical attributes, it's possible to use two estimators and concatenate their predictions manually. + +### Independent Scores and Predictions + +When an independent test set is not available, the accuracy of a kNN regressor can be estimated by comparing each sample in the training set to its second-nearest neighbor, i.e. the closest point *excluding itself*. All `sknnr` estimators set `independent_prediction_` and `independent_score_` attributes when they are fit, which store the predictions and scores of this independent evaluation. + +```python +print(est.independent_score_) +# 0.10243925752772305 +``` + +### Retrieving Dataframe Indexes + +In `sklearn`, the `KNeighborsRegressor.kneighbors` method can identify the array index of the nearest neighbor to a given sample. Estimators in `sknnr` offer an additional parameter `return_dataframe_index` that allows neighbor samples to be identified directly by their index. + +```python +X, y = load_swo_ecoplot(return_X_y=True, as_frame=True) +est = est.fit(X, y) + +# Find the distance and dataframe index of the nearest neighbors to the first plot +distances, neighbor_ids = est.kneighbors(X.iloc[:1], return_dataframe_index=True) + +# Preview the nearest neighbors by their dataframe index +print(y.loc[neighbor_ids[0]]) +``` + +| | ABAM_COV | ABGRC_COV | ABPRSH_COV | ACMA3_COV | ALRH2_COV | +|------:|-----------:|------------:|-------------:|------------:|------------:| +| 52481 | 0 | 0 | 39.3469 | 0 | 0 | +| 60089 | 0 | 0 | 22.1199 | 0 | 0 | +| 56253 | 0 | 0 | 22.8948 | 0 | 0 | + +!!! warning + An estimator must be fit with a `DataFrame` in order to use `return_dataframe_index=True`. + +!!! tip + In forestry applications, users typically store a unique inventory plot identification number as the index in the dataframe. + +### Y-Fit Data + +The [GNNRegressor](api/estimators/gnn.md) and [MSNRegressor](api/estimators/msn.md) estimators can be fit with `X` and `y` data, but they also accept an optional `y_fit` parameter. If provided, `y_fit` is used to fit the ordination transformer while `y` is used to fit the kNN regressor. + +In forest attribute estimation, the underlying ordination transformations for these two estimators (CCA for GNN and CCorA for MSN) typically use a matrix of species abundances or presence/absence information to relate the species data to environmental covariates, but often the user wants predictions based not on these features, but rather attributes that describe forest structure (e.g. biomass) or composition (e.g. species richness). In this case, the species matrix would be specified as `y_fit` and the stand attributes would be specified as `y`. + +```python +from sknnr import GNNRegressor + +est = GNNRegressor().fit(X, y, y_fit=y_fit) +``` + +### Dimensionality Reduction + +The ordination transformers used by the [GNNRegressor](api/estimators/gnn.md) and [MSNRegressor](api/estimators/msn.md) estimators apply dimensionality reduction by creating components that are linear combinations of the features in the `X` data. For both transformers, components that explain more variation present in the `y` (or `y_fit`) matrix are ordered first. Users can further reduce the number of components that are used to determine nearest neighbors by specifying `n_components` when instantiating the estimator. + +```python +est = GNNRegressor(n_components=3).fit(X, y) +``` + +!!! warning + The maximum number of components depends on the input data and the estimator. Specifying `n_components` greater than the maximum number of components will raise an error. + +### Custom Transformers + +Most estimators in `sknnr` work by applying specialized transformers like [CCA](api/transformers/cca.md) and [CCorA](api/transformers/ccora.md) to the input data. These transformers can be used independently of the estimators, like any other `sklearn` transformer. + +```python +from sknnr.transformers import CCATransformer + +cca = CCATransformer(n_components=3) +print(cca.fit_transform(X, y)) +``` + +`sknnr` currently provides the following transformers: + +- [StandardScalerWithDOF](api/transformers/standardscalerwithdof.md) +- [MahalanobisTransformer](api/transformers/mahalanobis.md) +- [CCATransformer](api/transformers/cca.md) +- [CCorATransformer](api/transformers/ccora.md) + +## Datasets + +`sknnr` estimators can be used for any multi-output regression problem, but they excel at predicting forest attributes. The `sknnr.datasets` module contains a number of test datasets with plot-based forest measurements and environmental attributes. + +```python +from sknnr.datasets import load_swo_ecoplot, load_moscow_stjoes +``` + +### Dataset Format + +Like in `sklearn`, datasets in `sknnr` can be loaded in a variety of formats, including as a `dict`-like [`Dataset` object](api/datasets/dataset.md): + +```python +dataset = load_swo_ecoplot() +print(dataset) +# Dataset(n=3005, features=18, targets=25) +``` + +...as an X, y `tuple` of Numpy arrays: + +```python +X, y = load_swo_ecoplot(return_X_y=True) +print(X.shape, y.shape) +# (3005, 18) (3005, 25) +``` + +...or as `tuple` of Pandas dataframes: + +```python +X_df, y_df = load_swo_ecoplot(return_X_y=True, as_frame=True) +print(X_df.head()) +``` + +| | ANNPRE | ANNTMP | AUGMAXT | CONTPRE | CVPRE | DECMINT | DIFTMP | SMRTMP | SMRTP | ASPTR | DEM | PRR | SLPPCT | TPI450 | TC1 | TC2 | TC3 | NBR | +|------:|---------:|---------:|----------:|----------:|--------:|----------:|---------:|---------:|--------:|--------:|--------:|--------:|---------:|---------:|--------:|---------:|---------:|--------:| +| 52481 | 740 | 514.667 | 2315 | 517.667 | 8971.67 | -583.111 | 2899.11 | 1136.11 | 212.222 | 197.667 | 1870.11 | 13196.7 | 48.3333 | 33.7778 | 218.778 | 68.5556 | -86.2222 | 343.556 | +| 52482 | 742 | 563.556 | 2354.33 | 502 | 9124.33 | -543.556 | 2898.89 | 1179.44 | 221.111 | 190.222 | 1713.11 | 16355.8 | 5.4444 | 6.4444 | 210.222 | 60.3333 | -96.6667 | 261.667 | +| 52484 | 738.556 | 639.111 | 2468.89 | 545.889 | 8897.22 | -479.111 | 2949 | 1266.22 | 236 | 194.556 | 1612.11 | 15132.6 | 15.5556 | -1.2222 | 157 | 110.222 | -17.4444 | 721 | +| 52485 | 730.333 | 622.667 | 2405.33 | 555 | 8829.78 | -481.222 | 2887.56 | 1244.22 | 234 | 196.444 | 1682.33 | 15146.7 | 19.8889 | -16.8889 | 152.556 | 86.1111 | -31.6667 | 597.111 | +| 52494 | 720 | 778.556 | 2678.11 | 658.556 | 8638 | -386.667 | 3065.78 | 1396 | 262 | 191.778 | 1345.67 | 16672.1 | 2 | 0.4444 | 214.667 | 58.5556 | -88.1111 | 294.222 | + +!!! note + `pandas` must be installed to use `as_frame=True`. diff --git a/docs/requirements.txt b/docs/requirements.txt new file mode 100644 index 0000000..199bd4f --- /dev/null +++ b/docs/requirements.txt @@ -0,0 +1,3 @@ +mkdocs +mkdocs-material +mkdocstrings[python] \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml index 4d884cc..746504d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -50,6 +50,17 @@ dependencies = [ all = "pytest {args}" coverage = "pytest --cov=src/sknnr {args}" +[tool.hatch.envs.docs] +dependencies = [ + "mkdocs", + "mkdocs-material", + "mkdocstrings[python]" +] + +[tool.hatch.envs.docs.scripts] +serve = "mkdocs serve --config-file docs/mkdocs.yml --watch ./README.md" +build = "mkdocs build --config-file docs/mkdocs.yml" + [tool.pytest.ini_options] pythonpath = "src/" markers = [