Add load functions for example datasets #3

aazuspan · 2023-04-14T20:25:43Z

Mimic the sklearn.datasets module with functions to load the example datasets from yaImpute ~~(pending approval for us to share the datasets)~~. Unlike sklearn, probably package datasets in a purpose-made class rather than a Bunch object, but try to keep a consistent API.

It may make sense for us to define a Dataset class for public examples with a TestingDataset subclass that contains additional neighbors and distances for internal testing, but we can figure that out later.

The text was updated successfully, but these errors were encountered:

grovduck · 2023-04-14T22:04:21Z

Andy Hudak has given us permission to use the Moscow Mountain / St. ~~Joseph's~~ Joes Woodland data as test/example data. Here is ~~are~~ the data citations:

Hudak, A.T. (2010) Field plot measures and predictive maps for "Nearest neighbor imputation of species-level, plot-scale forest structure attributes from LiDAR data". Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station. https://www.fs.usda.gov/rds/archive/Catalog/RDS-2010-0012

Hudak, A.T. (2010) Field plot measures and predictive maps for "Regression modeling and mapping of coniferous forest basal area and tree density from discrete-return lidar and multispectral satellite data". Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station. https://www.fs.usda.gov/rds/archive/Catalog/RDS-2010-0013 (no longer needed, does not contain species level basal area and density attributes)

aazuspan · 2023-05-03T18:28:56Z

Hey @grovduck, I'm pushing ahead with this and had a few questions.

First, am I understanding right that Moscow Mountain / St. Joseph's is a single dataset that we're currently referring to as just moscow in the tests? But both citations above apply? In the datasets module, any preferences on how we refer to it, e.g. load_moscow or load_moscow_stjoes or another alternative?

Second, and more complicated, how do we want to refer to attributes in the dataset? It seems to me that we're sort of going to be trading off between compatibility with the sklearn standards and descriptiveness, so I outlined a few possible approaches below.

Option 1

Stick with the sklearn standard with only a few necessary modifications. load_moscow would return a Dataset with the following attributes:

data (165, 28): array of environmental variables
target (165, 35): array of composition/structure variables
ids (165,): array of plot IDs
feature_names (28,): array of column names for data
target_names (35,): array of column names for target

The inconsistencies are 1) adding an ids field and 2) changing target_names from a key that corresponds with class IDs to a list of columns. This is because the sklearn datasets use target to store a 1D array of classes rather than a 2D array of predicted features. Maybe it's more confusing than helpful to use the same attribute names with slightly different meanings?

Option 2

Sacrifice some compatibility by using more descriptive names for the array attributes. Obviously sklearn uses very general attributes because they have a big variety of datasets, but we could instead go with something like environmental, species, ids, environmental_names, and species_names. Or something similarly specific.

Option 3

Ditch compatibility and just use pandas dataframes instead of arrays. load_moscow would return a Dataset with something like the attributes below:

environmental: dataframe of IDs and environmental variables
species: dataframe of IDs and composition/structure variables

Of course we could adjust those names or consider splitting ids back into its own attribute. In any case, this would require either adding pandas as a dependency or making it an optional dependency to use the datasets module (i.e. try to import it and tell the user to pip install pandas if it fails).

Let me know what you think, or if there are other options I'm not considering! I have a PR that's about ready to go using option 1 as a placeholder, so if it would be easier to discuss this with the code in place I can submit that as a draft.

grovduck · 2023-05-03T20:50:43Z

First, am I understanding right that Moscow Mountain / St. Joseph's is a single dataset that we're currently referring to as just moscow in the tests? But both citations above apply? In the datasets module, any preferences on how we refer to it, e.g. load_moscow or load_moscow_stjoes or another alternative?

Correct, it is a single dataset containing species information and environmental/lidar covariates at two locations in Idaho (Moscow Mountain and St. Joe Woodlands - I had mistakenly called it St. Joseph's, but have verified that it should be St. Joes). I just downloaded both datasets and only the first one ("Nearest neighbor imputation of species-level ...") contains the species level basal areas/densities so I think we only need to include the first citation. I've updated my comment above.

Yes, I've used moscow as a short-hand for these data, but I like your suggestion of load_moscow_stjoes. Of course, that probably means we might also want to rename the test data files?

Second, and more complicated, how do we want to refer to attributes in the dataset? It seems to me that we're sort of going to be trading off between compatibility with the sklearn standards and descriptiveness, so I outlined a few possible approaches below.

My first inclination is to go with Option 1 to be as consistent with sklearn as possible (hopefully, I can infer by your work already that you prefer this option as well 😉). It's possible that the terminology in Option 2 might be more familiar to ecologists, but I think our target audience is going to be folks who are already familiar with sklearn and that this will be an easy transition for the Venn diagram of ecologists and sklearn users. Pandas dataframes are great to work with and it looks like at least one package is looking to make this connection, but I'd feel more comfortable sticking to the standard of passing arrays.

The inconsistencies are 1) adding an ids field and 2) changing target_names from a key that corresponds with class IDs to a list of columns. This is because the sklearn datasets use target to store a 1D array of classes rather than a 2D array of predicted features. Maybe it's more confusing than helpful to use the same attribute names with slightly different meanings?

I think you're actually OK on this. Take a look at load_linnerud. I think this is pretty much exactly what you've proposed for target and target_names, correct? (This example dataset might inspire our multi-output approach.)

>>> from sklearn.datasets import load_linnerud
>>> bunch = load_linnerud()
>>> print(bunch.target)
[[191.  36.  50.]
 [189.  37.  52.]
 [193.  38.  58.]
 [162.  35.  62.]
 [189.  35.  46.]
 ...
 [176.  37.  54.]
 [157.  32.  52.]
 [156.  33.  54.]
 [138.  33.  68.]]
>>> print(bunch.target_names)
['Weight', 'Waist', 'Pulse']

I think we'll have to make the exception to include ids as it's so essential for NN problems (and especially when we get to the mapping step). That doesn't bother me to include it as an attribute here.

aazuspan · 2023-05-03T22:33:43Z

Thanks for the clarification on the dataset! I suppose the test files could be renamed, but I'm not too worried about it since it's just an internal detail.

My first inclination is to go with Option 1 to be as consistent with sklearn as possible

As you guessed, this was my leaning too, and I agree with all the points you brought up. Targeting this towards an sklearn audience is probably a good thing to always keep in mind, design-wise.

Take a look at load_linnerud. I think this is pretty much exactly what you've proposed

Great point! I was only comparing against load_iris, but you're right that this seems to fit their model for a multi-output dataset. That makes me feel better. And as you suggested, that should be a valuable reference elsewhere, so it's a good find.

One more question. Any thoughts on how we should handle the return_X_y option? Do we add another element to the tuple for IDs? If so, do we stick with return_X_y or do we need to think about renaming it to return_X_y_ids? I keep going back and forth on which is the best option since they'll all break compatibility in some way...

grovduck · 2023-05-04T17:59:43Z

@aazuspan, I'm trying to process some thoughts and get your feedback about y_predict attributes that I believe has some bearing on choices here. Sorry to block progress on this issue.

aazuspan · 2023-05-04T18:35:57Z

Yeah, I realized I might be jumping the gun trying to tackle this before we got prediction up and running! I'm good to pause this until we have a plan there :)

aazuspan · 2023-06-15T19:02:19Z

Resolved by #39!

aazuspan self-assigned this Apr 14, 2023

aazuspan mentioned this issue Apr 14, 2023

Refactor estimators for simplicity and make tests more modular #1

Merged

aazuspan added the documentation Improvements or additions to documentation label Apr 15, 2023

grovduck mentioned this issue May 4, 2023

Decide how to handle fitting, prediction, and ID data #2

Closed

This was referenced May 25, 2023

Unwanted double transformation using TransformedKNeighborsMixin.predict #23

Closed

Test accuracy of predict methods #26

Closed

aazuspan added this to the 0.1.0 milestone May 31, 2023

aazuspan mentioned this issue Jun 9, 2023

Support crosswalking dataframe indexes in kneighbors #37

Merged

aazuspan added a commit that referenced this issue Jun 13, 2023

Add datasets module #3

e1a0512

aazuspan mentioned this issue Jun 13, 2023

Datasets module #39

Merged

aazuspan linked a pull request Jun 13, 2023 that will close this issue

Datasets module #39

Merged

aazuspan closed this as completed Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add load functions for example datasets #3

Add load functions for example datasets #3

aazuspan commented Apr 14, 2023 •

edited

Loading

grovduck commented Apr 14, 2023 •

edited

Loading

aazuspan commented May 3, 2023

grovduck commented May 3, 2023 •

edited

Loading

aazuspan commented May 3, 2023

grovduck commented May 4, 2023

aazuspan commented May 4, 2023

aazuspan commented Jun 15, 2023

Add load functions for example datasets #3

Add load functions for example datasets #3

Comments

aazuspan commented Apr 14, 2023 • edited Loading

grovduck commented Apr 14, 2023 • edited Loading

aazuspan commented May 3, 2023

Option 1

Option 2

Option 3

grovduck commented May 3, 2023 • edited Loading

aazuspan commented May 3, 2023

grovduck commented May 4, 2023

aazuspan commented May 4, 2023

aazuspan commented Jun 15, 2023

aazuspan commented Apr 14, 2023 •

edited

Loading

grovduck commented Apr 14, 2023 •

edited

Loading

grovduck commented May 3, 2023 •

edited

Loading