Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add load functions for example datasets #3

Closed
aazuspan opened this issue Apr 14, 2023 · 7 comments · Fixed by #39
Closed

Add load functions for example datasets #3

aazuspan opened this issue Apr 14, 2023 · 7 comments · Fixed by #39
Assignees
Labels
documentation Improvements or additions to documentation
Milestone

Comments

@aazuspan
Copy link
Contributor

aazuspan commented Apr 14, 2023

Mimic the sklearn.datasets module with functions to load the example datasets from yaImpute (pending approval for us to share the datasets). Unlike sklearn, probably package datasets in a purpose-made class rather than a Bunch object, but try to keep a consistent API.

It may make sense for us to define a Dataset class for public examples with a TestingDataset subclass that contains additional neighbors and distances for internal testing, but we can figure that out later.

@grovduck
Copy link
Member

grovduck commented Apr 14, 2023

Andy Hudak has given us permission to use the Moscow Mountain / St. Joseph's Joes Woodland data as test/example data. Here is are the data citations:

Hudak, A.T. (2010) Field plot measures and predictive maps for "Nearest neighbor imputation of species-level, plot-scale forest structure attributes from LiDAR data". Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station. https://www.fs.usda.gov/rds/archive/Catalog/RDS-2010-0012

Hudak, A.T. (2010) Field plot measures and predictive maps for "Regression modeling and mapping of coniferous forest basal area and tree density from discrete-return lidar and multispectral satellite data". Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station. https://www.fs.usda.gov/rds/archive/Catalog/RDS-2010-0013 (no longer needed, does not contain species level basal area and density attributes)

@aazuspan aazuspan added the documentation Improvements or additions to documentation label Apr 15, 2023
@aazuspan
Copy link
Contributor Author

aazuspan commented May 3, 2023

Hey @grovduck, I'm pushing ahead with this and had a few questions.

First, am I understanding right that Moscow Mountain / St. Joseph's is a single dataset that we're currently referring to as just moscow in the tests? But both citations above apply? In the datasets module, any preferences on how we refer to it, e.g. load_moscow or load_moscow_stjoes or another alternative?

Second, and more complicated, how do we want to refer to attributes in the dataset? It seems to me that we're sort of going to be trading off between compatibility with the sklearn standards and descriptiveness, so I outlined a few possible approaches below.

Option 1

Stick with the sklearn standard with only a few necessary modifications. load_moscow would return a Dataset with the following attributes:

  • data (165, 28): array of environmental variables
  • target (165, 35): array of composition/structure variables
  • ids (165,): array of plot IDs
  • feature_names (28,): array of column names for data
  • target_names (35,): array of column names for target

The inconsistencies are 1) adding an ids field and 2) changing target_names from a key that corresponds with class IDs to a list of columns. This is because the sklearn datasets use target to store a 1D array of classes rather than a 2D array of predicted features. Maybe it's more confusing than helpful to use the same attribute names with slightly different meanings?

Option 2

Sacrifice some compatibility by using more descriptive names for the array attributes. Obviously sklearn uses very general attributes because they have a big variety of datasets, but we could instead go with something like environmental, species, ids, environmental_names, and species_names. Or something similarly specific.

Option 3

Ditch compatibility and just use pandas dataframes instead of arrays. load_moscow would return a Dataset with something like the attributes below:

  • environmental: dataframe of IDs and environmental variables
  • species: dataframe of IDs and composition/structure variables

Of course we could adjust those names or consider splitting ids back into its own attribute. In any case, this would require either adding pandas as a dependency or making it an optional dependency to use the datasets module (i.e. try to import it and tell the user to pip install pandas if it fails).

Let me know what you think, or if there are other options I'm not considering! I have a PR that's about ready to go using option 1 as a placeholder, so if it would be easier to discuss this with the code in place I can submit that as a draft.

@grovduck
Copy link
Member

grovduck commented May 3, 2023

First, am I understanding right that Moscow Mountain / St. Joseph's is a single dataset that we're currently referring to as just moscow in the tests? But both citations above apply? In the datasets module, any preferences on how we refer to it, e.g. load_moscow or load_moscow_stjoes or another alternative?

Correct, it is a single dataset containing species information and environmental/lidar covariates at two locations in Idaho (Moscow Mountain and St. Joe Woodlands - I had mistakenly called it St. Joseph's, but have verified that it should be St. Joes). I just downloaded both datasets and only the first one ("Nearest neighbor imputation of species-level ...") contains the species level basal areas/densities so I think we only need to include the first citation. I've updated my comment above.

Yes, I've used moscow as a short-hand for these data, but I like your suggestion of load_moscow_stjoes. Of course, that probably means we might also want to rename the test data files?

Second, and more complicated, how do we want to refer to attributes in the dataset? It seems to me that we're sort of going to be trading off between compatibility with the sklearn standards and descriptiveness, so I outlined a few possible approaches below.

My first inclination is to go with Option 1 to be as consistent with sklearn as possible (hopefully, I can infer by your work already that you prefer this option as well 😉). It's possible that the terminology in Option 2 might be more familiar to ecologists, but I think our target audience is going to be folks who are already familiar with sklearn and that this will be an easy transition for the Venn diagram of ecologists and sklearn users. Pandas dataframes are great to work with and it looks like at least one package is looking to make this connection, but I'd feel more comfortable sticking to the standard of passing arrays.

The inconsistencies are 1) adding an ids field and 2) changing target_names from a key that corresponds with class IDs to a list of columns. This is because the sklearn datasets use target to store a 1D array of classes rather than a 2D array of predicted features. Maybe it's more confusing than helpful to use the same attribute names with slightly different meanings?

I think you're actually OK on this. Take a look at load_linnerud. I think this is pretty much exactly what you've proposed for target and target_names, correct? (This example dataset might inspire our multi-output approach.)

>>> from sklearn.datasets import load_linnerud
>>> bunch = load_linnerud()
>>> print(bunch.target)
[[191.  36.  50.]
 [189.  37.  52.]
 [193.  38.  58.]
 [162.  35.  62.]
 [189.  35.  46.]
 ...
 [176.  37.  54.]
 [157.  32.  52.]
 [156.  33.  54.]
 [138.  33.  68.]]
>>> print(bunch.target_names)
['Weight', 'Waist', 'Pulse']

I think we'll have to make the exception to include ids as it's so essential for NN problems (and especially when we get to the mapping step). That doesn't bother me to include it as an attribute here.

@aazuspan
Copy link
Contributor Author

aazuspan commented May 3, 2023

Thanks for the clarification on the dataset! I suppose the test files could be renamed, but I'm not too worried about it since it's just an internal detail.

My first inclination is to go with Option 1 to be as consistent with sklearn as possible

As you guessed, this was my leaning too, and I agree with all the points you brought up. Targeting this towards an sklearn audience is probably a good thing to always keep in mind, design-wise.

Take a look at load_linnerud. I think this is pretty much exactly what you've proposed

Great point! I was only comparing against load_iris, but you're right that this seems to fit their model for a multi-output dataset. That makes me feel better. And as you suggested, that should be a valuable reference elsewhere, so it's a good find.

One more question. Any thoughts on how we should handle the return_X_y option? Do we add another element to the tuple for IDs? If so, do we stick with return_X_y or do we need to think about renaming it to return_X_y_ids? I keep going back and forth on which is the best option since they'll all break compatibility in some way...

@grovduck
Copy link
Member

grovduck commented May 4, 2023

@aazuspan, I'm trying to process some thoughts and get your feedback about y_predict attributes that I believe has some bearing on choices here. Sorry to block progress on this issue.

@aazuspan
Copy link
Contributor Author

aazuspan commented May 4, 2023

Yeah, I realized I might be jumping the gun trying to tackle this before we got prediction up and running! I'm good to pause this until we have a plan there :)

@aazuspan
Copy link
Contributor Author

Resolved by #39!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants