Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

areaacres & obterm in get_soilseries_from_NASIS(), and use of areasymbol generally #272

Open
smroecker opened this issue Oct 11, 2022 · 6 comments
Assignees
Labels
enhancement NASIS-local This tag is used for pull requests, issues, discussions etc. for soilDB local NASIS functions

Comments

@smroecker
Copy link
Member

I noticed a couple of issues with the get_soilseries_from_NASIS() function today.

  1. areaacres and obterm were added to the get_soilseries_from_NASIS() function last year (url)[https://github.com/ncss-tech/soilDB/blob/bd6c6405b223b1d4438fcb7c9ea3e90400b42b68/R/get_soilseries_from_NASIS.R#L25]. Both those columns seem misleading in this function. They refer to the state area acres and not the soil series acres, which I suspect will confuse users who aren't familiar with the NASIS data model. Therefore I suggest they be removed, unless their is a compelling reason to keep them here.
  2. This issue appears to be part of a broader issue when it comes to the use of "areasymbol" in reports/queries/functions. areasymbol by itself is incomplete without the accompanying areatypename. It is possible that a report could include multiple areasymbols (e.g. site area overlaps)(something I hadn't considered fully until recently). Therefore in this instance and moving forward, I suggest the resulting column name for area overlaps be named according the original areaiidref name followed by the associated area table column name (e.g. typelocstareaiidref.areasymbol or typelocstareaiidref.areaname). Hopefully this is explicit enough to be obvious to users. I'm open to other suggestions.

Cheers

@brownag
Copy link
Member

brownag commented Oct 11, 2022

I think item 1 is not that misleading as the areasymbol, areatypename etc. are all together and are corresponding to the "State or Territory" in the query result. I can definitely see needing to disambiguate if multiple areatypename were used in the same query.

Granted that information is not particularly useful for soil series overlap... the fact that those new columns were included was probably an oversight from #188 which added several columns that were not present in the result for consistency.

I don't think there was a specific need for the areaacres or obterm in this context. I suppose we can safely remove these columns as there are plenty of places to get the information on state area and the obterm is not relevant in this context. EDIT: removed in b5f5692

I suggest the resulting column name for area overlaps be named according the original areaiidref name followed by the associated area table column name (e.g. typelocstareaiidref.areasymbol or typelocstareaiidref.areaname)

This seems like a fine suggestion in terms of identifying related tables and the context associated with an area symbol or name. I don't think the "iidref" suffix is necessary considering this is a made up table/column physical name combination.

More generally: changing column names would be a fairly large breaking change for basically all existing uses of the areasymbol column if I understand correctly. As suggested would result in some pretty convoluted names (that arguably are more confusing than just understanding how the area table works, or just documenting in a function where the data come from), so implementing this would require some careful testing and assessment of impacts.

I don't really think this is a "big problem" in terms of people understanding NASIS (I have never had anyone contact me about it), but I would gladly review pull request(s) from you to implement these changes when you have time. Do you have an idea of how many/which functions you would expect to be impacted by this change?

@brownag brownag added enhancement NASIS-local This tag is used for pull requests, issues, discussions etc. for soilDB local NASIS functions labels Oct 11, 2022
@smroecker
Copy link
Member Author

I think the original column name/iidref linkage (e.g. typelocstarea.areasymbol) better convenes the column purpose where it's used in various tables (like the project table). I agreed the iidref part isn't necessary. Also, in many instances the areaiidref column name isn't unique, even when it is elsewhere when referring to the same areatype. If we adopt this convention it'll make including the areatypename unnecessary. To be clear, at this time, I'm not advocating a wholesale change of all the functions that use areasymbol, but would recommend it moving forward. The one logical exception I can see is where the areasymbol is linked to the legend and mapunit (which would minimzie the majority of disruptions). Since we seem to be in agreement, perhaps we could start here.

@brownag
Copy link
Member

brownag commented Oct 12, 2022

Agreed if we carve out an exception for legend/mapunit that probably would cover most use cases that would "break".

I am open to a PR to implement this where you want, and we can start with get_soilseries_from_NASIS(), I think anything that is less ambiguous is a "good" change even if we have to fix things. A PR is the right place to test these changes while we work out the kinks.

I can't help but think there are probably other instances beyond areaiidrefs where this idea could be implemented to make it less ambiguous where data come from when a query behind the scenes does something that involves joins or related tables with potentially ambiguous column names.

I think you are advocating for a wholesale change--though I recognize you would like to focus on this function for now. In the big picture I would like to see a roadmap of sorts for where we plan to implement this new recommendation. A checklist of places to implement and sketch of the rules could be a good topic for either a new issue and/or for a section in the "soilDB developers guide" (#269; which I would appreciate your help in drafting)

I think there might be some other things to consider when you have child tables referencing area that have many:1 relationship with the parent... examples relevant to get_soilseries_from_NASIS() would be if we wanted to include information on Soil Series MLRAs Using or Soil Series States Using in addition to the "type location state". The former would need to be concatenated, I believe

@smroecker
Copy link
Member Author

Sounds good. I'd be interested to discuss the developer guide more.

@brownag
Copy link
Member

brownag commented Feb 12, 2024

Coming back to this topic, specifically item 2 from OP. I have received a request to add the state, area and county information to the Site table results returned by get_site_data_from_NASIS_db(). This reminded me of this issue, which was only half resolved.

Unfortunately I have not progressed on the "developers guide" in the last 1.5 years or so, but still would like to set aside some time to rough it out it some day. The following could be snippets to include related to dealing with the frequent relationships to the Area table that occur throughout NASIS.


Site table is a bit of a unique case (and related to item 2) because 3 different areaiidrefs are present with custom "relationship name" for each. The relationship name would be a good derived column name replacement for "areasymbol" in most cases (excluding legend table).

I suggest this option rather than making something up like mlraarea.areasymbol. The main issue I have with the latter is mlraarea is not a table, but the name suggests that it is.

So, for the site table we could have: site_state, site_county and site_mlra column names derived from the relationship name for site.stateareaiidref:area.areaiid, site.countyareaiidref:area.areaiid and site.mlraareaiidref:area.areaiid, respectively.

image

Most tables related to Area use the "default" relationship, but others, such as lmapunit, also have a defined relationship to Area. The naming scheme for these relationships is consistent and easy to understand. lmapunit follows the same scheme as in the site data where lmapunit_mlra denotes relationship lmapunit.mlraareaiidref:area.areaiid for the dominant MLRA where that legend mapunit is used.

image

I think we can emulate what is done for these named relationships to the Area table for all cases where we need to bring in these related areasymbol columns.

Bringing it back to Soil Series, which doesn't have a named relationship to Area, naming the "Type Location State" areasymbol column soilseries_typelocst would be the analog to relationships for Site and Legend Mapunit.
image

@brownag brownag reopened this Feb 12, 2024
brownag added a commit that referenced this issue Feb 12, 2024
@brownag
Copy link
Member

brownag commented Feb 12, 2024

Initial implementation of above for:

  • get_soilseries_from_NASIS: 2bd68ae
  • get_mapunit_from_NASIS_db: 1ea2390
  • get_site_data_from_NASIS_db: d28901b

I started to look at get_projectmapunit_from_NASIS() and a few other cases, but have not implemented it there yet. For project it would be "mlra_sso" and "nonmlra_ssa" relationship names:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement NASIS-local This tag is used for pull requests, issues, discussions etc. for soilDB local NASIS functions
Projects
None yet
Development

No branches or pull requests

2 participants