Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read CSVs using info from MODEL_SPEC #1345

Merged
merged 17 commits into from
Aug 15, 2023
Merged

Read CSVs using info from MODEL_SPEC #1345

merged 17 commits into from
Aug 15, 2023

Conversation

emlys
Copy link
Member

@emlys emlys commented Jul 6, 2023

Description

Fixes #1328

  • Added index_col attribute to the spec for most CSVs
  • Updated utils.read_csv_to_dataframe to parse tables according to the info in the table spec

I simplified read_csv_to_dataframe a bit by standardizing some things that weren't consistent across models:

  • All column names will be lowercased and whitespace removed
  • All freestyle_string and option_string values will be lowercased and whitespace removed
  • All path values will be expanded relative to the table location

Because read_csv_to_dataframe now enforces data types, some type checking and casting could be removed from models.

In many places, it was simpler to refactor things to use the dataframe directly. In general, where a simple dictionary mapping index-to-value (such as LULC code to a biophysical parameter) was needed, I used this pattern: value_map = biophysical_df[parameter].to_dict()
Where the index maps to multiple values (such as LULC code to multiple biophysical parameters), I passed in the dataframe directly, rather than formatting it into a nested dictionary.

This PR has more deletions than additions, which feels like a sign that we're moving in the right direction!

Checklist

  • Updated HISTORY.rst and link to any relevant issue (if these changes are user-facing)
  • Updated the user's guide (if needed)
  • Tested the affected models' UIs (if relevant)

patterns.append(f'{groups[0]}(.+){groups[2]}')
else:
# for regular column names, use the exact name as the pattern
patterns.append(column.replace('(', '\(').replace(')', '\)'))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed this to handle the one case in HRA where a column name contains parentheses: stressor buffer (meters). It would be nice if we could require that column names not contain any special regex characters.

@emlys emlys requested a review from phargogh July 19, 2023 17:06
@emlys emlys marked this pull request as ready for review July 19, 2023 17:06
Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew, this is a big change! So exciting to see a lot of the model-specific table parsing and type checking handled by the new function. I had one or two comments/questions about functionality and there's a merge conflict on a file, but that's about it! Thanks @emlys !

unique_lulc_classnames = set(
params['lulc-class'] for params in biophysical_parameters.values())
if len(unique_lulc_classnames) != len(biophysical_parameters):
if not biophysical_df['lulc-class'].is_unique:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a nice check

Comment on lines 423 to 427
'protein', 'lipid', 'energy', 'ca', 'fe', 'mg', 'ph', 'k', 'na', 'zn',
'cu', 'fl', 'mn', 'se', 'vita', 'betac', 'alphac', 'vite', 'crypto',
'lycopene', 'lutein', 'betat', 'gammat', 'deltat', 'vitc', 'thiamin',
'riboflavin', 'niacin', 'pantothenic', 'vitb6', 'folate', 'vitb12',
'vitk']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this may be slightly outside the scope of this PR, but would it be worth removing this list and simply using the MODEL_SPEC definition now that they're all lowercased?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Comment on lines +652 to +656
if pandas.isna(row[carbon_pool_type]):
raise ValueError(
"Could not interpret carbon pool value as a number. "
f"lucode: {lucode}, pool_type: {carbon_pool_type}, "
f"value: {biophysical_table[lucode][carbon_pool_type]}")
f"value: {row[carbon_pool_type]}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we still need to have a test for whether the value is a valid number and not just NaN?

The exception here is about the value not being able to be interpreted as a number, but looking through the isna docs, the function seems like it's only about NaNs. So float('my invalid value') would cause a ValueError, but isna('my invalid value') would not cause this exception to be raised.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the value was not a valid number, an error would be raised earlier in utils.read_csv_to_dataframe, since it's now validating against the spec for the biophysical table. If we've gotten to this point, row[carbon_pool_type] should be guaranteed to be a float or NaN.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! Makes sense, thanks!

src/natcap/invest/habitat_quality.py Show resolved Hide resolved
Comment on lines -581 to +583
weight_sum = 0.0
for threat_data in threat_dict.values():
# Sum weight of threats
weight_sum = weight_sum + threat_data['weight']
weight_sum = threat_df['weight'].sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I wonder why we hadn't just sum()ed these before!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working with everything in dataframes rather than dictionaries makes it a lot easier to do simple aggregations like this!

] for lucode in sorted_lucodes
], dtype=numpy.float32)
1 - biophysical_df[f'rc_{soil_group}'].to_numpy()
for soil_group in ['a', 'b', 'c', 'd']], dtype=numpy.float32).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transposition here is an interesting change in behavior! Sorry if I'm missing something here, but what were the circumstances that made the transpose necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is confusing! But the behavior shouldn't be changed. The transpose is needed because I replaced the nested dictionary with a dataframe, and that effectively swaps the rows and columns.

The original code

numpy.array([
  [1 - biophysical_dict[lucode][f'rc_{soil_group}']
      for soil_group in ['a', 'b', 'c', 'd']
   ] for lucode in sorted_lucodes])

produces an array of arrays, where the rows represent lucodes and the columns represent soil groups.

The new code

numpy.array([
  1 - biophysical_df[f'rc_{soil_group}'].to_numpy()
     for soil_group in ['a', 'b', 'c', 'd']
])

produces an array of arrays where the rows are soil groups and the columns are lucodes. So the .T is needed to get it back to the original format.

Comment on lines +622 to +627
Also sets custom defaults for some kwargs passed to ``pandas.read_csv``,
which you can override with kwargs:

- sep=None: lets the Python engine infer the separator
- engine='python': The 'python' engine supports the sep=None option.
- encoding='utf-8-sig': 'utf-8-sig' handles UTF-8 with or without BOM.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea to note the pandas kwarg defaults in the docstring!

Comment on lines +701 to +702
df[col] = df[col].astype('boolean')
except ValueError as err:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding an else clause here? Or are we assuming that any other type will be caught by the MODEL_SPEC tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, since this is only being used internally by invest models, I think we can be sure that we'll only encounter valid types

@emlys emlys self-assigned this Aug 2, 2023
@phargogh phargogh merged commit 6db2256 into natcap:main Aug 15, 2023
19 checks passed
@emlys emlys deleted the feature/1328 branch October 3, 2024 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: use info from MODEL_SPEC to read CSVs and add index_col key to CSV specs
2 participants