-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read CSVs using info from MODEL_SPEC
#1345
Conversation
patterns.append(f'{groups[0]}(.+){groups[2]}') | ||
else: | ||
# for regular column names, use the exact name as the pattern | ||
patterns.append(column.replace('(', '\(').replace(')', '\)')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed this to handle the one case in HRA where a column name contains parentheses: stressor buffer (meters)
. It would be nice if we could require that column names not contain any special regex characters.
This reverts commit 5276b52.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whew, this is a big change! So exciting to see a lot of the model-specific table parsing and type checking handled by the new function. I had one or two comments/questions about functionality and there's a merge conflict on a file, but that's about it! Thanks @emlys !
unique_lulc_classnames = set( | ||
params['lulc-class'] for params in biophysical_parameters.values()) | ||
if len(unique_lulc_classnames) != len(biophysical_parameters): | ||
if not biophysical_df['lulc-class'].is_unique: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a nice check
'protein', 'lipid', 'energy', 'ca', 'fe', 'mg', 'ph', 'k', 'na', 'zn', | ||
'cu', 'fl', 'mn', 'se', 'vita', 'betac', 'alphac', 'vite', 'crypto', | ||
'lycopene', 'lutein', 'betat', 'gammat', 'deltat', 'vitc', 'thiamin', | ||
'riboflavin', 'niacin', 'pantothenic', 'vitb6', 'folate', 'vitb12', | ||
'vitk'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this may be slightly outside the scope of this PR, but would it be worth removing this list and simply using the MODEL_SPEC definition now that they're all lowercased?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
if pandas.isna(row[carbon_pool_type]): | ||
raise ValueError( | ||
"Could not interpret carbon pool value as a number. " | ||
f"lucode: {lucode}, pool_type: {carbon_pool_type}, " | ||
f"value: {biophysical_table[lucode][carbon_pool_type]}") | ||
f"value: {row[carbon_pool_type]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we still need to have a test for whether the value is a valid number and not just NaN
?
The exception here is about the value not being able to be interpreted as a number, but looking through the isna
docs, the function seems like it's only about NaN
s. So float('my invalid value')
would cause a ValueError
, but isna('my invalid value')
would not cause this exception to be raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the value was not a valid number, an error would be raised earlier in utils.read_csv_to_dataframe
, since it's now validating against the spec for the biophysical table. If we've gotten to this point, row[carbon_pool_type]
should be guaranteed to be a float or NaN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! Makes sense, thanks!
weight_sum = 0.0 | ||
for threat_data in threat_dict.values(): | ||
# Sum weight of threats | ||
weight_sum = weight_sum + threat_data['weight'] | ||
weight_sum = threat_df['weight'].sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I wonder why we hadn't just sum()
ed these before!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working with everything in dataframes rather than dictionaries makes it a lot easier to do simple aggregations like this!
] for lucode in sorted_lucodes | ||
], dtype=numpy.float32) | ||
1 - biophysical_df[f'rc_{soil_group}'].to_numpy() | ||
for soil_group in ['a', 'b', 'c', 'd']], dtype=numpy.float32).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transposition here is an interesting change in behavior! Sorry if I'm missing something here, but what were the circumstances that made the transpose necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this is confusing! But the behavior shouldn't be changed. The transpose is needed because I replaced the nested dictionary with a dataframe, and that effectively swaps the rows and columns.
The original code
numpy.array([
[1 - biophysical_dict[lucode][f'rc_{soil_group}']
for soil_group in ['a', 'b', 'c', 'd']
] for lucode in sorted_lucodes])
produces an array of arrays, where the rows represent lucodes and the columns represent soil groups.
The new code
numpy.array([
1 - biophysical_df[f'rc_{soil_group}'].to_numpy()
for soil_group in ['a', 'b', 'c', 'd']
])
produces an array of arrays where the rows are soil groups and the columns are lucodes. So the .T
is needed to get it back to the original format.
Also sets custom defaults for some kwargs passed to ``pandas.read_csv``, | ||
which you can override with kwargs: | ||
|
||
- sep=None: lets the Python engine infer the separator | ||
- engine='python': The 'python' engine supports the sep=None option. | ||
- encoding='utf-8-sig': 'utf-8-sig' handles UTF-8 with or without BOM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea to note the pandas kwarg defaults in the docstring!
df[col] = df[col].astype('boolean') | ||
except ValueError as err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth adding an else
clause here? Or are we assuming that any other type will be caught by the MODEL_SPEC tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, since this is only being used internally by invest models, I think we can be sure that we'll only encounter valid types
Description
Fixes #1328
index_col
attribute to the spec for most CSVsutils.read_csv_to_dataframe
to parse tables according to the info in the table specI simplified
read_csv_to_dataframe
a bit by standardizing some things that weren't consistent across models:freestyle_string
andoption_string
values will be lowercased and whitespace removedBecause
read_csv_to_dataframe
now enforces data types, some type checking and casting could be removed from models.In many places, it was simpler to refactor things to use the dataframe directly. In general, where a simple dictionary mapping index-to-value (such as LULC code to a biophysical parameter) was needed, I used this pattern:
value_map = biophysical_df[parameter].to_dict()
Where the index maps to multiple values (such as LULC code to multiple biophysical parameters), I passed in the dataframe directly, rather than formatting it into a nested dictionary.
This PR has more deletions than additions, which feels like a sign that we're moving in the right direction!
Checklist