Read CSVs using info from `MODEL_SPEC` #1345

emlys · 2023-07-06T19:45:07Z

Description

Added index_col attribute to the spec for most CSVs
Updated utils.read_csv_to_dataframe to parse tables according to the info in the table spec

I simplified read_csv_to_dataframe a bit by standardizing some things that weren't consistent across models:

All column names will be lowercased and whitespace removed
All freestyle_string and option_string values will be lowercased and whitespace removed
All path values will be expanded relative to the table location

Because read_csv_to_dataframe now enforces data types, some type checking and casting could be removed from models.

In many places, it was simpler to refactor things to use the dataframe directly. In general, where a simple dictionary mapping index-to-value (such as LULC code to a biophysical parameter) was needed, I used this pattern: value_map = biophysical_df[parameter].to_dict()
Where the index maps to multiple values (such as LULC code to multiple biophysical parameters), I passed in the dataframe directly, rather than formatting it into a nested dictionary.

This PR has more deletions than additions, which feels like a sign that we're moving in the right direction!

Checklist

Updated HISTORY.rst and link to any relevant issue (if these changes are user-facing)
Updated the user's guide (if needed)
Tested the affected models' UIs (if relevant)

natcap#1328

src/natcap/invest/habitat_quality.py

emlys · 2023-07-19T16:40:53Z

src/natcap/invest/utils.py

+            patterns.append(f'{groups[0]}(.+){groups[2]}')
+        else:
+            # for regular column names, use the exact name as the pattern
+            patterns.append(column.replace('(', '\(').replace(')', '\)'))


Needed this to handle the one case in HRA where a column name contains parentheses: stressor buffer (meters). It would be nice if we could require that column names not contain any special regex characters.

This reverts commit 5276b52.

phargogh

Whew, this is a big change! So exciting to see a lot of the model-specific table parsing and type checking handled by the new function. I had one or two comments/questions about functionality and there's a merge conflict on a file, but that's about it! Thanks @emlys !

phargogh · 2023-08-01T20:45:07Z

src/natcap/invest/coastal_blue_carbon/coastal_blue_carbon.py

-    unique_lulc_classnames = set(
-        params['lulc-class'] for params in biophysical_parameters.values())
-    if len(unique_lulc_classnames) != len(biophysical_parameters):
+    if not biophysical_df['lulc-class'].is_unique:


Oh that's a nice check

phargogh · 2023-08-01T20:55:24Z

src/natcap/invest/crop_production_percentile.py

+    'protein', 'lipid', 'energy', 'ca', 'fe', 'mg', 'ph', 'k', 'na', 'zn',
+    'cu', 'fl', 'mn', 'se', 'vita', 'betac', 'alphac', 'vite', 'crypto',
+    'lycopene', 'lutein', 'betat', 'gammat', 'deltat', 'vitc', 'thiamin',
+    'riboflavin', 'niacin', 'pantothenic', 'vitb6', 'folate', 'vitb12',
+    'vitk']


I realize this may be slightly outside the scope of this PR, but would it be worth removing this list and simply using the MODEL_SPEC definition now that they're all lowercased?

phargogh · 2023-08-01T21:05:09Z

src/natcap/invest/forest_carbon_edge_effect.py

+            if pandas.isna(row[carbon_pool_type]):
                raise ValueError(
                    "Could not interpret carbon pool value as a number. "
                    f"lucode: {lucode}, pool_type: {carbon_pool_type}, "
-                    f"value: {biophysical_table[lucode][carbon_pool_type]}")
+                    f"value: {row[carbon_pool_type]}")


Wouldn't we still need to have a test for whether the value is a valid number and not just NaN?

The exception here is about the value not being able to be interpreted as a number, but looking through the isna docs, the function seems like it's only about NaNs. So float('my invalid value') would cause a ValueError, but isna('my invalid value') would not cause this exception to be raised.

If the value was not a valid number, an error would be raised earlier in utils.read_csv_to_dataframe, since it's now validating against the spec for the biophysical table. If we've gotten to this point, row[carbon_pool_type] should be guaranteed to be a float or NaN.

Aha! Makes sense, thanks!

src/natcap/invest/habitat_quality.py

phargogh · 2023-08-01T21:20:02Z

src/natcap/invest/habitat_quality.py

-    weight_sum = 0.0
-    for threat_data in threat_dict.values():
-        # Sum weight of threats
-        weight_sum = weight_sum + threat_data['weight']
+    weight_sum = threat_df['weight'].sum()


Huh, I wonder why we hadn't just sum()ed these before!

Working with everything in dataframes rather than dictionaries makes it a lot easier to do simple aggregations like this!

phargogh · 2023-08-01T21:44:34Z

src/natcap/invest/stormwater.py

-         ] for lucode in sorted_lucodes
-    ], dtype=numpy.float32)
+        1 - biophysical_df[f'rc_{soil_group}'].to_numpy()
+        for soil_group in ['a', 'b', 'c', 'd']], dtype=numpy.float32).T


The transposition here is an interesting change in behavior! Sorry if I'm missing something here, but what were the circumstances that made the transpose necessary?

Sorry, this is confusing! But the behavior shouldn't be changed. The transpose is needed because I replaced the nested dictionary with a dataframe, and that effectively swaps the rows and columns.

The original code

numpy.array([ [1 - biophysical_dict[lucode][f'rc_{soil_group}'] for soil_group in ['a', 'b', 'c', 'd'] ] for lucode in sorted_lucodes])

produces an array of arrays, where the rows represent lucodes and the columns represent soil groups.

The new code

numpy.array([ 1 - biophysical_df[f'rc_{soil_group}'].to_numpy() for soil_group in ['a', 'b', 'c', 'd'] ])

produces an array of arrays where the rows are soil groups and the columns are lucodes. So the .T is needed to get it back to the original format.

phargogh · 2023-08-01T22:06:51Z

src/natcap/invest/utils.py

+    Also sets custom defaults for some kwargs passed to ``pandas.read_csv``,
+    which you can override with kwargs:
+
+    - sep=None: lets the Python engine infer the separator
+    - engine='python': The 'python' engine supports the sep=None option.
+    - encoding='utf-8-sig': 'utf-8-sig' handles UTF-8 with or without BOM.


This is a great idea to note the pandas kwarg defaults in the docstring!

phargogh · 2023-08-01T22:23:35Z

src/natcap/invest/utils.py

+                    df[col] = df[col].astype('boolean')
+            except ValueError as err:


Would it be worth adding an else clause here? Or are we assuming that any other type will be caught by the MODEL_SPEC tests?

Yeah, since this is only being used internally by invest models, I think we can be sure that we'll only encounter valid types

emlys added 8 commits June 16, 2023 09:28

add index_col property to model specs and tests natcap#1328

520fc3d

Merge branch 'main' into feature/1328

517b23c

update read_csv_to_dataframe and corresponding tests

e857281

update all models and tests to use new version of read_csv_to_dataframe

73fbcad

use dataframes directly where convenient, instead of converting to dict

e5b4e48

natcap#1328

remove set_index option from read_csv_to_dataframe

4d3745b

remove na_allowed option from draft read_csv_to_dataframe

e05ca5c

fix description of lulc-class in CBC transition table

50079c8

emlys commented Jul 12, 2023

View reviewed changes

src/natcap/invest/habitat_quality.py Show resolved Hide resolved

emlys added 5 commits July 12, 2023 13:43

misc cleanup natcap#1328

e99f198

update test data for crop production headers

1bc30f7

fix docstring formatting

0283c25

fix typo

a1a8608

revert swy test change

5276b52

emlys commented Jul 19, 2023

View reviewed changes

Revert "revert swy test change"

a293dd9

This reverts commit 5276b52.

emlys requested a review from phargogh July 19, 2023 17:06

emlys marked this pull request as ready for review July 19, 2023 17:06

phargogh requested changes Aug 1, 2023

View reviewed changes

emlys self-assigned this Aug 2, 2023

emlys and others added 3 commits August 2, 2023 13:25

Merge branch 'main' into feature/1328

cc2933c

deduplicate expected nutrient list natcap#1328

68ce236

lowercase crop nutrients natcap#1328

24a67b6

emlys requested a review from phargogh August 2, 2023 23:54

This was referenced Aug 14, 2023

Bugfix/1346 urban flood empty table value #1372

Merged

Add validation for transition table #1365

Merged

phargogh approved these changes Aug 15, 2023

View reviewed changes

phargogh merged commit 6db2256 into natcap:main Aug 15, 2023
19 checks passed

emlys mentioned this pull request Aug 17, 2023

Rec model error with numeric predictor IDs #1278

Closed

emlys deleted the feature/1328 branch October 3, 2024 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read CSVs using info from `MODEL_SPEC` #1345

Read CSVs using info from `MODEL_SPEC` #1345

emlys commented Jul 6, 2023 •

edited

Loading

emlys Jul 19, 2023

phargogh left a comment

phargogh Aug 1, 2023

phargogh Aug 1, 2023

emlys Aug 2, 2023

phargogh Aug 1, 2023

emlys Aug 2, 2023

phargogh Aug 15, 2023

phargogh Aug 1, 2023

emlys Aug 2, 2023

phargogh Aug 1, 2023

emlys Aug 2, 2023

phargogh Aug 1, 2023

phargogh Aug 1, 2023

emlys Aug 2, 2023

		df[col] = df[col].astype('boolean')
		except ValueError as err:

Read CSVs using info from MODEL_SPEC #1345

Read CSVs using info from MODEL_SPEC #1345

Conversation

emlys commented Jul 6, 2023 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

phargogh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Read CSVs using info from `MODEL_SPEC` #1345

Read CSVs using info from `MODEL_SPEC` #1345

emlys commented Jul 6, 2023 •

edited

Loading