Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pandas): Flopy pandas support #1955

Merged
merged 1 commit into from
Oct 4, 2023
Merged

feat(pandas): Flopy pandas support #1955

merged 1 commit into from
Oct 4, 2023

Conversation

scottrp
Copy link
Contributor

@scottrp scottrp commented Sep 19, 2023

This PR is the first step towards integrating Pandas into Flopy. This integration takes place in the MFPandasList and MFPandasTransientList classes (MFPandas*), which are used instead of the MFList and MFTransientList classes under the following conditions:

  1. The data is in a package is not an considered advanced
  2. The data type meets certain criteria, like it is not a jagged list with variable column length
  3. The flopy simulation data option use_pandas is set to true (default value is true)

The MFPandas* classes currently support the same interface as MFList and MFTransientList, and should behave similarly to the end-user. However, MFPandas* stores data internally in a Pandas Dataframe and reads and writes data using Pandas “read_csv” and “to_csv” methods, which can be significantly faster than flopy’s current file reading. The MFPandas* classes set_data methods support DataFrames and their new “get_dataframe” method returns data in a Panda’s Dataframe (“get_data” still returns a recarray).

Remaining work on this PR includes:

  1. When reading files do not use python’s “tell” method to record the start and finish of data (this can be problematic for files opened as text).
  2. Convert recarrays to Pandas using the “from_records” method instead of the dataframe constructor.
  3. Remove cellid tuples support from Flopy. Flopy will only accept cellids stored in separate layer, row, and column fields (or appropriate fields for the discretization) instead of also supporting cellids as a single field with a tuple (layer, row, column). All flopy lists (including the old MFList* classes and the new MFPandasList* classes) will store each component of the cellid in a separate column. This feature may or may not be part of this PR depending on timing.

@codecov
Copy link

codecov bot commented Sep 19, 2023

Codecov Report

Merging #1955 (061dcbe) into develop (6e23400) will decrease coverage by 1.0%.
The diff coverage is 22.6%.

@@            Coverage Diff            @@
##           develop   #1955     +/-   ##
=========================================
- Coverage     72.6%   71.7%   -1.0%     
=========================================
  Files          257     258      +1     
  Lines        57800   57412    -388     
=========================================
- Hits         42017   41179    -838     
- Misses       15783   16233    +450     
Files Coverage Δ
flopy/mf6/data/mfdata.py 75.8% <100.0%> (+0.1%) ⬆️
flopy/mf6/data/mfdataarray.py 60.9% <ø> (-0.1%) ⬇️
flopy/mf6/data/mfdatalist.py 71.0% <100.0%> (-0.4%) ⬇️
flopy/mf6/data/mfdatascalar.py 60.5% <ø> (-0.2%) ⬇️
flopy/mf6/mfsimbase.py 67.5% <100.0%> (+<0.1%) ⬆️
flopy/mf6/modflow/mfgwfchd.py 100.0% <ø> (ø)
flopy/mf6/modflow/mfgwfdrn.py 100.0% <ø> (ø)
flopy/mf6/modflow/mfgwfevt.py 100.0% <ø> (ø)
flopy/mf6/modflow/mfgwfevta.py 100.0% <ø> (ø)
flopy/mf6/modflow/mfgwfghb.py 100.0% <ø> (ø)
... and 17 more

... and 58 files with indirect coverage changes

@wpbonelli
Copy link
Member

linking back to comment on original PR, sorry again for the accidental close

"""
self._set_data(data, check_data=check_data)

def set_record(self, data_record, autofill=False, check_data=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be "set_data_record" to be consistent with the variable that it is setting? Or set_control_record?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the hidden and exposed method necessary for this? set_record() is only calling _set_record()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable name changed to be more consistent with method name. Changed variable name instead of method name since method name change would breaking existing interface.

Having the hidden method is not necessary. I was originally doing this to make sure the correct method (parent vs child class) got called. But it is better to just explicitly define this, which I am now doing.

"""
self._set_record(data_record, autofill, check_data)

def _set_record(self, data_record, autofill=False, check_data=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be _set_data_record to be consistent with the variable it is setting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed variables to be consistent with method name

self._resync()
try:
# convert to tuple
tuple_record = ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why lists are being converted to tuples here. It looks like there is support for lists in .append_data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if this is a single list record, could this just be tuple(record).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the conversion code and now passing list instead of tuple.

ex,
)

def update_record(self, record, key_index):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be clearer if key_index was kper or stress_period, etc...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface is also used for packages like Time-Array Series whose "TIME" block (BEGIN TIME <tas_time>) has a key that is not a stress period. Similarly, the Observation package "CONTINUOUS" block (BEGIN CONTINUOUS FILEOUT <obs_output_file_name>) has a key that is a file name. I therefore choose the generalized name "key_index".

Comment on lines 583 to 623
message = (
f"ERROR: Data list {self._data_name} supplied the "
f"wrong number of columns of data, expected "
f"{len(self._data_item_names)} got {len(data[0])}."
)
type_, value_, traceback_ = sys.exc_info()
raise MFDataException(
self._data_dimensions.structure.get_model(),
self._data_dimensions.structure.get_package(),
self._data_dimensions.structure.path,
"setting list data",
self._data_dimensions.structure.name,
inspect.stack()[0][3],
type_,
value_,
traceback_,
message,
self._simulation_data.debug,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we could either return the name of the missing column (or extra column) here to provide a more detailed error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not easily determine the name of the missing column since this code accepts Pandas Dataframes with incorrect column names (it corrects the column names below, which is trivial to do given that the Dataframe has the correct number of columns). I did however make a more detailed error message that lists the data column names supplied and expected.

Comment on lines 1421 to 1424
if isinstance(dataset_one, mfdataplist.MFPandasList) or isinstance(
dataset_one, mfdataplist.MFPandasTransientList
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified to isinstance(dataset_one, (mfdataplist.MFPandasList, mfdataplist.MFPandasTransientList))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified

Comment on lines 1426 to 1428
assert isinstance(
dataset, mfdataplist.MFPandasList
) or isinstance(dataset, mfdataplist.MFPandasTransientList)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance simplification here too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified

Comment on lines 132 to +134
write_headers=True,
lazy_io=False,
use_pandas=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does write_headers write?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_headers: bool
    When true flopy writes a header to each package file indicating that
    it was created by flopy.

Comment on lines 2896 to 2898
elif isinstance(
value, mfdatalist.MFTransientList
) or isinstance(value, mfdataplist.MFPandasTransientList):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isinstance statement can be simplified `isinstance(value, (mfdatalist.MFTransientList, mfdataplist.MFPandasTransientList))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified

Comment on lines 2916 to 2917
elif isinstance(value, mfdatalist.MFList) or isinstance(
value, mfdataplist.MFPandasList
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same isinstance comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified

@spaulins-usgs spaulins-usgs merged commit f82fdf6 into modflowpy:develop Oct 4, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants