Removed aggregation in util_list.py, get_dataframe() #1033

tomvansteijn · 2021-01-04T07:41:35Z

In the current implementation list input is grouped by layer, row, column when creating a dataframe. Variables are summed. This is incorrect for levels, e.g. stages or river bottoms.

Removed the groupby altogether. I believe it is not needed here? Not aware of the requirements for NetCDF, but a groupby can be performed after this function call.

…develop

codecov · 2021-01-04T07:58:34Z

Codecov Report

Merging #1033 (06b0a71) into develop (943ace4) will increase coverage by 0.679%.
The diff coverage is 92.857%.

@@              Coverage Diff              @@
##           develop     #1033       +/-   ##
=============================================
+ Coverage   71.769%   72.448%   +0.679%     
=============================================
  Files          225       221        -4     
  Lines        51895     51116      -779     
=============================================
- Hits         37245     37033      -212     
+ Misses       14650     14083      -567

Impacted Files	Coverage Δ
flopy/utils/util_list.py	`71.943% <92.857%> (-0.305%)`	⬇️
flopy/mt3d/mtsft.py	`6.569% <0.000%> (-72.263%)`	⬇️
flopy/utils/formattedfile.py	`88.311% <0.000%> (-2.598%)`	⬇️
flopy/utils/mfreadnam.py	`75.912% <0.000%> (-1.460%)`	⬇️
flopy/modflow/mflmt.py	`91.578% <0.000%> (-1.053%)`	⬇️
flopy/mf6/utils/testutils.py
flopy/mf6/utils/reference.py
flopy/mt3d/mtcts.py
flopy/utils/voronoi.py

tomvansteijn · 2021-01-04T09:38:21Z

Ok, this is not an ideal solution. The dataframe will become sparse when the number of stresses changes between periods. I think it would be more proper to concatenate different periods as rows instead of columns. That would require the period number as an additional index level.

jdhughes-usgs · 2021-01-21T01:00:42Z

Any thoughts @mnfienen-usgs, @aleaf-usgs, and/or @jtwhite79??

briochh · 2021-01-21T01:57:01Z

For what it is worth, I think I am probably responsible for some of that aggregation business. From what I remember, it was to try and avoid issues with repeating indices in the resulting dataframe. I think @tomvansteijn is right though, it doesn't make much sense for some of the package list-types that now exist, so should be all good as long as repeated cells no longer cause and issue in the df construction.

tomvansteijn · 2021-01-21T08:30:26Z

To be honest I think a better solution would be to add the period as an index level and concatenate the stress period data in row direction instead of the column direction.

Instead of the dataframe looking like this:

           node    stage0      cond0     rbot0  isub0
k i   j
9 53  89  12568  0.300000   8.052216 -0.380000    1.0
  54  87  12806  0.300000   4.652384 -0.886000    1.0
      88  12807  0.300000  12.746740 -0.372659    1.0

Have this:

               node     stage       cond      rbot  isub
per k i   j
0   9 53  89  12568  0.300000   8.052216 -0.380000   1.0
      54  87  12806  0.300000   4.652384 -0.886000   1.0
          88  12807  0.300000  12.746740 -0.372659   1.0

This would make aggregation no longer necessary and make it easier to get a Series over period index for a given k, i, j (or node). What do you think?

jtwhite79 · 2021-01-21T16:40:47Z

@tomvansteijn - I think adding stress period as an index level is probably the best approach since it would let us slice and dice more easily. And Im fine with dropping the aggregation.

briochh · 2021-01-21T19:42:20Z

@tomvansteijn, kper index level also works for me. I think we could still end up with repeated index though, if a cell entry is repeated within a stress period (which I think can happen in MODFLOW inputs) (kper, k, i, j) would be the same. Not sure how pandas deals with this now, previously it threw an error, hence the introduction of that aggregation step to ensure unique index.

tomvansteijn · 2021-01-21T20:56:26Z

Good to know. I thinkt there is no problem when there are duplicates in the index, when concatenating the stress period data in along the same dimension. I will adjust the PR for this. The change may break some people's code though.

langevin-usgs · 2021-01-22T15:39:36Z

Hey @tomvansteijn, one way to fix the duplicate issue would be to include boundary number as another column in the data frame. The boundary number would just correspond to which feature it is in the bound list. So if there are 10 wells, for a stress period, they could just be numbered zero to nine. That way, if the ten wells were all in the same cell, they could would at least have a useful indicator of which one is which. Not sure what to call that column, maybe IBNDNO? Or something like that?

langevin-usgs · 2021-02-06T18:29:55Z

Hey @tomvansteijn, we are going to be making a flopy release in the next week or two. Any chance you are able to update this PR?

…elop

… variables in the columns

tomvansteijn · 2021-02-06T22:38:34Z

Hi Chris, thanks for the reminder. I have made some changes to include "per" as the first index column. Seems to work, but I am still checking the tests.

…que multi-index

tomvansteijn · 2021-02-07T21:14:25Z

Some trouble here with test t004_test_utilarray.py::test_mflist. The last assert fails in this block:

    # test get_dataframe() on mflist obj
    sp_data3 = {
        0: [1, 1, 1, 1.0],
        1: [[1, 1, 3, 3.0], [1, 1, 2, 6.0]],
        2: [
            [1, 2, 4, 8.0],
            [1, 2, 3, 4.0],
            [1, 2, 2, 4.0],
            [1, 1, 3, 3.0],
            [1, 1, 2, 6.0],
        ],
    }
    wel4 = flopy.modflow.ModflowWel(ml, stress_period_data=sp_data3)
    df = wel4.stress_period_data.get_dataframe()
    df = df.set_index(["per", "k", "i", "j"])
    assert df.loc[0, "flux"].sum() == 1.0
    assert df.loc[1, "flux"].sum() == 9.0
    assert df.loc[2, "flux"].sum() == 25.0

The expression df.loc[2, "flux"].sum() results in 16.0 instead of 25.0. The sum of the fluxes in sp_data3[2] is indeed 25.0, but the squeeze option in get_dataframe(), which is on by default, results in removal of the values at (1, 1, 3) and (1, 1, 2) in stress period 2 (the third stress period), as these have not changed since the previous period, making the sum 25.0 - 3.0 - 6.0 = 16.0.

If you agree I will change the assertion to:

assert df.loc[2, "flux"].sum() == 16.0

Honestly I think it is better to turn the squeeze option off by default, since it is changing the size of the input data implicitly. But then we would still want a test for get_dataframe with and without squeezing.

jtwhite79 · 2021-02-07T22:55:21Z

That squeeze option sounds dangerous! I think if the sum of input flux is 25 then the assert is technically correct right?

tomvansteijn · 2021-02-08T08:25:25Z

The assertion is not correct when two entries are removed in the squeeze operation. The remaining entries sum to 16.0.
I set the squeeze option to False in this PR, so the assertion does not need to change. I can always revert.

You can also consider to move the squeeze operation to a separate util function, or to remove it.

tomvansteijn · 2021-02-08T11:52:56Z

Tests are passing so I think this is it. Let me know what you think.

tomvansteijn · 2021-02-08T13:18:06Z

Clueless why t065 is failing now, as I did not change anything related to gridintersect. t065 is passing on my machine.

langevin-usgs · 2021-02-08T18:36:41Z

@dbrakenhoff, we are getting some weird failures with this PR, and I've restarted a couple times, but they keep occurring. From what I can tell, the version of shapely has not changed since we've had a successful tests, so it's unclear what is going on here. This PR changes the behavior of util_list.py and retrieval of a pandas data frame. Any thoughts? PR failure can be seen here: https://github.com/modflowpy/flopy/pull/1033/checks?check_run_id=1856766475

======================================================================
ERROR: t065_test_gridintersect.test_rect_grid_multilinestring_in_one_cell
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/flopy/lib/python3.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/runner/work/flopy/flopy/autotest/t065_test_gridintersect.py", line 483, in test_rect_grid_multilinestring_in_one_cell
    [LineString([(1., 1), (9., 1.)]), LineString([(1., 9.), (9., 9.)])]))
  File "/home/runner/work/flopy/flopy/flopy/utils/gridintersect.py", line 190, in intersect
    shp, keepzerolengths
  File "/home/runner/work/flopy/flopy/flopy/utils/gridintersect.py", line 768, in _intersect_linestring_structured
    n, l, v, ixs = self._get_nodes_intersecting_linestring(ls)
  File "/home/runner/work/flopy/flopy/flopy/utils/gridintersect.py", line 960, in _get_nodes_intersecting_linestring
    x = intersect.xy[0]
  File "/usr/share/miniconda/envs/flopy/lib/python3.7/site-packages/shapely/geometry/linestring.py", line 115, in xy
    return self.coords.xy
AttributeError: 'list' object has no attribute 'xy'

======================================================================
FAIL: t065_test_gridintersect.test_rect_grid_linestring_in_2cells
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/flopy/lib/python3.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/runner/work/flopy/flopy/autotest/t065_test_gridintersect.py", line 435, in test_rect_grid_linestring_in_2cells
    assert len(result) == 2
AssertionError

======================================================================
FAIL: t065_test_gridintersect.test_rect_grid_linestring_on_outer_boundary
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/flopy/lib/python3.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/runner/work/flopy/flopy/autotest/t065_test_gridintersect.py", line 451, in test_rect_grid_linestring_on_outer_boundary
    assert len(result) == 2
AssertionError

======================================================================
FAIL: t065_test_gridintersect.test_rect_grid_linestring_on_inner_boundary
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/flopy/lib/python3.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/runner/work/flopy/flopy/autotest/t065_test_gridintersect.py", line 467, in test_rect_grid_linestring_on_inner_boundary
    assert len(result) == 2
AssertionError

dbrakenhoff · 2021-02-08T19:23:01Z

I saw the comment and I already tried replicating the error locally but to no avail. The intersect result seems to be different for some reason. I'll try to figure out what's going on...

dbrakenhoff · 2021-02-08T21:14:41Z

I don't quite understand why yet, but the error seems to be caused by the odd behavior when intersecting LineStrings with a polygon representing the bounding box . The LineStrings in these cases intersect the bounding box, but the intersection result is an empty LineString.

The snippet below illustrates the problem:

    from shapely.geometry import LineString, box

    pl = box(0, 0, 20, 20)
    shp = LineString([(5., 5.), (15., 5.)])

    shp.intersects(pl)  # returns True

    intersect = shp.intersection(pl)
    intersect.__geo_interface__  # Returns: {'type': 'LineString', 'coordinates': ()}

Also weird, I can only replicate this error (thus far) using the environment file etc/environment.yml. When I create new minimal environments (only installing shapely) to test the snippet above on my PC (Ubuntu 20.04) for Python 3.7, 3.8 and 3.9, the resulting intersect is what you would expect (not empty...).

So I'm sort of at a loss now. Anyone else have any ideas?

mwtoews · 2021-02-09T19:52:26Z

Hi all. I'm able to replicate the error using GEOS 3.9.0. This is found in a few recent commits from the development branch, e.g.

GEOS 3.9.0 (used by Shapely) is the current version, released December 2020, which uses a new overlay engine dubbed OverlayNG. The snippet that @dbrakenhoff showed above indeed looks strange. The expected behaviour is found with geos==3.8.1, so my suggestion is to pin this version while I investigate why Shapely/GEOS is providing the wrong intersection with geos==3.9.0.

mwtoews · 2021-02-09T20:53:38Z

Some good news: this issue (libgeos/geos#408) was just fixed 19 hours ago in the geos 3.9 branch (thanks @dr-jts !). The bad news that you'll need to avoid geos==3.9.0 and wait until geos==3.9.1 comes out -- and I'm not sure when this could be.

Update: a release could be expected in the next few days.

Update2: a release was announced. So it's just a matter of time for (e.g.) conda to catch-up to enable geos==3.9.1

langevin-usgs · 2021-02-11T16:20:13Z

@tomvansteijn, can you merge develop into this PR and resubmit? I think the t065 test should pass now. It seems like there is still a little bit of unsettled business with this squeeze option, but turning it off by default seems like the safest idea for now. The Pandas people might consider taking a closer look at this get_dataframe() method to ensure that records and data aren't being lost, which is what it sounds like now, when the squeeze option is used.

…elop

tomvansteijn · 2021-02-12T10:43:59Z

Getting a "rate limit exceeded" while building Modflow using pymake in the test setup..

tomvansteijn · 2021-02-12T11:59:40Z

As for the squeeze option in the get_dataframe method. The original doc says:

squeeze : bool
            Reduce number of columns in dataframe to only include
            stress periods where a variable changes.

That is, stress periods where none of the cells have changed, in any variable, since the previous stress period, are ommited from the dataframe. For example if the stress period data of period 5 is identical to the stress period data in period 4, period 5 is omitted from the dataframe. This would not lead to a different simulation after putting the dataframe back to the package data (unless the omitted period becomes an empty block).

Moving the periods from the columns to the index led me to a different understanding of the squeeze operation: Omit any row (period, node) where the variables have not changed since the previous period. This different from the original implementation. For example the stress period data for node (1, 1, 3) in period 5 is equal to the stress period data in period 4, therefore node (1, 1, 3) is omitted from period 5 in the dataframe. I understand now that this would lead to a different result in a simulation. So this implementation is not correct.

In general I think it is best to maintain a 1-to-1 relation between the package data and the dataframe, hence my suggestion to remove the squeeze option completely. Otherwise I can correct it.

jtwhite79 · 2021-02-12T14:51:13Z

I agree. Squeeeze makes sense if the entire dataframe is identical between stress periods but otherwise, we lose that 1-to-1 relation.

tomvansteijn · 2021-02-14T20:08:44Z

I have adjusted the squeeze option once more. Only fully duplicate periods are removed from the stress period data. See the additional test in t004. @langevin-usgs I have used your suggestion of adding a unique identifier for multiple entries in the same period and cell index.

langevin-usgs · 2021-02-15T23:26:38Z

Thanks for the perseverance!

Steijn van added 8 commits September 24, 2020 14:50

debug for complex linestrings

a473912

check on geoms attribute instead of geom_type

eebeb46

fix long line; variable names

3b7ea47

travis fix _transform_geo_interface_polygon

566b90d

black /w line lengthh 79

02bd6ab

rerun black after update from black 19.10 to black 20.8

8249987

removed faulty aggregation in get_dataframe()

82164cc

Merge branch 'develop' of https://github.com/tomvansteijn/flopy into …

ce682c1

…develop

Steijn van added 2 commits January 4, 2021 10:27

handle multiple stress periods; debug t004 for non-unique index

030efc8

reformat with black

2b6dd4a

jdhughes-usgs requested review from a user and jtwhite79 January 21, 2021 00:59

Steijn van added 3 commits February 6, 2021 19:33

Merge branch 'develop' of https://github.com/modflowpy/flopy into dev…

66118ca

…elop

MfList.get_dataframe; changed dataframe to per, k, i, j in the index;…

7c30904

… variables in the columns

blacken

446edbb

Steijn van added 3 commits February 7, 2021 09:21

test for util_list get_dataframe

5d3aaa7

fixed t009 for new dataframe format; fixed squeeze method for non-uni…

ab67134

…que multi-index

blacken

6ff8c68

Steijn van added 2 commits February 8, 2021 08:30

squeeze option=False in get_dataframe()

6bcbee9

account for new period index in t004

c2b3d44

added test for the squeeze option

4d39928

langevin-usgs mentioned this pull request Feb 9, 2021

fix(numpy): aliases of builtin types is deprecated as of numpy 1.20 #1052

Merged

Merge branch 'develop' of https://github.com/modflowpy/flopy into dev…

f68abc0

…elop

adjusted squeeze option in get_dataframe; remove only duplicate periods

06b0a71

langevin-usgs merged commit 1332b96 into modflowpy:develop Feb 15, 2021

jdhughes-usgs mentioned this pull request Aug 4, 2021

aggregation method in util_list.py #764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed aggregation in util_list.py, get_dataframe() #1033

Removed aggregation in util_list.py, get_dataframe() #1033

tomvansteijn commented Jan 4, 2021

codecov bot commented Jan 4, 2021 •

edited

tomvansteijn commented Jan 4, 2021

jdhughes-usgs commented Jan 21, 2021

briochh commented Jan 21, 2021

tomvansteijn commented Jan 21, 2021

jtwhite79 commented Jan 21, 2021

briochh commented Jan 21, 2021

tomvansteijn commented Jan 21, 2021

langevin-usgs commented Jan 22, 2021 •

edited

langevin-usgs commented Feb 6, 2021

tomvansteijn commented Feb 6, 2021

tomvansteijn commented Feb 7, 2021

jtwhite79 commented Feb 7, 2021

tomvansteijn commented Feb 8, 2021

tomvansteijn commented Feb 8, 2021

tomvansteijn commented Feb 8, 2021

langevin-usgs commented Feb 8, 2021

dbrakenhoff commented Feb 8, 2021

dbrakenhoff commented Feb 8, 2021

mwtoews commented Feb 9, 2021

mwtoews commented Feb 9, 2021 •

edited

langevin-usgs commented Feb 11, 2021

tomvansteijn commented Feb 12, 2021

tomvansteijn commented Feb 12, 2021

jtwhite79 commented Feb 12, 2021

tomvansteijn commented Feb 14, 2021

langevin-usgs commented Feb 15, 2021

Removed aggregation in util_list.py, get_dataframe() #1033

Removed aggregation in util_list.py, get_dataframe() #1033

Conversation

tomvansteijn commented Jan 4, 2021

codecov bot commented Jan 4, 2021 • edited

Codecov Report

tomvansteijn commented Jan 4, 2021

jdhughes-usgs commented Jan 21, 2021

briochh commented Jan 21, 2021

tomvansteijn commented Jan 21, 2021

jtwhite79 commented Jan 21, 2021

briochh commented Jan 21, 2021

tomvansteijn commented Jan 21, 2021

langevin-usgs commented Jan 22, 2021 • edited

langevin-usgs commented Feb 6, 2021

tomvansteijn commented Feb 6, 2021

tomvansteijn commented Feb 7, 2021

jtwhite79 commented Feb 7, 2021

tomvansteijn commented Feb 8, 2021

tomvansteijn commented Feb 8, 2021

tomvansteijn commented Feb 8, 2021

langevin-usgs commented Feb 8, 2021

dbrakenhoff commented Feb 8, 2021

dbrakenhoff commented Feb 8, 2021

mwtoews commented Feb 9, 2021

mwtoews commented Feb 9, 2021 • edited

langevin-usgs commented Feb 11, 2021

tomvansteijn commented Feb 12, 2021

tomvansteijn commented Feb 12, 2021

jtwhite79 commented Feb 12, 2021

tomvansteijn commented Feb 14, 2021

langevin-usgs commented Feb 15, 2021

codecov bot commented Jan 4, 2021 •

edited

langevin-usgs commented Jan 22, 2021 •

edited

mwtoews commented Feb 9, 2021 •

edited