Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed aggregation in util_list.py, get_dataframe() #1033

Merged
merged 21 commits into from Feb 15, 2021

Conversation

tomvansteijn
Copy link
Contributor

See #764.

In the current implementation list input is grouped by layer, row, column when creating a dataframe. Variables are summed. This is incorrect for levels, e.g. stages or river bottoms.

Removed the groupby altogether. I believe it is not needed here? Not aware of the requirements for NetCDF, but a groupby can be performed after this function call.

@codecov
Copy link

codecov bot commented Jan 4, 2021

Codecov Report

Merging #1033 (06b0a71) into develop (943ace4) will increase coverage by 0.679%.
The diff coverage is 92.857%.

@@              Coverage Diff              @@
##           develop     #1033       +/-   ##
=============================================
+ Coverage   71.769%   72.448%   +0.679%     
=============================================
  Files          225       221        -4     
  Lines        51895     51116      -779     
=============================================
- Hits         37245     37033      -212     
+ Misses       14650     14083      -567     
Impacted Files Coverage Δ
flopy/utils/util_list.py 71.943% <92.857%> (-0.305%) ⬇️
flopy/mt3d/mtsft.py 6.569% <0.000%> (-72.263%) ⬇️
flopy/utils/formattedfile.py 88.311% <0.000%> (-2.598%) ⬇️
flopy/utils/mfreadnam.py 75.912% <0.000%> (-1.460%) ⬇️
flopy/modflow/mflmt.py 91.578% <0.000%> (-1.053%) ⬇️
flopy/mf6/utils/testutils.py
flopy/mf6/utils/reference.py
flopy/mt3d/mtcts.py
flopy/utils/voronoi.py

@tomvansteijn
Copy link
Contributor Author

Ok, this is not an ideal solution. The dataframe will become sparse when the number of stresses changes between periods. I think it would be more proper to concatenate different periods as rows instead of columns. That would require the period number as an additional index level.

@jdhughes-usgs jdhughes-usgs requested review from a user and jtwhite79 January 21, 2021 00:59
@jdhughes-usgs
Copy link
Contributor

Any thoughts @mnfienen-usgs, @aleaf-usgs, and/or @jtwhite79??

@briochh
Copy link
Contributor

briochh commented Jan 21, 2021

For what it is worth, I think I am probably responsible for some of that aggregation business. From what I remember, it was to try and avoid issues with repeating indices in the resulting dataframe. I think @tomvansteijn is right though, it doesn't make much sense for some of the package list-types that now exist, so should be all good as long as repeated cells no longer cause and issue in the df construction.

@tomvansteijn
Copy link
Contributor Author

To be honest I think a better solution would be to add the period as an index level and concatenate the stress period data in row direction instead of the column direction.

Instead of the dataframe looking like this:

           node    stage0      cond0     rbot0  isub0
k i   j
9 53  89  12568  0.300000   8.052216 -0.380000    1.0
  54  87  12806  0.300000   4.652384 -0.886000    1.0
      88  12807  0.300000  12.746740 -0.372659    1.0

Have this:

               node     stage       cond      rbot  isub
per k i   j
0   9 53  89  12568  0.300000   8.052216 -0.380000   1.0
      54  87  12806  0.300000   4.652384 -0.886000   1.0
          88  12807  0.300000  12.746740 -0.372659   1.0

This would make aggregation no longer necessary and make it easier to get a Series over period index for a given k, i, j (or node). What do you think?

@jtwhite79
Copy link
Contributor

@tomvansteijn - I think adding stress period as an index level is probably the best approach since it would let us slice and dice more easily. And Im fine with dropping the aggregation.

@briochh
Copy link
Contributor

briochh commented Jan 21, 2021

@tomvansteijn, kper index level also works for me. I think we could still end up with repeated index though, if a cell entry is repeated within a stress period (which I think can happen in MODFLOW inputs) (kper, k, i, j) would be the same. Not sure how pandas deals with this now, previously it threw an error, hence the introduction of that aggregation step to ensure unique index.

@tomvansteijn
Copy link
Contributor Author

Good to know. I thinkt there is no problem when there are duplicates in the index, when concatenating the stress period data in along the same dimension. I will adjust the PR for this. The change may break some people's code though.

@langevin-usgs
Copy link
Contributor

langevin-usgs commented Jan 22, 2021

Hey @tomvansteijn, one way to fix the duplicate issue would be to include boundary number as another column in the data frame. The boundary number would just correspond to which feature it is in the bound list. So if there are 10 wells, for a stress period, they could just be numbered zero to nine. That way, if the ten wells were all in the same cell, they could would at least have a useful indicator of which one is which. Not sure what to call that column, maybe IBNDNO? Or something like that?

@langevin-usgs
Copy link
Contributor

Hey @tomvansteijn, we are going to be making a flopy release in the next week or two. Any chance you are able to update this PR?

@tomvansteijn
Copy link
Contributor Author

Hi Chris, thanks for the reminder. I have made some changes to include "per" as the first index column. Seems to work, but I am still checking the tests.

@tomvansteijn
Copy link
Contributor Author

Some trouble here with test t004_test_utilarray.py::test_mflist. The last assert fails in this block:

    # test get_dataframe() on mflist obj
    sp_data3 = {
        0: [1, 1, 1, 1.0],
        1: [[1, 1, 3, 3.0], [1, 1, 2, 6.0]],
        2: [
            [1, 2, 4, 8.0],
            [1, 2, 3, 4.0],
            [1, 2, 2, 4.0],
            [1, 1, 3, 3.0],
            [1, 1, 2, 6.0],
        ],
    }
    wel4 = flopy.modflow.ModflowWel(ml, stress_period_data=sp_data3)
    df = wel4.stress_period_data.get_dataframe()
    df = df.set_index(["per", "k", "i", "j"])
    assert df.loc[0, "flux"].sum() == 1.0
    assert df.loc[1, "flux"].sum() == 9.0
    assert df.loc[2, "flux"].sum() == 25.0

The expression df.loc[2, "flux"].sum() results in 16.0 instead of 25.0. The sum of the fluxes in sp_data3[2] is indeed 25.0, but the squeeze option in get_dataframe(), which is on by default, results in removal of the values at (1, 1, 3) and (1, 1, 2) in stress period 2 (the third stress period), as these have not changed since the previous period, making the sum 25.0 - 3.0 - 6.0 = 16.0.

If you agree I will change the assertion to:

assert df.loc[2, "flux"].sum() == 16.0

Honestly I think it is better to turn the squeeze option off by default, since it is changing the size of the input data implicitly. But then we would still want a test for get_dataframe with and without squeezing.

@jtwhite79
Copy link
Contributor

That squeeze option sounds dangerous! I think if the sum of input flux is 25 then the assert is technically correct right?

@tomvansteijn
Copy link
Contributor Author

The assertion is not correct when two entries are removed in the squeeze operation. The remaining entries sum to 16.0.
I set the squeeze option to False in this PR, so the assertion does not need to change. I can always revert.

You can also consider to move the squeeze operation to a separate util function, or to remove it.

@tomvansteijn
Copy link
Contributor Author

Tests are passing so I think this is it. Let me know what you think.

@tomvansteijn
Copy link
Contributor Author

Clueless why t065 is failing now, as I did not change anything related to gridintersect. t065 is passing on my machine.

@langevin-usgs
Copy link
Contributor

@dbrakenhoff, we are getting some weird failures with this PR, and I've restarted a couple times, but they keep occurring. From what I can tell, the version of shapely has not changed since we've had a successful tests, so it's unclear what is going on here. This PR changes the behavior of util_list.py and retrieval of a pandas data frame. Any thoughts? PR failure can be seen here: https://github.com/modflowpy/flopy/pull/1033/checks?check_run_id=1856766475

======================================================================
ERROR: t065_test_gridintersect.test_rect_grid_multilinestring_in_one_cell
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/flopy/lib/python3.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/runner/work/flopy/flopy/autotest/t065_test_gridintersect.py", line 483, in test_rect_grid_multilinestring_in_one_cell
    [LineString([(1., 1), (9., 1.)]), LineString([(1., 9.), (9., 9.)])]))
  File "/home/runner/work/flopy/flopy/flopy/utils/gridintersect.py", line 190, in intersect
    shp, keepzerolengths
  File "/home/runner/work/flopy/flopy/flopy/utils/gridintersect.py", line 768, in _intersect_linestring_structured
    n, l, v, ixs = self._get_nodes_intersecting_linestring(ls)
  File "/home/runner/work/flopy/flopy/flopy/utils/gridintersect.py", line 960, in _get_nodes_intersecting_linestring
    x = intersect.xy[0]
  File "/usr/share/miniconda/envs/flopy/lib/python3.7/site-packages/shapely/geometry/linestring.py", line 115, in xy
    return self.coords.xy
AttributeError: 'list' object has no attribute 'xy'

======================================================================
FAIL: t065_test_gridintersect.test_rect_grid_linestring_in_2cells
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/flopy/lib/python3.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/runner/work/flopy/flopy/autotest/t065_test_gridintersect.py", line 435, in test_rect_grid_linestring_in_2cells
    assert len(result) == 2
AssertionError

======================================================================
FAIL: t065_test_gridintersect.test_rect_grid_linestring_on_outer_boundary
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/flopy/lib/python3.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/runner/work/flopy/flopy/autotest/t065_test_gridintersect.py", line 451, in test_rect_grid_linestring_on_outer_boundary
    assert len(result) == 2
AssertionError

======================================================================
FAIL: t065_test_gridintersect.test_rect_grid_linestring_on_inner_boundary
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/flopy/lib/python3.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/runner/work/flopy/flopy/autotest/t065_test_gridintersect.py", line 467, in test_rect_grid_linestring_on_inner_boundary
    assert len(result) == 2
AssertionError

@dbrakenhoff
Copy link
Contributor

I saw the comment and I already tried replicating the error locally but to no avail. The intersect result seems to be different for some reason. I'll try to figure out what's going on...

@dbrakenhoff
Copy link
Contributor

I don't quite understand why yet, but the error seems to be caused by the odd behavior when intersecting LineStrings with a polygon representing the bounding box . The LineStrings in these cases intersect the bounding box, but the intersection result is an empty LineString.

The snippet below illustrates the problem:

    from shapely.geometry import LineString, box

    pl = box(0, 0, 20, 20)
    shp = LineString([(5., 5.), (15., 5.)])

    shp.intersects(pl)  # returns True

    intersect = shp.intersection(pl)
    intersect.__geo_interface__  # Returns: {'type': 'LineString', 'coordinates': ()}

Also weird, I can only replicate this error (thus far) using the environment file etc/environment.yml. When I create new minimal environments (only installing shapely) to test the snippet above on my PC (Ubuntu 20.04) for Python 3.7, 3.8 and 3.9, the resulting intersect is what you would expect (not empty...).

So I'm sort of at a loss now. Anyone else have any ideas?

@mwtoews
Copy link
Contributor

mwtoews commented Feb 9, 2021

Hi all. I'm able to replicate the error using GEOS 3.9.0. This is found in a few recent commits from the development branch, e.g.

GEOS 3.9.0 (used by Shapely) is the current version, released December 2020, which uses a new overlay engine dubbed OverlayNG. The snippet that @dbrakenhoff showed above indeed looks strange. The expected behaviour is found with geos==3.8.1, so my suggestion is to pin this version while I investigate why Shapely/GEOS is providing the wrong intersection with geos==3.9.0.

@mwtoews
Copy link
Contributor

mwtoews commented Feb 9, 2021

Some good news: this issue (libgeos/geos#408) was just fixed 19 hours ago in the geos 3.9 branch (thanks @dr-jts !). The bad news that you'll need to avoid geos==3.9.0 and wait until geos==3.9.1 comes out -- and I'm not sure when this could be.

Update: a release could be expected in the next few days.

Update2: a release was announced. So it's just a matter of time for (e.g.) conda to catch-up to enable geos==3.9.1

@langevin-usgs
Copy link
Contributor

@tomvansteijn, can you merge develop into this PR and resubmit? I think the t065 test should pass now. It seems like there is still a little bit of unsettled business with this squeeze option, but turning it off by default seems like the safest idea for now. The Pandas people might consider taking a closer look at this get_dataframe() method to ensure that records and data aren't being lost, which is what it sounds like now, when the squeeze option is used.

@tomvansteijn
Copy link
Contributor Author

Getting a "rate limit exceeded" while building Modflow using pymake in the test setup..

@tomvansteijn
Copy link
Contributor Author

As for the squeeze option in the get_dataframe method. The original doc says:

squeeze : bool
            Reduce number of columns in dataframe to only include
            stress periods where a variable changes.

That is, stress periods where none of the cells have changed, in any variable, since the previous stress period, are ommited from the dataframe. For example if the stress period data of period 5 is identical to the stress period data in period 4, period 5 is omitted from the dataframe. This would not lead to a different simulation after putting the dataframe back to the package data (unless the omitted period becomes an empty block).

Moving the periods from the columns to the index led me to a different understanding of the squeeze operation: Omit any row (period, node) where the variables have not changed since the previous period. This different from the original implementation. For example the stress period data for node (1, 1, 3) in period 5 is equal to the stress period data in period 4, therefore node (1, 1, 3) is omitted from period 5 in the dataframe. I understand now that this would lead to a different result in a simulation. So this implementation is not correct.

In general I think it is best to maintain a 1-to-1 relation between the package data and the dataframe, hence my suggestion to remove the squeeze option completely. Otherwise I can correct it.

@jtwhite79
Copy link
Contributor

I agree. Squeeeze makes sense if the entire dataframe is identical between stress periods but otherwise, we lose that 1-to-1 relation.

@tomvansteijn
Copy link
Contributor Author

I have adjusted the squeeze option once more. Only fully duplicate periods are removed from the stress period data. See the additional test in t004. @langevin-usgs I have used your suggestion of adding a unique identifier for multiple entries in the same period and cell index.

@langevin-usgs langevin-usgs merged commit 1332b96 into modflowpy:develop Feb 15, 2021
@langevin-usgs
Copy link
Contributor

Thanks for the perseverance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants