Issue with converting to dataframe #45

cheginit · 2019-03-20T05:02:31Z

HydroFunctions version: 0.1.7
Python version: 3.7.2
Operating System: Manjaro

Description

I tried to get the streamflow data for PA but when I tried to make a dataframe I got the following error:

Shape of passed values is (368, 546), indices imply (366, 546)

It works fine for other states though only PA.

What I Did

import hydrofunctions as hf
start = "2017-01-01"
end = "2017-12-31"
request = hf.NWIS(None, "dv", start, end, stateCd='PA').get_data()
request.df()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-f4a9c8304bcf> in <module>
      1 request = hf.NWIS(None, "dv", start, end, stateCd='PA').get_data()
----> 2 request.df()

~/anaconda/envs/hydro/lib/python3.7/site-packages/hydrofunctions/station.py in <lambda>()
    165         self.json = lambda: self.response.json()
    166         # set self.df without calling it.
--> 167         self.df = lambda: hf.extract_nwis_df(self.json())
    168 
    169         # Another option might be to do this:

~/anaconda/envs/hydro/lib/python3.7/site-packages/hydrofunctions/hydrofunctions.py in extract_nwis_df(nwis_dict)
    362         # except that package requires their n-dimensional structures to all be
    363         # the same datatype.
--> 364         DF = pd.concat([DF, dfa], axis=1)
    365 
    366     # replace missing values in the dataframe

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    227                        verify_integrity=verify_integrity,
    228                        copy=copy, sort=sort)
--> 229     return op.get_result()
    230 
    231 

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in get_result(self)
    424             new_data = concatenate_block_managers(
    425                 mgrs_indexers, self.new_axes, concat_axis=self.axis,
--> 426                 copy=self.copy)
    427             if not self.copy:
    428                 new_data._consolidate_inplace()

~/.local/lib/python3.7/site-packages/pandas/core/internals/managers.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   2063         blocks.append(b)
   2064 
-> 2065     return BlockManager(blocks, axes)

~/.local/lib/python3.7/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    112 
    113         if do_integrity_check:
--> 114             self._verify_integrity()
    115 
    116         self._consolidate_check()

~/.local/lib/python3.7/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    309         for block in self.blocks:
    310             if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 311                 construction_error(tot_items, block.shape[1:], self.axes)
    312         if len(self.items) != tot_items:
    313             raise AssertionError('Number of manager items must equal union of '

~/.local/lib/python3.7/site-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
   1689         raise ValueError("Empty data passed with indices specified.")
   1690     raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 1691         passed, implied))
   1692 
   1693 

ValueError: Shape of passed values is (368, 546), indices imply (366, 546)

The text was updated successfully, but these errors were encountered:

mroberge · 2019-03-20T15:09:08Z

Thanks for filing this! I'm looking into this problem now.
What I've figured out so far:

This is a large request! It has 546 stations in it. Good thing you didn't ask for iv values!
One of the stations returned some duplicate values in the time index, somehow. Instead of 366 days of data being returned, apparently 368 were returned by one of the datasets.

I'm still working on this!
-Marty

mroberge · 2019-03-20T18:07:48Z

It turns out that the request is 31.2 MB! That's without zip compression.

I added a few lines of code that checks for duplicated rows and gets rid of them. This request works now, but it takes forever to combine all of the dataseries into one table. Your error message had 546 data series in it, but it was just getting started when it choked on the bad data! The final dataframe has 2618 columns!! Many of these are for temperature readings, which get summarized with a daily max and a daily min and a third column too.

BTW, this is a much smaller request that duplicates your problem:
request2 = hf.NWIS('03107698', "dv", start, end)

mroberge · 2019-03-20T19:20:03Z

closed with merged pull request #47.

mroberge · 2019-03-20T19:27:29Z

@taataam
I'm trying to think how you can get this bugfix installed. Unfortunately, my new version has changed the internals substantially, so I can't just patch my old version with the fix. I'm getting ready to release version 1.8, but I've got to rewrite a lot of the docstrings and the user's manual, so you probably don't want to wait a week or two for that.

You can install the new version directly from github however. Try using:
pip install git+https://github.com/mroberge/hydrofunctions.git@develop

I'm about to merge the bugfix into develop now too.

cheginit · 2019-03-20T19:49:54Z

@mroberge Thank you for your quick response and help. I will give it a try.

All the other states worked fine. I think it took about half an hour for the data of all the states over a period of one year to be downloaded and saved to a HDF file. My final goal is to get the data for a period of 20 or 30 years.

mroberge · 2019-03-20T20:11:47Z

@taataam So you are trying to download all of the data from all of the states for the past 20 to 30 years?
That is a lot!!!

One thing you can do is to limit your requests to only the discharge data. You probably don't want the temperature or chemistry data, for example.

Also, you might want to reconsider getting all of the data locally. Why not use the internet as your hard drive, and request the data at the moment you need it? For example, if you wanted to calculate a flow duration chart for every station, you could download all of the data for one station, create your chart, and then move on to the next station.

If you include all of the EPA chemistry data, there are over a million data collection sites!!!

cheginit · 2019-03-20T20:20:58Z

@mroberge I think I read somewhere in your documentation that by default it downloads only the discharge data. In the final data that I got with my code, there were only two columns other than date, discharge and the qualification. So do I have to explicitly give the data type in the request line?

The reason that I download it locally is exactly because of the large amount of computations that I am planning to do with the data. They act as checkpoints so if something goes wrong somewhere in the code, whether a bug or a hardware issues (specially on a cluster) I don't have to do everything from the beginning.

mroberge · 2019-03-21T19:05:12Z

In the new versions, the software will request every variable that gets measured at a site unless you specify which parameter that you want. So, for example, if you only want discharge, then you can do this:

my_PA_discharge = hf.NWIS(service='dv', parameterCd='00060', stateCd='pa' )

I'm sorry that the User's Guide is in such a woeful state! The docstrings do a much better job of explaining the parameters, and I've kept them up to date better. You can access them in IPython by typing ?func_name or using the help() function, like this: help(hf.NWIS).

I haven't been updating the User's Guide much lately because the code has been going through some major changes. Now that I've merged everything into my develop branch, I'm going to be working on the documentation before releasing version 0.1.8. I may even make this 0.2.0, but we'll see.

Please feel free to contact me by email too.

-Marty

cheginit · 2019-03-22T15:44:09Z

Thanks for the tip. Then, I will check the help for now. The library is very useful, thanks for the time and effort.

mroberge · 2019-03-22T15:58:42Z

My pleasure! Please let me know if there are any features that you think should be included. And of course, I would love to have you contribute some code or a test or a change to the documentation! It looks good for a project to have multiple contributors, and it helps me feel like the software is useful to someone!

…

-Marty

________________________________ Martin Roberge · Professor Geography and Environmental Planning<http://www.towson.edu/cla/departments/geography/> Towson University<http://www.towson.edu/> · 8000 York Road · Towson, Maryland, 21252-0001 p. 410-704-5011 [cid:5db439a2-2f44-4fce-b098-9911eccf04ed]

________________________________ From: Taher Chegini <notifications@github.com> Sent: Friday, March 22, 2019 11:44 AM To: mroberge/hydrofunctions Cc: Roberge, Martin; Mention Subject: Re: [mroberge/hydrofunctions] Issue with converting to dataframe (#45) Thanks for the tip. Then, I will check the help for now. The library is very useful, thanks for the time and effort. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#45 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AGIoY-e5s-ygv5RwkxeE6ew0J1qo8rA0ks5vZPpJgaJpZM4b-evu>.

cheginit · 2019-03-23T23:13:43Z

Thank you. Sure, would be happy to contribute as much as I can.

This was referenced Mar 20, 2019

Bugfix dupe records #46

Closed

Bugfix dupe records #47

Merged

mroberge closed this as completed Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with converting to dataframe #45

Issue with converting to dataframe #45

cheginit commented Mar 20, 2019

mroberge commented Mar 20, 2019 •

edited

mroberge commented Mar 20, 2019

mroberge commented Mar 20, 2019

mroberge commented Mar 20, 2019 •

edited

cheginit commented Mar 20, 2019

mroberge commented Mar 20, 2019

cheginit commented Mar 20, 2019

mroberge commented Mar 21, 2019

cheginit commented Mar 22, 2019

mroberge commented Mar 22, 2019 via email

cheginit commented Mar 23, 2019

Issue with converting to dataframe #45

Issue with converting to dataframe #45

Comments

cheginit commented Mar 20, 2019

Description

What I Did

mroberge commented Mar 20, 2019 • edited

mroberge commented Mar 20, 2019

mroberge commented Mar 20, 2019

mroberge commented Mar 20, 2019 • edited

cheginit commented Mar 20, 2019

mroberge commented Mar 20, 2019

cheginit commented Mar 20, 2019

mroberge commented Mar 21, 2019

cheginit commented Mar 22, 2019

mroberge commented Mar 22, 2019 via email

cheginit commented Mar 23, 2019

mroberge commented Mar 20, 2019 •

edited

mroberge commented Mar 20, 2019 •

edited