Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with converting to dataframe #45

Closed
cheginit opened this issue Mar 20, 2019 · 11 comments
Closed

Issue with converting to dataframe #45

cheginit opened this issue Mar 20, 2019 · 11 comments

Comments

@cheginit
Copy link

  • HydroFunctions version: 0.1.7
  • Python version: 3.7.2
  • Operating System: Manjaro

Description

I tried to get the streamflow data for PA but when I tried to make a dataframe I got the following error:

Shape of passed values is (368, 546), indices imply (366, 546)

It works fine for other states though only PA.

What I Did

import hydrofunctions as hf
start = "2017-01-01"
end = "2017-12-31"
request = hf.NWIS(None, "dv", start, end, stateCd='PA').get_data()
request.df()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-f4a9c8304bcf> in <module>
      1 request = hf.NWIS(None, "dv", start, end, stateCd='PA').get_data()
----> 2 request.df()

~/anaconda/envs/hydro/lib/python3.7/site-packages/hydrofunctions/station.py in <lambda>()
    165         self.json = lambda: self.response.json()
    166         # set self.df without calling it.
--> 167         self.df = lambda: hf.extract_nwis_df(self.json())
    168 
    169         # Another option might be to do this:

~/anaconda/envs/hydro/lib/python3.7/site-packages/hydrofunctions/hydrofunctions.py in extract_nwis_df(nwis_dict)
    362         # except that package requires their n-dimensional structures to all be
    363         # the same datatype.
--> 364         DF = pd.concat([DF, dfa], axis=1)
    365 
    366     # replace missing values in the dataframe

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    227                        verify_integrity=verify_integrity,
    228                        copy=copy, sort=sort)
--> 229     return op.get_result()
    230 
    231 

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in get_result(self)
    424             new_data = concatenate_block_managers(
    425                 mgrs_indexers, self.new_axes, concat_axis=self.axis,
--> 426                 copy=self.copy)
    427             if not self.copy:
    428                 new_data._consolidate_inplace()

~/.local/lib/python3.7/site-packages/pandas/core/internals/managers.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   2063         blocks.append(b)
   2064 
-> 2065     return BlockManager(blocks, axes)

~/.local/lib/python3.7/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    112 
    113         if do_integrity_check:
--> 114             self._verify_integrity()
    115 
    116         self._consolidate_check()

~/.local/lib/python3.7/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    309         for block in self.blocks:
    310             if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 311                 construction_error(tot_items, block.shape[1:], self.axes)
    312         if len(self.items) != tot_items:
    313             raise AssertionError('Number of manager items must equal union of '

~/.local/lib/python3.7/site-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
   1689         raise ValueError("Empty data passed with indices specified.")
   1690     raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 1691         passed, implied))
   1692 
   1693 

ValueError: Shape of passed values is (368, 546), indices imply (366, 546)
@mroberge
Copy link
Owner

mroberge commented Mar 20, 2019

Thanks for filing this! I'm looking into this problem now.
What I've figured out so far:

  • This is a large request! It has 546 stations in it. Good thing you didn't ask for iv values!
  • One of the stations returned some duplicate values in the time index, somehow. Instead of 366 days of data being returned, apparently 368 were returned by one of the datasets.

I'm still working on this!
-Marty

@mroberge
Copy link
Owner

It turns out that the request is 31.2 MB! That's without zip compression.

I added a few lines of code that checks for duplicated rows and gets rid of them. This request works now, but it takes forever to combine all of the dataseries into one table. Your error message had 546 data series in it, but it was just getting started when it choked on the bad data! The final dataframe has 2618 columns!! Many of these are for temperature readings, which get summarized with a daily max and a daily min and a third column too.

BTW, this is a much smaller request that duplicates your problem:
request2 = hf.NWIS('03107698', "dv", start, end)

This was referenced Mar 20, 2019
@mroberge
Copy link
Owner

closed with merged pull request #47.

@mroberge
Copy link
Owner

mroberge commented Mar 20, 2019

@taataam
I'm trying to think how you can get this bugfix installed. Unfortunately, my new version has changed the internals substantially, so I can't just patch my old version with the fix. I'm getting ready to release version 1.8, but I've got to rewrite a lot of the docstrings and the user's manual, so you probably don't want to wait a week or two for that.

You can install the new version directly from github however. Try using:
pip install git+https://github.com/mroberge/hydrofunctions.git@develop

I'm about to merge the bugfix into develop now too.

@cheginit
Copy link
Author

@mroberge Thank you for your quick response and help. I will give it a try.

All the other states worked fine. I think it took about half an hour for the data of all the states over a period of one year to be downloaded and saved to a HDF file. My final goal is to get the data for a period of 20 or 30 years.

@mroberge
Copy link
Owner

@taataam So you are trying to download all of the data from all of the states for the past 20 to 30 years?
That is a lot!!!

One thing you can do is to limit your requests to only the discharge data. You probably don't want the temperature or chemistry data, for example.

Also, you might want to reconsider getting all of the data locally. Why not use the internet as your hard drive, and request the data at the moment you need it? For example, if you wanted to calculate a flow duration chart for every station, you could download all of the data for one station, create your chart, and then move on to the next station.

If you include all of the EPA chemistry data, there are over a million data collection sites!!!

@cheginit
Copy link
Author

@mroberge I think I read somewhere in your documentation that by default it downloads only the discharge data. In the final data that I got with my code, there were only two columns other than date, discharge and the qualification. So do I have to explicitly give the data type in the request line?

The reason that I download it locally is exactly because of the large amount of computations that I am planning to do with the data. They act as checkpoints so if something goes wrong somewhere in the code, whether a bug or a hardware issues (specially on a cluster) I don't have to do everything from the beginning.

@mroberge
Copy link
Owner

In the new versions, the software will request every variable that gets measured at a site unless you specify which parameter that you want. So, for example, if you only want discharge, then you can do this:

my_PA_discharge = hf.NWIS(service='dv', parameterCd='00060', stateCd='pa' ) 

I'm sorry that the User's Guide is in such a woeful state! The docstrings do a much better job of explaining the parameters, and I've kept them up to date better. You can access them in IPython by typing ?func_name or using the help() function, like this: help(hf.NWIS).

I haven't been updating the User's Guide much lately because the code has been going through some major changes. Now that I've merged everything into my develop branch, I'm going to be working on the documentation before releasing version 0.1.8. I may even make this 0.2.0, but we'll see.

Please feel free to contact me by email too.

-Marty

@cheginit
Copy link
Author

Thanks for the tip. Then, I will check the help for now. The library is very useful, thanks for the time and effort.

@mroberge
Copy link
Owner

mroberge commented Mar 22, 2019 via email

@cheginit
Copy link
Author

Thank you. Sure, would be happy to contribute as much as I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants