Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_baseline_data does not partition data (using daily data set). #363

Closed
sc0ttyg opened this issue Aug 12, 2019 · 3 comments
Closed

get_baseline_data does not partition data (using daily data set). #363

sc0ttyg opened this issue Aug 12, 2019 · 3 comments

Comments

@sc0ttyg
Copy link

sc0ttyg commented Aug 12, 2019

Report installed package versions

eemeter==2.7.2
pandas==0.23.4
scipy==1.3.0
numpy==1.16.4

Describe the bug
The get_baseline_data function with option max_days = 365 returns the input dataframe, not a version subsetted to 365 days.

  1. Include a short, self-contained Python snippet reproducing the problem. You can
    format the code nicely by using GitHub Flavored Markdown:

    >>> In [1]: import eemeter
    
    >>> In [2]: import pandas as pd
    
    >>> In [3]: meter_data, temperature_data, metadata = \
    ...:     eemeter.load_sample('il-electricity-cdd-hdd-daily')
    >>> In [5]: data = eemeter.create_caltrack_daily_design_matrix(meter_data, temperature_data)
    ...:
    >>> In [6]: baseline_data, warnings = eemeter.get_baseline_data(data, max_days=365)
    >>> In [7]: baseline_data.equals(data)
    >>>  Out[7]: True
    >>> In [8]: eemeter.get_version()
    >>>  Out[8]: '2.7.2'
    >>> In [9]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: 1.3.0
pyarrow: None
xarray: None
IPython: 7.6.1
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.3.6
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

   >>> In [10]: import scipy
   >>> In [12]: scipy.__version__
   >>> Out[12]: '1.3.0'
   >>>  In [13]: import numpy
   >>> In [15]: numpy.__version__
   >>> Out[15]: '1.16.4'
   >>> In [16]: len(baseline_data)
   >>> Out[16]: 810
   >>> In [17]: len(data)
   >>> Out[17]: 810

Expected behavior

Expect a dataframe of length 365 days over only the first 365 days of data.

Additional context
Add any other context about the problem here.

@philngo
Copy link
Contributor

philngo commented Aug 13, 2019

@sc0ttyg The get_baseline_data function requires an end date to be set (this function assumes you want to go some number of days back before a project or intervention). Correct usage is the following:

eemeter.get_baseline_data(meter_data, max_days=365, end=datetime(...))

There's a bit more detail on this behavior in the docs for get_baseline_data: http://eemeter.openee.io/api.html#eemeter.get_baseline_data, though it's easy to miss. This behavior might warrant a warning - if so, I'd be happy to take a look at a pull request.

max_days (int) – The maximum length of the period. Ignored if end is not set. The stricter of this or start is used to determine the earliest allowable baseline period date.

Reading between the lines here, if you want to get data selecting forward from a date (e.g., the start of your data), then you can use the get_reporting_data method, which allows that operation. Although that also is admittedly a bit unintuitive if you're selecting down to baseline data.

import eemeter
import pandas as pd
meter_data = pd.DataFrame({'value': 1}, index=pd.date_range(start='2013-01-01', end='2019-01-01', freq='D', tz='utc'))
eemeter.get_reporting_data(meter_data, start=meter_data.index[0], max_days=365)

If neither of these methods quite matches your use case, the ultimate flexibility is also available by selecting on the pandas DatetimeIndex as well.

@sc0ttyg
Copy link
Author

sc0ttyg commented Aug 13, 2019

@philngo Great, thanks for the clarification. I'll consider a pull request after I get to know the code a little better.

@philngo
Copy link
Contributor

philngo commented Aug 13, 2019

@sc0ttyg Thanks for reaching out! I'm going to go ahead and close this issue. Please consider helping our developer community by filling out our first-time issue/PR contributor survey.

@philngo philngo closed this as completed Aug 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants