Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upsample spline interpolation with smoothing #26309

Open
tritemio opened this issue May 7, 2019 · 6 comments
Open

Upsample spline interpolation with smoothing #26309

tritemio opened this issue May 7, 2019 · 6 comments
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Resample resample method

Comments

@tritemio
Copy link

tritemio commented May 7, 2019

Code Sample, a copy-pastable example if possible

from io import StringIO
import numpy as np
import pandas as pd
from scipy import interpolate
import matplotlib.pyplot as plt

csv = StringIO("""
date_time,temperature
2019-05-05 01:30:00,38.8
2019-05-05 03:20:00,39.5
2019-05-05 04:30:00,39.3
2019-05-05 05:30:00,38.9
2019-05-05 07:55:00,38.7
2019-05-05 09:45:00,39.1
""")
df = pd.read_csv(csv, parse_dates=['date_time'], index_col=0)

s = 0.3
x = df.index.values
y = df.temperature.values
xnew = np.arange(x[0], x[-1], np.timedelta64(5, 'm'))

tck = interpolate.splrep(x, y, k=3, s=s)
ynew = interpolate.splev(xnew.view(int), tck)

df.temperature.plot(marker='o', ls='', label='data')
plt.plot(xnew, ynew, label=f'scipy cubic spline (s={s})')
(df.temperature.resample('5min')
               .interpolate(method='spline', order=3, s=s)
               .plot(label=f'pandas cubic spline (s={s})'))
plt.legend();

See the same script as a notebook:

Problem description

Spline interpolation of order 3 with smoothing (s>0) gives an interpolation that does not pass through the data points. Scipy's version shows this behaviour. Pandas's version shows a smooth spline and then "jumps" in correspondence to the data points in order to "pass through the data". See figure below:

index

Expected Output

Scipy and pandas interpolation should match.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@gfyoung gfyoung added Resample resample method Visualization plotting labels May 8, 2019
@gfyoung
Copy link
Member

gfyoung commented May 8, 2019

cc @TomAugspurger

@TomAugspurger
Copy link
Contributor

@tritemio can you edit the original post to be fully reproducible on it's own? Specifically, remove the need for read_csv of an external file (you can use read_csv(StringIO(...)).

I believe that pandas always keeps "valid" (non-NaN) points present, so we really are just interpolating rather than smoothing. We could consider an alternative API for smoothers, but I'm not sure what that would look like.

@TomAugspurger TomAugspurger removed the Visualization plotting label May 14, 2019
@tritemio
Copy link
Author

@TomAugspurger, I modified the example to be self-contained.

As you can see now better, the times in the original data are all multiple of 5 minutes. When you evaluate the interpolation at the exact same timestamp as in the original data, the result should be different than the original data, by definition of smoothed spline.

Since scipy does the right thing, it is strange that pandas would override the some of the result values. Looks like pandas is adding the data points back, without checking if they already exist in the interpolated data.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 14, 2019 via email

@tritemio
Copy link
Author

Except for spline, I don't think scipy.interpolate has other methods using "smoothing".

Instead of adding a new API, would it make sense to special-case if method='spline' and s > 0 to fill all values in the resampled axis?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 14, 2019 via email

@mroeschke mroeschke added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jul 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Resample resample method
Projects
None yet
Development

No branches or pull requests

4 participants