Upsample spline interpolation with smoothing #26309

tritemio · 2019-05-07T15:20:41Z

Code Sample, a copy-pastable example if possible

from io import StringIO
import numpy as np
import pandas as pd
from scipy import interpolate
import matplotlib.pyplot as plt

csv = StringIO("""
date_time,temperature
2019-05-05 01:30:00,38.8
2019-05-05 03:20:00,39.5
2019-05-05 04:30:00,39.3
2019-05-05 05:30:00,38.9
2019-05-05 07:55:00,38.7
2019-05-05 09:45:00,39.1
""")
df = pd.read_csv(csv, parse_dates=['date_time'], index_col=0)

s = 0.3
x = df.index.values
y = df.temperature.values
xnew = np.arange(x[0], x[-1], np.timedelta64(5, 'm'))

tck = interpolate.splrep(x, y, k=3, s=s)
ynew = interpolate.splev(xnew.view(int), tck)

df.temperature.plot(marker='o', ls='', label='data')
plt.plot(xnew, ynew, label=f'scipy cubic spline (s={s})')
(df.temperature.resample('5min')
               .interpolate(method='spline', order=3, s=s)
               .plot(label=f'pandas cubic spline (s={s})'))
plt.legend();

See the same script as a notebook:

https://gist.github.com/tritemio/aea8c0e18e9bde6984ca199b7c012ad8

Problem description

Spline interpolation of order 3 with smoothing (s>0) gives an interpolation that does not pass through the data points. Scipy's version shows this behaviour. Pandas's version shows a smooth spline and then "jumps" in correspondence to the data points in order to "pass through the data". See figure below:

Expected Output

Scipy and pandas interpolation should match.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2019-05-08T22:48:43Z

cc @TomAugspurger

TomAugspurger · 2019-05-14T13:50:32Z

@tritemio can you edit the original post to be fully reproducible on it's own? Specifically, remove the need for read_csv of an external file (you can use read_csv(StringIO(...)).

I believe that pandas always keeps "valid" (non-NaN) points present, so we really are just interpolating rather than smoothing. We could consider an alternative API for smoothers, but I'm not sure what that would look like.

tritemio · 2019-05-14T19:01:44Z

@TomAugspurger, I modified the example to be self-contained.

As you can see now better, the times in the original data are all multiple of 5 minutes. When you evaluate the interpolation at the exact same timestamp as in the original data, the result should be different than the original data, by definition of smoothed spline.

Since scipy does the right thing, it is strange that pandas would override the some of the result values. Looks like pandas is adding the data points back, without checking if they already exist in the interpolated data.

TomAugspurger · 2019-05-14T19:29:42Z

Thanks, I'm not sure about the original intent on pandas' interpolate, but it's fundamentally based around filling missing values. All the current interpolate methods will only update missing values, and non-missing values will be passed through. This use case seems worth supporting, but someone will need to design an API (either an additional keyword, or an alternative method).

…

On Tue, May 14, 2019 at 2:01 PM Antonino Ingargiola < ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger>, I modified the example to be self-contained. As you can see now better, the times in the original data are all multiple of 5 minutes. When you evaluate the interpolation at the exact same timestamp as in the original data, the result should be different than the original data, by definition of smoothed spline. Since scipy does the right thing, it is strange that pandas would override the some of the result values. Looks like pandas is adding the data points back, without checking if they already exist in the interpolated data. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26309?email_source=notifications&email_token=AAKAOIWKZ7A2OVHEWOYRNHDPVMECBA5CNFSM4HLJ46LKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVMO2NA#issuecomment-492367156>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOITT4AEYFLJTDTNXJLLPVMECBANCNFSM4HLJ46LA> .

tritemio · 2019-05-14T21:07:58Z

Except for spline, I don't think scipy.interpolate has other methods using "smoothing".

Instead of adding a new API, would it make sense to special-case if method='spline' and s > 0 to fill all values in the resampled axis?

TomAugspurger · 2019-05-14T21:14:58Z

To do that, we would need to deprecate the existing behavior first, which would require a new keyword (which itself would then need to be deprecated). A keyword to control smoothing or not for existing points seems the most sensible (would probably raise for non-spline methods).

…

On Tue, May 14, 2019 at 4:08 PM Antonino Ingargiola < ***@***.***> wrote: Except for spline, I don't think scipy.interpolate has other methods using "smoothing". Instead of adding a new API, would it make sense to special-case if method='spline' and s > 0 to fill all values in the resampled axis? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26309?email_source=notifications&email_token=AAKAOIRGVAPPOBCWAJQAVELPVMS3LA5CNFSM4HLJ46LKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVMZLOY#issuecomment-492410299>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOITXOMWMJERLD5CMW2DPVMS3LANCNFSM4HLJ46LA> .

gfyoung added Resample resample method Visualization plotting labels May 8, 2019

TomAugspurger removed the Visualization plotting label May 14, 2019

mroeschke added the Enhancement label May 11, 2020

mroeschke added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upsample spline interpolation with smoothing #26309

Upsample spline interpolation with smoothing #26309

tritemio commented May 7, 2019 •

edited

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

gfyoung commented May 8, 2019

TomAugspurger commented May 14, 2019

tritemio commented May 14, 2019

TomAugspurger commented May 14, 2019 via email

tritemio commented May 14, 2019

TomAugspurger commented May 14, 2019 via email

Upsample spline interpolation with smoothing #26309

Upsample spline interpolation with smoothing #26309

Comments

tritemio commented May 7, 2019 • edited

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

gfyoung commented May 8, 2019

TomAugspurger commented May 14, 2019

tritemio commented May 14, 2019

TomAugspurger commented May 14, 2019 via email

tritemio commented May 14, 2019

TomAugspurger commented May 14, 2019 via email

tritemio commented May 7, 2019 •

edited

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS