## Using Lowess to smooth noisy data

In [None]:
import itertools
import numpy as np
import pandas as pd

import random

from statsmodels.nonparametric.smoothers_lowess import lowess

In [None]:
import matplotlib.pyplot as plt
from IPython.core.pylabtools import figsize
import seaborn as sns

In [None]:
sns.set_theme()
figsize(12, 6)

[This article](https://medium.economist.com/mistakes-weve-drawn-a-few-8cdd8a42d368) suggests that a smooth curve is a better way to show noisy polling data over time.

Here's their before and after:

![](https://cdn-images-1.medium.com/max/800/1*9GzHVtm4y_LeVmFCjqV3Ww.png)

And here's their data:

In [None]:
df = pd.read_csv('../data/economist_brexit.csv', header=3, parse_dates=[0])
df.rename(columns={
    '% responding right': 'right',
    '% responding wrong': 'wrong'
}, inplace=True)
# df.index = df['Date']
df.head()

In [None]:
df.tail()

The following function uses StatsModels to put a smooth curve through a time series (and stuff the results back into a Pandas Series)

In [None]:
df_long = df.melt(
    id_vars = 'Date',
    value_vars = df.columns[1:],
    value_name = 'Percentage',
    var_name = 'Response'
)
df_long

As you can see, its a very noisy dataset

In [None]:
title = 'In hindsight, do you think Britain was right or wrong to vote to leave the EU?'
p = sns.lineplot(
    data = df_long,
    x = 'Date',
    y = 'Percentage',
    hue = 'Response'
);
p.set(
    xlabel = None,
    title = title
);

A scatter plot doesn't make things any clearer

In [None]:
p = sns.scatterplot(
    data = df_long,
    x = 'Date',
    y = 'Percentage',
    hue = 'Response'
)
p.set(
    xlabel = None,
    title = title
);

To fit a [lowess](https://www.statsmodels.org/stable/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html) curve we needs the x values of the observed points (the date) and the y values (the % response)

In [None]:
right_smooth = lowess(df['right'], df.Date, return_sorted=False)
wrong_smooth = lowess(df['wrong'], df.Date, return_sorted=False)

In [None]:
df_long['lowess'] = list(itertools.chain(right_smooth, wrong_smooth))

In [None]:
# or
df_long['lowess'] = np.hstack((right_smooth, wrong_smooth,))

In [None]:
df_long.head()

In [None]:
df.Date.apply(lambda row: row.strftime('%b %Y'))

In [None]:
# relplot returns a FacetGrid instance
p = sns.relplot(
    kind='scatter',
    x='Date',
    y='Percentage',
    hue = 'Response',
    data=df_long,
    height=5,
    aspect=1.5
)
p.set_axis_labels('', 'Percentage');
p.map_dataframe(sns.lineplot, 'Date', 'lowess', hue='Response');

Not great. Its better to use subplots

In [None]:
fig, ax = plt.subplots()
sns.scatterplot(
    data=df_long,
    x='Date',
    y='Percentage',
    hue='Response',
    legend=False,
    ax=ax,
    alpha=0.5
)
sns.lineplot(
    data=df_long,
    x='Date',
    y='lowess',
    hue='Response',
    ax=ax
);
ax.set_title(title);
ax.set_xlabel('');