# Time Utilities in Pandas 

It is worth to mention that pandas has some *amazing* utilities when dealing with timestamps. In this notebook we will demonstrate some of them.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt

dates = pd.date_range("2015-01-01", "2018-01-01")
values = np.random.normal(0, 1, len(dates)).cumsum()
df = pd.DataFrame({"dates": dates, "values": values}).set_index("dates")
df.plot(figsize=(16,4));

## Easy Aggregations 

If you have a dataframe that has a datetime-index you can use the `.resample()` method to perform a "groupby"-like grouping based on the index.

For example, we can easily calculate the mean per year by running:

In [2]:
df.resample("Y").mean()

You can also run the same aggregation per month `M`, week `W` or quarter `Q`. If the index is a datetime stamp that also includes times then you can aggregate per hour. The script below demonstrates this by calulating the mean per hour.

In [3]:
seconds = pd.date_range("2015-01-01 00:00:00", "2015-01-02 00:00:00", freq="s")
values_s = np.random.normal(0, 1, len(seconds)).cumsum()
df_seconds = pd.DataFrame({"time": seconds, "value": values_s}).set_index("time")
df_seconds.resample("H").mean().head(6)

Note that we can also use a general `.apply()` or `.transform` method here as well.

In [4]:
(df_seconds
 .resample("H")
 .apply(lambda d: pd.Series({
     "mean_value": d['value'].mean(), 
     "var_value": d['value'].var()
 }))
 .head())

In [5]:
(df_seconds
 .resample("H")
 ['value']
 .transform(np.mean)
 .reset_index()
 .head())

## Time Based Features 

Any column in pandas that is of dtype `datetime` has a module attached that can be used to perform vectorised datetime operations. This is very similar to the `.str` module attached to string columns. It is a good thing to explore since the alternative is non-vectorised and much slower.

Below is an example of getting the `weekofyear`. Feel free to explore other properties and methods.

In [6]:
dates = pd.date_range("2015-01-01", "2018-01-01")
date_column = pd.DataFrame({"date": dates})['date']
date_column.dt.weekofyear.head(10)

## Rolling and Smoothing

Sometimes you might want to create a rolling average. Pandas also supports this via the `.rolling()` method which can be called both on a dataframe as well as a series object. It can be used to calculate multiple properties too.

In [7]:
dates = pd.date_range("2015-01-01", "2018-01-01")
values = np.random.normal(0, 1, len(dates)).cumsum()
df = pd.DataFrame({"dates": dates, "values": values}).set_index("dates")
df.assign(rolling_mean = lambda d: d['values'].rolling(20).mean()).plot(figsize=(16, 4));

Note that this rolling property can be centered but can also be tasked with taking a rolling value over a week; pandas is able to recognize datetime values in the index on which to base the roller. 

You can see the effect of the centering below, pay attention to the fact that the green line does not lag anymore. Also note that in order to get this propery we need information from the future so the green line is a bit naughty to use in timeseries predictions.

In [8]:
(df
 .assign(rolling_mean_d = lambda d: d['values'].rolling("30D").mean())
 .assign(rolling_mean_center = lambda d: d['values'].rolling(30, center=True).mean())
).plot(figsize=(16,4));

Also note that you can do more than just "calculating the mean" you can also compute other statistics.

In [9]:
(df
 .assign(rolling_mean_d = lambda d: d['values'].rolling("30D").apply(np.var, raw=True))
 .assign(rolling_mean_center = lambda d: d['values'].rolling(30, center=True).var())
).plot(figsize=(16,4));

An alternative to calculating the rolling statistics is to smooth the timeseries exponentially with the following formula:

$$\hat{y_t} = \alpha y_t + (1-\alpha) \hat{y_{t-1}}$$

The idea is to recursively smooth the series by averaging the current average with the current value. If the alpha value is high then the smoothing will be low but the average can respond quicker to changes and if it is low it will result in something much more flat.

In [10]:
(df
 .assign(smoothed1=lambda d: d['values'].ewm(alpha=0.01).mean())
 .assign(smoothed2=lambda d: d['values'].ewm(alpha=0.1).mean())
).plot(figsize=(16, 4));

# Fill NA

Note that these smoothing functions can be especially nice when you have missing data.

In [11]:
import random
random.seed(42)

df_nan = (df
          .head(40)
          .assign(missing=lambda d: [_ if random.random() < 0.6 else np.nan for _ in d['values']])
          .assign(smooth=lambda d: d['values'].ewm(alpha=0.5).mean().fillna(method="ffill")))

plt.figure(figsize=(16, 4))
plt.subplot(121)
plt.scatter(range(len(df_nan)), df_nan['values'])
plt.scatter(range(len(df_nan)), df_nan['missing'], c='red')
plt.title("the red values are missing");

plt.subplot(122)
plt.scatter(range(len(df_nan)), df_nan['values'])
plt.scatter(range(len(df_nan)), df_nan['smooth'], c='red')
plt.title("the red values are interpolated");

## Convolutional Pandas

One interpretation of a rolling window is that you are smoothing the original timeseries; in other words we might be 'de-noiseing' the dataset. One advanced setting that is worth mentioning is the window type. By setting the window type to be "gaussian" you can make the smoothing weighted. This way points that are further away have less influence. Another interpretation of this method is that we apply a convolution on the timeseries.

Extra documentation on this topic can be found [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#rolling-windows) and a demo can be seen below.

In [12]:
(df
 .assign(rolling_mean_d = lambda d: (d['values']
                                     .rolling(30, center=True, win_type="gaussian")
                                     .mean(std=5)))
).plot(figsize=(16, 4));

## Expanding NA

A final method worth mentioning is `.expanding()`. In essense this allows you to write functions like `cumsum()` but with more customisation options.

In [13]:
(df
 .assign(cumsum=lambda d: d['values'].cumsum())
 .assign(expanding=lambda d: d.expanding()['values'].sum())
 .head(6))

You can write your own aggregation functions as you wish, to show what to expect in the `apply()` we print the results below.

In [14]:
def print_and_mean(d):
    print(d)
    return np.mean(d)

df.head(4).expanding().apply(print_and_mean)