# Time Series Intro

The study of time series has arisen because certain sorts of data streams are heavily dependent on the flow of time. Of course, we have not totally ignored time as a feature up to this point. The selling price of a house probably *does* have some relation to the season or the year as real estate markets grow and decline with certain temporally-indexed economic changes etc. But surely time is not the most important predictor of house price. Square footage would likely be more strongly correlated with price than would date of sale.

But there are other sorts of data that more readily lend themselves to a temporal analysis. One canonical example is numbers from a stock exchange: First, data from stock tickers often arrive as numbers anchored to consecutive units of time. I get the selling price for some stock on January 1, say, and the next bit of information I gain will be the selling price for that stock on January 2. (We'll explore this feature of time series below.) Second, and more important, if I'm interested in actually *predicting* the selling price of a stock for, say, tomorrow, then very likely one piece of very salient (i.e. *correlated*) information would be the selling price of that stock *today*.

What other examples of this sort of time-dependent data can you think of?

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder

In [None]:
pd.read_csv('data/google-trends_game-of-thrones_us.csv').head()

In [None]:
# Define a function that will help us load and
# clean up a dataset.

def load_trend(trend_name='football', country_code='us'):
    df = pd.read_csv('data/google-trends_'
                     + trend_name + '_'
                     + country_code
                     + '.csv').iloc[1:, :]
    df.columns = ['counts']
    df['counts'] = df['counts'].str.replace('<1', '0').astype(int)
    return df

In [None]:
df = load_trend(**{'trend_name': 'data-science', 'country_code': 'us'})

In [None]:
df

In [None]:
trends = [
    {'trend_name': 'data-science', 'country_code': 'us'},
    {'trend_name': 'football', 'country_code': 'us'},
    {'trend_name': 'football', 'country_code': 'uk'},
    {'trend_name': 'game-of-thrones', 'country_code': 'us'},
    {'trend_name': 'pokemon', 'country_code': 'us'},
    {'trend_name': 'taxes', 'country_code': 'us'},   
]

In [None]:
np.random.shuffle(trends)

In [None]:
trend_dfs = [load_trend(**trend) for trend in trends]

In [None]:
# Let's see if we can guess which is which just by looking
# at their graphs.

import matplotlib; matplotlib.style.use('ggplot')

fig, axs = plt.subplots(len(trend_dfs), 1, figsize=(8, 10))
plt.tight_layout()
for i, trend_df in enumerate(trend_dfs):
    ax = axs[i]
    #ax.set_title(str(trends[i]))
    ax.plot(np.array(trend_df.index), trend_df['counts'])
    ticks = ax.get_xticks()
    ax.set_ylim((0, 100))
    ax.set_xticks([tick for tick in ticks if tick%24 == 0])

We could do a histogram of our data, say of the taxes counts:

In [None]:
taxes_df = load_trend('taxes')

In [None]:
plt.hist(taxes_df['counts']);

But clearly we would be missing something important about how the data is structured. Let's try to capture that. We'll stick with the taxes data.

In [None]:
# Adding a month column

taxes_df['i'] = np.arange(len(taxes_df))
taxes_df['month'] = taxes_df['i'] % 12

In [None]:
# Using month to predict tax activity

trend_model = LinearRegression()
trend_model.fit(taxes_df[['i']], taxes_df['counts'])
trend_line = trend_model.predict(taxes_df[['i']])

In [None]:
trend_line[:12]

In [None]:
plt.plot(taxes_df['i'], taxes_df['counts'])
plt.plot(taxes_df['i'], trend_line);

Clearly, this model leaves something to be desired! Let's try again. And this time we'll make explicit use of the time indices.

In [None]:
month_encoder = OneHotEncoder(categories='auto')
month_encoder.fit(taxes_df[['month']])
month_data = month_encoder.transform(taxes_df[['month']]).toarray()

In [None]:
month_data[0]

In [None]:
lr = LinearRegression()

In [None]:
data = np.hstack((taxes_df[['i']].values, month_data))

In [None]:
data[0]

In [None]:
lr.fit(data, taxes_df['counts'])
lr_pred = lr.predict(data)  # Predictive model based on i and month

In [None]:
lr_pred[:12]

In [None]:
trend_df = taxes_df
fig, ax = plt.subplots(figsize=(8, 4))
ax.set_title('Taxes')
ax.plot(trend_df['i'], trend_df['counts'], label='Data',
       linewidth=.5, alpha=.8)
ax.plot(trend_df['i'], trend_line, label='Trend')
ax.plot(trend_df['i'], lr_pred, label='Regression', linestyle="dotted")
plt.legend()
ticks = ax.get_xticks()
ax.set_ylim((0, 100))
ax.set_xticks([tick for tick in ticks if tick%24 == 0])
plt.show()

In [None]:
residuals = trend_df['counts'] - lr_pred

fig, ax = plt.subplots(figsize=(8, 4))
ax.set_title("Residuals")
ax.plot(trend_df['i'], trend_df['counts'], label='Data',
       linewidth=.5, alpha=.8)
ax.plot(trend_df['i'], lr_pred, label='Regression', linestyle="dotted")
ax.plot(trend_df['i'], residuals,
        label='Residuals', linewidth=.5)

#ax.plot(trend_df.index, trend_line, label='trend')
plt.legend()
ticks = ax.get_xticks()
ax.set_ylim((-10, 90))
ax.set_xticks([tick for tick in ticks if tick%24 == 0])
plt.show()

## Datetime Objects

These comprise a nice standard way of dealing with times and dates in Python. There is a `datetime` [library](https://docs.python.org/2/library/datetime.html), and inside `pandas` there is a `datetime` module as well as a `to_datetime()` function.

In [None]:
import datetime

In [None]:
datetime.datetime(2020, 12, 31)

### Datetime objects have the parts of a date as attributes

In [None]:
now = datetime.datetime(2020, 12, 31)

In [None]:
now.year

In [None]:
now.month

In [None]:
now.day

In [None]:
now.hour

### `.timedelta()`

In [None]:
moment = datetime.timedelta(minutes=100)
moment.days

In [None]:
moment.seconds

In [None]:
pd.to_datetime(100)

In [None]:
pd.to_datetime('2020-12-31')

In [None]:
pd.to_datetime(['2020-12-30', '2020-12-31'], format='%Y-%m-%d')

There are [many](https://docs.python.org/2/library/datetime.html) format codes that `datetime` supports.

## Upsampling and Downsampling

Sometimes we have information sorted according to a certain level in the temporal hierarchy but we'd like to have it sorted according to a different level. For example, we have monthly data but we want to look at quarters, or we have daily data but we want to visualize hourly trends.

**To upsample** is to *increase the frequency* of the data of interest. <br/>
**To downsample** is to *decrease the frequency* of the data of interest.

There is a `.resample()` method available for `pandas` objects:

In [None]:
times = np.arange(10)
np.random.seed(42)
target = np.random.random(size=10)

In [None]:
rev_times = pd.to_datetime(times, format='%M')
rev_times

In [None]:
df = pd.DataFrame({'target': target}, index=rev_times)
df

### Upsampling with `.resample()`

In [None]:
df.resample('S')

#### `.ffil()`

In [None]:
df_up = df.resample('S').ffill()

In [None]:
df_up.head()

#### `.bfill()`

In [None]:
df_up = df.resample('S').bfill()
df_up.head()

#### `.interpolate()`

In [None]:
df_up = df.resample('S').interpolate()
df_up.head()

In [None]:
(df_up.iloc[60,] - df_up.iloc[0,]) / 60

In [None]:
[df_up.iloc[j+1,].values - df_up.iloc[j,].values for j in range(59)][:10]

### Downsampling with `.resample()`

In [None]:
df.resample('h')

In [None]:
df_down = df.resample('h').ffill()

In [None]:
df.down

[Here](https://machinelearningmastery.com/resample-interpolate-time-series-data-python/) is a helpful post on this sort of resampling in Python.

## Decomposing a Time Series

Statsmodels has a great tool for looking at a time series as a sum of parts: a general trend, a seasonality component, and whatever is left over (often called a residual (why?)): its `seasonal_decompose()` function.

In [None]:
taxes_df.index = pd.to_datetime(taxes_df.index)

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(taxes_df['counts'])

observed = decomposition.observed
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

In [None]:
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

In [None]:
plt.figure(figsize=(12,8))
plt.subplot(411)
plt.plot(observed, label='Original', color="blue")
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(trend, label='Trend', color="blue")
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality', color="blue")
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(residual, label='Residuals', color="blue")
plt.legend(loc='upper left')
plt.tight_layout()

Exercise to make sure that the residual really captures *all* remaining information about our times series.

For various techincal reasons that won't concern us here, some of the components of the decomposition have NANs at their heads and tails. But we can just use `np.nansum()`.

In [None]:
trend.head()

In [None]:
myst = 0
for i in range(len(taxes_df['counts'])):
    myst += np.nansum(taxes_df['counts'][i] - trend[i] - seasonal[i] - residual[i])
myst

# Extra Resources for timeseries manipulation

- [Aileen Neilsen SciPy - 2016](https://www.youtube.com/watch?v=JNfxr4BQrLk)

- [Aileen Neilsen - Github](https://github.com/AileenNielsen/TimeSeriesAnalysisWithPython)

- [A project on Flight Fares](https://achyutjoshi.github.io/btp/datacollection)

- [Python DS Handbook - Working with TS](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html)