In [None]:
### DATACAMP:  Machine Learning for Time Series Data in Python:
### https://learn.datacamp.com/courses/machine-learning-for-time-series-data-in-python

In [None]:
#### Recommend this other DATACAMP course:  Manipulating Time Series Data in Python  (pandas DateTimeIndex)
#### https://learn.datacamp.com/courses/manipulating-time-series-data-in-python

In [None]:
"""  Smooth the data, to reduce noise in the data (using concept of rolling windows)"""

In [None]:
"""  Time-shifted dataset to train on """

In [None]:
"""  Cross validation iterator, unique to time series data """

In [None]:
""" Idea of stationarity and stability of data """

In [None]:
"""  tsfresh in python helps find features in time series """

In [None]:
"""  Kaggle has datasets; Quantopian has financial data and models you can play with """

In [None]:
"""Engineering multiple rolling features at once
Now that you've practiced some simple feature engineering, let's move on to something more complex. 
You'll calculate a collection of features for your time series data and visualize what they look like over time. 
This process resembles how many other time series models operate"""

In [None]:
# Define a rolling window with Pandas, excluding the right-most datapoint of the window
prices_perc_rolling = prices_perc.rolling(20, min_periods=5, closed='right')

# Define the features you'll calculate for each window
features_to_calculate = [np.min, np.max, np.mean, np.std]

# Calculate these features for your rolling window object
features = prices_perc_rolling.aggregate(features_to_calculate)

# Plot the results
ax = features.loc[:"2011-01"].plot()
prices_perc.loc[:"2011-01"].plot(ax=ax, color='k', alpha=.2, lw=3)
ax.legend(loc=(1.01, .6))
plt.show()

In [None]:
"""Percentiles and partial functions
In this exercise, you'll practice how to pre-choose arguments of a function so that you can pre-configure how it runs. 
You'll use this to calculate several percentiles of your data using the same percentile() function in numpy."""

In [None]:
# Import partial from functools
from functools import partial
percentiles = [1, 10, 25, 50, 75, 90, 99]

# Use a list comprehension to create a partial function for each quantile
percentile_functions = [partial(np.percentile, q=percentile) for percentile in percentiles]

# Calculate each of these quantiles on the data using a rolling window
prices_perc_rolling = prices_perc.rolling(20, min_periods=5, closed='right')
features_percentiles = prices_perc_rolling.aggregate(percentile_functions)

# Plot a subset of the result
ax = features_percentiles.loc[:"2011-01"].plot(cmap=plt.cm.viridis)
ax.legend(percentiles, loc=(1.01, .5))
plt.show()

In [None]:
"""Using "date" information
It's easy to think of timestamps as pure numbers, but don't forget they generally correspond to things 
that happen in the real world. That means there's often extra information encoded in the data 
such as "is it a weekday?" or "is it a holiday?". This information is often useful in predicting timeseries data.

In this exercise, you'll extract these date/time based features. A single time series has been loaded 
in a variable called prices."""

In [None]:
# Extract date features from the data, add them as columns
prices_perc['day_of_week'] = prices_perc.index.dayofweek
prices_perc['week_of_year'] = prices_perc.index.weekofyear
prices_perc['month_of_year'] = prices_perc.index.month

# Print prices_perc
print(prices_perc)

In [None]:
#############

In [None]:
"""Creating time-shifted features
In machine learning for time series, it's common to use information about previous time points 
to predict a subsequent time point.

In this exercise, you'll "shift" your raw data and visualize the results. You'll use the percent change time series 
that you calculated in the previous chapter, this time with a very short window. A short window is important because, 
in a real-world scenario, you want to predict the day-to-day fluctuations of a time series, not its change 
over a longer window of time."""

In [None]:
""" 
* Use a dictionary comprehension to create multiple time-shifted versions of prices_perc using the lags specified in shifts.
* Convert the result into a DataFrame.
* Use the given code to visualize the results. """

In [None]:
# These are the "time lags"
shifts = np.arange(1, 11).astype(int)

# Use a dictionary comprehension to create name: value pairs, one pair per shift
shifted_data = {"lag_{}_day".format(day_shift): prices_perc.shift(day_shift) for day_shift in shifts}

# Convert into a DataFrame for subsequent use
prices_perc_shifted = pd.DataFrame(shifted_data)

# Plot the first 100 samples of each
ax = prices_perc_shifted.iloc[:100].plot(cmap=plt.cm.viridis)
prices_perc.iloc[:100].plot(color='r', lw=2)
ax.legend(loc='best')
plt.show()

In [None]:
"""Special case: Auto-regressive models
Now that you've created time-shifted versions of a single time series, you can fit an auto-regressive model. 
This is a regression model where the input features are time-shifted versions of the output time series data. 
You are using previous values of a timeseries to predict current values of the same timeseries (thus, it is auto-regressive).

By investigating the coefficients of this model, you can explore any repetitive patterns that exist in a timeseries, 
and get an idea for how far in the past a data point is predictive of the future."""

In [None]:
"""
* Replace missing values in prices_perc_shifted with the median of the DataFrame and assign it to X.
* Replace missing values in prices_perc with the median of the series and assign it to y.
* Fit a regression model using the X and y arrays."""

In [None]:
# Replace missing values with the median for each column
X = prices_perc_shifted.fillna(np.nanmedian(prices_perc_shifted))
y = prices_perc.fillna(np.nanmedian(prices_perc))

# Fit the model
model = Ridge()
model.fit(X, y)

In [None]:
"""Visualize regression coefficients
Now that you've fit the model, let's visualize its coefficients. This is an important part of machine learning 
because it gives you an idea for how the different features of a model affect the outcome.

The shifted time series DataFrame (prices_perc_shifted) and the regression model (model) are available in your workspace.

In this exercise, you will create a function that, given a set of coefficients and feature names, 
visualizes the coefficient values."""

In [None]:
""" Define a function (called visualize_coefficients) that takes as input an array of coefficients, 
an array of each coefficient's name, and an instance of a Matplotlib axis object. It should then generate a bar plot 
for the input coefficients, with their names on the x-axis."""

In [None]:
def visualize_coefficients(coefs, names, ax):
    # Make a bar plot for the coefficients, including their names on the x-axis
    ax.bar(names, coefs)
    ax.set(xlabel='Coefficient name', ylabel='Coefficient value')
    
    # Set formatting so it looks nice
    plt.setp(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
    return ax

In [None]:
""" Use this function (visualize_coefficients()) with the coefficients contained in the model variable and 
column names of prices_perc_shifted. """

In [None]:
# Visualize the output data up to "2011-01"
fig, axs = plt.subplots(2, 1, figsize=(10, 5))
y.loc[:'2011-01'].plot(ax=axs[0])

# Run the function to visualize model's coefficients
visualize_coefficients(model.coef_, prices_perc_shifted.columns, ax=axs[1])
plt.show()

In [None]:
"""Auto-regression with a smoother time series
Now, let's re-run the same procedure using a smoother signal. You'll use the same percent change algorithm as before, 
but this time use a much larger window (40 instead of 20). As the window grows, the difference between 
neighboring timepoints gets smaller, resulting in a smoother signal. What do you think this will do 
to the auto-regressive model?

prices_perc_shifted and model (updated to use a window of 40) are available in your workspace."""

In [None]:
""" Using the function (visualize_coefficients()) you created in the last exercise, generate a plot with coefficients 
of model and column names of prices_perc_shifted. """

In [None]:
# Visualize the output data up to "2011-01"
fig, axs = plt.subplots(2, 1, figsize=(10, 5))
y.loc[:'2011-01'].plot(ax=axs[0])

# Run the function to visualize model's coefficients
visualize_coefficients(model.coef_, prices_perc_shifted.columns, ax=axs[1])
plt.show()

In [None]:
#############

In [None]:
"""Time-based cross-validation
Finally, let's visualize the behavior of the time series cross-validation iterator in scikit-learn. 
Use this object to iterate through your data one last time, visualizing the training data used 
to fit the model on each iteration.

An instance of the Linear regression model object is available in your workpsace. Also, the arrays X and y (training data) 
are available too."""

In [None]:
# Import TimeSeriesSplit
from sklearn.model_selection import TimeSeriesSplit

# Create time-series cross-validation object
cv = TimeSeriesSplit(n_splits=10)

# Iterate through CV splits
fig, ax = plt.subplots()
for ii, (tr, tt) in enumerate(cv.split(X, y)):
    # Plot the training data on each iteration, to see the behavior of the CV
    ax.plot(tr, ii + y[tr])

ax.set(title='Training data on each CV iteration', ylabel='CV iteration')
plt.show()

In [None]:
#############

In [None]:
"""Bootstrapping a confidence interval
A useful tool for assessing the variability of some data is the bootstrap. In this exercise, you'll write 
your own bootstrapping function that can be used to return a bootstrapped confidence interval.

This function takes three parameters: a 2-D array of numbers (data), a list of percentiles to calculate (percentiles), 
and the number of boostrap iterations to use (n_boots). It uses the resample function to generate a bootstrap sample, 
and then repeats this many times to calculate the confidence interval."""

In [None]:
from sklearn.utils import resample

def bootstrap_interval(data, percentiles=(2.5, 97.5), n_boots=100):
    """Bootstrap a confidence interval for the mean of columns of a 2-D dataset."""
    # Create empty array to fill the results
    bootstrap_means = np.zeros([n_boots, data.shape[-1]])
    for ii in range(n_boots):
        # Generate random indices for data *with* replacement, then take the sample mean
        random_sample = resample(data)
        bootstrap_means[ii] = random_sample.mean(axis=0)

    # Compute the percentiles of choice for the bootstrapped means
    percentiles = np.percentile(bootstrap_means, percentiles, axis=0)
    return percentiles

In [None]:
"""Calculating variability in model coefficients
In this lesson, you'll re-run the cross-validation routine used before, but this time paying attention 
to the model's stability over time. You'll investigate the coefficients of the model, as well as the uncertainty 
in its predictions.

Begin by assessing the stability (or uncertainty) of a model's coefficients across multiple CV splits. 
Remember, the coefficients are a reflection of the pattern that your model has found in the data.

An instance of the Linear regression object (model) is available in your workpsace. Also, the arrays X and y (the data) 
are available too."""

In [None]:
# Iterate through CV splits
n_splits = 100
cv = TimeSeriesSplit(n_splits=n_splits)

# Create empty array to collect coefficients
coefficients = np.zeros([n_splits, X.shape[1]])

for ii, (tr, tt) in enumerate(cv.split(X, y)):
    # Fit the model on training data and collect the coefficients
    model.fit(X[tr], y[tr])
    coefficients[ii] = model.coef_

In [None]:
"""Finally, calculate the 95% confidence interval for each coefficient in coefficients using the bootstrap_interval() function 
you defined in the previous exercise. You can run bootstrap_interval? if you want a refresher on the parameters 
that this function takes."""

In [None]:
# Calculate a confidence interval around each coefficient
bootstrapped_interval = bootstrap_interval(coefficients)

# Plot it
fig, ax = plt.subplots()
ax.scatter(feature_names, bootstrapped_interval[0], marker='_', lw=3)
ax.scatter(feature_names, bootstrapped_interval[1], marker='_', lw=3)
ax.set(title='95% confidence interval for model coefficients')
plt.setp(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()

In [None]:
"""Visualizing model score variability over time
Now that you've assessed the variability of each coefficient, let's do the same for the performance (scores) of the model. 
Recall that the TimeSeriesSplit object will use successively-later indices for each test set. This means that you can treat 
the scores of your validation as a time series. You can visualize this over time in order to see how the model's performance 
changes over time.

An instance of the Linear regression model object is stored in model, a cross-validation object in cv, and data in X and y."""

In [None]:
from sklearn.model_selection import cross_val_score

# Generate scores for each split to see how the model performs over time
scores = cross_val_score(model, X, y, cv=cv, scoring=my_pearsonr)

# Convert to a Pandas Series object
scores_series = pd.Series(scores, index=times_scores, name='score')

# Bootstrap a rolling confidence interval for the mean score
scores_lo = scores_series.rolling(20).aggregate(partial(bootstrap_interval, percentiles=2.5))
scores_hi = scores_series.rolling(20).aggregate(partial(bootstrap_interval, percentiles=97.5))

In [None]:
# Plot the results
fig, ax = plt.subplots()
scores_lo.plot(ax=ax, label="Lower confidence interval")
scores_hi.plot(ax=ax, label="Upper confidence interval")
ax.legend()
plt.show()

In [None]:
"""Accounting for non-stationarity
In this exercise, you will again visualize the variations in model scores, but now for data that changes its statistics 
over time.

An instance of the Linear regression model object is stored in model, a cross-validation object in cv, and the data 
in X and y."""

"""Create an empty DataFrame to collect the results.
Iterate through multiple window sizes, each time creating a new TimeSeriesSplit object.
Calculate the cross-validated scores (using a custom scorer we defined for you, my_pearsonr) of the model on training data."""

In [None]:
# Pre-initialize window sizes
window_sizes = [25, 50, 75, 100]

# Create an empty DataFrame to collect the stores
all_scores = pd.DataFrame(index=times_scores)

# Generate scores for each split to see how the model performs over time
for window in window_sizes:
    # Create cross-validation object using a limited lookback window
    cv = TimeSeriesSplit(n_splits=100, max_train_size=window)
    
    # Calculate scores across all CV splits and collect them in a DataFrame
    this_scores = cross_val_score(model, X, y, cv=cv, scoring=my_pearsonr)
    all_scores['Length {}'.format(window)] = this_scores

In [None]:
# Visualize the scores
ax = all_scores.rolling(10).mean().plot(cmap=plt.cm.coolwarm)
ax.set(title='Scores for multiple windows', ylabel='Correlation (r)')
plt.show()