In [3]:
# prompt: let move the downlaode file to drive

import shutil
import os

# Replace 'downloaded_file.zip' with the actual name of your downloaded file
downloaded_file = 'data'

# Replace '/content/drive/My Drive/my_folder' with your desired destination folder in Google Drive
destination_folder = '/content/drive/My Drive/timeseries'

# Create the destination folder if it doesn't exist
if not os.path.exists(destination_folder):
  os.makedirs(destination_folder)

# Construct the full destination path
destination_path = os.path.join(destination_folder, downloaded_file)

# Move the file
try:
  shutil.move(downloaded_file, destination_path)
  print(f"File '{downloaded_file}' moved to '{destination_path}' successfully.")
except FileNotFoundError:
  print(f"Error: File '{downloaded_file}' not found.")
except Exception as e:
  print(f"An error occurred: {e}")


File 'data' moved to '/content/drive/My Drive/timeseries/data' successfully.


In [6]:
!git clone https://github.com/PacktPublishing/Modern-Time-Series-Forecasting-with-Python-2E.git

Cloning into 'Modern-Time-Series-Forecasting-with-Python-2E'...
remote: Enumerating objects: 836, done.[K
remote: Counting objects: 100% (237/237), done.[K
remote: Compressing objects: 100% (141/141), done.[K
remote: Total 836 (delta 131), reused 150 (delta 96), pack-reused 599 (from 1)[K
Receiving objects: 100% (836/836), 101.37 MiB | 19.04 MiB/s, done.
Resolving deltas: 100% (434/434), done.
Updating files: 100% (155/155), done.


**What is a time series?**

To keep it simple, a time series is a set of observations taken sequentially in time. The focus is on the word time. If we keep taking the same observation at different points in time, we will get a time series. For example, if you keep recording the number of bars of chocolate you have in a month, you’ll end up with a time series of your chocolate consumption. Suppose you are recording your weight at the beginning of every month. You get another time series of your weight. Is there any relation between the two time series?


**Types of time series**

There are two types of time series data based on time intervals, as outlined here:

**Regular time series:** This is the most common type of time series, where we have observations coming in at regular intervals of time, such as every hour or every month. For example, if we take a time series of temperature in a city, we will get the time series in a regular interval (whichever frequency we choose for observation).

**Irregular time series:** There are a few time series where we do not have observations at regular intervals of time. For example, consider we have a sequence of readings from lab tests of a patient. We see an observation in the time series only when the patient heads to the clinic and carries out the lab test, and this may not happen at regular intervals.


**Data-generating process (DGP)**

We have seen that time series data is a collection of observations made sequentially along the time dimension. Any time series is, in turn, generated by some kind of mechanism. For example, time series data of daily shipments of your favorite chocolate from the manufacturing plant is affected by a lot of factors, such as the time of the year (holiday season, for example), the availability of cocoa, the uptime of the machines working on the plant, and so on. In statistics, this underlying process that generates the time series is referred to as the DGP. Time series data is produced by stochastic and deterministic processes. The deterministic processes involve quantities that evolve in a predictable manner over time. An example of this is the radioactive decay of an element, where the remaining quantity diminishes according to a precise mathematical formula, leading to a consistent reduction over time. But most of the interesting time series (from a forecasting perspective) are generated by a stochastic process. A stochastic process is a way to describe how things change over time in a random but somewhat predictable manner, like how the weather changes daily with some patterns and probabilities involved. So, let’s discuss more about time series generated from stochastic processes.


If we had complete and perfect knowledge of reality, all we would need to do would be to put this DGP together in a mathematical form and you would get the most accurate forecast possible. But sadly, nobody has complete and perfect knowledge of reality. So, what we try to do is approximate the DGP, mathematically, as much as possible so that our imitation of the DGP gives us the best possible forecast (or any other output we want from the analysis). This imitation is called a model that provides a useful approximation of the DGP.

But we must remember that the model is not the DGP, but a representation of some essential aspects of reality. For example, let’s consider an aerial view of London and a map of London

[pic here]

The map of London is certainly useful—we can use it to go from point A to point B. But a map of London is not the same as a photo of London. It doesn’t showcase the bustling nightlife or the insufferable traffic. A map is just a model that represents some useful features of a location, such as roads and places. The following diagram might help us internalize the concept and remember it:


Naturally, the next question would be this:** Do we have a useful model?** Every model has limitations and challenges. As we have seen, a map of London does not perfectly represent London. But if our purpose is to navigate London, then a map is a very useful model. What if we want to understand the culture? A map doesn’t give you a flavor of that. So, now, the same model that was useful is utterly useless in the new context.

Different kinds of models are required in different situations and for different objectives. For example, the best model for forecasting may not be the same as the best model for making a causal inference.

We can use the concept of DGPs to generate multiple synthetic time series of varying degrees of complexity.


**White and red noise**

An extreme case of a stochastic process that generates a time series is a white noise process. It has a sequence of random numbers with zero mean and constant variance. This is also one of the most popular assumptions of noise in a time series.

Let’s see how we can generate such a time series and plot it:

In [1]:
# !pip install --upgrade pip
%pip install git+https://github.com/TimeSynth/TimeSynth.git

Collecting git+https://github.com/TimeSynth/TimeSynth.git
  Cloning https://github.com/TimeSynth/TimeSynth.git to /tmp/pip-req-build-z4u2o9o6
  Running command git clone --filter=blob:none --quiet https://github.com/TimeSynth/TimeSynth.git /tmp/pip-req-build-z4u2o9o6
  Resolved https://github.com/TimeSynth/TimeSynth.git to commit e50cdb9015d415adf46a4eae161a087c5c378564
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting symengine>=0.4 (from timesynth==0.2.4)
  Downloading symengine-0.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting jitcdde==1.4 (from timesynth==0.2.4)
  Downloading jitcdde-1.4.0.tar.gz (136 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m136.2/136.2 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jitcxde_common==1.4.1 (from timesynth==0.2.4)
  Downloading jitcxde_common-1.4.1.tar.gz (22 kB)
  Preparing metadata (setup.py) ...

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import os
import plotly.express as px
import plotly.io as pio

pio.templates.default = "plotly_white"
import timesynth as ts
import pandas as pd
np.random.seed()

def plot_time_series(time, values, label, legends=None):
    if legends is not None:
        assert len(legends)==len(values)
    if isinstance(values, list):
        series_dict = {"Time": time}
        for v, l in zip(values, legends):
            series_dict[l] = v
        plot_df = pd.DataFrame(series_dict)
        plot_df = pd.melt(plot_df,id_vars="Time",var_name="ts", value_name="Value")
    else:
        series_dict = {"Time": time, "Value": values, "ts":""}
        plot_df = pd.DataFrame(series_dict)

    if isinstance(values, list):
        fig = px.line(plot_df, x="Time", y="Value", line_dash="ts")
    else:
        fig = px.line(plot_df, x="Time", y="Value")
    fig.update_layout(
        autosize=False,
        width=900,
        height=500,
        title={
        'text': label,
#         'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        titlefont={
            "size": 25
        },
        yaxis=dict(
            title_text="Value",
            titlefont=dict(size=12),
        ),
        xaxis=dict(
            title_text="Time",
            titlefont=dict(size=12),
        )
    )
    return fig


def generate_timeseries(signal, noise=None):
    time_sampler = ts.TimeSampler(stop_time=20)
    regular_time_samples = time_sampler.sample_regular_time(num_points=100)
    timeseries = ts.TimeSeries(signal_generator=signal, noise_generator=noise)
    samples, signals, errors = timeseries.sample(regular_time_samples)
    return samples, regular_time_samples, signals, errors

In [14]:
import numpy as np
import matplotlib.pyplot as plt
time = np.arange(200)

values = np.random.randn(200) * 100
plot_time_series(time, values, "White Noise")

Red noise, on the other hand, has zero mean and constant variance but is serially correlated in time. This serial correlation or redness is parameterized by a correlation coefficient r, such that:

In [13]:
r = 0.4

# Generate the time axis
time = np.arange(200)
# Generate white noise
white_noise = np.random.randn(200)*100
# Create Red Noise by introducing correlation between subsequent values in the white noise
values = np.zeros(200)
for i, v in enumerate(white_noise):
    if i==0:
        values[i] = v
    else:
        values[i] = r*values[i-1]+ np.sqrt((1-np.power(r,2))) *v
plot_time_series(time, values, "Red Noise")

**Cyclical or seasonal signals**

Among the most common signals you see in time series are seasonal or cyclical signals. Therefore, you can introduce seasonality into your generated series in a few ways.

In [16]:
#Sinusoidal Signal with Amplitude=1.5 & Frequency=0.25
signal_1 =ts.signals.Sinusoidal(amplitude=1.5, frequency=0.25)
#Sinusoidal Signal with Amplitude=1 & Frequency=0. 5
signal_2 = ts.signals.Sinusoidal(amplitude=1, frequency=0.5)
#Generating the time series
samples_1, regular_time_samples, signals_1, errors_1 = generate_timeseries(signal=signal_1)
samples_2, regular_time_samples, signals_2, errors_2 = generate_timeseries(signal=signal_2)

In [18]:
plot_time_series(regular_time_samples,
                 [samples_1, samples_2],
                 "Sinusoidal Waves",
                 legends=["Amplitude = 1.5 | Frequency = 0.25", "Amplitude = 1 | Frequency = 0.5"])

In [3]:
# PseudoPeriodic signal with Amplitude=1 & Frequency=0.25
signal = ts.signals.PseudoPeriodic(amplitude=1, frequency=0.25)
#Generating Timeseries
samples, regular_time_samples, signals, errors = generate_timeseries(signal=signal)
plot_time_series(regular_time_samples,
                 samples,
                 "Pseudo Periodic")

**Autoregressive signals**

Another very popular signal in the real world is an autoregressive (AR) signal. We will go into this in more detail in Chapter 4, Setting a Strong Baseline Forecast, but for now, an AR signal refers to when the value of a time series for the current timestep is dependent on the values of the time series in the previous timesteps. This serial correlation is a key property of the AR signal, and it is parametrized by a few parameters, outlined as follows:

Order of serial correlation—or, in other words, the number of previous timesteps the signal is dependent on
Coefficients to combine the previous timesteps

In [12]:
%cd /content/Modern-Time-Series-Forecasting-with-Python-2E
# We have re-implemented the class in src because of a bug in TimeSynth
from src.synthetic_ts.autoregressive import AutoRegressive
# Autoregressive signal with parameters 1.5 and -0.75
# y(t) = 1.5*y(t-1) - 0.75*y(t-2)
signal= AutoRegressive(ar_param=[1.5, -0.75])
#Generate Timeseries
samples, regular_time_samples, signals, errors = generate_timeseries(signal=signal)
plot_time_series(regular_time_samples,
                 samples,
                 "Auto Regressive")

/content/Modern-Time-Series-Forecasting-with-Python-2E


**Mix and match**

There are many more components that you can use to create your DGP and thereby generate a time series, but let’s quickly look at how we can combine the components we have already seen to generate a realistic time series.



In [13]:
#Generating Pseudo Periodic Signal
pseudo_samples, regular_time_samples, _, _ = generate_timeseries(signal=ts.signals.PseudoPeriodic(amplitude=1, frequency=0.25), noise=ts.noise.GaussianNoise(std=0.3))
# Generating an Autoregressive Signal
ar_samples, regular_time_samples, _, _ = generate_timeseries(signal= AutoRegressive(ar_param=[1.5, -0.75]))
# Combining the two signals using a mathematical equation
ts = pseudo_samples*2+ar_samples
plot_time_series(regular_time_samples,
                 ts,
                 "Pseudo Periodic with AutoRegression and White Noise")

**Stationary and non-stationary time series**

In time series, stationarity is of great significance and is a key assumption in many modeling approaches. Ironically, many (if not most) real-world time series are non-stationary. So, let’s understand what a stationary time series is from a layman’s point of view.

There are multiple ways to look at stationarity, but one of the clearest and most intuitive ways is to think of the probability distribution or the data distribution of a time series. We call a time series stationary when the probability distribution remains the same at every point in time. In other words, if you pick different windows in time, the data distribution across all those windows should be the same.

A standard Gaussian distribution is defined by two parameters—the mean and the variance. So, there are two ways the stationarity assumption can be broken, as outlined here:

**Change in mean over time**

**Change in variance over time**

Let’s look at these assumptions in detail and understand them better.

**Change in mean over time**

This is the most popular way a non-stationary time series presents itself. If there is an upward/downward trend in the time series, the mean across two windows of time would not be the same.

Another way non-stationarity manifests itself is in the form of seasonality. Suppose we are looking at the time series of average temperature measurements per month for the last 5 years. From our experience, we know that temperature peaks during summer and falls in winter. So, when we take the mean temperature of winter and the mean temperature of summer, they will be different.

In [17]:
import timesynth as ts # reimporting to correct the overwritten name
# Sinusoidal Signal with Amplitude=1 & Frequency=0.25
signal=ts.signals.Sinusoidal(amplitude=1, frequency=0.25)
# White Noise with standard deviation = 0.3
noise=ts.noise.GaussianNoise(std=0.3)
# Generate the time series
sinusoidal_samples, regular_time_samples, _, _ = generate_timeseries(signal=signal, noise=noise)
# Regular_time_samples is a linear increasing time axis and can be used as a trend
trend = regular_time_samples*0.4
# Combining the signal and trend
ts_new = sinusoidal_samples+trend # changing to ts_new
plot_time_series(regular_time_samples,
                 ts_new,  # using ts_new variable for plotting
                 "Sinusoidal with Trend and White Noise")

**Change in variance over time**

Non-stationarity can also present itself in the fluctuating variance of a time series. If the time series starts off with low variance and as time progresses, the variance keeps getting bigger and bigger, we have a non-stationary time series. In statistics, there is a scary name for this phenomenon—heteroscedasticity. The Air Passengers dataset, which is the “iris dataset” of time series (the most popular, over-used, and useless) is a classic example of a heteroscedastic time series. Let’s look at the plot:

**Forecasting terminology**
There are a few terms that will help you understand this book as well as other literature on time series. These terms are described in more detail here:

**Forecasting**
Forecasting is the prediction of future values of a time series using the known past values of the time series and/or some other related variables. This is very similar to prediction in ML, where we use a model to predict unseen data.

**Multivariate forecasting**
Multivariate time series consist of more than one time series variable that is not only dependent on its past values but also has some dependency on the other variables. For example, a set of macroeconomic indicators, such as gross domestic product (GDP) and inflation, of a particular country can be considered a multivariate time series. The aim of multivariate forecasting is to come up with a model that captures the interrelationship between the different variables along with its relationship with its past and forecast all the time series together in the future.

**Explanatory forecasting**
In addition to the past values of a time series, we might use some other information to predict the future values of a time series. For example, when predicting retail store sales, information regarding promotional offers (both historical and future ones) is usually helpful. This type of forecasting, which uses information other than its own history, is called explanatory forecasting.

**Backtesting**
Setting aside a validation set from your training data to evaluate your models is a practice that is common in the ML world. Backtesting is the time series equivalent of validation, whereby you use the history to evaluate a trained model. We will cover the different ways of doing validation and cross-validation for time series data later.

**In-sample and out-sample**
Again drawing parallels with ML, in-sample refers to training data and out-sample refers to unseen or testing data. When you hear in-sample metrics, this refers to metrics calculated on training data, and out-sample metrics refers to metrics calculated on testing data.

**Exogenous and endogenous variables**
Exogenous variables are parallel time series variables that are not modeled directly for output but used to help us model the time series that we are interested in. Typically, exogenous variables are not affected by other variables in the system. Endogenous variables are variables that are affected by other variables in the system. A purely endogenous variable is a variable that is entirely dependent on the other variables in the system. Relaxing the strict assumptions a bit, we can consider the target variable as the endogenous variable and the explanatory regressors we include in the model as exogenous variables.

**Forecast combination**
Forecast combinations in the time series world are similar to ensembles from the ML world. Forecast combination is a process by which we combine multiple forecasts by using a function, either learned or heuristic-based, such as a simple average of three forecast models.