# Forecastable Analysis

Similar to the Stationary Analysis, the analysis here is to double check the forecastability of those potential time series.

In [6]:
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose

In [2]:
sample_train_df1 = pd.read_pickle('sample_train_df1.pkl')
sample_test_df1 = pd.read_pickle('sample_test_df1.pkl')
print(sample_train_df1.shape, sample_test_df1.shape)

sample_train_df2 = pd.read_pickle('sample_train_df2.pkl')
sample_test_df2 = pd.read_pickle('sample_test_df2.pkl')
print(sample_train_df2.shape, sample_test_df2.shape)

sample_train_df1.head()

(100, 2) (14, 2)
(100, 2) (10, 2)


Unnamed: 0_level_0,purchase_amount,client_count
purchase_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-02-11,58743.49,470
2019-02-12,73984.82,686
2019-02-13,95749.55,876
2019-02-14,118723.62,1095
2019-02-15,152350.76,1379


In [5]:
train1_col = sample_train_df1['client_count']
train2_col = sample_train_df2['client_count']
test1_col = sample_test_df1['client_count']
test2_col = sample_test_df2['client_count']

# Sample 1 residuals
decomposition1_train = seasonal_decompose(train1_col, model='multiplicative')
decomposition1_test = seasonal_decompose(test1_col, model='multiplicative')
residual1_train = decomposition1_train.resid.dropna()
residual1_test = decomposition1_test.resid.dropna()

# Sample 2 residuals - additive
decomposition2_train_add = seasonal_decompose(train2_col, model='additive')
decomposition2_test_add = seasonal_decompose(test2_col, model='additive', period=2)
residual2_train_add = decomposition2_train_add.resid.dropna()
residual2_test_add = decomposition2_test_add.resid.dropna()

# Sample 2 residuals - multiplicative
decomposition2_train_mulp = seasonal_decompose(train2_col, model='multiplicative')
decomposition2_test_mulp = seasonal_decompose(test2_col, model='multiplicative', period=2)
residual2_train_mulp = decomposition2_train_mulp.resid.dropna()
residual2_test_mulp = decomposition2_test_mulp.resid.dropna()

### What is Sample Entropy

* "Sample entropy (SampEn) is a measure of complexity of a sequence of value. For a given embedding dimension m, tolerance r and number of data points N, SampEn is the negative natural logarithm of the probability that if two sets of simultaneous data points of length m have distance <r then two sets of simultaneous data points of length m+1 also have distance <r."
* Comparing to Approximate Entropy, Sample Entropy works better even for small set of sequence.
* Lower the entropy value, less complexity, higher forecastability in time series.
* Another benefit of sample entropy or approximate entropy is, the scale of the time series doesn't better, since it does the comparisons within each sequence.

In [9]:
def SampEn(U, m, r):
    """
    Compute Sample entropy.
    
    U: time series data.
    m: int, the length of compared run of data.
    r: a positive real number, it specifies a filtering level.
    
    Reference: https://www.machinelearningplus.com/time-series/time-series-analysis-python/
    """
    N = len(U)
    
    def _maxdist(x_i, x_j):
        return max([abs(ua - va) for ua, va in zip(x_i, x_j)])

    def _phi(m):
        x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
        C = [len([1 for j in range(len(x)) if i != j and _maxdist(x[i], x[j]) <= r]) for i in range(len(x))]
        return sum(C)
    
    return -np.log(_phi(m+1) / _phi(m))

### Sample 1

In [21]:
# original time series
entropy_smp1 = SampEn(train1_col.values, m=2, r=0.2*np.std(train1_col.values))

# logged original time series
ts_log = np.log(train1_col)
entropy_smp1_log = SampEn(ts_log.values, m=2, r=0.2*np.std(ts_log.values))

# logged moving average time series
ts_moving_avg = ts_log.rolling(window=4,center=False).mean().dropna()
entropy_smp1_log_ma = SampEn(ts_moving_avg.values, m=2, r=0.2*np.std(ts_moving_avg.values))

# residuals
entropy_residuals = SampEn(residual1_train.values, m=2, r=0.2*np.std(residual1_train.values))

# differenced residuals
residuals_diff = residual1_train - residual1_train.shift(3) 
residuals_diff = residuals_diff.dropna()
entropy_residuals_diff = SampEn(residuals_diff.values, m=2, r=0.2*np.std(residuals_diff.values))

print(entropy_smp1, entropy_smp1_log, entropy_smp1_log_ma)
print(entropy_residuals, entropy_residuals_diff)

0.22783839561841748 0.18301144988789228 0.13943250735323678
1.8021222562636017 2.178532444324067


#### Note

* Comparing with the statistical values in Stationary Analysis, the output of sample entropy seems more determined.
* It simply confirms that residuals are more complex than (processed) original time series, and logged moving average appears to be better than other time series sequences here.

### Sample 2

In [23]:
# original time series
entropy_smp2 = SampEn(train2_col.values, m=2, r=0.2*np.std(train2_col.values))

# logged original time series
ts_log = np.log(train2_col)
entropy_smp2_log = SampEn(ts_log.values, m=2, r=0.2*np.std(ts_log.values))

# logged moving average time series
ts_moving_avg = ts_log.rolling(window=4,center=False).mean().dropna()
entropy_smp2_log_ma = SampEn(ts_moving_avg.values, m=2, r=0.2*np.std(ts_moving_avg.values))

# Additive residuals
entropy_residuals_add = SampEn(residual2_train_add.values, m=2, r=0.2*np.std(residual2_train_add.values))

# differenced additive residuals
residuals_add_diff = residual2_train_add - residual2_train_add.shift(3) 
residuals_add_diff = residuals_add_diff.dropna()
entropy_residuals_add_diff = SampEn(residuals_add_diff.values, m=2, r=0.2*np.std(residuals_add_diff.values))

# Multiplicative residuals
entropy_residuals_mulp = SampEn(residual2_train_mulp.values, m=2, r=0.2*np.std(residual2_train_mulp.values))

# differenced multiplicative residuals
residuals_mulp_diff = residual2_train_mulp - residual2_train_mulp.shift(3) 
residuals_mulp_diff = residuals_mulp_diff.dropna()
entropy_residuals_mulp_diff = SampEn(residuals_mulp_diff.values, m=2, r=0.2*np.std(residuals_mulp_diff.values))

print(entropy_smp2, entropy_smp2_log, entropy_smp2_log_ma)
print(entropy_residuals_add, entropy_residuals_add_diff)
print(entropy_residuals_mulp, entropy_residuals_mulp_diff)

0.5052854433903752 0.475845904869964 0.16911898215202087
2.268683541318364 1.9136492868370918
1.5553706911638245 1.739115735742633


## Summary

Comparing all the complexity of these time series, logged moving average of original time series still appears to be the most ideal.