## Notebook for Model Performance and Walf Foward Validation Techniques
Investigate various methods for model performance and WF validation
1. Currently using Stratified Shuffle Split based on QTS book
    - Should I continue to use...probably not
2. Have read in RobotWealth and Hyndman that time series WF validation should use time series type split, also called "rolling origin" forecast
    - How does this affect CM calculations?
    - Should I use repeated CV?
    - Should I use custom created rolling original that doesn't use who series like time series split?
        - Use rolling series of data?

For time-dependent data, we can employ walk-forward analysis or rolling forecast origin techniques. This comes in various flavors (see the below illustration).  We first divide up the data set into, say, ten periods. We first fit a trading algorithm on the first period in the data set, then see how it performs on the “out-of-sample” second period. Then we repeat for the second and third periods, third and fourth periods, and so on until we’ve run to the end of the data set. The out-of-sample data sets are then used for evaluating the potential performance of the trading system in question. If we like, we may be keeping a final data set, perhaps the most recent data set, for final evaluation of whatever trading system passes this initial test.
![Image](split_time-1.png?raw=true)

#### Import libraries

In [1]:
%matplotlib inline

In [2]:
from Code.lib.plot_utils import PlotUtility
from Code.lib.time_utils import TimeUtility
from Code.lib.model_utils import ModelUtility
from Code.lib.retrieve_data import DataRetrieve, ComputeTarget
from Code.lib.config import current_feature, feature_dict
from Code.models import models_utils

In [3]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.utils import indexable
from sklearn.utils.validation import _num_samples
import numpy as np
import datetime
from dateutil.relativedelta import relativedelta
import matplotlib.pylab as plt

In [4]:
from sklearn.model_selection import StratifiedShuffleSplit

#### Get data for analysis

In [7]:
timeUtil = TimeUtility()
ct = ComputeTarget()
dSet = DataRetrieve()
modelUtil = ModelUtility()
    
issue = "TLT"
# Set IS-OOS parameters
pivotDate = datetime.date(2019, 1, 3)
is_oos_ratio = 2
oos_months = 4
segments = 4

df = dSet.read_issue_data(issue)
dataLoadStartDate = df.Date[0]
lastRow = df.shape[0]
dataLoadEndDate = df.Date[lastRow-1]
dataSet = dSet.set_date_range(df, dataLoadStartDate,dataLoadEndDate)
# Resolve any NA's for now
dataSet.fillna(method='ffill', inplace=True)
#set beLong level
beLongThreshold = 0.000
dataSet = ct.setTarget(dataSet, "Long", beLongThreshold)

Successfully retrieved data series for TLT


#### Set date range for analysis using the typical IS-OOS ratio

In [8]:
# set date splits
isOosDates = timeUtil.is_oos_data_split(issue, pivotDate, is_oos_ratio, oos_months, segments)
dataLoadStartDate = isOosDates[0]
is_start_date = isOosDates[1]
oos_start_date = isOosDates[2]
is_months = isOosDates[3]
is_end_date = isOosDates[4]
oos_end_date = isOosDates[5]

modelStartDate = is_start_date
modelEndDate = modelStartDate + relativedelta(months=is_months)
print("Issue: " + issue)
print("Start date: " + str(modelStartDate) + "  End date: " + str(modelEndDate))
mmData = dataSet[modelStartDate:modelEndDate].copy()
model_results = []
predictor_vars = "Temp holding spot"

                    Segments:  4
                IS OOS Ratio:  2
                  OOS months:  4
                   IS Months:  8
              Months to load:  36
              Data Load Date:  2016-12-03
              IS Start  Date:  2017-01-03
              OOS Start Date:  2017-09-03
                  Pivot Date:  2019-01-03
Issue: TLT
Start date: 2017-01-03  End date: 2017-09-03


#### Prep data sets for classification

In [9]:
dX, dy = modelUtil.prepare_for_classification(mmData)

### Investigate TimeSeriesSplit parameters
#### Fixed Window: FALSE, Overlapping: TRUE, Number of splits = 8

In [10]:
tscv = TimeSeriesSplit(n_splits=8)
for train_index, test_index in tscv.split(dX,dy):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = dX[train_index], dX[test_index]
    y_train, y_test = dy[train_index], dy[test_index] 

TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] TEST: [25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42] TEST: [43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 53 54 55 56 57 58 59 60] TEST: [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
 75 76 77 78] TEST: [79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25

#### Fixed Window: FALSE, Overlapping: TRUE, Number of splits = 8, Max Train size = 20

In [11]:
tscv = TimeSeriesSplit(n_splits=3, max_train_size=24)
for train_index, test_index in tscv.split(dX,dy):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = dX[train_index], dX[test_index]
    y_train, y_test = dy[train_index], dy[test_index] 

TRAIN: [19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42] TEST: [43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84]
TRAIN: [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84] TEST: [ 85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102
 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
 121 122 123 124 125 126]
TRAIN: [103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
 121 122 123 124 125 126] TEST: [127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
 163 164 165 166 167 168]


### Investigation of the Improved TimeSeriesSplit

In [13]:
class TimeSeriesSplitImproved(TimeSeriesSplit):
    """Time Series cross-validator
    Provides train/test indices to split time series data samples
    that are observed at fixed time intervals, in train/test sets.
    In each split, test indices must be higher than before, and thus shuffling
    in cross validator is inappropriate.
    This cross-validation object is a variation of :class:`KFold`.
    In the kth split, it returns first k folds as train set and the
    (k+1)th fold as test set.
    Note that unlike standard cross-validation methods, successive
    training sets are supersets of those that come before them.
    Read more in the :ref:`User Guide `.
    Parameters
    ----------
    n_splits : int, default=3
        Number of splits. Must be at least 1.
    Examples
    --------
    >>> from sklearn.model_selection import TimeSeriesSplit
    >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
    >>> y = np.array([1, 2, 3, 4])
    >>> tscv = TimeSeriesSplit(n_splits=3)
    >>> print(tscv)  # doctest: +NORMALIZE_WHITESPACE
    TimeSeriesSplit(n_splits=3)
    >>> for train_index, test_index in tscv.split(X):
    ...    print("TRAIN:", train_index, "TEST:", test_index)
    ...    X_train, X_test = X[train_index], X[test_index]
    ...    y_train, y_test = y[train_index], y[test_index]
    TRAIN: [0] TEST: [1]
    TRAIN: [0 1] TEST: [2]
    TRAIN: [0 1 2] TEST: [3]
    >>> for train_index, test_index in tscv.split(X, fixed_length=True):
    ...     print("TRAIN:", train_index, "TEST:", test_index)
    ...     X_train, X_test = X[train_index], X[test_index]
    ...     y_train, y_test = y[train_index], y[test_index]
    TRAIN: [0] TEST: [1]
    TRAIN: [1] TEST: [2]
    TRAIN: [2] TEST: [3]
    >>> for train_index, test_index in tscv.split(X, fixed_length=True,
    ...     train_splits=2):
    ...     print("TRAIN:", train_index, "TEST:", test_index)
    ...     X_train, X_test = X[train_index], X[test_index]
    ...     y_train, y_test = y[train_index], y[test_index]
    TRAIN: [0 1] TEST: [2]
    TRAIN: [1 2] TEST: [3]
 
    Notes
    -----
    When ``fixed_length`` is ``False``, the training set has size
    ``i * train_splits * n_samples // (n_splits + 1) + n_samples %
    (n_splits + 1)`` in the ``i``th split, with a test set of size
    ``n_samples//(n_splits + 1) * test_splits``, where ``n_samples``
    is the number of samples. If fixed_length is True, replace ``i``
    in the above formulation with 1, and ignore ``n_samples %
    (n_splits + 1)`` except for the first training set. The number
    of test sets is ``n_splits + 2 - train_splits - test_splits``.
    """
 
    def split(self, X, y=None, groups=None, fixed_length=False, train_splits=1, test_splits=1):
        """Generate indices to split data into training and test set.
        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            Training data, where n_samples is the number of samples
            and n_features is the number of features.
        y : array-like, shape (n_samples,)
            Always ignored, exists for compatibility.
        groups : array-like, with shape (n_samples,), optional
            Always ignored, exists for compatibility.
        fixed_length : bool, hether training sets should always have
            common length
        train_splits : positive int, for the minimum number of
            splits to include in training sets
        test_splits : positive int, for the number of splits to
            include in the test set
        Returns
        -------
        train : ndarray
            The training set indices for that split.
        test : ndarray
            The testing set indices for that split.
        """
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_splits = self.n_splits
        n_folds = n_splits + 1
        train_splits, test_splits = int(train_splits), int(test_splits)
        if n_folds > n_samples:
            raise ValueError(("Cannot have number of folds ={0} greater than the number of samples: {1}.").format(n_folds,n_samples))
        if (n_folds - train_splits - test_splits) < 0 and test_splits > 0:
            raise ValueError("Both train_splits and test_splits must be positive integers.")
        indices = np.arange(n_samples)
        split_size = (n_samples // n_folds)
        test_size = split_size * test_splits
        train_size = split_size * train_splits
        test_starts = range(train_size + n_samples % n_folds, n_samples - (test_size - split_size), split_size)
        if fixed_length:
            for i, test_start in zip(range(len(test_starts)), test_starts):
                rem = 0
                if i == 0:
                    rem = n_samples % n_folds
                yield (indices[(test_start - train_size - rem):test_start],
                       indices[test_start:test_start + test_size])
        else:
            for test_start in test_starts:
                yield (indices[:test_start],
                    indices[test_start:test_start + test_size])

#### Fixed Window: TRUE, Overlapping: TRUE, Number of splits = 10, Train splits = 4

In [18]:
tscvi = TimeSeriesSplitImproved(n_splits=8)
for train_index, test_index in tscvi.split(dX, fixed_length=True, train_splits=4):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = dX[train_index], dX[test_index]
    y_train, y_test = dy[train_index], dy[test_index] 

TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
 75 76 77 78] TEST: [79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96]
TRAIN: [25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96] TEST: [ 97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114]
TRAIN: [ 43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60
  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78
  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96
  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114] TEST: [115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132]
TR

#### Fixed Window: TRUE, Overlapping: TRUE, Number of splits = 25, Train splits = 4

In [89]:
tscvi = TimeSeriesSplitImproved(n_splits=25)
for train_index, test_index in tscvi.split(dX, fixed_length=True, train_splits=4):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = dX[train_index], dX[test_index]
    y_train, y_test = dy[train_index], dy[test_index] 

TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38] TEST: [39 40 41 42]
TRAIN: [27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42] TEST: [43 44 45 46]
TRAIN: [31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46] TEST: [47 48 49 50]
TRAIN: [35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50] TEST: [51 52 53 54]
TRAIN: [39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54] TEST: [55 56 57 58]
TRAIN: [43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58] TEST: [59 60 61 62]
TRAIN: [47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62] TEST: [63 64 65 66]
TRAIN: [51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66] TEST: [67 68 69 70]
TRAIN: [55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70] TEST: [71 72 73 74]
TRAIN: [59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74] TEST: [75 76 77 78]
TRAIN: [63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78] TEST: [79 80 81 82]
TRAIN: [67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82] TEST: [83 84 85 86]
TRAIN:

#### Fixed Window: TRUE, Overlapping: FALSE, Number of splits = 10

In [105]:
tscvi = TimeSeriesSplitImproved(n_splits=10)
for train_index, test_index in tscvi.split(dX, fixed_length=True):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = dX[train_index], dX[test_index]
    y_train, y_test = dy[train_index], dy[test_index] 

TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16] TEST: [17 18 19 20 21 22 23 24 25 26 27]
TRAIN: [17 18 19 20 21 22 23 24 25 26 27] TEST: [28 29 30 31 32 33 34 35 36 37 38]
TRAIN: [28 29 30 31 32 33 34 35 36 37 38] TEST: [39 40 41 42 43 44 45 46 47 48 49]
TRAIN: [39 40 41 42 43 44 45 46 47 48 49] TEST: [50 51 52 53 54 55 56 57 58 59 60]
TRAIN: [50 51 52 53 54 55 56 57 58 59 60] TEST: [61 62 63 64 65 66 67 68 69 70 71]
TRAIN: [61 62 63 64 65 66 67 68 69 70 71] TEST: [72 73 74 75 76 77 78 79 80 81 82]
TRAIN: [72 73 74 75 76 77 78 79 80 81 82] TEST: [83 84 85 86 87 88 89 90 91 92 93]
TRAIN: [83 84 85 86 87 88 89 90 91 92 93] TEST: [ 94  95  96  97  98  99 100 101 102 103 104]
TRAIN: [ 94  95  96  97  98  99 100 101 102 103 104] TEST: [105 106 107 108 109 110 111 112 113 114 115]
TRAIN: [105 106 107 108 109 110 111 112 113 114 115] TEST: [116 117 118 119 120 121 122 123 124 125 126]


#### Fixed Window: TRUE, Overlapping: FALSE, Number of splits = 10, Test splits = 2

In [130]:
tscvi = TimeSeriesSplitImproved(n_splits=10)
for train_index, test_index in tscvi.split(dX, fixed_length=True, train_splits=1,test_splits=2):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = dX[train_index], dX[test_index]
    y_train, y_test = dy[train_index], dy[test_index] 

TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16] TEST: [17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38]
TRAIN: [17 18 19 20 21 22 23 24 25 26 27] TEST: [28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]
TRAIN: [28 29 30 31 32 33 34 35 36 37 38] TEST: [39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60]
TRAIN: [39 40 41 42 43 44 45 46 47 48 49] TEST: [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71]
TRAIN: [50 51 52 53 54 55 56 57 58 59 60] TEST: [61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82]
TRAIN: [61 62 63 64 65 66 67 68 69 70 71] TEST: [72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93]
TRAIN: [72 73 74 75 76 77 78 79 80 81 82] TEST: [ 83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
 101 102 103 104]
TRAIN: [83 84 85 86 87 88 89 90 91 92 93] TEST: [ 94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111
 112 113 114 115]
TRAIN: [