Difference between SAX and KBinsDiscretizer #142

janthmueller · 2022-12-12T09:37:34Z

Is there any difference between the SAX with an 'ordinal' alphabet and the KBinsDiscretizer?

def transform(self, X):
        """Bin the data with the given alphabet.

        Parameters
        ----------
        X : array-like, shape = (n_samples, n_timestamps)
            Data to transform.

        y
            Ignored

        Returns
        -------
        X_new : array, shape = (n_samples, n_timestamps)
            Binned data.

        """
        X = check_array(X, dtype='float64')
        n_timestamps = X.shape[1]
        alphabet = self._check_params(n_timestamps)
        discretizer = **KBinsDiscretizer**(
            n_bins=self.n_bins, strategy=self.strategy)
        indices = discretizer.fit_transform(X)
        if isinstance(alphabet, str):
            return indices
        else:
            return alphabet[indices]

If not can I reproduce the original algorithm by building a pipeline usining:

Standard Scaler
PAA
KBinsDiscretizer using normal strategy

or should I use the Standard Scaler after PAA?

The text was updated successfully, but these errors were encountered:

johannfaouzi · 2022-12-12T11:01:26Z

KBinsDiscretizer is different from SAX because SAX is a "row-wise" discretization (i.e., discretizing each time series independently) while KBinsDiscretizer is a "column-wise" discretization (i.e., discretizing each column independently). KBinsDiscretizer is used for the Symbolic Fourier Approximation (SFA) algorithm which discretizes the Fourier coefficients of a time series.

If you want to reproduce the original SAX algorithm, you indeed need a pipeline consisting of:

StandardScaler (so that each time series has zero mean and unit variance)
PAA (to decrease the number of time points in the time series)
SAX with normal strategy (to use the the quantiles of the standard normal distribution as cut-off values)

janthmueller · 2022-12-12T13:59:10Z

So, as I understand it, this means that given an input X in the form (n_samples, n_timestamps), the KBinsDiscretizer discretizes over the individual timestamps while SAX discretizes over the individual samples?

However, this would mean for the KBinsDiscretizer that it could not generate bins with n_samples = 1, which is not true according to my observation.

A further test of the equivalence of both methods over different numbers of bins and strategies suggests to me that these methods are the same. For Example:

strategy = 'normal' 
n_bins = 10

ts = load_gunpoint(return_X_y=True)[0]

SCALER = StandardScaler()
ts = SCALER.transform(ts)
SAX = SymbolicAggregateApproximation(n_bins=n_bins, strategy=strategy, alphabet='ordinal')
KBINS = KBinsDiscretizer(n_bins=n_bins, strategy=strategy)
same = np.all(KBINS.transform(ts) == SAX.transform(ts))
print(same)

Outputs: True

johannfaouzi · 2022-12-12T14:32:35Z

Sorry for my mistake, what I said is indeed totally wrong.

When you mentioned KBinsDiscretizer, I actually thought about MultipleCoefficientBinning, which performs the discretization column-wise and is used in Symbolic Fourier Approximation.

KBinsDiscretizer is very similar to SAX indeed, it's just that it does not support the alphabet argument. If you look at the source code of SAX, you can see that it just uses KBinsDiscretizer and takes care of returning symbols (instead of integers) if necessary. KBinsDiscretizer is mainly there to provide most of the preprocessing tools in scikit-learn (which are applied column-wise, feature-wise) to pyts (which are applied row-wise, sample-wise).

Edit: most of the tools in the preprocessing module are rarely used in practice (or not really mentioned in the literature I think). They are just there for conveniency and usually just use the implementation in scikit-learn (transposing the input matrix, applying the implementation in scikit-learn, transposing the output matrix). It's not the case for KBinsDiscretizer because I didn't want to allow the use of the k-means strategy to compute the bins, and I wanted to add the 'normal' strategy.

pyts/pyts/approximation/sax.py

Lines 94 to 102 in 2434592

    
           discretizer = KBinsDiscretizer( 
        
               n_bins=self.n_bins, strategy=self.strategy, 
        
               raise_warning=self.raise_warning 
        
           ) 
        
           indices = discretizer.fit_transform(X) 
        
           if isinstance(alphabet, str): 
        
               return indices 
        
           else: 
        
               return alphabet[indices]

janthmueller · 2022-12-12T14:58:07Z

thank you for clearing up the misunderstanding and the additional info about scikit-learns implementation!

janthmueller closed this as completed Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference between SAX and KBinsDiscretizer #142

Difference between SAX and KBinsDiscretizer #142

janthmueller commented Dec 12, 2022

johannfaouzi commented Dec 12, 2022

janthmueller commented Dec 12, 2022

johannfaouzi commented Dec 12, 2022 •

edited

Loading

janthmueller commented Dec 12, 2022

Difference between SAX and KBinsDiscretizer #142

Difference between SAX and KBinsDiscretizer #142

Comments

janthmueller commented Dec 12, 2022

johannfaouzi commented Dec 12, 2022

janthmueller commented Dec 12, 2022

johannfaouzi commented Dec 12, 2022 • edited Loading

janthmueller commented Dec 12, 2022

johannfaouzi commented Dec 12, 2022 •

edited

Loading