GitHub - naviveztim/CDP_python: Concatenation Decision Paths implementation in python

Python implementation of Concatenation Decision Paths (CDP)- fast and accurate method for time series classification

Overview

CDP is a novel method for time-series classification using shapelets. The approach focuses on overcoming the limitations of traditional shapelet-based methods, primarily their slow training times, while maintaining high accuracy. Proposed algorithm involves training small decision trees and combining their decisions to form unique patterns for identifying time-series data. Method is tested on dataset from UCR)

Main characteristics

Python implementation of the CDP algorithm posses following advantages:

very fast to (re)train (training time vary from seconds to minutes for datasets from UCR)
produces compact (~KB) models, in comparison with large standard models (~100MB)
maintains high accuracy and is comparable or in some cases even more accurate than state-of-the-art algorithms (Fig.1)
python implementation does not depend on other machine learning package. It has only dependencies on standard python packages
very simple to maintain (consists of 8 python files, spread in two folders)

Installation

pip install cdp-ts

Donate

If you like this project, consider supporting me by donating.

Training & Testing

from cdp_tsc.core.cdp import CDP
from cdp_tsc.utils.logger import logger
from cdp_tsc.utils.dataset import Dataset
from cdp_tsc.utils.utils import process_dataset
import numpy as np
from functools import wraps

TRAIN_DATASET_PATH = <>
TEST_DATASET_PATH = <>
DELIMITER = "\t"
MODELS_FOLDER_PATH = <>
COMPRESSION_FACTOR = 1,2,3,4...
NORMALIZE = True/False
DERIVATIVE = True/False
NUM_CLASSES_PER_TREE = 2
NUM_TREES = <>


def train():
    """ Demo function that shows creating and training of CDP model"""
s
    # Obtain train dataset from 'ucr' type csv file
    train_dataset = Dataset(filepath=TRAIN_DATASET_PATH
                            , delimiter=DELIMITER)

    # Apply pre-processing
    train_dataset = process_dataset(dataset=train_dataset
                                    , compression_factor=COMPRESSION_FACTOR
                                    , normalize=NORMALIZE
                                    , derivative=DERIVATIVE)

    # Initialize CDP
    cdp = CDP(model_folder=MODELS_FOLDER_PATH
              , num_classes_per_tree=NUM_CLASSES_PER_TREE
              , num_trees=NUM_TREES
              )

    # Train the model
    cdp.fit(train_dataset)


def test():

    # Initialize CDP
    cdp2 = CDP(model_folder=MODELS_FOLDER_PATH
               , num_classes_per_tree=NUM_CLASSES_PER_TREE
               , num_trees=NUM_TREES
               )

    # Load already trained model 
    cdp2.load_model()

    # Obtain test dataset
    test_dataset = Dataset(filepath=TEST_DATASET_PATH
                           , delimiter=DELIMITER)

    # Apply pre-processing, already applied to train dataset
    test_dataset = process_dataset(dataset=test_dataset
                                   , compression_factor=COMPRESSION_FACTOR
                                   , normalize=NORMALIZE
                                   , derivative=DERIVATIVE)

    # Predict class indexes of a test dataset
    predicted_class_indexes = cdp2.predict(test_dataset)

    # Check how many of predicted class indexes is correct 
    matching_count = np.sum((np.array(predicted_class_indexes) == test_dataset.class_indexes))
    logger.info(f"Accuracy: {100*round(matching_count/len(predicted_class_indexes), 4)}%")


if __name__ == "__main__":
    train()
    test()

Performance - accuracy and training time

CDP model has very small training time- it vary from seconds to minutes for dataset from USR database. Table below shows some elapsed training time and corresponding accuracy along with used hyper-parameters. Also, Fig. 1 shows comparison of the CDP method in terms of accuracy with some state-of-the-art time series classification method. Note: Accuracies reported for Fig.1 were obtained by C# implementation of CDP method (for more information: cdp-project.com). Table 1 contain training time and accuracies obtained by python implementation of the CDP method and Table 2 corresponding performance parameters from C# implementation. Present Python implementation does not use any acceleration techniques such as numba, or multiprocessing.

Table 1. Training time and accuracy of python implementation with numba of CDP method

UCR Dataset	Num. classes	Num. train samples	Num. test samples	Training time, [sec]	Accuracy, [%]	Compression rate	Num. decision trees	Normalize	Derivative
SwedishLeaf	15	500	625	99	85.4%	2	500	No	No
Beef	5	30	30	43	70.1%	1	200	Yes	Yes
OliveOil	4	30	30	35	76.6%	2	200	Yes	No
Symbols	6	25	995	62	86.9%	4	600	Yes	Yes
OsuLeaf	6	200	242	98	90.1%	4	800	Yes	Yes

There is also an implementation of CDP algorithm in C#, which on the same CPU produced even better results (Table 2)

Table 2. Training time and accuracy of C# implementation of CDP method

UCR Dataset	Num. classes	Num. train samples	Num. test samples	Training time, [sec]	Accuracy, [%]	Compression rate	Num. decision trees	Normalize	Derivative
SwedishLeaf	15	500	625	16	92.7%	2	700	No	No
Beef	5	30	30	24	86.8%	1	400	Yes	Yes
OliveOil	4	30	30	71	90.1%	2	200	Yes	No
Symbols	6	25	995	4	95.6%	4	600	Yes	Yes
OsuLeaf	6	200	242	15	88.9%	4	800	Yes	Yes

We tested several methods for time series classification on 40 datasets from UCR database. CDP methods stays well in terms of accuracy as shown on figure below.

Fig. 1 Comparison of state-of-the-art classifiers and CDP method. Used C# implementation of CDP method.

Model

Two files are produced during training process. First one contains representation in .pickle format of decision tree sequence, and the second one (in csv format), contains concatenated decision patterns produced from decision trees, for each time series from train dataset, as shown in the example below.

class_index,class_pattern
1,LLRLRLLRRLLLRLLLLRL...
1,LLLLRRRRLLLLLLRRRRR...
2,LLLLRRRRLLLLLLLLLLL...

These files are stored in model folder given as an input parameter to the process. They have hardcoded names (defined in cdp.py) as follows:

# Filename of trained model - contains sequence of decision trees
MODEL_FILENAME = 'cdp_model.pickle'
# Filename of csv file that contains predicted class indexes
PATTERNS_FILE_NAME = 'patterns.csv'

Classification

Currently, classification is done by producing decision pattern of an incoming time series, and comparing that pattern to such patterns from train dataset. The pattern from train dataset, which mostly resemble the incoming time series pattern will define its index.

Default process of classification is a bit slow as the incoming time series pattern has to be compared with many patterns, which is a bit slow process.

More advanced classification methods such as Neural Networks, Random Forests or other could be applied for even more precise and fast classification, by taking produced decision patterns as input features to these methods.

Website:

cdp-project.com

References:

“Concatenated Decision Paths Classification for Datasets with Small Number of Class Labels”, Ivan Mitzev and N.H. Younan, ICPRAM, Porto, Portugal, 24-26 February 2017_

“Concatenated Decision Paths Classification for Time Series Shapelets”, Ivan Mitzev and N.H. Younan, International journal for Instrumentation and Control Systems (IJICS), Vol. 6, No. 1, January 2016_

“Combined Classifiers for Time Series Shapelets”, Ivan Mitzev and N.H. Younan, CS & IT-CSCP 2016 pp. 173–182, Zurich, Switzerland, January 2016_

“Time Series Shapelets: Training Time Improvement Based on Particle Swarm Optimization”, Ivan Mitzev and N.H. Younan, IJMLC 2015 Vol. 5(4): 283-287 ISSN: 2010-3700_

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
cdp_ts		cdp_ts
Accuracy_comparison.png		Accuracy_comparison.png
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
environment.yml		environment.yml
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python implementation of Concatenation Decision Paths (CDP)- fast and accurate method for time series classification

Overview

Main characteristics

Installation

Donate

Training & Testing

Performance - accuracy and training time

Model

Classification

Website:

References:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

naviveztim/CDP_python

Folders and files

Latest commit

History

Repository files navigation

Python implementation of Concatenation Decision Paths (CDP)- fast and accurate method for time series classification

Overview

Main characteristics

Installation

Donate

Training & Testing

Performance - accuracy and training time

Model

Classification

Website:

References:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages