Add comment regarding the current state of libraries for PdM (#28)

* Paper: Add comment regarding the current state of libraries for PdM * Paper: Comment on Industry 5.0 and add appropiate references * Paper: Add missing DOIs * Setup: Add upper bound for scikit-learn due to antropy incompatibility
lucianolorenti · Jul 23, 2023 · c12e24d · c12e24d
1 parent 02600e8
commit c12e24d
Show file tree

Hide file tree

Showing 3 changed files with 77 additions and 10 deletions.
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -132,7 +132,8 @@ @article{scikit-learn
  journal={Journal of Machine Learning Research},
  volume={12},
  pages={2825--2830},
- year={2011}
+ year={2011},
+ doi={10.48550/arXiv.1201.0490}
 }
 
 @misc{tensorflow2015-whitepaper,
@@ -181,4 +182,61 @@ @misc{tensorflow2015-whitepaper
     Yuan~Yu and
     Xiaoqiang~Zheng},
   year={2015},
+  doi={10.48550/arXiv.1603.04467}
+}
+
+@misc{2022_nasa_prog_models,
+    author    = {Christopher Teubert and Matteo Corbetta and Chetan Kulkarni and Katelyn Jarvis and Matthew Daigle},
+    title     = {Prognostics Models Python Package},
+    month     = December,
+    year      = 2022,
+    version   = {1.4},
+    url       = {https://github.com/nasa/prog\_models}
+    }
+
+  @article{christ2018time,
+  title={Time series feature extraction on basis of scalable hypothesis tests (tsfresh--a python package)},
+  author={Christ, Maximilian and Braun, Nils and Neuffer, Julius and Kempa-Liehr, Andreas W},
+  journal={Neurocomputing},
+  volume={307},
+  pages={72--77},
+  year={2018},
+  publisher={Elsevier},
+  doi={10.1016/j.neucom.2018.03.067}
+}
+
+@article{JMLR:v21:20-091,
+  author  = {Romain Tavenard and Johann Faouzi and Gilles Vandewiele and
+             Felix Divo and Guillaume Androz and Chester Holtz and
+             Marie Payne and Roman Yurchak and Marc Ru{\ss}wurm and
+             Kushal Kolar and Eli Woods},
+  title   = {Tslearn, A Machine Learning Toolkit for Time Series Data},
+  journal = {Journal of Machine Learning Research},
+  year    = {2020},
+  volume  = {21},
+  number  = {118},
+  pages   = {1-6},
+  url     = {http://jmlr.org/papers/v21/20-091.html}
+}
+
+@article{van2022predictive,
+  title={Predictive maintenance for industry 5.0: behavioural inquiries from a work system perspective},
+  author={van Oudenhoven, Bas and Van de Calseyde, Philippe and Basten, Rob and Demerouti, Evangelia},
+  journal={International Journal of Production Research},
+  pages={1--20},
+  year={2022},
+  publisher={Taylor \& Francis},
+  doi={10.1080/00207543.2022.2154403}
+}
+
+@article{khan2023changes,
+  title={Changes and improvements in Industry 5.0: A strategic approach to overcome the challenges of Industry 4.0},
+  author={Khan, Moin and Haleem, Abid and Javaid, Mohd},
+  journal={Green Technologies and Sustainability},
+  volume={1},
+  number={2},
+  pages={100020},
+  year={2023},
+  publisher={Elsevier},
+  doi={10.1016/j.grets.2023.100020} 
 }
diff --git a/paper/paper.md b/paper/paper.md
@@ -23,28 +23,37 @@ bibliography: paper.bib
 
 # Summary
 
-`CeRULEo`, which stands for Comprehensive utilitiEs for Remaining Useful Life Estimation methOds, is a Python package designed to train and evaluate regression models for predicting remaining useful life (RUL) of equipment. RUL estimation is a process that uses prediction methods to forecast the future performance of machinery and obtain the time left before machinery loses its operation ability.  The remaining useful life  estimation has been considered as a central 
+`CeRULEo`, which stands for Comprehensive utilitiEs for Remaining Useful Life Estimation methOds, is a Python package designed to train and evaluate regression models for predicting remaining useful life (RUL) of equipment. RUL estimation is a process that uses prediction methods to forecast the future performance of machinery and obtain the time left before machinery loses its operation ability.  The RUL  estimation has been considered as a central 
 technology of Predictive Maintenance (PdM) [@heimes2008recurrent; @li2018remaining].  PdM  techniques can statistically evaluate a piece of equipment's health status,  enabling early identification of impending failures and prompt pre-failure  interventions, thanks to prediction tools based on historical data [@susto2014machine].  `CeRULEo` offers a comprehensive suite of tools to help with the analysis and pre-processing of preventive maintenance data. These tools also enable the training and evaluation of RUL models that are tailored to the specific needs of the problem at hand. 
 
 
 # Statement of need
 
-Effective maintenance management helps reduce costs related to defective products and equipment downtime. A well-planned maintenance strategy improves reliability, prevents unexpected outages, and lowers operating costs. In Industry 4.0, data from the manufacturing process can enhance decision-making. RUL estimation uses prediction techniques to forecast a machine's future performance based on historical data and determine its remaining useful life, enabling early identification of potential failures and prompt pre-failure interventions. In this context, `CeRULEo` provides a comprehensive set of utilities designed to train and evaluate regression models for predicting remaining useful life of equipment. 
+Effective maintenance management helps reduce costs related to defective products and equipment downtime. A well-planned maintenance strategy improves reliability, prevents unexpected outages, and lowers operating costs. 
 
-In order to achieve good performance, RUL regression requires data preparation and feature engineering. Typically, machinery data is provided as time series data from various sensors during operation. The first step in data preparation is often to create a dataset based on run-to-failure cycles. This involves dividing the time series into segments where the equipment starts in a healthy state and ends in a failure state, or is close to failure. The second step of data preparation is preprocessing. While PdM models can be used in a variety of contexts with different data sources and errors, there are some general techniques that can be applied [@serradilla2022deep], such as time-series validation, imputing missing values, handling homogeneous or non-homogeneous sampling rates, addressing values, range and behaviour differences across difference machines and the creation of run-to-failure-cycle-based data. 
+
+In Industry 5.0, the industrial machines produce a large amount of data which can be used to predict an asset’s life [@khan2023changes]. RUL estimation uses prediction techniques to forecast a machine's future performance based on historical data, enabling early identification of potential failures and prompt pre-failure interventions. 
+
+Within the PdM and RUL regression ecosystem, finding a library that effectively combines modelling, feature extraction capabilities, and tools for model comparison poses a significant challenge. While numerous repositories and libraries exist for models and feature extraction in time series data [@christ2018time; @JMLR:v21:20-091], few offer a comprehensive solution that integrates both aspects effectively. The prog_models and prog_als libraries from NASA [@2022_nasa_prog_models] come closest to fulfilling this requirement. However, they have a strong focus on simulation and lack extensive mechanisms for feature extraction from time series data. 
+
+On the other hand, `CeRULEo` provides a comprehensive set of utilities designed to train and evaluate regression models for predicting RUL of equipment. `CeRULEo`  emphasizes a data-driven approach using industrial data, particularly when a simulation model is unavailable or costly to develop, prioritizing model library-agnosticism for easy deployment in any production environment. 
+
+In order to achieve good performance, RUL regression requires data preparation and feature engineering. Typically, machinery data is provided as time series data from various sensors during operation. The first step in data preparation is often to create a dataset based on run-to-failure cycles. This involves dividing the time series into segments where the equipment starts in a healthy state and ends in a failure state, or is close to failure. The second step of data preparation is preprocessing. While PdM models can be used in a variety of contexts with different data sources and errors, there are some general techniques that can be applied [@serradilla2022deep], such as time-series validation, imputing missing values, handling homogeneous or non-homogeneous sampling rates, addressing values, range and behaviour differences across different machines and the creation of run-to-failure-cycle-based data. 
 
 
 `CeRULEo` addresses these issues by providing a comprehensive toolkit for preprocessing time series data for use in PdM models, with a focus on run-to-failure cycles. The preprocessing includes sensor data validation methods, for studying not only missing and corrupted values but also distribution drift among different pieces of equipment. 
 
-In addition to preprocessing, it enables the iteration of machine data for use in both mini-batch and full-batch regression models, and is compatible with popular machine learning frameworks such as scikit-learn [@scikit-learn] and tensorflow [@tensorflow2015-whitepaper]. The library also includes a catalog of successful deep learning models [@jayasinghe2019temporal; @li2020remaining; @CHEN2022104969] from the literature and a collection of commonly used remaining useful life datasets for quick model evaluation.
+In addition to preprocessing, it enables the iteration of machine data for use in both mini-batch and full-batch regression models, and is compatible with popular machine learning frameworks such as scikit-learn [@scikit-learn] and tensorflow [@tensorflow2015-whitepaper]. The library also includes a catalog of successful deep learning models [@jayasinghe2019temporal; @li2020remaining; @CHEN2022104969] from the literature and a collection of commonly used RUL datasets for quick model evaluation.
+
+The acceptance of PdM technologies is pivotal in Industry 5.0 for successful implementation, but hesitations or reluctance by decision-makers  can still pose significant barriers [@van2022predictive]. One effective approach to foster acceptance and understanding is through explainability, which plays a crucial role in PdM.
+As such, `CeRULEo`  incorporates explainable models capable of providing additional information about the predictions, enhancing comprehension: one that can select the most relevant features for the model [@lemhadri2021lassonet], and a convolutional model [@fauvel2021xcm] that provides post-hoc explanations of the predictions to understand the reasoning behind the predicted RUL. 
 
-In the context of predictive maintenance, explainability is crucial. As such, `CeRULEo` includes two explainable models: one that can select the most relevant features for the model [@lemhadri2021lassonet], and a convolutional model [@fauvel2021xcm] that provides post-hoc explanations of the predictions to understand the reasoning behind the predicted remaining useful life. This helps users better understand and trust the model's predictions.
+Moreover, `CeRULEo` provides tools for evaluating and comparing PdM models based on not only traditional regression metrics, but also on their ability to prevent errors and reduce costs. In many cases, the costs of not accurately detecting or anticipating faults can be much higher than the cost of inspections or maintenance due to reduced efficiency, unplanned downtime, and corrective maintenance expenses. In PdM, it is particularly important to be accurate about the RUL  of equipment near the end of its lifespan, as an overestimation of RUL can have serious consequences when immediate action is required. `CeRULEo` addresses this issue by providing mechanisms for weighting samples according to their importance and asymmetric losses for training models, as well as visualization tools for understanding model performance in relation to true RUL.
 
-Moreover, `CeRULEo` provides tools for evaluating and comparing PdM models based on not only traditional regression metrics, but also on their ability to prevent errors and reduce costs. In many cases, the costs of not accurately detecting or anticipating faults can be much higher than the cost of inspections or maintenance due to reduced efficiency, unplanned downtime, and corrective maintenance expenses. In predictive maintenance, it is particularly important to be accurate about the remaining useful life  of equipment near the end of its lifespan, as an overestimation of RUL can have serious consequences when immediate action is required. `CeRULEo` addresses this issue by providing mechanisms for weighting samples according to their importance and asymmetric losses for training models, as well as visualization tools for understanding model performance in relation to true RUL.
 
 
 # Financial Acknowledgement
 
-The Italian Government PNRR iniatiatives 'Partenariato 11: Made in Italy circolare e sostenibile' and 'Ecosistema dell'Innovazione - iNest' are gratefully acknowledged for partially financing this research activity.
+The Italian Government PNRR initiatives 'Partenariato 11: Made in Italy circolare e sostenibile' and 'Ecosistema dell'Innovazione - iNest' are gratefully acknowledged for partially financing this research activity.
 
 # References
diff --git a/pyproject.toml b/pyproject.toml
@@ -18,15 +18,15 @@ dependencies = [
         "pandas >= 1.5",
         "numpy >= 1.22",
         "tqdm >= 4.56",
-        "scikit-learn >= 0.24",
+        "scikit-learn >= 0.24, <1.3",
         "emd >= 0.4",
         "mmh3   >= 2.0",
         "pyarrow >= 4",
         "gdown >= 4.2",
         "pyinform >= 0.2",
         "pyts >= 0.12",
         "seaborn >= 0.11",
-        "antropy >= 0.1",
+        "antropy >= 0.1.5",
         "uncertainties >= 3.1",
         "PyWavelets >= 1.3",
 ]