- Provide a set of utilities and tools for time series forecasting.
- Capable of training models at scale using Pyspark.
- Build these standardized tools into a shared package for reuse.
- Contributing Data Scientists can add new functionality such as feature engineering and machine learning algorithms.
- The codebase can evolve with new functionality as Data Scientists experiment with new feature engineering approaches and machine learning algorithms.
- JSON configuration file: Used to provide parameters for dataset, feature engineering, model training and evaluation. Sample configuration files are provided in the
notebooks/configs/jsonfolder. - TSFA source code
- Sample notebooks that illustrate: - How to use specific components of TSFA, with minimal config dictionaries providing required parameters. These samples can be found in the notebooks/module_samples folder. - Sample notebooks illustrating the various TSFA usage scenarios, emphasizing flexibility and end-to-end model training with walk-forward cross validation. These notebooks can be found in the notebooks/how_to_use_tsfa folder.
- Documentation to support TSFA understanding and usage.
- Clone the repo locally.
cdinto the projectpoetry install- This will leverage the poetry configuration filepyproject.tomland create a python virtual environment for you.poetry add <package>- This will let you add new python dependencies from the command linepoetry shellthen spins up the local virtual environment for you to develop and test code.poetry buildcreates a wheel package of thetsfacode.
Refer to the Time Series forecasting model development lifecycle documentation for an examination of the proposed model development lifecycle and the phases TSFA can be used. However, its important to consider the following prerequisites before diving into model development:
-
The experimentation dataset should be created independent of this accelerator.
- It is the responsibility of the Data Scientist and Engineer to ensure reconciliation and correctness of the data against source systems.
- Source data aggregation and transformations such as imputation to cater for missing values should be done prior to using this accelerator.
-
Exploratory Data Analysis (EDA) is conducted prior to experimentation using this accelerator.
- It is encouraged to dedicate time for data understanding prior to experimentation. Users should have performed basic EDA steps such as data validation and missing value imputations prior to using the accelerator.
- It is also encouraged to create an experiment hypotheses backlog such that the experiments are planned, specifying features and ML algorithms to consider.
Refer to the Time Series forecasting model development lifecycle documentation for an examination of the proposed model development lifecycle and the phases TSFA can be used.
Refer to Time Series Forecasting Accelerator coding standards document for further details.
The table below provides details of the folders and their content.
| Folder name | Description |
|---|---|
| docs | Accelerator documentation, and references to other forecasting resources are stored in this folder. |
| notebooks | Comprises of sample notebooks illustrating specific TSFA component usage and example notebooks to illustrate the various TSFA usage scenarios, emphasizing flexibility as well as end-to-end model training with walk forward cross validation. |
| tests | Contains unit tests for core accelerator functionality |
| tsfa/common | Contains Classes that have common helper methods used for data preparation and model training. E.g. ConfigParser class that used to parse, get, and set configuration parameters. |
| tsfa/data_prep | Contains Class and functions for data preparation and data validation, respectively. |
| tsfa/error_analysis | Contains Class for error analysis for time-series forecasting. |
| tsfa/evaluation | Evaluator class with functionality to compute model performance metrics such as WMAPE are saved in this folder. |
| tsfa/feature_engineering | This folder contains the feature engineering utility Classes. The FeaturesUtils Class is a "wrapper" class that orchestrates features computations based on specified configuration parameters. |
| tsfa/models | This folder contains the model utility Classes. The MultivariateModelWrapper and UnivariateModelWrapper Classes are "wrapper" classes to be used to add new models to the accelerator. |
| tsfa/ml_experiment.py | This is the ML Experiment orchestrator that utilizes the feature engineering utility, models, and evaluation Classes to train a specified model. |
- Azure DataBricks
- Azure DevOps
- MLFlow
- Pandas API on Spark
- Pyspark.ml package
- Technical tutorial: Random Forest models with Python and Spark ML
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Contributor License Agreements. When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
If you would like to contribute to this library, please refer to How to contribute to Time Series Forecasting Accelerator document for further details.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.