Skip to content

Overview of the Batch Processing Framework

Hector Andres Mejia Vallejo edited this page Jan 21, 2022 · 3 revisions

Batch Processing

This solution is heavily emphasized on scalability using modern cloud architectures. There are three phases to launch this system:

  • Data acquisition
  • Modeling
  • Productionalization

These are the same three phases as the Stream processing pipeline. However, the processing will not be performed in real time, rather there will be periodic executions, according to the client's needs. Each phase of the framework will be explained in detail in the following sections.

Data Acquisition

Regardless of the processing flavor, enough data must be collected to build the models. In this case, we get batches of data from sensors periodically. How frequent these batches of data must be generated is up to the client's needs.

batch data acquisition

  1. A scheduled job is created to send the file containing the data to the staging area in the Databricks environment for further processing.
  2. Inside Databricks, data is cleaned and features are extracted from raw data using Spark.
  3. After the data is processed, then it is appended to the corresponding Delta table on the warehouse.
  4. This workflow will be triggered regularly using a scheduled job on Databricks.
  5. Whenever needed, sensor data can be joined with maintenance logs to form a training dataset for modeling.

Modeling

With the data acquisition process, we gather data to train statistical or machine learning models. These phase is implemented in the same way as in the stream processing framework. There are mainly two types of models that will be covered here:

  • Anomaly detection
  • Estimation of time-until-service

You can see the technologies that are going to be used for this purpose in the diagram below.

modeling

  1. The input dataset will be generated after extraction and aggregation of both telemetry and maintenance logs data, stored on the delta warehouse. This will be converted into a spark dataframe for scalable feature extraction.

  2. Spark SQL will be used to extract features from the Dataframe on a scalable fashion. As the telemetry data is numeric then the features involved will be:

  • Rolling average
  • Tumbling average
  • Rolling median
  • Standard deviation
  • Min
  • Max
  • Sum

In addition, static data and maintenance logs will be used under the hypothesis that over time service required for equipment repairs will be more frequent. The resulting feature dataset will be stored on another table on the warehouse for future access and will be used for the modeling process.

  1. Tensorflow will be used when there is sufficient data to generate a time-series neural network model to estimate time-until-failure of equipment. Scikit-learn will be used to generate a model capable of anomaly detection. This kind of models do not need as much data as a neural network and can be put to production earlier.

  2. MlFlow will be used to track performance and register models. When the models are set to Production stage, then MLflow will allow us to serve the model both on batch mode or real-time mode.

Scoring and Dashboard

Here, the model is loaded and inference is performed on new data, stored on Databricks Delta tables. The difference with the stream processing framework is that the model does not need to be contained on a web service and a scheduled job can be created to perform the inference. Predictions are written to another Delta table and Power Bi is used to generate dashboards using both the predictions and the agregated. The dashboard will be automatically updated regularly as well.

batch dashboard