# Anomaly Detection in Machines using Audio

References:


1.   [Anomalous Sound Detection with
Machine Learning: A Systematic Review (2021)](https://arxiv.org/pdf/2102.07820.pdf)
2.   [Performing anomaly detection on industrial equipment using audio signals (2021)](https://aws.amazon.com/blogs/machine-learning/performing-anomaly-detection-on-industrial-equipment-using-audio-signals/)
3.   [Automatic Feature Extraction for Heartbeat Anomaly Detection (2021)](https://arxiv.org/pdf/2102.12289v1.pdf)
4.   [Analysis of Feature Representations for Anomalous Sound Detection (2020)](https://arxiv.org/pdf/2012.06282v1.pdf)
5.   [Acoustic Anomaly Detection for Machine Sounds based on Image Transfer Learning (2020)](https://arxiv.org/pdf/2006.03429v2.pdf)
6.   [Audio-based detection of malfunctioning machines using deep convolutional autoencoders](https://www.aes.org/e-lib/browse.cfm?elib=20747)

## 1. Overview
A rattling noise from the car engine or a pump running dry are sounds that alert a person to an on going problem with their respective machine. Along with alerting the person using the car or pump, the type of sound made can also point to the source of the problem or in locating the problem to a specific part or component. Consider the case of industrial pumps which are used to lift water to irrigation canals and keep coastal cities safe from storm surges. They can get damaged if air gets sucked in or not enough volume is flowing through it leading to cavitation or overheating. 

Most recent studies aimed at tackling this problem among industry and academia are using audio captured from the running machine, in addition to sensors that monitor water flow rate or water pressure to catch anomalous behavior before it becomes a problem. With the right sensors providing data through IoT network, we can build a model that listens to the incoming stream of audio from each machine and translate it to a binary signal of normal/abnormal behavior which can act as an indicator to respective for personnel manning such pumps. 

![](https://youtu.be/T8jcNAPZ5is?t=357).

## 2. MIMII Dataset
Two common datasets that capture this usecase are ToyADMOS and MIMII. For our usecase of identifying malfunctioning pumps, MIMII dataset contains normal sounds (from 5000 seconds to 10000 seconds) and anomalous sounds (about 1000 seconds). These anomalous sounds were captured to resemble a real-life scenario (e.g., contamination, leakage, rotating unbalance, and rail damage). The background noise from real factories was mixed with the machine sounds. The sounds were recorded by eight-channel microphone array with 16 kHz sampling rate and 16 bit per sample. This dataset supports unsupervised training as well as transfer learning.

## 3. Solution

We focus on a transfer learning based approach, that performs binary classification into normal/abnormal conditions.

### 3.1 Feature Extraction
The main methods of extracting features from the audio are Mel-frequency cepstral coefficients (MFCCs), Log-Mel Energy, and Mel-spectrogram. I will use the Mel-spectogram to create an spectogram image of each channel in the audio track. Another way to go forward is to extract the mel-spectogram features into a tabular dataset.

### 3.2 Model Architecture

For Anomaly detection, both unsupervised and supervised approaches can yield good results. 

The unsupervised approach is based on using AutoEncoders that learn to recognize the normal signals from the audio data. An short overview of this technique is given below. 

<b> Unsupervised Approach</b>
Our AE is trained only on the normal signals, as we want our model to learn how to reconstruct these signals. When we feed this trained model with abnormal sounds, the reconstruction error is a lot higher than when trying to reconstruct normal sounds. We use an error threshold to discriminate abnormal and normal sounds.

<b> Supervised Approach</b>
For supervised learning, the Mel-spectogram can be used as an image to train a binary classifier using CNN.

Using transfer learning, image recognition architectures like Resnet can be used to train a CNN model that uses the spectogram as input image and outputs a binary value i.e. normal/abnormal.

### 3.4 Evaluation 

### 3.4.1 Metrics
Common metrics to measure classification performance are:

1.  AUC-ROC - The higher the better.
2.  F1-score - The higher the better.

### 3.4.2 Criteria
Using a confusion matrix, we can determine if the model is learning properly or not. If multiple audio sensors are available for each pump, we can use a simple voting mechanism or a weighted voting to determine if an audio sample should be classified as normal or abnormal.

The model f1-score should also be plotted againt a time to event (TTE). Here TTE is an abnormal condition in the sequence of audio clips. How far ahead of abnormal condition did the model manage to recognize the problem will be of particular interest when evaluating the model.

## 4. Conclusion

1.   While the Autoencoder can be used to train immediately as normal audio is readily available, having a supervised model that is trained to distinguish between both helps increase the scope of model, for example specific classes of errors.
2. Both supervised and unsupervised learning approaches can take advantage of time-distributed model architectures to smooth out preditions over a period of time.
