Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Machine learning on embedded devices

Focused primarily on running inference/prediction/feed-forward part on a microcontroller (or small embedded device). Training phase can run on a standard computer/server, using existing tools as much as possible.

State of the Art in 2019

Of ML inference on general-purpose microcontrollers.

  • Deep models have efficient implementations for ARM Cortex-M. Ex: CNN and RNN in CMSIS-NN, FC in uTensor
  • Some implementations available for non-neural models that can be used. Ex: SVM,RF,AdaBoost in sklearn-porter
  • A few special-designed ML algorithms made. Ex: ProtoNN, Bonsai
  • Basic tools available for converting Tensorflow models
  • Keyword-spotting/wake-word on audio well established. Used in commercial products (Alexa etc)
  • Human activity detecton on accelerometers.
  • Computer vision is actively
  • Lots of research and many announcements of low-power co-processors, but little on market yet


  • Neural models lacking for non-ARM micros. ESP8266/ESP32
  • Non-neural models missing inference engines designed for microcontrollers
  • "Small DNN" work mostly on computer vision for mobile phones (model size 1000x of uC)
  • Few/no pretrained models available. Transfer learning little explored?
  • Very little documentation of entire development process. From planning, data aquisition, model design
  • Best practices underdocumented (or underdeveloped?)

Ways of advancing, make contributions

  • Faster inference. Power saving, or bigger problems.
  • Smaller models. Cheaper MCU, or bigger problems.
  • Better accuracy on a problem. Better user experience, new usecases
  • Solve a concrete usecase. Easier to deploy similar usecases
  • Comparison between approaches. Microcontroller, ML model
  • Libraries or tools. Lower time to market, enable more developers




  • X-CUBE-AI for STM32


What and when to use machine learning

The defaults right now are to do conventional signal processing (no learning) in sensor, and stream raw data to the cloud for storage and processing. Machine learning happens in the cloud. If gateways are used, they mostly forward communication (no data processing).

On-edge processing valueable when

  • Local response needed. Autonomy
  • Adaptable response needed. Over time, in context.
  • Low/predictable latency needed
  • Sending raw sensor data has privacy implications. Audio, video
  • Unreliable connection
  • High bandwidth sensor input. Audio, video, accelerometer/IMU, current sensor, radiowaves.
  • Low bandwidth algorithm output
  • Events of interest are rare
  • Low energy usage needed
  • Full/raw sensor data is not valuable to store
  • Low cost sensor unit

Example usecases

  • Predictive maintenance, using audio/vibration data
  • Activitity detection for people, using audio/accelerometer data. Assistive tech, medical
  • Appliance disaggregation, using aggregated power consumption data. "Non-Intrusive Load Monitoring" (NILM)
  • Anomaly/change detection for predictive maintenance, using audio/vibration data, or electrical data
  • Gesture recognition as human input device, using accelerometer/gyro data.
  • Speech/command recognition as human input device, using microphone. Keyword/Wake-word detection
  • Battery saving in wireless sensors. Normally sending day/week aggregates, on event/anomaly detection send data immediately
  • Health status of animals via activity detected using accelerometer
  • Monitoring eating activity using accelerometer 1
  • Environmental monitoring, using microphone to detect unwanted activity like cutting down trees
  • Adaptive signalling and routing for wireless transmission in Wireless Sensor networks
  • Electronic nose using arrays of MEMS detectors
  • Material identification using reflecive spectrometer 1

Energy budgets


  • Constantly on wired power
  • Periodically used on battery, else plugged in
  • Always/normally battery powered
  • Energy harvesting. Never connected to charger, should run forever

Energy harvesting

Energy harvesting rules of thumb:

Outdoor light – 10mW/cm2
Industrial temperature difference – 1-10 mW/cm2

Industrial vibration – 100µW/cm2
Human temperature difference – 25µW/cm2
Indoor light – 10µW/cm2
Human vibration – 4µW/cm2

GSM RF – 0.1µW/cm2
Wifi RF – 0.001µW/cm2

AI and Unreliable Electronics (*batteries not included).

Wireless transmission

TODO: overview of typical energy requirements, for different wireless tech
TODO: overview of data transmission capacity, for different wireless tech
TODO: overview of sending range, for different wireless tech
TODO: cost (monetary) of data transmission, for different wireless techs


Doing more of the data processing locally, enables storing or transmitting privacy sensitive data more seldom.


  • Scalable Machine Learning with Fully Anonymized Data Using feature hashing on client/sensor-side, before sending to server that performs training. hashing trick is an established way of processing data as part of training a machine learning model. The typical motivation for using the technique is a reduction in memory requirements or the ability to perform stateless feature extraction. While feature hashing is ideally suited to categorical features, it also empirically works well on continuous features


  • In audio-processing, could we use a speech detection algorithm to avoid storing samples with speech in them? Can then store/transmit the other data in order to do quality assurance and/or further data analysis.


Roughly ordered by relevance. Should both be useful for typical tasks and efficiently implementable.

  • Decision trees, random forests
  • Convolutional Neural networks (quantized)
  • Binary Neural networks
  • Support Vector Machines. SVM.
  • Naive Bayes
  • Nearest Neighbours. kNN. Reduced prototypes


  • Metric learning. Specially HDML / Hamming Distance Metric Learning (Norouzi 2012), since Hamming distance is very compact, and neighbours can be found fast.

Machine learning tasks

  • Classification
  • Regression
  • Prediction
  • Outlier/novelty/anomaly detection

Introduction materials

Tree-based methods

Implemented in emtrees


  • Random Forests
  • Extremely Randomized Trees. ExtraTrees
  • Gradient Boosted Trees
  • Multiple Additive Regression Trees. Lambda-MART. λ-MART.
  • Gradient Boosted Regression Trees (GBRT)

People who need the same



  • MNIST. Kaggle. Random Forest with original features seems to peaks out at just under 3% error rate, for 200-1000 trees. Should be below 5% for 100 trees. Training time should be under 10 minutes.

As execution target for other ML models

  • Keywords. Knowledge distillation, model distillation, model compression
  • Distilling a Neural Network Into a Soft Decision Tree. Hinton, November 2017. Main motivation is making the decisions of model more explainable. Could it also be used to get computationally efficient results for problems which classically-trained trees don't perform so well? "we use the deep neural network to train a decision tree that mimics the input-output function discovered by the neural network but works in a completely different way". It is possible to transfer the generalization abilities of the neural net to a decision tree by using a technique called distillation and a type of decision tree that makes soft decisions. Soft decision tree is a hierarchical mixture of experts. With probability across learned distributions. Sigmoid activation function, with heating term. ! Unlike most decision trees, our soft decision trees use decision boundaries that are not aligned with the axes defined by the components of the input vector. ! Trained by first picking the size of the tree and then using mini-batch gradient descent to update all of their parameters simultaneously, rather than the more standard greedy approach that decides the splits one node at a time. Loss function minimizes the cross entropy between each leaf, weighted by its path probability, and the target distribution. Using regularization.
  •, Python implementation of Hinton2017 using TensorFlow
  •, Python implementation of Hinton2017 using Pytorch


QuickScorer (QS). Claims 2-6x speedups over state-of-the-art. Evaluates the branching nodes of the trees in cache-aware feature-wise order, instead of each tree separately. Uses a bitvector for the comparison operations. Exploiting CPU SIMD Extensions to Speed-up Document Scoring with Tree Ensembles. V-QuickScorer (vQS). Extends QuickScorer to use SIMD. Claims a 3.2x speedup with AVX-2, scoring 8 documents in parallel.

  • Can we reduce number of nodes in a tree? Using pruning?
  • Can we eliminate redundant decisions across trees in the forest?
  • How can one make use of SIMD? Do N comparisons at a time Challenge: Divergent branches considers SIMD for "SIMD-parallelization outweighs its initial overhead only for 7 or more nodes in parallel"
  • How to best make use of multithreading? Split by sample. Only works for batch predictions. Split by tree. Works also for single sample.

Related methods



Metric Learning / similarity / distance learning

  • Similarity Forests
  • Random forest distance (RFD)

Naive Bayes

Implemented in embayes

Simple generative model. Very effective at some classification problems. Quick and easy to train, just basic descriptive statistics. Making predictions also quick, amounts to calculating probabilities for each class.


  • Gaussian,Multinomial,Bernouilli
  • Adaptive Naive Bayes (ANBC)
  • Fuzzy Naive Bayes
  • Rough Gaussian Naive Bayes
  • Non-naive Bayes. Actually takes covariance into account

Naive Bayes classifier implementations

Techniques for improvement

  • Compensate for naiviety, covariance. Bagging, one-against-many


Gaussian Naive Bayes

Prior art for embayes optimization

Baysian Networks

Neural Networks


Convolutional Neural Networks.

Much more compact models than fully connected for spatial-patterns.

  • Convolutional Neural Networks for Visual Recognition. Explains FC->CONV equivalence, Layer Sizing Patterns, Generally early layers have few kernels, many pixels (generic features). Late layers many kernels, few pixels (specific features). A fully connected (FC) layer can be converted to a CONV layer. This allows to reduce the number of passes needed over the input, which is more computationally effective. Several layers of small kernels (3x3) is more expressive than bigger kernel (7x7), because they can combine in non-linear ways thanks to activation functions. Also requires less parameters.

Optimizing CNNs

  • convolution-flavors. Kernel to Row an Kernel to Column tricks save memory. Also has Vectored Convolution, 1D Winograd Convolution are computationally faster.
  • ResNet use stride instead of pooling layers to reduce size. Less computations?
  • Grouped convolutions. Not convolving (output of) entire past layers, but just subsets. Allows to eliminate redundant computations. - approx twice as efficient as MobileNets

Kernel dictionaries

Use a fixed set of kernels in a CNN or convolutional feature-map extractor. Store their index instead of individual weights? To reduce weight storage size for big networks.

Kernel weights 3x3 @ 8 bit = 72 bits. Kernel map index: 32-256kernels: 5-8 bit. Approx factor 10x reduction.

Can be done using vector quantization or clustering on the CNN weights.


  • Clustering Convolutional Kernels to Compress Deep Neural Networks. Sanghyun Son. ECCV 2018. From a pre-trained model, extract representative 2D kernel centroids using k-means clustering. Each centroid replaces the corresponding kernels. Use indexed representations instead of saving whole kernels. Applied to ResNet-18. Outperforms its uncompressed counterpart in ILSVRC2012 with over 10x compression ratio. 3.3 Accelerating convolution via shared kernel representations. Rewriting multiple shared convolutions to Add-then-conv. 3.4 Transform invariant clustering. Vertical,horizontal flip, 90 degree rotation. Only 32 centroids required for VGG-16 to get reasonable accuracy. Fig. 5: The 16 most(Top)/least(Bottom) frequent centroids.

Stacking kernels

  • Learning Separable Fixed-Point kernels for Deep Convolutional Neural Networks. Sajid Anwar. Approximate the separable kernels from non-separable ones using SVD. Separable. For a K x K filter, the count of weights is reduced from K*K to K+K and the speedup is (K**K)/2K.

Transfer learning

  • DeCAF: A deep convolutional activation feature for generic visual recognition. DeCAF layers taken from a general task, plus SVM/LogisticRegression training outperform existing state-of-the-art.
  • CNN features off the shelf: an astounding baseline for recognition. Pre-trained CNN plus SVM. "SIFT and HOG descriptors produced big performance gains a decade ago and now deep convolutional features are providing a similar breakthrough for recognition" "In any case, if you develop any new algorithm for a recognition task, it must be compared against the strong baseline of generic deep features + simple classifier"

Could one try well-known normalized kernels, instead of learning them? Ex: Sobel edge detectors, median/gaussian averaging etc.

Could it give performance benefits to flatten a deep model? Ie use a deep model, compute typical activations for the different layers, mimick these flat convolutional kernels. Teacher-student type learning.

Other kernel learning methods

Could one learn CNN kernels using greedy selection of sets of N kernels?

Layer-wise greedy learning can be done. Can be unsupervised or supervised. Goes back to Y Bengio, 2007. Not so popular in 2015, after ReLu etc improved. Can still be beneficial for datasets with small amount of labeled samples. 1 Especially unsupervised pre-training/initialization, with supervised fine-tuning using back-propagation. Stacked Autoencoders is one approach. 1

A pre-training strategy for convolutional neural network applied to Chinese digital gesture recognition. Principal Component Analysis (PCA) is employed to learn convolution kernels as the pre-training strategy. Called PCA-based Convolutional Neural Network (PCNN).

Gabor filters

  • GaborCNN. 2016. GaborCNN. Convolutional neural network combined with Gabor filters for strengthening the learning of texture information. 81.53% on ImageNet.
  • "Informative Spectro-Temporal Bottleneck Features for Noise-Robust Speech Recognition". 2013. Uses Gabor filters for feature extraction. On Power-normalized spectrogram. Filters selected using sparse PCA. Multi-layer-perceptron used, pretrained with Restricted Boltzman Machine.
  • "Robust CNN-based Speech Recognition With Gabor Filter Kernels". 2014. Gabor Convolutional Neural Network (GCNN). Incorporates Gabor functions into convolutional filter kernels. Gabor-DNN features. Power-normalized spectrum (PNS) instead of mel-spectrogram. Gammatone auditory filters equally spaced on the equivalent rectangular bandwidth (ERB) scale. Medium-duration power bias is subtracted, where the bias level calculation was based on the ratio of arithmetic mean and geometric mean (AM/GM ratio) of the medium duration power. A power nonlinearity with an exponent of 0.1 replaces the logarithm nonlinearity used for compression.
  • "". Gabor Convolutional Networks (GCNs). Convolutional Gabor orientation Filters (GoFs). Learned convolution filters modulated by Gabor filter. Performs slightly better than ResNet with slightly fewers trainable parameters.

Q. When transfer learning on CNNs, can one transfer kernels from different models/architectures. SubQ. Do kernels in different CNNs tend to be similar? Cluster...

Model compression

  • Pruning Convolutional Neural Networks for Resource Efficient Inference. ICLR 2017. Prunin criterion based on Taylor expansion that approximates the change in the cost function induced by pruning network parameters. Focus on transfer learning. Pruning large CNNs after adaptation to fine-grained classification tasks. Demonstrates superior performance compared to other criteria, e.g. the norm of kernel weights or feature map activation.
  • Structural Compression of Convolutional Neural Networks Based on Greedy Filter Pruning.
  • "Compressing Convolutional Neural Networks in the Frequency Domain". Wenlin Chen. Called FreshNets. Uses HashedNets for FC parts. Evaluated at 1/16 and 1/64 compression.
  • CNNpack: Packing Convolutional Neural Networks in the Frequency Domain. Yunhe Wang. NIPS 2016. 41 citations. Treat convolutional kernels as images. Decompose each kernel into common parts (ie cluster centers), and residual (unique to the kernel). A large number of low-energy frequency coefficients in both parts can be discarded to produce high compression without significantly compromising accuracy. Relax the computational burden of convolution operations in CNNs by linearly combining the convolution responses of discrete cosine transform (DCT) bases. Method: kernel coefficients -> DCT -> k-means clustering -> l1 shrinkage -> quantization -> Huffman -> Compressed Sparse Row. For similar performance, 30-40x compression on AlexNet/VGG16, 10-25x speedup. 13x compression on ResNet50. !!

Neural Architecture Search

For embedded almost always interested in performance under constraints, on RAM, FLASH and CPU time / energy usage. Of interest is to find Pareto-optimal (family of) models, which offers the best performance/constraint tradeoff.

Large amount of literature linked from

SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers CNNs for 2kB RAM.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

On Random Weights and Unsupervised Feature Learning

NIPS 2010.

In this paper we pose the question, why do random weights sometimes do so well? Our answer is that certain convolutional pooling architectures can be inherently frequency selective and translation invariant, even with random weights. Demonstrate the viability of extremely fast architecture search by using random weights to evaluate candidate architectures, thereby sidestepping the time-consuming learning process. Process is approximately 30x faster. TODO: does this have modern citations?

AutoML: Methods, Systems, Challenges 2018 Reviews state of the art

An Overview of Machine Learning within Embedded and Mobile Devices–Optimizations and Applications

Reviews methods, tools, optimization techniques and applications for constrainted embedded and mobile devices. Methods include k-NN, HMM, SVM, GMM and deep neural networks. For each of the methods covers some optimization schemes.

For SVM concludes that the Laplacian kernel is the most efficient, since it can be implemented with shifts. Laplacian kernel can be used in scikit-learn by precomputing the kernel. On a practical level the number of support vectors also impact runtime. In scikit-learn can use the NuSVC model to constrain the number of support vectors.

Efficient Multi-objective Neural Architecture Search via Lamarckian Evolution

Proposes LEMONADE April 2018 - Feb 2019 Bosch / Uni Freiburg

Trained on CIFAR-10, evaluted on ImageNet64x64 Accuracy versus number of parameters Pareto optimal over NASNet, MobileNet V1. Parity with MobileNet V2 24-56 GPU days used

Small Convolutional Neural Nets

  • SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. 2015. 1x1 convolutions over 3x3. Percentage tunable as hyperparameters. Pooling very late in the layers. No fully-connected end, uses convolutional instead. 5MB model performs like AlexNet on ImageNet. 650KB when compressed to 8bit at 33% sparsity.
  • MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. 2017. Extensive use of 1×1 Conv layers. 95% of it’s computation time, 75% parameters. 25% parameters in final fully-connected. Also depthwise-separable convolutions. Combination of a depthwise convolution and a pointwise convolution. Has two hyperparameters: image size and a width multiplier alpha (0.0-1.0). Figure 4 shows log linear dependence between accuracy and computation. 0.5 MobileNet-160 has 76M mul-adds, versus SqueezeNet 1700 mult-adds, both around 60% on ImageNet. Smallest tested was 0.25 MobileNet-128, with 15M mult-adds and 200k parameters.
  • ShuffleNet. Zhang, 2017. Introduces the three variants of the Shuffle unit. Group convolutions and channel shuffles. Group convolution applies over data from multiple groups (RGB channels). Reduces computations. Channel shuffle randomly mixes the output channels of the group convolution.
  • MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 Inserting linear bottleneck layers into the convolutional blocks. Ratio between the size of the input bottleneck and the inner size as the expansion ratio. Shortcut connections between bottlenecks. ReLU6 as the non-linearity. Designed for with low-precision computation (8 bit fixed-point). y = min(max(x, 0), 6). Max activtions size 200K float16, versus 800K for MobileNetV1 and 600K for ShuffleNet. Smallest network at 96x96 with 12M mult-adds, 0.35 width. Performance curve very similar to ShuffleNet. Combined with SSDLite, gives similar object detection performance as YOLOv2 at 10% model size and 5% compute. 200ms on Pixel1 phone using TensorFlow Lite.
  • EffNet. Freeman, 2018. Spatial separable convolutions. Made of depthwise convolution with a line kernel (1x3), followed by a separable pooling, and finished by a depthwise convolution with a column kernel (3x1).
  • FD-MobileNet: Improved MobileNet with a Fast Downsampling Strategy. 2018. 3 Small But Powerful Convolutional Networks. Explains MobileNet, ShuffleNet, EffNet. Visualizations of most important architecture differences, and the computational complexity benefits. Why MobileNet and Its Variants (e.g. ShuffleNet) Are Fast. Covers MobileNet, ShuffleNet, FD-MobileNet. Explains the convolution variants used visually. Pointwise convolution (conv1x1), grouped convolution, depthwise convolution.

Open source projects

  • nn_dataflow. Energy-efficient dataflow scheduling for neural networks (NNs), including array mapping, loop blocking and reordering, and parallel partitioning.
  • Sparse-Winograd-CNN. Efficient Sparse-Winograd Convolutional Neural Networks paper. ICLR 2018.
  • wincnn. Simple python module for computing minimal Winograd convolution algorithms for use with convolutional neural networks. "Fast Algorithms for Convolutional Neural Networks" Lavin and Gray, CVPR 2016.
  • Tencent/FeatherCNN. High performance inference engine for convolutional neural networks. For embedded Linux and mobile, especially ARM processors.
  • dll. C++ implementation of Restricted Boltzmann Machine (RBM) and Deep Belief Network (DBN) and their convolution versions.
  • CNN-Inference-Engine-Comparison. Overview of CCN inference engines, and performance. Shows MobileNetV1 at 60ms on 2-core 1.8Ghz Cortex-A72, ResNet-18 in 200ms.
  • ESP-WHO. Face recognition based on MobileNets, which custom CNN implementation?
  • tflite micro. TensorFlow Lite for microcontrollers. Since November 2018. Supports ARM Cortex M, RISC-V and Linux/MacOS host.

Sequence modelling

Temporal Convolutional Networks. Using convolutions instead of Recurrent Neural Networks. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling.

Quantized Neural Networks

Using integer quantization, typically down to 8-bit. Reduces size of weights, allows to use wider SIMD instructions. Most interesting for low, when applied to already-efficient deep learning architectures. Examples are MobileNet, SqueezeNet. In microcontrollers, ARM Cortex M4F and M7 can do SIMD operations with 4x 8-bit integers.


  • Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. Paper on the quantization supported in TensorFlow Lite. Evaluated on Accuracy versus Latency tradeoff. Tested on Qualcomm Snapdragon 835 LITTLE. Using ReLU6 non-linearity. Primarily 8-bit arithmetics, 32-bit for some parts.
  • Papers are expected from CVPR 2018 On-Device Visual Intelligence Challenge (June 2018), competition focused on improving accuracy/latency tradeoff.



  • gemmlowp. Low-precision General Matrix Multiplication. C++ library for optimized GEMM using 8-bit integers, with 32 bit accumulator. Can utilize SSE4,NEON for SIMD. Seems to run on Cortex M. Well-documented archiecture and implementation of the quantization.

Binarized Neural Networks

Bitwise arithmetic packed into integer representations. Decreases weights storage drastically. Implementable efficiently on constrained hardware (only fixed-point units). Also called BNN, Binary Neutral Network, and XNOR neural network.


Another alternative is binary shift networks, which replaces multiplications with bitshifts.

Support Vector Machines

Strong linear classifier/regressor, also strong non-linear method when using a kernel (polynomial,RBF).


  • O(nd) classification for n support vectors and d features.


  • Kernel function can be expensive
  • Can sometime output a lot of cofefficients (support vectors)
  • Training slow on large datasets, O(n^3)


  • Soft-margin, to reduce effects of noise/outliers near margin
  • Sparse methods. Reducing number of support vectors
  • Approximations. For instance using Nyström method. sklearn Nystroem/RBFSampler
  • Sampling to speep up training. Usually only on linear SVM. Stocastic Gradient Descent (SGD), Sequential minimal optimization (SMO)
  • Transductive, for semi-supervised learning
  • Support vector clustering
  • One-class SVM, for anomaly detection


Nearest Neighbours

Canonical example is kNN. However conventional kNN requires all training points to be stored, which is typically way too much for a microcontroller.


  • Condensed kNN. Data reduction technique. Selects prototypes for each class and eliminates points that are not needed. sklearn
  • Fast condensed nearest neighbor rule. 2005. Paper
  • Approximate nearest neighbours.


Existing work


  • RPS-RNN. Small physical device that can play Rock, Paper, Scissors slightly better than chance. Custom electronics and 3d-printed casing. 3-layer RNN running on 8-bit microcontroller, Attiny1614.

Software libraries


  • Machine Learning for Embedded Systems: A Case Study Support Vector Machines. Target system used for auto-tunic a mobile ad-hoc network (MANET) by earns the relationships among configuration parameters. Running on ARMv7 and PPC, 128MB+ RAM. Lots of detail about how they optimized an existing SVM implementation, in the end running 20x faster.


Research groups


  • SensiML. ISV providing the SensiML Analytics Toolkit, on-microcontroller ML algorithms and supporting tools.
  • ST devices. At WDC2018 announced STM32CubeMX AI. A SDK for neural networks on STM32 micros, and intent to develop NN coprocessor. Available since December 2018. Integrated with STM32 device selectors: Analyzes provided model file, and filters possible devices that can fit it. Supports Keras,Lasagne,Caffee. Has integrated model compression. 4/8x settings. Templates for validation, system performance check and applications. Validation compares on-device compressed model against full model, either with random input or custom data files. Shows ROM,RAM used, time per inference, and time per layer. Function packs available for audio and motion. Motion performs Human Activity Detection. HAR. Uses a 4order high pass at 1Hz to separate gravity component. Dynamic portion is rotated such that it is always in the same direction, using Rodrigues' rotation formula. 3 different models. HAR_GMP: ST proprietary design trained on an ST proprietary data set HAR_IGN: ST simplified design taken from Andrey Ignatov, “Real-time human activity recognition from accelerometer data using convolutional neural networks”, Applied Soft Computing 62 (2018), pp 915-922 trained on an ST proprietary data set. HAR_IGN_WSDM: same network topology as HAR_IGN but trained on the public Wireless Sensor Data Mining (WSDM) dataset in Jennifer R. Kwapisz, Gary M. Weiss and Samuel A. Moore. “Activity Recognition using Cell Phone Accelerometers” in ACM SIGKDD Exploration Newsletter, volume 12 issue 2, December 2010, pp 74-82. Audio computes log-mel representation. Acoustic Scene Classification as example, with 3 classes. ASC. using simplified version of "Virtanen, DCASE 2016 acoustic scene classification using convolutional neural networks" 16kHz. 30 mels, 32 frames, inference every 1024ms. Android application allows to push labels to the device and store on SDcard. SensorTile hardware used. Looks like 1mA @ 1.8V average power consumption. STM32CubeMX AI works OK on Linux when combined with the free. Tested on an AI project setup from scratch, and STM32L476-Nucleo AI example. The Makefile generation did however not work out-of-the-box.
  • Reality AI.
  • Lattice Semicondutors. Announced CNN acceleration IP blocks,tools and devkits for their ICE40 FPGAs October 2018. eejournal. Himax HM01B0 UPduino Shield, with ultra-low-power imaging module and support for 2 microphones.
  • STMicrocontrollers. STM32 X-CUBE-AI.
  • Sensory. Wakeword/keyword spotting, speech reconginition, biometric authentication.


  • Why the Future of Machine Learning is Tiny (devices) Tiny Computers are Already Cheap and Everywhere. Energy is the Limiting Factor.We Capture Much More Sensor Data Than We Use.
  • Bringing machine learning to the edge Predictions are much lower bandwidth than the raw sensor data (e.g. video) It allows for local adaptation in the AI logic (L-DNN) It achieves lower latency between observed event and action resulting from AI logic "the most important question is what is the least amount accuracy and computation complexity we can do while still delivering the business value?" Top mistake: "Continuing with the top down approach ‘let’s make it perform the task first and then squeeze it on device` instead of switching to bottom up ‘let’s make it run on device and fulfill all hardware constraints first, and then tune it for the task at hand’."
  • How to run deep learning model on microcontroller with CMSIS-NN. Why run deep learning model on a microcontroller? Sensitive data gets to the cloud, photos, and audio recordings. The company who sells this may charge a service fee to use its service and even worse sell your private data. It won't work without the network connection to the server. Data traveling back and forth between the device and server introduces lag. Require network and wireless hardware components on the circuit design which increase the cost. It might waste bandwidth sending useless data.
  • eetimes: AI at the Very, Very Edge. Report from first TinyML meetup in BayArea.

Open hardware platforms

  • OpenMV, very nice machine vision devkit with STMF7 and MicroPython. Not low-power, 200mA cited. 75 USD.
  • AudioMoth. Field audio recording device designed for battery power. 16 mAh with 10sec rec/10 sec sleep on 96kHz samplerate. 3xAA batteries. Over 100 days runtime. Silicon Labs Cortex M4F. External 256kB SRAM. Records and processes up to 384kHz. 50 USD. AudioMoth: Evaluation of a smart open acoustic device for monitoring biodiversity and the environment "AudioMoth can be programmed to filter relevant sounds such that only those of interest are saved, thus reducing post‐processing time, power usage and data storage requirements." "AudioMoth creates a unique opportunity for users to design specific classification algorithms for individual projects." uses the Goertzel filter for real‐time classification algorithms. This filter evaluates specific terms of a fast Fourier transform on temporarily buffered audio samples without the computational expense of a complete transform. Samples are split into N windows Precomputed filter coefficients. Hamming window, precomputed. from 10 to 25 mW consumption when processing samples. Many of the natural environments most prone to poaching have no Wi‐Fi or mobile coverage, ruling out the use of cloud‐based acoustic systems. 5‐month total period of field deployment of 87 AudioMoths resulted in 129 hr of audio triggered by positive algorithm responses. These were identified as false positives from a number of sources, including dog whistles, leaf noise during strong winds, and bird songs. In comparison, recording continuously for 12 hr per day over the same period would have created 156,600 hr of audio data. The most energy intensive task on AudioMoth was writing data to the microSD card, which consumed 17–70 mW. 80 μW when sleeping between sample, approx 6 years standby. Further developments are exploring the potential for networking AudioMoth by LoRa radio, to link them to a base station for real‐time signalling of acoustic events triggered by the detection algorithm. record alternative types of data to memory, instead of memory inefficient uncompressed WAV files. For example, summarise the important characteristics of sounds with measurements known as acoustic indices


Machine Hearing

General info

  • Blog: Book: Human and Machine Hearing: Extracting Meaning from Sound Paper in IEEE, 2010 Key ideas. Modelling human hearing. Reusing machine vision learnings by representing sound as (moving) images. Combined audiovisual models.
  • What’s wrong with CNNs and spectrograms for audio processing?. Challenges: Sounds intermix/blend into eachother, ie are "transparent". Can also have complex relationships like phase cancellation. Directions in spectogram have different meansings. Frequency,time. Features not invariant wrt to frequency, but generally wrt time. Activations in the spectrogram are non-local, eg formants. Sound information is serial 'events', observed only in one instance of time. Not visual stationary 'objects'. If freezing time, cannot understand much of the information present (compared with video). Temporal patterns are critical.

Keyword spotting

Aka "wake word detection"

Existing work on microcontrollers

  • ML-KWS-for-MCU. Speech recognition on microcontroller. Neural network trained with TensorFlow, then deployed on Cortex-M7, with FPU. Using CMSIS NN and DSP modules.

  • CASE2012. Implemented speech recognition using MFCC on 16-bit dsPIC with 40 MIPS and 16kB RAM. A Cortex-M3 at 80 MHz should have 100+MIPS.

  • An Optimized Recurrent Unit for Ultra-Low-Power Keyword Spotting. February 2019. Introduces eGRU, claimed to be 60x faster and 10x smaller than standard GRU cell. Omits the reset gate (and assosicaed weights W_r). Uses softsign instead of sigmoid and tanh. Faster, less prone to saturation. Uses fixed-point integer operations only. Q15, bitshifts for divide/mul. 3-bit exponential weight quantization. -1,-0.5,-0.25,0,0.25,0.5,1.0. Bitwise operations, no lookup table. Uses a quantization-aware training. Quantized in forward pass, full precision in backward (for gradient). Evaluated on Keyword Spotting, Cough Detection and Environmental Sound Classification. Sampling rate 8kHz. 128 samples STFT window, no overlap. 64 bands. No mel filtering! 250 timesteps for Urbansound8k. eGRU_opt Urbansound8k scores 61.2%. 3kB model size. eGRU_arc Urbansound8k score of 72%. Indicates 8kHz enough!

  • How to Achieve High-Accuracy Keyword Spotting on Cortex-M Processors. Reviews many deep learning approaches. DNN, CNN, RNN, CRNN, DS-CNN. Considering 3 different sizes of networks, bound by NN memory limit and ops/second limits. Small= 80KB, 6M ops/inference. Depthwise Separable Convolutional Neural Network (DS-CNN) provides the best accuracy while requiring significantly lower memory and compute resources. 94.5% accuracy for small network. ARM Cortex M7 (STM32F746G-DISCO). 8-bit weights and 8-bit activations, with KWS running at 10 inferences per second. Each inference – including memory copying, MFCC feature extraction and DNN execution – takes about 12 ms. 10x inferences/second. Rest sleeping = 12% duty cycle.

  • QuickLogic partners with Nordic Semiconductor for its Amazon Alexa-compatible wearables reference design using Voice-over-Bluetooth Low Energy. 2017/11 Always-on wake word detection at 640uWatt typical. nRF51822 with external MCU. EOS S3 SoC’s (Cortex M4F) hardware integrated Low Power Sound Detector.

  • Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting. Uses Per-Channel Energy Normalization on mel spectograms. CRNN with ~230k parameters, acceptably low latency on mobile devices. Achieves 97.71% accuracy at 0.5 FA/hour for 5 dB signal-to-noise ratio. Down to 45k parameters tested.


  • Audio classification overview Criterias for good features, PCA/LDA for dimensionality reduction. Sequential forward/backward selection
  • Environmental sound recognition: a survey (2014). Mentiones MPEG-7 based features, efficient and perceptual.
  • Dolph-Chebyshev Window, good window function for audio. C reference implementation.
  • Voice Activity Detection, tutorial Using 5 simple features.
  • Machine Learning for Audio, Image and Video Analysis.
  • Notes on Music Information Retrieval, series of Jupyter notebooks. Lots of goodies, from feature extraction to high-level algorithms.
  • Detection and Classification of Acoustic Scenes and Events. 2014 Review of state of the art in machine listening. Problem 1: Acoustic scene classification, Characterize acoustic environment of an audio stream by selecting a semantic label for it. Single-label classification. Similar to: Music genre recognition. Speaker recognition. Also similar to other time-based classification, ie in video. Approach 1. Bag of frames. Long-term statistical distribution of local spectral features. Ex MFCC. Compare feature distributions using GMM. Approach 2. Intermediate representation using higher level vocabulary/dictionary of "acoustic atoms". Problem 2. Acoustic event detection. Label temporal regions within an audio recording; start time, end time and label for each event instance. Related to. Automatic music transcription. Speaker diarisation. Typically treated as monophonic problem, but polyphonic is desirable. More challening that scene classification. One strategy to handle polyphonic signals is to perform audio source separation, and then to analyse the resulting signals individually.


Acoustic event detection (AED)

  • Aka Automatic Environmental Sound Recognition (AESR)

  • Competitions: CLEAR "Classification of Events, Activities and Relation- ships". DCASE Detection and Classification of Acoustic Scenes and Events (2016,2013) website shown progress on same dataset up to modern methods with f1-score=69.3% using Convolutional Recurrent Neural Networks. Dataset TUT-SED2009 TUT-CASA2009

  • [](Bag-of-Features Methods for Acoustic Event Detection and Classification). Grzeszick, 2014/2017. Features are calculated for all frames in a given time window. Then, applying the bag-of-features concept, these features are quantized with respect to a learned codebook and a histogram representation is computed. Bag-of-features approaches are particularly interesting for online processing as they have a low computational cost. Using GCFF Gammatone frequency cepstral coefficients, in addition to MFCC. Codebook quantizations used: soft quantization, supervised codebook learning, and temporal modeling. Using DCASE 2013 office live dataset and the ITC-IRST multichannel. BoF principle: Learn intermediate representation of features in unsupervised manner. Clustering like k-means Hard-quantization: All N*K feature vectors are clustered. Only cluster centroids are of interest. Assign based on minimum distance. Soft-quantization: GMM with expectation maximation. Codebook has mean,variance. Supervized-quantization. GMM per class, concatenated. Re-introducing temporality. Pyramid scheme, feature augumentation by adding quantizied time coordinate. SVM classification. Multiclass. Linear, RBF. Histogram-intersection kernel works well. Random Forests. Works well for AED. Frame size = 1024samples@44.1kZ=22.3 ms The current python implementation uses a single core on a standard desktop machine and requires less than 20% of the real time for computation.

  • Bird Audio Detection using probability sequence kernels Judges award DCASE2016 for most computationally efficient. MFCC features (voicebox), GMM, SVM classifier from libsvm with probability sequence kernel (PSK). AUC of 73% without short-term Gaussianization to adapt to dataset differences.

  • LEARNING FILTER BANKS USING DEEP LEARNING FOR ACOUSTIC SIGNALS. Shuhui Qu. Based on the procedure of log Mel-filter banks, we design a filter bank learning layer. Urbansound8K dataset, the experience guided learning leads to a 2% accuracy improvement.

  • Automatic Environmental Sound Recognition: Performance Versus Computational Cost. 2016. Sigtia,...,Mark D. Plumbley Results suggest that Deep Neural Networks yield the best ratio of sound classification accuracy across a range of computational costs, while Gaussian Mixture Models offer a reasonable accuracy at a consistently small cost, and Support Vector Machines stand between both in terms of compromise between accuracy and computational cost. ! No Convolutional Neural networks. ! used MFCC instead of mel-spectrogram

  • EFFICIENT CONVOLUTIONAL NEURAL NETWORK FOR AUDIO EVENT DETECTION. Meyer, 2017. structural optimizations. reduce the memory requirement by a factor 500, and the computational effort by a factor of 2.1 while performing 9.2 % better. Final weights are 904 kB. Which fits in progmem, but not in RAM on a ARM Cortex M7. Needs 75% of theoritical performance wrt MACs, which is likely not relalizable. They suggest use of a dedicated accelerator chip.

  • Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks. ! Very shallow network performs similar to state-of-the art in event detection on very noisy datasets. Convolution (3..25 wide x 52 tall) -> MaxPool per frame -> Softmax across frames. Claims to also outperform with a single filter width setting. Also uses window averaging to downsample spectrogram bins to 52 bins instead of typical triangular mel. This arcitecture should be suitable also for Acoustic Scene Classification?

  • Baby Cry Sound Detection: A Comparison of Hand Crafted Features and Deep Learning Approach. 2017 Shows that hand-crafted features can give same performance as state-of-art CNN at 20x the computational cost. Features: Voiced unvoiced counter (VUVC), Consecutive F_0 (CF0), Harmonic ratio accumulation (HRA). Classifier: Support Vector Data Description (SVDD). "Further research should investigate ways of reducing complexity of CNN, by decreasing the number of filters and their size" Dataset was constructed from and Approx 1 hour cry, 1 hour non-cry for training. ! Testing set has only 26 baby cry events (15 min) as base. Upsampled by mixing in noise at 18dB. Makes 4h of sound with sparse amounts of target event, and 2 hours without.

  • SwishNet: A Fast Convolutional Neural Network for Speech, Music and Noise Classification and Segmentation 1D Convolutional Neural Network (CNN). Operates on MFCC, 20 band. Uses combinations of 1x3 and 1x6 convolutions. Only convolutions across temporal bands. ? Gated activations between each step. ? skip connections with Add. Architecture inspired by Inception and WaveNet architecture. Optionally use distilled knowledge from MobileNet trained on ImageNet. Tested on MUSAN, GTZAN. ! used background noise removal 5k and 18k parameters. Versus 220k for MobileNet. 1ms prediction time for 1 second window on desktop CPU. ! simple problems, GMM baseline performed 96-99% and 90%, MobileNet Random initialized 98-00% and 94-96%

  • Kaggle: The Marinexplore and Cornell University Whale Detection Challenge Features & classification approaches. Many approached used with good results. Large range in feature sets. Mostly deep learning and tree ensembles, some SVM. Winner used image template on spectograms with a GradientBoostingClassifier.

Environmental sound monitoring


  • Environmental Noise Monitoring
  • Noise source identification



  • Acoustic Event Detection Using Machine Learning: Identifying Train Events. Shannon Mckenna,David Mclare. Using RMS over 0.125 seconds and 1/3 octave frequency bands. Classify individual time instances as train-event, then require a cluster of 3 train events successive. "Performance of our classifier was significantly increased when we normalized the noise levels by subtracting out the mean noise level of each 1/3 octave band and dividing by the standard deviation" Used Logistic Regression and SVM. From 0.6 to 0.9 true positive rate (depending on site), with <0.05 false positive rate. Tested across 10 sites.

Detection of Anomalous Noise Events on Low-Capacity Acoustic Nodes for Dynamic Road Traffic Noise Mapping within an Hybrid WASN

Speech commands


Speaker detection

Bird audio detection

  • DCASE 2018. Audio Event Detection, single-class: bird-present. 6 datasets of some k samples each.
  • BirdCLEF 2016. 24k audio clips of 999 birds species

Human Activity Recognition

Terms used

  • Activity Recognition / human activity recognition (AR)
  • Activities of Daily Living (ADL).
  • Action recognition
  • Fall detection. (FD)


Existing work

Gesture recognition

Using IMUs.


Existing work

Feature processing

  • Vector quantization
  • Acceleration statistics
  • Motion histogram
  • Zero velocity compensation (ZVC)
  • DWT. FastDWT, approximation of DTW in linear time and linear space.

Vibration analysis

Often used for 'machine condition' analysis, especially for rotating machines.

Predictive maintenance

NASA Prognostics Data Repository. Collection of datasets for operational and failed systems. Thermal, vibration, electronical

Computer vision


"Recent studies show that the latencies to upload a JPEG-compressed input image (i.e. 152KB) for a single inference of a popular CNN–“AlexNet” via stable wireless connections with 3G (870ms), LTE (180ms) and Wi-Fi (95ms), can exceed that of DNN computation (6∼82ms) by a mobile or cloud-GPU." Moreover,the communication energy is comparable with the associated DNN computation energy.

Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2017, pp. 615–629.


  • VLFeat. Portable C library with lots of feature extractors for computer vision tasks.


ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation 0.7 MB weights (16 bit floats). 3.83 GFLOPS on 3x640x260 images.

Classifying JPEG-compressed data

Can one do classification and object detection on compressed JPEG straight from the camera? Instead of computing the framebuffer from the JPEG.

Can it be also done in a streaming fashion?

Operating on the blocks with DCT coefficients.

Prior work:


New sensor types

How ML/DL is Disrupting Sensor Design. Compressed sensing. Random projections.

Rise of the super sensor. CMU has developed a generic 'synthetic sensor', using audio/vibration etc. "the revolution is to install a super sensor once, then all future sensing (and the actions based on that sensing) is a software solution that does not involve new devices"

Soft Robotics Youtube video about easy-to-construct soft gripper with integrated resistive sensors. Could train algorithms to detect objects gripped.

Time series

Deep learning for time series classification: a review. Compares many different model types across 97 time-series datasets. Finds that CNNs and ResNet performs the best.

Change detection

Novelity detection. Anomaly detection.

Change point detection. In Time series, at which point something changes. Often growth rate. Can also be amplitude. Changes in distribution.

Breakout detection. In time series, when the mean shifts relatively suddenly. Mean divergence/shift. Or transition too/from (rampup).


  • Network Intrusion Detection (IDS)
  • Condition Monitoring of machines



  • tslearn. Time-series machine learning tools. scikit-learn inspired
  • seglearn Time-series. scikit-learn inspired.
  • banpei. Change-point detection using Singular Spectrum Transform (SST), Outlier detection using Hotelling's theory.


  • Outlier Detection DataSets (ODDS). Huge number of datasets. In multiple groups: Multi-dimensional point, time-series point (uni/multivariate), time-series graph data, adverserial/attack data,
  • Numenta Anomaly Benchmark NAB. 50+ different time-series, benchmarked on many methods.
  • UCSD Anomaly Detection Dataset. Video of pedestrian walkway. Anomalies are non-pedestrians. A subset has pixel-level masks.


  • DeepADoTS. From paper "A Systematic Evaluation of Deep Anomaly Detection Methods for Time Series". Implements 7 deep neural models for anomaly detection.
  • telemanom. STMs to detect anomalies in multivariate time series data. Includes anomaly dataset from NASA Mars Rover.

Online learning

It is also desirable to learn on-the-fly. First level is hybrid systems where new samples is used to tune/improve a pre-trained model. More advanced is on-line training which can automatically detect new classes. Get closer to typical Artificial Intelligence field, since now have an intelligent agent able to learn on its own.

Hybrid learning, adaptive machine learning, progressive learning, semi-supervised learning. Q-learning (reinforcement learning).

Transfer learning

Kinds of transfer learning

  • Inductive transfer. Few labeled target data is available. Source data is used as auxillary.
  • Transductive transfer. Lots of labeled source data. Lots of unlabeled data in target.
  • Unsupervised transfer. Both source and target data is unlabeled.

Note: in transfer learning, performance on source dataset is generally ignored. Goal is good performance on target.

How to transfer

  • Instance based. Reuse instances in source domains that are similar to target domain. Ex: Instance reweigthing, importance sampling
  • Feature based. Find an alternate feature space for learning target domain, while projecting source into this space. Feature subset selection, feature space transformation.
  • Model/parameter based. Use model parameters/hyper-parameters to influence learning target. Parameter-space partitioning, superimposing shape constraints.

Boosting based transfers

  • TrAdaBoost
  • TransferBoost. Based on AdaBoost



Sparsity Lesson of Fundamentals of Digital Image and Video Processing Applications. Noise smoothing, inpainting, superresolution. Foreground/background separation. Compressive sensing. Images are sparse in DCT decomposition. Can throw away many of the with minimal quality loss. Noise is not correlated and will not compress well. This fact used in image denoising. Compute a sparse representation, then reconstruct. Can be done with standard basis like DCT, or a learned dictionary. Basis pursuit. Matching Pursuit. Orthonogonal Matching Pursuit. Foreground/background separation in video. Singular Value Decomposition. Can one do foreground separation of audio in a similar manner?

Compressive Data Aquisition. Replace sampling at Nyquist followed by compression with. Sampling matrices. Suprising result: Random matrices work.


PDF approximations

How rough/fast approximation can be used for the Normal PDF in Gaussian methods?

  • Naive Bayes classification
  • Gaussian Mixture model (GMM)
  • Hidden Markov Model (HMM)
  • Kalman filtering
  • RBF/Gaussian kernel in kernel SVM/logits

It looks like the logarithm of normpdf() can be easily simplified to the quadratic function, see embayes: Quadratic function in Gaussian Naive Bayes.

Can this be applied for SVM with RBF kernel also? RBF: k(xi, xj) = exp(-y * ||xi - xj||^2) Does not seem like it, at least not with the kernel trick. No way to split: sum( z_iy_ik(x_i, x))



Avoid anthropomorphic language

Ask “what worked?” and “why?”, rather than just “how well?” Strongest empirical papers include:

  • error analysis
  • ablation studies
  • robustness checks

"Would I rely on this explanation for making predictions or for getting a system to work?"

ML Paper Checklist


Paper publishing


  • IJASUC: International Journal of Ad hoc, Sensor & Ubiquitous Computing. Open-access, bi-monthly. CFP example
  • SDAP: Smart Devices, Applications, and Protocols for the IoT. (bi)Yearly.

Research questions

  • How to take inference cost into account in model/feature evaluation/selection? (time,energy)

Especially with computationally heavy features, that one can generate lots of. Ie dictionary of convolutional kernels. Perhaps set desired score as a hyperparameter, and then optimize for inference cost? Alternatively set a inference cost budget, and find best model that fits. Using standard FS wrapper methods (and uniform feature costs): Do SBS/SFB to obtain score/feature mapping, apply a inference cost function, chose according to budget. Or (with methods robust to irrelevant/redundant features), estimate feature number within budget Could one implement model/feature searchs that take this into account? Could be first feature then model. Or joint optimization? Does it perform better than other model/feature selection methods? Or is more practical. Ease of use. Non-uniform feature-calculation costs. Ie different sized convolution kernels. Convolutions in different layers. Classifier hyper-parameters influencing prediction time. Ie RandomForest min_samples_split/min_samples_leaf/max_depth. Need to specify a cost function. Number of features. Typical/average depth in tree based methods. Number of layers in CNN-like architecture.

  • How to optimize/adapt existing machine learning methods for use on small CPUs.

Memory usage, CPU, prediction time. RandomForest on-demand memoized computation of non-trivial features? Approximation of RBF kernel in SVM? PDF simplification in Naive Bayes. (Recurrent) Convolutional Neural Network. How to compress models efficiently? How to conserve memory (and compute) by minimize the receptive field of the network? Feature selection techniques. Sparsity contraints. Can something like feature permutation be used to find/eliminate irrelevant features? In frequency domains. In time domain. How to select the right resolution, to minimize compute. Frequency domain (filterbands). Time domain (window size, overlap). Network architecture search. Constrained by compute resources. Tool(s) for reasoning about computational efficienty of CNN. Constraint/solver based. Give base architecture and constraints like Nparameters,FLOPs => produce model variations that match. Could also just do random mutations (within ranges), check the flops/parameter count, and filter those not maching?