Review Papers

2019

aspdac2019
vlsisymposium
isfpga
isca2019
date-conference
dac
hotchips
iccad
microarch
fccm
asplos
HPCA 2019
isscc

MyCaffe: A Complete C# Re-Write of Caffe with Reinforcement Learning
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
The implementation of a Deep Recurrent Neural Network Language Model on a Xilinx FPGA
Automatic Full Compilation of Julia Programs and ML Models to Cloud TPUs
Facebook: Hardware & Software Systems research
Intel AI: high-throughput-object-detection-on-edge-platforms-with-fpga
Intel AI: adaptable-deep-learning-solutions-with-ngraph-compiler-and-onnx
Intel AI: intel-fpgas-powering-real-time-ai-inferencing
Intel AI: heterogenous-computing-ai-hardware-designed-for-specific-tasks
Intel AI Academy: Framework Optimization Training
Intel® Optimization for TensorFlow*
Intel AI Docs
CUDA - useful libraries and resources for CUDA development
IBM AI: Hardware and the Physics of AI
DyNet: The Dynamic Neural Network Toolkit
Deep Learning with Dynamic Computation Graphs
Mesh-TensorFlow: Deep Learning for Supercomputers
AI Benchmark: Running Deep Neural Networks on Android Smartphones
Apple: Democratizing Production-Scale Distributed Deep Learning
Communication Primitives in Deep Learning Frameworks
Comparing Deep Learning Frameworks: A Rosetta Stone Approach
CodeX: Bit-Flexible Encoding for Streaming-based FPGA Acceleration of DNNs
EcoRNN: Efficient Computing of LSTM RNN Training on GPUs
Training for 'Unstable' CNN Accelerator:A Case Study on FPGA
Optimized Compilation of Aggregated Instructions for Realistic Quantum Computers
Software-Defined FPGA Accelerator Design for Mobile Deep Learning Applications
Systimator: A Design Space Exploration Methodology for Systolic Array based CNNs Acceleration on the FPGA-based Edge Nodes
TensorSCONE: A Secure TensorFlow Framework using Intel SGX
Quantum Entanglement in Deep Learning Architectures
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine
High-Throughput CNN Inference on Embedded ARM big.LITTLE Multi-Core Processors
Improving Transparency of Deep Neural Inference Process
Neuromorphic Hardware learns to learn
FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning
Extracting Success from IBM's 20-Qubit Machines Using Error-Aware Compilation
Multi-Level Intermediate Representation" Compiler Infrastructure
TensorFlow Lite Now Faster with Mobile GPUs (Developer Preview)
Open-sourcing PyTorch-BigGraph for faster embeddings of extremely large graphs
Accelerated Neural Networks on OpenCL Devices Using SYCL-DNN
DARPA, NSF Seek Real-Time ML Processor
unlocking-dl-performance-with-ngraph
Deep Learning in Mobile and Wireless Networking: A Survey
Meta Filter Pruning to Accelerate Deep Convolutional Neural Networks
alibaba/ai-matrix - To make it easy to benchmark AI accelerators
facebookresearch/deepfloat - An exploration of log domain "alternative floating point" for hardware ML/AI accelerators.
Stanford: Hardware Accelerators for Machine Learning (CS 217)
https://github.com/pytorch/glow - Compiler for Neural Network hardware accelerators
MorphIC: A 65-nm 738k-Synapse/mm2 Quad-Core Binary-Weight Digital Neuromorphic Processor with Stochastic Spike-Driven Online Learning
NeuroPod: a real-time neuromorphic spiking CPG applied to robotics
NengoDL: Combining deep learning and neuromorphic modelling methods
Dynamic Power Management for Neuromorphic Many-Core Systems
Programming multi-level quantum gates in disordered computing reservoirs via machine learning
Bayesian Deep Learning on a Quantum Computer
Defining Quantum Neural Networks via Quantum Time Evolution
Full-stack Optimization for Accelerating CNNs with FPGA Validation
OpenCL-based FPGA accelerator for disparity map generation with stereoscopic event cameras
Energy-Efficient Inference Accelerator for Memory-Augmented Neural Networks on an FPGA
Accelerator-Aware Pruning for Convolutional Neural Networks
RAPIDNN: In-Memory Deep Neural Network Acceleration Framework
Reinforcement Learning and Adaptive Sampling for Optimized DNN Compilation
Defining Quantum Neural Networks via Quantum Time Evolution
Programming multi-level quantum gates in disordered computing reservoirs via machine learning
Image Processing in Quantum Computers
TF-Replicator: Distributed Machine Learning for Researchers
AdaNet: A Scalable and Flexible Framework for Automatically Learning Ensembles

A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities

2018 ASPDAC

ReGAN: A Pipelined ReRAM-Based Accelerator for Generative Adversarial Networks. (University of Pittsburgh, Duke) [Paper]
Accelerator-centric Deep Learning Systems for Enhanced Scalability, Energy-efficiency, and Programmability. (POSTECH)
Architectures and Algorithms for User Customization of CNNs. (Seoul National University, Samsung) [Paper]
Optimizing FPGA-based Convolutional Neural Networks Accelerator for Image Super-Resolution. (Sogang University) [Paper]
Running sparse and low-precision neural network: when algorithm meets hardware. (Duke)

2018 ISSCC

A 55nm Time-Domain Mixed-Signal Neuromorphic Accelerator with Stochastic Synapses and Embedded Reinforcement Learning for Autonomous Micro-Robots. (Georgia Tech) [Paper]
A Shift Towards Edge Machine-Learning Processing. (Google)
QUEST: A 7.49TOPS Multi-Purpose Log-Quantized DNN Inference Engine Stacked on 96MB 3D SRAM Using Inductive-Coupling Technology in 40nm CMOS. (Hokkaido University, Ultra Memory, Keio University)
UNPU: A 50.6TOPS/W Unified Deep Neural Network Accelerator with 1b-to-16b Fully-Variable Weight Bit-Precision. (KAIST)
A 9.02mW CNN-Stereo-Based Real-Time 3D Hand-Gesture Recognition Processor for Smart Mobile Devices. (KAIST)
An Always-On 3.8μJ/86% CIFAR-10 Mixed-Signal Binary CNN Processor with All Memory on Chip in 28nm CMOS. (Stanford, KU Leuven)
Conv-RAM: An Energy-Efficient SRAM with Embedded Convolution Computation for Low-Power CNN-Based Machine Learning Applications. (MIT) [Paper]
A 42pJ/Decision 3.12TOPS/W Robust In-Memory Machine Learning Classifier with On-Chip Training. (UIUC)
Brain-Inspired Computing Exploiting Carbon Nanotube FETs and Resistive RAM: Hyperdimensional Computing Case Study. (Stanford, UC Berkeley, MIT) [Paper]
A 65nm 1Mb Nonvolatile Computing-in-Memory ReRAM Macro with Sub-16ns Multiply-and-Accumulate for Binary DNN AI Edge Processors. (NTHU) [Paper]
A 65nm 4Kb Algorithm-Dependent Computing-in-Memory SRAM Unit Macro with 2.3ns and 55.8TOPS/W Fully Parallel Product-Sum Operation for Binary DNN Edge Processors. (NTHU, TSMC, UESTC, ASU)
A 1μW Voice Activity Detector Using Analog Feature Extraction and Digital Deep Neural Network. (Columbia University)

2018 HPCA

Making Memristive Neural Network Accelerators Reliable. (University of Rochester) [Paper]
Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-based Deep Learning. (University of Florida) [Paper]
Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks. (POSTECH, NVIDIA, UT-Austin) [Paper]
In-situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems. (University of Florida, Chongqing University, Capital Normal University) [Paper]

2018 ASPLOS

Bridging the Gap Between Neural Networks and Neuromorphic Hardware with A Neural Network Compiler. (Tsinghua, UCSB) [Paper]
MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. (Georgia Tech) [Paper]
- Higher PE utilization: Use an augmented reduction tree (reconfigurable interconnects) to construct arbitrary sized virtual neurons.
VIBNN: Hardware Acceleration of Bayesian Neural Networks. (Syracuse University, USC) [Paper]
Exploiting Dynamical Thermal Energy Harvesting for Reusing in Smartphone with Mobile Applications. (Guizhou University, University of Florida)
Potluck: Cross-application Approximate Deduplication for Computation-Intensive Mobile Applications. (Yale) [Paper]

2018 VLSI

STICKER: A 0.41‐62.1 TOPS/W 8bit Neural Network Processor with Multi‐Sparsity Compatible Convolution Arrays and Online Tuning Acceleration for Fully Connected Layers. (THU)
2.9TOPS/W Reconfigurable Dense/Sparse Matrix‐Multiply Accelerator with Unified INT8/INT16/FP16 Datapath in 14nm Tri‐gate CMOS. (Intel)
A Scalable Multi‐TeraOPS Deep Learning Processor Core for AI Training and Inference. (IBM) [Paper]
An Ultra‐high Energy‐efficient reconfigurable Processor for Deep Neural Networks with Binary/Ternary Weights in 28nm CMOS. (THU)
B‐Face: 0.2 mW CNN‐Based Face Recognition Processor with Face Alignment for Mobile User Identification. (KAIST)
A 141 uW, 2.46 pJ/Neuron Binarized Convolutional Neural Network based Self-learning Speech Recognition Processor in 28nm CMOS. (THU)
A Mixed‐Signal Binarized Convolutional‐Neural-Network Accelerator Integrating Dense Weight Storage and Multiplication for Reduced Data Movement. (Princeton)
PhaseMAC: A 14 TOPS/W 8bit GRO based Phase Domain MAC Circuit for In‐Sensor‐Computed Deep Learning Accelerators. (Toshiba)

2018 FPGA

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs. (Peking Univ, Syracuse Univ, CUNY) [paper]
DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator. (ETHZ, BenevolentAI) [paper]
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA. (National Univ of Defense Tech) [Chain-NN]
A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform - A Deep Learning Case Study. (The Univ of Sydney, Intel) [ppt]
A Framework for Generating High Throughput CNN Implementations on FPGAs. (USC) [Paper]
Liquid Silicon: A Data-Centric Reconfigurable Architecture enabled by RRAM Technology. (UW Madison)

2018 ISCA

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM. (THU) [ppt]
Brainwave: A Configurable Cloud-Scale DNN Processor for Real-Time AI. (Microsoft) [Paper] [Paper]
PROMISE: An End-to-End Design of a Programmable Mixed-Signal Accelerator for Machine Learning Algorithms. (UIUC) [ppt]
Computation Reuse in DNNs by Exploiting Input Similarity. (UPC) [ppt]
GANAX: A Unified SIMD-MIMD Acceleration for Generative Adversarial Network. (Georiga Tech, IPM, Qualcomm, UCSD, UIUC) [paper]
SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks. (UCSD, Georgia Tech, Qualcomm) [paper]
UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition. (UIUC, NVIDIA) [paper]
An Energy-Efficient Neural Network Accelerator based on Outlier-Aware Low Precision Computation. (Seoul National)
Prediction based Execution on Deep Neural Networks. (Florida) [paper]
Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks. (Georgia Tech, ARM, UCSD) [paper]
Gist: Efficient Data Encoding for Deep Neural Network Training. (Michigan, Microsoft, Toronto)
The Dark Side of DNN Pruning. (UPC) [paper]
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks. (Michigan) [paper]
EVA^2: Exploiting Temporal Redundancy in Live Computer Vision. (Cornell) [arxiv]
Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision. (Rochester, Georgia Tech, ARM) [paper]
Feature-Driven and Spatially Folded Digital Neurons for Efficient Spiking Neural Network Simulations. (POSTECH/Berkeley, Seoul National)
Space-Time Algebra: A Model for Neocortical Computation. (Wisconsin) [paper]
Scaling Datacenter Accelerators With Compute-Reuse Architectures. (Princeton) [paper]
- Add a NVM-based storage layer to the accelerator, for computation reuse.
Enabling Scientific Computing on Memristive Accelerators. (Rochester) [paper]

2018 DATE

MATIC: Learning Around Errors for Efficient Low-Voltage Neural Network Accelerators. (University of Washington) [paper]
- Learn around errors resulting from SRAM voltage scaling, demonstrated on a fabricated 65nm test chip.
Maximizing System Performance by Balancing Computation Loads in LSTM Accelerators. (POSTECH)
- Sparse matrix format that load balances computation, demonstrated for LSTMs.
CCR: A Concise Convolution Rule for Sparse Neural Network Accelerators. (CAS)
- Decompose convolution into multiple dense and zero kernels for sparsity savings.
Block Convolution: Towards Memory-Efficient Inference of Large-Scale CNNs on FPGA. (CAS)
moDNN: Memory Optimal DNN Training on GPUs. (University of Notre Dame, CAS)
HyperPower: Power and Memory-Constrained Hyper-Parameter Optimization for Neural Networks. (CMU, Google) [paper]

2018 DAC

Compensated-DNN: Energy Efficient Low-Precision Deep Neural Networks by Compensating Quantization Errors. (Best Paper, Purdue, IBM)
- Introduce a new fixed-point representation, Fixed Point with Error Compensation (FPEC): Computation bits, +compensation bits that represent quantization error.
- Propose a low-overhead sparse compensation scheme to estimate the error in MAC design.
Calibrating Process Variation at System Level with In-Situ Low-Precision Transfer Learning for Analog Neural Network Processors. (THU)
DPS: Dynamic Precision Scaling for Stochastic Computing-Based Deep Neural Networks. (UNIST)
DyHard-DNN: Even More DNN Acceleration With Dynamic Hardware Reconfiguration. (Univ. of Virginia)
Exploring the Programmability for Deep Learning Processors: from Architecture to Tensorization. (Univ. of Washington)
LCP: Layer Clusters Paralleling Mapping Mechanism for Accelerating Inception and Residual Networks on FPGA. (THU)
A Kernel Decomposition Architecture for Binary-weight Convolutional Neural Networks. (THU)
Ares: A Framework for Quantifying the Resilience of Deep Neural Networks. (Harvard)
ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators (New York Univ., IIT Kanpur)
Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks. (Univ. of Toronto)
Parallelizing SRAM Arrays with Customized Bit-Cell for Binary Neural Networks. (Arizona)
Thermal-Aware Optimizations of ReRAM-Based Neuromorphic Computing Systems. (Northwestern Univ.)
SNrram: An Efficient Sparse Neural Network Computation Architecture Based on Resistive RandomAccess Memory. (THU, UCSB)
Long Live TIME: Improving Lifetime for Training-In-Memory Engines by Structured Gradient Sparsification. (THU, CAS, MIT)
Bandwidth-Efficient Deep Learning. (MIT, Stanford)
Co-Design of Deep Neural Nets and Neural Net Accelerators for Embedded Vision Applications. (Berkeley)
Sign-Magnitude SC: Getting 10X Accuracy for Free in Stochastic Computing for Deep Neural Networks. (UNIST)
DrAcc: A DRAM Based Accelerator for Accurate CNN Inference. (National Univ. of Defense Technology, Indiana Univ., Univ. of Pittsburgh)
On-Chip Deep Neural Network Storage With Multi-Level eNVM. (Harvard)
VRL-DRAM: Improving DRAM Performance via Variable Refresh Latency. (Drexel Univ. ETHZ)

2018 HotChips

ARM's First Generation ML Processor. (ARM) [ppt]
The NVIDIA Deep Learning Accelerator. (NVIDIA) [ppt]
Xilinx Tensor Processor: An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs. (Xilinx) [ppt]
Tachyum Cloud Chip for Hyperscale workloads, deep ML, general, symbolic and bio AI. (Tachyum) [ppt]
SMIV: A 16nm SoC with Efficient and Flexible DNN Acceleration for Intelligent IoT Devices. (ARM) [ppt]
NVIDIA's Xavier System-on-Chip. (NVIDIA) [ppt]
Xilinx Project Everest: HW/SW Programmable Engine. (Xilinx) [ppt]

2018 ICCAD

Tetris: Re-architecting Convolutional Neural Network Computation for Machine Learning Accelerators. (CAS) [paper]
3DICT: A Reliable and QoS Capable Mobile Process-In-Memory Architecture for Lookup-based CNNs in 3D XPoint ReRAMs. (Indiana - - University Bloomington, Florida International Univ.)
TGPA: Tile-Grained Pipeline Architecture for Low Latency CNN Inference. (PKU, UCLA, Falcon)
NID: Processing Binary Convolutional Neural Network in Commodity DRAM. (KAIST)
Adaptive-Precision Framework for SGD using Deep Q-Learning. (PKU) [paper]
Efficient Hardware Acceleration of CNNs using Logarithmic Data Representation with Arbitrary log-base. (Robert Bosch GmbH)
C-GOOD: C-code Generation Framework for Optimized On-device Deep Learning. (SNU)
Mixed Size Crossbar based RRAM CNN Accelerator with Overlapped Mapping Method. (THU) [paper]
FCN-Engine: Accelerating Deconvolutional Layers in Classic CNN Processors. (HUT, CAS, NUS)
DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs. (UIUC)
DIMA: A Depthwise CNN In-Memory Accelerator. (Univ. of Central Florida)
EMAT: An Efficient Multi-Task Architecture for Transfer Learning using ReRAM. (Duke)
FATE: Fast and Accurate Timing Error Prediction Framework for Low Power DNN Accelerator Design. (NYU) [paper]
Designing Adaptive Neural Networks for Energy-Constrained Image Classification. (CMU) [paper]
Watermarking Deep Neural Networks for Embedded Systems. (UCLA)
Defensive Dropout for Hardening Deep Neural Networks under Adversarial Attacks. (Northeastern Univ., Boston Univ., Florida International Univ.)
A Cross-Layer Methodology for Design and Optimization of Networks in 2.5D Systems. (Boston Univ., UCSD)

2018 MICRO

Addressing Irregularity in Sparse Neural Networks:A Cooperative Software/Hardware Approach. (USTC, CAS)
Diffy: a Deja vu-Free Differential Deep Neural Network Accelerator. (University of Toronto)
Beyond the Memory Wall: A Case for Memory-centric HPC System for Deep Learning. (KAIST) [paper]
Towards Memory Friendly Long-Short Term Memory Networks (LSTMs) on Mobile GPUs. (University of Houston, Capital Normal University)
A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks. (UIUC, THU, SJTU, Intel, UCSD) [paper]
PermDNN: Efficient Compressed Deep Neural Network Architecture with Permuted Diagonal Matrices. (City University of New York, University of Minnesota, USC)
GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware. (Georgia Tech) [arxiv]
Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach. (UCM, UCSD, UCSC) [paper]
LerGAN: A Zero-free, Low Data Movement and PIM-based GAN Architecture. (THU, University of Florida)
Multi-dimensional Parallel Training of Winograd Layer through Distributed Near-Data Processing. (KAIST) [paper]
SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator. (UCSB, Samsung)
Morph: Flexible Acceleration for 3D CNN-based Video Understanding. (UIUC) [paper]
Inter-thread Communication in Multithreaded, Reconfigurable Coarse-grain Arrays. (Technion) [paper]
An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware. (Cornell) [ppt]

2014 ASPLOS

DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. (CAS, Inria) [paper]

2014 MICRO

DaDianNao: A Machine-Learning Supercomputer. (CAS, Inria, Inner Mongolia University) [paper]

2015 ISCA

ShiDianNao: Shifting Vision Processing Closer to the Sensor. (CAS, EPFL, Inria) [paper]

2015 ASPLOS

PuDianNao: A Polyvalent Machine Learning Accelerator. (CAS, USTC, Inria)

2015 FPGA

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. (Peking University, UCLA) [paper]

2015 DAC

Reno: A Highly-Efficient Reconfigurable Neuromorphic Computing Accelerator Design. (Universtiy of Pittsburgh, Tsinghua University, San Francisco State University, Air Force Research Laboratory, University of Massachusetts.)
Scalable Effort Classifiers for Energy Efficient Machine Learning. (Purdue University, Microsoft Research) [paper]
Design Methodology for Operating in Near-Threshold Computing (NTC) Region. (AMD)
Opportunistic Turbo Execution in NTC: Exploiting the Paradigm Shift in Performance Bottlenecks. (Utah State University)

2016 DAC

DeepBurning: Automatic Generation of FPGA-based Learning Accelerators for the Neural Network Family. (Chinese Academy of Sciences)
- Hardware generator: Basic buliding blocks for neural networks, and address generation unit (RTL).
- Compiler: Dynamic control flow (configurations for different models), and data layout in memory.
- Simply report their framework and describe some stages.
C-Brain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive Data-Level Parallelization. (Chinese Academy of Sciences) [paper]
Simplifying Deep Neural Networks for Neuromorphic Architectures. (Incheon National University)
Dynamic Energy-Accuracy Trade-off Using Stochastic Computing in Deep Neural Networks. (Samsung, Seoul National University, Ulsan National Institute of Science and Technology)
Optimal Design of JPEG Hardware under the Approximate Computing Paradigm. (University of Minnesota, TAMU) [paper]
Perform-ML: Performance Optimized Machine Learning by Platform and Content Aware Customization. (Rice University, UCSD) [paper]
Low-Power Approximate Convolution Computing Unit with Domain-Wall Motion Based “Spin-Memristor” for Image Processing Applications. (Purdue University)
Cross-Layer Approximations for Neuromorphic Computing: From Devices to Circuits and Systems. (Purdue University)
Switched by Input: Power Efficient Structure for RRAM-based Convolutional Neural Network. (Tsinghua University) [paper]
A 2.2 GHz SRAM with High Temperature Variation Immunity for Deep Learning Application under 28nm. (UCLA, Bell Labs)

2016 ISSCC

A 1.42TOPS/W Deep Convolutional Neural Network Recognition Processor for Intelligent IoE Systems. (KAIST)
Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. (MIT, NVIDIA)
A 126.1mW Real-Time Natural UI/UX Processor with Embedded Deep Learning Core for Low-Power Smart Glasses Systems. (KAIST)
A 502GOPS and 0.984mW Dual-Mode ADAS SoC with RNN-FIS Engine for Intention Prediction in Automotive Black-Box System. (KAIST)
A 0.55V 1.1mW Artificial-Intelligence Processor with PVT Compensation for Micro Robots. (KAIST)
A 4Gpixel/s 8/10b H.265/HEVC Video Decoder Chip for 8K Ultra HD Applications. (Waseda University)

2016 ISCA

Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Computing. (University of Toronto, University of British Columbia)
EIE: Efficient Inference Engine on Compressed Deep Neural Network. (Stanford University, Tsinghua University)
Minerva: Enabling Low-Power, High-Accuracy Deep Neural Network Accelerators. (Harvard University)
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. (MIT, NVIDIA)
- Present an energy analysis framework.
- Propose an energy-efficienct dataflow called Row Stationary, which considers three levels of reuse.
Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. (Georgia Institute of Technology, SRI International)
- Propose an architecture integrated in 3D DRAM, with a mesh-like NOC in the logic layer.
- Detailedly describe the data movements in the NOC.
ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. (University of Utah, HP Labs)
A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory. (UCSB, HP Labs, NVIDIA, Tsinghua University)
RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision. (Rice University)
Cambricon: An Instruction Set Architecture for Neural Networks. (Chinese Academy of Sciences, UCSB)

2016 DATE

The Neuro Vector Engine: Flexibility to Improve Convolutional Network Efficiency for Wearable Vision. (Eindhoven University of Technology, Soochow University, TU Berlin) [pdf]
- Propose an SIMD accelerator for CNN.
Efficient FPGA Acceleration of Convolutional Neural Networks Using Logical-3D Compute Array. (UNIST, Seoul National University)
- The compute tile is organized on 3 dimensions: Tm, Tr, Tc.
NEURODSP: A Multi-Purpose Energy-Optimized Accelerator for Neural Networks. (CEA LIST)
MNSIM: Simulation Platform for Memristor-Based Neuromorphic Computing System. (Tsinghua University, UCSB, Arizona State University) [pdf]
Accelerated Artificial Neural Networks on FPGA for Fault Detection in Automotive Systems. (Nanyang Technological University, University of Warwick) [pdf]
Significance Driven Hybrid 8T-6T SRAM for Energy-Efficient Synaptic Storage in Artificial Neural Networks. (Purdue University) [pdf]

2016 FPGA

Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. [Slides][Demo] (Tsinghua University, MSRA)
- The first work I see, which runs the entire flow of CNN, including both CONV and FC layers.
- Point out that CONV layers are computational-centric, while FC layrers are memory-centric.
- The FPGA runs VGG16-SVD without reconfiguring its resources, but the convolver can only support k=3.
- Dynamic-precision data quantization is creative, but not implemented on hardware.
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks. [Slides] (Arizona State Univ, ARM)
- Spatially allocate FPGA's resources to CONV/POOL/NORM/FC layers.

2016 ASPDAC

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks. (UC Davis) [paper]
LRADNN: High-Throughput and Energy-Efficient Deep Neural Network Accelerator using Low Rank Approximation. (Hong Kong University of Science and Technology, Shanghai Jiao Tong University) [paper]
Efficient Embedded Learning for IoT Devices. (Purdue University)
ACR: Enabling Computation Reuse for Approximate Computing. (Chinese Academy of Sciences) [paper]

2016 VLSI

A 0.3‐2.6 TOPS/W Precision‐Scalable Processor for Real‐Time Large‐Scale ConvNets. (KU Leuven)
- Use dynamic precision for different CONV layers, and scales down the MAC array's supply voltage at lower precision.
- Prevent memory fetches and MAC operations based on the ReLU sparsity.
A 1.40mm2 141mW 898GOPS Sparse Neuromorphic Processor in 40nm CMOS. (University of Michigan)

2016 ICCAD

Efficient Memory Compression in Deep Neural Networks Using Coarse-Grain Sparsification for Speech Applications. (Arizona State University)
Memsqueezer: Re-architecting the On-chip memory Sub-system of Deep Learning Accelerator for Embedded Devices. (Chinese Academy of Sciences)
Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks. (Peking University, UCLA, Falcon) [Paper]
- Propose a uniformed convolutional matrix-multiplication representation for accelerating CONV and FC layers on FPGA.
- Propose a weight-major convolutional mapping method for FC layers, which has good data reuse, DRAM access burst length and effective bandwidth.
BoostNoC: Power Efficient Network-on-Chip Architecture for Near Threshold Computing. (Utah State University)
Design of Power-Efficient Approximate Multipliers for Approximate Artificial Neural Network. (Brno University of Technology, Brno University of Technology) [Paper]
Neural Networks Designing Neural Networks: Multi-Objective Hyper-Parameter Optimization. (McGill University) [Paper]

2016 MICRO

From High-Level Deep Neural Models to FPGAs. (Georgia Institute of Technology, Intel) [Paper]
- Develop a macro dataflow ISA for DNN accelerators.
- Develop hand-optimized template designs that are scalable and highly customizable.
- Provide a Template Resource Optimization search algorithm to co-optimize the accelerator architecture and scheduling.
vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design. (NVIDIA) [Paper]
Stripes: Bit-Serial Deep Neural Network Computing. (University of Toronto, University of British Columbia) [Paper]
- Introduce serial computation and reduced precision computation to neural network accelerator designs, enabling accuracy vs. performance trade-offs.
- Design a bit-serial computing unit to enable linear scaling the performance with precision reduction.
Cambricon-X: An Accelerator for Sparse Neural Networks. (Chinese Academy of Sciences)
NEUTRAMS: Neural Network Transformation and Co-design under Neuromorphic Hardware Constraints. (Tsinghua University, UCSB) [Paper]
Fused-Layer CNN Accelerators. (Stony Brook University)
- Fuse multiple CNN layers (CONV+POOL) to reduce DRAM access for input/output data.
Bridging the I/O Performance Gap for Big Data Workloads: A New NVDIMM-based Approach. (The Hong Kong Polytechnic University, NSF/University of Florida) [Paper]
A Patch Memory System For Image Processing and Computer Vision. (NVIDIA)
An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition. (Universitat Politecnica de Catalunya) [Paper]
Perceptron Learning for Reuse Prediction. (TAMU, Intel Labs) [Paper]
- Train neural networks to predict reuse of cache blocks.
A Cloud-Scale Acceleration Architecture. (Microsoft Research) [Paper]
Reducing Data Movement Energy via Online Data Clustering and Encoding. (University of Rochester) [Paper]
The Microarchitecture of a Real-time Robot Motion Planning Accelerator. (Duke University) [Paper]
Chameleon: Versatile and Practical Near-DRAM Acceleration Architecture for Large Memory Systems. (UIUC, Seoul National University) [Paper]

2016 FPL

A High Performance FPGA-based Accelerator for Large-Scale Convolutional Neural Network. (Fudan University) [paper]
Overcoming Resource Underutilization in Spatial CNN Accelerators. (Stony Brook University) [ppt]
- Build multiple accelerators, each specialized for specific CNN layers, instead of a single accelerator with uniform tiling parameters.
Accelerating Recurrent Neural Networks in Analytics Servers: Comparison of FPGA, CPU, GPU, and ASIC. (Intel) [paper]

2016 HPCA

A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs. (Nanyang Technological University, HKUST, Cornell University)
TABLA: A Unified Template-based Architecture for Accelerating Statistical Machine Learning. (Georgia Institute of Technology) [paper]
Memristive Boltzmann Machine: A Hardware Accelerator for Combinatorial Optimization and Deep Learning. (University of Rochester) [paper]

2017 FPGA

An OpenCL Deep Learning Accelerator on Arria 10. (Intel) [paper]
- Minimum bandwidth requirement: All the intermediate data in AlexNet's CONV layers are cached in the on-chip buffer, so their architecture is compute-bound.
- Reduced operations: Winograd transformation.
- High usage of the available DSPs+Reduced computation -> Higher performance on FPGA -> Competitive efficiency vs. TitanX.
ESE: Efficient Speech Recognition Engine for Compressed LSTM on FPGA. (Stanford University, DeepPhi, Tsinghua University, NVIDIA) [paper]
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. (Xilinx, Norwegian University of Science and Technology, University of Sydney) [paper]
Can FPGA Beat GPUs in Accelerating Next-Generation Deep Neural Networks? (Intel) [paper]
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs. (Cornell University, UCLA, UCSD) [paper]
Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network. (UW-Madison) [paper]
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System. (USC) [paper]
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks. (Arizona State University) [paper]

2017 ISSCC

A 2.9TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems. (ST)
DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for GeneralPurpose Deep Neural Networks. (KAIST)
ENVISION: A 0.26-to-10TOPS/W Subword-Parallel Computational Accuracy-Voltage-Frequency-Scalable Convolutional Neural Network Processor in 28nm FDSOI. (KU Leuven)
A 288µW Programmable Deep-Learning Processor with 270KB On-Chip Weight Storage Using Non-Uniform Memory Hierarchy for Mobile Intelligence. (University of Michigan, CubeWorks)
A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-NeuralNetwork Engine with >0.1 Timing Error Rate Tolerance for IoT Applications. (Harvard)
A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating (MIT) [ppt]
A 0.62mW Ultra-Low-Power Convolutional-Neural-Network Face Recognition Processor and a CIS Integrated with Always-On Haar-Like Face Detector. (KAIST)

2017 HPCA

FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. (Chinese Academy of Sciences) [paper]
PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. (University of Pittsburgh, University of Southern California) [paper]
Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures. (University of Florida) [paper]
- Satisfaction of CNN (SoC) is the combination of SoCtime, SoCaccuracy and energy consumption.
- The P-CNN framework is composed of offline compilation and run-time management.
  - Offline compilation: Generally optimizes runtime, and generates scheduling configurations for the run-time stage.
  - Run-time management: Generates tuning tables through accuracy tuning, and calibrate accuracy+runtime (select the best tuning table) during the long-term execution.
Supporting Address Translation for Accelerator-Centric Architectures. (UCLA) [paper]

2017 ASPLOS

Tetris: Scalable and Efficient Neural Network Acceleration with 3D Memory. (Stanford University) [paper]
- Move accumulation operations close to the DRAM banks.
- Develop a hybrid partitioning scheme that parallelizes the NN computations over multiple accelerators.
SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing. (Syracuse University, USC, The City College of New York) [paper]

2017 ISCA

Maximizing CNN Accelerator Efficiency Through Resource Partitioning. (Stony Brook University) [paper]
- An Extension of their FPL'16 paper.
In-Datacenter Performance Analysis of a Tensor Processing Unit. (Google) [paper]
SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. (Purdue University, Intel) [ppt]
- Propose a full-system (server node) architecture, focusing on the challenge of DNN training ().
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. (NVIDIA, MIT, UC Berkeley, Stanford University) [paper]
Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. (University of Michigan, ARM) [paper]
Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent. (Stanford) [paper]
LogCA: A High-Level Performance Model for Hardware Accelerators. (AMD, University of Wisconsin-Madison) [paper]
APPROX-NoC: A Data Approximation Framework for Network-On-Chip Architectures. (TAMU) [paper]

2017 FCCM

Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer. (Stony Brook University)
Customizing Neural Networks for Efficient FPGA Implementation. [paper]
Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs. [paper]
FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates. (Peking University, HKUST, MSRA, UCLA) [paper]
- Compute-instensive part: RTL-based generalized matrix multiplication kernel.
- Layer-specific part: HLS-based control logic.
- Memory-instensive part: Several techniques for lower DRAM bandwidth requirements.
FPGA accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-off. [paper]
A Configurable FPGA Implementation of the Tanh Function using DCT Interpolation.

2017 DAC

Deep^3: Leveraging Three Levels of Parallelism for Efficient Deep Learning. (UCSD, Rice)
Real-Time meets Approximate Computing: An Elastic Deep Learning Accelerator Design with Adaptive Trade-off between QoS and QoR. (CAS)
- I'm not sure whether the proposed tuning scenario and direction are reasonable enough to find out feasible solutions.
Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs. (PKU, CUHK, SenseTime) [papers]
Hardware-Software Codesign of Highly Accurate, Multiplier-free Deep Neural Networks. (Brown University) [papers]
A Kernel Decomposition Architecture for Binary-weight Convolutional Neural Networks. (KAIST)
Design of An Energy-Efficient Accelerator for Training of Convolutional Neural Networks using Frequency-Domain Computation. (Georgia Tech) [papers]
New Stochastic Computing Multiplier and Its Application to Deep Neural Networks. (UNIST)
TIME: A Training-in-memory Architecture for Memristor-based Deep Neural Networks. (THU, UCSB) [paper]
Fault-Tolerant Training with On-Line Fault Detection for RRAM-Based Neural Computing Systems. (THU, Duke) [paper]
Automating the systolic array generation and optimizations for high throughput convolution neural network. (PKU, UCLA, Falcon) [paper]
Towards Full-System Energy-Accuracy Tradeoffs: A Case Study of An Approximate Smart Camera System. (Purdue) [paper]
- Synergistically tunes componet-level approximation knobs to achieve system-level energy-accuracy tradeoffs.
Error Propagation Aware Timing Relaxation For Approximate Near Threshold Computing. (KIT)
RESPARC: A Reconfigurable and Energy-Efficient Architecture with Memristive Crossbars for Deep Spiking Neural Networks. (Purdue)
Rescuing Memristor-based Neuromorphic Design with High Defects. (University of Pittsburgh, HP Lab, Duke)
Group Scissor: Scaling Neuromorphic Computing Design to Big Neural Networks. (University of Pittsburgh, Duke) [paper]
Towards Aging-induced Approximations. (KIT, UT Austin)
SABER: Selection of Approximate Bits for the Design of Error Tolerant Circuits. (University of Minnesota, TAMU)
On Quality Trade-off Control for Approximate Computing using Iterative Training. (SJTU, CUHK)

2017 DATE

DVAFS: Trading Computational Accuracy for Energy Through Dynamic-Voltage-Accuracy-Frequency-Scaling. (KU Leuven)
Accelerator-friendly Neural-network Training: Learning Variations and Defects in RRAM Crossbar. (Shanghai Jiao Tong University, University of Pittsburgh, Lynmax Research)
A Novel Zero Weight/Activation-Aware Hardware Architecture of Convolutional Neural Network. (Seoul National University)
Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks. (Brown University) [arxiv]
Design Space Exploration of FPGA Accelerators for Convolutional Neural Networks. (Samsung, UNIST, Seoul National University) [paper]
MoDNN: Local Distributed Mobile Computing System for Deep Neural Network. (University of Pittsburgh, George Mason University, University of Maryland) [paper]
Chain-NN: An Energy-Efficient 1D Chain Architecture for Accelerating Deep Convolutional Neural Networks. (Waseda University) [paper]
LookNN: Neural Network with No Multiplication. (UCSD) [paper]
Energy-Efficient Approximate Multiplier Design using Bit Significance-Driven Logic Compression. (Newcastle University) [paper]
Revamping Timing Error Resilience to Tackle Choke Points at NTC Systems. (Utah State University) [paper]

2017 VLSI

A 3.43TOPS/W 48.9pJ/Pixel 50.1nJ/Classification 512 Analog Neuron Sparse Coding Neural Network with On-Chip Learning and Classification in 40nm CMOS. (University of Michigan, Intel)
BRein Memory: A 13-Layer 4.2 K Neuron/0.8 M Synapse Binary/Ternary Reconfigurable In-Memory Deep Neural Network Accelerator in 65 nm CMOS. (Hokkaido University, Tokyo Institute of Technology, Keio University)
A 1.06-To-5.09 TOPS/W Reconfigurable Hybrid-Neural-Network Processor for Deep Learning Applications. (Tsinghua University)
A 127mW 1.63TOPS sparse spatio-temporal cognitive SoC for action classification and motion tracking in videos. (University of Michigan)

2017 ICCAD

AEP: An Error-bearing Neural Network Accelerator for Energy Efficiency and Model Protection. (University of Pittsburgh) [paper]
VoCaM: Visualization oriented convolutional neural network acceleration on mobile system.
AdaLearner: An Adaptive Distributed Mobile Learning System for Neural Networks.
MeDNN: A Distributed Mobile System with Enhanced Partition and Deployment for Large-Scale DNNs.
TraNNsformer: Neural Network Transformation for Memristive Crossbar based Neuromorphic System Design. (Purdue University). [paper]
A Closed-loop Design to Enhance Weight Stability of Memristor Based Neural Network Chips.
Fault injection attack on deep neural network.
ORCHARD: Visual Object Recognition Accelerator Based on Approximate In-Memory Processing. (UCSD) [paper]

2017 HotChips

A Dataflow Processing Chip for Training Deep Neural Networks. (Wave Computing) [paper]
Brainwave: Accelerating Persistent Neural Networks at Datacenter Scale. (Microsoft) [paper]
DNN ENGINE: A 16nm Sub-uJ Deep Neural Network Inference Accelerator for the Embedded Masses. (Harvard, ARM) [paper]
DNPU: An Energy-Efficient Deep Neural Network Processor with On-Chip Stereo Matching. (KAIST) [paper]
Evaluation of the Tensor Processing Unit (TPU): A Deep Neural Network Accelerator for the Datacenter. (Google) [paper]
NVIDIA’s Volta GPU: Programmability and Performance for GPU Computing. (NVIDIA) [paper]
Knights Mill: Intel Xeon Phi Processor for Machine Learning. (Intel) [paper]
XPU: A programmable FPGA Accelerator for diverse workloads. (Baidu) [paper]

2017 MICRO

Distributed FPGA Acceleration for Learning. (Georgia Tech, UCSD)
Bit-Pragmatic Deep Neural Network Computing. (NVIDIA, University of Toronto) [arxiv]
CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-Circulant Weight Matrices. (Syracuse University, City University of New York, USC, California State University, Northeastern University) [arxiv]
Addressing Compute and Memory Bottlenecks for DNN Execution on GPUs. (University of Michigan) [paper]
DRISA: A DRAM-based Reconfigurable In-Situ Accelerator. (UCSB, Samsung) [paper]
Scale-Out Acceleration for Machine Learning. (Georgia Tech, UCSD) [paper]
DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission. (Univ. of Michigan, Univ. of Nevada) [paper]
Data Movement Aware Computation Partitioning. (PSU, TOBB University of Economics and Technology) [paper]
- Partition computation on a manycore system for near data processing.

Files

review_papers.md

Latest commit

History

review_papers.md

File metadata and controls

Review Papers

2019

2018 ASPDAC

2018 ISSCC

2018 HPCA

2018 ASPLOS

2018 VLSI

2018 FPGA

2018 ISCA

2018 DATE

2018 DAC

2018 HotChips

2018 ICCAD

2018 MICRO

2014 ASPLOS

2014 MICRO

2015 ISCA

2015 ASPLOS

2015 FPGA

2015 DAC

2016 DAC

2016 ISSCC

2016 ISCA

2016 DATE

2016 FPGA

2016 ASPDAC

2016 VLSI

2016 ICCAD

2016 MICRO

2016 FPL

2016 HPCA

2017 FPGA

2017 ISSCC

2017 HPCA

2017 ASPLOS

2017 ISCA

2017 FCCM

2017 DAC

2017 DATE

2017 VLSI

2017 ICCAD

2017 HotChips

2017 MICRO