# Search-time Efficient Device Constraints-Aware Neural Architecture Search

Oshin Dutta, Tanu Kanvar, and Sumeet Agarwal

Indian Institute of Technology {oshin.dutta,sumeet}@ee.iitd.ac.in, kanvar.tanu@gmail.com

**Abstract.** Edge computing aims to enable edge devices, such as IoT devices, to process data locally instead of relying on the cloud. However, deep learning techniques like computer vision and natural language processing can be computationally expensive and memory-intensive. Creating manual architectures specialized for each device is infeasible due to their varying memory and computational constraints. To address these concerns, we automate the construction of task-specific deep learning architectures optimized for device constraints through Neural Architecture Search (NAS). We present DCA-NAS, a principled method of fast neural network architecture search that incorporates edge-device constraints such as model size and floating-point operations. It incorporates weight sharing and channel bottleneck techniques to speed up the search time. Based on our experiments, we see that DCA-NAS outperforms manual architectures for similar sized models and is comparable to popular mobile architectures on various image classification datasets like CIFAR-10, CIFAR-100, and Imagenet-1k. Experiments with search spaces—DARTS and NAS-Bench-201 show the generalization capabilities of DCA-NAS. On further evaluating our approach on Hardware-NAS-Bench, devicespecific architectures with low inference latency and state-of-the-art performance were discovered.

**Keywords:** Neural Architecture Search  $\cdot$  DARTS  $\cdot$  Meta-Learning  $\cdot$  Edge Inference  $\cdot$  Constrained Optimization

#### 1 Introduction

In recent years, there has been significant progress in developing Deep Neural Network (DNN) architectures for edge and mobile devices. However, designing DNN architectures for specific hardware constraints and tasks is a time-consuming and computationally expensive process [2]. To address this, Neural Architecture Search (NAS) [39] has become popular as it discovers optimal architectures given a task and network operations. Despite its success, traditional NAS techniques cannot guarantee optimal architecture for specific devices with hardware constraints such as storage memory and maximum supported FLOPs. To address this concern, researchers have developed hardware-aware algorithms [28,3] that find optimal device architectures with low resource training overhead and search time. These methods often use inference latency [3],



Fig. 1: DCA-NAS framework: Weight sharing in the search space and Derived cells lowers the search time from other DNAS. Target device constraint is used to query search constraint from look-up graph for constrained optimization.

FLOPs [28] or a combination of hardware metrics [28] as constraints scaled by a tunable factor. However, the time to tune the scaling factor is often not considered within the NAS search time and can be ten times the reported search time. To address these issues, we propose the Device Constraints-Aware NAS (DCA-NAS), a principled differentiable NAS method that introduces total allowable model size or floating-point operations (FLOPs) as constraints within the optimization problem, with minimal hyper-parameter tuning. Unlike inference latency which is task dependent, FLOPs and memory are specified with a given hardware and thus are appropriate for our generic method. The approach is adaptable to other hardware metrics such as energy consumption or inference latency using additional metric-measuring functions. The paper make the following significant contributions:

- It introduces a fast method that uses weight sharing among operations in the search space and channel bottleneck, along with a differentiable resource constraint, for continuous exploration of the search space.
- A training pipeline that allows a user to input device memory or FLOPs and search for optimal architecture with minimal hyper-parameter tuning.
- Our extensive experimentation on vision datasets- CIFAR-10, CIFAR-100, TinyImagenet, Imagenet-1k and inference-latency comparisons of trained models on Hardware-NAS-bench demonstrate the efficiency of our method. The generalization of our method to different search spaces is shown with experiments on DARTS and NAS-Bench.

# 2 Related Work

Neural Architecture Search Popular approaches designed architectures for high performance on specific tasks or datasets with the traditional deep learning perspective that bigger is better, resulting in computationally and memory-intensive inference on edge devices. Network pruning and channel removal [27] can compress architectures, but require pre-training, hyperparameter tuning,

and often lack transferability. Neural Architecture Search (NAS) methods such as Reinforcement Learning [3], Evolutionary Learning [31] and Differentiable Neural Architecture Search (DNAS) [21] can automatically search for architectures without user intervention, and can transfer across similar tasks. DNAS with surrogate metrics [33] have also been used to explore the architecture search space. However, architectures found by DNAS methods are not optimized for deployment on edge devices and smaller models obtained by reducing layers or channels are often sub-optimal.

Hardware-aware Neural Architecture search Certain NAS methods optimize [3,2,15] for constraints such as latency, inference speed [32], FLOPS [29], memory usage [20]. Some use a separate DNN to predict constraint metrics and evolutionary search to obtain hardware-aware optimal models [2], while others consider real-time latencies of edge devices or provide specific architectures for specific devices [22,7]. However, these methods require significant search time and tuning of scaling factors controlling the trade-off between the performance and the constraint, and do not always account for optimal architectures. In contrast, we use a differentiable hardware-aware objective function with generic hardware metrics, and do not require a tunable scaling factor. Certain methods [2.8] train a supernet first and then search for a smaller architecture, but this is only efficient when there are more than fifteen different edge devices with different limitations or deployment scenarios [2] as training the supernet takes huge resources-32 V100s taking about 1,200 GPU hours. Search stage followed by evaluation, as done in our approach is more efficient when the different number of possible edge devices is less than fifteen.

# 3 DCA-NAS: Device Constraints Aware Fast Neural Architecture Search

We present the preliminary gradient-based NAS objective function in section 3.1 and then formulate the problem of incorporating the hardware-awareness in NAS as a constrained optimization problem in section 3.2 followed by techniques to reduce the search time in section 3.3. The framework of our approach is illustrated in Figure 1.

#### 3.1 Gradient-based NAS Objective Function

Popular DNAS techniques [21,37] have two stages, the search phase and the evaluation phase. During the search phase, given a task or a dataset the techniques search for a network of cells, which are directed acyclic graphs with N nodes. The edges of the graph are network layers, whose operations are to be selected from a pre-defined set  $\mathcal{O}$  containing operations such as 3x3 separable convolution and identity operations with trainable weights  $w_o$ . The search is made differentiable by making the choice of a particular operation to be a softmax of architecture weights  $\alpha$  of all operations. Thus, the intermediate output  $z_j$  at node j is given by,

$$z_{j} = \sum_{o \in \mathcal{O}} \frac{\exp\left\{\alpha_{o}^{i,j}\right\}}{\sum_{o' \in \mathcal{O}} \exp\left\{\alpha_{o'}^{i,j}\right\}} \cdot o\left(w_{o}^{i,j}, \mathbf{z}_{i}\right)$$
(1)

#### 3.2 DCA-NAS formulation

Previous DNAS approaches [21,36,37] did not prioritize searching architectures for resource-constrained inference. In contrast, we formulate the DNAS objective function as a constrained optimization problem by incorporating device resource constraints (memory or FLOPs) in the search objective function. The constrained bi-level optimization problem is written as,

where training dataset is split into train and val sets to jointly optimize w and  $\alpha$  in each iteration, while ensuring that the architecture's parameter or FLOPs count  $k_s$  remains within the device resource constraint  $K_d$ . The following equation calculates the architecture's number of parameters or FLOPs during search given the number of cells $c_n$ . Our method can also be adapted to use other metrics such as latency and energy consumption with additional metric measuring functions.

$$k_s(\alpha) = c_n \sum_{(i,j) \in N} \sum_{o \in \mathcal{O}} \frac{\exp\{\alpha_o^{i,j}\} * b(o)}{\sum_{o' \in \mathcal{O}} \exp\{\alpha_o^{i,j}\}}$$
(3)

Tackling the difference in search and evaluation networks The size of the architecture in the search phase  $k_s$  is different from the architecture size in evaluation phase due to the softmax weighting factor in equation 3 (demonstration can be found in the supplementary material<sup>1</sup>). To address this, we introduce a tighter bound on the search constraint  $K_{d'}$  than the device resource constraint  $K_d$ . A lookup graph (LUG) is made for each dataset by varying  $K_{d'}$  within appropriate bounds and running the algorithm until convergence each time to obtain the corresponding device resource constraint  $K_d$ . The computation time of the LUG can be reduced by running the searches in parallel. Thus, on incorporating the tighter constraint by looking-up the graph for the given device resource constraint  $K_d$  along with the trainable Lagrange multiplier  $\lambda$  in Equation 2, the objective function is re-written as,

$$\widetilde{\mathcal{L}} = \mathcal{L}_{\text{val}} \left( w^*(\alpha), \alpha \right) + \lambda (k_s(\alpha) - LUG(K_d))$$
s.t.  $w^*(\alpha) = \operatorname{argmin}_w \mathcal{L}_{\text{train}} \left( w, \alpha \right)$  (4)

#### 3.3 Techniques to reduce search time

Channel Bottleneck We use convolutional layers of 1x1 kernel to reduce the depth of output channels of operations in the search space to save computation time and memory overhead.

Derived Cell and Weight sharing. During architecture search, a single cell with trainable architecture parameters  $\alpha$  is used. The target network for inference is built by stacking cells with architectures derived from highly weighted

<sup>&</sup>lt;sup>1</sup> https://github.com/oshindutta/DCA-NAS

Table 1: Performance comparison of architectures evaluated on visual datasets-CIFAR-10 and TinyImagenet. '(CIFAR-10)' indicates search with CIFAR-10. 'X M' in 'DCA-NAS-X M' denotes the input memory constraint. RCAS- Resource Constrained Architecture Search

|                      | Dataset Search                         |                  | Method                                                      | Accuracy          | Parameters | $\overline{	ext{GPU}}$                               |
|----------------------|----------------------------------------|------------------|-------------------------------------------------------------|-------------------|------------|------------------------------------------------------|
|                      |                                        | ${\bf Strategy}$ |                                                             | (%)               | (Million)  | Hours                                                |
|                      | CIFAR-10                               | manual           | PyramidNet-110 (2017) [10]                                  | 95.74             | 3.8        | -                                                    |
|                      |                                        | manual           | VGG-16 pruned (2017) [13]                                   | 93.4              | 5.4        | -                                                    |
|                      |                                        | evolution        | Evolution + Cutout (2019) [31]                              | 96.43             | 5.8        | 12                                                   |
|                      |                                        | random           | NAO Random-WS (2019) [25]                                   | 96.08             | 3.9        | 7.2                                                  |
|                      |                                        | gradient         | ENAS + micro + Cutout (2018) [24]                           | 96.46             | 4.6        | 12                                                   |
|                      |                                        | gradient         | DARTS + Cutout (2nd) (2018) [21]                            | $97.24 \pm 0.09$  | 3.3        | 24                                                   |
|                      |                                        | gradient         | SNAS + Cutout (2018) [34]                                   | 97.15             | 2.8        | 36                                                   |
|                      |                                        | gradient         | PC-DARTS (2019) [36]                                        | $97.43 \pm\ 0.07$ | 3.6        | 2.4                                                  |
|                      |                                        | gradient         | SGAS (2020) [18]                                            | 97.34             | 3.7        | 6                                                    |
|                      |                                        | gradient         | DrNAS (2020) [6]                                            | $97.46\pm0.03$    | 4.0        | 9.6                                                  |
|                      |                                        | gradient         | DARTS+PT (2021) [30]                                        | $97.39\pm0.08$    | 3.0        | 19.2                                                 |
|                      |                                        | gradient         | Shapley-NAS (2022) [33]                                     | $97.53 \pm 0.04$  | 3.4        | 7.2                                                  |
|                      |                                        | RCAS             | DCA-NAS- 3.5 M (CIFAR-10)                                   | $97.2 \pm 0.09$   | 3.4        | 1.37                                                 |
|                      | Tiny ImageNet                          | manual           | SqueezeNet (2016) [14]                                      | 54.40             | -          | -                                                    |
|                      |                                        | manual           | PreActResNet18 (2020) [17]                                  | 63.48             | -          | -                                                    |
|                      |                                        | manual           | DenseNet (2020) [1]                                         | 62.73             | 11.8       | =                                                    |
|                      |                                        | gradient         | DARTS+ Cutout (2018) [21]                                   | $62.15 \pm 0.15$  | 7.3        | 219                                                  |
|                      |                                        | RCAS             | DCA-NAS- 3.5 M                                              | $61.34\pm0.09$    | 3.5        | 12.5                                                 |
|                      |                                        | RCAS             | DCA-NAS- 3.5 M (CIFAR-10)                                   | $61.4 \pm 0.15$   | 3.4        | 1.37                                                 |
| 90                   | CIFAR-10<br>97.24 98.68                |                  | 68 75 Tinylmagenet 64.35 6                                  | 1.3 75            | Imagene    | et-1k 75.8 75.1                                      |
| 90<br>75<br>60<br>45 | → DCA-NAS<br>→ DARTS<br>→ PyramidNet-2 | 272              | 45<br>30<br>15<br>DCA-NAS<br>DARTS<br>DARTS<br>ResNet18pSC9 | 73<br>71<br>— 69  | Joseph 7   | 2 — DCA-NAS — ProxylessNAS — Mobilenet-v2 — PC-DARTS |
| C                    | ) 1 2 3<br>Parameters (Mi              |                  | 0 3 6 1<br>Parameters (Million)                             | 12 (              |            | 4<br>ers (Million)                                   |

Fig. 2: Plots show that DCA-NAS method discovers models with fewer parameters than other NAS methods and manual architectures without sacrificing prediction performance to a large extent.

operations. This derivation process, performed iteratively, reduces computation and memory overhead [37]. This derived cell saves computation and memory overhead. A weight sharing strategy among same operations with the same originating node i to all nodes i < j < N has been applied within a cell. This is motivated by the observation that non-parametric operations operating on the representation of a node produce the same feature map irrespective of the output node and thereby extended to parametric operations. Thus, Equation 1 may be re-written to the following,

$$z_{j} = \sum_{o \in \mathcal{O}} \frac{\exp\left\{\alpha_{o}^{i,j}\right\}}{\sum_{o' \in \mathcal{O}} \exp\left\{\alpha_{o'}^{i,j}\right\}} \cdot o\left(w_{o}^{i}, \mathbf{z}_{i}\right)$$
 (5)

### 4 Experimental Results

Our approach is evaluated on two search spaces- DARTS and NAS-Bench with vision datasets- CIFAR10, TinyImagenet, Imagenet-16-20 and Imagenet-1k. The details of the search space and implementation is given in the supplementary material.

Table 2: Performance and comparison of architectures evaluated on Imagenet-1k. The label "(Imagenet)" indicates that the architecture has been searched and evaluated on Imagenet-1k.; else it is searched on CIFAR-10. 'X M' in 'DCA-NAS-X M' denotes the

input memory constraint

| Method                                  | Test I | Error (%) | Parameters | FLOPS | Search Cost | Search    |
|-----------------------------------------|--------|-----------|------------|-------|-------------|-----------|
|                                         | top-1  | top-5     | (Mil)      | (Mil) | (GPU days)  | Strategy  |
| MobileNet_V2 (2018) [26]                | 72.0   | 91.0      | 3.4        | 300   | -           | manual    |
| ShuffleNet $2 \times (v2) (2018) [23]$  | 25.1   | -         | 5          | 591   | -           | manual    |
| MnasNet-92 (2020) [11]                  | 25.2   | 8.0       | 4.4        | 388   | -           | RL        |
| AmoebaNet-C (2019) [25]                 | 24.3   | 7.6       | 6.4        | 570   | 3150        | evolution |
| DARTS+Cutout (2018) [21]                | 26.7   | 8.7       | 4.7        | 574   | 1.0         | gradient  |
| SNAS (2018) [34]                        | 27.3   | 9.2       | 4.3        | 522   | 1.5         | gradient  |
| GDAS (2019) [9]                         | 26.0   | 8.5       | 5.3        | 545   | 0.3         | gradient  |
| BayesNAS (2019) [39]                    | 26.5   | 8.9       | 3.9        | -     | 0.2         | gradient  |
| P-DARTS (2018) [24]                     | 24.4   | 7.4       | 4.9        | 557   | 0.3         | gradient  |
| SGAS (Cri 1. best) (2020) [18]          | 24.2   | 7.2       | 5.3        | 585   | 0.25        | gradient  |
| SDARTS-ADV (2020) [5]                   | 25.2   | 7.8       | 6.1        | -     | 0.4         | gradient  |
| Shapley-NAS (2022) [33]                 | 24.3   | -         | 5.1        | 566   | 0.3         | gradient  |
| RC-DARTS (2019) [16]                    | 25.1   | 7.8       | 4.9        | 590   | 1           | RCAS      |
| DCA-NAS                                 | 25.1   | 8.1       | 5.1        | 578   | 0.06        | RCAS      |
| ProxylessNAS (GPU) (2019) [3](Imagenet) | 24.9   | 7.5       | 7.1        | 465   | 8.3         | gradient  |
| PC-DARTS (2019) [36] (Imagenet)         | 24.2   | 7.3       | 5.3        | 597   | 3.8         | gradient  |
| DrNAS (2020) [6] (Imagenet)             | 24.2   | 7.3       | 5.2        | 644   | 3.9         | gradient  |
| DARTS+PT (2021) [30] (Imagenet)         | 25.5   | -         | 4.7        | 538   | 3.4         | gradient  |
| Shapley-NAS (2022) [33] (Imagenet)      | 23.9   | -         | 5.4        | 582   | 4.2         | gradient  |
| RCNet-B (2019) [35] (ImageNet)          | 25.3   | 8.0       | 4.7        | 471   | 9           | RCAS      |
| DCA-NAS- 5.5 M(Imagenet)                | 24.4   | 7.2       | 5.3        | 597   | 1.9         | RCAS      |

## 4.1 Results on DARTS search space

Transferability- learning of coarse features during search. We transfer the architecture searched on CIFAR-10 to train and evaluate the model weights on TinyImagenet in Table 1 and ImageNet-1k in Table 2. This transferred model yields higher performance than manually designed architectures [26,23] for the target dataset. It is observed that performance of the transferred model is comparable to the architecture searched on the target dataset itself which can be attributed to the architecture learning coarse features than objects during search.

Performance versus Device-Constraints trade-off DCA-NAS discovers 2 to 4% better-performing architectures than manual designs with a memory constraint of 3.5 million parameters on CIFAR-10 and similar performance on TinyImagenet as in Table 1. On Imagenet-1k, DCA-NAS yields models with similar performance to other NAS methods [33,6,36] with a constraint of 5.5 million parameters (taken to yield similar sized models as other NAS methods) as in Table 2. We vary the input device resource constraint and plot the performance of the searched models against the number of parameters in Figure 2. As observed, DCA-NAS searched models can yield 15x lower sized models than manual architectures like PyramidNet-272 [10] with at most 1% reduction in accuracy on CIFAR-10. On TinyImagenet, DCA-NAS yields models similar in performance but 6x smaller in size than the manual Resnet variant. In comparison to ProxvlessNAS [3] for Imagenet-1k, DCA-NAS yields 32% smaller model in terms of model parameters for similar accuracy. In comparison to DNAS methods [21,36] for each of the three datasets, we observe that the performance of the DCA-NAS searched models is retained to a certain extent as resources are further limited after which the model performance degrades. DCA-NAS model of similar size has



Fig. 3: Plots show DCA-NAS searched models with similar performance but lower inference latency (on two devices- Pixel 3 and Raspberry Pi 4) to previous SOTA NAS method- PC-DARTS when evaluated on NAS-Bench dataset.

the advantage of better performance (by 1%) and being automatically searched over MobileNet-v2 [26], a manually designed network on Imagenet-1k.

Search time comparison For evaluation on TinyImagenet in Table 1, the architecture searched on CIFAR-10 with DCA-NAS demonstrates superior search-time efficiency, highlighting the transferability property. Our method requires about 4x lower search cost than SGAS [18] which performs the best among the other transferred architectures and 16x lower search time than the other resource-constrained approach [16] for similar performance as seen in Table 2. Moreover, ProxylessNAS [3] takes about 4x more search time than DCA-NAS whereas PC-DARTS takes about 2x more search time with no capability to constraint model size.

#### 4.2 Results on NAS-Bench-201 search space

Performance and Latency comparisons on different devices Our method reports the mean by averaging over five runs with different random seed. Figure 3 compares the performance of models searched with DCA-NAS and PC-DARTS by varying the latency constraints. It shows that unlike PC-DARTS, DCA-NAS can search for more efficient models which have lower inference latency for similar test accuracy. Moreover, we observe that models with similar performance have lower latency when tested on Pixel 3 than on Raspberry Pi 4 due to a faster RAM in Pixel 3. DCA-NAS takes the lowest search time among all the NAS methods due to the addition of search-time-efficient techniques while being atpar in terms of performance across all datasets.

#### 5 Ablation Study

Effectiveness of various algorithmic augmentations for faster search:

We analyze the effectiveness of algorithmic augmentations mentioned preciously in section 3.3 to reduce search cost in our study. We sequentially add weight sharing, channel bottleneck, and derived cells to the baseline DARTS [21] method and measure search time and accuracy. Weight sharing, channel bottleneck, and derived cells was observed to significantly reduce search memory overhead, enabling us to use larger batch sizes and reducing overall search cost as seen in Figure 4a. Adding the resource-constraint in the final DCA-NAS method negligibly increases search cost while maintaining performance.



Fig. 4: (a) Ablation study with CIFAR-10 dataset- Each component added to DARTS leads to the reduction in the search cost of DCA-NAS while performance is retained. WS- Weight Sharing, CB- Channel Bottleneck, DC- Derived Cell, RC- Resource Constraint, BS- Batch Size (b) Shows stability of performance of DCA-NAS searched models for runs with varying seeds on CIFAR-10 dataset.

Stability of the approach: We test stability by running the search algorithm independently five times with different initial seeds and the same constraints and hyperparameters. The architectures found during each run have similar performance when re-trained and evaluated as shown in Fig. 4b. Smaller models have lower performance due to restrictions in model complexity compared to larger models.

#### 6 Conclusion

We present DCA-NAS, a device constraints-aware neural architecture search framework which discovers architectures optimized to the memory and computational constraints of an edge device in a time-efficient manner. It does so by incorporating a constraint in terms of the number of parameters or floating point operations (FLOPs) in the objective function with the help of a Lagrange multiplier. DCA-NAS in essence searches for a Pareto optimal solution given the edge device memory or FLOPs constraint. Moreover, it enables architecture search with search cost 4 to 17 times lower than the previous state-of-the-art Hardware-aware NAS approaches. DCA-NAS can discover models with size about 10 to 15 times lower than manually designed architectures for similar performance. In comparison to DARTS and its other NAS variants, DCA-NAS can discover models upto 3x smaller in size with similar performance. This hardware-aware approach can be generalized to any future updates to differential neural architecture search and possibly to training-free methods of NAS with some adaptation.

#### Acknowledgement

We thank the anonymous reviewers; Profs. Surendra Prasad and Brejesh Lall of IIT Delhi; and colleagues at Cadence India for their valuable feedback and inputs. This research is supported by funding from Cadence India; the first author is also supported by a fellowship from the Ministry of Education, India.

#### References

- 1. Abai, Z., Rajmalwar, N.: Densenet models for tiny imagenet classification (2020)
- Cai, H., Gan, C., Wang, T., Zhang, Z., Han, S.: Once-for-All: Train One Network and Specialize it for Efficient Deployment (Apr 2020), http://arxiv.org/abs/1908. 09791, arXiv:1908.09791 [cs, stat]
- 3. Cai, H., Zhu, L., Han, S.: Proxylessnas: Direct neural architecture search on target task and hardware (2019)
- Chen, W., Gong, X., Wang, Z.: Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective. arXiv preprint arXiv:2102.11535 (2021)
- Chen, X., Hsieh, C.J.: Stabilizing differentiable architecture search via perturbation-based regularization. In: International conference on machine learning. pp. 1554–1565. PMLR (2020)
- Chen, X., Wang, R., Cheng, M., Tang, X., Hsieh, C.J.: Drnas: Dirichlet neural architecture search. arXiv preprint arXiv:2006.10355 (2020)
- Chu, G., Arikan, O., Bender, G., Wang, W., Brighton, A., Kindermans, P.J., Liu, H., Akin, B., Gupta, S., Howard, A.: Discovering multi-hardware mobile models via architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3022–3031 (2021)
- 8. Ding, Y., Wu, Y., Huang, C., Tang, S., Wu, F., Yang, Y., Zhu, W., Zhuang, Y.: Nap: Neural architecture search with pruning. Neurocomputing 477, 85–95 (2022)
- Dong, X., Yang, Y.: Searching for a robust neural architecture in four gpu hours. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1761–1770 (2019)
- Han, D., Kim, J., Kim, J.: Deep pyramidal residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5927–5935 (2017)
- 11. He, C., Ye, H., Shen, L., Zhang, T.: Milenas: Efficient neural architecture search via mixed-level reformulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11993–12002 (2020)
- 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
- He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1389–1397 (2017)
- 14. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡0.5mb model size (2016)
- 15. Jiang, Q., Zhang, X., Chen, D., Do, M.N., Yeh, R.A.: EH-DNAS: End-to-End Hardware-aware Differentiable Neural Architecture Search. arXiv:2111.12299 [cs] (Nov 2021), http://arxiv.org/abs/2111.12299, arXiv: 2111.12299
- Jin, X., Wang, J., Slocum, J., Yang, M.H., Dai, S., Yan, S., Feng, J.: Redarts: Resource constrained differentiable architecture search. arXiv preprint arXiv:1912.12814 (2019)
- 17. Kim, J.H., Choo, W., Song, H.O.: Puzzle mix: Exploiting saliency and local statistics for optimal mixup (2020)
- 18. Li, G., Qian, G., Delgadillo, I.C., Müller, M., Thabet, A., Ghanem, B.: Sgas: Sequential greedy architecture search (2020)

- 19. Li, L., Talwalkar, A.: Random search and reproducibility for neural architecture search. In: Uncertainty in artificial intelligence. pp. 367–377. PMLR (2020)
- Lin, J., Chen, W.M., Lin, Y., Gan, C., Han, S., et al.: Mcunet: Tiny deep learning on iot devices. Advances in Neural Information Processing Systems 33, 11711– 11722 (2020)
- 21. Liu, H., Simonyan, K., Yang, Y.: Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018)
- 22. Lyu, B., Yuan, H., Lu, L., Zhang, Y.: Resource-Constrained Neural Architecture Search on Edge Devices. IEEE Transactions on Network Science and Engineering 9(1), 134–142 (Jan 2022). https://doi.org/10.1109/TNSE.2021.3054583, conference Name: IEEE Transactions on Network Science and Engineering
- 23. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV). pp. 116–131 (2018)
- 24. Pham, H., Guan, M.Y., Zoph, B., Le, Q.V., Dean, J.: Efficient neural architecture search via parameter sharing. In: ICML (2018)
- Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search (2019)
- 26. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4510–4520 (2018)
- Srivastava, A., Dutta, O., Gupta, J., Agarwal, S., AP, P.: A variational information bottleneck based method to compress sequential networks for human action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2745–2754 (2021)
- Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., Le QV, M.: platform-aware neural architecture search for mobile. 2019 ieee. In: CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2815–2823 (2019)
- 29. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. pp. 6105–6114. PMLR (2019)
- 30. Wang, R., Cheng, M., Chen, X., Tang, X., Hsieh, C.J.: Rethinking architecture selection in differentiable nas. arXiv preprint arXiv:2108.04392 (2021)
- 31. Wistuba, M.: Deep learning architecture search by neuro-cell-based evolution with function-preserving mutations. In: Berlingerio, M., Bonchi, F., Gärtner, T., Hurley, N., Ifrim, G. (eds.) Machine Learning and Knowledge Discovery in Databases. pp. 243–258. Springer International Publishing, Cham (2019)
- 32. Wu, Y., Gong, Y., Zhao, P., Li, Y., Zhan, Z., Niu, W., Tang, H., Qin, M., Ren, B., Wang, Y.: Compiler-Aware Neural Architecture Search for On-Mobile Real-time Super-Resolution (Jul 2022), http://arxiv.org/abs/2207.12577, arXiv:2207.12577 [cs, eess]
- 33. Xiao, H., Wang, Z., Zhu, Z., Zhou, J., Lu, J.: Shapley-NAS: Discovering Operation Contribution for Neural Architecture Search (Jun 2022), http://arxiv.org/abs/2206.09811, arXiv:2206.09811 [cs]
- 34. Xie, S., Zheng, H., Liu, C., Lin, L.: Snas: stochastic neural architecture search. In: International Conference on Learning Representations (2018)
- 35. Xiong, Y., Mehta, R., Singh, V.: Resource constrained neural network architecture search: Will a submodularity assumption help? In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1901–1910 (2019)

- 36. Xu, Y., Xie, L., Zhang, X., Chen, X., Qi, G.J., Tian, Q., Xiong, H.: Pc-darts: Partial channel connections for memory-efficient architecture search. arXiv preprint arXiv:1907.05737 (2019)
- 37. Yang, Y., You, S., Li, H., Wang, F., Qian, C., Lin, Z.: Towards improving the consistency, efficiency, and flexibility of differentiable neural architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6667–6676 (2021)
- 38. Zhang, M., Su, S.W., Pan, S., Chang, X., Abbasnejad, E.M., Haffari, R.: idarts: Differentiable architecture search with stochastic implicit gradients. In: International Conference on Machine Learning. pp. 12557–12566. PMLR (2021)
- 39. Zhou, H., Yang, M., Wang, J., Pan, W.: Bayesnas: A bayesian approach for neural architecture search (2019)

# Appendix

## A Deriving cell architectures

The searched cells are stacked to form the network whose weights are trained and evaluated. The layers of this network during the evaluation phase is varied from 4 to 20. It can be seen that the models searched with DARTS with only 2-cells perform equally well as those of 8-cell search for target model with layers more than 10. Hence, in our experiments, instead of training architecture parameters for all 8 cells, we train only 2 cells- one normal and the other reduction cell. The architecture of the other 6 cells stacked to form the network during search are derived from either the normal or the reduction cell as shown in Figure 1.

# B Calculation of search-stage architecture size

The size of the architecture in the search phase  $k_s$  is different from the architecture size in evaluation phase due to the softmax weighting factor in equation 3 (demonstrated in Figure 2). To address this, we introduce a tighter bound on the search constraint  $K_{d'}$ , which is less than the device resource constraint  $K_d$ . A lookup graph (LUG) needs to be made for each dataset by varying  $K_{d'}$  within appropriate bounds and running the algorithm until convergence each time to obtain the corresponding device resource constraint  $K_d$ . The computation time of the LUG can be reduced by running the searches in parallel.

# C Algorithm

The practical implementation of our resource-constrained gradient descent-based approach is illustrated in Algrorithm 1.

# D Implementation Details

The experiments with the smaller vision datasets-MNIST, FashionMNIST, CIFAR-10, Imagenet-16-120 and TinyImagenet were run on a single Tesla V100 GPU. Training and evaluation on Imagenet-1k was performed on a cluster containing eight V100 GPUs.

The super-net used for search with smaller vision datasets except Imagenet-1k consists of 8 cells, with 6 normal cells and 2 reduction cells, and an initial number of channels set to 16. Each cell has 6 nodes, with the first 2 nodes in cell k serving as input nodes. The super-net is trained for 50 epochs with a batchsize of 512, and optimized using SGD with a momentum of 0.9 and weight decay of 3e-4. The learning rate is initially set to 0.2 and gradually reduced to zero



Fig. 1: Top: shows the regular DARTS cell with nodes connected by weighted operations and the derived cell made of top-weighted operations. Bottom: Shows the network comprising the normal cell (bold border) and reduction cells (dotted border) with trainable architecture parameters (red border) and the derived cells (green border) without any architecture parameters.



Fig. 2: Demonstrates the calculation of memory size of a single cell in the architecture during - Left: search phase. Right: evaluation phase

using a cosine scheduler. Architecture parameters  $\alpha$  are optimized using Adam optimizer, with a learning rate of 6e-4, a momentum of (0.5, 0.999), and a weight decay of 1e-3. The search is run 5 times, and the architecture with the highest validation accuracy is chosen. For evaluation, the target-net has 20 cells, with 18 normal cells and 2 reduction cells, and an initial number of channels set to 36. The target-net is trained for 600 epochs with a batchsize of 96, optimized using SGD with a momentum of 0.9, weight decay of 3e-4, and gradient clipping of 5. The initial learning rate is set to 0.025 and gradually reduced to zero using a cosine scheduler. Additional settings include a cutout length of 16, dropout rate of 0.2, and use of an auxiliary head. For Imagenet-1k, We reduce the input size from  $224 \times 224$  to  $28 \times 28$  using three convolution layers with a stride of 2. The super-net for search has 8 cells starting with 16 channels, and the target-net for evaluation has 14 cells starting with 48 channels. Both search and evaluation use a batch size of 1,024. In search, we train for 50 epochs with a learning rate of 0.5 (annealed down to zero using a cosine scheduler), and a learning rate of 6e-3 for architecture parameters. In evaluation, we train for 250 epochs using

# Algorithm 1 DCA-NAS - gradient descent based search method

```
Assign random weights to \alpha^{i,j} on edges i,j denoting weights of operations in the mixed set
Input look-up graph G and device memory constraint K_d
Look-up corresponding search memory constraint K_{d'} from G
Calculate total search time memory size k_s(\alpha)
while not converged do

Calculate \widetilde{\mathcal{L}}(w,\alpha,\lambda) = \mathcal{L}_{\text{val}}(w(\alpha),\alpha) + \lambda(k_s(\alpha) - K_{d'})
Update weights w by descending \nabla_w \, \widetilde{\mathcal{L}}_{train}(w,\alpha,\lambda)
Update \alpha by descending \nabla_\alpha \, \widetilde{\mathcal{L}}_{val}(w^*,\alpha,\lambda)
Calculate total search time memory size k_s(\alpha)
Calculate loss as in equation 4
Update \lambda
end while

Derive the final architecture based on the learned \alpha by connecting the top weighted operations among the mixed set
```

the SGD optimizer with a momentum of 0.9 and a weight decay of 3e - 5, and adopt an auxiliary head and the label smoothing technique.

# E Model performance by varying FLOPs constraint on CIFAR10, TinyImagenet and Imagenet-1k

Instead of model parameters, we also experiment with FLOPs as the constraint in our objective function. As shown in Figure 3, our method DCA-NAS retains performance till a certain FLOPs constraint, after which it degrades. In comparison to manual architectures, our NAS approach yields models which require much smaller FLOPs and hence would have lower latency.



Fig. 3: Plots show that DCA-NAS method discovers models with fewer FLOPs than other NAS methods and manual architectures without sacrificing prediction performance.

Table 1: Performance and comparison of architectures on NAS-Bench-201 search space. DCA-NAS -6 M denotes DCA-NAS with memory constraint as 6 Million parameters. Search cost is measured on CIFAR10 for all methods, while evaluation is done on specific dataset.

| Architecture              | CIFAR-10    | CIFAR-100   | Imagenet-16-120 | Search Cost | Search Strategy |
|---------------------------|-------------|-------------|-----------------|-------------|-----------------|
|                           |             |             | 0               | (GPU sec.)  | 0,0             |
| ResNet [12]               | 93.97       | 70.86       | 43.63           | -           | -               |
| RSPS [19]                 | 87.66(1.69) | 58.33(4.34) | 31.14(3.88)     | 8007.13     | random          |
| ENAS [24]                 | 54.30(0.00) | 15.61(0.00) | 16.32(0.00)     | 13314.51    | RL              |
| TE-NAS [4]                | 93.9(0.47)  | 71.24(0.56) | 42.38(0.46)     | 1558        | training-free   |
| DARTS +Cutout† (1st) [21] | 54.30(0.00) | 15.61(0.00) | 16.32(0.00)     | 10889.87    | gradient        |
| DARTS+Cutout† (2nd) [21]  | 54.30(0.00) | 15.61(0.00) | 16.32(0.00)     | 29901.67    | gradient        |
| SNAS [34]                 | 92.77(0.83) | 69.34(1.98) | 43.16(2.64)     | -           | gradient        |
| GDAS [9]                  | 93.61(0.09) | 70.70(0.30) | 41.84(0.90)     | 28925.91    | gradient        |
| PC-DARTS [21]             | 93.41(0.30) | 67.48(0.89) | 41.31(0.22)     | 8012        | gradient        |
| iDARTS† [38]              | 93.58(0.32) | 70.83(0.48) | 40.89(0.68)     | -           | gradient        |
| DrNAS [6]                 | 94.36(0.00) | 73.51(0.00) | 46.34(0.00)     | 8219        | gradient        |
| Shapley-NAS [33]          | 94.37(0.00) | 73.51(0.00) | 46.85(0.12)     | -           | gradient        |
| DCA-NAS - 6 M             | 94.32(0.3)  | 72.4(0.40)  | 43.1(0.60)      | 3800        | gradient        |
| Optimal                   | 94.37       | 73.51       | 47.31           | -           | -               |