# Looped ReLU MLPs May Be All You Need as Practical Programmable Computers

Yingyu Liang<sup>♦,♥</sup> Zhizhou Sha<sup>♣</sup> Zhenmei Shi<sup>♠</sup> Zhao Song<sup>♠</sup> Yufa Zhou<sup>♡</sup>

<sup>♠</sup> University of Wisconsin-Madison.

<sup>♠</sup> Tsinghua University.

<sup>©</sup> University of Pennsylvania.

<sup>♠</sup> The Simons Institute for the Theory of Computing at the University of California, Berkeley.

#### Abstract

Previous work has demonstrated that attention mechanisms are Turing complete. More recently, it has been shown that a looped 9layer Transformer can function as a universal programmable computer. In contrast, the multi-layer perceptrons with ReLU activation (ReLU-MLP), one of the most fundamental components of neural networks, is known to be expressive; specifically, a two-layer neural network is a universal approximator given an exponentially large number of hidden neurons. However, it remains unclear whether a ReLU-MLP can be made into a universal programmable computer using a practical number of weights. In this work, we provide an affirmative answer that a looped 23-layer ReLU-MLP is capable of performing the basic necessary operations, more efficiently and effectively functioning as a programmable computer than a looped Transformer. This indicates simple modules have stronger expressive power than previously expected and have not been fully explored. Our work provides insights into the mechanisms of neural networks and demonstrates that complex tasks. such as functioning as a programmable computer, do not necessarily require advanced architectures like Transformers.

## 1 INTRODUCTION

Transformers (Vaswani et al., 2017) have demonstrated their potential across a variety of tasks, emerg-

Proceedings of the 28<sup>th</sup> International Conference on Artificial Intelligence and Statistics (AISTATS) 2025, Mai Khao, Thailand. PMLR: Volume 258. Copyright 2025 by the author(s).

ing as a dominant choice for a wide spectrum of practical applications, including natural language processing (NLP) (Devlin et al., 2019; Raffel et al., 2020; OpenAI, 2023; Abdin et al., 2024; Anthropic, 2024; OpenAI, 2024; Meta, 2024) and computer vision (Dosovitskiy et al., 2020; Peebles and Xie, 2023; Han et al., 2022; Arnab et al., 2021; Khan et al., 2022; Wang et al., 2023b,a, 2024b; Liang et al., 2024b; Wang et al., 2024a), among others. The success of Transformers is largely attributed to their ability to perform complex operations, such as induction head (Olsson et al., 2022; Crosbie and Shutova, 2024), in-context learning (Dong et al., 2022; Wei et al., 2021, 2022b; Xu et al., 2024; Shi et al., 2024c; Chen et al., 2025a), information retrieval Shi et al. (2024a) and chain of thoughts (Wei et al., 2022c; Kojima et al., 2022). Meanwhile, another line of research explores the theoretical capability of Transformers. For instance, Pérez et al. (2021) has proven the Turing completeness of the attention mechanism. However, Turing completeness is not a feature unique to Transformers. Siegelmann and Sontag (1992); Chung and Siegelmann (2021) has demonstrated that Recurrent Neural Networks (RNNs) are also Turing complete. Other works (Pinkus, 1999; Kidger and Lyons, 2020) have shown that the most basic module in deep learning, the Multi-Layer Perceptron (MLP), is a universal approximator.

However, the concept of Turing completeness inherently necessitates infinite memory, an impracticality in real-world scenarios due to the finite nature of available memory. In Giannou et al. (2023), they bridge the gap between the theoretical Turing machine and the practical Transformer-based programmable computer by illustrating that a looped 9-layer Transformer possesses the necessary expressiveness to operate as a programmable computer. Given that ReLU-MLP also has the capability for achieving Turing completeness and itself is a universal approximator, this raises an interesting question:

Is ReLU-MLP expressive enough to be a practical

programmable computer?

To the best of our knowledge, no practical solution for constructing a ReLU-MLP as a general-purpose computer has been proposed before. Therefore, we explore the conditions under which a ReLU-MLP can function as a universal programmable computer and prove that a 23-layer looped ReLU-MLP is capable of emulating a general-purpose computer. Our main approach involves constructing a 23-layer ReLU-MLP to demonstrate that the minimalistic SUBLEQ instruction can be implemented, thereby showing that the computational power of a 23-layer ReLU-MLP is comparable to that of a programmable computer. This idea was first introduced by Giannou et al. (2023), but our key contribution lies in using a simpler ReLU-MLP architecture, in contrast to the more complex Transformer architecture employed in Giannou et al. (2023).

Our research centers on MLPs equipped with the ReLU activation function, which is fundamental to their design. We commence by formally defining the ReLU activation function as follows:

**Definition 1.1** (ReLU). We define ReLU as follows: For a vector  $x \in \mathbb{R}^n$ , the output of the *i*-th entry of ReLU for  $i \in \{1, 2, ..., n\}$  is ReLU $(x)_i = \max\{x_i, 0\}$ .

Then, we introduce the definition of MLP with ReLU activation as follows:

**Definition 1.2** (ReLU-MLP). Let n be the size of the state. Let input be  $x \in \mathbb{R}^n$ . The weights and biases are  $W \in \mathbb{R}^{n \times n}, b \in \mathbb{R}^n$ . We have the 1-layer output of ReLU Multiple-Layer Perceptron (ReLU-MLP) as

$$ReLU(Wx+b) \in \mathbb{R}^n$$
. (1)

The m-layer ReLU-MLP is then the compositions of m such Perceptrons in Eq. (1), e.g., ReLU( $W_2 \cdot \text{ReLU}(W_1x+b_1)+b_2$ ) for 2-layer ReLU-MLP.

Based on this setting, we have demonstrated that a 23-layer looped ReLU-MLP is capable of emulating a general-purpose computer. This is accomplished by constructing a 23-layer ReLU-MLP that can execute a generalized form of the single instruction SUBLEQ (see also Algorithm 1). The SUBLEQ instruction, which stands for Subtract and Branch if Less-than or Equal to zero, operates on three addresses  $a, b, c \in \mathbb{R}^{\log n}$ . It subtracts the value at memory location mem[b] from the value at memory location mem[a], stores the result back at mem[b], and if the result is less than or equal to zero, the program execution continues at the address specified by c. Otherwise, it will execute the next instruction without branching. Despite the instruction's simplicity, it is powerful enough to be the foundation of a universal computing system (Mavaddat and Parhami,

1988; Esolangs, 2025). Consequently, we have established that our ReLU-MLP design constitutes a functional One Instruction Set Computer (OISC).

In contrast to previous research (Giannou et al., 2023), which utilized Transformers as the fundamental building block to create a universal computer, our approach harnesses the simple ReLU-MLP to accomplish the same objective. Furthermore, for each forward pass, our looped ReLU-MLP takes  $O(n \log n)$  time complexity, while looped Transformer takes  $O(n^2)$  (Section 6.2). These findings suggest that basic ReLU-MLP modules are sufficiently expressive, and their potential is underexplored. With careful design, they might exhibit emergent abilities like in-context learning, indicating that complex architectures like Transformers may not always be necessary for certain computational tasks. Understanding the fundamental capabilities of neural networks before adopting complex models could lead to more efficient, less resourceintensive solutions. Challenging the belief that only advanced models can handle complex tasks opens avenues for research, prioritizing simplicity and efficiency without sacrificing performance. In a world of finite computational resources, this discovery could lead to more sustainable and accessible AI solutions.

To sum up, we conclude our contributions as follows:

- To the best of our knowledge, we are the first work to prove that a looped 23-layer ReLU-MLP satisfies the conditions required to function as programmable computers (Theorem 4.1).
- For each forward pass, our looped ReLU-MLP takes  $O(n \log n)$  time complexity, while looped Transformer takes  $O(n^2)$  (Section 6.2), highlighting the importance of understanding the capabilities of the fundamental ReLU-MLP component and demonstrating its powerful expressivity.
- Our findings show that traditional neural networks, such as ReLU-MLP, have not been fully explored, challenging the belief that only advanced architectures can perform complex tasks.

Roadmap. The paper is organized as follows: In Section 2, we discuss related literature. In Section 3, we provide our notation system and key concepts and definitions. In Section 4, we introduce our main result that a looped 23-layer ReLU-MLP can emulate a programmable computer. In Section 5, we demonstrate how to implement basic operations such as read, write, conditional branching, and SUBLEQ using ReLU-MLP. In Section 6, we discuss the high-level intuition and potential future directions of our findings. In Section 7, we conclude our paper.

# 2 RELATED WORK

Complexity and Neural Networks. Circuit complexity, a branch of computational complexity theory, studies circuit families as models of computation<sup>1</sup>. Several circuit complexity classes are significant in machine learning. Specifically,  $AC^0$  represents problems highly parallelizable with standard logic gates, while TC<sup>0</sup> extends this to include threshold gates, and NC<sup>1</sup> denotes the language recognizable by  $O(\log n)$ -depth circuits with bounded gate arity (Merrill et al., 2022). It is known that  $AC^0 \subset TC^0 \subseteq NC^1$ , but whether  $TC^0 \neq NC^1$  remains an open question. Assuming this inequality, Liu et al. (2022) shows that Transformer depth must depend on input sequence length when simulating non-solvable semiautomata. Li et al. (2024c) explore relationships among constant-depth Transformers, Transformers with Chain-of-Thought (CoT), and circuit complexity. They demonstrate:  $\mathsf{T}[\mathsf{poly}(n), 1, 1] \subseteq \mathsf{CoT}[\log n, \mathsf{poly}(n), 1, 1] \subseteq \mathsf{AC}^0$  and  $\mathsf{T}[\mathsf{poly}(n), \log n, 0] \subseteq \mathsf{CoT}[\log n, \mathsf{poly}(n), \log n, 0] \subseteq$  $\mathsf{TC}^0$  where  $\mathsf{T}[d(n), s(n), e(n)]$  denotes a constant-depth Transformers with embedding size d(n), precision s(n)bits, and exponent bits e(n) for input length n and  $\mathsf{CoT}[T(n), d(n), s(n), e(n)]$  denotes a T(n)-step CoT of a constant-depth Transformer T[d(n), s(n), e(n)]. It provides theoretical insights into the emergent CoT ability of Transformers, showing that intermediate reasoning steps enable tackling more complex problems.

The Strong Exponential Time Hypothesis (SETH), introduced by Impagliazzo and Paturi (2001), strengthens the  $P \neq NP$  conjecture by asserting that current best SAT algorithms are roughly optimal: for every  $\epsilon >$ 0, there exists k > 3 such that k-SAT cannot be solved in  $O(2^{(1-\epsilon)n})$  time, even randomly. SETH is widely used to prove fine-grained lower bounds for various algorithmic problems (Williams, 2018) and has been applied to derive lower bounds for Transformer training/inference (Alman and Song, 2023, 2024b; Liang et al., 2024a) and tensor attention (Alman and Song, 2024a; Liang et al., 2024c). Specifically, Alman and Song (2023) demonstrates that unless the SETH fails, no algorithm exists that can compute the forward pass of an attention network in truly subquadratic time. On the other hand, Alman and Song (2024b) establishes that the same condition applies to the backward computation of attention networks, i.e., unless the SETH fails, no truly-subquadratic time algorithm can be devised for the backward computation of attention networks. In essence, complexity theory provides a powerful framework for investigating neural networks'

computational capabilities by rigorously analyzing the computational problems they can efficiently solve.

Turing Completeness of Neural Networks. In recent years, neural networks (NNs) have demonstrated great potential in performing tasks that were previously considered impossible for traditional numerical approximation methods. This remarkable capability is largely attributed to their properties as universal approximators (Pinkus, 1999; Yun et al., 2019; Kidger and Lyons, 2020; Chen et al., 2025c; Hu et al., 2024d) and, in some cases, their Turing completeness (Siegelmann and Sontag, 1992; Pérez et al., 2019; Dehghani et al., 2019; Chung and Siegelmann, 2021; Pérez et al., 2021; Stogin et al., 2024). Specifically, Pérez et al. (2019); Pérez et al. (2021) show that Transformers with attention mechanism under infinite precision are Turing complete, whereas Dehghani et al. (2019) demonstrates that this is not the case under fixed precision. Another line of work (Pollack, 1987; Siegelmann and Sontag, 1992; Indyk, 1995; Kilian and Siegelmann, 1996; Chung and Siegelmann, 2021; Stogin et al., 2024) focuses on recurrent neural networks (RNNs) and proves their Turing completeness. Moreover, Wei et al. (2022a) demonstrates that ReLU-MLP can meaningfully approximate Boolean circuits, and Transformers can meaningfully approximate Turing machines. It is important to note that Turing completeness deals with discrete computations, such as processing language, whereas universal approximation focuses on continuous functions. Shi et al. (2021, 2024b) show that ReLU-MLP is expressive and over fixed feature methods like kernels. Thus, one property does not imply the other (Yun et al., 2019), and it is necessary to study these two subjects separately.

Limitations of Transformers. Transformers have demonstrated remarkable capability in natural language processing tasks, yet their proficiency in mathematical computations remains a concern (Charton, 2022). Therefore, research has been directed toward delineating the computational limits of Transformers when faced with mathematical tasks. Merrill and Sabharwal (2023) has shown that if  $L \neq P^2$  (i.e. not all polynomial-time problems are solvable in logarithmic space), Transformers are incapable of accurately resolving linear inequalities or determining membership in an arbitrary context-free grammar that includes empty productions, and Feng et al. (2024) illustrates that unless  $TC^0 = NC^1$ , there is no log-precision Transformers is capable to solve arithmetic

<sup>&</sup>lt;sup>1</sup>We refer the reader to the chapter 6 and 14 of Arora and Barak (2009) or chapter 1 and 2 of *Handbook of Theoretical Computer Science* (Boas, 2014; Johnson, 1990) for more detailed background of circuit complexity.

 $<sup>^2{\</sup>rm The~class}$  L represents the set of problems that can be resolved using logarithmic space, whereas P denotes the class of problems that can be solved within polynomial time constraints.

and equation-solving problems. Li et al. (2025c); Ke et al. (2025a,b,c); Chen et al. (2025b); Li et al. (2024b,a); Chen et al. (2024); Hu et al. (2024a,b,c) show the limitation of Transformers by circuit complexity or some other frameworks.

Neural Networks Can Perform Algorithms.

Given their Turing completeness, it is not surprising that neural networks (NNs) can perform algorithms once properly trained. One example is their ability for in-context learning (ICL) (Olsson et al., 2022; Min et al., 2022; Xu et al., 2024; Shi et al., 2024c; Gao et al., 2023), where Transformers produce the correct output based on the context provided by examples without adapting their parameters. Studies have shown that ICL can implement optimization algorithms like gradient descent across layers (Von Oswald et al., 2023; Akyürek et al., 2023; Ahn et al., 2024; Mahankali et al., 2023; Gatmiry et al., 2024), and interestingly, Transformers can in-context fine-tune smaller Transformers (Panigrahi et al., 2024). Moreover, Li et al. (2024c) shows that when equipped with enough steps of Chainof-Thought (CoT) reasoning (Wei et al., 2022c; Kojima et al., 2022), constant-depth Transformers using constant-bit precision and embedding size can solve any problem solvable by Boolean circuits. Other studies (Zhao et al., 2023; Allen-Zhu and Li, 2023; Li et al., 2025b,a) observe that Transformers perform dynamic programming to generate. Transformers have also been shown to efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square roots, and can even simulate a programmable computer (Lee et al., 2023; Giannou et al., 2023). Additionally, Hertrich and Sering (2024) proves that ReLU-MLP can solve exact max-flow problems.

Neural Networks as Practical Programmable Computer. The programmable computer is known as a powerful and controllable computing architecture. Numerous studies strive to establish the equivalence of their proposed neural network architectures with the programmable computer, thereby illustrating the efficacy of their designs. Giannou et al. (2023) employs Transformers as the building block to build a programmable computer, showcasing the latent capabilities of Transformer-based neural networks. Additionally, other research initiatives adopt distinct methodologies to realize programmable computer architectures. For example, Carpinlioğlu et al. (2024) introduces a construction utilizing optical neural networks. Studies such as Liang and Van den Broeck (2019); Liang (2021) are dedicated to investigating the feasibility of attaining universal computation through probabilistic circuits. Moreover, Weiss et al. (2021) proposes a computational model for the Transformer-encoder using a domain-specific language called the Restricted Access Sequence Processing Language (RASP). Building on this, Lindner et al. (2024) introduce Tracr, a compiler that leverages RASP to use Transformer networks as programmable units.

#### 3 PRELIMINARY

This section provides essential definitions used in this paper. In Section 3.1, we introduce some basic notations. In Section 3.2, we present several key concepts related to the state vector of the programmable computer constructed by ReLU-MLP.

#### 3.1 Notations

We use [n] to denote  $\{1, 2, ..., n\}$  for any  $n \in \mathbb{N}_+$ . We use  $e_i$  to denote a vector in which only the i-th location is 1 and zeros everywhere else. We denote an all 1 vector using  $\mathbf{1}_n \in \mathbb{R}^n$ . We denote an all 0 vector using  $\mathbf{0}_n \in \mathbb{R}^n$ . We use  $a^{\mathsf{T}}b$  to denote the inner product of  $a, b \in \mathbb{R}^d$  i.e.  $a^{\mathsf{T}}b := \sum_{i=1}^d a_i b_i$ . We use  $\circ$  to denote the Hadamard product, i.e., the i-th entry of  $a \circ b$  is  $a_i b_i$ . Let  $I_{d \times d} \in \mathbb{R}^{d \times d}$  denote an identity matrix.

#### 3.2 Key Concepts

We begin by introducing the way we organize the data. Different from conventional  $\{0,1\}$  representation of the data, we use  $\{\pm 1\}$  to represent the data. This design will benefit the calculation of the address vectors, which will be discussed in Remark 3.5.

**Definition 3.1** (One-Bit Data). We define one-bit data as  $v \in \{-1, 1\}$ .

One bit is not capable of representing the integers or floats or other data types used in modern computers. Thus, we introduce the d-bits data vector as follows:

**Definition 3.2** (d-Bits Data). We define data as  $v \in \{-1,1\}^d$  using two's complement. Here, data dimension d means the number of bits, and the data type can be int32 or float64 in a computer.

In this work, we consider all data as integers. Specifically, we use 2's complement to represent the integer. It is worth mentioning that the data type can be easily extended. Due to the space limitation, we temporarily consider only integer data.

**Definition 3.3** (2's Complement). For a d-bit data value  $v \in \{-1,1\}^d$ , which represents an integer with bits  $b_d, b_{d-1}, \ldots, b_2, b_1$ , where  $b_i \in \{\pm 1\}$  for  $i \in [d]$ , we denote  $b_d$  as the most significant bit (MSB).

The integer value of v is defined as follows:

- If  $b_d = -1$ , the integer is considered positive, with a value given by:  $\sum_{i=1}^{d-1} 2^{i-1} \frac{b_i+1}{2}$ .
- If  $b_d = +1$ , the integer is considered negative, with a value given by:  $-2^{d-1} + \sum_{i=1}^{d-1} 2^{i-1} \frac{b_i+1}{2}$ .

Suppose there are total n bits in the programmable computer. Hence, we choose the length of the address vector as  $\log(n)$ , which is the most efficient way to locate the total n address. We present the definition of address vector as follows:

**Definition 3.4** (Address). We define the address of a data as  $a \in \{-1, +1\}^{\log(n)}$  with state size n.

Since we choose  $\{\pm 1\}$  instead of  $\{0,1\}$  as our data representation, only the inner product of the address vectors with the same address will be  $\log(n)$ . Any inner product of address vectors with different addresses will be strictly less than  $\log(n)$ . This property facilitates our addressing operation.

Remark 3.5 (Address Property). In Definition 3.4, we use vectors with value  $\pm 1$  to represent the address, which is different from classical  $\{0,1\}$  representation. Under this setting, we have  $\forall i \in [n], a_i^{\top} a_i = \log(n)$ , and  $\forall i, j \in [n], i \neq j$ , we have  $a_i^{\top} a_j < \log(n)$  because of Cauchy-Schwarz inequality.

In this work, we mainly focus on constructing ReLU-MLP for executing "SUBLEQ". This focus stems from the fact that a One Instruction Set Computer (OISC) constructed with the "SUBLEQ" instruction is functionally equivalent to a programmable computer in terms of its computational capabilities (Mavaddat and Parhami, 1988). Since we only consider the "SUBLEQ" instruction, we do not need any bits to encode the type of instruction. We only need to encode three address vectors used by the "SUBLEQ" instruction. By simply concatenating the three address vectors, we have the length of the instruction as  $3\log(n)$ .

**Definition 3.6** (Instruction). Let "SUBLEQ" instruction be defined as in Algorithm 1. Let the address vector be defined as Definition 3.4. Then we define the instruction vector  $c_i \in \{\pm 1\}^{3\log(n)}$  by simple concatenating three address vectors  $a, b, c \in \{\pm 1\}^{\log(n)}$  required by the "SUBLEQ" instruction. Namely, we have  $c_i$  satisfies the following equation:  $c_i = [a, b, c]$ .

Based on all the crucial concepts introduced above, we now introduce the state vector for our programmable computer. This state vector contains all the computer's registers, data/memory, and instructions.

**Definition 3.7** (One-Bit State). Let n denote the size of the state vector. Let  $r_{d_1}, r_{d_1} \in \{\pm 1\}$  denote two data registers. Let  $r_c \in \{\pm 1\}$  denote the carry bit. Let  $r_{a_1}, r_{a_2}, r_{a_3} \in \{\pm 1\}^{\log(n)}$  denote three address registers. Let  $r_{pc} \in \{\pm 1\}^{\log(n)}$  denote the program counter.

Let  $c_1, c_2, \dots, c_m \in \{\pm 1\}^{3\log(n)}$  denote m instructions, where  $c_m = c_{\text{EOF}}$  is the End Of File (EOF) instruction, which means the program should terminate here. Let  $v_1, v_2, \dots, v_k \in \{\pm 1\}$  denote k one-bit data stored in the memory and the memory size k satisfies  $k = n - 2 - 4\log(n) - 3m\log(n)$ . We define our one-bit state of ReLU-MLP as follows:



# 4 MAIN RESULT

In this section, we introduce our thrilling finding, which demonstrates a looped 23-layer ReLU-MLP is capable of emulating a programmable computer.

**Theorem 4.1** (Looped ReLU-MLP as Programmable Computer, Informal Version of Theorem E.1). Let ReLU-MLP be defined as Definition 1.2. Let n be the size of the state vector. Let m be the number of instructions. Let k be the number of one-bit data stored in the memory. For  $i \in [k]$ , each data is  $v_i \in \{\pm 1\}$  and the memory size k satisfies  $k = n - 2 - 4\log(n) - 3m\log(n)$ . Let the address vector  $a_i \in \{\pm 1\}^{\log(n)}$ . Suppose we have two data registers  $r_{d_1}, r_{d_2} \in \{\pm 1\}$ , one carry bit  $r_c \in \{\pm 1\}$ , three address registers  $r_{a_1}, r_{a_2}, r_{a_3} \in$  $\{\pm 1\}^{\log(n)}$ , and one program counter  $r_{pc} \in \{\pm 1\}^{\log(n)}$ in the scratchpad. Then, a 23-layer ReLU-MLP with width n can emulate a programmable computer, where d is the number of bits we use to store each integer. Namely, this "computer" supports integers within the range  $[-2^{d-1}, 2^{d-1} - 1]$ .

The high-level idea of the proof is that we first prove the 23-layer ReLU-MLP is capable of emulating the one-bit version "SUBLEQ" instruction. Then, we extend it to supporting d-bits version "SUBLEQ". Finally, according to the finding in Mavaddat and Parhami (1988), the One Instruction Set Computer (OISC) constructed by the looped ReLU-MLP is Turing complete, which indicates it is equivalent to a programmable computer. Due to the space limitation, we refer readers to Appendix E for more in-depth analysis.

Furthermore, in Section 6.1, we show that our model only requires  $O(n \log n)$  number of parameters by low-rank decomposition. Thus, each forward pass of our 23-layer ReLU-MLP only takes  $O(n \log n)$  time complexity, while looped Transformer in Giannou et al. (2023) takes  $O(n^2)$  time complexity for each forward pass. We refer readers to Section 6.2 for more details.

#### 5 IMPLEMENT KEY FUNCTIONS

In this section, we outline a selection of critical functions that the ReLU-MLP model is capable of executing. Due to the space limitation, the detailed proofs are deferred to the Appendix. Specifically, in Section 5.1, we implement the read operation. Section 5.2 covers the write operation. Section 5.3 presents addition, while Section 5.4 addresses subtraction. Conditional branching is implemented in Section 5.5, and the SUBLEQ operation is in Section 5.6.

#### 5.1 Read

We first consider the most basic "read" operation, which is to read any data or instruction from memory into a register. We begin with introducing the read operation for one-bit data as follows:

**Lemma 5.1** (Read One-Bit Data, Informal Version of Lemma A.1). Let ReLU-MLP be defined as Definition 1.2. Let n denote the number of data in the memory. For  $i \in [n]$ , each data  $v_i \in \{\pm 1\}$ . Let the address vector  $a_i \in \{\pm 1\}^{\log(n)}$ . Then, a 2-layer ReLU-MLP can read any one-bit data from the memory to the register.

Then, to achieve the read operation for d-bits data, we only need to use the ReLU-MLP introduced in the previous Lemma to perform a read operation on each bit in the d-bits data.

**Lemma 5.2** (Read d-Bits Data, informal version of Lemma A.2). Let ReLU-MLP be defined as Definition 1.2. Let n denote the number of data in the memory. For  $i \in [n]$ , each data  $v_i \in \{\pm 1\}$ . Let  $v \in \{\pm 1\}^d$  denote a d dimension vector. Let the address matrix  $a_i \in \{\pm 1\}^{\log(n) \times d}$ . Then, a 2-layer ReLU-MLP looped for d times can read any d-bits data vector from the memory to the register.

#### 5.2 Write

Corresponding to the read operation, in this section, we introduce how to use ReLU-MLP to implement the "write" operation. We start with the write operation of one-bit data.

**Lemma 5.3** (Write One-Bit Data, Informal Version of Lemma A.3). Let ReLU-MLP be defined as Definition 1.2. Let n denote the number of data in the memory. For  $i \in [n]$ , each data  $v_i \in \{\pm 1\}$ . Let the address vector  $a_i \in \{\pm 1\}^{\log(n)}$ . Then, a 2-layer ReLU-MLP can write any one-bit data from the register to the memory.

Subsequently, we can expand our approach to accommodate the "write" operation for d-bit data by sequentially processing each bit within the d-bit data.

**Lemma 5.4** (Write d-Bits Data, Informal Version of Lemma A.4). Let ReLU-MLP be defined as Definition 1.2. Let n denote the number of data in the memory. For  $i \in [n]$ , each data  $v_i \in \{\pm 1\}$ . Let  $v \in \{\pm 1\}^d$  denote a d dimension vector. Let the address matrix  $a_i \in \{\pm 1\}^{\log(n) \times d}$ . Then, a 2-layer ReLU-MLP looped for d times can write any d-bits data vector from the register to the memory.

#### 5.3 Addition

Beyond the fundamental memory operations of reading and writing, a pivotal set of operations involves algorithmic functions. Consequently, we demonstrate how the addition and subtraction operations can be emulated by the ReLU-MLP. We begin with introducing the emulation of one-bit addition operation via ReLU-MLP.

**Lemma 5.5** (One-Bit Addition, Informal Version of Lemma B.1). Let ReLU-MLP be defined as Definition 1.2. Then, a 6-layer ReLU-MLP can emulate the "addition" operation for any one-bit data.

We extend to d-bits addition as follows:

**Lemma 5.6** (d-Bits Addition, Informal Version of Lemma B.2). Let ReLU-MLP be defined as Definition 1.2. Then, a 6-layer ReLU-MLP looped for 2d times (due to carry bit) can emulate the "addition" operation for any d-dimension vectors.

#### 5.4 Subtraction

We move on to introducing the subtraction operation. Since we use 2's complement (Definition 3.3) as the representation of the data, the subtraction operation can be decomposed into two steps. The first step is to negate the subtrahend, and the second step is to add the negated subtrahend to the minuend. Combining the above statement with the addition operation introduced in the previous section, we can easily perform a subtraction operation with a 7-layer ReLU-MLP.

**Lemma 5.7** (d-Bits Subtraction, Informal Version of Lemma B.3). Let ReLU-MLP be defined as Defini-

tion 1.2. Then, a 7-layer ReLU-MLP can emulate the "subtraction" operation for any d-dimension vectors.

# 5.5 Conditional Branching

Conditional branching is an essential operation in computer design since it enables the processor to make decisions and execute different paths of code based on the evaluation of conditions (the flag), thereby allowing for more complex and adaptable program flow. We present our design for conditional branching via ReLUMLP as follows:

**Lemma 5.8** (Conditional Branching, Informal Version of Lemma C.1). Let ReLU-MLP be defined as Definition 1.2. Let n be the number of data in the memory. For  $i \in [n]$ , each data  $v_i \in \{\pm 1\}$ . Let the address vector  $a_i \in \{\pm 1\}^{\log(n)}$ . Then, a 4-layer ReLU-MLP can emulate the "conditional branching" operation.

#### 5.6 SUBLEQ Instruction

Now, we present our method for implementing the "SUBLEQ" instruction using a looped 23-layer ReLU-MLP. The "SUBLEQ" instruction operates with three addresses as parameters, denoted as a, b, and c. It first computes the subtraction between the values stored at addresses b and a, i.e.,  $\mathtt{mem}[b] - \mathtt{mem}[a],$  and then updates the memory at  $\mathtt{mem}[b]$  with this result. If the value at  $\mathtt{mem}[b]$  is zero or negative, the program counter will be transferred to the address specified by c; otherwise, the program counter will go to the next instruction without branching. Algorithm 1 demonstrates the execution process of "SUBLEQ" instruction.

**Algorithm 1** SUBtract and branch if Less-than or Equal to zero (SUBLEQ), (Mavaddat and Parhami, 1988; Giannou et al., 2023)

```
    procedure SUBLEQ(a, b, c)
    mem[b] = mem[b] - mem[a] ▷ mem[a] denotes the value of address a in the memory
    if mem[b] ≤ 0 then
    goto instruction c
    else goto next instruction
    end if
    end procedure
```

We integrate the read and write operations, along with the addition and subtraction operations and the conditional branching mechanisms discussed in the preceding sections, to facilitate the implementation of the "SUBLEQ" instruction. The accompanying lemma is stated as follows:

**Lemma 5.9** (ReLU-MLP Emulate SUBLEQ, Informal Version of Lemma D.1). Let ReLU-MLP be defined as Definition 1.2. Let n denote the size of the state vector.

Let m denote the number of instructions. Let k denote the number of one-bit data stored in the memory. For  $i \in [k]$ , each data is  $v_i \in \{\pm 1\}$  and the memory size k satisfies  $k = n - 2 - 4\log(n) - 3m\log(n)$ . Let the address vector  $a_i \in \{\pm 1\}^{\log(n)}$ . Let the instruction  $c_i \in \{\pm 1\}^{3\log(n)}$  be defined as Definition 3.6. Suppose we have three data registers  $r_c, r_{d_1}, r_{d_2} \in \{\pm 1\}$ , one carry bit  $r_c \in \{\pm 1\}$ , three address registers  $r_{a_1}, r_{a_2}, r_{a_3} \in \{\pm 1\}^{\log(n)}$ , and one program counter  $r_{pc} \in \{\pm 1\}^{\log(n)}$  in the scratchpad. Then, a 23-layer ReLU-MLP with width n can emulate the "SUBLEQ" operation (Algorithm 1).

To make it easier for readers to understand our proof, we provide a proof sketch here, which contains some high-level ideas used in our proof. Specifically, we first read the necessary address vectors and data required by the instruction from the memory. Then, we perform subtraction and conditional branching via the respective ReLU-MLPs discussed in the previous sections.

Proof sketch. We use the state vector x as defined in Definition 3.7. The first step is to read the "SUBLEQ" instruction and the data required by the instruction from memory, where the read operation is supported by the ReLU-MLP discussed in Lemma 5.2. After this step, the state vector will change as follows:

$$x = \begin{bmatrix} r_c \\ r_{d_1} \\ r_{d_2} \\ r_{a_1} \\ r_{a_2} \\ r_{a_3} \\ r_{pc} \\ \hline v_1 \\ \vdots \\ v_k \\ \hline c_1 \\ \vdots \\ c_{m-1} \\ c_{\rm EOF} \end{bmatrix} \rightarrow \begin{bmatrix} r_c \\ r_{d_1} \\ r_{d_2} \\ a \\ b \\ c \\ r_{pc} \\ \hline v_1 \\ \vdots \\ v_k \\ \hline c_1 \\ \vdots \\ c_{m-1} \\ c_{\rm EOF} \end{bmatrix} \rightarrow \begin{bmatrix} r_c \\ \text{mem}[a] \\ \text{mem}[b] \\ a \\ b \\ c \\ r_{pc} \\ \hline v_1 \\ \vdots \\ v_k \\ \hline c_1 \\ \vdots \\ c_{m-1} \\ c_{\rm EOF} \end{bmatrix}$$

where we first read the address vectors a and b from the memory to the address registers, then read the corresponding data mem[a] and mem[b] from the memory to the data registers.

In the second step, we calculate the subtraction of mem[b] and mem[a], store its result to mem[b], and calculate the  $r_{pc+1}$  according to  $r_{pc}$ , storing  $r_{pc+1}$  at the address register which previously stores b. The above operations can be supported by Lemma 5.6 and Lemma 5.4. After this step, the state vector will

change as follows:



In the final step, we perform the conditional branching operation. We view the result  $\mathtt{mem}[b]-\mathtt{mem}[a]$  as the flag of the conditional branching. Then, by Lemma 5.8, the program counter will point to the target address  $r_{\mathrm{target}}$  for continue executing the program, where we have  $r_{\mathrm{target}} \in \{c, r_{pc+1}\}$ . After this step, the state vector will change as follows:



In the first step, we treat the value of  $\mathtt{mem}[b] - \mathtt{mem}[a]$  merely as a flag for conditional branching purposes. No operations are conducted in this first step. In the second step, we perform the conditional branching based on the flag's value. Specifically, if the flag is zero or negative, the target register  $r_{\mathrm{target}}$  is set to the value of c. Conversely, if the flag is positive,  $r_{\mathrm{target}}$  is assigned the value of  $r_{pc+1}$ .

Up to this point, we have obtained the execution result for the current "SUBLEQ" instruction. To process the subsequent instruction, we simply need to iterate through the provided 23-layer ReLU-MLP once more, thereby executing the "SUBLEQ" instruction for the next cycle.

We only provided a sketch of the proof above. We refer the readers to Appendix D for more details.

## 6 DISCUSSION AND EXTENSION

Section 6.1 shows that we can reduce the number of parameters. Section 6.2 compares ReLU-MLP with the attention mechanism. In Section 6.3, we discuss the potential capability of ReLU-MLP. In Section 6.4, we discuss the potential inspirations our work offers for designing more efficient neural network architectures.

#### 6.1 Low-rank Decomposition for Efficiency

Though the naive way to construct the looped-ReLU-MLP will take up to  $O(n^2)$  parameter, the parameter requirement can be easily reduced by low-rank decomposition. The high-level idea is that, for all  $n \times n$  matrices used in our proof are equal to an identity matrix minus a matrix with rank  $O(\log(n))$ . Thus, we can decompose our weight matrix into a low-rank format from  $n \times n$  into  $n \times O(\log(n))$  and  $O(\log(n)) \times n$ .

We formalize the above high-level idea into math formulas. Let  $x \in \mathbb{R}^n$  be a vector, where n is the sequence length, and let W be the weight matrix. As all our matrices can be viewed as an identity matrix with a substitution of a submatrix with some  $O(\log n) \times O(\log n)$  matrix on the diagonal, i.e., the matrix W is equal to an identity matrix minus a matrix A with rank  $O(\log(n))$ , corresponding to residual connections. Then, we have A can be decomposed into  $A = U_A V_A^{\top}$ , where  $U_A, V_A \in \mathbb{R}^{n \times O(\log n)}$ . And we have  $Wx = (I - U_A V_A^{\top})x = x - U_A V_A^{\top}x$ . Since we only need to store matrices  $U_A$  and  $V_A$ , the parameter is reduced from  $O(n^2)$  to  $O(n \log n)$ . In other words, for each operation, we only need the  $O(\log n) \times O(\log n)$  matrix to modify the corresponding part of the state vector (vector with length n) and keep the rest part of the state vector unchanged. For example, the  $W_1$  matrix used in Lemma C.1 is a submatrix. Then, the entire matrix is  $W'_1 \in \mathbb{R}^{n \times n}$ , where

$$W_1' = \mathbf{I} - \begin{bmatrix} W_1 + I & 0 & 0 & \cdots & 0 \\ 0 & 0 & 0 & \cdots & 0 \\ 0 & 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 0 \end{bmatrix} = \mathbf{I} + A$$

Since  $W_1 - I \in \mathbb{R}^{O(\log n) \times O(\log n)}$ , the matrix A is at most rank- $O(\log n)$ . Therefore, when equipped with low-rank decomposition methods, all  $n \times n$  matrices used in our method can be decomposed into  $n \times \log(n)$  matrices. The overall parameter used by our method is  $O(n \log(n))$ .

#### 6.2 Comparison with Attention

In terms of computational cost, since all matrices in our method can be decomposed into matrices with rank- $O(\log n)$ , the computation cost for one forward pass is  $O(n\log n)$ . This is because the computational bottleneck of our method is the multiplication of the  $n \times \log n$  size matrix and the n size state vector, which requires  $O(n\log n)$  time. However, for the looped Transformers proposed in Giannou et al. (2023), they have to calculate the matrix multiplication of the  $n \times n$  attention matrix and the  $n \times d$  value matrix in each forward pass, which requires  $O(n^2)$  time complexity.

On the other hand, as discussed in Section 4, we have demonstrated that a looped 23-layer ReLU-MLP is capable of emulating a programmable computer. In contrast to the findings reported by Giannou et al. (2023), where a looped 9-layer Transformer is shown to be capable of such emulation, our result indicates that even a basic component of deep learning, the ReLU-MLP, possesses the potential to handle complex computational tasks. This suggests that while advanced architectures like Transformers have shown proficiency in processing intricate tasks, this proficiency may not come from its advanced architecture design. Instead, it may derive from the inherent capabilities of fundamental components such as the ReLU-MLP. Our finding underscores the importance of investigating the core mechanisms behind the capabilities of advanced architectures, an area that warrants further exploration.

## 6.3 Exploring the Potential of ReLU-MLP

Our research has confirmed that the capabilities of the ReLU-MLP are on par with Transformers when it comes to constructing programmable computers. As noted in Pinkus (1999); Kidger and Lyons (2020), ReLU-MLPs have been demonstrated to be universal approximators. Consequently, we conjecture the programmable computer represents just one of the many downstream tasks that ReLU-MLPs are capable of achieving. Building on the insights from this study, it is promising to further investigate the potential capabilities of ReLU-MLPs. Our approach to analyzing the looped ReLU-MLP mitigates the black-box nature often associated with traditional deep learning training. We are confident that through the analytical methods outlined in our work, we can systematically

probe the capacity of ReLU-MLPs as universal approximators to tackle more complex tasks. This line of work will be done in our future research.

# 6.4 Towards More Efficient Architecture Design for Specific Tasks

The construction-based proof in our work can inspire future explorations of the smallest feasible network structure for specific tasks. Since the main focus of our work is to prove the feasibility of ReLU-MLPbased computer construction, the ReLU-MLP-based construction we provide may not be the smallest construction that can support a programmable computer. Therefore, in fact, we provide a possible lower bound that can achieve a programmable computer. The current mainstream methods for model compression are mainly quantization and model pruning, which often provide model compression solutions based on experimental results and lack theoretical measurements between model capabilities and model capabilities required by the downstream tasks. Thus, leveraging the methodologies applied in this study, we can first identify the minimal requirements of a model for a particular task, enabling researchers to create the smallest yet efficient models to handle it. This strategy permits us to bypass the costly process of the expensive process of scaling up data or model parameters. We leave these directions as our future directions.

# 7 CONCLUSION

In this work, we investigate the computational potential of a looped ReLU-MLP. We begin by demonstrating that a looped 23-layer ReLU-MLP with a single pass is capable of emulating the execution process of the "SUBLEQ" instruction. Drawing on the conclusions presented in Mavaddat and Parhami (1988), we establish that the One Instruction Set Computer (OISC) implemented by the looped 23-layer ReLU-MLP is functionally equivalent to a programmable computer. Contrary to Giannou et al. (2023), which achieved the emulation of a programmable computer using a looped 9layer Transformer, our approach leverages a more efficient and more fundamental building block of deep learning, a 23-layer ReLU-MLP, to accomplish the same objective. This finding prompts us to consider two key insights: (i) The untapped potential of ReLU-MLPs warrants further exploration, offering a deeper understanding of the capability boundaries of existing neural networks; (ii) By adopting the methodologies employed in this study, it may facilitate us to gain insights for designing the minimal neural network requirements for any specific tasks.

# Acknowledgement

Research is partially supported by the National Science Foundation (NSF) Grants 2023239-DMS, CCF-2046710, and Air Force Grant FA9550-18-1-0166.

#### References

- Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
- Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. Advances in Neural Information Processing Systems, 36, 2024.
- Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. In *The Eleventh International Conference on Learning Representations*, 2023.
- Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, context-free grammar. arXiv preprint arXiv:2305.13673, 2023.
- Josh Alman and Zhao Song. Fast attention requires bounded entries. In *NeurIPS*, 2023.
- Josh Alman and Zhao Song. How to capture higherorder correlations? generalizing matrix softmax attention to kronecker computation. In *ICLR*, 2024a.
- Josh Alman and Zhao Song. The fine-grained complexity of gradient computation for training large language models. In *NeurIPS*, 2024b.
- Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet.
- Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
- Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge University Press, 2009.
- P Van Emde Boas. Machine models and simulations. Handbook of Theoretical Computer Science, volume A, pages 1–66, 2014.
- Bora Çarpınlıoğlu, Bahrem Serhat Daniş, and Uğur Teğin. Genetically programmable optical random neural networks. arXiv preprint arXiv:2403.12490, 2024.

- François Charton. What is my math transformer doing?—three results on interpretability and generalization. arXiv preprint arXiv:2211.00170, 2022.
- Bo Chen, Xiaoyu Li, Yingyu Liang, Jiangxuan Long, Zhenmei Shi, and Zhao Song. Circuit complexity bounds for rope-based transformer architecture. arXiv preprint arXiv:2411.07602, 2024.
- Bo Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, and Zhao Song. Bypassing the exponential dependency: Looped transformers efficiently learn incontext by multi-step gradient descent. In *International Conference on Artificial Intelligence and Statistics*, 2025a.
- Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, and Zhao Song. The computational limits of state-space models and mamba via the lens of circuit complexity. In *Conference on Parsimony and Learning*. PMLR, 2025b.
- Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, and Zhao Song. Universal approximation of visual autoregressive transformers. arXiv preprint arXiv:2502.06167, 2025c.
- Stephen Chung and Hava Siegelmann. Turing completeness of bounded-precision recurrent neural networks. Advances in neural information processing systems, 34:28431–28441, 2021.
- Joy Crosbie and Ekaterina Shutova. Induction heads as an essential mechanism for pattern matching in in-context learning. arXiv preprint arXiv:2407.07011, 2024.
- Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In *International Conference on Learn*ing Representations, 2019.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.
   In Proceedings of naacL-HLT, volume 1, page 2.
   Minneapolis, Minnesota, 2019.
- Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022.
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Esolangs. Subleq, 2025. URL https://esolangs.org/wiki/Subleq.

- Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36, 2024.
- Yeqi Gao, Zhao Song, and Shenghao Xie. In-context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. arXiv preprint arXiv:2307.02419, 2023.
- Khashayar Gatmiry, Nikunj Saunshi, Sashank J Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped transformers learn to implement multi-step gradient descent for in-context learning? In Fortyfirst International Conference on Machine Learning, 2024.
- Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. In *International Conference on Machine Learning*, pages 11398–11442. PMLR, 2023.
- Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer. *IEEE transactions on pattern analysis and machine intelligence*, 45(1):87–110, 2022.
- Christoph Hertrich and Leon Sering. Relu neural networks of polynomial size for exact maximum flow computation. *Mathematical Programming*, pages 1–30, 2024.
- Jerry Yao-Chieh Hu, Thomas Lin, Zhao Song, and Han Liu. On computational limits of modern hopfield models: A fine-grained complexity analysis. In Forty-first International Conference on Machine Learning, 2024a.
- Jerry Yao-Chieh Hu, Maojiang Su, En-Jui Kuo, Zhao Song, and Han Liu. Computational limits of low-rank adaptation (lora) fine-tuning for transformer models. In *The Thirteenth International Conference on Learning Representations*, 2024b.
- Jerry Yao-Chieh Hu, Wei-Po Wang, Ammar Gilani, Chenyang Li, Zhao Song, and Han Liu. Fundamental limits of prompt tuning transformers: Universality, capacity and efficiency. arXiv preprint arXiv:2411.16525, 2024c.
- Jerry Yao-Chieh Hu, Weimin Wu, Yi-Chen Lee, Yu-Chao Huang, Minshuo Chen, and Han Liu. On statistical rates of conditional diffusion transformers: Approximation, estimation and minimax optimality. arXiv preprint arXiv:2411.17522, 2024d.
- Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. *Journal of Computer and System Sciences*, 62(2):367–375, 2001.

- Piotr Indyk. Optimal simulation of automata by neural nets. In *Annual Symposium on Theoretical Aspects of Computer Science*, pages 337–348. Springer, 1995.
- David S Johnson. A catalog of complexity classes. In *Algorithms and complexity*, pages 67–161. Elsevier, 1990.
- Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, and Zhao Song. On computational limits and provably efficient criteria of visual autoregressive models: A fine-grained complexity analysis. arXiv preprint arXiv:2501.04377, 2025a.
- Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, and Zhao Song. Circuit complexity bounds for visual autoregressive model. arXiv preprint arXiv:2501.04299, 2025b.
- Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song, and Chiwun Yang. Curse of attention: A kernel-based perspective for why transformers fail to generalize on time series forecasting and beyond. In *Conference on Parsimony and Learning*. PMLR, 2025c.
- Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. *ACM computing surveys (CSUR)*, 54(10s):1–41, 2022.
- Patrick Kidger and Terry Lyons. Universal approximation with deep narrow networks. In *Conference on learning theory*, pages 2306–2327. PMLR, 2020.
- Joe Kilian and Hava T Siegelmann. The dynamic universality of sigmoidal neural networks. *Information and computation*, 128(1):48–56, 1996.
- Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Nayoung Lee, Kartik Sreenivasan, Jason D Lee, Kangwook Lee, and Dimitris Papailiopoulos. Teaching arithmetic to small transformers. arXiv preprint arXiv:2307.03381, 2023.
- Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song, and Tianyi Zhou. Fourier circuits in neural networks and transformers: A case study of modular arithmetic with multiple inputs. In *International Conference on Artificial Intelligence and Statistics*, 2025a.
- Xiaoyu Li, Yuanpeng Li, Yingyu Liang, Zhenmei Shi, and Zhao Song. On the expressive power of modern hopfield networks. arXiv preprint arXiv:2412.05562, 2024a.

- Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, and Mingda Wan. Theoretical constraints on the expressive power of rope-based tensor attention transformers. arXiv preprint arXiv:2412.18040, 2024b.
- Xiaoyu Li, Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, and Zhen Zhuang. Neural algorithmic reasoning for hypergraphs with looped transformers. arXiv preprint arXiv:2501.10688, 2025b.
- Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Wei Wang, and Jiahao Zhang. On the computational capability of graph neural networks: A circuit complexity bound perspective. arXiv preprint arXiv:2501.06444, 2025c.
- Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. In *The Twelfth International Conference on Learning Representations*, 2024c.
- Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, and Yufa Zhou. Multi-layer transformers gradient can be approximated in almost linear time. arXiv preprint arXiv:2408.13233, 2024a.
- Yingyu Liang, Zhenmei Shi, Zhao Song, and Yufa Zhou. Unraveling the smoothness properties of diffusion models: A gaussian mixture perspective. arXiv preprint arXiv:2405.16418, 2024b.
- Yingyu Liang, Zhenmei Shi, Zhao Song, and Yufa Zhou. Tensor attention training: Provably efficient learning of higher-order transformers. arXiv preprint arXiv:2405.16411, 2024c.
- Yitao Liang. Unifying Probabilistic Reasoning, Learning, and Classification with Circuit Representations. University of California, Los Angeles, 2021.
- Yitao Liang and Guy Van den Broeck. Learning logistic circuits. In *Proceedings of the AAAI Conference on Artificial Intelligence*, pages 4277–4286, 2019.
- David Lindner, János Kramár, Sebastian Farquhar, Matthew Rahtz, Tom McGrath, and Vladimir Mikulik. Tracr: Compiled transformers as a laboratory for interpretability. Advances in Neural Information Processing Systems, 36, 2024.
- Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022.
- Arvind Mahankali, Tatsunori B Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. arXiv preprint arXiv:2307.03576, 2023.
- Farhad Mavaddat and Behrooz Parhami. Urisc: the ultimate reduced instruction set computer. *Interna-*

- tional Journal of Electrical Engineering Education, 25(4):327–334, 1988.
- William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023.
- William Merrill, Ashish Sabharwal, and Noah A Smith. Saturated transformers are constant-depth threshold circuits. *Transactions of the Association for Computational Linguistics*, 10:843–856, 2022.
- Meta. Llama 3, 2024. URL https://ai.meta.com/blog/meta-llama-3/.
- Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
- Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- OpenAI. Gpt-4 turbo, 2023.
- OpenAI. Openai o1, 2024. URL https://openai.com/o1/.
- Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia, and Sanjeev Arora. Trainable transformer in transformer. In Forty-first International Conference on Machine Learning, 2024.
- William Peebles and Saining Xie. Scalable diffusion models with transformers. In *Proceedings of the IEEE/CVF International Conference on Computer Vision*, pages 4195–4205, 2023.
- Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Attention is turing-complete. *Journal of Machine Learning Research*, 22(75):1–35, 2021.
- Allan Pinkus. Approximation theory of the mlp model in neural networks. *Acta numerica*, 8:143–195, 1999.
- Jordan Bruce Pollack. On connectionist models of natural language processing. *PhD thesis, Computer Science Dept., University of Illinois*, 1987.
- Jorge Pérez, Javier Marinković, and Pablo Barceló. On the turing completeness of modern neural network architectures. In *International Conference on Learning Representations*, 2019.
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. *Journal of machine learning research*, 21 (140):1–67, 2020.

- Zhenmei Shi, Junyi Wei, and Yingyu Liang. A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features. In *International Conference on Learning Representations*, 2021.
- Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, and Shafiq Joty. Discovering the gems in early layers: Accelerating long-context llms with 1000x input token reduction. arXiv preprint arXiv:2409.17422, 2024a.
- Zhenmei Shi, Junyi Wei, and Yingyu Liang. Provable guarantees for neural networks via gradient feature learning. Advances in Neural Information Processing Systems, 36, 2024b.
- Zhenmei Shi, Junyi Wei, Zhuoyan Xu, and Yingyu Liang. Why larger language models do incontext learning differently? arXiv preprint arXiv:2405.19592, 2024c.
- Hava T Siegelmann and Eduardo D Sontag. On the computational power of neural nets. In *Proceedings of the fifth annual workshop on Computational learning theory*, pages 440–449, 1992.
- John Stogin, Ankur Mali, and C Lee Giles. A provably stable neural network turing machine with finite precision and time. *Information Sciences*, 658:120034, 2024.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In *International Conference on Machine Learning*, pages 35151–35174. PMLR, 2023.
- Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. Advances in Neural Information Processing Systems, 36, 2024a.
- Yilin Wang, Zeyuan Chen, Liangjun Zhong, Zheng Ding, Zhizhou Sha, and Zhuowen Tu. Dolfin: Diffusion layout transformers without autoencoder. arXiv preprint arXiv:2310.16305, 2023a.
- Yilin Wang, Haiyang Xu, Xiang Zhang, Zeyuan Chen, Zhizhou Sha, Zirui Wang, and Zhuowen Tu. Omnicontrolnet: Dual-stage integration for conditional image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7436–7448, 2024b.

- Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, and Zhuowen Tu. Tokencompose: Grounding diffusion with token-level supervision. arXiv preprint arXiv:2312.03626, 2023b.
- Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approximation: a case study on approximating turing machines with transformers. Advances in Neural Information Processing Systems, 35:12071–12083, 2022a.
- Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022b.
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. *Advances in neural information processing systems*, 35:24824–24837, 2022c.
- Gail Weiss, Yoav Goldberg, and Eran Yahav. Thinking like transformers. In *International Conference on Machine Learning*, pages 11080–11090. PMLR, 2021.
- Virginia Vassilevska Williams. On some fine-grained questions in algorithms and complexity. In *Proceedings of the international congress of mathematicians: Rio de janeiro 2018*, pages 3447–3487. World Scientific, 2018.
- Zhuoyan Xu, Zhenmei Shi, and Yingyu Liang. Do large language models have compositional ability? an investigation into limitations and scalability. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024.
- Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077, 2019.
- Haoyu Zhao, Abhishek Panigrahi, Rong Ge, and Sanjeev Arora. Do transformers parse while predicting the masked word? arXiv preprint arXiv:2303.08117, 2023.

# Checklist

- 1. For all models and algorithms presented, check if you include:
  - (a) A clear description of the mathematical setting, assumptions, algorithm, and/or model. [Yes]
  - (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes]
  - (c) (Optional) Anonymized source code, with specification of all dependencies, including external libraries. [Not Applicable]
- 2. For any theoretical claim, check if you include:
  - (a) Statements of the full set of assumptions of all theoretical results. [Yes]
  - (b) Complete proofs of all theoretical results. [Yes]
  - (c) Clear explanations of any assumptions. [Yes]
- 3. For all figures and tables that present empirical results, check if you include:
  - (a) The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). [Not Applicable]
  - (b) All the training details (e.g., data splits, hyperparameters, how they were chosen). [Not Applicable]
  - (c) A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times). [Not Applicable]
  - (d) A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider). [Not Applicable]
- 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include:
  - (a) Citations of the creator If your work uses existing assets. [Not Applicable]
  - (b) The license information of the assets, if applicable. [Not Applicable]
  - (c) New assets either in the supplemental material or as a URL, if applicable. [Not Applicable]
  - (d) Information about consent from data providers/curators. [Not Applicable]
  - (e) Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Not Applicable]

- 5. If you used crowdsourcing or conducted research with human subjects, check if you include:
  - (a) The full text of instructions given to participants and screenshots. [Not Applicable]
  - (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable]
  - (c) The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Not Applicable]

# Looped ReLU MLPs May Be All You Need as Programmable Computers: Supplementary Materials

# Contents

| 1            | INTRODUCTION                                                      | 1  |
|--------------|-------------------------------------------------------------------|----|
| <b>2</b>     | RELATED WORK                                                      | 3  |
| 3            | PRELIMINARY                                                       | 4  |
|              | 3.1 Notations                                                     | 4  |
|              | 3.2 Key Concepts                                                  | 4  |
| 4            | MAIN RESULT                                                       | 5  |
| 5            | IMPLEMENT KEY FUNCTIONS                                           | 6  |
|              | 5.1 Read                                                          | 6  |
|              | 5.2 Write                                                         | 6  |
|              | 5.3 Addition                                                      | 6  |
|              | 5.4 Subtraction                                                   | 6  |
|              | 5.5 Conditional Branching                                         | 7  |
|              | 5.6 SUBLEQ Instruction                                            | 7  |
| 6            | DISCUSSION AND EXTENSION                                          | 8  |
|              | 6.1 Low-rank Decomposition for Efficiency                         | 8  |
|              | 6.2 Comparison with Attention                                     | 9  |
|              | 6.3 Exploring the Potential of ReLU-MLP                           | 9  |
|              | 6.4 Towards More Efficient Architecture Design for Specific Tasks | 9  |
| 7            | CONCLUSION                                                        | 9  |
| A            | READ AND WRITE                                                    | 16 |
| В            | ADDITION AND SUBTRACTION                                          | 21 |
| $\mathbf{C}$ | CONDITIONAL BRANCHING                                             | 24 |
| D            | SUBLEQ                                                            | 26 |

**Roadmap.** We organize our appendix as follows: In Section A, we introduce our construction of ReLU-MLP for emulating the read and write operation. In Section B, we present the way we achieve the addition and subtraction operation via ReLU-MLP . In Section C, we show that a 4-layer ReLU-MLP is capable of emulating the conditional branching operation. In Section D, we illustrate the "SUBLEQ" can be emulated by a looped 23-layer ReLU-MLP. In Section E, we demonstrate how the looped 23-layer ReLU-MLP introduced in the previous section is capable of functioning as a programmable computer.

## A READ AND WRITE

In this section, we mainly focus on the construction of a multi-layer ReLU-MLP for supporting read and write operations. We begin with the scenario where we read one-bit data from the memory to the data register.

**Lemma A.1** (Read One-Bit Data, Formal Version of Lemma 5.1). If the following conditions hold:

- Let ReLU-MLP be defined as Definition 1.2.
- Let n denote the number of data in the memory. For  $i \in [n]$ , each data  $v_i \in \{\pm 1\}$ .
- Let the address vector  $a_i \in \{\pm 1\}^{\log(n)}$ .

Then, we can show that a 2-layer ReLU-MLP can read any one-bit data from the memory to the register.

*Proof.* Consider a simplified case where we have one data register  $r_d \in \{\pm 1\}$  which stores data  $v_0 \in \{\pm 1\}$  initially, and one address register  $r_a \in \{\pm 1\}^{\log(n)}$  which stores address  $a_i \in \{\pm 1\}^{\log(n)}$  initially. The address  $a_i$  points to the location where the data should be copied from. Namely, we want to perform the following operation:

$$x := \begin{bmatrix} \begin{matrix} r_d : v_0 \\ \hline r_a : a_i \\ \hline v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \rightarrow \begin{bmatrix} \begin{matrix} r_d : v_i \\ \hline r_a : a_i \\ \hline v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}$$

For simplicity, we ignore the notations  $r_d$ : and  $r_a$ : in the following proof. We use  $\mathbf{0}_d \in \mathbb{R}^d$  to denote a vector with entries are 0.

# Step 1: Erase $v_0$ .

We construct the following weight matrix  $W_1 \in \mathbb{R}^{(n+\log(n)+1)\times(n+\log(n)+1)}$ :

$$W_1 := egin{bmatrix} 0 & \mathbf{0}_{\log(n)}^{ op} & 0 & \cdots & 0 \ \mathbf{0}_{\log(n)} & I_{\log(n) imes \log(n)} & \mathbf{0}_{\log(n)} & \cdots & \mathbf{0}_{\log(n)} \ 0 & \mathbf{0}_{\log(n)}^{ op} & 1 & \cdots & 0 \ dots & dots & dots & dots & \ddots & dots \ 0 & \mathbf{0}_{\log(n)}^{ op} & 0 & \cdots & 1 \ \end{bmatrix}$$

Then we erase  $v_0$  by  $W_1$ .

$$x_1 := W_1 x = \begin{bmatrix} 0 \\ \underline{a_i} \\ v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}$$

## Step 2: Extract $v_i$ .

We need to construct the location vector first. We construct the following weight matrix  $W_2 \in \mathbb{R}^{n \times \log(n)}$ :

$$W_2 := [a_1, \cdots, a_n]^\top$$

Then, we perform the following operations to get the location vector:

$$\mathsf{ReLU}(\underbrace{W_2}_{n \times \log(n)} \underbrace{a_i}_{\log(n) \times 1} - \underbrace{(\log(n) - 1)}_{1 \times 1} \cdot \underbrace{\mathbf{1}_n}_{n \times 1}) = \underbrace{e_i}_{n \times 1}$$

where the first step follows from we have  $\forall i, a_i^{\top} a_i = \log(n)$  and  $\forall i \neq j, a_i^{\top} a_j < \log(n)$  (Remark 3.5). Here, the  $e_i$  denotes the vector whose *i*-th entry is 1, and other entries are 0.

Then, we extract  $v_i$  by the following operation:

$$\underbrace{e_i^{\top}}_{1 \times n} \underbrace{\begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}}_{n \times 1} = \underbrace{v_i}_{1 \times 1}$$

## Step 3: Put $v_i$ to register.

Then, we construct  $x_{v_i}$  by concatenating  $v_i$  with some zero vectors:

$$x_{v_i} := egin{bmatrix} v_i \ \mathbf{0}_{\log(n)} \ 0 \ 0 \ dots \ 0 \end{bmatrix}$$

Finally, we add  $x_{v_i}$  and  $x_1$  together to get our final result.

$$y := x_1 + x_{v_i} = \begin{bmatrix} v_i \\ a_i \\ v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}$$

To sum up, we have

- Step 1 uses one-layer ReLU-MLP.
- Step 2 uses one-layer ReLU-MLP.
- Step 3 uses vector addition, so doesn't use ReLU-MLP.

Therefore, we use a two-layer ReLU-MLP to emulate the "read" operation.

Then, we move on to considering the read operation for d-bits data.

**Lemma A.2** (Read d-Bits Data, Formal Version of Lemma 5.2). If the following conditions hold:

• Let ReLU-MLP be defined as Definition 1.2.

#### Looped ReLU MLPs May Be All You Need as Practical Programmable Computers

- Let n denote the number of data in the memory. For  $i \in [n]$ , each data  $v_i \in \{\pm 1\}$ .
- Let  $v \in \{\pm 1\}^d$  denote a d dimension vector.
- Let the address matrix  $a_i \in \{\pm 1\}^{\log(n) \times d}$ .

Then, we can show that a 2-layer ReLU-MLP, looped for d times, can read any d-bits data vector from the memory to the register.

*Proof.* By Lemma A.1, we can read one bit from memory to register via a two-layer ReLU-MLP.

Considering the case where we have an input matrix X with size  $(1 + \log(n) + n) \times d$ , which can be written as follows, where the superscript denotes the dimension:

$$X := \begin{bmatrix} r_d^1 & r_d^2 & \cdots & r_d^d \\ r_a^1 & r_a^2 & \cdots & r_a^d \\ \hline v_1^1 & v_1^2 & \cdots & v_1^d \\ v_2^1 & v_2^2 & \cdots & v_2^d \\ \vdots & \vdots & \ddots & \vdots \\ v_n^1 & v_n^2 & \cdots & v_n^d \end{bmatrix}$$

We apply the two-layer ReLU-MLP to each column of X. Since X has d columns, we loop the two layers ReLU-MLP for d times. Then, we can perform the "read" operation for any d-dimension vector from the memory to the register.

Writing can be considered as the inverse process of reading. Summarily, we first consider writing one-bit data from the data register to the memory.

**Lemma A.3** (Write One-Bit Data, Formal Version of Lemma 5.3). If the following conditions hold:

- Let ReLU-MLP be defined as Definition 1.2.
- Let n denote the number of data in the memory. For  $i \in [n]$ , each data  $v_i \in \{\pm 1\}$ .
- Let the address vector  $a_i \in \{\pm 1\}^{\log(n)}$ .

Then, we can show that a 2-layer ReLU-MLP can write any one-bit data from the register to the memory.

*Proof.* Consider a simplified case where we have one data register  $r_d \in \{\pm 1\}$  which stores data  $v_0 \in \{\pm 1\}$  initially, and one address register  $r_a \in \{\pm 1\}^{\log(n)}$  which stores address  $a_i \in \{\pm 1\}^{\log(n)}$  initially. The address  $a_i$  points to the location where the data should be copied from. Namely, we want to perform the following operation:

$$x := \begin{bmatrix} r_d : v_0 \\ r_a : a_i \\ v_1 \\ v_2 \\ \vdots \\ v_{i-1} \\ v_i \\ v_{i+1} \\ \vdots \\ v_n \end{bmatrix} \rightarrow \begin{bmatrix} r_d : v_0 \\ r_a : a_i \\ v_1 \\ v_2 \\ \vdots \\ v_{i-1} \\ v_0 \\ v_{i+1} \\ \vdots \\ v_n \end{bmatrix}$$

For simplicity, we ignore the notations  $r_d$ : and  $r_a$ : in the following proof.

Let  $N := n + \log(n) + 1$ . Then we have  $x \in \mathbb{R}^N$ .

### Step 1: Erase $v_i$ .

We need to construct the location vector first. We construct the following weight matrix  $W_1 \in \mathbb{R}^{n \times \log(n)}$ :

$$W_1 := [a_1, \cdots, a_n]^{\top}$$

Then, we perform the following operations:

$$\mathsf{ReLU}(\underbrace{W_1}_{n \times \log(n)} \cdot \underbrace{a_i}_{\log(n) \times 1} - \underbrace{(\log(n) - 1)}_{1 \times 1} \cdot \underbrace{\mathbf{1}_n}_{n \times 1}) = \underbrace{e_i}_{n \times 1}$$

where the first step follows from we have  $\forall i, a_i^{\top} a_i = \log(n)$  and  $\forall i \neq j, a_i^{\top} a_j < \log(n)$  (Remark 3.5). Here, the  $e_i$  denotes the vector whose *i*-th entry is 1, and other entries are 0.

Then, we erase the  $v_i$  in x by first performing the following operation:

$$\underbrace{(1-e_i)}_{n\times 1} \circ \underbrace{\begin{bmatrix} v_1\\v_2\\\vdots\\v_{i-1}\\v_i\\v_{i+1}\\\vdots\\v_n\end{bmatrix}}_{n\times 1} = \underbrace{\begin{bmatrix} v_1\\v_2\\\vdots\\v_{i-1}\\0\\v_{i+1}\\\vdots\\v_n\end{bmatrix}}_{v_i\times 1}$$

Then, we define  $x_1$  as follows:

$$x_{1} := \begin{bmatrix} v_{0} \\ -\frac{a_{i}}{v_{1}} \\ v_{2} \\ \vdots \\ v_{i-1} \\ 0 \\ v_{i+1} \\ \vdots \\ v_{n} \end{bmatrix}$$

# Step 2: Extract $v_0$ .

We construct the following weight matrix  $W_2 \in \mathbb{R}^{1 \times N}$ :

$$W_2 := [1, \mathbf{0}_{\log(n)}^\top, \mathbf{0}_n^\top]$$

Then, we have

$$\underbrace{W_2}_{1\times N}\cdot\underbrace{x_1}_{N\times 1}=\underbrace{v_0}_{1\times 1}$$

#### Step 3: Put $v_0$ to memory.

We use the  $e_i$  acquired from **Step 1** to perform the following operation:

$$\underbrace{v_0}_{1\times 1} \cdot \underbrace{e_i}_{n\times 1} + \underbrace{\begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_{i-1} \\ 0 \\ v_{i+1} \\ \vdots \\ v_n \end{bmatrix}}_{n\times 1} = \underbrace{\begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_{i-1} \\ v_0 \\ v_{i+1} \\ \vdots \\ v_n \end{bmatrix}}_{n\times 1}$$

Then, by concatenation, we have our final result:

$$y := \begin{bmatrix} r_d : v_0 \\ r_a : a_i \\ \hline v_1 \\ v_2 \\ \vdots \\ v_{i-1} \\ v_0 \\ v_{i+1} \\ \vdots \\ v_n \end{bmatrix}$$

To sum up, we have

- Step 1 uses one-layer ReLU-MLP.
- Step 2 uses one-layer ReLU-MLP.
- Step 3 use concatenation, so doesn't use ReLU-MLP.

Therefore, we use a two-layer ReLU-MLP to emulate the "write" operation.

Then, we show how we extend writing one-bit data to writing d-bits data as follows:

**Lemma A.4** (Write d-Bits Data, Formal Version of Lemma 5.4). If the following conditions hold:

- Let ReLU-MLP be defined as Definition 1.2.
- Let n denote the number of data in the memory. For  $i \in [n]$ , each data  $v_i \in \{\pm 1\}$ .
- Let  $v \in \{\pm 1\}^d$  denote a d dimension vector.
- Let the address matrix  $a_i \in \{\pm 1\}^{\log(n) \times d}$ .

Then, we can show that, a 2-layer ReLU-MLP, looped for d times, can write any d-bits data vector from the register to the memory.

*Proof.* By Lemma A.3, we can write one bit from the register to the memory via a two-layer ReLU-MLP.

Considering the case where we have an input matrix X with size  $(1 + \log(n) + n) \times d$ , which can be written as follows, where the superscript denotes the dimension:

$$X := \begin{bmatrix} r_d^1 & r_d^2 & \cdots & r_d^d \\ r_a^1 & r_a^2 & \cdots & r_a^d \\ \hline v_1^1 & v_1^2 & \cdots & v_1^d \\ v_2^1 & v_2^2 & \cdots & v_2^d \\ \vdots & \vdots & \ddots & \vdots \\ v_n^1 & v_n^2 & \cdots & v_n^d \end{bmatrix}$$

We apply the two layers ReLU-MLP to each column of X. Since X has d columns, we loop the two-layer ReLU-MLP for d times. Then, we can perform the "write" operation for any d-dimension vector from the register to the memory.

# B ADDITION AND SUBTRACTION

In this section, we focus on implementing ReLU-MLP to support basic algorithmic operations, i.e., addition and subtraction.

We begin with introducing the construction for one-bit addition. It is worth mentioning that we also consider the carry bit in the addition.

**Lemma B.1** (One-Bit Addition, Formal Version of Lemma 5.5). If the following conditions hold:

• Let ReLU-MLP be defined as Definition 1.2.

Then, we can show that a 6-layer ReLU-MLP can emulate the "addition" operation for any one-bit data.

*Proof.* Consider a simplified case, where we have the following components in the scratchpad: two data register  $r_{d_1}, r_{d_2} \in \{\pm 1\}$ , and one data register  $r_c$  which stores the carry of the addition operation. We want to perform an addition operation and store the result in the first register. Namely, we want the following operation:

$$x := \left[egin{array}{c} r_c \\ r_{d_1} \\ r_{d_2} \end{array}
ight] 
ightarrow \left[egin{array}{c} r_c \\ r_{d_1} + r_{d_2} \\ r_{d_2} \end{array}
ight]$$

Here, we want the addition result  $r_{d_1} + r_{d_2} \in \{\pm 1\}$  and  $r_c = 1$  if and only if  $r_{d_1} = r_{d_2} = 1$ .

Step 1:  $\{\pm 1\}$  representation to  $\{0,1\}$  representation and reset  $r_c$ .

We apply one layer ReLU-MLP with the weight matrix  $W_1$  be defined as follows:

$$W_1 = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$$

Our goal is to reset the carry register  $r_c$  and change the representation of  $r_{d_1}, r_{d_2}$  from  $\{\pm 1\}$  to  $\{0, 1\}$ . Namely, we perform the following operation:

$$x_1 := \mathsf{ReLU}(W_1 \cdot x) = \begin{bmatrix} 0 \\ \mathsf{ReLU}(r_{d_1}) \\ \mathsf{ReLU}(r_{d_2}) \end{bmatrix}$$

Since  $r_{d_1}, r_{d_2} \in \{\pm 1\}$ ,  $ReLU(r_{d_1}), ReLU(r_{d_2}) \in \{0, 1\}$ .

Step 2: Construct two flag vectors for ReLU( $r_{d_1}$ ).

Let 
$$x_{11} := ReLU(r_{d_1}), x_{12} := ReLU(r_{d_2}).$$

We construct the weight matrix  $W_2$  of one ReLU-MLP as follows:

$$W_2 = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}, \quad b_2 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}$$

Then, we have

$$x_2 := W_2 \cdot x_1 + b_2 = \begin{bmatrix} 1 \\ x_{11} \\ 0 \end{bmatrix}$$

We construct the weight matrix  $W_3$  and bias vector  $b_3$  of one ReLU-MLP as follows:

$$W_3 = \begin{bmatrix} 0 & 0 & 0 \\ 0 & -1 & 0 \\ 0 & 0 & 0 \end{bmatrix}, \quad b_3 = \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix}$$

Then, we have

$$x_3 = W_3 \cdot x_1 + b_3 = \begin{bmatrix} 1 \\ 1 - x_{11} \\ 0 \end{bmatrix}$$

# Step 3: Construct two flag vectors for $ReLU(r_{d_2})$ .

We construct the weight matrix  $W_4$  of one ReLU-MLP as follows:

$$W_4 = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix}, \quad b_4 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}$$

Then, we have

$$x_4 := W_4 \cdot x_1 + b_4 = \begin{bmatrix} 1 \\ x_{12} \\ 0 \end{bmatrix}$$

We construct the weight matrix  $W_5$  and bias vector  $b_5$  of one ReLU-MLP as follows:

$$W_5 = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & -1 \\ 0 & 0 & 0 \end{bmatrix}, \quad b_3 = \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix}$$

Then, we have

$$x_5 = W_5 \cdot x_1 + b_3 = \begin{bmatrix} 1 \\ 1 - x_{12} \\ 0 \end{bmatrix}$$

# Step 4: Construct the addition vector.

Recall in the previous two steps, we defined  $x_{11} := \text{ReLU}(r_{d_1}), x_{12} := \text{ReLU}(r_{d_2})$ , and we have the following four vectors:

$$x_2 = \begin{bmatrix} 1 \\ x_{11} \\ 0 \end{bmatrix}, \quad x_3 = \begin{bmatrix} 1 \\ 1 - x_{11} \\ 0 \end{bmatrix}, \quad x_4 = \begin{bmatrix} 1 \\ x_{12} \\ 0 \end{bmatrix}, \quad x_5 = \begin{bmatrix} 1 \\ 1 - x_{12} \\ 0 \end{bmatrix},$$

Then, we construct  $x_6$  with the following operation:

$$x_6 := x_2 \circ x_4 \circ \begin{bmatrix} 1 \\ -1 \\ 0 \end{bmatrix} + x_2 \circ x_5 \circ \begin{bmatrix} -1 \\ 1 \\ 0 \end{bmatrix} + x_3 \circ x_4 \circ \begin{bmatrix} -1 \\ 1 \\ 0 \end{bmatrix} + x_3 \circ x_5 \circ \begin{bmatrix} -1 \\ -1 \\ 0 \end{bmatrix}$$

Our construction is reasonable because only when both  $r_{d_1}$  and  $r_{d_2}$  are 1, the carry register  $r_c$  will be set to 1, otherwise it should be set to -1.

# Step 5: Erase $r_{d_1}$ and add the addition vector $x_6$ .

We construct the weight matrix  $W_6$  as follows:

$$W_6 = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 1 \end{bmatrix}$$

Then, we perform the following operation:

$$y := W_6 \cdot x + x_6 = \begin{bmatrix} r_c \\ r_{d_1} + r_{d_2} \\ r_{d_2} \end{bmatrix}$$

Therefore, we get our addition result.

To sum up, we have

- Step 1 uses one ReLU-MLP.
- Step 2 uses two ReLU-MLP.
- Step 3 uses two ReLU-MLP.
- Step 4 doesn't use ReLU-MLP.
- Step 5 uses one ReLU-MLP.

Therefore, we emulate the "addition" operation with a six-layer ReLU-MLP.

Since we already have a ReLU-MLP for one-bit addition, then we extend it to support d-bits addition as follows: Lemma B.2 (d-Bits Addition, Formal Version of Lemma 5.6). If the following conditions hold:

• Let ReLU-MLP be defined as Definition 1.2.

Then, we can show that a 6-layer ReLU-MLP, looped for 2d times, can emulate the "addition" operation for any d-dimension vectors.

*Proof.* By Lemma B.1, we have a six-layer ReLU-MLP that can perform a one-bit addition operation and store the carry result in the carry register.

For each dimension in d, we first perform the addition operation between the carry register from the previous dimension with one data register. Then, we perform the addition operation.

For each dimension, we need 2 loops. Therefore, we can emulate the d-dimension vector addition via 2d loops of that six-layer ReLU-MLP.

In this work, we use 2's-complement as the representation of data. Therefore, the subtraction can be viewed as first negating the subtrahend, then adding 1, and adding to the minuend. We follow the aforementioned high-level idea to implement the subtraction as follows:

**Lemma B.3** (d-Bits Subtraction, Formal Version of Lemma 5.7). If the following conditions hold:

• Let ReLU-MLP be defined as Definition 1.2.

Then, we can show that a 7-layer ReLU-MLP can emulate the "subtraction" operation for any d-dimension vectors.

*Proof.* By Lemma B.2, we have we can perform addition operation for d-dimension vectors via a six-layer ReLU-MLP with 2d loops.

In our setting, we use 2's complement (Definition 3.3) to represent our data. Therefore, to perform subtraction  $x_1 - x_2$ , we can first calculate  $-x_2$ , then add  $x_1$  and  $-x_2$  to get the final result.

#### Step 1: Calculate $-x_2$ .

According to 2's complement, to calculate  $-x_2$ , we first need to invert all entries of  $x_2$   $(1 \to -1; -1 \to 1)$ . This step requires a one-layer ReLU-MLP with weight matrix  $W = -I_{d \times d}$ .

The second step is to add 1 to the inverted  $x_2$ , which can be achieved by the six-layer ReLU-MLP used for the addition operation.

# Step 2: Add $x_1$ and $-x_2$ .

This step keeps uses the same six-layer ReLU-MLP with Step 1. Since that six-layer ReLU-MLP can perform addition operation, we can use that six-layer ReLU-MLP to perform  $x_1 + (-x_2)$ , which is our desired result.

To sum up, we need an additional layer to invert  $x_2$  as discussed in **Step 1**, and the six-layer ReLU-MLP is used for the addition operation. Therefore, the subtraction operation requires a seven-layer ReLU-MLP.

#### C CONDITIONAL BRANCHING

In this section, we implement the "conditional branching" instruction through ReLU-MLP. Conditional branching is a critical instruction in computer programs since it enables the computer programs to achieve controllability. **Lemma C.1** (Conditional Branching, Formal Version of Lemma 5.8). If the following conditions hold:

- Let ReLU-MLP be defined as Definition 1.2.
- Let n denote the number of data in the memory. For  $i \in [n]$ , each data  $v_i \in \{\pm 1\}$ .
- Let the address vector  $a_i \in \{\pm 1\}^{\log(n)}$ .

Then, we can show that a 4-layer ReLU-MLP can emulate the "conditional branching" operation.

Proof. Consider a simplified case where we have the following components in the scratchpad: a flag register stores flag{ $\pm 1$ }. If we jump (flag = 1), the program counter register  $r_{pc} \in \{\pm 1\}^{\log(n)}$ , points to a target address register  $r_{ab} \in \{\pm 1\}^{\log(n)}$ . If we stay (flag = -1), the program counter register  $r_{pc}$  will be incremented by one, which means it will point to an address register  $r_{pc+1} \in \{\pm 1\}^{\log(n)}$ , where the  $r_{pc+1}$  can be calculated by Lemma B.2 with adding  $r_{pc}$  with 1.

Namely, if flag = 1, we want the following operation:

$$x = \begin{bmatrix} \text{flag} \\ r_{pc} \\ r_{pc+1} \\ r_{a_b} \end{bmatrix} \to \begin{bmatrix} \text{flag} \\ r_{a_b} \\ r_{pc+1} \\ r_{a_b} \end{bmatrix}$$

If flag = -1, we want the following operation:

$$x = \begin{bmatrix} \text{flag} \\ r_{pc} \\ r_{pc+1} \\ r_{ab} \end{bmatrix} \to \begin{bmatrix} \text{flag} \\ r_{pc+1} \\ r_{pc+1} \\ r_{ab} \end{bmatrix}$$

# Step 1: Extract $r_{a_b}$ .

We construct the following weight matrix  $W_1 \in \mathbb{R}^{(3\log(n)+1)\times(3\log(n)+1)}$ :

$$W_1 = \begin{bmatrix} 0 & \mathbf{0}_{\log(n)}^\top & \mathbf{0}_{\log(n)}^\top & \mathbf{0}_{\log(n)}^\top \\ \mathbf{0}_{\log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & I_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} \\ \mathbf{0}_{\log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} \\ \mathbf{0}_{\log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} \end{bmatrix}$$

Then we have

$$x_1 := W_1 \cdot x = \begin{bmatrix} 0 \\ r_{a_b} \\ \mathbf{0}_{\log(n)} \\ \mathbf{0}_{\log(n)} \end{bmatrix}$$

## Step 2: Extract $r_{pc+1}$ .

We construct the following weight matrix  $W_2 \in \mathbb{R}^{(3\log(n)+1)\times(3\log(n)+1)}$ :

$$W_2 = \begin{bmatrix} \mathbf{0} & \mathbf{0}_{\log(n)}^\top & \mathbf{0}_{\log(n)}^\top & \mathbf{0}_{\log(n)}^\top \\ \mathbf{0}_{\log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & I_{\log(n) \times \log(n)} \\ \mathbf{0}_{\log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} \\ \mathbf{0}_{\log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} \end{bmatrix}$$

Then we have

$$x_2 := W_2 \cdot x = \begin{bmatrix} 0 \\ r_{pc+1} \\ \mathbf{0}_{\log(n)} \\ \mathbf{0}_{\log(n)} \end{bmatrix}$$

## Step 3: Simulate the flag.

We use ReLU-MLP to extract flag  $\in \{\pm 1\}$  and perform the following operation: We construct the weight matrix  $W_3 \in \mathbb{R}^{(3\log(n)+1)\times(3\log(n)+1)}$ :

$$W_3 = \begin{bmatrix} \mathbf{0} & \mathbf{0}_{\log(n)}^\top & \mathbf{0}_{\log(n)}^\top & \mathbf{0}_{\log(n)}^\top \\ \mathbf{1}_{\log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} \\ \mathbf{0}_{\log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} \\ \mathbf{0}_{\log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} \end{bmatrix}$$

We apply  $W_3$  to x, we have

$$W_3 \cdot x = \begin{bmatrix} 0 \\ \text{flag} \cdot \mathbf{1}_{\log(n)} \\ \mathbf{0}_{\log(n)} \\ \mathbf{0}_{\log(n)} \end{bmatrix}$$

Then, we have

$$x_3 := \mathsf{ReLU}(W_3 x) \circ x_1 + \mathsf{ReLU}(-1 \cdot (W_3 x)) \circ x_2$$

Step 4: Erase  $r_{pc}$  and repoint. We construct the weight matrix  $W_4 \in \mathbb{R}^{(3\log(n)+1)\times(3\log(n)+1)}$ :

$$W_4 = \begin{bmatrix} \mathbf{1} & \mathbf{0}_{\log(n)}^\top & \mathbf{0}_{\log(n)}^\top & \mathbf{0}_{\log(n)}^\top \\ \mathbf{0}_{\log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} \\ \mathbf{0}_{\log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & I_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} \\ \mathbf{0}_{\log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & \mathbf{0}_{\log(n) \times \log(n)} & I_{\log(n) \times \log(n)} \end{bmatrix}$$

Then we have

$$y := W_4 x + x_3$$

Then, y is the final output we want.

To sum up, we have

- Step 1 uses one ReLU-MLP with width  $3\log(n) + 1$ .
- Step 2 uses one ReLU-MLP with width  $3\log(n) + 1$ .
- Step 3 uses one ReLU-MLP with width  $3\log(n) + 1$ .
- Step 4 uses one ReLU-MLP with width  $3\log(n) + 1$ .

Therefore, we use ReLU-MLP with four layers and width  $O(\log n)$  to emulate the "conditional branching" operation.

# D SUBLEQ

Based on the fundamental operations introduced in the previous sections, we are ready to construct the "SUBLEQ" instruction.

Lemma D.1 (ReLU-MLP Emulate SUBLEQ, Formal Version of Lemma 5.9). If the following conditions hold:

- Let ReLU-MLP be defined as Definition 1.2.
- Let n denote the size of the state vector.
- Let m denote the number of instructions.
- Let k denote the number of one-bit data stored in the memory. For  $i \in [k]$ , each data is  $v_i \in \{\pm 1\}$  and the memory size k satisfies  $k = n 2 4\log(n) 3m\log(n)$ .
- Let the address vector  $a_i \in \{\pm 1\}^{\log(n)}$ .
- Let the instruction  $c_i \in \{\pm 1\}^{3 \log(n)}$  be defined as Definition 3.6.
- Suppose we have three data registers  $r_c, r_{d_1}, r_{d_2} \in \{\pm 1\}$ , one carry bit  $r_c \in \{\pm 1\}$ , three address registers  $r_{a_1}, r_{a_2}, r_{a_3} \in \{\pm 1\}^{\log(n)}$ , and one program counter  $r_{pc} \in \{\pm 1\}^{\log(n)}$  in the scratchpad.

Then, we can show that, a 23-layer ReLU-MLP with width n can emulate the "SUBLEQ" operation (Algorithm 1).

*Proof.* Consider we organize our state vector as follows:



Here,  $r_c \in \{\pm 1\}$  denote the carry bit,  $r_{d_1}, r_{d_1} \in \{\pm 1\}$  denote two data registers;  $r_{a_1}, r_{a_2}, r_{a_3} \in \{\pm 1\}^{\log(n)}$  denote three address registers;  $r_{pc} \in \{\pm 1\}^{\log(n)}$  denotes the program counter;  $v_1, v_2, \cdots, v_k \in \{\pm 1\}$  denote k one-bit data stored in the memory;  $c_1, c_2, \cdots, c_m \in \{\pm 1\}^{3\log(n)}$  denote m instructions, where  $c_m = c_{\text{EOF}}$ , denoting the End Of File (EOF) instruction, which means the program should terminate here. Since the memory size k satisfies  $k = n - 2 - 4\log(n) - 3m\log(n)$ , the total length of the state vector is n.

## Step 1: Read the three addresses.

In this step, we read the instruction from the memory according to the address provided in  $r_{pc}$ . As shown in Algorithm 1, the instruction contains three addresses. Here we denote them as  $a, b, c \in \{\pm 1\}^{\log(n)}$ .

By Lemma A.2, we read the three address vectors from the memory to the three registers in the scratchpad, using two-layer ReLU-MLP. Then, the state x transforms as follows:

$$x = \begin{bmatrix} r_{c} \\ r_{d_{1}} \\ r_{d_{2}} \\ r_{a_{1}} \\ r_{a_{2}} \\ r_{a_{3}} \\ r_{pc} \\ v_{1} \\ v_{2} \\ \vdots \\ v_{k} \\ \hline c_{1} \\ c_{2} \\ \vdots \\ c_{m-1} \\ c_{\text{EOF}} \end{bmatrix} \begin{bmatrix} r_{c} \\ r_{d_{1}} \\ r_{d_{2}} \\ \vdots \\ v_{d_{2}} \\ \vdots \\ v_{k} \\ \hline c_{1} \\ c_{2} \\ \vdots \\ c_{m-1} \\ c_{\text{EOF}} \end{bmatrix}$$

Overall, this step requires a two-layer ReLU-MLP.

## Step 2: Read the data required by the instruction.

In this step, we read the two data required by the instruction. Namely mem[a] and mem[b]. By Lemma A.2, we achieve this operation by a two-layer ReLU-MLP. Then, the state x transforms as follows:

$$x = \begin{bmatrix} r_c \\ r_{d_1} \\ r_{d_2} \\ a \\ b \\ c \\ \hline v_1 \\ v_2 \\ \vdots \\ v_k \\ \hline c_1 \\ c_2 \\ \vdots \\ c_{m-1} \\ c_{\text{EOF}} \end{bmatrix} \rightarrow \begin{bmatrix} r_c \\ \text{mem}[a] \\ \text{mem}[b] \\ a \\ b \\ c \\ \hline r_{pc} \\ \hline v_1 \\ v_2 \\ \vdots \\ v_k \\ \hline c_1 \\ c_2 \\ \vdots \\ c_{m-1} \\ c_{\text{EOF}} \end{bmatrix}$$

Overall, this step requires a two-layer ReLU-MLP.

# Step 3: Perform subtraction.

In this step, we perform the subtraction operation, mem[b] - mem[a]. By Lemma B.3, we achieve this operation by a seven-layer ReLU-MLP. Then, the state x transforms as follows:

$$x = \begin{bmatrix} r_c \\ \text{mem}[a] \\ a \\ b \\ c \\ \hline r_{pc} \\ \hline v_1 \\ v_2 \\ \vdots \\ v_k \\ \hline c_1 \\ c_2 \\ \vdots \\ c_{m-1} \\ c_{\text{EOF}} \end{bmatrix} \rightarrow \begin{bmatrix} r_c \\ \text{mem}[b] - \text{mem}[a] \\ \text{mem}[b] \\ a \\ b \\ c \\ \hline r_{pc} \\ \hline v_1 \\ v_2 \\ \vdots \\ v_k \\ \hline c_1 \\ c_2 \\ \vdots \\ c_{m-1} \\ c_{\text{EOF}} \end{bmatrix}$$

This step requires a seven-layer ReLU-MLP.

# Step 4: Write back mem[b] - mem[a].

In this step, we write back the mem[b] - mem[a] to the memory according to the address vector b. By Lemma A.4, we achieve this operation via a two-layer ReLU-MLP.



This step requires a two-layer ReLU-MLP.

# Step 5: Calculate $r_{pc+1}$ .

In this step, we calculate  $r_{pc+1}$  and store it at the place which stores b previously. (Since the address vectors a and b will not be used in the following steps, and we can overwrite them with any other address vectors.) By Lemma B.2, we achieve this operation via a six-layer ReLU-MLP. Then, the state x transforms as follows:

$$x = \begin{bmatrix} r_c \\ \text{mem}[b] - \text{mem}[a] \\ \text{mem}[b] \\ a \\ b \\ c \\ r_{pc} \\ \hline v_1 \\ v_2 \\ \vdots \\ v_k \\ \hline c_1 \\ c_2 \\ \vdots \\ c_{m-1} \\ c_{\text{EOF}} \end{bmatrix} \rightarrow \begin{bmatrix} r_c \\ \text{mem}[b] - \text{mem}[a] \\ \text{mem}[b] \\ a \\ \hline r_{pc+1} \\ c \\ \hline v_1 \\ v_2 \\ \vdots \\ v_k \\ \hline c_1 \\ c_2 \\ \vdots \\ c_{m-1} \\ c_{\text{EOF}} \end{bmatrix}$$

This step requires a six-layer ReLU-MLP.

# Step 6: Conditional branching.

In this step, we perform the conditional branching operation according to mem[b] - mem[a]. For simplicity, let flag := mem[b] - mem[a]. We use  $r_{target} \in \{r_{pc+1}, c\}$  to denote the final address stored in the program counter  $r_{pc}$ . Namely, if flag = 1,  $r_{target} = c$ ; if flag = -1,  $r_{target} = r_{pc+1}$ . By Lemma C.1, we achieve this operation via a four-layer ReLU-MLP. Then, the state x transforms as follows:

$$x = \begin{bmatrix} r_c \\ \text{flag} \\ \text{mem}[b] \\ a \\ r_{pc+1} \\ c \\ \hline v_1 \\ v_2 \\ \vdots \\ v_k \\ \hline c_1 \\ c_2 \\ \vdots \\ c_{m-1} \\ c_{\text{EOF}} \end{bmatrix} \rightarrow \begin{bmatrix} r_c \\ \text{flag} \\ \text{mem}[b] \\ a \\ r_{pc+1} \\ c \\ \hline v_1 \\ v_2 \\ \vdots \\ v_k \\ \hline c_1 \\ c_2 \\ \vdots \\ c_{m-1} \\ c_{\text{EOF}} \end{bmatrix}$$

This operation requires a four-layer ReLU-MLP.

# Step 7: Program termination.

The special instruction  $c_{\text{EOF}}$  is used to signal the end of the program. Similar to other instructions,  $c_{\text{EOF}}$  also consists of three address vectors.

To terminate the program, our idea is to make  $r_{pc}$  always point to  $c_{EOF}$  after executing  $c_{EOF}$ . For the three address vectors in  $c_{EOF}$ , we set the a = b, and  $c = \&(c_{EOF})$ , where  $\&(c_{EOF})$  denotes the address of the  $c_{EOF}$ 

instruction. This design is reasonable because setting a=b brings mem[a]=mem[b]. Therefore, the condition always holds in the "SUBLEQ" instruction. Hence, the program counter  $r_{pc}$  will always jump to c. Since we have set c to the address of  $c_{\rm EOF}$ , the  $r_{pc}$  still points to  $c_{\rm EOF}$  after execute  $c_{\rm EOF}$ .

To sum up, we have

- Step 1: Read the three addresses requires a two-layer ReLU-MLP.
- Step 2: Read the data required by the instruction requires a two-layer ReLU-MLP.
- Step 3: Perform subtraction requires a seven-layer ReLU-MLP.
- Step 4: Write back mem[b] mem[a] requires a two-layer ReLU-MLP.
- Step 5: Calculate  $r_{pc+1}$  requires an six-layer ReLU-MLP.
- Step 6: Conditional branching requires a four-layer ReLU-MLP.

Since we always operate the state vector  $x \in \mathbb{R}^n$ , then the width of the above-mentioned ReLU-MLP is n.

Therefore, to emulate the "SUBLEQ" instruction, we need a twenty-three-layer ReLU-MLP with size  $n \times n$ .

# E LOOPED ReLU-MLP AS PROGRAMMABLE COMPUTER

Finally, in this section, we demonstrate the One Instruction Set Computer (OISC) constructed by the "SUBLEQ" instruction is equivalent to the programmable computer in terms of computational ability, which further indicates that the 23-layer ReLU-MLP discussed in the previous section is capable of functioning as a programmable computer.

**Theorem E.1** (Looped ReLU-MLP as Programmable Computer, Formal Version of Theorem 4.1). *If the following conditions hold:* 

- Let ReLU-MLP be defined as Definition 1.2.
- Let n denote the size of the state vector.
- Let m denote the number of instructions.
- Let k denote the number of one-bit data stored in the memory. For  $i \in [k]$ , each data is  $v_i \in \{\pm 1\}$  and the memory size k satisfies  $k = n 2 4\log(n) 3m\log(n)$ .
- Let the address vector  $a_i \in \{\pm 1\}^{\log(n)}$ .
- Let the instruction  $c_i \in \{\pm 1\}^{3 \log(n)}$  be defined as Definition 3.6.
- Suppose we have two data registers  $r_{d_1}, r_{d_2} \in \{\pm 1\}$ , one carry bit  $r_c \in \{\pm 1\}$ , three address registers  $r_{a_1}, r_{a_2}, r_{a_3} \in \{\pm 1\}^{\log(n)}$ , and one program counter  $r_{pc} \in \{\pm 1\}^{\log(n)}$  in the scratchpad.

Then, we can show that a 23-layer ReLU-MLP with width n can emulate a programmable computer, where d is the number of bits we use to store each integer. Namely, this "computer" supports integers within the range  $[-2^{d-1}, 2^{d-1} - 1]$ .

*Proof.* In Lemma D.1, we have proved that a 23-layer ReLU-MLP is capable of emulating the one-bit version of "SUBLEQ" instruction.

#### Step 1: Extend to d-bits "SUBLEQ".

We demonstrate how the 1-bit SUBLEQ instruction can be extended to a d-bit SUBLEQ instruction. Similar to the proof of Lemma A.2 and A.4, we can extend horizontally from the one-bit version to d-bits version, where we apply the 23-layer ReLU-MLP row by row, with totally d loops.

To be more specific, each operation on the d-bit data can be decomposed into d individual operations on 1-bit data. Our design for the 1-bit operations takes this into account. Specifically, the 1-bit addition operation (Lemma 5.5) also outputs a carry bit, indicating whether the operation results in a carry. By simply stacking the 1-bit operations d times, we can effectively support the entire d-bit operation.

## Step 2: Extend to OISC.

By Mavaddat and Parhami (1988), the One Instruction Set Computer (OISC) with the instruction "SUBLEQ" is Turing complete, which means it can compute arbitrary programs. Therefore, we can conclude that our looped 23-layer ReLU-MLP can actually function as a programmable computer.

Note that Mavaddat and Parhami (1988) shows that anything a programmable computer can do can also be accomplished by stacking "SUBLEQ" instructions. This means that, within our framework, any computation performed by a programmable computer can also be executed by looping our 23-layer ReLU-MLP with enough iterations.