1

# Efficient and Access-Pattern Secure Computations on TEE via Dynamic Program Partitioning

Ju Chen † Cheng Xu † Kai Li † Yuzhe (Richard) Tang † etc.
†Syracuse University, NY, USA, Email: {jchen133,kli111,ytang100}@syr.edu
†Hong Kong Baptist University, Kowloon Tong, Hong Kong, Email: {chengxu}@comp.hkbu.edu.hk

### **Abstract**

With the advent of commercial trusted execution environment (TEE), memory access-pattern attacks become a reality. The need of defense is critical to the wide adoption of these TEEs in real security applications. Existing defense techniques engineer data oblivious algorithms and cause high performance overhead.

This work addresses the efficiency of data-oblivious computations in TEE. The motivating observation is that the external-memory data-oblivious algorithms are a class of algorithms with the best complexity, thus promising for the efficiency in concrete performance. To engineer external oblivious algorithms, we define a new security notation, called cache-miss obliviousness (or CMO), where the trace of cache misses in a program execution is oblivious to the computation data. CMO expresses the target computation in two parts: external oblivious computation where the cache-misses are oblivious to the secret data, and internal computation that is fully contained inside CPU and does not incur any cache misses. By this means, the data security is achieved against all memory access-pattern attacks including the bus-tapping attacks. The efficiency is achieved by allowing the more efficient, data-dependent computation in the internal part.

In this work, we propose a suite of programming tools for engineering cache-miss oblivious computations in Intel SGX. The key technical problem is the dynamic program partitioning, that is, how to leverage the runtime information to split the program execution to the internal/external computation partitions. This is different from existing work relying on static or manual program partitioning (i.e., T-SGX or Cloak).

We implement a library for the programming support of CMO. The library is functional on Intel SGX and TSX, where the capability of transactional memory (TSX) is to enforce no cache-miss of the internal computation. First, the library exposes API for developers to specify the scope of leaky code section and the types of data accessed there. Second, for dynamic program partitioning, the program execution is monitored to trigger the insertion of TSX transactional boundaries. Third, we optimize the performance by amortizing the TSX transactional cost over a large code partition.

We evaluate the CMO library in systems security, computation expressiveness, and performance. Using the library, we implement a variety of data-analytical computations and cryptographic operations. Through performance study, we show that the CMO-based computation achieves a performance speedup in multiple orders of magnitude comparing existing data-oblivious techniques.



Fig. 1: System-design goals in security and data scale

## I. Introduction

The modern architecture of trusted execution environment (TEE) has been commercially available, notably Intel Software Guard eXtension (SGX [1]) and ARM TrustZone [2], and has recently been adopted in the cloud services, such as Google Cloud Platform [3] and Microsoft Azure [4]. These TEEs supports the architecture of a "hardware enclave" for secure and trustworthy execution of programs on an untrusted third-party host, such as Cloud. Despite the security-oriented design (e.g., encrypted memory, CPU access control), existing TEEs are vulnerable to one major type of side-channel attacks – memory-access pattern attacks [5], [6], [7], [8].

To mitigate memory-access pattern attacks on SGX, existing work either closes the leakage channel by detecting the presence of a memory-access sniffer [9], [10], [6] or makes the disclosed access trace irrelevant to sensitive information (i.e., data obliviousness [11], [12], [13]). The mitigation by attack detection relies on some hardware capabilities, such as Intel transactional memory (TSX). This approach limits resource sharing and prohibits data movement across trust boundary (G1). The mitigation by data-obliviousness [11], [14], [15] runs expensive oblivious algorithms. Both approaches incur high performance overhead, especially in the application running data-intensive computations.

In this work, we tackle the open challenge of supporting secure and efficient computations on Intel SGX. The security goal is to prevent any information leakage from the memory access trace, and to achieve the security against bus-tapping attacks as well as all other memory-access pattern attacks. The only assumption is that the processor is trusted. For efficiency, we adopt the external oblivious algorithms to express the target computation. External oblivious algorithms are a class of oblivious algorithm with the best time complexity (in both big-O notation and constant), thus promising for the good concrete performance.

In this work, we propose a security-performance notion, cache-miss obliviousness (CMO). Informally, it requires that the trace of cache misses in executing a program is oblivious to sensitive data, thus safe to be disclosed (to a bus-tapping adversary). The rest of memory accesses are resolved in the trusted CPU (namely cache hits and register accesses) and can afford to be secret-dependent. With cache-miss obliviousness, our observation is that the security to bus-tapping attacks can be attained, and performance efficiency can be retained due to that it allows for secret-dependent computations and algorithm design with asymptotic efficiency (See § IV-B1 for details).

Engineering CMO in software presents a non-trivial task. To start with, we focus on the class of external-oblivious computations [16], [17]. The common computing paradigm is that the data is stored 1) in a small internal memory with leaky access and 2) in a much larger external memory with non-leaky access (see § VI for a list of examples). Externally oblivious computations have better asymptotic performance other than oblivious mechanisms.

First, for expressing CMO computations, we propose a programming interface for expert developers to annotate the program and to distinguish the code section with leaky data access and that without. It annotates the begin and end of leaky code sections and data types accessed in TSX transactions.

Second, we propose a program-partitioning framework that splits the program execution and wraps them in TSX transactions. The use of TSX transactions is to ensure that there is no occurrence of cache misses during the program execution (which is similar to prior works [10], [18], [19]). However, we uniquely study the <u>dynamic</u> partitioning problem that partitions the program based on various runtime information. For security, our scheme only partitions the non-leaky data accesses across transactions, while keeping the leaky data accesses atomic. To implement the dynamic-partitioning scheme, we build a library that monitors program execution at runtime, looking for triggering conditions of partitioning. When conditions are met, it partitions the computation by inserting TSX instructions at transaction boundaries.

Third, we optimize the performance by enlarging the transaction size. The observation is that the larger a transaction is, the more instructions the transactional setup cost can be amortized to. To ensure the successful execution of large transactions, we propose to arrange the cache data usage based on the data-access type; the intuition is that different data types in a transaction are bounded by different cache conditions. For instance, read-only transaction data can reside in both last-level cache and level-one cache, whereas read-write transactional data must reside in level-one cache.



Fig. 2: Taxonomy of memory-access attacks and defenses in SGX: Existing attacks shown in the diagram are code-named by the publication venue and year, including USS17 (USENIX Security) [6], CCS17 [8], SysTex17 [21], ATC17 [20], OAKLAND15 [5]. Existing defenses shown in the diagram are code-named similarly and with their name, including T-SGX [9], Deja Vu [22], and Cloak [10].

To evaluate the expressiveness of our scheme, we build programs for a variety of common computations, including sorting, shuffle, binary search, k-means, etc. To evaluate the performance and data scalability, we measure the execution time of the computation.

### II. RELATED WORK

### A. Preliminary

**TEE** and memory access-pattern attacks: The architecture of SGX-alike TEE features a trusted world (a.k.a. enclave) and untrusted world sharing a processor, the boundary between which is protected by some isolation mechanisms. The main memory usually resides in the untrusted world where the content is protected by hardware encryption. In this TEE architecture, a memory reference has to access a series of micro-architectural "buffers" that are shared, such as last-level cache (LLC), translation look-aside buffer (TLB), memory row buffer, etc., creating leakage side channels.

A passive memory-access attacker monitors the access trace of an enclave execution and infers sensitive information. To monitor, the attacker *breaks* enclave execution and *observes* some micro-architectural side-channel. Existing memory-access attacks can be characterized by the spatial resolution of the leakage channel being observed and the temporal resolution of breaking the enclave control flow. For instance, in the seminal work [5], a controlled page-fault attacker both breaks and observes the enclave execution through page-fault interrupts, where the trace spatial granularity is at page numbers and the temporal granularity is individual page accesses. More advanced attacks exploit other leakage channels (e.g., page-table access/dirty bits [6], [8], shared last-level caches [20]) to improve spatial and temporal resolution. To make the defense more difficult, recent *stealthy* attacks break the enclave execution invisibly through hyper-threading that eliminate interrupts [10], [8].

**Data-oblivious algorithms**: The conventional oblivious algorithms ensure oblivious data access at the "word" granularity. These *word-oblivious algorithms* (A1) are constructed based on primitives such as compare-exchanges and tend to be not very practical (e.g., AKS sorting [23] and sub-optimal column sort [24]). In the external-memory model, **external oblivious algorithms** improve the time/IO complexity by assuming a small amount of internal storage that is accessed in a leaky fashion. Due to a smaller constant and optimal complexity, this family of algorithms are of great practical interest. For instance, Melbourne shuffle [16] causes  $O(\sqrt{N})$  IO with  $O(N \log N)$  time at the expense of  $\sqrt{N}$  internal space and was recently used in constructing practical systems (e.g., MapReduce [12]). Research to improve the algorithmic "locality" and to further reduce the internal-space complexity is underway [25]. The *oblivious RAM* (A3) [26], [27] is an external data structure supporting oblivious external data access by translating virtual random-access to oblivious physical data access with poly-logarithmic blowup.

Comparing ORAM (A3) and word-oblivious algorithms (A1), the external oblivious algorithms (A2) cause lower overhead in complexity. Comparing generic ORAM, external oblivious algorithms are computation specific; existing algorithms support sort, compaction, and relational queries and other batched-oriented computations [16], [28], [17], [29].

# B. Existing Work on Access-Pattern Secure Systems

**Memory-access attack mitigation**: A successful memory-access attack entails two conditions: 1) an observable channel of memory-access leakage, and 2) the capability of relating the leaked memory-access to sensitive information. A memory-access attack can be mitigated by breaking either condition: M1) Detecting that a leakage channel is being observed by *enclave fortification*, or M2) Eliminating the source of leakage by making it difficult to correlate the access trace with sensitive information, via *obliviousness*.

For the former, existing enclave-fortification work detects the attack on the events of page-fault [30], [9], cache-miss [10], [31], etc.

For the latter, existing work on **data-oblivious systems** is based on oblivious algorithms [11], [12] or less-efficient oblivious RAM [13], [32]. Concretely, data obliviousness can eliminate the source of leakage in the presence of memory-access attacks on TEE. When building data-oblivious systems, a design choice is which oblivious-computing mechanisms to use: Opaque [11] supports database queries by engineering word-oblivious algorithms (Column Sort [24]). Oblivious machine-learning [14] combines word-oblivious algorithms and the trivial mechanism of converting random-access to a full-array scan. Scan-based transformation is similarly adopted in building ZeroTrace [13], an ORAM system in SGX. OblivM [33] is a compiler that supports the engineering of both word-oblivious algorithms and ORAM [26], [27]. Oblivious Map-Reduce [12] and Prochlo [34] use external oblivious shuffle [16] to protect distributed systems under traffic analysis, but insecure under micro-architectural memory-access attacks.

### III. RESEARCH FORMULATION

### A. System Model

**System model**: In this work, we consider the modern computing paradigm that a security-sensitive computation is outsourced and run on a third-party host. This paradigm becomes increasingly popular due to the advent of cloud computing which has the advantage in cost effectiveness, availability, etc. Concretely, the data owner possessing security-sensitive personal data uploads the data and program to a third-part host in the cloud where the program is executed over the data to evaluate the computation on the owner's behalf.

The host in our system has an SGX-alike trusted execution environment, or enclave. The enclave is trusted by the data owner, which includes only the processor at the hardware layer and the owner-provided program at the software layer. Hardware and software outside this perimeter of the enclave is untrusted by the data owner.

a) The target computation: of this work is what we call by external-oblivious computation. The working set of this computation consists of a large volume of data accessed in a non-leaky fashion (namely, non-leaky data) and a small volume of data accessed in a leaky fashion (namely, access-leaky data). The small volume of access-leaky data is defined to be cache resident.

In practice, we deem there are a large number of real-world applications that meet the external-oblivious computations, especially in data-analytical applications. For instance, streaming applications feature a big-data stream that is accessed in a non-oblivious fashion, while keeping a small state that is accessed randomly (or in a leaky fashion). In other data-analytical computations, a common paradigm is to iterate through the dataset multiple iterations, each of which runs a stateful computation with a data pass through the dataset. Even for computations that are not natively external-oblivious, such as sorting (quick sort or merge sort are not oblivious), there are external oblivious algorithms, such as Melbourne shuffle [16], oblivious merge sorts [28]. We conduct a study of a series of related cases and support sample external-oblivious computations in § VI.

Out of the scope is the computation that randomly accesses a large external dataset, such as database search (with index); these computations can be supported by existing work such as ZeroTrace [13] that engineers an ORAM in external storage.

## B. Threat Model

This work considers memory-access pattern attacks in the emerging hardware enclaves. In the target system, a program runs inside an enclave of trusted processor and encrypted memory, and the adversary monitors certain micro-architectural side-channel during the enclave execution. Based on the observed access trace, she then infers other sensitive information. The exploited leakage channels include page-fault [5], page-table access/dirty bits [7], [6], cache-timing [8], [20]. Access-driven attacks against data-intensive computations can lead to significant information disclosure [12], [11], [35]. For clarity, the nongoals of this work include other side-channel attacks (e.g., timing attacks, power analysis), denial-of-service attacks, rollback attacks, etc.

## C. Security Definition: Cache-Miss Obliviousness

Intuitively, the cache-miss obliviousness or CMO requires that the trace of cache-misses (as exposed in bus-tapping attacks) in a program execution should be independent with any data of the execution (both input and intermediate data). Consider the execution of a program P with data I. The execution produces trace T consisting of last-level cache misses. CMO requires that given two data values,  $I_0$  and  $I_1$ , a CMO execution produces two identical traces, that is,  $T_P(I_0) = T_P(I_1)$ . This definition assumes that the computation induced by P is deterministic. And the definition of obliviousness is "perfect" as it requires the two traces of different data are exactly the same. We skip the more formal and generic definition of cache-miss obliviousness (e.g., based on indistinguishability formation [36]).

<sup>&</sup>lt;sup>1</sup>Data analytical algorithms usually have super-linear time complexity and need to access data multiple iterations.

## D. Research Problem and Distinction

The focus of this work is the support of efficient data-intensive computations (in § III-A0a) with cache-miss obliviousness (in § III-C) on the TEE platforms (in § III-A). Particularly, the cache-miss obliviousness ensures the memory-access security under cache-timing attacks while retaining the efficiency of external-oblivious algorithms.

This goal presents an open research problem: Existing research on cache-timing attack detection (e.g., by HTM) is limited in data scalability as the working set must be cache-resident. Existing research on data-obliviousness engineering largely focuses on word-oblivious algorithms (e.g., Opaque [11], ObliviousML [14]). The limited work considering the more efficient external-oblivious algorithms (e.g., ZeroTrace [13], ObliviousMR [12], Prochlo [34]) has either sub-par security or sub-optimal performance efficiency.

We survey existing research from the angles of strong cache-attack security, data scalability, and algorithmic complexity. The research distinction of our work is presented in Table I.

|                               | Method                       | Cache access-<br>pattern security | Data scalability | Algorithmic complexity |
|-------------------------------|------------------------------|-----------------------------------|------------------|------------------------|
| T-SGX [9], Cloak [10]         | Attack detection by HTM      | +                                 | -                | +                      |
| Opaque [11], ObliviousML [14] | Word-oblivious algo.         | +                                 | +                | -                      |
| ObliviousMR [12],Prochlo [34] | External oblivious algo.     | -                                 | +                | +                      |
| ZeroTrace [13]                | External ORAM                | +                                 | +                | -                      |
| This work                     | External oblivious algo./HTM | +                                 | +                | +                      |

TABLE I: Research distinction

## E. Proposed Method

Our method for engineering external obliviousness is that given the target computation expressed by an external-oblivious algorithm, we place and only place the part of program accessing internal memory in external-oblivious algorithms to the HTM transactions. By this means, the cache-misses which only occur outside the internal program are oblivious, while cache-hits and other internal accesses can be made secure by HTM protection.

The motivation to combine HTM and external oblivious algorithms can be better illustrated by an example: Consider implementing a sort on Intel SGX with cache-attack security. The first baseline (BL1) entails implementing the word-oblivious sorting network at the expense of a logarithmic multiplicative factor (i.e.,  $O(N\log^2 N)$ ). The second baseline is to use HTM to protect the leaky accesses in a classic sorting algorithm of  $O(N\log^2 N)$  complexity, such as quick sort or merge sort. This approach limits the data size by CPU cache size. In a big-data setting (with large N), either approach is feasible as the former is inefficient and the latter is unscalable. By contrast, our proposed method combining HTM and external oblivious algorithms could lead to an optimal  $O(N\log N)$  sorting algorithm (based on the scramble-then-shuffle paradigm [37]) implemented with cache-attack security.

Note that our approach shares some general idea with Cloak and T-SGX in the sense that it leverages Intel TSX to detect the memory-access attacks. However, Cloak does not automatically partition the computation and is limited to small-data case. T-SGX is a static program-partition scheme, which limits its applicability in handling loops and thus big-data computations. Our work uniquely addresses the problem of dynamically partitioning the programs to support big-data computations with access-pattern security.

## IV. SYSTEM OVERVIEW AND API

## A. System Overview

Our goal is to build a data-intensive runtime system with access-pattern security on hardware enclaves. Concretely, given an external oblivious algorithm, we provide an API (S1) for developers to annotate the leaky section of the program. Given the annotated program, we provide an execution engine (S2) runs the program on architecture of two execution environments: the one with access-pattern protection (e.g., prohibiting cache-misses) and the other without. We call the former by fortified enclave, and the latter by unprotected enclave. Briefly, the execution engine partitions the runtime instance dynamically in a way that assures no cache-miss yet can scale to relatively large data. Figure 3 illustrates the overview of our system. In the following, we present the detailed design and implementation of our system in each component, S1 in § IV-B2 and S2 in § V.

# B. External-Obliviousness API

1) Preliminary: External-Obliviousness Algorithms: External oblivious algorithms are a class of oblivious mechanisms that has advantageous time/round complexity at the expenses of assuming a trusted internal memory. External oblivious algorithms are proposed for a wide variety of computations, including shuffle (Melbourne shuffle [16]), sort (oblivious merge sort [28]), most aggregation computations, etc. Comparing word-oblivious algorithms (e.g., sorting networks causing a multiplicative factor of  $O(\log N)$ ), an external oblivious algorithm has better time complexity (e.g., Melbourne shuffle based sorting [12], [37] with the O(1) multiplicative factor). Comparing ORAM (with a multiplicative factor of  $O(\log^2 N)$ ), an external oblivious algorithm is specific to target computation and is more efficient. While external oblivious algorithms achieve better complexity, it may



Fig. 3: System overview: A target computation is annotated by our programming API that specifies data-access pattern. Our runtime system first partitions the program dynamically to units (i.e., dynamic partitioning), and then runs each unit in an TSX transaction (i.e., tx-execution engine).

present limitation on data scalability. Concretely, most existing external oblivious algorithms are designed for a client-server setting, and assume a large internal memory at a client side. For instance, a Melbourne shuffle assumes an internal memory of  $O(\sqrt{N})$ . The large internal memory may be suited for a client machine (in the big-data setting), but presents challenges when treating cache as internal memory, whose size is very limited; for instance, a L1 cache is 32 KB and L3 cache is 8 MB in an Intel SGX CPU. Very recently, space-efficient external oblivious algorithms are proposed, for instance, stash shuffle with small  $O(\log N)$  internal space.

2) API: Our system provides a programming interface for developers to annotate the leaky section of the program and define the type of data referenced inside the leaky section. Specifically, 1) we provide two library functions, begin\_leaky() and end\_leaky(), for developers to declare the begin and end of a program snippet where leaky memory access occurs. 2) In addition, all the memory data accessed inside a leaky section needs to be defined with our data type. We provide four data-container types as defined below. The explicit data type allows our execution engine to be know ahead of time the possible data-access pattern and take action in partitioning more precisely.

The data-container types are classified based on whether the access is leaky and whether the data is read-only. The four data types are Nobro, Obro, Nobro, Nobro,

```
class ObRO {
                                                                1 void MergeSort(int[] array, int start, int end){
    int32_t read_next();
                                                                    if (start == end - 1) return;
3
    void reset();
4
                                                                3
                                                                    MergeSort(array, start, (end + start)/2);
                                                                    MergeSort(array, (end + start)/2, end);
  class ObRW {
    void write_next(int32_t data);
                                                                    Merge(array, 0, (end + start)/2, (end + start)/2, end);
                                                                5
    void reset();
                                                                6 }
8
                                                                7 void Merge(int[] array, int start1, int end1, int start2, int
  class NobRW {
                                                                         end2){
                                                                8
                                                                    NobRW parray = new NobRW(array);
10
    int32 t read at(int32 t addr);
    void write_at(int32_t addr, int32_t data);
11
                                                                    begin_leaky();
                                                               10
                                                                    originalMerge(parray, start1, end1, start2, end2);
12
13 class NobRO {
                                                               11
                                                                    end_leaky();
14
    int32_t nob_read_at(int32_t addr);
                                                               12 }
15
```

Fig. 4: Implementing merge sort using our CMO API: in-transaction data types and leaky section

## V. DYNAMIC PROGRAM PARTITIONING

#### A. Motivation

To secure a computation against memory-access pattern attacks, existing work leverages HTM features to detect any unintended cache misses. Notably, Cloak [10] detects cache attacks using Intel TSX. Briefly, it executes the computation

#### Read-set Write-set NobRO **NobRW** Leaky C2) Leaky data data resides in cache C1) Tx C1) Tx readset writeset Non-leaky smaller smaller than LLC than L1 data **ObRO ObRW**

Fig. 5: Transactional data constraints imposed by hardware (C1) and security (C2) requirements. The data is classified to four types based on read/write and leaky/non-leaky.

in HTM transactions, such that the intra-transaction memory accesses are resolved by cache hits and thus are concealed from the adversary. Note that the inter-transaction memory accesses are disclosed. The security of these Cloak alike schemes requires that access-leaky memory region must reside in data cache; this security requirement significantly limits the data scale. In addition, Cloak's programming interface is rather low-level and requires programmer to manually specify the scope of individual HTM transactions, which further limit its applicability in big-data computations.

To systematically improve the data scalability, we study two technical problems: (O1) dynamic program partitioning and (O2) enlarging transaction. First, given a target computation, we dynamically partition the program to smaller execution units, each run in an individual transaction. Comparing existing work, such as Cloak [10] and T-SGX [9], our unique perspective about program partitioning is to take into account various runtime information (see § III-E for detailed comparison).

Second, for each execution unit, we enlarge the transaction as much as possible. Here, the larger a transaction is, the performance efficiency it can be, as the transaction setup cost (e.g., running xbegin/xend instructions) can be amortized among more instructions. Note that the size of any TSX transaction has a theoretic limit in that its working set cannot exceed a CPU data cache. Our goal is to reach this theoretic bound in transaction size.

## B. Problem of Dynamic Program Partitioning

As revealed in our API design in § IV-B2, a starting point of our approach is to expose to the runtime system some high-level semantic information. This information includes different types of data accesses; concretely, the data-container types in our API lets the runtime be aware of four data-access classes: 1) non-oblivious reads (or access-leaky reads as in Nobro), 2) non-oblivious writes (or access-leaky writes as in Nobro), 3) oblivious reads (as in Obro) and 4) oblivious writes (as in Obro).

With awareness of data access type, we can specify the constraints of program partitioning problem. In general, there are two constraints, the constraint imposed by the TSX hardware (C1) and that imposed by the security requirement (C2). Figure 5 illustrates the two constraints in the presence of four types of data accesses. For C1, it requires that a transaction's write-set should not cause L1 cache replacement, and a transaction's read-set should not cause LLC cache replacement. For C2, the access-leaky memory region must reside in data cache without being split. Here, an access-leaky memory region is defined by memory data accessed in a data-dependent way; for instance, NobRO and NobRW instances are access-leaky regions.<sup>2</sup>. The constraint of C2 is due to the strong security concern. Because the HTM based protection only conceals the intra-transaction memory access pattern and the inter-transaction memory access pattern is disclosed. Placing the leaky memory accesses across multiple transactions would inevitably disclose the data secrets.

Because of the constraints (particularly C2), a key design choice we made is that we focus on partitioning the non-leaky part of the program. That is, we consider the problem of partitioning the program into smaller units, each unit accessing a partition of non-leaky data (ObXX) but accessing leaky data in their entirety. Note that the leaky data is not partitioned and has to be loaded in total to the data cache. The implication of this design choice is that we only support computation with small cache-bound leaky data (NobXX).<sup>3</sup>, which is consistent with our target computations (as described in § III-A0a).

Our goal is two-fold: (O1) Support computations of (non-leaky) data as large as possible by dynamically partitioning the computations, and (O2) Run these computations using transactions of as large size as possible.

The high-level approach is illustrated in Figure 6. As in the figure, consider a computation whose application memory consists of (cache-bound) access-leaky data (i.e., A1 and A2) and large non-leaky data (i.e., A3 and A4). We split the non-leaky data to a series of smaller data partitions (i.e., {P3} and {P4}), each of which is small enough to fit in a CPU data cache. To

 $<sup>^2\</sup>mbox{We}$  may use terms  $\mbox{NobXX}$  to represent access-leaky regions

<sup>&</sup>lt;sup>3</sup>Note that the assumption presents a fundamental limitation of our approach in supporting computation of large access-leaky data.



Fig. 6: Splitting data to transaction-wise partitions and mapping data in memory/cache hierarchy: Shadow-memory is an indirection layer that arranges data in cache based on access pattern. From LL (Last-Level) cache to L1 (Level-one) cache, read-only data replacement is enabled.

support large transactions (O2), we enable <u>L1 cache replacement on read-only data</u>. That is, the footprint of A1 and P3 in L1 cache can be reduced to just one set, as will be explained next.

Formally, our work can be formulated as an optimization problem: Given a program G on data  $\{A1, A2, A3, A4\}$ , it partitions G to a series of  $\{g\}$  with each program partition g operate on data A1, A2, P3, P4 such that it maximizes  $\min P3, P4$  and it is subject to the following constraints:

Here, L is the maximal number of LLC cachelines that can be mapped to one L1 cache set.

# C. Supporting Large Transactions by Shadow Memory

A1 can be replaced As will be described next, read-only data (A1 and P3) in a transaction can reuse L1 data cache without aborting the transaction.

cache full utilization is essential.4

Efficient use of the cache is of particular importance to ensuring large transactions. The transaction will not abort when the cache is underutilized; for instance, the transaction may end when only one cache set (out of 64) becomes full while the other sets are empty.

We propose a technique called shadow memory that arranges the in- transaction memory layout such that memory data is mapped to the CPU cache hierarchy without unnecessary conflict. Briefly, we enforce the rules that different types of data (A1, ...A4) should not overlap in data caches. Specifically, consider a cache hierarchy with a 8-way 32KB L1 cache and 16-way 8MB L2 cache. We allocate 60 L1C sets to be A2 (NobRW), 2 L1C sets to be P4, 1 L1C set to be P3 and L1C set to be A1. Note that A1 and P3 can reuse the L1 cache and their sizes are bounded by LLC. Figure 6 illustrates an example of the shadow memory.

## D. Implementing Dynamic Partitioning

To realize the dynamic partitioning on real hardware, there are two problems to be tackled: 1) How to realize the partitioning, in other words, how and when to insert the xbegin and xend instructions such that it can successfully execute the transaction (without abort) yet does not prematurely end a transaction? 2) How to realize the shadow memory in its full life cycle? 3) How to decide the condition that the computation cannot be executed using transactions at an early time (e.g., when the A1 is larger than LLC).

We build a library that realizes the partitioning life cycle. Recall that our API supports declaring the scope of leak section and wrapping each in-transaction memory access by custom data types. Upon leaky section declaration, we realize the allocation of shadow memory. Upon the in-transaction memory access, we hook the transaction partitioning schemes. The overall mechanism is illustrated in Figure 7.

<sup>&</sup>lt;sup>4</sup>For instance, a poorly utilized cache may abort a transaction when just one set (out of say 64 cache sets) is full. A well utilized cache may abort a transaction when the entire cache is full, which results in a much larger transaction.



Fig. 7: Library implementation: Each box represents a function in our library. White boxes run outside TSX transactions and blue boxes run inside TSX transactions.

Concretely, for shadow memory allocation, we allocate the shadow memory according to the layout described in § V-C. Note that the shadow-memory size is fixed (as the total size of a cache is fixed). We perform checks on whether the data size, specifically the size of access-leaky memory data, is small enough to fit into the data cache. In other words, we enforce the constraint C1 that A1 does not exceed L1 and A2 does not exceed LLC.

For transaction partitioning, we implement two components: 1) monitoring cache usage, and 2) realizing transaction partitioning. For 1), we simply monitor the condition that if each non-leaky data overflows, that is, the iterator call to ObXX.next()<sup>5</sup> reaches its capability. Note that we enforce a joint capability limit, that is, the total size of ObXX objects is bounded instead of individual ObXX.

For 2), we insert the TSX instruction xend and xbegin. Between them, we make data copy between shadow memory and application memory of the non-leaky data. Because the computation is partitioned based on non-leaky data, different transactions in the same leaky-section computation will use different non-leaky data, thus making it necessary to reload non-leaky data across transactions. In addition, we perform double preloading across the transactional boundary.

```
begin_leaky(){
   shadow_memory_alloc();
2
3
5
  ObXX::next(){
     //dynamically partition computations in next();
     //1. monitoring the overflow
    if(shadow_mem.ob.size + 1 < shadow_mem.ob.capability()) return;</pre>
     //2. start to execute the partitioning
10
    xend();
     //reload non-leaky data (ob) in shadow memory from application memory
11
12
     shadow_mem.reload_ob();
13
14
     double preload();
15
     xbegin();
16
```

Fig. 8: Library implementation: hooking dynamic-partitioning trigger in non-leaky data scans.

## VI. CASE STUDY

Our system supports external-oblivious computing. In our perspective, there are some computations that are natively external-oblivious, such as binary search, iterative machine learning, etc.. There are some computations whose classic algorithms are not oblivious, but there exist external-oblivious algorithms. For instance, for sorting, the most common algorithms, such as quick sort, are not oblivious. There are external-oblivious sorting algorithms, such as oblivious merge sort and Melbourne shuffle based sort, that have the same  $O(N \log N)$  complexity with quick sort.

In this section, we present our experience building various external-oblivious computations using our library.

## A. K-Means Computation

A K-means computation takes as input a data array and randomly initialized K centroids. It produces the output of K centroids in K clusters that are the most representative of the data array. Here, each element in the data array represents a data point and there are pair-wise distances defined between them. The key-means computation runs iteratively, where each

<sup>&</sup>lt;sup>5</sup>ObXX refers to non-leaky data type, either ObRO or ObRW.



Fig. 9: Performance under varying transaction sizes

iteration consists of two data passes, one (KM1) is a nested loop that (re)-assigns every array element to the closest centroid, and the other (KM2) is an array scan that randomly accesses the centroid array to update the centroid position.

K-means is an external oblivious computation in the sense that it keeps the centroid as internal memory accessed in a leaky way, and the data array as an external memory, which is accessed obliviously.

To implement K-means using our library, we declare the data pass (KM2) to be the leaky section where the internal memory is accessed. In the leaky section, the centroid array is defined as a NobRW, data array as an ObRO, the data-centroid mapping as an ObRW.

## B. Melbourne Shuffle

A Melbourne shuffle is a randomized algorithm for data shuffling. A data-shuffle operation takes as input two arrays of equal length, one storing data and the other storing a permutation. It produces the output of permuted data array. For instance,  $\mathtt{mshuffle}(x,y,z,1,0,2) = y,x,z$ . Internally, Melbourne shuffle is a randomized algorithm that bucketizes each of the two array to  $\sqrt{N}$  buckets, each of  $\sqrt{N}$  size. A Melbourne shuffle runs in two rounds, each round using  $\log N\sqrt{N}$  internal memory accessed in a leaky fashion. The original input arrays are in external memory.

To implement Melbourne shuffle using our library, we declare both rounds to be leaky sections. In particular, for scalability, we declare the first round, which is to "distribute data bucket based on permutation bucket", to be two leaky sections. Briefly, the first leaky section in the distribution round produces the unordered, linked list as output, without dummy elements. The second leaky section adds dummy elements. In this way, we can improve the data scalability significantly and reduce the access-leaky internal memory from  $O(\log N\sqrt{N})$  to  $O(\sqrt{N})$ .

## C. Oblivious Merge Sort

In oblivious merge sort, the operation of merging two sorted lists is made oblivious by a randomized algorithm [28]. Given two sorted lists, an oblivious merge keeps a  $O(\sqrt{N})$  internal-memory buffer and finishes the merge computation with the same O(N) complexity. It is a randomized algorithm and the probability of internal buffer overflowing is made negligible.

We implement oblivious merge sort by declaring the merge operation to be leaky section and the internal buffer to be a NobRW and the two merging arrays to be two ObRO.

## D. Streaming Binary Search

Given a query and a sorted data array, a binary search locates the matching element in the array to the query. Given a stream of queries, the binary search can be treated as an external-oblivious process in the sense that the internal memory maintains the data array and external memory maintains the query stream.

To implement the streaming binary search using our library, we declare the binary search to be the leaky section, the data array to be a NobRO, the query stream (array) to be an ObRO.

# VII. PERFORMANCE EVALUATION

In this section, we evaluate our system in terms of data scalability and performance. We first conduct a micro-benchmark study on the transaction success rate. We then conduct a macro-benchmark study that extensively evaluates the system performance under different data scales.

## A. Micro-Benchmark: Performance and Transaction Size

In this micro-benchmark study, we evaluate the performance with varying transaction size. Speculatively, a small transaction would cause high performance overhead.

For this experiment, we consider a simple computation of a loop writing to a data array. Each array element has the size of a cache line, such that each element access leads to a cache miss. We control the transaction size by the number of elements accessed.

We run the program on an Intel SGX machine with following specs: Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz 64 bits, with 32 KB L1 cache size (8-way 64 sets) and a 8 GB RAM. In the experiment, we vary the size of transaction and the data size (i.e., the array length). We conduct the program run 1000 times and measure the average execution time and its standard deviation.

The performance result is presented in Figure 9. With increasing transaction sizes, the execution time generally decreases. Specifically, when transaction size increases from 8 to 64, the execution time decreases by 2 - 5 times. When the transaction grows beyond 1024, the execution time becomes very unstable. When the transaction size is 2048 cachelines, the execution is limited for small data, such as below 256 cachelines.

The reason of performance improvement with increasing transaction size is that the cost of transaction setup (e.g., running xbegin and xend) is amortized over more instructions with larger transactions. When transaction grows over a threshold, the return (of performance increase) diminishes. And if the transaction grows too large (e.g., larger than cache size), the transaction abort rate will increase, due to cache replacement in transactions, which leads to unstable execution and large execution time.

## B. Macro Benchmark: Execution Time

1) Comparison with Scan Baseline: This set of experiments evaluates the performance of CMO, with comparison to scan-based baselines. The scan-based baseline translates each leaky random access (on NobXX) to a full array scan. Note that while ORAM [26], [27] presents a generic solution with better (asymptotic) efficiency, the scan-based data-obliviousness approach is the state-of-the-art; it is implemented more widely for in-memory data processing due to its simplicity, such as in oblivious machine learning [14] and ZeroTrace [13]<sup>6</sup>. In addition, at a medium data scale, the better asymptotic efficiency of ORAM may not translate to better concrete performance. We implement the scan baseline in our library by converting each access to NobXX to a full-array scan.

We consider the various computations in our case study. In each experiment, we measure the performance by execution time. Specifically, the execution time only includes the time spent on executing the computation, and excludes the time spent on data loading and system initialization. We use numeric datasets and generate them randomly.

We did all the experiments on a laptop with an Intel 8-core i7-6820HK CPU of 2.70GHz, 32KB L1 and 8MB LL cache, 32 GB RAM and 1 TB Disk. This is one of the Skylake CPUs equipped with both SGX and TSX features.

We vary the data size in each computation and report the results in Figures 10 and 11.

Figure 11a and 11b evaluate the execution time of Melbourne shuffle and oblivious merge sort. Comparing the baseline of scan approach, the Melbourne shuffle by CMO achieves a speedup of more than 100X at the largest data scale of 1 million records. For oblivious merge sort, the speedup is much less significant, about 10

Figures 10b and 10a present the performance result of streaming binary search with varying data size and the number of queries. With varying data size, the CMO outperforms the scan baseline by a speedup ranging from 100X to 1000X. With varying query number, CMO first keeps the execution time constant, until the cache cannot accommodate all the queries. CMO's execution time then linearly scale with the query number. With a large query number, CMO outperforms the scan baseline by 1000X times. Similarly, the performance result of K-means in Figures 10c and 10d also show multi-magnitude speedup of CMO.

2) Comparing Different Algorithms: In this set of experiments, we evaluate the importance of dynamic partitioning by measuring the performance of CMO against two baselines without partitioning: 1) pure data-obviousness approach and 2) TSX based attack-detection (as in Cloak [10]). We allow one to choose the best and most-efficient algorithm in each baseline. For TSX attack detection, we consider quick sort. For data-oblivious computing, we consider bubble sort. For CMO, we consider three sorting algorithms, Melbourne shuffle with quick sort (following the scramble-then-compute paradigm [37]), oblivious merge sort, and cache shuffle with quick sort.

The performance result is in Figure 12. It can be seen clearly the TSX attack detection is limited in data scalability, the baseline of bubble sort causes high performance overhead, especially as data size grows. The CMO paradigm, with different algorithms, achieves much better performance than the baselines with up to two orders of magnitudes speedup.

## REFERENCES

- [1] "Intel corp. software guard extensions programming reference."
- [2] "ARM TrustZone, https://www.arm.com/products/security-on-arm/trustzone."

<sup>&</sup>lt;sup>6</sup>ZeroTrace uses ORAM for disk data access and scan for its internal-memory data access.



Fig. 10: Comparing CMO with scan: Binary searches and k-means



Fig. 11: Comparing CMO with scan: External-oblivious sorts

- [3] "Google cloud platform supports intel sgx."
- [4] "Introducing azure confidential computing."
- [5] Y. Xu, W. Cui, and M. Peinado, "Controlled-channel attacks: Deterministic side channels for untrusted operating systems," in 2015 IEEE Symposium on Security and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015. IEEE Computer Society, 2015, pp. 640–656. [Online]. Available: https://doi.org/10.1109/SP.2015.45
- [6] J. V. Bulck, N. Weichbrodt, R. Kapitza, F. Piessens, and R. Strackx, "Telling your secrets without page faults: Stealthy page table-based attacks on enclaved execution," in 26th USENIX Security Symposium, USENIX Security 2017, Vancouver, BC, Canada, August 16-18, 2017., E. Kirda and T. Ristenpart, Eds. USENIX Association, 2017, pp. 1041–1056. [Online]. Available: https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/van-bulck
- [7] S. Lee, M. Shih, P. Gera, T. Kim, H. Kim, and M. Peinado, "Inferring fine-grained control flow inside SGX enclaves with branch shadowing," in <a href="26th USENIX Security Symposium">26th USENIX Security Symposium</a>, USENIX Security 2017, Vancouver, BC, Canada, August 16-18, 2017., E. Kirda and T. Ristenpart, Eds. USENIX Association, 2017, pp. 557–574. [Online]. Available: https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/lee-sangho
- [8] W. Wang, G. Chen, X. Pan, Y. Zhang, X. Wang, V. Bindschaedler, H. Tang, and C. A. Gunter, "Leaky cauldron on the dark land: Understanding memory side-channel hazards in SGX," in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 November 03, 2017, B. M. Thuraisingham, D. Evans, T. Malkin, and D. Xu, Eds. ACM, 2017, pp. 2421–2434. [Online]. Available: http://doi.acm.org/10.1145/3133956.3134038
- [9] M.-W. Shih, S. Lee, T. Kim, and M. Peinado, "T-sgx: Eradicating controlled-channel attacks against enclave programs," in NDSS Symposium 2017 in San Diego, California.
- [10] D. Gruss, J. Lettner, F. Schuster, O. Ohrimenko, I. Haller, and M. Costa, "Strong and efficient cache side-channel protection using hardware transactional memory," in 26th USENIX Security Symposium, USENIX Security 2017, Vancouver, BC, Canada, August 16-18, 2017., E. Kirda and T. Ristenpart, Eds.



Fig. 12: Comparing CMO with word-obliviousness and Cloak: the case of sorting

- USENIX Association, 2017, pp. 217-233. [Online]. Available: https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/gruss
- [11] W. Zheng, A. Dave, J. G. Beekman, R. A. Popa, J. E. Gonzalez, and I. Stoica, "Opaque: An oblivious and encrypted distributed analytics platform," in 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 27-29, 2017, 2017, pp. 283–298. [Online]. Available: https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/zheng
- [12] O. Ohrimenko, M. Costa, C. Fournet, C. Gkantsidis, M. Kohlweiss, and D. Sharma, "Observing and preventing leakage in mapreduce," in <a href="Proceedings">Proceedings</a>
  of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, October 12-6, 2015, I. Ray, N. Li, and C. Kruegel, Eds. ACM, 2015, pp. 1570–1581. [Online]. Available: http://doi.acm.org/10.1145/2810103.2813695
- [13] S. Sasy, S. Gorbunov, and C. W. Fletcher, "Zerotrace: Oblivious memory primitives from intel SGX," <u>IACR Cryptology ePrint Archive</u>, vol. 2017, p. 549, 2017. [Online]. Available: http://eprint.iacr.org/2017/549
- [14] O. Ohrimenko, F. Schuster, C. Fournet, A. Mehta, S. Nowozin, K. Vaswani, and M. Costa, "Oblivious multi-party machine learning on trusted processors," in 25th USENIX Security Symposium, USENIX Security 16, Austin, TX, USA, August 10-12, 2016., T. Holz and S. Savage, Eds. USENIX Association, 2016, pp. 619–636. [Online]. Available: https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/ohrimenko
- [15] T. T. A. Dinh, P. Saxena, E. Chang, B. C. Ooi, and C. Zhang, "M2R: enabling stronger privacy in mapreduce computation," in 24th USENIX Security Symposium, USENIX Security 15, Washington, D.C., USA, August 12-14, 2015., J. Jung and T. Holz, Eds. USENIX Association, 2015, pp. 447–462. [Online]. Available: https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/dinh
- [16] O. Ohrimenko, M. T. Goodrich, R. Tamassia, and E. Upfal, "The melbourne shuffle: Improving oblivious storage in the cloud," in <u>Automata, Languages, and Programming 41st International Colloquium, ICALP 2014, Copenhagen, Denmark, July 8-11, 2014, Proceedings, Part II, 2014, pp. 556–567. [Online]. Available: https://doi.org/10.1007/978-3-662-43951-7\_47</u>
- [17] M. T. Goodrich, "Data-oblivious external-memory algorithms for the compaction, selection, and sorting of outsourced data," in SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Jose, CA, USA, June 4-6, 2011 (Co-located with FCRC 2011), 2011, pp. 379–388. [Online]. Available: http://doi.acm.org/10.1145/1989493.1989555
- [18] J. Chen, Y. Tang, and H. Zhou, "Strongly secure and efficient data shuffle on hardware enclaves," <u>CoRR</u>, vol. abs/1711.04243, 2017. [Online]. Available: http://arxiv.org/abs/1711.04243
- [19] J. Chen, Y. R. Tang, and H. Zhou, "Strongly secure and efficient data shuffle on hardware enclaves," ACM SOSP Workshop (SysTex), 2017.
- [20] M. Hähnel, W. Cui, and M. Peinado, "High-resolution side channels for untrusted operating systems," in 2017 USENIX Annual Technical Conference, USENIX ATC 2017, Santa Clara, CA, USA, July 12-14, 2017. USENIX Association, 2017, pp. 299–312. [Online]. Available: https://www.usenix.org/conference/atc17/technical-sessions/presentation/hahnel
- [21] J. V. Bulck, F. Piessens, and R. Strackx, "Sgx-step: A practical attack framework for precise enclave execution control," <u>ACM SOSP Workshop (SysTex)</u>, 2017
- [22] S. Chen, X. Zhang, M. K. Reiter, and Y. Zhang, "Detecting privileged side-channel attacks in shielded execution with déjà vu," in <u>Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, AsiaCCS 2017, Abu Dhabi, United Arab Emirates, April 2-6, 2017, R. Karri, O. Sinanoglu, A. Sadeghi, and X. Yi, Eds. ACM, 2017, pp. 7–18. [Online]. Available: http://doi.acm.org/10.1145/3052973.3053007</u>
- [23] M. Ajtai, J. Komlós, and E. Szemerédi, "An o(n log n) sorting network," in Proceedings of the 15th Annual ACM Symposium on Theory of Computing, 25-27 April, 1983, Boston, Massachusetts, USA, D. S. Johnson, R. Fagin, M. L. Fredman, D. Harel, R. M. Karp, N. A. Lynch, C. H. Papadimitriou, R. L. Rivest, W. L. Ruzzo, and J. I. Seiferas, Eds. ACM, 1983, pp. 1–9. [Online]. Available: http://doi.acm.org/10.1145/800061.808726
- [24] Y. Azar and U. Vishkin, "Tight comparison bounds on the complexity of parallel sorting," SIAM J. Comput., vol. 16, no. 3, pp. 458–464, 1987. [Online]. Available: https://doi.org/10.1137/0216032
- [25] P. Maniatis, I. Mironov, and K. Talwar, "Oblivious stash shuffle," CoRR, vol. abs/1709.07553, 2017. [Online]. Available: http://arxiv.org/abs/1709.07553
- [26] E. Stefanov, M. van Dijk, E. Shi, C. W. Fletcher, L. Ren, X. Yu, and S. Devadas, "Path ORAM: an extremely simple oblivious RAM protocol," in 2013 ACM SIGSAC Conference on Computer and Communications Security, CCS'13, Berlin, Germany, November 4-8, 2013, 2013, pp. 299–310. [Online]. Available: http://doi.acm.org/10.1145/2508859.2516660
- [27] O. Goldreich and R. Ostrovsky, "Software protection and simulation on oblivious rams," J. ACM, vol. 43, no. 3, pp. 431–473, 1996. [Online]. Available: http://doi.acm.org/10.1145/233551.233553
- [28] P. Williams and R. Sion, "Usable PIR," in Proceedings of the Network and Distributed System Security Symposium, NDSS 2008, San Diego, California, USA, 10th February 13th February 2008. The Internet Society, 2008. [Online]. Available: http://www.isoc.org/isoc/conferences/ndss/08/papers/09\_usable\_pir.pdf
- [29] A. Arasu and R. Kaushik, "Oblivious query processing," in Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014, 2014, pp. 26–37. [Online]. Available: http://dx.doi.org/10.5441/002/icdt.2014.07
- [30] S. Shinde, Z. L. Chua, V. Narayanan, and P. Saxena, "Preventing page faults from telling your secrets," in Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, AsiaCCS 2016, Xi'an, China, May 30 June 3, 2016, X. Chen, X. Wang, and X. Huang, Eds. ACM, 2016, pp. 317–328. [Online]. Available: http://doi.acm.org/10.1145/2897845.2897885
- [31] F. Brasser, U. Müller, A. Dmitrienko, K. Kostiainen, S. Capkun, and A. Sadeghi, "Software grand exposure: SGX cache attacks are practical," in 11th USENIX Workshop on Offensive Technologies, WOOT 2017, Vancouver, BC, Canada, August 14-15, 2017., W. Enck and C. Mulliner, Eds. USENIX Association, 2017. [Online]. Available: https://www.usenix.org/conference/woot17/workshop-program/presentation/brasser
- [32] S. Eskandarian and M. Zaharia, "An Oblivious General-Purpose SQL Database for the Cloud," ArXiv e-prints, Oct. 2017.
- [33] C. Liu, X. S. Wang, K. Nayak, Y. Huang, and E. Shi, "Oblivm: A programming framework for secure computation," in 2015 IEEE Symposium on Security and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015. IEEE Computer Society, 2015, pp. 359–376. [Online]. Available: http://dx.doi.org/10.1109/SP.2015.29
- [34] A. Bittau, Ú. Erlingsson, P. Maniatis, I. Mironov, A. Raghunathan, D. Lie, M. Rudominer, U. Kode, J. Tinnés, and B. Seefeld, "Prochlo: Strong privacy for analytics in the crowd," in <u>Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017</u>. ACM, 2017, pp. 441–459. [Online]. Available: <a href="http://doi.acm.org/10.1145/3132747.3132769">http://doi.acm.org/10.1145/3132747.3132769</a>
- [35] J. Bater, G. Elliott, C. Eggen, S. Goel, A. N. Kho, and J. Duggan, "SMCQL: secure query processing for private data networks," <u>CoRR</u>, vol. abs/1606.06808, 2016. [Online]. Available: http://arxiv.org/abs/1606.06808
- [36] J. Katz and Y. Lindell, Introduction to Modern Cryptography. Chapman and Hall/CRC Press, 2007.
- [37] H. Dang, T. T. A. Dinh, E. Chang, and B. C. Ooi, "Privacy-preserving computation with trusted computing via scramble-then-compute," PoPETs, vol. 2017, no. 3, p. 21, 2017. [Online]. Available: https://doi.org/10.1515/popets-2017-0026
- [38] 2015 IEEE Symposium on Security and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015. IEEE Computer Society, 2015. [Online]. Available: http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7160813
- [39] E. Kirda and T. Ristenpart, Eds., 26th USENIX Security Symposium, USENIX Security 2017, Vancouver, BC, Canada, August 16-18, 2017. USENIX Association, 2017. [Online]. Available: https://www.usenix.org/conference/usenixsecurity17