



# Custom Processing Unit: Tracing and Patching Intel Atom Microcode

Black Hat USA 2022

Pietro Borrello

Sapienza University of Rome

Michael Schwarz

CISPA Helmholtz Center for Information Security

Martin Schwarzl

Graz University of Technology

**Daniel Gruss** 

Graz University of Technology





1. Deep dive on CPU μcode





- 1. Deep dive on CPU µcode
- $2. \ \mu code \ Software \ Framework$





- 1. Deep dive on CPU µcode
- 2. µcode Software Framework
- 3. Reverse Engineering of the secret  $\mu$ code update algorithm





- 1. Deep dive on CPU µcode
- 2. µcode Software Framework
- 3. Reverse Engineering of the secret <a href="mailto:update">update</a> algorithm
- 4. Some bonus content;)





- This is based on our understanding of CPU Microarchitecture.
- In theory, it may be all wrong.
- In practice, a lot seems right.









• Red Unlock of Atom Goldmont (GLM) CPUs





- Red Unlock of Atom Goldmont (GLM) CPUs
- Extraction and reverse engineering of GLM μcode format





- Red Unlock of Atom Goldmont (GLM) CPUs
- Extraction and reverse engineering of GLM μcode format
- Discovery of undocumented control instructions to access internal buffers

#### Microcoded Instuctions 101



| cpuid |
|-------|
|       |
|       |
|       |
|       |
| XLAT  |





#### Microcoded Instuctions 101











| OP1          | OP2          | OP3          | SEQW     |
|--------------|--------------|--------------|----------|
| 09282eb80236 | 0008890f8009 | 092830f80236 | 0903e480 |

# Deep Dive into the $\mu\text{code}$



U1a54: 09282eb80236

U1a55: 0008890f8009

U1a56: 092830f80236

U1a57: 000000000000

SEQW: 0903e480

CMPUJZ\_DIRECT\_NOTTAKEN(tmp6, 0x2, U0e2e)

 $tmp8:= ZEROEXT_DSZ32(0x2389)$ 

SYNC-> CMPUJZ\_DIRECT\_NOTTAKEN(tmp6, 0x3, U0e30)

NOP

SEQW GOTO U03e4

#### Building a Ghidra µcode Decompiler



```
U32f0: 002165071408
                                   tmp1 := CONCAT DSZ32(0x04040404)
U32f1: 004700031c75
                                   tmp1:= NOTAND DSZ64(tmp5, tmp1)
                                  tmp1:= SHR DSZ64(tmp1. 0x000000001)
U32f2: 006501031231
          01c4c980
                                  SEOW GOTO U44c9
U32f4: 0251f25c0278
                                  UJMPCC DIRECT NOTTAKEN CONDNS(tmp8, U37f2)
U32f5: 006275171200
                                  tmp1:= MOVEFROMCREG DSZ64( . PMH CR EMRR MASK)
                                  BTUJB DIRECT NOTTAKEN(tmp1, 0x0000000b, generate #GP) !m0.m1
U32f6: 186a11dc02b1
          01e15080
                                  SEOW GOTO U6150
U32f8: 000c85e80280
                                  SAVEUIP( , 0x01, U5a85) !m0
U32f9: 000406031d48
                                  tmp1 := AND DSZ32(0x00000006, tmp5)
U32fa: 1928119c0231
                                  CMPUJZ DIRECT NOTTAKEN(tmpl. 0x00000002, generate #GP) !m0.ml
          0187bd80
                                  SEOW GOTO U07bd
                                   tmp2:= SHR DSZ32(tmp5, 0x0000001a)
U32fc: 00251a032235
U32fd: 0062c31b1200
                                   tmp1:= MOVEFROMCREG DSZ64( , 0x6c3)
U32fe: 000720031c48
                                  tmp1:= NOTAND DSZ32(0x00000020, tmp1)
          01c4d580
                                  SEOW GOTO U44d5
```



```
void rc4 decrypt(ulong tmp0 i,ulong tmp1 j,byte *ucode patch tmp5,int len tmp6,byte *S tmp7,
3
                   long callback tmp8)
 4
5
6
     byte bVarl:
     byte bVar2:
8
9
     do {
0
       tmp0 i = (ulong)(byte)((char)tmp0 i + 1);
       bVarl = S tmp7[tmp0 i];
       tmpl j = (ulong)(byte)(bVarl + (char)tmpl j);
.3
                       /* swap S[i] and S[i] */
       bVar2 = S tmp7[tmpl i];
. 4
.5
       S tmp7[tmp0 i] = bVar2;
.6
       S tmp7[tmp1 j] = bVar1;
       *ucode patch tmp5 = S tmp7[(byte)(bVar2 + bVar1)] ^ *ucode patch tmp5;
.8
       ucode patch tmp5 = ucode patch tmp5 + 1;
.9
       len tmp6 += -1;
20
     } while (len tmp6 != 0);
21
     (*(code *)(callback tmp8 * 0x10))();
22
     return:
23 }
24
```





 $\bullet$  CPU interacts with its internal components through the CRBUS





- CPU interacts with its internal components through the CRBUS
- $\bullet \ \mathsf{MRSs} \to \mathsf{CRBUS} \ \mathsf{addr}$





- CPU interacts with its internal components through the CRBUS
- $\bullet \ \mathsf{MRSs} \to \mathsf{CRBUS} \ \mathsf{addr}$
- Control and Status registers





- CPU interacts with its internal components through the CRBUS
- ullet MRSs o CRBUS addr
- Control and Status registers
- SMM configuration





- CPU interacts with its internal components through the CRBUS
- $\bullet \;\; \mathsf{MRSs} \to \mathsf{CRBUS} \; \mathsf{addr}$
- Control and Status registers
- SMM configuration
- Local Direct Access Test (LDAT) access





 $\bullet$  The  $\mu$ code Sequencer manages the access to  $\mu$ code ROM and RAM





- $\bullet$  The  $\mu code$  Sequencer manages the access to  $\mu code$  ROM and RAM
- $\rightarrow\,$  The LDAT has access to the  $\mu code$  Sequencer





- $\bullet$  The  $\mu code$  Sequencer manages the access to  $\mu code$  ROM and RAM
- $\rightarrow\,$  The LDAT has access to the  $\mu code$  Sequencer
- $\rightarrow$  We can access the LDAT through the CRBUS



- $\bullet$  The  $\mu code$  Sequencer manages the access to  $\mu code$  ROM and RAM
- ightarrow The LDAT has access to the  $\mu$ code Sequencer
- ightarrow We can access the LDAT through the CRBUS
- $\rightarrow\,$  If we can access the CRBUS we can control  $\mu code!$





Mark Ermolov, Maxim Goryachy & Dmitry Sklyarov discovered the existance of two secret instructions that can access (RW):

- System agent
- URAM
- Staging buffer
- I/O ports
- Power supply unit





Mark Ermolov, Maxim Goryachy & Dmitry Sklyarov discovered the existance of two secret instructions that can access (RW):

- System agent
- URAM
- Staging buffer
- I/O ports
- Power supply unit
- CRBUS

## e.g., Writing to the CRBUS



```
def CRBUS_WRITE(ADDR, VAL):
   udbgwr(
    rax: ADDR,
    rbx|rdx: VAL,
    rcx: 0,
)
```



```
//Decompile of: U2782 - part of ucode update routine
write_8 (crbus_06a0 , (ucode_address - 0 \times 7c00);
MSLOOPCTR = (*(ushort *)((long)ucode_update_ptr + 3) - 1);
syncmark():
if ((in\_ucode\_ustate \& 8) != 0) {
  syncfull();
  write_8 (crbus_06a1,0x30400);
  ucode_ptr = (ulong *)((long)ucode_update_ptr + 5);
  do {
    ucode_aword = *ucode_ptr:
    ucode_ptr = ucode_ptr + 1;
    write_8 (crbus_06a4 , ucode_gword);
    write_8 (crbus_06a5 . ucode_gword >> 0x20):
    syncwait();
    MSLOOPCTR -= 1:
  \} while (-1 < MSLOOPCTR);
  syncfull();
```

### Writing to the µcode Sequencer



```
def ucode_sequencer_write(SELECTOR, ADDR, VAL):
  CRBUS [0x6a1] = 0x30000 | (SELECTOR << 8)
  CRBUS[0x6a0] = ADDR
  CRBUS[0x6a4] = VAL & Oxffffffff
  CRBUS \lceil 0x6a5 \rceil = VAL >> 32
  CRBUS [0x6a1] = 0
with SELECTOR:
  2 -> SEQW PATCH RAM
  3 -> MATCH & PATCH
  4 -> UCODE PATCH RAM
```



Redirects execution from  $\mu$ code ROM to  $\mu$ code RAM to execute patches.



Leveraging udbgrd/wr we can patch μcode via software





Leveraging udbgrd/wr we can patch μcode via software

Completely observe CPU behavior





Leveraging udbgrd/wr we can patch μcode via software

- Completely observe CPU behavior
- Completely control CPU behavior





Leveraging udbgrd/wr we can patch μcode via software

- Completely observe CPU behavior
- Completely control CPU behavior
- All within a BIOS or kernel module





Patch µcode





Patch µcode



Hook  $\mu code$ 





Patch µcode



Hook  $\mu code$ 



Trace µcode





We can change the CPU's behavior.





We can change the CPU's behavior.

• Change microcoded instructions





We can change the CPU's behavior.

- Change microcoded instructions
- Add functionalities to the CPU

# $\mu$ code patch Hello World!



```
.patch 0x0428 # RDRAND ENTRY POINT
.org 0x7c00
rax:= ZEROEXT_DSZ64(0x6f57206f6c6c6548) # 'Hello Wo'
rbx:= ZEROEXT_DSZ64(0x21646c72) # 'rld!\x00'
UEND
```

# μcode patch Hello World!



```
.patch 0x0428 # RDRAND ENTRY POINT
.org 0x7c00
rax:= ZEROEXT_DSZ64(0x6f57206f6c6c6548) # 'Hello Wo'
rbx:= ZEROEXT_DSZ64(0x21646c72) # 'rld!\x00'
UEND
```

- 1. Assemble μcode
- 2. Write  $\mu$ code at 0x7c00
- 3. Setup Match & Patch:  $0x0428 \rightarrow 0x7c00$
- 4. rdrand → "Hello World!"



rdrand returns random data, what if we make it return SMM memory?

```
.patch 0x0428 # RDRAND ENTRY POINT
.org 0x7c00
rax:= LDPPHYS_DSZ64(0x7b000000) # SMROM ADDR
```

UF.ND



rdrand returns random data, what if we make it return SMM memory?

```
.patch 0x0428 # RDRAND ENTRY POINT
.org 0x7c00
tmp2:= ZEROEXT_DSZ64(0x0)
MOVETOCREG_DSZ64(tmp2, CR_SMRR_MASK) # DISABLE SMM MEMORY RANGE
rax:= LDPPHYS_DSZ64(0x7b000000) # SMROM ADDR
UF.ND
```



rdrand returns random data, what if we make it return SMM memory?

```
.patch 0x0428 # RDRAND ENTRY POINT
.org 0x7c00
tmp1:= MOVEFROMCREG_DSZ64(CR_SMRR_MASK)
tmp2:= ZEROEXT_DSZ64(0x0)
MOVETOCREG_DSZ64(tmp2, CR_SMRR_MASK) # DISABLE SMM MEMORY RANGE
rax:= LDPPHYS_DSZ64(0x7b000000) # SMROM ADDR
UF.ND
```



rdrand returns random data, what if we make it return SMM memory?

```
.patch 0x0428 # RDRAND ENTRY POINT
.org 0x7c00
tmp1:= MOVEFROMCREG_DSZ64(CR_SMRR_MASK)
tmp2:= ZEROEXT_DSZ64(0x0)
MOVETOCREG_DSZ64(tmp2, CR_SMRR_MASK) # DISABLE SMM MEMORY RANGE
rax:= LDPPHYS_DSZ64(0x7b000000) # SMROM ADDR
MOVETOCREG_DSZ64(tmp1, CR_SMRR_MASK)
UF.ND
```











Install µcode hooks to observe events.

- $\bullet$  Setup Match & Patch to execute custom  $\mu$ code at certain events
- Resume execution

### Make your own performance counter



We can make the CPU to react to certain  $\mu$ code events, e.g., verw executed

```
.org 0x7da0

tmp0:= ZEROEXT_DSZ64(<counter_address>)

tmp1:= LDPPHYSTICKLE_DSZ64_ASZ64_SC1(tmp0)

tmp1:= ADD_DSZ64(tmp1, 0x1) # INCREMENT COUNTER
STADPPHYSTICKLE_DSZ64_ASZ64_SC1(tmp0, tmp1)
```

.patch OxXXXX # INSTRUCTION ENTRY POINT

UJMP(OxXXXX + 1) # JUMP TO NEXT UOP

### Make your own performance counter



We can make the CPU to react to certain  $\mu$ code events, e.g., verw executed

```
.patch OxXXXX # INSTRUCTION ENTRY POINT
.org 0x7da0
tmp0:= ZEROEXT_DSZ64(<counter_address>)
tmp1:= LDPPHYSTICKLE_DSZ64_ASZ64_SC1(tmp0)
tmp1:= ADD_DSZ64(tmp1, Ox1) # INCREMENT COUNTER
STADPPHYSTICKLE_DSZ64_ASZ64_SC1(tmp0, tmp1)
UJMP(OxXXXX + 1) # JUMP TO NEXT UOP
```





- 1. dump timestamp
- 2. disable hook
- 3. continue





- 1. dump timestamp
- 2. disable hook
- 3. continue





- 1. dump timestamp
- 2. disable hook
- 3. continue





- 1. dump timestamp
- 2. disable hook
- continue





- 1. dump timestamp
- 2. disable hook
- 3. continue





- 1. dump timestamp
- 2. disable hook
- 3. continue





- 1. dump timestamp
- 2. disable hook
- continue





- 1. dump timestamp
- 2. disable hook
- continue





- 1. dump timestamp
- 2. disable hook
- continue





- 1. dump timestamp
- 2. disable hook
- 3. continue





- 1. dump timestamp
- 2. disable hook
- 3. continue













Trigger a μcode update





- Trigger a μcode update
- Trace if a microinstruction is executed





- Trigger a μcode update
- Trace if a microinstruction is executed
- $\bullet$  Repeat for all the possible  $\mu code$  instructions





- Trigger a μcode update
- Trace if a microinstruction is executed
- $\bullet$  Repeat for all the possible  $\mu code$  instructions
- Restore order



wrmsr



wrmsr move ucode patch to 0xfeb01000















































The temporary physical address where  $\mu$ code is decrypted.



The temporary physical address where  $\mu$ code is decrypted.

```
> sudo cat /proc/iomem | grep feb00000
:(
```



The temporary physical address where  $\mu$ code is decrypted.





• Dynamically enabled by the CPU





- Dynamically enabled by the CPU
- Access time: about 20 cycles





- Dynamically enabled by the CPU
- Access time: about 20 cycles
- Content not shared between cores





- Dynamically enabled by the CPU
- Access time: about 20 cycles
- Content not shared between cores
- Can fit 64-256Kb of valid data





- Dynamically enabled by the CPU
- Access time: about 20 cycles
- Content not shared between cores
- Can fit 64-256Kb of valid data
- Replacement policy on the content?!





- Dynamically enabled by the CPU
- Access time: about 20 cycles
- Content not shared between cores
- Can fit 64-256Kb of valid data
- Replacement policy on the content?!
- It's a special CPU view on the L2 cache!

#### Parsing $\mu$ code updates



```
00000000: 0102 007c 3900 0a00 3f88 4bed c000 080c
                                                    ... 19...?.K....
00000010: 0b01 4780 0000 0a00 3f88 4fad 0003 0a00
                                                    ..G....?.D....
00000020: 2f20 4b2d 8002 080c 0322 4740 a903 0a00
                                                    / K-...."G@....
00000030: 2f20 4f6d 1902 0002 0353 6380 c000 3002
                                                    / Dm . . . . Sc . . . 0 .
00000040: b8a6 6be8 0000 0002 0320 63c0 0003 f003
                                                    ..k..... c....
00000050: f8a6 6b28 c000 0800 03c0 0bed 0000 0b10
                                                    ..k(........
00000060: 7f00 0800 8001 3110 0300 a140 c000 310c
                                                    . . . . . . 1 . . . . @ . . 1 .
00000070: 0300 0700 0000 4012 0b30 6210 0003 4b1c
                                                    00000080: 7f00 0440 c000 3112 0310 2400 0000 310c
                                                    ...@..1...$...1.
00000090: 0300 01c0 0003 0800 03c0 0fad 0002 00d2
```

# Parsing $\mu$ code updates



A  $\mu$ code update is bytecode: the CPU interprets commands from the  $\mu$ code update







 $\bullet$  Create a parser for  $\mu$ code updates



- Create a parser for μcode updates
- $\bullet$  Automatically collect existing  $\mu code$  (s) for GLM



- Create a parser for μcode updates
- ullet Automatically collect existing  $\mu code$  (s) for GLM
- Decrypt all GLM updates



- Create a parser for μcode updates
- Automatically collect existing μcode (s) for GLM
- Decrypt all GLM updates

github.com/pietroborrello/CustomProcessingUnit/ucode\_
collection

# **Bonus Content 1: Skylake perf traces**





# Bonus Content 2: An APIC failed exploit





# Bonus Content 2: An APIC failed exploit





## Bonus Content 2: An APIC failed exploit







• Deepen understanding of modern CPUs with μcode access





- Deepen understanding of modern CPUs with μcode access
- $\bullet$  Develop a static and dynamic analysis framework for  $\mu code :$





- Deepen understanding of modern CPUs with μcode access
- $\bullet\,$  Develop a static and dynamic analysis framework for  $\mu code:$ 
  - μcode decompiler
  - μcode assembler
  - μcode patcher
  - μcode tracer





- Deepen understanding of modern CPUs with μcode access
- Develop a static and dynamic analysis framework for μcode:
  - μcode decompiler
  - μcode assembler
  - μcode patcher
  - μcode tracer
- Let's control our CPUs!

github.com/pietroborrello/CustomProcessingUnit