# Is Nucleus the answer?

Near the start of this project, I ran into a pretty showstopping problem: stripped binaries don't mark the boundaries of functions. Very quickly (and almost accidentally), I came across [Nucleus](https://ieeexplore.ieee.org/abstract/document/7961979), the "Compiler-Agnostic Function Detector" that promised to give us function boundaries with 95% accuracy (F1) on average. Early experiments (during Summer 2018) were promising, seeming to pick up on functions inside `coreutils` binaries with little difficulty.

More recent experiments might be exposing some cracks in that assumption, however. While writing the toy examples (see `../toy-examples/fingerprinting-demo.ipynb`), I attempted to use Nucleus to get the functions boundaries of a simple Hello World C program. The source code looked like this:
```
#include <stdio.h>

void print_hello() {
   puts("Hello");
   return;
}

void print_world() {
   puts("World");
   return;
}

int main() {
   print_hello();
   print_world();
   return 0;
}
```
Which, in x86 assembly, looks like:
```
helloworld.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <print_hello>:
   0:	55                   	push   rbp
   1:	48 89 e5             	mov    rbp,rsp
   4:	48 8d 3d 00 00 00 00 	lea    rdi,[rip+0x0]        # b <print_hello+0xb>
   b:	e8 00 00 00 00       	call   10 <print_hello+0x10>
  10:	90                   	nop
  11:	5d                   	pop    rbp
  12:	c3                   	ret    

0000000000000013 <print_world>:
  13:	55                   	push   rbp
  14:	48 89 e5             	mov    rbp,rsp
  17:	48 8d 3d 00 00 00 00 	lea    rdi,[rip+0x0]        # 1e <print_world+0xb>
  1e:	e8 00 00 00 00       	call   23 <print_world+0x10>
  23:	90                   	nop
  24:	5d                   	pop    rbp
  25:	c3                   	ret    

0000000000000026 <main>:
  26:	55                   	push   rbp
  27:	48 89 e5             	mov    rbp,rsp
  2a:	b8 00 00 00 00       	mov    eax,0x0
  2f:	e8 00 00 00 00       	call   34 <main+0xe>
  34:	b8 00 00 00 00       	mov    eax,0x0
  39:	e8 00 00 00 00       	call   3e <main+0x18>
  3e:	b8 00 00 00 00       	mov    eax,0x0
  43:	5d                   	pop    rbp
  44:	c3                   	ret    
```
(Note that the above is the disassembly of the `.o` file, since it's cleaner)

I used the following commands to compile the executable:
```
gcc -Wall -ansi -fno-inline-functions -O0 -c helloworld.c -o helloworld.o
gcc -static helloworld.o -o helloworld
```
And it runs as expected:

In [3]:
!../res/helloworld

Hello
World


But here's what happens when I try to use Nucleus on it

In [6]:
# Now the reader can pay attention
import nucleus
def get_functs(bin_path):
    # Load binary into a nucleus context
    nctx = nucleus.load(bin_path)
    return nctx.cfg.functions

full_functs = get_functs("../res/helloworld")
obj_functs = get_functs("../res/helloworld.o")

print("Found {} functions in ../res/helloworld".format(len(full_functs)))
print("Found {} functions in ../res/helloworld.o".format(len(obj_functs)))

Found 1321 functions in ../res/helloworld
Found 7 functions in ../res/helloworld.o


Where are these 1,300+ functions coming from? I'll start by looking into what functions are just in the .o

In [12]:
for count, fn in enumerate(obj_functs):
    print("{}: Spans 0x{:x} to 0x{:x} with {} basic blocks".format(count, fn.start, fn.end, len(fn.BBs)))
    for bb in fn.BBs:
        #print("  The basic block spanning 0x{:x} to 0x{:x} has {} x86 instructions".format(bb.start, bb.end, len(bb.insns)))
        for ct, ins in enumerate(bb.insns):
            print("    0x{:x}: {} {}".format(ins.start, ins.mnem, ins.op_str))
            

0: Spans 0x10 to 0x13 with 2 basic blocks
    0x10: nop 
    0x11: pop rbp
    0x12: ret 
1: Spans 0x23 to 0x26 with 2 basic blocks
    0x23: nop 
    0x24: pop rbp
    0x25: ret 
2: Spans 0x34 to 0x3e with 1 basic blocks
    0x34: mov eax, 0
    0x39: call 0x3e
3: Spans 0x3e to 0x45 with 1 basic blocks
    0x3e: mov eax, 0
    0x43: pop rbp
    0x44: ret 
4: Spans 0x0 to 0x10 with 1 basic blocks
    0x0: push rbp
    0x1: mov rbp, rsp
    0x4: lea rdi, qword ptr [rip]
    0xb: call 0x10
5: Spans 0x13 to 0x23 with 1 basic blocks
    0x13: push rbp
    0x14: mov rbp, rsp
    0x17: lea rdi, qword ptr [rip]
    0x1e: call 0x23
6: Spans 0x26 to 0x34 with 1 basic blocks
    0x26: push rbp
    0x27: mov rbp, rsp
    0x2a: mov eax, 0
    0x2f: call 0x34


Alright, looks like "functions" 4 and 5 from the output above are parts of the `print_hello()` and `print_world()` functions. It also seems like 1 and 2 are the epilogues of said functions. Why is Nucleus splitting these functions up? Or is this just because we're working with the .o file, which is not explicitly supported by Nucleus.

Let's try looking at the fully-compiled file

In [16]:
#### Helper functions. Reader can ignore this cell

from elftools.elf.elffile import ELFFile
from elftools.elf.sections import SymbolTableSection, Symbol
from reil.x86.translator import translate

def get_func_addr_range(bin_path, func_name):
    """
    Get a function's address range in a binary using its name
    
    :param bin_path: a path to the executable being searched
    :param name: the name of the function you're looking for
    :returns: a tuple containing the start and end addr of the function, or None if not found
    """
    with open(bin_path, 'rb') as f:
        # Load the binary's symbol table into PyELFtools 
        symtab = ELFFile(f).get_section_by_name(".symtab")
        sym = symtab.get_symbol_by_name(func_name)
        if sym is not None and len(sym) == 1:
            # Return the address of the binary
            start_addr = sym[0]['st_value']
            size = sym[0]['st_size']
            return (start_addr, start_addr + size)
        else:
            print("Couldn't find symbol " + func_name)
            return None
        
def get_raw_bytes(path, start, stop):
    """
    Get the raw bytes between two MEMORY addresses in an ELF binary
    
    :param path: string path to the ELF file
    :param start: the starting address to extract
    :param stop: the last address to extract
    :returns: a raw bytes object
    """
    with open(path, 'rb') as f:
        start_addr = list(ELFFile(f).address_offsets(start))[0]
        f.seek(start_addr)
        return f.read(stop - start)
    
def x86_64_to_reil(raw_bytes):
    """
    Wrapper function. Returns output from REIL translator in human-readable format
    """
    return list(il_ins for nat_ins in translate(raw_bytes, 0x0, x86_64=True) for il_ins in nat_ins.il_instructions)

In [19]:
print_hello_range = get_func_addr_range("../res/helloworld", "print_hello")
print_world_range = get_func_addr_range("../res/helloworld", "print_world")
print("print_hello() lives between 0x{:x} and 0x{:x}".format(print_hello_range[0], print_hello_range[1]))
print("print_world() lives between 0x{:x} and 0x{:x}".format(print_world_range[0], print_world_range[1]))

print_hello() lives between 0x400b4d and 0x400b60
print_world() lives between 0x400b60 and 0x400b73


Now let's find the functions listed within those address ranges.

In [27]:
def print_func_in_range(functs, addr_range):
    for count, fn in enumerate(functs):
        if (fn.start >= addr_range[0] and fn.end <= addr_range[1]):
            print("{}: Spans 0x{:x} to 0x{:x} with {} basic blocks".format(count, fn.start, fn.end, len(fn.BBs)))
            for bb in fn.BBs:
                #print("  The basic block spanning 0x{:x} to 0x{:x} has {} x86 instructions".format(bb.start, bb.end, len(bb.insns)))
                for ct, ins in enumerate(bb.insns):
                    print("    0x{:x}: {} {}".format(ins.start, ins.mnem, ins.op_str))
    
print("Functions within print_hello's address range")
print_func_in_range(full_functs, print_hello_range)

print("\nFunctions within print_world's address range")
print_func_in_range(full_functs, print_world_range)
            

Functions within print_hello's address range
6: Spans 0x400b4d to 0x400b60 with 3 basic blocks
    0x400b4d: push rbp
    0x400b4e: mov rbp, rsp
    0x400b51: lea rdi, qword ptr [rip + 0x915ec]
    0x400b58: call 0x410240
    0x400b5d: nop 
    0x400b5e: pop rbp
    0x400b5f: ret 

Functions within print_world's address range
7: Spans 0x400b60 to 0x400b73 with 3 basic blocks
    0x400b60: push rbp
    0x400b61: mov rbp, rsp
    0x400b64: lea rdi, qword ptr [rip + 0x915df]
    0x400b6b: call 0x410240
    0x400b70: nop 
    0x400b71: pop rbp
    0x400b72: ret 


Well... It worked! Those instructions match the original disassembled instructions of `print_hello()` and `print_world()` almost exactly. 

**So it would seem that part of the problem is solved**: Nucleus can still detect function boundaries, you just have to give it a fully compiled executable instead of an object file. (and to be fair, none of the Nucleus literature ever claimed to work with object files, so this is on me). This is pretty relieving, since Nucleus needs to be able to solve the function boundary problem in order for ACE to work.

This still leaves the question of why Nucleus claims to see 1,300 functions in this tiny toy binary. My first guess is that a lot of glibc (and maybe libgcc) library functions are getting pulled into the the static binary. I can test this by having nucleus look into the dynamically-linked version of the binary.

In [24]:
!gcc -Wall -ansi -fno-inline-functions -O0 ../res/helloworld.c -o ../res/helloworld_dl
!chmod +x ../res/helloworld_dl
!../res/helloworld_dl
!file ../res/helloworld_dl

Hello
World
../res/helloworld_dl: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=97c1b2a35d00b53f438ffdb15a3c5957677896f6, not stripped


Now to count the functions...

In [25]:
dl_functs = get_functs("../res/helloworld_dl")

print("Found {} functions in ../res/helloworld_dl".format(len(dl_functs)))

Found 12 functions in ../res/helloworld_dl


Now that's a much more manageable number. Now to make sure our functions are still there

In [30]:
dl_print_hello_range = get_func_addr_range("../res/helloworld_dl", "print_hello")
dl_print_world_range = get_func_addr_range("../res/helloworld_dl", "print_world")

print("print_hello() lives between 0x{:x} and 0x{:x}".format(print_hello_range[0], print_hello_range[1]))
print("print_world() lives between 0x{:x} and 0x{:x}".format(print_world_range[0], print_world_range[1]))

print("\nFunctions within print_hello's address range")
print_func_in_range(dl_functs, dl_print_hello_range)
print("\nFunctions within print_world's address range")
print_func_in_range(dl_functs, dl_print_world_range)

print_hello() lives between 0x400b4d and 0x400b60
print_world() lives between 0x400b60 and 0x400b73

Functions within print_hello's address range
3: Spans 0x63a to 0x64d with 3 basic blocks
    0x63a: push rbp
    0x63b: mov rbp, rsp
    0x63e: lea rdi, qword ptr [rip + 0xbf]
    0x645: call 0x510
    0x64a: nop 
    0x64b: pop rbp
    0x64c: ret 

Functions within print_world's address range
4: Spans 0x64d to 0x660 with 3 basic blocks
    0x64d: push rbp
    0x64e: mov rbp, rsp
    0x651: lea rdi, qword ptr [rip + 0xb2]
    0x658: call 0x510
    0x65d: nop 
    0x65e: pop rbp
    0x65f: ret 


And now let's see the libraries that this executable relies on. I think it's also safe to assume that these libraries are also the ones getting compiled into the statically-linked binary.

In [34]:
print("objdump tells us what the ELF header says we need")
!objdump -p ../res/helloworld_dl | grep NEEDED
print("ldd tells us what the dynamic linker says we need")
!ldd ../res/helloworld_dl

objdump tells us what the ELF header says we need
  NEEDED               libc.so.6
ldd tells us what the dynamic linker says we need
	linux-vdso.so.1 (0x00007ffde2e8f000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fab70810000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fab70e03000)


`linux-vdso.so.1` and `ld-linux-x86-64.so.2` are kernel and loader-related libraries, and according to [this stack overflow answer](https://stackoverflow.com/a/6182774), they wouldn't be compiled into the statically-linked version of the binary. So the only library that would actually get compiled into the static binary is `libc.so.6`.

So is libc causing the function number to skyrocket? [This SO answer](https://stackoverflow.com/a/33188344) tells us a little procedure for counting the number of functions inside libc: 
```
# Normally I'd put run this in a code cell, but for some reason the jupyter kernel hangs when I tried that
!nm -C /usr/lib/x86_64-linux-gnu/libc.a | grep -w T | wc -l
2483
# Now for the count excluding _ variants
!nm -C /usr/lib/x86_64-linux-gnu/libc.a | grep -w T | grep -v _ | wc -l
491
```
So the number aren't exactly lining up, but essentially this shows there's a LOT of functions in libc and many of them are interdependent (usually "__ variants" just call some other function of the same name without the `__`, but they still get an entry in the symbol table), **so it's reasonable to assume that most-if-not-all of those ~1,300 extra functions in `helloworld` are coming from libc**.

*But wait, there's more*, since the aforementioned SO answer gives us a powerful tool in `nm`. Let's see how many function symbols are listed in our static binary.

In [23]:
print("Number of functions in the .text section:")
!nm -C ../res/helloworld | grep -w T | wc -l
print("Number above excluding __ variants:")
!nm -C ../res/helloworld | grep -w T | grep -v __ | wc -l
print("List of all functions in .text section excluding __ variants:")
!nm -C ../res/helloworld | grep -w T | grep -v __ 

Number of functions in the .text section:
718
Number above excluding __ variants:
293
List of all functions in .text section excluding __ variants:
0000000000417680 T _IO_adjust_column
0000000000471860 T _IO_adjust_wcolumn
00000000004156e0 T _IO_cleanup
0000000000416e00 T _IO_default_doallocate
0000000000417250 T _IO_default_finish
0000000000418300 T _IO_default_imbue
0000000000418130 T _IO_default_pbackfail
00000000004182d0 T _IO_default_read
00000000004182b0 T _IO_default_seek
0000000000417570 T _IO_default_seekoff
0000000000416d90 T _IO_default_seekpos
0000000000416c60 T _IO_default_setbuf
00000000004182f0 T _IO_default_showmanyc
00000000004182c0 T _IO_default_stat
0000000000417240 T _IO_default_sync
0000000000416800 T _IO_default_uflow
00000000004167f0 T _IO_default_underflow
00000000004182e0 T _IO_default_write
00000000004169e0 T _IO_default_xsgetn
0000000000416860 T _IO_default_xsputn
0000000000416740 T _IO_doallocbuf
0000000000417000 T _IO_enable_locks
000000000040fe50 T _IO_ffl

Now things are a bit more clear: our simple `helloworld` program ends up pulling in a LOT of gcc functions, probably because we used `puts()` (which somehow calls `printf()` through some called function or due to compiler optimizations in glibc). 

So according to the symbol table (which can be considered ground truth), **718** functions end up in our static binary, whereas Nucleus estimated there was 1321. So Nucleus is overestimating (i.e. overdividing functions), but only for library functions (at least, in this example). My guess is that glibc has some fancy optimizations that are confusing Nucleus.

### Some conclusions
 - Nucleus still seems to be adequate for our purposes of identifying function boundaries
   - Except for the precise function boundaries of highly-optimized libraries, like libc, where the optimizations can border on obfuscation
 - Nucleus will not work on anything that's not a proper ELF executable, i.e. object files, .so files, etc.
 - Simple programs pull in a huge number of functions simply by using any libc function
 
*One more interesting thing to try: write a simple program that does some basic arithmetic without any libc calls, and compile it without libc*