# Library Identification in Statically-Linked Binaries
Here, we'll explore different ways to identify what libraries are used in certain statically-linked binaries. First task is to identify library name, then version.

## Ideas
 * Run byteweight, get function data
 * Syscall analysis?
 
## Notes
 * Nucleus can give me function names and start addresses for unstripped binaries
   * For stripped binaries though, all we get are start addresses
   * If we can also obtain end address, we can analyze function content somehow


In [7]:
# Strings function borrowed from https://stackoverflow.com/questions/17195924/python-equivalent-of-unix-strings-utility
import string

def strings(filename, min=4):
    with open(filename, errors="ignore") as f:  # Python 3.x
        result = ""
        for c in f.read():
            if c in string.printable:
                result += c
                continue
            if len(result) >= min:
                yield result
            result = ""
        if len(result) >= min:  # catch result at EOF
            yield result
            
display(list(strings("/bin/echo")))

['/lib64/ld-linux-x86-64.so.2',
 '3:78Le5',
 'Pv,crbA9',
 'libc.so.6',
 'fflush',
 '__printf_chk',
 'setlocale',
 'mbrtowc',
 'strncmp',
 'strrchr',
 'dcgettext',
 'error',
 '__stack_chk_fail',
 'iswprint',
 'realloc',
 'abort',
 '_exit',
 'program_invocation_name',
 '__ctype_get_mb_cur_max',
 'calloc',
 'strlen',
 'ungetc',
 'memset',
 '__errno_location',
 'memcmp',
 '__fprintf_chk',
 'stdout',
 'lseek',
 'memcpy',
 'fclose',
 'malloc',
 'mbsinit',
 '__uflow',
 'nl_langinfo',
 '__ctype_b_loc',
 'getenv',
 '__freading',
 'stderr',
 'fscanf',
 'fileno',
 'fwrite',
 '__fpending',
 'program_invocation_short_name',
 'fdopen',
 'bindtextdomain',
 'strcmp',
 '__libc_start_main',
 'fseeko',
 '__overflow',
 'fputs_unlocked',
 'free',
 '__progname',
 '__progname_full',
 '__cxa_atexit',
 '__gmon_start__',
 'GLIBC_2.3',
 'GLIBC_2.3.4',
 'GLIBC_2.14',
 'GLIBC_2.4',
 'GLIBC_2.2.5',
 '%z_ ',
 '%r_ ',
 '%j_ ',
 'p%b_ ',
 '`%Z_ ',
 'P%R_ ',
 '@%J_ ',
 '0%B_ ',
 ' %:_ ',
 '%2_ ',
 '%*_ ',
 '%"_ ',
 '%\

## Nucleus
There's a neat little tool called [nucleus](https://bitbucket.org/vusec/nucleus) based on the paper *[Compiler-Agnostic Function Detection in Binaries](https://ieeexplore.ieee.org/document/7961979/)* that can find function entry points and their lengths pretty fast using static call graph analysis. We could use this to find functions, then disassemble using Capstone or BAP and analyze from there.

~~Nucleus is a command line tool, so we'll run the shell command `nucleus -d linear -f -e <binary_path>` and parse the output.~~ Well would you look at that, somebody wrote some [python bindings](https://bitbucket.org/AlexAltea/nucleus/src/c1529bac5968175059fdfeed706533eada104db2/bindings/python/?at=master)!

In [13]:
import nucleus
BIN_PATH = "/bin/echo"
# Load binary into a nucleus context
nctx = nucleus.load(BIN_PATH)
functs = nctx.cfg.functions

# Print the first few detected functions
print("Found {} functions in {}".format(len(functs), BIN_PATH))
count = 0
for fn in functs[:5]:
    print("{}: Spans 0x{:x} to 0x{:x} with {} basic blocks".format(count, fn.start, fn.end, len(fn.BBs)))
    count += 1
    

Found 65 functions in /bin/echo
0: Spans 0x401038 to 0x401052 with 3 basic blocks
1: Spans 0x401370 to 0x401376 with 1 basic blocks
2: Spans 0x401830 to 0x401862 with 4 basic blocks
3: Spans 0x401900 to 0x401978 with 9 basic blocks
4: Spans 0x401980 to 0x401d30 with 50 basic blocks


### Feature Engineering
Nucleus is actually pretty robust when it comes to introspection. A potentially useful feature is its ability to break functions into "basic blocks" (i.e. groups of contiguous instructions with no jumps/branches except at the end) and futher allows basic blocks to be broken into their instructions (i.e. Capstone disassembly is built-in). This leads to some interesting possibilities for feature generation:
 * Most basic: ignore BB's, just use function count and sizes as features
 * Use BB counts per function
 * Use instruction counts per BB
 * Most complex: Hash BB's (somehow) and use their combination (sums?)
   * It might make sense to try to generalize BB's, similar to how instructions were generalized in ByteWeight
   * Maybe treat each BB as a "sentence" that might look like "lea mov mov addi jmp"
     * Be weary of going all w2v on this, the overhead is going to explode exponentially if you're making a sentence out of each BB
   * But we could still add up hashes of generalized BB's, kind of like instruction counting

Nucleus also exposes a set of attributes at the BB (`.is_called()`, `.is_padding()`, `.score`, etc.) and instruction (`.flags`, `.edge_type`, `.addr_size`, etc.) levels that might be useful as features.

Keep in mind that Nucleus is based on a generated call graph, so there might be some fun stuff we can do with that. Maybe look into graph analysis

In [8]:
# Demo basic block disassembly
bb = functs[0].BBs[0]
#display(bb)
print("The basic block spanning 0x{:x} to 0x{:x} has {} instructions".format(bb.start, bb.end, len(bb.insns)))
for ct, ins in enumerate(bb.insns):
    print("0x{:x}: {} {}".format(ins.start, ins.mnem, ins.op_str))
print("Attributes: called={}, invalid={}, padding={}, trap={}".format(bb.is_called(), bb.is_invalid(), bb.is_padding(), bb.is_trap()))

The basic block spanning 0x401038 to 0x401048 has 4 instructions
0x401038: sub rsp, 8
0x40103c: mov rax, qword ptr [rip + 0x205fb5]
0x401043: test rax, rax
0x401046: je 0x40104d
Attributes: called=True, invalid=False, padding=False, trap=False


**Neat idea**: ~~Capstone (or maybe this is specific to Nucleus) assigns a set of flags to each instruction that essentially groups it based on whether or not it's a conditional, a jump, a control flow instruction, etc. That flag is stored as a single byte made by `OR`ing the individual flags together, which produces a single integer that could potentially be used as the "hash" for that instruction. A resulting BB hash could be a bin-count of occurances of each instruction hashcode~~

Okay, so this probably won't work with Nucleus flags because the they're all control flow related, meaning most basic blocks will only have a non-zero flag at the end (i.e. not a good hash because many collisions). But if I remember correctly, Capstone also has instruction groupings, ~~and I think those offer more coverage outside of control flow.~~

Upon further review, yes, Capstone has [instruction groups](https://github.com/aquynh/capstone/blob/7723175e80bcb95c73e30052cb8a10e0aceacfc4/bindings/python/capstone/x86_const.py#L1637), but the "generic" ones are basically the same as Nucleus', and I'm not certain that the "architecture-specific" ones are any more useful (they seem to be specific to fairly uncommon instructions). 

That being said, we could cook up our own groupings using the histogram data from `frequency_count/00_intro_freq_counts`. The idea would be to create ~10-20 (totally arbitrary # warning) groups based on the the most popular instructions (with the last group being a catchall for everything else). The first groups would definitely be MOVs, and perhaps we can use Capstone's [operand groupings](https://github.com/aquynh/capstone/blob/7723175e80bcb95c73e30052cb8a10e0aceacfc4/bindings/python/capstone/x86_const.py#L241) to create more specific groups (e.g. `MOV <REG> <IMM>`, `MOV <REG> <REG>`, etc.). Fair Warning: this will take a bit of work, and will be pretty architecture-specific.

In [12]:
ins = bb.insns[3]
ins.flags

11

### Optimization Experiment
__Hypothesis__: Within a single function, the number of basic blocks and their general effect (i.e. their output for a given input) does not change much across compiler optimization levels (and possibly versions).

__Basis__: compilers primarily perform optimizations within basic blocks.

To test this, we'll compile a simple application that uses a well-known library and link it statically at different optimization levels (O0, O1, O2, O3). We'll then analyze a particular function and compare their basic block makeup. Here's the C code we'll test:
```
#include <stdlib.h>
#include <math.h>

int main() {
	double num = (double)rand() / (double)(RAND_MAX/360);
	double ncos = cos(num);
	double nexp = exp(ncos);
	double nlog = log10(nexp);
	nlog = nlog + 1;
	return 0;
}
```
Compile with `gcc domath.c -Wall -ansi -lm -fno-inline-functions -OX -o domath.OX.out -static` where `X` is 0, 1, 2, or 3.

_One more thing_: Andriesse et al. mentioned in _An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries_ that linking-time optimizations are starting to gain popularity (p. 594). It may be useful to experiment with gcc's `-flto` option. (But this isn't a huge priority, as the authors did say it had little effect on their ability to recognize function boundaries)

In [62]:
!rm -v ../res/domath.O0.out ../res/domath.O1.out ../res/domath.O2.out ../res/domath.O3.out
!gcc ../res/domath.c -Wall -ansi -lm -fno-inline-functions -O0 -o ../res/domath.O0.out -static && echo compiled O0
!gcc ../res/domath.c -Wall -ansi -lm -fno-inline-functions -O1 -o ../res/domath.O1.out -static && echo compiled O1
!gcc ../res/domath.c -Wall -ansi -lm -fno-inline-functions -O2 -o ../res/domath.O2.out -static && echo compiled O2
!gcc ../res/domath.c -Wall -ansi -lm -fno-inline-functions -O3 -o ../res/domath.O3.out -static && echo compiled O3

removed '../res/domath.O0.out'
removed '../res/domath.O1.out'
removed '../res/domath.O2.out'
removed '../res/domath.O3.out'
compiled O0
compiled O1
compiled O2
compiled O3


__NOTE__: We use pyELFtools to extract symbols here because Nucleus doesn't seem to be returning all of the symbols of the binary

In [83]:
from elftools.elf.elffile import ELFFile
from elftools.elf.sections import SymbolTableSection, Symbol

# Binaries to look at
BIN_PATHS = ["../res/domath.O0.out", "../res/domath.O1.out", "../res/domath.O2.out", "../res/domath.O3.out"]

# Symbols to look for. Generally, the functions without underscores just call the underscored equivalent
SYMS = ["rand", "__random", "cos", "__cos32", "exp", "log10", "__log10_finite"]

# Basic process: for each binary, find the address range corresponding to each symbol in SYMS
# Disassemble that address range and let nucleus find the basic blocks
# Compare the basic bloc

def obtain_functs(BIN_PATHS, SYMS):
    addrs = {}
    sizes = {}
    nuc_functs = {}
    for bp in BIN_PATHS:
        addrs[bp] = {}
        sizes[bp] = {}

        # First, locate the addresses of the functions we want using PyELFtools
        with open(bp, 'rb') as f:
            # Load the binary's symbol table into PyELFtools 
            symtab = ELFFile(f).get_section_by_name(".symtab")
            # Loop over every symbol that we're looking for
            for sym_name in SYMS:
                sym = symtab.get_symbol_by_name(sym_name)
                if sym is not None and len(sym) == 1:
                    # Record the address of the binary
                    addr = sym[0]['st_value']
                    size = sym[0]['st_size']
                    #print("Found {} of size {} at 0x{:x}".format(sym_name, size, addr))
                    addrs[bp][str(addr)] = sym_name
                    sizes[bp][str(addr)] = size
                else:
                    print("Couldn't find symbol " + sym_name)

        # Then, we use nucleus to break those functions into basic blocks
        nuc = nucleus.load(bp)
        nuc_functs[bp] = {}
        # Loop over every discovered function and try to match it
        for func in nuc.cfg.functions:
            start_str = str(func.start)
            if start_str in addrs[bp]:
                # We found a matching function, so store it
                nuc_functs[bp][addrs[bp][start_str]] = func

            # Early breakout
            if len(nuc_functs[bp]) == len(addrs[bp]):
                break
    return nuc_functs
            
nuc_functs = obtain_functs(BIN_PATHS, SYMS)
nuc_functs

{'../res/domath.O0.out': {'__cos32': <nucleus.Function at 0x7f11df65bdc0>,
  '__random': <nucleus.Function at 0x7f11df64ad18>,
  'cos': <nucleus.Function at 0x7f11df65bc70>,
  'exp': <nucleus.Function at 0x7f11df630c70>,
  'log10': <nucleus.Function at 0x7f11df630ca8>,
  'rand': <nucleus.Function at 0x7f11df6a13e8>},
 '../res/domath.O1.out': {'__cos32': <nucleus.Function at 0x7f11df6482d0>,
  '__random': <nucleus.Function at 0x7f11df64d1f0>,
  'cos': <nucleus.Function at 0x7f11df648180>,
  'exp': <nucleus.Function at 0x7f11df6c6688>,
  'log10': <nucleus.Function at 0x7f11df6c63e8>,
  'rand': <nucleus.Function at 0x7f11df6a1848>},
 '../res/domath.O2.out': {'__cos32': <nucleus.Function at 0x7f11df6487a0>,
  '__random': <nucleus.Function at 0x7f11df64d650>,
  'cos': <nucleus.Function at 0x7f11df648650>,
  'exp': <nucleus.Function at 0x7f11df6b25a8>,
  'log10': <nucleus.Function at 0x7f11df6b20a0>,
  'rand': <nucleus.Function at 0x7f11df6a1c70>},
 '../res/domath.O3.out': {'__cos32': <nucle

Finally, for the main event, let's do some analysis on basic blocks of these extracted functions

In [84]:
qtys = {}
for sym in SYMS:
    try:
        qtys[sym] = {bp:len(d[sym].BBs) for (bp,d) in nuc_functs.items()}
    except KeyError:
        print("Warning: couldn't find symbol " + sym)
qtys



{'__cos32': {'../res/domath.O0.out': 19,
  '../res/domath.O1.out': 19,
  '../res/domath.O2.out': 19,
  '../res/domath.O3.out': 19},
 '__random': {'../res/domath.O0.out': 12,
  '../res/domath.O1.out': 12,
  '../res/domath.O2.out': 12,
  '../res/domath.O3.out': 12},
 'cos': {'../res/domath.O0.out': 2,
  '../res/domath.O1.out': 2,
  '../res/domath.O2.out': 2,
  '../res/domath.O3.out': 2},
 'exp': {'../res/domath.O0.out': 458,
  '../res/domath.O1.out': 458,
  '../res/domath.O2.out': 458,
  '../res/domath.O3.out': 458},
 'log10': {'../res/domath.O0.out': 1,
  '../res/domath.O1.out': 1,
  '../res/domath.O2.out': 1,
  '../res/domath.O3.out': 1},
 'rand': {'../res/domath.O0.out': 2,
  '../res/domath.O1.out': 2,
  '../res/domath.O2.out': 2,
  '../res/domath.O3.out': 2}}

Well, this is unexpected. It seems that, across optimization levels, there's the exact same number of basic blocks for all of theses functions. Let's try disassemblming a few

In [85]:
def print_insns(insns):
    for ins in insns:
        print("{:x}: {} {}".format(ins.start, ins.mnem, ins.op_str))

def print_opti(nuc_functs, prefix, funcname, BBnum):
    print("___O0___")
    print_insns(nuc_functs[prefix + '.O0.out'][funcname].BBs[BBnum].insns)
    print("___O1___")
    print_insns(nuc_functs[prefix + '.O3.out'][funcname].BBs[BBnum].insns)
    print("___O2___")
    print_insns(nuc_functs[prefix + '.O3.out'][funcname].BBs[BBnum].insns)
    print("___O3___")
    print_insns(nuc_functs[prefix + '.O3.out'][funcname].BBs[BBnum].insns)
    
    
print_opti(nuc_functs, "../res/domath", "exp", 100)

___O0___
4046f2: cmp edi, 0xc7
4046f8: mov edx, 0x4bebde
4046fd: mov eax, 0x4bebd7
404702: cmovle rax, rdx
404706: cmp dword ptr [rip + 0x2fb9a3], 2
40470d: mov qword ptr [rsp + 0x18], rax
404712: mov qword ptr [rsp + 0x30], 0
40471b: je 0x405d50
___O1___
404692: cmp edi, 0xc7
404698: mov edx, 0x4beb7e
40469d: mov eax, 0x4beb77
4046a2: cmovle rax, rdx
4046a6: cmp dword ptr [rip + 0x2fba03], 2
4046ad: mov qword ptr [rsp + 0x18], rax
4046b2: mov qword ptr [rsp + 0x30], 0
4046bb: je 0x405cf0
___O2___
404692: cmp edi, 0xc7
404698: mov edx, 0x4beb7e
40469d: mov eax, 0x4beb77
4046a2: cmovle rax, rdx
4046a6: cmp dword ptr [rip + 0x2fba03], 2
4046ad: mov qword ptr [rsp + 0x18], rax
4046b2: mov qword ptr [rsp + 0x30], 0
4046bb: je 0x405cf0
___O3___
404692: cmp edi, 0xc7
404698: mov edx, 0x4beb7e
40469d: mov eax, 0x4beb77
4046a2: cmovle rax, rdx
4046a6: cmp dword ptr [rip + 0x2fba03], 2
4046ad: mov qword ptr [rsp + 0x18], rax
4046b2: mov qword ptr [rsp + 0x30], 0
4046bb: je 0x405cf0


Okay, so these functions are basically identical. What's likely happening is that the standard C math library is so well-optimized that there's no further optimizations that the compiler can perform. So let's repeat this same analysis on a more complex binary using less-popular (and potentially less-optimized) libraries. We'll look at version 3.0.3 of [vsftpd](https://security.appspot.com/vsftpd.html#download). I modified the first few lines of the `Makefile` to compile statically, like below:
```
CC      = gcc
INSTALL = install
IFLAGS  = -idirafter dummyinc
CFLAGS  = -OX -static -static-libgcc --param=ssp-buffer-size=4 -D_FORTIFY_SOURCE=2 
LIBS    = `./vsf_findlibs.sh`
LINK    =	
LDFLAGS = -static -Wl,-z,relro -Wl,-z,now
```
Where X is the optimization level desired (0, 1, 2, or 3)

In [87]:
# Binaries to look at
BIN_PATHS = ["../res/vsftpd.O0.out", "../res/vsftpd.O1.out", "../res/vsftpd.O2.out", "../res/vsftpd.O3.out"]

# Symbols to look for
SYMS = [ "__gethostname", "init_connection", "hash_lookup_entry", "ssl_accept"]

ftp_nuc_functs = obtain_functs(BIN_PATHS, SYMS)

In [88]:
qtys = {}
for sym in SYMS:
    try:
        qtys[sym] = {bp:len(d[sym].BBs) for (bp,d) in ftp_nuc_functs.items()}
    except KeyError:
        print("Warning: couldn't find symbol " + sym)
qtys



{'__gethostname': {'../res/vsftpd.O0.out': 7,
  '../res/vsftpd.O1.out': 7,
  '../res/vsftpd.O2.out': 7,
  '../res/vsftpd.O3.out': 7},
 'hash_lookup_entry': {'../res/vsftpd.O0.out': 4,
  '../res/vsftpd.O1.out': 4,
  '../res/vsftpd.O2.out': 4,
  '../res/vsftpd.O3.out': 10},
 'init_connection': {'../res/vsftpd.O0.out': 12,
  '../res/vsftpd.O1.out': 139,
  '../res/vsftpd.O2.out': 140,
  '../res/vsftpd.O3.out': 159},
 'ssl_accept': {'../res/vsftpd.O0.out': 1,
  '../res/vsftpd.O1.out': 1,
  '../res/vsftpd.O2.out': 1,
  '../res/vsftpd.O3.out': 1},
 'ssl_control_handshake': {'../res/vsftpd.O0.out': 3,
  '../res/vsftpd.O1.out': 1,
  '../res/vsftpd.O2.out': 1,
  '../res/vsftpd.O3.out': 1}}

Okay, this is definitely more interesting. Let's disassemble some stuff

In [98]:
print("hash_lookup_entry, BB #2")
print_opti(nuc_functs, "../res/vsftpd", "hash_lookup_entry", 2)

hash_lookup_entry, BB #2
___O0___
40eeda: mov rax, qword ptr [rbp - 8]
40eede: mov rax, qword ptr [rax + 8]
40eee2: leave 
40eee3: ret 
___O1___
4105c0: mov rax, qword ptr [rbp + 0x18]
4105c4: mov rbx, qword ptr [rax + rbx*8]
4105c8: test rbx, rbx
4105cb: je 0x4105eb
___O2___
4105c0: mov rax, qword ptr [rbp + 0x18]
4105c4: mov rbx, qword ptr [rax + rbx*8]
4105c8: test rbx, rbx
4105cb: je 0x4105eb
___O3___
4105c0: mov rax, qword ptr [rbp + 0x18]
4105c4: mov rbx, qword ptr [rax + rbx*8]
4105c8: test rbx, rbx
4105cb: je 0x4105eb


Okay, some variation visible between optimized and unoptimized code. Let's take a look at a function that had a lot more variation between optimization levels

In [103]:
print("init_connection, BB #4")
print_opti(nuc_functs, "../res/vsftpd", "init_connection", 4)

init_connection, BB #4
___O0___
40148f: mov eax, dword ptr [rip + 0x3241e7]
401495: test eax, eax
401497: je 0x4014af
___O1___
401389: mov eax, dword ptr [rip + 0x32747d]
40138f: mov qword ptr [rsp + 0x10], 0
401398: mov dword ptr [rsp + 0x18], 0
4013a0: mov dword ptr [rsp + 0x1c], 0
4013a8: test eax, eax
4013aa: je 0x4013ec
___O2___
401389: mov eax, dword ptr [rip + 0x32747d]
40138f: mov qword ptr [rsp + 0x10], 0
401398: mov dword ptr [rsp + 0x18], 0
4013a0: mov dword ptr [rsp + 0x1c], 0
4013a8: test eax, eax
4013aa: je 0x4013ec
___O3___
401389: mov eax, dword ptr [rip + 0x32747d]
40138f: mov qword ptr [rsp + 0x10], 0
401398: mov dword ptr [rsp + 0x18], 0
4013a0: mov dword ptr [rsp + 0x1c], 0
4013a8: test eax, eax
4013aa: je 0x4013ec
