# Annotating Compiled Binaries

So I'm trying to build-up a dataset of binaries, but my experiments require some form of *ground truth* that can ground a comparison between two function's, especially if we're comparing them at the assembly level and fingerprint level. e.g. Using fingerprints, can we claim that two fingerprint-identical functions are also code-identical but not byte-identical?

We can't use the assembly code (since it's just disassembled from the bytes, so two functions that are assembly-identical are also byte-identical) as a ground truth. We *could* potentially use assembly code *generated by the compiler*, since that's more similar to the original C(++) code, but that complicates things.

Basically what I want at the end is a dataframe that looks something like:

| Function Name | ACE Fingerprint | Bytes | Source Code |
| --- | --- | --- | --- |
| printf() | [234, 3, 0, ... -4] | 0xDEAD, 0xBEEF... | function printf(char* param1...) { int i; ... } |

We already know how to extract and link all of the fields, except for Source Code. There doesn't seem to be any common way of embedding source code in a binary either. The first thing I thought of was GDB's `list <func_name>` command, but it turns out that GDB essentially just reads from the original source code files for that (although the binary does contain the name of the source files that the function comes from).

So my initial thought was: why don't we just use gdb to get the source code? We'd have to compile everything ourselves, but I expected that anyways. The only problem would be with library code: GDB doesn't seem to have a way of digging into the source of library functions.

One potentially-fatal flaw: GDB knows where functions begin, but it doesn't seem to be able to specify where they *end*. Running `list <func_name>,` (note comma) shows a certain number of lines of code, starting from the function signature. Running `set listsize unlimited` just shows all lines of each file...

There's a more promising but also more-involved-looking option: parse the C file yourself. This would require finding the symbol in the binary that indicates the name of the source file a particular function comes from, locating and loading that source file, and then parsing that file using a library like [pycparser](https://github.com/eliben/pycparser). Seems like there was someone that had a similar need in this [StackOverflow answer, with accompanying code](https://stackoverflow.com/a/55082235). There might also be a less-hackish method using [python-ctags3](https://github.com/jonashaag/python-ctags3)

Possibly helpful: Fedora apparently has this tool called [annobin](https://developers.redhat.com/blog/2018/02/20/annobin-storing-information-binaries/) that embeds extra information into ELF notes. 

## Attempt 1: Using `ctags`
We'll attempt to use `ctags` to extract source code for `sed`'s `usage()` function. Adapted from https://stackoverflow.com/a/55082235

In [None]:
!ctags -x --c-kinds=f ../res/sed-src/sed/sed.c | grep usage

In [None]:
import subprocess
import glob


def get_line_number(filename, funcname):
    found = False
    cmd = "ctags -x --c-kinds=f " + filename + " | grep " + funcname

    output = subprocess.getoutput(cmd)
    lines = output.splitlines()

    for line in lines:
        if line.startswith(funcname + " "):
            found = True

            if output.strip() is not "":
                output = output.split(" ")
                lines = list(filter(None, output))
                line_num = lines[2]

                #print("Function found in file " + filename + " on line: " + line_num)
                return int(line_num)

    if found == False:
        # print("Function not found")
        return 0


def extract_function_at_line(filename, line_num):
    #print("opening " + filename + " on line " + str(line_num))

    code = ""
    cnt_braket = 0
    found_start = False
    found_end = False

    with open(filename, "r") as f:
        for i, line in enumerate(f):
            if i >= (line_num - 1):
                code += line

                if line.count("{") > 0:
                    found_start = True
                    cnt_braket += line.count("{")

                if line.count("}") > 0:
                    cnt_braket -= line.count("}")

                if cnt_braket == 0 and found_start == True:
                    found_end = True
                    return code

def get_function_source_dir(function_name, source_code_dir):
    """
    Searches an entire source directory for the named function,
    stopping as soon as one is encountered and returning its
    source code.
    """
    for filename in glob.iglob(source_code_dir + "/**/*.c", recursive=True):
        line_num = get_line_number(filename, function_name)

        if line_num > 0:
            return extract_function_at_line(filename, line_num)

    # Didn't find anything in the .c files, now we'll check the .h file
    for filename in glob.iglob(source_code_dir + "/**/*.h", recursive=True):
        line_num = get_line_number(filename, function_name)

        if line_num > 0:
            return extract_function_at_line(filename, line_num)
        
    # If we get here, then we never found the function
    print("Function {} not found within {}".format(function_name, source_code_dir))


print(get_function_source_dir("usage", "../res/sed-src"))

In [None]:
# Alternatively, if we know the file that contains the function...
def get_function_source(function_name, source_file):
    line_num = get_line_number(source_file, function_name)
    return extract_function_at_line(source_file, line_num)
print(get_function_source("free_buffer", "../res/sed-src/sed/utils.c"))

Neat, but now we have to figure out how to extract the mappings between function names and source files from the debug symbols.

This [blog post](https://alex.dzyoba.com/blog/gdb-source-path/) has some info on where GDB gets that information. In short, it all comes from the `.debug_info` section encoded within DWARF debug info entries (DIEs). Source files are typed as `DW_TAG_compile_unit`, and they contain two helpful fields: `DW_AT_name` (source file name) and `DW_AT_comp_dir` (compilation directory). Quote from the article explaining the rest:
> So this is what happens when GDB tries to show you the source code:
> * parses the `.debug_info` to find `DW_AT_comp_dir` with `DW_AT_name` attributes for the current object file (range of addresses)
> * opens the file at `DW_AT_comp_dir/DW_AT_name`
> * shows the content of the file to you

See example below

In [None]:
!readelf --debug-dump=info ../res/sed | grep -C 10 -m 1 DW_TAG_compile_unit

Luckily, our friend `pyelftools` knows just how to parse DIEs, and they even include [an example tool](https://github.com/eliben/pyelftools/blob/v0.26/examples/dwarf_decode_address.py) that takes an arbitrary function address and returns the source code filename and line number!

The tool is a little slow, since they're looping over the entire DIE tree to find the function that contains the address given. Instead, perhaps we can modify the tool to instead return a lookup table for all functions in the binary, a la:

| Function Name | Address Range | Source File | Line # |
| --- | --- | --- | --- |
| printf() | [0x4012e0, 0x290a] | print.c | 203 |
| vfprintf() | [0x4012e0, 0x290a] | print.c | 234 |
| usage() | [0x4012e0, 0x290a] | sed.c | 132 |

This could serve as the "ground truth" table from which all of our data originates.

~~Seems like the general formula for doing this would be~~
* ~~Loop over every `DW_TAG_subprogram` to get all function names and address ranges. Store in table~~
* ~~Loop over every `DW_TAG_compile_unit` to get every source file name and address range~~
* ~~Join (kinda) the two tables along the address ranges, i.e.~~
  * ~~for each function as f in func_table~~
    * ~~find entry in file_table where f.addr_range is within entry.addr_range~~
    * ~~store entry.`DW_AT_name` in f.source_file_name~~

Soooo I was able to do this in one line of BASH, with parallelism... (note that `dwarf_decode_address.py` should be in the correct path)

*Note that cell below is "frozen" since it takes a while to run*

In [5]:
!nm -C ../res/sed | grep -iw t | cut -d' ' -f1 | xargs -P10 -I{} -d'\n' sh -c 'printf "%s\n" "$(python3 dwarf_decode_address.py 0x{} ../res/sed)"'

Failed on 440be0
Failed on 441460
Failed on 441f10
Failed on 442090
Failed on 441350
Failed on 441030
Failed on 4420b0
Failed on 4420e0
Failed on 439590
Failed on 43f4c0
Failed on 440b70
Failed on 4420d0
Failed on 441020
Failed on 4420a0
Failed on 4405e0
Failed on 440a40
Failed on 4420c0
Failed on 4405d0
Failed on 4407c0
Failed on 440640
Failed on 440520
Failed on 4364c0
Failed on 43c480
Failed on 43e0b0
Failed on 440de0
Failed on 49e240
Failed on 43c820
Failed on 43c850
Failed on 43d850
Failed on 43c400
Failed on 43c860
Failed on 43d730
Failed on 43d0c0
Failed on 43dba0
Failed on 43c4c0
Failed on 43d4d0
Failed on 43c8c0
Failed on 43d180
Failed on 43dda0
Failed on 4414a0
Failed on 441770
Failed on 441a20
Failed on 49e390
Failed on 440000
Failed on 439420
Failed on 4368b0
Failed on 49d1b0
Failed on 43b710
Failed on 436a50
Failed on 436c50
Failed on 49e700
Failed on 48b8f0
Failed on 49e5a0
Failed on 496dc0
Failed on 440d10
Failed on 441cb0
Failed on 4395f0
Failed on 440c40
Failed on 4421

Failed on 472b70
Failed on 472bb0
Failed on 463330
Failed on 472800
Failed on 4587b0
Failed on 472810
Failed on 45dd20
Failed on 44e430
Failed on 4727c0
Failed on 45dd30
Failed on 472860
Failed on 44f000
Failed on 4587c0
Failed on 456620
Failed on 456250
Failed on 4736e0
Failed on 4736c0
Failed on 472550
Failed on 473850
Failed on 473870
Failed on 473560
Failed on 473810
Failed on 472540
Failed on 473690
Failed on 473570
Failed on 473530
Failed on 473580
Failed on 4a0550
Failed on 4792f0
Failed on 4ad560
Failed on 4b2eb0
Failed on 4b2e00
Failed on 4af2e0
Failed on 4af430
Failed on 4ad5a0
Failed on 4af370
Failed on 4ae490
Failed on 4ae2d0
Failed on 4aea00
Failed on 4add80
Failed on 4aebb0
Failed on 4ae1b0
Failed on 4adb60
Failed on 4af0a0
Failed on 4adc70
Failed on 4af140
Failed on 4af1f0
Failed on 479400
Failed on 435f30
Failed on 4793d0
Failed on 4345d0
Failed on 436790
Failed on 47b1a0
Failed on 4361c0
Failed on 421d90
Failed on 44e570
Failed on 43b940
Failed on 4374d0
Failed on 421d

Failed on 417bb0
414c80: calc_eclosure_iter @ regcomp.c:1710
4085c0: cancel_cleanup @ sed.c:116
Failed on 4a3590
Failed on 4147d0
Failed on 412790
40fe60: case_folded_counterparts @ localeinfo.c:97
Failed on 4127b0
403910: check_final_program @ compile.c:1577
40c400: charclass_index @ dfa.c:858
Failed on 4b5100
414f30: check_arrival_expand_ecl @ regexec.c:3100
414600: check_arrival_expand_ecl_sub @ regexec.c:3153
418f90: check_arrival @ regexec.c:2855
412df0: check_dst_limits_calc_pos @ regexec.c:1986
412bc0: check_dst_limits_calc_pos_1 @ regexec.c:1903
Failed on 414830
408ca0: ck_fclose @ utils.c:258
416560: check_node_accept @ regexec.c:1210
416450: check_node_accept @ regexec.c:4010
408900: ck_fdopen @ utils.c:158
408ae0: ck_fread @ utils.c:216
408890: ck_fopen @ utils.c:139
Failed on 418d30
408a50: ck_fwrite @ utils.c:204
408fe0: ck_rename @ utils.c:381
408970: ck_mkstemp @ utils.c:177
408b80: ck_getdelim @ utils.c:226
420f00: chmod_or_fchmod @ set-permissions.c:761
Failed on 408bf

411a70: quotearg_char_mem @ quotearg.c:983
4101a0: quotearg_buffer_restyled @ quotearg.c:262
411b40: quotearg_colon_mem @ quotearg.c:1005
411c90: quotearg_custom @ quotearg.c:1021
411b20: quotearg_colon @ quotearg.c:993
Failed on 411cd0
Failed on 411b10
411cb0: quotearg_custom_mem @ quotearg.c:1045
411910: quotearg_mem @ quotearg.c:939
411820: quotearg_free @ quotearg.c:852
4113e0: quotearg_n_options @ quotearg.c:879
Failed on 4118c0
411b50: quotearg_n_style_colon @ quotearg.c:1010
411be0: quotearg_n_custom_mem @ quotearg.c:1029
4119c0: quotearg_n_style_mem @ quotearg.c:964
Failed on 411c80
411930: quotearg_n_style @ quotearg.c:956
Failed on 4118e0
411a50: quotearg_style @ quotearg.c:971
411a60: quotearg_style_mem @ quotearg.c:978
4158d0: re_acquire_state_context @ regex_internal.c:1533
415b90: re_acquire_state @ regex_internal.c:1485
Failed on 418800
Failed on 415090
Failed on 433610
413d20: re_dfa_add_node @ regex_internal.c:1410
41f120: re_compile_internal @ regcomp.c:740
Failed on 

### Investigating the failures

One failure: `x2nrealloc @ xmalloc.h:174`
Defined (in the header file) as:
```c
XALLOC_INLINE void *
x2nrealloc (void *p, size_t *pn, size_t s) { ... }
```
Notice the inline function... Compare to `x2realloc @ xalloc.h:178` (which **was** properly ID'd). This is *prototyped* (but **not defined**) in the header file as: 
```c
void *x2realloc (void *p, size_t *pn);
```
Through manual inspection, it would appear that this function is coming from `sed-4.7/lib/xmalloc.c:73`, where it's defined as:
```c
void *
x2realloc (void *p, size_t *pn)
{
  return x2nrealloc (p, pn, 1);
}
```

Are these two functions detectable with the StackOverflow method?

In [None]:
print(get_function_source_dir("x2nrealloc", "../res/sed-src"))
print(get_function_source_dir("x2realloc", "../res/sed-src"))

So it found them, ~~but notably, it found different definitions than what I did for `x2realloc()`: it found a template version of the function, where I had found the actual definition. This could potentially be solved by preferring code extracted from `.c` files over `.h` since (based on my experience) `.c` will usually contain the actual implementation.~~

Applied mentioned fix, and it worked!

Another failure: `44e630: two_way_long_needle @ glibc/string/str-two-way.h:387`. Notably, this file doesn't seem to exit anywhere on the system that would be reachable when compiling sed...

In [None]:
!locate str-two-way.h

This would appear to make sense, since this is a glibc function, and glibc's src *doesn't* exist in the system. Instead, the system just has the compiled static library `libc.a`: (ALSO WHAT IS THIS `--print-file-name` trickery!?)

In [None]:
!gcc --print-file-name=libc.a

So right now, here's the dichotomy for attributing functions:
(DWARF=xargs, ctags=StackOverflow)

* **normal functions** can be found using the *DWARF* method
  * high confidence because compiler is telling us exactly what code it used
* **inlined functions** or those otherwise defined in `.h` files can be found using the *ctags* method 
  * moderate confidence because we're making educated guesses as to what the compiler used, since we're analyzing source files ourselves
* **external library functions** (e.g. glibc) will have to be found using the *ctags-ext* method pointing at various source trees with 
  * low confidence, since name conflicts could lead to misattribution

It's likely that there's other cases that we don't currently handle (e.g., C++?), but we won't know about them until we get a more holistic view. Let's write a pipeline that takes in a binary and spits out something like that table mentioned above, a la:

| binary_name | function_name       | address_range      | source_loc                     | code                                                                          | attribution |
|-------------|---------------------|--------------------|--------------------------------|-------------------------------------------------------------------------------|-------------|
| sed         | usage               | (0x408300, 0x290a) | ../res/sed-src/sed/sed.c:135   | void  usage (int status) {   FILE *out = ...  }                               | dwarf       |
| sed         | free_buffer         | (0x408300, 0x290a) | ../res/sed-src/sed/utils.c:493 | void free_buffer (struct buffer *b) {   if (b)     free (b->b);   free (b); } | ctags       |
| sed         | two_way_long_needle | (0x44e630, 0x290a) | glibc/string/str-two-way.h:387 | (we'll have to figure out how to get this code)                               | ctags-ext   |

Before we start, we need to establish which tool can give us a true "ground truth" list of functions. Check out this diff (pyelftools on the left, nm on the right) showing all the functions in `../res/nano`: https://www.diffchecker.com/wEcpBD0Z

As shown, nm returns *50 fewer & 2 more* functions than pyelftools. Eyeballing the differences, most of the functions are system calls (or at least, standard lib. functions that do system calls). The two additional functions that nm picks up on are `__restore` and `__restore_rt`, which are both defined as `NOTYPE` symbols (as opposed to `FUNC` symbols.

Upon closer inspection of the 50 symbols nm didn't report, they *are* included in nm's output, but they're marked with `W`s instead of `T`s or `t`s. From the manpage: 
> "W" or "w": The symbol is a weak symbol that has not been specifically tagged as a weak object symbol.  When a weak defined symbol is linked with a normal defined symbol, the normal defined symbol is used with no error. When a weak undefined symbol is linked and the symbol is not defined, the value of the symbol is determined in a system-specific manner without error.  On some systems, uppercase indicates that a default value has been specified.

According to [Wikipedia](https://en.wikipedia.org/wiki/Weak_symbol), weak symbols are symbols that can be overridden by a "strong" (i.e., regular) symbol. An example would be an implementation of a library function that is slow but very portable. A library author might define this symbol as weak so that developers using that library can provide their own "strong" implementation that's faster but system-specific. 

Upon looking at nano's weak symbols, it would appear that none of them have source code attached to them, and that they are all indeed syscalls. I beleive it's safe to ignore them for our purposes.

...Also while looking at nm's manpage, I came across this:
> `-l` `--line-numbers`: For each symbol, use debugging information to try to find a filename and line number.  For a defined symbol, look for the line number of the address of the symbol.  For an undefined symbol, look for the line number of a relocation entry which refers to the symbol. If line number information can be found, print it after the other symbol information.

as well as the `-P` flag, which tells nm to use a more informative output format:
```
do_statusbar_mouse T 080544ad 000000c2  /var/tmp/src/nano-2.2.4/src/prompt.c:246
```
in the format of `[name] [type] [value] [size] [src file path]:[line]`

...in other words, nm alone gives us most, if not all, of the information we need. Here's some useful commands:

Get all functions: `nm -ClP [binary_path] | grep -iw t` (returns 1137 lines for nano)
Get only functions that have source code attached: `nm -ClP nano | grep -iw t | grep /` (returns 667 lines for nano)

Looking at the functions that *don't* have source code attached (add `-v` to last grep of prev. command), it would appear that they're all standard library functions

So with all this in mind, I think it's fair to set up our pipeline to only use functions with source-code locations from nm, at least initially. Standard library functions can be taken care of later using the ctags-ext method, if we decide that's valuable

In [1]:
from parse import compile
# Toy function for testing NM parsing. Don't actually use this since
# you'll be recompiling the two parsers every time
def parse_nm_line(line):
    nm_parser = compile("{binary}: {function} {nm_type:l} {address:x} {length:x} {src_path:>}:{src_line:d}")
    nm_parser_nosrc = compile("{binary}: {function} {nm_type:l} {address:x} {length:x}")
    nm_parser_gcc = compile("{binary}: {function} {nm_type:l} {address:x}")
    
    parsed = nm_parser.parse(line)
    
    if parsed is None:
        parsed = nm_parser_nosrc.parse(line)
        
    if parsed is None:
        parsed = nm_parser_gcc.parse(line)
    
    return parsed

display(parse_nm_line("./dc: num2str T 0804e603 000000f9       /var/tmp/src/bc-1.06/lib/number.c:1658"))
display(parse_nm_line("./dc: num2str T 0804e603 000000f9"))
display(parse_nm_line('../res/sed: _obstack_allocated_p T 0000000000412990 0000000000000038 /home/ubuntu/ace/dataset-gen/sed-4.7/lib/obstack.c:241'))
display(parse_nm_line("../res/sed: __init_array_end t 00000000006eacf8"))

<Result () {'binary': './dc', 'function': 'num2str', 'nm_type': 'T', 'address': 134538755, 'length': 249, 'src_path': '/var/tmp/src/bc-1.06/lib/number.c', 'src_line': 1658}>

<Result () {'binary': './dc', 'function': 'num2str', 'nm_type': 'T', 'address': 134538755, 'length': 249}>

<Result () {'binary': '../res/sed', 'function': '_obstack_allocated_p', 'nm_type': 'T', 'address': 4270480, 'length': 56, 'src_path': '/home/ubuntu/ace/dataset-gen/sed-4.7/lib/obstack.c', 'src_line': 241}>

<Result () {'binary': '../res/sed', 'function': '__init_array_end', 'nm_type': 't', 'address': 7253240}>

In [25]:
from shlex import quote
from subprocess import run, PIPE
from parse import compile
import pandas as pd
"""
Uses nm to identify and attribute as many functions as possible

:param binary_path: path to the binary whose functions are to be attributed
:param nm_src_only: set to True to only return functions that nm can attribute source to
"""
def nm_attribute_binary(binary_path, nm_src_only):
    # First, we need a ground truth list of functions. We'll use nm for this
    # C = demangle names, l = show line numbers, A = show binary path, P = POSIX format
    cmd = "nm -ClAP {} | grep -iw t".format(quote(binary_path))
    
    # If enabled, only return symbols for which nm found source code
    if nm_src_only:
        cmd = cmd + " | grep [[:space:]]/"
    
    # Execute nm
    try:
        nm_output = run(cmd, shell=True, check=True, universal_newlines=True, stdout=PIPE).stdout
    except CalledProcessError:
        print("Failed to run nm for " + binary_path)
    
    # Parse the output with pre-compiled parsers
    nm_parser = compile("{binary}: {function} {nm_type:l} {address:x} {length:x}\t{src_path}:{src_line:d}")
    nm_parser_nosrc = compile("{binary}: {function} {nm_type:l} {address:x} {length:x}")
    function_dicts = []
    for line in nm_output.splitlines():
        parsed = nm_parser.parse(line)
        if parsed is not None:
            # Hit a line with source information
            function_dict = parsed.named
            function_dict['attributor'] = "nm-" + function_dict['nm_type']
            function_dicts.append(function_dict)
        else:
            # Hit a line without source information
            parsed = nm_parser_nosrc.parse(line)
            if parsed is not None:
                function_dicts.append(parsed.named)
            else:
                print("WARN: couldn't parse nm line: " + line)
    
    
    col_names = ['binary', 'function', 'address', 'length', 'src_path', 'src_line', 'src_code', 'attributor']
    return pd.DataFrame(function_dicts, columns = col_names)


sed_functions = nm_attribute_binary("../res/sed", False)
sed_functions

WARN: couldn't parse nm line: ../res/sed: __do_global_dtors_aux t 0000000000401270 
WARN: couldn't parse nm line: ../res/sed: __do_global_dtors_aux_fini_array_entry t 00000000006eacf8 
WARN: couldn't parse nm line: ../res/sed: __fini_array_end t 00000000006ead20 
WARN: couldn't parse nm line: ../res/sed: __fini_array_start t 00000000006eacf8 
WARN: couldn't parse nm line: ../res/sed: __frame_dummy_init_array_entry t 00000000006eace0 
WARN: couldn't parse nm line: ../res/sed: __init_array_end t 00000000006eacf8 
WARN: couldn't parse nm line: ../res/sed: __init_array_start t 00000000006eace0 
WARN: couldn't parse nm line: ../res/sed: __restore_rt t 0000000000489620 
WARN: couldn't parse nm line: ../res/sed: _fini T 00000000004bd270 
WARN: couldn't parse nm line: ../res/sed: _init T 0000000000400418 
WARN: couldn't parse nm line: ../res/sed: deregister_tm_clones t 0000000000401200 
WARN: couldn't parse nm line: ../res/sed: frame_dummy t 00000000004012b0 
WARN: couldn't parse nm line: ../r

Unnamed: 0,binary,function,address,length,src_path,src_line,src_code,attributor
0,../res/sed,_IO_adjust_column,4461664,64,,,,
1,../res/sed,_IO_adjust_wcolumn,4429200,84,,,,
2,../res/sed,_IO_cleanup,4453568,1221,,,,
3,../res/sed,_IO_default_doallocate,4459488,89,,,,
4,../res/sed,_IO_default_finish,4460592,795,,,,
...,...,...,...,...,...,...,...,...
1429,../res/sed,xnrealloc,4269056,36,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xalloc.h,112.0,,nm-T
1430,../res/sed,xpalloc,4232176,202,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/dfa.c,794.0,,nm-t
1431,../res/sed,xrealloc,4268992,54,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xmall...,51.0,,nm-T
1432,../res/sed,xstrdup,4269456,19,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xmall...,119.0,,nm-T


In case you're wondering about the warnings in the ouput above, those are functions that GCC adds. See https://stackoverflow.com/questions/34966097/what-functions-does-gcc-add-to-the-linux-elf for more details

Now let's try sucking in all the source code

In [28]:
import os
src_path = "/home/ubuntu/ace/dataset-gen/sed-4.7/"
filt = sed_functions.src_path.notnull()
prefix = os.path.abspath(src_path) + os.sep
sed_functions.loc[filt, 'src_path'] = [os.path.normpath(os.path.join(prefix, x)) for x in sed_functions.loc[filt, 'src_path']]
sed_functions

Unnamed: 0,binary,function,address,length,src_path,src_line,src_code,attributor
0,../res/sed,_IO_adjust_column,4461664,64,,,,
1,../res/sed,_IO_adjust_wcolumn,4429200,84,,,,
2,../res/sed,_IO_cleanup,4453568,1221,,,,
3,../res/sed,_IO_default_doallocate,4459488,89,,,,
4,../res/sed,_IO_default_finish,4460592,795,,,,
...,...,...,...,...,...,...,...,...
1429,../res/sed,xnrealloc,4269056,36,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xalloc.h,112.0,,nm-T
1430,../res/sed,xpalloc,4232176,202,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/dfa.c,794.0,,nm-t
1431,../res/sed,xrealloc,4268992,54,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xmall...,51.0,,nm-T
1432,../res/sed,xstrdup,4269456,19,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xmall...,119.0,,nm-T


In [10]:
sed_functions['src_code'] = [extract_function_at_line(s, int(l)) if l >= 0 else None for s, l in zip(sed_functions['src_path'], sed_functions['src_line'])]
sed_functions

Unnamed: 0,binary,function,address,length,src_path,src_line,src_code,attributor
0,../res/sed,_IO_adjust_column,4461664,64,,,,
1,../res/sed,_IO_adjust_wcolumn,4429200,84,,,,
2,../res/sed,_IO_cleanup,4453568,1221,,,,
3,../res/sed,_IO_default_doallocate,4459488,89,,,,
4,../res/sed,_IO_default_finish,4460592,795,,,,
...,...,...,...,...,...,...,...,...
1429,../res/sed,xnrealloc,4269056,36,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xalloc.h,112.0,"xnrealloc (void *p, size_t n, size_t s)\n{\n ...",nm-T
1430,../res/sed,xpalloc,4232176,202,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/dfa.c,794.0,"xpalloc (void *pa, ptrdiff_t *nitems, ptrdiff_...",nm-t
1431,../res/sed,xrealloc,4268992,54,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xmall...,51.0,"xrealloc (void *p, size_t n)\n{\n if (!n && p...",nm-T
1432,../res/sed,xstrdup,4269456,19,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xmall...,119.0,xstrdup (char const *string)\n{\n return xmem...,nm-T


In [11]:
sed_functions.count()

binary        1434
function      1434
address       1434
length        1434
src_path       267
src_line       267
src_code       267
attributor     267
dtype: int64

So by inspection, it seems that out of 1434 detectable functions, the DWARF method was able to attribute and load source code for 267 (~18.62%) of them.

Let's do another sanity check by looking for any instances of duplicated source code. 

In [12]:
# Only look at functions with code attributions
with_code = sed_functions.dropna()
dupe_src = with_code[with_code.duplicated(subset="src_code", keep=False)]
dupe_src

Unnamed: 0,binary,function,address,length,src_path,src_line,src_code,attributor
915,../res/sed,build_collating_symbol.isra.28,4291888,76,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/regco...,3499.0,"build_equiv_class (bitset_t sbcset, re_charset...",nm-t
916,../res/sed,build_equiv_class.isra.27,4291888,76,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/regco...,3499.0,"build_equiv_class (bitset_t sbcset, re_charset...",nm-t
1167,../res/sed,line_exchange.constprop.10,4216656,123,/home/ubuntu/ace/dataset-gen/sed-4.7/sed/execu...,357.0,"line_exchange (struct line *a, struct line *b,...",nm-t
1168,../res/sed,line_exchange.part.0.constprop.15,4217744,130,/home/ubuntu/ace/dataset-gen/sed-4.7/sed/execu...,357.0,"line_exchange (struct line *a, struct line *b,...",nm-t


Interesting... So these functions appear to be artifacts of compiler optimizations, specifically [constant propagation](https://stackoverflow.com/questions/14796686/what-does-the-gcc-function-suffix-constprop-mean) and [ISRA](https://stackoverflow.com/questions/13963150/what-does-the-gcc-function-suffix-isra-mean). Notice that `line_exchange` has two different lengths, despite pointing to the same source. Also notice `build_collating_symbol.isra.28` is somehow equivalent to `build_equiv_class` despite having different names (my guess is that GCC decided `build_collating_symbol` is functionally equivalent to `build_equiv_class` when called with a certain mix of parameters. I haven't been able to confirm this by looking at `lib/regcomp.c` though).

**So are these functions useful, or should they be discarded?** I think they'll be interesting to look at since their great examples of funky compiler optimizations, but they also could potentially muck up training data. I won't discard them for right now; let's see what their fingerprints look like.


Now that we're fairly confident in the data, let's pull in the bytes and generate some fingerprints!

## Getting bytes & fingerprints

In [13]:
from elftools.elf.elffile import ELFFile
def get_raw_bytes_f(bin_f, start, stop):
    """
    Get the raw bytes between two MEMORY addresses in an ELF binary
    
    :param bin_f: File object for the binary
    :param start: the starting address to extract
    :param stop: the last address to extract
    :returns: a raw bytes object
    """
    start_addr = list(ELFFile(bin_f).address_offsets(start))[0]
    bin_f.seek(start_addr)
    return bin_f.read(stop - start)

with open("../res/sed", 'rb') as f:
    with_bytes = with_code.copy()
    with_bytes['raw_bytes'] = [get_raw_bytes_f(f, a, a+l) if l >= 0 else None for a, l in zip(with_code['address'], with_code['length'])]
    
with_bytes

Unnamed: 0,binary,function,address,length,src_path,src_line,src_code,attributor,raw_bytes
857,../res/sed,_obstack_allocated_p,4270480,56,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/obsta...,241.0,"_obstack_allocated_p (struct obstack *h, void ...",nm-T,b'H\x8bG\x08H\x85\xc0t-\x0f\x1f\x80\x00\x00\x0...
858,../res/sed,_obstack_begin,4270176,17,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/obsta...,150.0,"_obstack_begin (struct obstack *h,\n ...",nm-T,b'\x80gP\xfeH\x89O8L\x89G@\xe9_\xff\xff\xff'
859,../res/sed,_obstack_begin_1,4270208,21,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/obsta...,162.0,"_obstack_begin_1 (struct obstack *h,\n ...",nm-T,b'\x80OP\x01H\x89O8L\x89G@L\x89OH\xe9;\xff\xff...
861,../res/sed,_obstack_free,4270544,106,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/obsta...,262.0,"_obstack_free (struct obstack *h, void *obj)\n...",nm-T,b'ATUH\x89\xf5SH\x8bw\x08H\x89\xfbH\x85\xf6t*\...
862,../res/sed,_obstack_memory_used,4270656,42,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/obsta...,292.0,_obstack_memory_used (struct obstack *h)\n{\n ...,nm-T,b'H\x8bW\x081\xc0H\x85\xd2t\x1d\x0f\x1fD\x00\x...
...,...,...,...,...,...,...,...,...,...
1429,../res/sed,xnrealloc,4269056,36,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xalloc.h,112.0,"xnrealloc (void *p, size_t n, size_t s)\n{\n ...",nm-T,b'H\x89\xf0H\xf7\xe2H\x89\xc6\x0f\x90\xc0H\x85...
1430,../res/sed,xpalloc,4232176,202,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/dfa.c,794.0,"xpalloc (void *pa, ptrdiff_t *nitems, ptrdiff_...",nm-t,b'USH\x89\xf5I\x89\xd2H\x83\xec\x08L\x8b\x0eL\...
1431,../res/sed,xrealloc,4268992,54,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xmall...,51.0,"xrealloc (void *p, size_t n)\n{\n if (!n && p...",nm-T,b'H\x85\xf6SH\x89\xf3u\x05H\x85\xffu\x1aH\x89\...
1432,../res/sed,xstrdup,4269456,19,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xmall...,119.0,xstrdup (char const *string)\n{\n return xmem...,nm-T,b'SH\x89\xfb\xe8G\xdf\xfe\xffH\x89\xdfH\x8dp\x...


In [17]:
from reil.x86.translator import translate
from acevm import REILApproximateVM, REILRegContext, REILMemContext


def x86_to_reil(raw_bytes):
    """
    Wrapper function. Returns output from REIL translator in human-readable format
    """
    return list(
        il_ins
        for nat_ins in translate(raw_bytes, 0x0, x86_64=False)
        for il_ins in nat_ins.il_instructions
    )


def x86_fingerprint_raw_bytes(raw_bytes, sort = True):
    avm = REILApproximateVM(REILRegContext.zeros, REILMemContext.address, True)
    reil_code = sorted(x86_to_reil(raw_bytes)) if sort else x86_to_reil(raw_bytes)
    for ins in reil_code:
        avm.execute(ins)
    return avm.t_regs


full_fat = with_bytes.copy()
full_fat['fingerprint'] = [x86_fingerprint_raw_bytes(
    raw) for raw in full_fat.raw_bytes]
full_fat

  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,binary,function,address,length,src_path,src_line,src_code,attributor,raw_bytes,fingerprint
857,../res/sed,_obstack_allocated_p,4270480,56,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/obsta...,241.0,"_obstack_allocated_p (struct obstack *h, void ...",nm-T,b'H\x8bG\x08H\x85\xc0t-\x0f\x1f\x80\x00\x00\x0...,"[1, 2147483649, 0, 0, 38505, 0, 0, 0, 0, 0, 0,..."
858,../res/sed,_obstack_begin,4270176,17,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/obsta...,150.0,"_obstack_begin (struct obstack *h,\n ...",nm-T,b'\x80gP\xfeH\x89O8L\x89G@\xe9_\xff\xff\xff',"[63, 2147483649, 64, 0, 0, 64, 4, 2406, 64, 80..."
859,../res/sed,_obstack_begin_1,4270208,21,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/obsta...,162.0,"_obstack_begin_1 (struct obstack *h,\n ...",nm-T,b'\x80OP\x01H\x89O8L\x89G@L\x89OH\xe9;\xff\xff...,"[71, 2147483649, 72, 73, 0, 72, 4, 2406, 72, 8..."
861,../res/sed,_obstack_free,4270544,106,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/obsta...,262.0,"_obstack_free (struct obstack *h, void *obj)\n...",nm-T,b'ATUH\x89\xf5SH\x8bw\x08H\x89\xfbH\x85\xf6t*\...,"[1, 2147483649, 0, 0, 38505, 0, 0, 0, 0, 0, 0,..."
862,../res/sed,_obstack_memory_used,4270656,42,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/obsta...,292.0,_obstack_memory_used (struct obstack *h)\n{\n ...,nm-T,b'H\x8bW\x081\xc0H\x85\xd2t\x1d\x0f\x1fD\x00\x...,"[1, 2147483649, 0, 0, 38505, 0, 0, 0, 0, 0, 0,..."
...,...,...,...,...,...,...,...,...,...,...
1429,../res/sed,xnrealloc,4269056,36,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xalloc.h,112.0,"xnrealloc (void *p, size_t n, size_t s)\n{\n ...",nm-T,b'H\x89\xf0H\xf7\xe2H\x89\xc6\x0f\x90\xc0H\x85...,"[0, 2147483649, 0, 0, 38505, 0, 0, 0, 0, 0, 0,..."
1430,../res/sed,xpalloc,4232176,202,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/dfa.c,794.0,"xpalloc (void *pa, ptrdiff_t *nitems, ptrdiff_...",nm-t,b'USH\x89\xf5I\x89\xd2H\x83\xec\x08L\x8b\x0eL\...,"[8, 2147483649, 9223372041149743104, 922337204..."
1431,../res/sed,xrealloc,4268992,54,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xmall...,51.0,"xrealloc (void *p, size_t n)\n{\n if (!n && p...",nm-T,b'H\x85\xf6SH\x89\xf3u\x05H\x85\xffu\x1aH\x89\...,"[0, 2147483649, 1, 0, 38505, 1, 0, 0, 0, 0, 1,..."
1432,../res/sed,xstrdup,4269456,19,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xmall...,119.0,xstrdup (char const *string)\n{\n return xmem...,nm-T,b'SH\x89\xfb\xe8G\xdf\xfe\xffH\x89\xdfH\x8dp\x...,"[-1, 2147483649, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."


~~*facepalm*... I forgot that we don't have x86_64 support yet...~~ Fixed by adding "strict mode" feature to acevm.py

Beautiful... we have fingerprints! And a full table! Now let's wrap it all up into one function

Before we do that though, let's take a quick look at scalability concerns.

In [16]:
# Memory usage of each column in bytes
mem = full_fat.memory_usage(deep=True)
display(mem)
display("Total {:.2f} KB".format(mem.sum()/1024))

Index            2136
binary          17889
function        19285
address          2136
length           2136
src_path        28573
src_line         2136
src_code       414415
attributor      16287
raw_bytes      135757
fingerprint     81168
dtype: int64

'Total 705.00 KB'

~~ADDRESS INSTRUCTION SORTING!!~~ Instruction sorting added by making pyreil Instruction objects sortable (added `__lt__` functions to each class). Doing this exposed some issues the aVM has with very big numbers (which become more common after instructions are sorted) which were addressed.

Now, to combine everything into one module

In [2]:
import ace

ace.full_profile("../res/sed", threads = 8, ins_sort = False, src_only = False)

Unnamed: 0,binary,function,address,length,src_path,src_line,src_code,attributor,raw_bytes,fingerprint
0,../res/sed,_IO_adjust_column,4461664,64,,,,,b'Lc\xc2I\x01\xf0L9\xc6s#A\x80x\xff\nI\x8dH\xf...,"[1, -2147483648, 0, -2147483648, 0, 0, -214748..."
1,../res/sed,_IO_adjust_wcolumn,4429200,84,,,,,b'Hc\xcaL\x8d\x04\x8eL9\xc6s2A\x83x\xfc\nI\x8d...,"[1, -2147483648, 0, -2147483648, 0, 0, -214748..."
2,../res/sed,_IO_cleanup,4453568,1221,,,,,b'AWAVAUATUSH\x83\xec8I\xc7\xc5\x00\x00\x00\x0...,"[0, -2147483648, 0, 0, 38505, -2147483648, 0, ..."
3,../res/sed,_IO_default_doallocate,4459488,89,,,,,b'ATUSH\x89\xfb\xbf\x00 \x00\x00\xe8\x8f\x86\x...,"[-1, -2147483647, 0, -2147483648, -2147483648,..."
4,../res/sed,_IO_default_finish,4460592,795,,,,,b'USH\x89\xfbH\x83\xec8H\x8b\x7f8dH\x8b\x04%(\...,"[0, 0, 59, 11, 18, -2147483648, 0, 0, -1, -64,..."
...,...,...,...,...,...,...,...,...,...,...
1429,../res/sed,xnrealloc,4269056,36,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xalloc.h,112.0,"xnrealloc (void *p, size_t n, size_t s)\n{\n ...",nm-T,b'H\x89\xf0H\xf7\xe2H\x89\xc6\x0f\x90\xc0H\x85...,"[1, -2147483648, 0, 0, 38505, 0, 0, 0, -257, 0..."
1430,../res/sed,xpalloc,4232176,202,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/dfa.c,794.0,"xpalloc (void *pa, ptrdiff_t *nitems, ptrdiff_...",nm-t,b'USH\x89\xf5I\x89\xd2H\x83\xec\x08L\x8b\x0eL\...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 38505, 0, -1..."
1431,../res/sed,xrealloc,4268992,54,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xmall...,51.0,"xrealloc (void *p, size_t n)\n{\n if (!n && p...",nm-T,b'H\x85\xf6SH\x89\xf3u\x05H\x85\xffu\x1aH\x89\...,"[1, -2147483647, -2147483648, -2147483648, -21..."
1432,../res/sed,xstrdup,4269456,19,/home/ubuntu/ace/dataset-gen/sed-4.7/lib/xmall...,119.0,xstrdup (char const *string)\n{\n return xmem...,nm-T,b'SH\x89\xfb\xe8G\xdf\xfe\xffH\x89\xdfH\x8dp\x...,"[-1, -2147483647, 0, -2147483648, -2147483648,..."


*Runtime observation:* ~~enabling instruction sorting really slows things down, see below~~

Instruction sorting is still slower than not, but after adding simulated integer overflow to the aVM, the performance hit is not unbearable.

In [4]:
%timeit ace.full_profile("../res/sed", threads = 1, ins_sort = False, src_only = True)
%timeit ace.full_profile("../res/sed", threads = 1, ins_sort = True, src_only = True)

1.51 s ± 38.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.43 s ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [3]:
%timeit ace.full_profile("../res/sed", threads = 8, ins_sort = False, src_only = True)
%timeit ace.full_profile("../res/sed", threads = 8, ins_sort = True, src_only = True)

801 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.03 s ± 40.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%timeit ace.full_profile("../res/sed", threads = 8, ins_sort = False, src_only = False)
%timeit ace.full_profile("../res/sed", threads = 8, ins_sort = True, src_only = False)

2.91 s ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.47 s ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
