# Mining Input Grammar from Binary Programs

In the [chapter on grammar miner](GrammarMiner.ipynb), we have seen various techniques that _automatically mine grammars for programs_ – by executing the programs and observing how they process which parts of the input. 

From the above mentioned chapter, we must have seen that most of these techniques have only been demonstrated for python programs, and have we ever thought of our how these techniques could be explored on larger programs implemented in other programming languages such as `C, C++` etc?  

In this notebook, our focus is to demonstrate how this techniques can be further used to automatically mine syntactically valid input grammars from `C` _binary programs_ that can be used for efficient and effective fuzzing.

Also, we re-use and also extend some of the classes which we saw previously in order to be able to mine an input grammar.

In [None]:
import fuzzingbook_utils

## Recovering the Inventory Grammar

Suppose we convert the `inventory` system example we saw in the chapter on [parsers](Parser.ipynb) into `C`:

In [None]:
inventory_src = """\
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int max_size = 1024;

char* process_van(char* year, char* company, char* model, char* my_string) {
    char *current = my_string;
    char* fmt = "We have a %s %s van from %s vintage.\\n";
    int written = snprintf(current, max_size, fmt, company, model, year);

    current += written;
    int rem = (max_size - written);

    if(atoi(year) > 2010) {
        snprintf(current, rem, "It is a recent model!\\n");
    } else {
        snprintf(current, rem, "It is an old but reliable model!\\n");
    }
    return my_string;
}

char* process_car(char* year, char* company, char* model, char* my_string) {
    char *current = my_string;
    char* fmt = "We have a %s %s car from %s vintage.\\n";
    int written = snprintf(current, max_size, fmt, company, model, year);
    int rem = (max_size - written);

    if(atoi(year) > 2010) {
        snprintf(current, rem, "It is a recent model!\\n");
    } else {
        snprintf(current, rem, "It is an old but reliable model!\\n");
    }
    return my_string;
}

char* process_vehicle(char* vehicle_str, char* my_string) {
    char *data[4];
    int index = 0;
    char* str = vehicle_str;

    for(;;) {
       data[index] = strtok(str, ",");
       if (!data[index]) break;

       str = NULL;
       index += 1;
    }

    char* year = data[0];
    char* kind = data[1];
    char* company = data[2];
    char* model = data[3];

    if(!strcmp(kind, "van")) {
        return process_van(year, company, model, my_string);
    } else if (!strcmp(kind, "car")) {
        return process_car(year, company, model, my_string);
    } else {
        fprintf(stderr, "Invalid entry");
        return NULL;
    }
}

int main(int argc, char* argv[]) {
    /* format: year, kind, company, model */
    char *my_string = malloc(sizeof(char) * max_size);
    char* result = process_vehicle(argv[1], my_string);
    printf("%s", result);
    
    free(my_string);
    return 0;
}
"""

In [None]:
with open('inventory.c', 'w+') as f:
    print(inventory_src, file=f)

In order to be able to interrogate this program, observe it's process of parsing and recovering it's input grammar as well, we have seen in the case of the example illustrated in python that one can hook into the Python runtime to observe the arguments to a function and any local variables created. We have also seen that one can obtain the context of execution by inspecting the `frame` argument.

But in this case, we make use of `GDB`, the GNU Project debugger which allow us to trace through a program and also gives us access to the same contextual information as we have seen previously in the example implemented in python.`GDB` provides a python API which can be use in debugging binary programs and also accessing program information at runtime.

For the sake of clarity, it is important to know that `GDB` Api for python is not a library and we cannot directly import it as a python module within a python script.`GDB` only gets imported at runtime. 

### Debugger

Before we can start debugging programs with `GDB`, we need to define an interface called `Debugger` which holds various methods which can be implemented and used hand in hand to automatically debug programs.

In [None]:
class Debugger:
    def run(self):
        raise NotImplementedError()

    def step(self):
        raise NotImplementedError()

    def break_at(self, line):
        raise NotImplementedError()

    def start_program(self, inp, binary):
        raise NotImplementedError()

    def event_loop(self):
        raise NotImplementedError()

### GDBDebugger

Next, we consider another class called `GDBDebugger` which extends and implement the methods in the above interface. In addtion, this class also take some inputs such as, the *binary* to be debugged, the *program input* as well an instance of `GDB` itself. Also, the variable `trace` takes an instance of `GDBTracer` which is used for tracing later on.

In [None]:
class GDBDebugger(Debugger):
    def __init__(self, gdb, binary, inp, **kwargs):
        self.options(kwargs)
        self.gdb, self.binary, self.inp = gdb, binary, inp
        self.frames = []

        self._set_printer()
        self._set_logger()
        self._skip_std_files()
        self.tracer = GDBTracer(self.inp, files=self.files)

Also, we added a few private methods to our `GDBDebugger`. The first one is the `_set_printer`, this method helps us tell `GDB` not to include address in whatever information we are trying to access at runtime.

In [None]:
class GDBDebugger(GDBDebugger):
    def _set_printer(self):
        self.gdb.execute('set print address off')

The `_set_logger` function carries out three commands related to logging. Firstly, it tell `GDB` to overwrite the it's output logfile each time we run our program. Also by  default `GDB` logs it's output both to terminal and logfile.Hence, we instruct `GDB` via the second command to only redirect it's output to the logfile. Lastly, we instruct `GDB` to enable logging.

In [None]:
class GDBDebugger(GDBDebugger):
    def _set_logger(self):
        self.gdb.execute('set logging overwrite on')
        self.gdb.execute('set logging redirect on')
        self.gdb.execute('set logging on')

When stepping through binary programs, we need to make sure we avoid stepping into files which are not of interest to us such as the *standard libraries*. One of the ways to do this is to tell `GDB` to skip all files which are of a particular format, an example is the *.S files.

In [None]:
class GDBDebugger(GDBDebugger):
    def _skip_std_files(self):
        self.gdb.execute('skip -gfi *.S')

Furthermore, in order to efficiently avoid stepping into files we are not interested, we define a variable `file` which holds an array of file names which we are interested in tracing.

In [None]:
class GDBDebugger(GDBDebugger):
    def options(self, kwargs):
        self.files = kwargs.get('files', [])

At each step in our program we need to always check if we are within the context which we are most interested in. To do this, we provide the function `in_context` which take the selected frame at each *step* in our program an then to a check if we are still within the context that we are interested.

In [None]:
class GDBDebugger(GDBDebugger):
    def in_context(self, frame):
        file_name = frame.find_sal().symtab.fullname()
        return any(file_name.endswith(f) for f in self.files)

Also, we provide a concrete implementation of our `Debugger` interface. The methods just as their name implies are used to run, step, and also break while debugging.

In [None]:
class GDBDebugger(GDBDebugger):
    def run(self):
        self.gdb.execute('run')

In [None]:
class GDBDebugger(GDBDebugger):
    def step(self):
        self.gdb.execute('step')

In [None]:
class GDBDebugger(GDBDebugger):
    def break_at(self, line):
        self.gdb.execute("break '%s'" % line)
        self.run()

In [None]:
class GDBDebugger(GDBDebugger):
    def start_program(self, inp, binary):
        self.gdb.execute("set args '%s'" % inp)
        self.gdb.execute("file %s" % binary)

The function `get_event` keeps track of how frame are being created when stepping through our program. The idea behind this is that whenever a frame is newly created, it is added to the frame list and that shows that a function call has occurred within our program then we assign the event as a `call`. Also, whenever a frame is the last frame being added to our frame list that also implies that we are still within that particular frame  and then we assign the event as `line`. Lastly, if none of the above has occurred then it shows that particular frame has exited and then we assign our event as `return`.

In [None]:
class GDBDebugger(GDBDebugger):
    def get_event(self, frame):
        fname = frame.name()
        if fname not in self.frames:
            self.frames.append(fname)
            return 'call'
        elif fname == self.frames[-1]:
            return 'line'
        else:
            self.frames.pop()
            return 'return'

The `event_loop` starts our program and the auto-step through our program while it runs. Once we start our program we get the selected frame and then assign it to the variable called `frame`. The `frame` variable in gdb automatically becomes `False` when the program exits even though we never explicitly assign to it. Also, at each step in our program we check if there a new frame and if the frame is within our scope of interest. If not, we instruct `GDB` to finish execution from the uninterested scope and then returns back to it's caller.

In [None]:
class GDBDebugger(GDBDebugger):
    def event_loop(self):
        self.start_program(self.inp, self.binary)
        self.break_at('main')
        frame = self.gdb.selected_frame()
        try:
            while frame.is_valid():
                if self.gdb.selected_frame() != frame:
                    self.step()
                    current_frame = self.gdb.selected_frame()
                    if not self.in_context(current_frame):
                        # simply finish the current function execution.
                        self.gdb.execute('finish')
                        continue
                    event = self.get_event(current_frame)
                    self.tracer.traceit(current_frame, event, None)
                else:
                    self.step()
                    if not self.in_context(self.gdb.selected_frame()):
                        self.gdb.execute('finish')
        except gdb.error:
            return

### VarExtractor

Next, We also created a class called `VarExtractor` which provides various logics that can be used to extract and process variables which are in a frame.

In [None]:
class VarExtractor:
    def __init__(self, frame):
        self.frame = frame

In [None]:
class VarExtractor(VarExtractor):
    def extract_int_val(self, symbol):
        return '{}'.format(symbol.value(self.frame))

The function `extract_int_val` takes a symbol which type is an integer as argument and then looks up such symbol in the frame and then returns the value of such symbols. This form of extraction works for an integer which does not have a pointer type.

In [None]:
class VarExtractor(VarExtractor):
    def dereference_pointer_type(self, symbol):
        return symbol.value(self.frame).dereference()

`dereference_pointer_type` is solely used for symbols which are of the type `pointer`. This function basically dereference a pointer type and then returns the actual type which it points to.

In [None]:
class VarExtractor(VarExtractor):
    def extract_struct_val(self, struct):
        return {
            f.name: str(struct[f]).strip('"')
            for f in struct.type.fields()
        }

Just as the name implies, `extract_struct_val` take a symbol which is of type struct as an arguments and then returns a dictionary of the elements in this struct type. We choose to represent the struct type as a dictionary because in `GDB` symbols of type struct are iterable. 

In [None]:
class VarExtractor(VarExtractor):
    def strip_typedef(self, symbol):
        target_type = symbol.type.strip_typedefs()

        while target_type.code == gdb.TYPE_CODE_TYPEDEF:
            target_type = target_type.strip_typedef()

        return target_type

This function is used to extract variables which are of user defined types. In this method, we first strip out all layers of `typedef` and then returns the true underlying type.

In [None]:
class VarExtractor(VarExtractor):
    def extract_pointer_val(self, symbol):
        true_value = self.dereference_pointer_type(symbol)

        if true_value.type.code == gdb.TYPE_CODE_INT:
            return '{}'.format(true_value.address).strip('"')

        elif true_value.type.code == gdb.TYPE_CODE_STRUCT:
            return self.extract_struct_val(true_value)

Lastly, we have the function called `extract_pointer_val` which firstly dereference a pointer type and then returns the true type of the object it points to. Next, we check the type of the object being pointed to and then we extract the value and then return.

In [None]:
class VarExtractor(VarExtractor):
    def extract_typedef_val(self, symbol):
        true_type = self.strip_typedef(symbol)

        if true_type.code == gdb.TYPE_CODE_STRUCT:
            fields = true_type.fields()

### GDBContext

We've seen previously that the `Context` class provides easy access to the information such as the current module, and parameter names. We can also obtain same information using `GDB` API  to access the frame as seen below.

We call our new context class `GDBContext`.

In [None]:
from GrammarMiner import Context

In [None]:
class GDBContext(Context):
    def __init__(self, frame):
        self.method = frame.name()
        self.parameter_names = self.get_arg_names(frame)
        self.line_no = frame.find_sal().line
        self.file_name = frame.find_sal().symtab.fullname()

In [None]:
class GDBContext(GDBContext):
    def get_arg_names(self, frame):
        return [symbol.name for symbol in frame.block() if symbol.is_argument]

The `get_arg_names` is a custom function which takes a `frame` as input, extract the name of the arguments and return a list of argument names. `GDB` represents variables, constants, arguments as symbols in a block. In a more descriptive sense, a block is just a scope in the source code. Also, `gdb.Block` is iterable just as we can see in the `get_arg_names` function.

In [None]:
class GDBContext(GDBContext):
    def extract_vars(self, frame):
        vals = {}
        extractor = VarExtractor(frame)

        symbols = [
            sym for sym in frame.block() if sym.is_variable or sym.is_argument
        ]
        for symbol in symbols:
            if symbol.type.code == gdb.TYPE_CODE_INT:
                vals[symbol.name] = extractor.extract_int_val(symbol)

            elif symbol.type.code == gdb.TYPE_CODE_PTR:
                vals[symbol.name] = extractor.extract_pointer_val(symbol)

            elif symbol.type.code == gdb.TYPE_CODE_TYPEDEF:
                x = extractor.extract_typedef_val(symbol)

        return {k1: v1 for k, v in vals.items() for k1, v1 in flatten(k, v)}

We also extend the `extract_vars` which is a convenience method that acts on the frame within the `GDBContext` class. In this case we iterate through all `symbols` in the current `block`. If the symbol is a variable or an argument, we check what type they carry and then we extract their corresponding values based on their type and then we add them to  dictionary `vals` as defined.

### GDBTracer

Previously, we have seen how `Tracer` class was used to trace through a python program to obtain the trace information. In our case, we define a new class `GDBTracer` that inherits the base implementation of our `Tracer` class, the only exception we have is to use the `GDBContext` class which we already defined above. To do this, we override the function `create_context` and then return an instance of `GDBContext`.

In [None]:
from GrammarMiner import Tracer

IMPORTANT: verify that the version of fuzzingbook you use uses `create_context()` in the `traceit()` method.

In [None]:
class GDBTracer(Tracer):
    def create_context(self, frame):
        return GDBContext(frame)

### Recovering Grammars 

We make use of the function `recover_grammar` which we have seen before but this time we make use of the `GDBDebugger` class we implemented above.

In [None]:
from GrammarMiner import ScopedGrammarMiner, readable, flatten, VEHICLES

In [None]:
def recover_grammar(f, inp, **kwargs):
    inp_list = inp.split('\n')
    miner = ScopedGrammarMiner()

    for inpstr in inp_list:
        d = GDBDebugger(gdb, f, inpstr, **kwargs)
        d.event_loop()

        miner.update_grammar(inpstr, d.tracer.trace)
    return (readable(miner.clean_grammar()))

In [None]:
import sys
import inspect

In [None]:
tracer_head = """\
import sys
sys.path.extend([%s])
sys.path.append('.')
import matplotlib.pyplot
matplotlib.pyplot._IP_REGISTERED = True # Hack
import fuzzingbook_utils
from GrammarMiner import GrammarMiner, Context, Tracer, Coverage, ScopedGrammarMiner, readable, flatten
import jsonpickle
import os
import gdb
""" % (', '.join("'%s'" % str(i) for i in sys.path if i))
debugger_src = fuzzingbook_utils.extract_class_definition(Debugger)
context_src = fuzzingbook_utils.extract_class_definition(GDBContext)
gdbtracer_src = fuzzingbook_utils.extract_class_definition(GDBTracer)
varextractor_src = fuzzingbook_utils.extract_class_definition(VarExtractor)
gdbdebugger_src = fuzzingbook_utils.extract_class_definition(GDBDebugger)
tracer_tail="""
file_name = 'gdbtrace'
def recover_trace(f, inp, **kwargs):
    d = GDBDebugger(gdb, f, inp, **kwargs)
    d.event_loop()
    with open(file_name, 'w+') as f:
        print(jsonpickle.encode(d.tracer.trace), file=f)
binary = 'a.out'
recover_trace(binary, arg0, files=files.split(' '))
"""
tracer_src = '\n'.join([tracer_head, debugger_src, context_src, gdbtracer_src, varextractor_src, gdbdebugger_src, tracer_tail])

In [None]:
with open('debugger.py', 'w+') as f:
    print(tracer_src, file=f)

In order to recover the input grammar for the inventory program we do the following:

In [None]:
!gcc -g -o a.out inventory.c

In [None]:
import jsonpickle

In [None]:
traces = []
for inp in VEHICLES:
    arg = '\'py arg0="%s"\'' % inp
    argfiles = '\'py files="%s"\'' % 'inventory.c'
    print(arg)
    !gdb --batch-silent -ex {arg} -ex {argfiles} -x debugger.py
    with open('gdbtrace', 'rb') as f:
        traces.append((inp, jsonpickle.decode(f.read())))

In [None]:
miner = ScopedGrammarMiner()
for inp, trace in traces:
    miner.update_grammar(inp, trace)

In [None]:
grammar = readable(miner.grammar)

In [None]:
grammar

In [None]:
from Grammars import START_SYMBOL, syntax_diagram, is_nonterminal
from GrammarFuzzer import GrammarFuzzer

In [None]:
syntax_diagram(grammar)

In [None]:
f = GrammarFuzzer(grammar)
for _a in range(10):
    print(f.fuzz())

## Recovering  Grammar for Url Parser

Firstly, we have an header file for the url parser.

In [None]:
urlparse_h = """\

/*_
 * Copyright 2010 Scyphus Solutions Co. Ltd.  All rights reserved.
 *
 * Authors:
 *      Hirochika Asai
 */

#ifndef _URL_PARSER_H
#define _URL_PARSER_H

/*
 * URL storage
 */
struct parsed_url {
    char *scheme;               /* mandatory */
    char *host;                 /* mandatory */
    char *port;                 /* optional */
    char *path;                 /* optional */
    char *query;                /* optional */
    char *fragment;             /* optional */
    char *username;             /* optional */
    char *password;             /* optional */
};

#ifdef __cplusplus
extern "C" {
#endif

    /*
     * Declaration of function prototypes
     */
    struct parsed_url * parse_url(const char *, struct parsed_url* obj);
    void parsed_url_free(struct parsed_url *);

#ifdef __cplusplus
}
#endif

#endif /* _URL_PARSER_H */

/*
 * Local variables:
 * tab-width: 4
 * c-basic-offset: 4
 * End:
 * vim600: sw=4 ts=4 fdm=marker
 * vim<600: sw=4 ts=4
 */
"""

In [None]:
with open('url_parser.h', 'w+') as f:
    print(urlparse_h, file=f)

Next, we have the `.c` file.

In [1]:
urlparse_src = """\
/*
 * urlparse.c
 *
 * Decompose a URL into its components.
 */
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

enum url_type {
    URL_NORMAL,
    URL_OLD_TFTP,
    URL_PREFIX
};

struct url_info {
    char *scheme;
    char *user;
    char *passwd;
    char *host;
    unsigned int port;
    char *path;			/* Includes query */
    enum url_type type;
};

void parse_url(struct url_info *ui, char *url){
    char *p = url;
    char *q, *r, *s;

    memset(ui, 0, sizeof *ui);

    q = strstr(p, "://");
    if (!q) {
        q = strstr(p, "::");
        if (q) {
            *q = '\\000';
            ui->scheme = "tftp";
            ui->host = p;
            ui->path = q+2;
            ui->type = URL_OLD_TFTP;
            return;
        } else {
            ui->path = p;
            ui->type = URL_PREFIX;
            return;
        }
    }

    ui->type = URL_NORMAL;

    ui->scheme = p;
    *q = '\\000';
    p = q+3;

    q = strchr(p, '/');
    if (q) {
        *q = '\\000';
        ui->path = q+1;
        q = strchr(q+1, '#');
    if (q)
        *q = '\\000';
    } else {
        ui->path = "";
    }

    r = strchr(p, '@');
    if (r) {
        ui->user = p;
        *r = '\\000';
        s = strchr(p, ':');
        if (s) {
            *s = '\\000';
            ui->passwd = s+1;
        }
        p = r+1;
    }

    ui->host = p;
    r = strchr(p, ':');
    if (r) {
        *r = '\\000';
        ui->port = atoi(r+1);
    }
}

char *url_escape_unsafe(const char *input){
    const char *p = input;
    unsigned char c;
    char *out, *q;
    int n = 0;

    while ((c = *p++)) {
        if (c < ' ' || c > '~') {
            n += 3;		/* Need escaping */
        } else {
            n++;
        }
    }

    q = out = malloc(n+1);
    while ((c = *p++)) {
        if (c < ' ' || c > '~') {
            q += snprintf(q, 3, "%02X", c);
        } else {
            *q++ = c;
        }
    }

    *q = '\\000';

    return out;
}

static int hexdigit(char c){
    if (c >= '0' && c <= '9')
        return c - '0';
    c |= 0x20;
    if (c >= 'a' && c <= 'f')
        return c - 'a' + 10;
    return -1;
}

void url_unescape(char *buffer){
    const char *p = buffer;
    char *q = buffer;
    unsigned char c;
    int x, y;

    while ((c = *p++)) {
        if (c == '%') {
            x = hexdigit(p[0]);
            if (x >= 0) {
                y = hexdigit(p[1]);
                if (y >= 0) {
                    *q++ = (x << 4) + y;
                    p += 2;
                    continue;
                }
            }
        }
        *q++ = c;
    }
    *q = '\\000';
}


int main(int argc, char* argv[]) {
    struct url_info url;
    parse_url(&url, argv[1]);
    return 0;
}
"""

In [None]:
with open('urlparse.c', 'w+') as f:
    print(urlparse_src, file=f)

In [None]:
!gcc -g -o a.out urlparse.c

In [None]:
from GrammarMiner import URLS_X

In [None]:
traces = []
for inp in URLS_X:
    arg = '\'py arg0="%s"\'' % inp
    argfiles = '\'py files="%s"\'' % 'urlparse.c'
    print(arg)
    !gdb --batch-silent -ex {arg} -ex {argfiles} -x debugger.py
    with open('gdbtrace', 'rb') as f:
        traces.append((inp, jsonpickle.decode(f.read())))

In [None]:
miner = ScopedGrammarMiner()
for inp, trace in traces:
    miner.update_grammar(inp, trace)

In [None]:
grammar = readable(miner.grammar)

In [None]:
syntax_diagram(grammar)

In [None]:
f = GrammarFuzzer(grammar)
for _a in range(10):
    print(f.fuzz())