Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bytes hash and functions hash are too often the same hash in ARM #143

Open
joxeankoret opened this issue Dec 22, 2018 · 2 comments
Open

Comments

@joxeankoret
Copy link
Owner

Reported by Huku.

@huku-
Copy link

huku- commented Jan 11, 2019

Hello,

Let me elaborate more on this. It's good to have this here for reference purposes :)

In diaphora_ida.py one can see the following:

decoded_size, ins = diaphora_decode(x)
if ins.Operands[0].type in [o_mem, o_imm, o_far, o_near, o_displ]:
  decoded_size -= ins.Operands[0].offb
if ins.Operands[1].type in [o_mem, o_imm, o_far, o_near, o_displ]:
  decoded_size -= ins.Operands[1].offb
if decoded_size <= 0:
  decoded_size = 1
...

curr_bytes = GetManyBytes(x, decoded_size, False)

What happens here is that you remove operand bytes from the instructions and only use the opcode and prefixes to compute a signature, which you name function_hash. Another type of signature, named bytes_hash, takes into account all instruction bytes. So, normally, function_hash and bytes_hash should be different. This works fine for X86, but I've noticed that, on ARM, offb is always 0 (makes sense as operand encoding is interleaved with opcode encoding). In this case bytes_hash and function_hash are, most of the times, equal!

Let's have a look at two examples.

The following shows information exported from an ARM binary:

sqlite> SELECT COUNT(*) FROM functions WHERE bytes_hash != function_hash;
3845
sqlite> SELECT COUNT(*) FROM functions;
18424

While the following from an IA-32 binary.

sqlite> SELECT COUNT(*) FROM functions WHERE bytes_hash != function_hash;
20877
sqlite> SELECT COUNT(*) FROM functions;
21034

So in my ARM binary's Diaphora database, only 3845 functions have a bytes_hash which is different from function_hash, as opposed to the IA-32 binary where most of the functions have different bytes_hash and function_hash values. After some investigation, turned out that all of the 3845 functions have data elements (e.g. constants, jump tables etc.) interleaved with their instructions! I believe it's the following "fallback" code that eventually reads a single byte from data heads interleaved with standard function instruction heads, but haven't verified:

if decoded_size <= 0:
  decoded_size = 1

This tiny bug was verified using a simple IDA Python script like the following.

import idc
import idaapi
import idautils

TYPES = [
    idaapi.o_mem, 
    idaapi.o_imm,
    idaapi.o_far,
    idaapi.o_near,
    idaapi.o_displ
]

for segment in idautils.Segments():
    functions = idautils.Functions(idc.SegStart(segment), idc.SegEnd(segment))

    for function in functions:
        function = idaapi.get_func(function)

        for head in idautils.Heads(function.startEA, function.endEA):
            size = idaapi.decode_insn(head)

            if size == 0:
                print 'No instruction %#x' % head

            if idaapi.cmd.Operands[0].type in TYPES:
                if idaapi.cmd.Operands[0].offb != 0:
                    print '%#x 0 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[0].offb)
            if idaapi.cmd.Operands[1].type in TYPES:
                if idaapi.cmd.Operands[1].offb != 0:
                    print '%#x 1 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[1].offb)

Here's a quick solution that can give similar results. Instead of relying on the instruction bytes, you can directly use information provided by the DecodeInstruction() API.

insn = idautils.DecodeInstruction(head)

itype = insn.itype
for i in xrange(6):
    op_type = getattr(insn, 'Op%d' % (i + 1)).type
    itype <<= 8
    itype |= op_type

@djcatter
Copy link

djcatter commented Jan 11, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants