Skip to content

hoosierEE/ObfuscatedBinaryClassifiers

Repository files navigation

usage: main.py [-h] tmp out

Create musl-1.2.4 dataset, train models, report accuracy.

positional arguments:
  tmp         build directory for generating dataset
  out         results and figures will go here

options:
  -h, --help  show this help message and exit

This project aims to improve the state of the art in reverse engineering and binary static analysis. We focus on the specific case of binary executable code without debug symbols.

Statically linked binary formats are a common way to deploy programs to end-users, but they are opaque and difficult to analyze for security purposes. Malware authors also intentionally make their code harder to analyze using obfuscation. This can circumvent antivirus and malware detection software, so we specifically train on obfuscated data.

Dataset

The main.py program downloads musl-1.2.4 and extracts source code for some of its functions. It then creates multiple C source code versions of each function by obfuscating their sources with Tigress. Next it compiles all these sources with diverse options, resulting in a large set of obfuscated object files. It also compiles the source code of each function without any added obfuscation, resulting in a set of “plain” object files.

The musl project uses a typical ./configure && make install workflow, so our program parses the output of make to determine the baseline options required to build each function. Tigress obfuscates single C programs and requires that each program contains a main function, so our main.py program adds these to each file before obfuscating.

Finally, we strip all symbols from the object files using GNU strip.

Dataset preprocessing

Starting with the stripped object files, we apply further processing for use with some of the machine learning models.

from _utils import fname
from collections import Counter
from pathlib import Path
import numpy as np
c1 = Counter(fname(p) for p in Path('out/o').glob('*.o'))
c2 = Counter(fname(p) for p in Path('out/o_bb').glob('*'))
a1 = np.array(list(c1.values()))
a2 = np.array(list(c2.values()))
print(f'| name | obfuscated (\mu={a1.mean():.1f},\sigma={a1.std():.1f}) | block sampled (\mu={a2.mean():.1f},\sigma={a2.std():.1f})|')
print('|--|--|--|')
for (k1,v1),(_,v2) in zip(sorted(c1.items()),sorted(c2.items())):
 print(f'|{k1} | {v1} | {v2} |')
nameobject file (μ=1132.3,σ=77.6)sampled (μ=10097.4,σ=8692.7)
abs8432349
acos116815214
asin116814068
atan2108420914
ceil11687772
cos116810043
daemon103010385
floor11687387
inet_addr11403878
inet_aton114011469
isalnum12222819
memccpy11527533
memcmp10724866
memmem122229691
sin116812776
stpcpy11566150
stpncpy11648221
strchr11294126
strcpy10322365
strncpy10742552
strstr122240261
strtok11688126
tan11689534
utime11534442
wmemmove11285495

Bag of Bytes

Binaries are sequences of bytes, which means we can represent a binary of any length with a vector X of length 256 by storing the count of each byte i at Xi. This bag-of-words format loses information about the ordering of the original bytes.

Disassembly

We disassemble the original bytes into a stream of text tokens using Capstone’s support for the amd64 ISA. Each function becomes a single string. Whitespace separates each token within the string. A small sample of these tokens looks like this:

jg 0x47 add r8b, byte ptr [rcx] add dword ptr [rax], eax

In this representation, punctuation such as brackets and commas are part of the token.

Bag of Tokens

After disassembly, we create another data representation using a traditional bag-of-words approach. First we generate a vocabulary mapping each unique token t to its rank in terms of how frequently it occurs in the dataset. Then for each function, we store the count of each token in the function t at Vt. The length of this vector is the size of the number of unique words in the training data. Because the test data may contain tokens not present in the training data, we add a “unknown” token to the bag of words.

Basic Block Sampling

Finally, we use Ghidra to extract basic blocks from each object file. To synthesize a new representation of a given function, we can sample any number of basic blocks from all the object files for that function.

Models

We use the following models with the data representations described above:

kNN
Classify a function according to its k-nearest neighbors with a custom distance metric.
RandomForest
Classify a function using a “bag of bytes” representation.
RandomForestTokens
Classify a function using its “bag of tokens” disassembly representation.
fastText
Classify a function based on the disassembly representation using n-grams of length 5.

kNN

To find the similarity between the raw bytes of two functions, this model uses a custom metric called Normalized Compression Distance (NCD). First we compress, then take the length of the compressed documents D1 and D2. Next we concatenate both original documents and take the length of their compressed result as D3. Finally we express the normalized distance as:

$$\mathrm{NCD}(D_1, D_2, D_3) = \frac{D_3 - min{(D_1,D_2)}}{max{(D_1,D_2)}}$$

This model reports the most frequent label from the nearest neighbors for k $∈$ (1,3,5,9). In case of a tie, it returns the first label of the sequence.

Random Forest

For both the “Bag of Bytes” and “Bag of Tokens” representations, we train a Random Forest classifier from scikit-learn (version 1.3.1) with the following non-default hyperparameters:

  • n_estimators: (40,80,100,120) (trained separately)
  • n_jobs: -1
  • class_weight: “balanced”

fastText

This fastText 0.9.2 model operates on disassembled text tokens, with the following non-default hyperparameters:

  • thread: 18
  • lr: 2
  • epoch: 300
  • dim: 30
  • wordNgrams: 5

Block-Walk Embedding

Confusion matrix for Block-Walk embedding model:

walkyblocky.png

Training for 150 epochs…

myplot.png

Limitations

Currently we use gcc (version 11.4.0) on Ubuntu Linux (version 22.04.3), and target only the x86-64 instruction set architecture (ISA). Future improvements could add more compilers, operating systems, programming languages, and ISAs.

In addition, some combinations of Tigress transformations result in invalid code. We discard these from our dataset, so improvements to Tigress transformations would give us a better and more diverse dataset in less time.

Finally, some valid C code results in object files which do not contain a symbol matching the name of the original function. This may be due to a failed Tigress transformation or aggressive function inlining by the compiler. We currently exclude these object files from our dataset, but we would rather keep the inlined results.

About

Code and data for paper: Function Classification for Obfuscated Binary Code

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages