Binary-Crocodile

Code for poly malware clustering

General purpose of this code is to clustering malware binaries, which were created with polymorph engines. It’s not guaranteed, that algo will help you with all instances of specific malware family, but it can help with samples in same generation and with similar pre-config set.

The main idea is quite simple - collect entropy characteristic of executable section and map them to euclid coord system.

Then map entropy features to harmonic functions characteristics and assemble them with fast fourier transform (FFT). It’s not necessary to use FFT. I choose it due to simple and flexible mechanic, but you also can choose any math model you like.

Executable section of malware file contain regions with different entropy - native code, packet data, user code, data and so on. So, we have three characteristics - size of region, entropy and it offset. Size of region == period of harmonic function. Offset inside file == phase. Value of entropy == amplitude.

Entropy tree

For localization of regions we use binary tree. Initialize root with whole executable sections then calc entropy of file from start to half for left branch and from half to end. Then repeat this for both branches while difference of entropy between children's became less than some fixed value e. I use 0.1, but you can make it even less to reach more strict borders of section. It’s also all about performance.

After split, we take crones from each depth. For amplitude of each even region we multiply it by -1 to provide more effective mapping. For testing I use ~500 malware samples of Backdoor.Bedep for win32. You can find list of them at file "Win32Bedep". For comparison effectiveness of algo was used simple clustering based on fuzzy hashing (FH) + Levenshtein distance (LD). I use fixed LD, but is probably more useful to vary limit based on length of FH. Here is statistic of clustering with FH and entropy maps of groups.

As you can see, percent of clustered files is very low, but groups is “good”. Files are similar and algo can “ignore” minor differences in file structure. Here is example of mapping entropy features for few clusterized with FH+LD.