Skip to content

tallpid/Binary-Crocodile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Binary-Crocodile

Code for poly malware clustering

General purpose of this code is to clustering malware binaries, which were created with polymorph engines. It’s not guaranteed, that algo will help you with all instances of specific malware family, but it can help with samples in same generation and with similar pre-config set.

The main idea is quite simple - collect entropy characteristic of executable section and map them to euclid coord system.

Then map entropy features to harmonic functions characteristics and assemble them with fast fourier transform (FFT). It’s not necessary to use FFT. I choose it due to simple and flexible mechanic, but you also can choose any math model you like.

Executable section of malware file contain regions with different entropy - native code, packet data, user code, data and so on. So, we have three characteristics - size of region, entropy and it offset. Size of region == period of harmonic function. Offset inside file == phase. Value of entropy == amplitude.

Entropy tree

alt tag

For localization of regions we use binary tree. Initialize root with whole executable sections then calc entropy of file from start to half for left branch and from half to end. Then repeat this for both branches while difference of entropy between children's became less than some fixed value e. I use 0.1, but you can make it even less to reach more strict borders of section. It’s also all about performance.

alt tag

After split, we take crones from each depth. For amplitude of each even region we multiply it by -1 to provide more effective mapping. For testing I use ~500 malware samples of Backdoor.Bedep for win32. You can find list of them at file "Win32Bedep". For comparison effectiveness of algo was used simple clustering based on fuzzy hashing (FH) + Levenshtein distance (LD). I use fixed LD, but is probably more useful to vary limit based on length of FH. Here is statistic of clustering with FH and entropy maps of groups.

As you can see, percent of clustered files is very low, but groups is “good”. Files are similar and algo can “ignore” minor differences in file structure. Here is example of mapping entropy features for few clusterized with FH+LD.

Percentage of clustered with FH+LD files.

alt tag

Clusters FH+LD

alt tag alt tag

Raw entropy data, without mapping.

alt tag alt tag

For comparison of “entropy” function we integrate them and compare scales with fixed limit (0.05).

Fourie functions for groups

alt tag alt tag alt tag

Entropy maps for groups

alt tag alt tag

About

Code for poly malware clustering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages