# Instance Compare

Aims to answer the question **Are there any unique indicators to malicious samples in terms of specific API Call(s) alone?** in **4.2.6. Dataset Analysis** of the study.

## 1. Import Libraries/Datasets

In [1]:
import pandas as pd
import time

malicious_df = pd.read_csv('./Clustering/[EDITED]KMeans_SampleHash_Common.csv', low_memory=False) #This should point to a verified SampleHash_Common.csv file
benign_df = pd.read_csv('./Clustering/Benign/API_Patterns.csv') #This should point to the API_Patterns.csv file

#Load list of API calls
API_LIST = "api_calls.txt"
DELIMITER = "NaN"
API_FILE = open(API_LIST,"r")
APIS = API_FILE.readline().split(',')
# APIS.append(DELIMITER) #serves as a label for NaN values for Instance-based datasets
API_FILE.close()

## 2. DataFrame Preview

In [2]:
malicious_df.replace(to_replace='-',value='_', inplace=True)
malicious_df

Unnamed: 0,cluster,hash,type1,type2,type3,pattern
0,0,490d584c7d303ed35c673460b63f3ca8,trojan,dropper,pua,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
1,0,9ab8ea1d2d68a0d4110df413e677976c,trojan,hacktool,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
2,0,adbc74815ef2bd1ea4967abad812233d,trojan,_,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
3,0,f6eb4841bba3a4cee747700dc0ee1609,_,_,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
4,0,f5a0ad49337ebc87897698e70d03364e,trojan,dropper,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
...,...,...,...,...,...,...
490,99,38beaa14fdd861489b7c1e88161266f9,trojan,_,_,"GetSystemTimeAsFileTime,LdrGetDllHandle,LdrGet..."
491,99,125e4dfc79fbfdadfeba0fea49533621,trojan,dropper,hacktool,"GetSystemTimeAsFileTime,LdrGetDllHandle,LdrGet..."
492,99,ce4823889c3c5f42ffd5654be87d8ff3,trojan,_,_,"GetSystemTimeAsFileTime,LdrGetDllHandle,LdrGet..."
493,99,d7f05bb88c5547e567e0a4ee484feba4,trojan,miner,hacktool,"GetSystemTimeAsFileTime,LdrGetDllHandle,LdrGet..."


In [3]:
#Drop row that is falsely labelled.
malicious_df.drop(malicious_df[(malicious_df['type1']=='_')&(malicious_df['type2']=='_')&(malicious_df['type3']=='_')].index, inplace=True)
malicious_df

Unnamed: 0,cluster,hash,type1,type2,type3,pattern
0,0,490d584c7d303ed35c673460b63f3ca8,trojan,dropper,pua,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
1,0,9ab8ea1d2d68a0d4110df413e677976c,trojan,hacktool,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
2,0,adbc74815ef2bd1ea4967abad812233d,trojan,_,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
4,0,f5a0ad49337ebc87897698e70d03364e,trojan,dropper,_,"GetSystemTimeAsFileTime,NtCreateMutant,GetSyst..."
5,1,1ff43aa97f19dc8543aeaa1cd53e3885,trojan,adware,_,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce..."
...,...,...,...,...,...,...
490,99,38beaa14fdd861489b7c1e88161266f9,trojan,_,_,"GetSystemTimeAsFileTime,LdrGetDllHandle,LdrGet..."
491,99,125e4dfc79fbfdadfeba0fea49533621,trojan,dropper,hacktool,"GetSystemTimeAsFileTime,LdrGetDllHandle,LdrGet..."
492,99,ce4823889c3c5f42ffd5654be87d8ff3,trojan,_,_,"GetSystemTimeAsFileTime,LdrGetDllHandle,LdrGet..."
493,99,d7f05bb88c5547e567e0a4ee484feba4,trojan,miner,hacktool,"GetSystemTimeAsFileTime,LdrGetDllHandle,LdrGet..."


In [4]:
benign_df

Unnamed: 0,hash,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99,pattern
0,5b51d65972a349f90a86984c26b12b30,SetErrorMode,OleInitialize,LdrGetDllHandle,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,NtClose,NtQueryDirectoryFile,NtClose,LdrGetProcedureAddress,CoCreateInstance,NtOpenSection,CreateDirectoryW,NtCreateFile,LdrGetProcedureAddress,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."
1,ceb8cc125478fad641daa4e04e9b2f19,GetSystemInfo,NtAllocateVirtualMemory,NtOpenSection,GetTempPathW,CreateDirectoryW,GetFileAttributesW,FindFirstFileExW,DeleteFileW,NtQueryDirectoryFile,...,NtClose,NtCreateMutant,NtClose,LdrGetDllHandle,LdrGetProcedureAddress,NtClose,NtCreateMutant,NtClose,NtCreateFile,"GetSystemInfo,NtAllocateVirtualMemory,NtOpenSe..."
2,f108600edf46d7c20f6acc522aeba6df,GetSystemTimeAsFileTime,NtProtectVirtualMemory,SetUnhandledExceptionFilter,GetTimeZoneInformation,GetSystemTimeAsFileTime,GetTimeZoneInformation,GetSystemTimeAsFileTime,GetTimeZoneInformation,GetSystemTimeAsFileTime,...,SetErrorMode,GetFileAttributesExW,SetErrorMode,NtAllocateVirtualMemory,SetErrorMode,GetFileAttributesExW,SetErrorMode,FindFirstFileExW,NtQueryDirectoryFile,"GetSystemTimeAsFileTime,NtProtectVirtualMemory..."
3,711be6337cb78a948f04759a0bd210ce,GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,LdrGetProcedureAddress,NtAllocateVirtualMemory,LdrGetProcedureAddress,GetSystemMetrics,LdrLoadDll,LdrGetProcedureAddress,GetSystemMetrics,NtAllocateVirtualMemory,LdrLoadDll,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce..."
4,6de26f67ceb1e3303b889489010f4c3f,SetErrorMode,OleInitialize,LdrGetDllHandle,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,NtClose,NtQueryDirectoryFile,NtClose,LdrGetProcedureAddress,GetSystemWindowsDirectoryW,LoadStringW,GetSystemWindowsDirectoryW,GetSystemDirectoryW,RegOpenKeyExW,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,c6c5563b17b7c763e51e4dbc3378ef1a,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,...,RegCloseKey,RegEnumKeyExA,RegOpenKeyExA,RegQueryValueExA,RegCloseKey,RegEnumKeyExA,RegOpenKeyExA,RegQueryValueExA,RegCloseKey,"LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,L..."
66,67db2476f1e9e962ca343f799b669225,GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,FindFirstFileExW,NtCreateFile,SetErrorMode,ReadProcessMemory,NtAllocateVirtualMemory,Module32NextW,ReadProcessMemory,SetErrorMode,FindFirstFileExW,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce..."
67,6e51234733dec1e25f2fc3245aea3d7c,GetSystemTimeAsFileTime,SetUnhandledExceptionFilter,NtCreateMutant,LoadStringW,FindResourceExW,LoadResource,FindResourceExW,LoadResource,NtAllocateVirtualMemory,...,FindFirstFileExW,NtClose,SetErrorMode,NtOpenSection,NtMapViewOfSection,RegOpenKeyExW,LdrLoadDll,LdrGetProcedureAddress,RegOpenKeyExW,"GetSystemTimeAsFileTime,SetUnhandledExceptionF..."
68,cfbd8d062e9baa98737a0260996f48c6,SetUnhandledExceptionFilter,NtAllocateVirtualMemory,SetErrorMode,OleInitialize,SetWindowsHookExW,NtAllocateVirtualMemory,SetWindowsHookExW,NtAllocateVirtualMemory,GetForegroundWindow,...,LdrGetProcedureAddress,NtAllocateVirtualMemory,LdrGetProcedureAddress,GetSystemMetrics,LdrLoadDll,LdrGetProcedureAddress,GetSystemMetrics,LdrGetProcedureAddress,GetKeyState,"SetUnhandledExceptionFilter,NtAllocateVirtualM..."


## 3. Identify Malware Types

In [5]:
'''Identify popular malware types in the dataset per Type as validated by VirusTotal.'''
types = ['type1', 'type2', 'type3']

def sortbyquantity(ls):
    return ls[1]

def identify(malware_type:str):
    print(f"{malware_type.upper()} LABEL")
    unique = list(malicious_df[malware_type].unique())
    if '_' in unique:
        unique.remove('_')
    print(unique)
    quantities = []
    for t in unique:
        quantities.append([t, len(malicious_df[malicious_df[malware_type]==t])])
    quantities.sort(key=sortbyquantity, reverse=True)
    for q in quantities:
        print(q)
    print("")
    
for i in types:
    identify(i)

TYPE1 LABEL
['trojan', 'downloader', 'adware', 'ransomware', 'pua', 'softomate']
['trojan', 399]
['downloader', 29]
['adware', 23]
['ransomware', 8]
['pua', 2]
['softomate', 1]

TYPE2 LABEL
['dropper', 'hacktool', 'adware', 'downloader', 'miner', 'trojan', 'ransomware', 'spyware', 'pua', 'virus', 'banker']
['adware', 280]
['trojan', 38]
['downloader', 17]
['ransomware', 9]
['dropper', 8]
['miner', 6]
['spyware', 5]
['banker', 5]
['pua', 4]
['virus', 2]
['hacktool', 1]

TYPE3 LABEL
['pua', 'trojan', 'worm', 'adware', 'downloader', 'dropper', 'virus', 'ransomware', 'spyware', 'hacktool']
['pua', 56]
['virus', 24]
['downloader', 23]
['trojan', 19]
['adware', 12]
['dropper', 5]
['hacktool', 2]
['worm', 1]
['ransomware', 1]
['spyware', 1]



## 4. Extract Unique API Calls

In [6]:
malicious_apis = []
for i in range(malicious_df.shape[0]): #Only allow those with 
    if not (malicious_df['type1'].iloc[i] == '_' and malicious_df['type2'].iloc[i] == '_' and malicious_df['type3'].iloc[i] == '_'):
        malicious_apis += malicious_df['pattern'].iloc[i].split(',')
malicious_apis = list(pd.Series(malicious_apis).unique())
print("# of Unique API Calls in Verified Malicious Samples:", len(malicious_apis))
malicious_apis

# of Unique API Calls in Verified Malicious Samples: 112


['GetSystemTimeAsFileTime',
 'NtCreateMutant',
 'NtOpenKeyEx',
 'NtQueryKey',
 'LdrLoadDll',
 'LdrGetProcedureAddress',
 'RegOpenKeyExW',
 'RegQueryInfoKeyW',
 'RegEnumKeyExW',
 'RegEnumValueW',
 'RegCloseKey',
 'GetFileAttributesW',
 'RegQueryValueExW',
 'NtOpenFile',
 'NtQueryDirectoryFile',
 'NtClose',
 'NtQuerySystemInformation',
 'NtProtectVirtualMemory',
 'GetSystemDirectoryW',
 'LdrGetDllHandle',
 'NtOpenKey',
 'NtQueryValueKey',
 'NtCreateFile',
 'GetFileSize',
 'NtCreateSection',
 'NtMapViewOfSection',
 'GetSystemInfo',
 'NtUnmapViewOfSection',
 'GetFileType',
 'SetUnhandledExceptionFilter',
 'NtAllocateVirtualMemory',
 'GetSystemWindowsDirectoryW',
 'CreateActCtxW',
 'NtOpenDirectoryObject',
 'NtFreeVirtualMemory',
 'RegOpenKeyExA',
 'RegQueryValueExA',
 'CoInitializeEx',
 'WSAStartup',
 'GetSystemMetrics',
 'CreateThread',
 'CryptAcquireContextW',
 'CryptDecrypt',
 'GetUserNameW',
 'NtDuplicateObject',
 'NtCreateThreadEx',
 'NtResumeThread',
 'GetFileVersionInfoSizeW',
 'Get

In [7]:
benign_apis = []
for i in range(benign_df.shape[0]): #Only allow those with 
    benign_apis += benign_df['pattern'].iloc[i].split(',')
benign_apis = list(pd.Series(benign_apis).unique())
print("# of Unique API Calls in Benign Samples:", len(benign_apis))
benign_apis

# of Unique API Calls in Benign Samples: 134


['SetErrorMode',
 'OleInitialize',
 'LdrGetDllHandle',
 'LdrLoadDll',
 'LdrGetProcedureAddress',
 'NtOpenSection',
 'NtMapViewOfSection',
 'RegOpenKeyExW',
 'RegQueryValueExW',
 'RegCloseKey',
 'NtClose',
 'NtOpenKey',
 'NtQueryValueKey',
 'GetSystemWindowsDirectoryW',
 'NtCreateFile',
 'NtCreateSection',
 'RegOpenKeyExA',
 'CreateActCtxW',
 'GetSystemDirectoryW',
 'GetVolumeNameForVolumeMountPointW',
 'RegEnumKeyW',
 'NtQueryDirectoryFile',
 'CoCreateInstance',
 'CreateDirectoryW',
 'GetSystemInfo',
 'NtAllocateVirtualMemory',
 'GetTempPathW',
 'GetFileAttributesW',
 'FindFirstFileExW',
 'DeleteFileW',
 'RemoveDirectoryW',
 'FindResourceA',
 'LoadResource',
 'SizeofResource',
 'GetSystemTimeAsFileTime',
 'NtCreateMutant',
 'NtQueryAttributesFile',
 'NtProtectVirtualMemory',
 'SetUnhandledExceptionFilter',
 'GetTimeZoneInformation',
 'GetFileType',
 'NtWriteFile',
 'WSAStartup',
 'GetFileAttributesExW',
 'SHGetFolderPathW',
 'NtQuerySystemInformation',
 'CoInitializeEx',
 'FindResource

## 5. Identify the Unique API Calls only found in Malicious API Calls.

In [8]:
unique = []
for m in malicious_apis:
    if m not in benign_apis:
        unique.append(m)
print("No. of truly unique API Calls only found in Malicious Samples:", len(unique), f"({len(unique)/len(benign_apis)*100:.2f}% Matches API Calls of Benign Samples)")
print(f"Coverage of 'Malicious-only' API Calls to Official API Calls Oliveira.csv: {(len(unique)/len(APIS))*100:.4f}%")
print("")
print(unique)

No. of truly unique API Calls only found in Malicious Samples: 11 (8.21% Matches API Calls of Benign Samples)
Coverage of 'Malicious-only' API Calls to Official API Calls Oliveira.csv: 3.5831%

['CryptDecrypt', 'CryptCreateHash', 'CryptHashData', 'SetEndOfFile', 'CreateProcessInternalW', 'SearchPathW', '__exception__', 'GetShortPathNameW', 'WriteProcessMemory', 'NtSuspendThread', 'GlobalMemoryStatus']


Fun Fact: You might see on the results of item #5 that it will contain some Crypt related API calls. According to the talk of Sir Mantua, it is a possible key indicator that a ransomware is in the system. This is supported by the fact that some malicious samples are ransomware as seen in the `Type2` malware types.

## 6. Identify the Same API Calls found in both Malicious and Benign Samples.

Apparently, due to the proliferation of trojan-like malware, it could be possible that there are more API Calls that can be found the same between Malicious and Benign Samples.

In [9]:
same = []
for m in malicious_apis:
    if m in benign_apis:
        same.append(m)
print("No. of API Calls in Malicious Samples that is found in API Calls in Benign Samples:", len(same), f"({len(same)/len(benign_apis)*100:.2f}% Matches API Calls of Benign Samples)")
print(f"Coverage of 'Same-to-Malicious-Benign-Samples' API Calls to Official API Calls Oliveira.csv: {(len(same)/len(APIS))*100:.4f}%")
print("")
print(same)

No. of API Calls in Malicious Samples that is found in API Calls in Benign Samples: 101 (75.37% Matches API Calls of Benign Samples)
Coverage of 'Same-to-Malicious-Benign-Samples' API Calls to Official API Calls Oliveira.csv: 32.8990%

['GetSystemTimeAsFileTime', 'NtCreateMutant', 'NtOpenKeyEx', 'NtQueryKey', 'LdrLoadDll', 'LdrGetProcedureAddress', 'RegOpenKeyExW', 'RegQueryInfoKeyW', 'RegEnumKeyExW', 'RegEnumValueW', 'RegCloseKey', 'GetFileAttributesW', 'RegQueryValueExW', 'NtOpenFile', 'NtQueryDirectoryFile', 'NtClose', 'NtQuerySystemInformation', 'NtProtectVirtualMemory', 'GetSystemDirectoryW', 'LdrGetDllHandle', 'NtOpenKey', 'NtQueryValueKey', 'NtCreateFile', 'GetFileSize', 'NtCreateSection', 'NtMapViewOfSection', 'GetSystemInfo', 'NtUnmapViewOfSection', 'GetFileType', 'SetUnhandledExceptionFilter', 'NtAllocateVirtualMemory', 'GetSystemWindowsDirectoryW', 'CreateActCtxW', 'NtOpenDirectoryObject', 'NtFreeVirtualMemory', 'RegOpenKeyExA', 'RegQueryValueExA', 'CoInitializeEx', 'WSAStar