# Instance Compare

Aims to answer the question **Are there any unique indicators to malicious samples in terms of specific API Call(s) alone?** in **4.2.6. Dataset Analysis** of the study.

Note that this notebook only makes use of the verified xxxx_SampleHash_Common.csv file which represents a significant majority of the entire Oliveira dataset.  

## Import Libraries/Datasets

In [1]:
import pandas as pd
import time

malicious_df = pd.read_csv('../Clustering/Malicious/Manual_DBSCAN_Encoded_Clustering.csv', low_memory=False) #This should point to a verified <DataClustering>_SampleHash_Common.csv file
benign_df = pd.read_csv('../Clustering/Benign/API_Patterns_Benign.csv') #This should point to the API_Patterns.csv file

#DROP ROWS WITH NA
import numpy as nan
malicious_df.dropna(inplace=True, subset=['type'])
malicious_df['type'].unique()

#Load list of API calls
API_LIST = "../api_calls.txt"
DELIMITER = "NaN"
API_FILE = open(API_LIST,"r")
APIS = API_FILE.readline().split(',')
# APIS.append(DELIMITER) #serves as a label for NaN values for Instance-based datasets
API_FILE.close()

## DataFrame Preview

In [2]:
def list_to_str(ls:list):
    output = ""
    for l in ls:
        output += str(l) + " "
    return output[0:len(output)-1]

def inject_patterns(inner_df:pd.DataFrame):
    patterns = []
    for row in range(inner_df.shape[0]):
        patterns.append(list_to_str(inner_df.iloc[row,2:5].transpose().to_list()))
    inner_df['type_pattern'] = patterns
    return inner_df

malicious_df.replace(to_replace='-',value='_', inplace=True)
malicious_df.drop(malicious_df[(malicious_df['type']=='_')].index, inplace=True) #Drop row that is falsely labelled.
malicious_df = inject_patterns(malicious_df)

print("Malicious DF")
display(malicious_df)

print("Benign DF")
display(benign_df)

Malicious DF


Unnamed: 0,hash,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_95,t_96,t_97,t_98,t_99,malware,type,pattern,cluster,type_pattern
0,071e8c3f8922e186e57548cd4c703a5d,RegOpenKeyExA,NtOpenKey,NtQueryValueKey,NtClose,NtOpenKey,NtQueryValueKey,NtClose,NtQueryAttributesFile,LoadStringA,...,NtClose,GetSystemMetrics,NtAllocateVirtualMemory,CreateActCtxW,GetSystemWindowsDirectoryW,1,trojan,"RegOpenKeyExA,NtOpenKey,NtQueryValueKey,NtClos...",0,NtOpenKey NtQueryValueKey NtClose
1,33f8e6d08a6aae939f25a8e0d63dd523,GetSystemTimeAsFileTime,NtAllocateVirtualMemory,NtFreeVirtualMemory,NtAllocateVirtualMemory,LdrGetDllHandle,LdrGetProcedureAddress,LdrGetDllHandle,LdrGetProcedureAddress,LdrGetDllHandle,...,NtCreateFile,NtCreateSection,NtMapViewOfSection,NtClose,GetSystemMetrics,1,pua,"GetSystemTimeAsFileTime,NtAllocateVirtualMemor...",1,NtAllocateVirtualMemory NtFreeVirtualMemory Nt...
2,b68abd064e975e1c6d5f25e748663076,SetUnhandledExceptionFilter,OleInitialize,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,...,RegOpenKeyExA,RegQueryValueExA,RegCloseKey,RegEnumKeyExA,RegOpenKeyExA,1,trojan,"SetUnhandledExceptionFilter,OleInitialize,LdrL...",2,OleInitialize LdrLoadDll LdrGetProcedureAddress
3,72049be7bd30ea61297ea624ae198067,GetSystemTimeAsFileTime,NtAllocateVirtualMemory,NtFreeVirtualMemory,NtAllocateVirtualMemory,LdrGetDllHandle,LdrGetProcedureAddress,LdrGetDllHandle,LdrGetProcedureAddress,LdrGetDllHandle,...,NtFreeVirtualMemory,NtAllocateVirtualMemory,NtWriteVirtualMemory,NtProtectVirtualMemory,NtWriteVirtualMemory,1,trojan,"GetSystemTimeAsFileTime,NtAllocateVirtualMemor...",3,NtAllocateVirtualMemory NtFreeVirtualMemory Nt...
4,c9b3700a77facf29172f32df6bc77f48,GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,RegOpenKeyExW,RegQueryValueExW,RegOpenKeyExW,RegQueryValueExW,RegOpenKeyExW,1,trojan,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce...",4,LdrLoadDll LdrGetProcedureAddress LdrLoadDll
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40154,e3d6d58faa040f0f9742c9d0eaf58be4,GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,RegQueryValueExW,RegOpenKeyExW,RegQueryValueExW,RegOpenKeyExW,RegQueryValueExW,1,trojan,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce...",17,LdrLoadDll LdrGetProcedureAddress LdrLoadDll
40155,9b917bab7f32188ae40c744f2be9aaf8,GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,EnumWindows,GetSystemTimeAsFileTime,NtDelayExecution,EnumWindows,GetSystemTimeAsFileTime,1,trojan,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce...",11490,LdrLoadDll LdrGetProcedureAddress LdrLoadDll
40156,35a18ee05f75f04912018d9f462cb990,GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,RegOpenKeyExW,RegQueryValueExW,RegOpenKeyExW,RegQueryValueExW,RegOpenKeyExW,1,trojan,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce...",11491,LdrLoadDll LdrGetProcedureAddress LdrLoadDll
40157,654139d715abcf7ecdddbef5a84f224b,GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,RegQueryValueExW,RegOpenKeyExW,RegQueryValueExW,RegOpenKeyExW,RegQueryValueExW,1,trojan,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce...",17,LdrLoadDll LdrGetProcedureAddress LdrLoadDll


Benign DF


Unnamed: 0,hash,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99,pattern
0,5b51d65972a349f90a86984c26b12b30,SetErrorMode,OleInitialize,LdrGetDllHandle,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,NtClose,NtQueryDirectoryFile,NtClose,LdrGetProcedureAddress,CoCreateInstance,NtOpenSection,CreateDirectoryW,NtCreateFile,LdrGetProcedureAddress,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."
1,ceb8cc125478fad641daa4e04e9b2f19,GetSystemInfo,NtAllocateVirtualMemory,NtOpenSection,GetTempPathW,CreateDirectoryW,GetFileAttributesW,FindFirstFileExW,DeleteFileW,NtQueryDirectoryFile,...,NtClose,NtCreateMutant,NtClose,LdrGetDllHandle,LdrGetProcedureAddress,NtClose,NtCreateMutant,NtClose,NtCreateFile,"GetSystemInfo,NtAllocateVirtualMemory,NtOpenSe..."
2,f108600edf46d7c20f6acc522aeba6df,GetSystemTimeAsFileTime,NtProtectVirtualMemory,SetUnhandledExceptionFilter,GetTimeZoneInformation,GetSystemTimeAsFileTime,GetTimeZoneInformation,GetSystemTimeAsFileTime,GetTimeZoneInformation,GetSystemTimeAsFileTime,...,SetErrorMode,GetFileAttributesExW,SetErrorMode,NtAllocateVirtualMemory,SetErrorMode,GetFileAttributesExW,SetErrorMode,FindFirstFileExW,NtQueryDirectoryFile,"GetSystemTimeAsFileTime,NtProtectVirtualMemory..."
3,711be6337cb78a948f04759a0bd210ce,GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,LdrGetProcedureAddress,NtAllocateVirtualMemory,LdrGetProcedureAddress,GetSystemMetrics,LdrLoadDll,LdrGetProcedureAddress,GetSystemMetrics,NtAllocateVirtualMemory,LdrLoadDll,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce..."
4,6de26f67ceb1e3303b889489010f4c3f,SetErrorMode,OleInitialize,LdrGetDllHandle,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,NtClose,NtQueryDirectoryFile,NtClose,LdrGetProcedureAddress,GetSystemWindowsDirectoryW,LoadStringW,GetSystemWindowsDirectoryW,GetSystemDirectoryW,RegOpenKeyExW,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1074,d282ef96a93986f89825508812958354,SetErrorMode,OleInitialize,LdrGetDllHandle,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,NtClose,LdrGetProcedureAddress,NtAllocateVirtualMemory,LdrGetDllHandle,LdrGetProcedureAddress,LdrGetDllHandle,LdrGetProcedureAddress,LdrGetDllHandle,LdrGetProcedureAddress,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."
1075,c0389d256f976044adf570f0df908953,GetSystemTimeAsFileTime,SetUnhandledExceptionFilter,GetCursorPos,SetErrorMode,FindResourceW,SetWindowsHookExW,CoInitializeEx,NtDuplicateObject,NtAllocateVirtualMemory,...,NtAllocateVirtualMemory,LdrLoadDll,LdrGetProcedureAddress,NtAllocateVirtualMemory,GetSystemMetrics,RegOpenKeyExW,NtAllocateVirtualMemory,GetSystemMetrics,NtAllocateVirtualMemory,"GetSystemTimeAsFileTime,SetUnhandledExceptionF..."
1076,20316e717de5db169aecbb67377504ce,SetUnhandledExceptionFilter,NtCreateMutant,NtAllocateVirtualMemory,NtClose,NtCreateMutant,NtClose,NtCreateMutant,NtClose,NtAllocateVirtualMemory,...,RegOpenKeyExW,RegQueryValueExW,RegCloseKey,RegOpenKeyExW,RegQueryValueExW,RegCloseKey,RegOpenKeyExW,RegQueryValueExW,RegCloseKey,"SetUnhandledExceptionFilter,NtCreateMutant,NtA..."
1077,ce945d424b93ea73fbbedf0254f6bc07,NtClose,NtOpenKey,NtQueryValueKey,NtClose,NtOpenKey,NtQueryValueKey,NtClose,LdrGetDllHandle,LdrGetProcedureAddress,...,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrGetDllHandle,FindResourceExW,LoadResource,"NtClose,NtOpenKey,NtQueryValueKey,NtClose,NtOp..."


## Extract Unique API Calls

In [3]:
str_output = ""

malicious_apis = []
for i in range(malicious_df.shape[0]): #Only allow those with 
    if not (malicious_df['type'].iloc[i] == '_' and malicious_df['Type 2'].iloc[i] == '_' and malicious_df['Type 3'].iloc[i] == '_'):
        malicious_apis += malicious_df['pattern'].iloc[i].split(',')
malicious_apis = list(pd.Series(malicious_apis).unique())
str_output += f"# of Unique API Calls in Verified Malicious Samples: {len(malicious_apis)}\n"
str_output += str(malicious_apis) + "\n\n"

benign_apis = []
for i in range(benign_df.shape[0]): #Only allow those with 
    benign_apis += benign_df['pattern'].iloc[i].split(',')
benign_apis = list(pd.Series(benign_apis).unique())
str_output += f"# of Unique API Calls in Benign Samples: {len(benign_apis)}\n" 
str_output += str(benign_apis) + "\n\n"

with open("./Output/4 Unique_APICalls_MaliciousOnly.txt", 'w') as f:
    f.write(str_output)
    f.flush()
    f.close()
print(str_output)

# of Unique API Calls in Verified Malicious Samples: 248
['RegOpenKeyExA', 'NtOpenKey', 'NtQueryValueKey', 'NtClose', 'NtQueryAttributesFile', 'LoadStringA', 'NtAllocateVirtualMemory', 'LdrGetDllHandle', 'LdrGetProcedureAddress', 'GetSystemMetrics', 'FindResourceExW', 'LoadResource', 'LdrLoadDll', 'DrawTextExW', 'FindResourceA', 'SizeofResource', 'GetSystemWindowsDirectoryW', 'NtCreateFile', 'NtCreateSection', 'NtMapViewOfSection', 'CreateActCtxW', 'GetSystemTimeAsFileTime', 'NtFreeVirtualMemory', 'SetUnhandledExceptionFilter', 'GetFileSize', 'SetFilePointer', 'NtReadFile', 'RegOpenKeyExW', 'RegCreateKeyExW', 'RegCloseKey', 'RegSetValueExW', 'IsDebuggerPresent', 'CoInitializeEx', 'GetForegroundWindow', 'OleInitialize', 'GetNativeSystemInfo', 'RegQueryValueExW', 'LookupPrivilegeValueW', 'GetUserNameA', 'RegEnumKeyExA', 'RegQueryValueExA', 'NtProtectVirtualMemory', 'GetSystemInfo', 'NtCreateMutant', 'NtOpenKeyEx', 'NtQuerySystemInformation', 'GetSystemDirectoryW', 'NtWriteVirtualMemory',

## Identify the Unique API Calls only found in Malicious API Calls.

In [4]:
str_output = ""

unique = []
for m in malicious_apis:
    if m not in benign_apis:
        unique.append(m)
str_output += f"No. of truly unique API Calls only found in Malicious Samples: {len(unique)} ({len(unique)/len(benign_apis)*100:.2f}% Matches API Calls of Benign Samples)\n"
str_output += f"Coverage of 'Malicious-only' API Calls to Official API Calls Oliveira.csv: {(len(unique)/len(APIS))*100:.4f}%\n"
str_output += "Unique API Calls to Verified Malicious Samples only: "+ str(unique) + "\n"

with open("./Output/4 Unique_APICalls_MaliciousOnly.txt", 'w') as f:
    f.write(str_output)
    f.flush()
    f.close()
print(str_output)

No. of truly unique API Calls only found in Malicious Samples: 60 (30.00% Matches API Calls of Benign Samples)
Coverage of 'Malicious-only' API Calls to Official API Calls Oliveira.csv: 19.5440%
Unique API Calls to Verified Malicious Samples only: ['NtWriteVirtualMemory', 'getaddrinfo', 'MoveFileWithProgressW', 'CryptDecrypt', 'OpenSCManagerA', 'OpenServiceA', 'StartServiceA', 'SetFileTime', 'connect', 'CryptProtectData', 'CryptEncrypt', 'GetBestInterfaceEx', 'NtTerminateThread', 'CopyFileA', 'CreateServiceA', 'InternetOpenA', 'InternetConnectA', 'HttpOpenRequestA', 'NtGetContextThread', 'NtReadVirtualMemory', 'NtSetContextThread', 'GetDiskFreeSpaceW', 'InternetOpenW', 'InternetGetConnectedState', 'InternetCrackUrlA', 'HttpSendRequestA', 'InternetCloseHandle', 'CreateServiceW', 'DeleteUrlCacheEntryA', 'gethostbyname', 'send', 'DeleteUrlCacheEntryW', 'WSARecv', 'shutdown', 'InternetConnectW', 'HttpOpenRequestW', 'CreateJobObjectW', 'CopyFileExW', 'RtlRemoveVectoredExceptionHandler', 'Cr

**Fun Fact:** You might see on the results of item #5 that it will contain some Crypt (e.g., `CryptDecrypt`) related API calls. According to the talk of Sir Mantua (during his 4th hr. talk), it is a possible key indicator that a ransomware is in the system. This is supported by the fact that some malicious samples are ransomware as seen in the `Type 2` & `Type 3` malware types.

In [5]:
str_output = ""
for u in unique:
    str_output += u + "\n\n"
    str_output += str(malicious_df[malicious_df['pattern'].str.contains(u)]['type'].value_counts())
    str_output += "\n=====================================================================\n"

with open("./Output/4 APICalls_MalwareTypes_MaliciousOnly.txt", 'w') as f:
    f.write(str_output)
    f.flush()
    f.close()
print(str_output)

NtWriteVirtualMemory

type
trojan    4
miner     2
Name: count, dtype: int64
getaddrinfo

type
adware        136
trojan         61
pua            10
downloader      1
miner           1
ransomware      1
Name: count, dtype: int64
MoveFileWithProgressW

type
trojan        125
pua            41
adware          7
miner           2
ransomware      1
Name: count, dtype: int64
CryptDecrypt

type
trojan        2551
pua            233
adware         120
downloader      97
Name: count, dtype: int64
OpenSCManagerA

type
trojan        168
adware          4
pua             3
downloader      2
virus           2
Name: count, dtype: int64
OpenServiceA

type
trojan        135
pua             3
downloader      2
adware          2
virus           1
Name: count, dtype: int64
StartServiceA

type
trojan    41
virus      1
Name: count, dtype: int64
SetFileTime

type
trojan    94
pua       15
adware     5
miner      1
Name: count, dtype: int64
connect

type
adware        103
trojan         62
pua            2

## Identify the Same API Calls found in both Malicious and Benign Samples.

*Apparently, due to the proliferation of trojan-like malware, it could be possible that there are more API Calls that can be found the same between Malicious and Benign Samples.*

In [6]:
str_output = ""
same = []
for m in malicious_apis:
    if m in benign_apis:
        same.append(m)
str_output += f"No. of API Calls in Malicious Samples that is found in API Calls in Benign Samples: {len(same)} ({len(same)/len(benign_apis)*100:.2f}% Matches API Calls of Benign Samples)\n"
str_output += f"Coverage of 'Same-to-Malicious-Benign-Samples' API Calls to Official API Calls Oliveira.csv: {(len(same)/len(APIS))*100:.4f}%\n"
str_output += "Same API Calls to both Verified Malicious and Benign Samples: "+ str(same) + "\n"

with open("./Output/4 Unique_APICalls_MaliciousBenign.txt", 'w') as f:
    f.write(str_output)
    f.flush()
    f.close()
print(str_output)

No. of API Calls in Malicious Samples that is found in API Calls in Benign Samples: 188 (94.00% Matches API Calls of Benign Samples)
Coverage of 'Same-to-Malicious-Benign-Samples' API Calls to Official API Calls Oliveira.csv: 61.2378%
Same API Calls to both Verified Malicious and Benign Samples: ['RegOpenKeyExA', 'NtOpenKey', 'NtQueryValueKey', 'NtClose', 'NtQueryAttributesFile', 'LoadStringA', 'NtAllocateVirtualMemory', 'LdrGetDllHandle', 'LdrGetProcedureAddress', 'GetSystemMetrics', 'FindResourceExW', 'LoadResource', 'LdrLoadDll', 'DrawTextExW', 'FindResourceA', 'SizeofResource', 'GetSystemWindowsDirectoryW', 'NtCreateFile', 'NtCreateSection', 'NtMapViewOfSection', 'CreateActCtxW', 'GetSystemTimeAsFileTime', 'NtFreeVirtualMemory', 'SetUnhandledExceptionFilter', 'GetFileSize', 'SetFilePointer', 'NtReadFile', 'RegOpenKeyExW', 'RegCreateKeyExW', 'RegCloseKey', 'RegSetValueExW', 'IsDebuggerPresent', 'CoInitializeEx', 'GetForegroundWindow', 'OleInitialize', 'GetNativeSystemInfo', 'RegQuer

In [7]:
for s in same:
    str_output += s + "\n\n"
    str_output += str(malicious_df[malicious_df['pattern'].str.contains(s)]['type'].value_counts())
    str_output += "\n=====================================================================\n"

with open("./Output/4 APICalls_MalwareTypes_MaliciousBenign.txt", 'w') as f:
    f.write(str_output)
    f.flush()
    f.close()
print(str_output)

No. of API Calls in Malicious Samples that is found in API Calls in Benign Samples: 188 (94.00% Matches API Calls of Benign Samples)
Coverage of 'Same-to-Malicious-Benign-Samples' API Calls to Official API Calls Oliveira.csv: 61.2378%
Same API Calls to both Verified Malicious and Benign Samples: ['RegOpenKeyExA', 'NtOpenKey', 'NtQueryValueKey', 'NtClose', 'NtQueryAttributesFile', 'LoadStringA', 'NtAllocateVirtualMemory', 'LdrGetDllHandle', 'LdrGetProcedureAddress', 'GetSystemMetrics', 'FindResourceExW', 'LoadResource', 'LdrLoadDll', 'DrawTextExW', 'FindResourceA', 'SizeofResource', 'GetSystemWindowsDirectoryW', 'NtCreateFile', 'NtCreateSection', 'NtMapViewOfSection', 'CreateActCtxW', 'GetSystemTimeAsFileTime', 'NtFreeVirtualMemory', 'SetUnhandledExceptionFilter', 'GetFileSize', 'SetFilePointer', 'NtReadFile', 'RegOpenKeyExW', 'RegCreateKeyExW', 'RegCloseKey', 'RegSetValueExW', 'IsDebuggerPresent', 'CoInitializeEx', 'GetForegroundWindow', 'OleInitialize', 'GetNativeSystemInfo', 'RegQuer