# Instance Compare

Aims to answer the question **Are there any unique indicators to malicious samples in terms of specific API Call(s) alone?** in **4.2.6. Dataset Analysis** of the study.

Note that this notebook only makes use of the verified xxxx_SampleHash_Common.csv file which represents a significant majority of the entire Oliveira dataset.  

## Import Libraries/Datasets

In [1]:
import pandas as pd
import time

malicious_df = pd.read_csv('./Converted_(EDITED) DBSCAN_SampleHash_Common.csv', low_memory=False) #This should point to a verified <DataClustering>_SampleHash_Common.csv file
benign_df = pd.read_csv('./API_Patterns.csv') #This should point to the API_Patterns.csv file

#Load list of API calls
API_LIST = "../api_calls.txt"
DELIMITER = "NaN"
API_FILE = open(API_LIST,"r")
APIS = API_FILE.readline().split(',')
# APIS.append(DELIMITER) #serves as a label for NaN values for Instance-based datasets
API_FILE.close()

## DataFrame Preview

In [2]:
def list_to_str(ls:list):
    output = ""
    for l in ls:
        output += str(l) + " "
    return output[0:len(output)-1]

def inject_patterns(inner_df:pd.DataFrame):
    patterns = []
    for row in range(inner_df.shape[0]):
        patterns.append(list_to_str(inner_df.iloc[row,2:5].transpose().to_list()))
    inner_df['type_pattern'] = patterns
    return inner_df

malicious_df.replace(to_replace='-',value='_', inplace=True)
malicious_df.drop(malicious_df[(malicious_df['Type 1']=='_')&(malicious_df['Type 2']=='_')&(malicious_df['Type 3']=='_')].index, inplace=True) #Drop row that is falsely labelled.
malicious_df = inject_patterns(malicious_df)

print("Malicious DF")
display(malicious_df)

print("Benign DF")
display(benign_df)

Malicious DF


Unnamed: 0,cluster,hash,Type 1,Type 2,Type 3,pattern,type_pattern
0,-1,5e1f079fc9130cd508568da3cb0b219a,adware,_,_,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr...",adware _ _
4,-1,d93b214c093a9f1e07248962aeb74fc8,trojan,_,_,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr...",trojan _ _
5,0,bcc2e77229d428536091d0795980eb46,trojan,_,_,"RegOpenKeyExA,NtOpenKey,NtQueryValueKey,NtClos...",trojan _ _
6,0,c17a20fe53f3e9f0300a82bb371c7859,trojan,_,_,"RegOpenKeyExA,NtOpenKey,NtQueryValueKey,NtClos...",trojan _ _
7,0,b8454478faca0929d7078dc7d30fd913,trojan,_,_,"RegOpenKeyExA,NtOpenKey,NtQueryValueKey,NtClos...",trojan _ _
...,...,...,...,...,...,...,...
1490,297,05b379055a79c5e47bdabec418190ac7,trojan,_,_,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce...",trojan _ _
1491,297,d8c65468405b789c56754336c1f8911b,trojan,_,_,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce...",trojan _ _
1492,297,4b58a7c885df8e86be4769fd949d2c37,trojan,_,_,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce...",trojan _ _
1493,297,a4200ec0b146d8a0d37e90e32c674780,trojan,_,_,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce...",trojan _ _


Benign DF


Unnamed: 0,hash,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_91,t_92,t_93,t_94,t_95,t_96,t_97,t_98,t_99,pattern
0,5b51d65972a349f90a86984c26b12b30,SetErrorMode,OleInitialize,LdrGetDllHandle,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,NtClose,NtQueryDirectoryFile,NtClose,LdrGetProcedureAddress,CoCreateInstance,NtOpenSection,CreateDirectoryW,NtCreateFile,LdrGetProcedureAddress,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."
1,ceb8cc125478fad641daa4e04e9b2f19,GetSystemInfo,NtAllocateVirtualMemory,NtOpenSection,GetTempPathW,CreateDirectoryW,GetFileAttributesW,FindFirstFileExW,DeleteFileW,NtQueryDirectoryFile,...,NtClose,NtCreateMutant,NtClose,LdrGetDllHandle,LdrGetProcedureAddress,NtClose,NtCreateMutant,NtClose,NtCreateFile,"GetSystemInfo,NtAllocateVirtualMemory,NtOpenSe..."
2,f108600edf46d7c20f6acc522aeba6df,GetSystemTimeAsFileTime,NtProtectVirtualMemory,SetUnhandledExceptionFilter,GetTimeZoneInformation,GetSystemTimeAsFileTime,GetTimeZoneInformation,GetSystemTimeAsFileTime,GetTimeZoneInformation,GetSystemTimeAsFileTime,...,SetErrorMode,GetFileAttributesExW,SetErrorMode,NtAllocateVirtualMemory,SetErrorMode,GetFileAttributesExW,SetErrorMode,FindFirstFileExW,NtQueryDirectoryFile,"GetSystemTimeAsFileTime,NtProtectVirtualMemory..."
3,711be6337cb78a948f04759a0bd210ce,GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,LdrGetProcedureAddress,NtAllocateVirtualMemory,LdrGetProcedureAddress,GetSystemMetrics,LdrLoadDll,LdrGetProcedureAddress,GetSystemMetrics,NtAllocateVirtualMemory,LdrLoadDll,"GetSystemTimeAsFileTime,LdrLoadDll,LdrGetProce..."
4,6de26f67ceb1e3303b889489010f4c3f,SetErrorMode,OleInitialize,LdrGetDllHandle,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,NtClose,NtQueryDirectoryFile,NtClose,LdrGetProcedureAddress,GetSystemWindowsDirectoryW,LoadStringW,GetSystemWindowsDirectoryW,GetSystemDirectoryW,RegOpenKeyExW,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1074,d282ef96a93986f89825508812958354,SetErrorMode,OleInitialize,LdrGetDllHandle,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,...,NtClose,LdrGetProcedureAddress,NtAllocateVirtualMemory,LdrGetDllHandle,LdrGetProcedureAddress,LdrGetDllHandle,LdrGetProcedureAddress,LdrGetDllHandle,LdrGetProcedureAddress,"SetErrorMode,OleInitialize,LdrGetDllHandle,Ldr..."
1075,c0389d256f976044adf570f0df908953,GetSystemTimeAsFileTime,SetUnhandledExceptionFilter,GetCursorPos,SetErrorMode,FindResourceW,SetWindowsHookExW,CoInitializeEx,NtDuplicateObject,NtAllocateVirtualMemory,...,NtAllocateVirtualMemory,LdrLoadDll,LdrGetProcedureAddress,NtAllocateVirtualMemory,GetSystemMetrics,RegOpenKeyExW,NtAllocateVirtualMemory,GetSystemMetrics,NtAllocateVirtualMemory,"GetSystemTimeAsFileTime,SetUnhandledExceptionF..."
1076,20316e717de5db169aecbb67377504ce,SetUnhandledExceptionFilter,NtCreateMutant,NtAllocateVirtualMemory,NtClose,NtCreateMutant,NtClose,NtCreateMutant,NtClose,NtAllocateVirtualMemory,...,RegOpenKeyExW,RegQueryValueExW,RegCloseKey,RegOpenKeyExW,RegQueryValueExW,RegCloseKey,RegOpenKeyExW,RegQueryValueExW,RegCloseKey,"SetUnhandledExceptionFilter,NtCreateMutant,NtA..."
1077,ce945d424b93ea73fbbedf0254f6bc07,NtClose,NtOpenKey,NtQueryValueKey,NtClose,NtOpenKey,NtQueryValueKey,NtClose,LdrGetDllHandle,LdrGetProcedureAddress,...,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrLoadDll,LdrGetProcedureAddress,LdrGetDllHandle,FindResourceExW,LoadResource,"NtClose,NtOpenKey,NtQueryValueKey,NtClose,NtOp..."


## Extract Unique API Calls

In [3]:
str_output = ""

malicious_apis = []
for i in range(malicious_df.shape[0]): #Only allow those with 
    if not (malicious_df['Type 1'].iloc[i] == '_' and malicious_df['Type 2'].iloc[i] == '_' and malicious_df['Type 3'].iloc[i] == '_'):
        malicious_apis += malicious_df['pattern'].iloc[i].split(',')
malicious_apis = list(pd.Series(malicious_apis).unique())
str_output += f"# of Unique API Calls in Verified Malicious Samples: {len(malicious_apis)}\n"
str_output += str(malicious_apis) + "\n\n"

benign_apis = []
for i in range(benign_df.shape[0]): #Only allow those with 
    benign_apis += benign_df['pattern'].iloc[i].split(',')
benign_apis = list(pd.Series(benign_apis).unique())
str_output += f"# of Unique API Calls in Benign Samples: {len(benign_apis)}\n" 
str_output += str(benign_apis) + "\n\n"

with open("./Output/4 Unique_APICalls_MaliciousOnly.txt", 'w') as f:
    f.write(str_output)
    f.flush()
    f.close()
print(str_output)

# of Unique API Calls in Verified Malicious Samples: 147
['SetErrorMode', 'OleInitialize', 'LdrGetDllHandle', 'LdrLoadDll', 'LdrGetProcedureAddress', 'NtOpenSection', 'NtMapViewOfSection', 'RegOpenKeyExW', 'RegQueryValueExW', 'RegCloseKey', 'NtClose', 'NtOpenKey', 'NtQueryValueKey', 'GetSystemWindowsDirectoryW', 'NtCreateFile', 'NtCreateSection', 'RegOpenKeyExA', 'CreateActCtxW', 'GetSystemDirectoryW', 'GetVolumeNameForVolumeMountPointW', 'NtDuplicateObject', 'LoadStringW', 'NtCreateMutant', 'GetNativeSystemInfo', 'NtQueryAttributesFile', 'LoadStringA', 'NtAllocateVirtualMemory', 'GetSystemMetrics', 'FindResourceExW', 'LoadResource', 'DrawTextExW', 'FindResourceA', 'SizeofResource', 'GetSystemTimeAsFileTime', 'NtFreeVirtualMemory', 'SetUnhandledExceptionFilter', 'GetFileSize', 'SetFilePointer', 'NtReadFile', 'RegCreateKeyExW', 'RegSetValueExW', 'IsDebuggerPresent', 'CoInitializeEx', 'GetForegroundWindow', 'GetSystemInfo', 'FindFirstFileExW', 'LookupAccountSidW', 'NtProtectVirtualMemory

## Identify the Unique API Calls only found in Malicious API Calls.

In [4]:
str_output = ""

unique = []
for m in malicious_apis:
    if m not in benign_apis:
        unique.append(m)
str_output += f"No. of truly unique API Calls only found in Malicious Samples: {len(unique)} ({len(unique)/len(benign_apis)*100:.2f}% Matches API Calls of Benign Samples)\n"
str_output += f"Coverage of 'Malicious-only' API Calls to Official API Calls Oliveira.csv: {(len(unique)/len(APIS))*100:.4f}%\n"
str_output += "Unique API Calls to Verified Malicious Samples only: "+ str(unique) + "\n"

with open("./Output/4 Unique_APICalls_MaliciousOnly.txt", 'w') as f:
    f.write(str_output)
    f.flush()
    f.close()
print(str_output)

No. of truly unique API Calls only found in Malicious Samples: 6 (3.00% Matches API Calls of Benign Samples)
Coverage of 'Malicious-only' API Calls to Official API Calls Oliveira.csv: 1.9544%
Unique API Calls to Verified Malicious Samples only: ['CryptDecrypt', 'getaddrinfo', 'connect', 'GetDiskFreeSpaceW', 'SetFileTime', 'NtSuspendThread']



**Fun Fact:** You might see on the results of item #5 that it will contain some Crypt (e.g., `CryptDecrypt`) related API calls. According to the talk of Sir Mantua (during his 4th hr. talk), it is a possible key indicator that a ransomware is in the system. This is supported by the fact that some malicious samples are ransomware as seen in the `Type 2` & `Type 3` malware types.

In [5]:
str_output = ""
for u in unique:
    str_output += u + "\n\n"
    str_output += str(malicious_df[malicious_df['pattern'].str.contains(u)]['Type 1'].value_counts())
    str_output += "\n=====================================================================\n"

with open("./Output/4 APICalls_MalwareTypes_MaliciousOnly.txt", 'w') as f:
    f.write(str_output)
    f.flush()
    f.close()
print(str_output)

CryptDecrypt

Type 1
trojan        63
downloader     9
adware         2
pua            1
Name: count, dtype: int64
getaddrinfo

Type 1
adware    15
Name: count, dtype: int64
connect

Type 1
adware    10
Name: count, dtype: int64
GetDiskFreeSpaceW

Type 1
trojan    3
adware    1
Name: count, dtype: int64
SetFileTime

Type 1
trojan    3
adware    1
Name: count, dtype: int64
NtSuspendThread

Type 1
trojan    1
Name: count, dtype: int64



## Identify the Same API Calls found in both Malicious and Benign Samples.

*Apparently, due to the proliferation of trojan-like malware, it could be possible that there are more API Calls that can be found the same between Malicious and Benign Samples.*

In [6]:
str_output = ""
same = []
for m in malicious_apis:
    if m in benign_apis:
        same.append(m)
str_output += f"No. of API Calls in Malicious Samples that is found in API Calls in Benign Samples: {len(same)} ({len(same)/len(benign_apis)*100:.2f}% Matches API Calls of Benign Samples)\n"
str_output += f"Coverage of 'Same-to-Malicious-Benign-Samples' API Calls to Official API Calls Oliveira.csv: {(len(same)/len(APIS))*100:.4f}%\n"
str_output += "Same API Calls to both Verified Malicious and Benign Samples: "+ str(same) + "\n"

with open("./Output/4 Unique_APICalls_MaliciousBenign.txt", 'w') as f:
    f.write(str_output)
    f.flush()
    f.close()
print(str_output)

No. of API Calls in Malicious Samples that is found in API Calls in Benign Samples: 141 (70.50% Matches API Calls of Benign Samples)
Coverage of 'Same-to-Malicious-Benign-Samples' API Calls to Official API Calls Oliveira.csv: 45.9283%
Same API Calls to both Verified Malicious and Benign Samples: ['SetErrorMode', 'OleInitialize', 'LdrGetDllHandle', 'LdrLoadDll', 'LdrGetProcedureAddress', 'NtOpenSection', 'NtMapViewOfSection', 'RegOpenKeyExW', 'RegQueryValueExW', 'RegCloseKey', 'NtClose', 'NtOpenKey', 'NtQueryValueKey', 'GetSystemWindowsDirectoryW', 'NtCreateFile', 'NtCreateSection', 'RegOpenKeyExA', 'CreateActCtxW', 'GetSystemDirectoryW', 'GetVolumeNameForVolumeMountPointW', 'NtDuplicateObject', 'LoadStringW', 'NtCreateMutant', 'GetNativeSystemInfo', 'NtQueryAttributesFile', 'LoadStringA', 'NtAllocateVirtualMemory', 'GetSystemMetrics', 'FindResourceExW', 'LoadResource', 'DrawTextExW', 'FindResourceA', 'SizeofResource', 'GetSystemTimeAsFileTime', 'NtFreeVirtualMemory', 'SetUnhandledExcep

In [7]:
for s in same:
    str_output += s + "\n\n"
    str_output += str(malicious_df[malicious_df['pattern'].str.contains(s)]['Type 1'].value_counts())
    str_output += "\n=====================================================================\n"

with open("./Output/4 APICalls_MalwareTypes_MaliciousBenign.txt", 'w') as f:
    f.write(str_output)
    f.flush()
    f.close()
print(str_output)

No. of API Calls in Malicious Samples that is found in API Calls in Benign Samples: 141 (70.50% Matches API Calls of Benign Samples)
Coverage of 'Same-to-Malicious-Benign-Samples' API Calls to Official API Calls Oliveira.csv: 45.9283%
Same API Calls to both Verified Malicious and Benign Samples: ['SetErrorMode', 'OleInitialize', 'LdrGetDllHandle', 'LdrLoadDll', 'LdrGetProcedureAddress', 'NtOpenSection', 'NtMapViewOfSection', 'RegOpenKeyExW', 'RegQueryValueExW', 'RegCloseKey', 'NtClose', 'NtOpenKey', 'NtQueryValueKey', 'GetSystemWindowsDirectoryW', 'NtCreateFile', 'NtCreateSection', 'RegOpenKeyExA', 'CreateActCtxW', 'GetSystemDirectoryW', 'GetVolumeNameForVolumeMountPointW', 'NtDuplicateObject', 'LoadStringW', 'NtCreateMutant', 'GetNativeSystemInfo', 'NtQueryAttributesFile', 'LoadStringA', 'NtAllocateVirtualMemory', 'GetSystemMetrics', 'FindResourceExW', 'LoadResource', 'DrawTextExW', 'FindResourceA', 'SizeofResource', 'GetSystemTimeAsFileTime', 'NtFreeVirtualMemory', 'SetUnhandledExcep